Adding clear_output to kernel and HTML notebook.

Brian E. Granger - - Load All Authors

File last commit:

r4637:d919e2ec


                r5080:bdf1ecd4

Download file

             text_analysis.ipynb
        
                    141 lines
            
             | 149.0 KiB
            
                | text/plain
            
             |
                TextLexer

/ docs / examples / notebooks / text_analysis.ipynb

History | Annotation | Raw |Copy content |Copy permalink

Text Analysis Using NetworkX

This notebook will analyze a plain text file treating it as a list of newline-separated sentences (e.g. a list of paper titles).

It computes word frequencies (after doing some naive normalization by lowercasing and throwing away a few overly common words). It also computes, from the most common words, a weighted graph of word co-occurrences and displays it, as well as summarizing the graph structure by ranking its nodes in descending order of eigenvector centrality.

This is meant as an illustration of text processing in Python, using matplotlib for visualization and NetworkX for graph-theoretical manipulation. It should not be considered production-strength code for serious text analysis.

Author: Fernando Perez

In [1]:

%run text_analysis.py

In [2]:

default_url  = "http://bibserver.berkeley.edu/tmp/titles.txt"
n_words = 15
n_nodes = 15
url  = default_url

Fetch text and do basic preprocessing.

In [3]:

text = get_text_from_url(url).lower()
lines = text.splitlines()
words = text_cleanup(text)

Compute frequency histogram.

In [4]:

wf = word_freq(words)
sorted_wf = sort_freqs(wf)

Build a graph from the n_nodes most frequent words.

In [5]:

popular = sorted_wf[-n_nodes:]
pop_words = [wc[0] for wc in popular]
co_occur = co_occurrences(lines, pop_words)
wgraph = co_occurrences_graph(popular, co_occur, cutoff=1)
centrality = nx.eigenvector_centrality_numpy(wgraph)

Print summaries of single-word frequencies and graph structure.

In [6]:

summarize_freq_hist(sorted_wf)
summarize_centrality(centrality)

Number of unique words: 9265

10 least frequent words:
    rosenhouse&apos;s -&gt; 1
 reconstruction, -&gt; 1
parameterization -&gt; 1
          benda) -&gt; 1
 reconstruction: -&gt; 1
          payoff -&gt; 1
     electricity -&gt; 1
     kalman-bucy -&gt; 1
    clark--ocone -&gt; 1
    m-estimators -&gt; 1

10 most frequent words:
     large -&gt; 421
   process -&gt; 428
    markov -&gt; 440
 equations -&gt; 471
     limit -&gt; 518
  brownian -&gt; 536
     model -&gt; 542
stochastic -&gt; 1000
 processes -&gt; 1087
    random -&gt; 2356

Graph centrality
         random: 0.474
          limit: 0.336
     stochastic: 0.312
          walks: 0.307
        process: 0.297
      processes: 0.259
        theorem: 0.251
           time: 0.232
      equations: 0.213
          model: 0.209
         markov: 0.197
          large: 0.19
       brownian: 0.125
         models: 0.107
    percolation: 0.0852

Plot histogram and graph.

In [7]:

plot_word_histogram(sorted_wf, n_words,"Frequencies for %s most frequent words" % n_words)

Out[7]:

&lt;matplotlib.axes.AxesSubplot at 0x334b250&gt;

No description has been provided for this image

In [8]:

plot_word_histogram(sorted_wf, 1.0, "Frequencies for entire word list")

Out[8]:

&lt;matplotlib.axes.AxesSubplot at 0x3371e50&gt;

In [9]:

plot_graph(wgraph)

	Site-wide shortcuts
/	Use quick search box
g h	Goto home page
g g	Goto my private gists page
g G	Goto my public gists page
g 0-9	Goto bookmarked items from 0-9
n r	New repository page
n g	New gist page

	Repositories
g s	Goto summary page
g c	Goto changelog page
g f	Goto files page
g F	Goto files page with file search activated
g p	Goto pull requests page
g o	Goto repository settings
g O	Goto repository access permissions settings
t s	Toggle sidebar on some pages