##// END OF EJS Templates
Adding clear_output to kernel and HTML notebook.
Adding clear_output to kernel and HTML notebook.

File last commit:

r4637:d919e2ec
r5080:bdf1ecd4
Show More
text_analysis.ipynb
141 lines | 149.0 KiB | text/plain | TextLexer

Text Analysis Using NetworkX

This notebook will analyze a plain text file treating it as a list of newline-separated sentences (e.g. a list of paper titles).


It computes word frequencies (after doing some naive normalization by lowercasing and throwing away a few overly common words). It also computes, from the most common words, a weighted graph of word co-occurrences and displays it, as well as summarizing the graph structure by ranking its nodes in descending order of eigenvector centrality.


This is meant as an illustration of text processing in Python, using matplotlib for visualization and NetworkX for graph-theoretical manipulation. It should not be considered production-strength code for serious text analysis.


Author: Fernando Perez

In [1]:
%run text_analysis.py
In [2]:
default_url  = "http://bibserver.berkeley.edu/tmp/titles.txt"
n_words = 15
n_nodes = 15
url  = default_url
    

Fetch text and do basic preprocessing.

In [3]:
text = get_text_from_url(url).lower()
lines = text.splitlines()
words = text_cleanup(text)

Compute frequency histogram.

In [4]:
wf = word_freq(words)
sorted_wf = sort_freqs(wf)

Build a graph from the n_nodes most frequent words.

In [5]:
popular = sorted_wf[-n_nodes:]
pop_words = [wc[0] for wc in popular]
co_occur = co_occurrences(lines, pop_words)
wgraph = co_occurrences_graph(popular, co_occur, cutoff=1)
centrality = nx.eigenvector_centrality_numpy(wgraph)

Print summaries of single-word frequencies and graph structure.

In [6]:
summarize_freq_hist(sorted_wf)
summarize_centrality(centrality)
Number of unique words: 9265

10 least frequent words:
    rosenhouse's -> 1
 reconstruction, -> 1
parameterization -> 1
          benda) -> 1
 reconstruction: -> 1
          payoff -> 1
     electricity -> 1
     kalman-bucy -> 1
    clark--ocone -> 1
    m-estimators -> 1

10 most frequent words:
     large -> 421
   process -> 428
    markov -> 440
 equations -> 471
     limit -> 518
  brownian -> 536
     model -> 542
stochastic -> 1000
 processes -> 1087
    random -> 2356

Graph centrality
         random: 0.474
          limit: 0.336
     stochastic: 0.312
          walks: 0.307
        process: 0.297
      processes: 0.259
        theorem: 0.251
           time: 0.232
      equations: 0.213
          model: 0.209
         markov: 0.197
          large: 0.19
       brownian: 0.125
         models: 0.107
    percolation: 0.0852

Plot histogram and graph.

In [7]:
plot_word_histogram(sorted_wf, n_words,"Frequencies for %s most frequent words" % n_words)
Out[7]:
<matplotlib.axes.AxesSubplot at 0x334b250>
No description has been provided for this image
In [8]:
plot_word_histogram(sorted_wf, 1.0, "Frequencies for entire word list")
Out[8]:
<matplotlib.axes.AxesSubplot at 0x3371e50>
No description has been provided for this image
In [9]:
plot_graph(wgraph)
No description has been provided for this image