##// END OF EJS Templates
Adding new notebook examples....
Adding new notebook examples. * Many that use sympy's quantum computing (github master required) * One from Fernando Perez that does text analysis.

File last commit:

r4328:7877585c
r4328:7877585c
Show More
text_analysis.ipynb
1 line | 2.5 KiB | text/plain | TextLexer

Text Analysis Using NetworkX

This notebook will analyze a plain text file treating it as a list of newline-separated sentences (e.g. a list of paper titles).


It computes word frequencies (after doing some naive normalization by lowercasing and throwing away a few overly common words). It also computes, from the most common words, a weighted graph of word co-occurrences and displays it, as well as summarizing the graph structure by ranking its nodes in descending order of eigenvector centrality.


This is meant as an illustration of text processing in Python, using matplotlib for visualization and NetworkX for graph-theoretical manipulation. It should not be considered production-strength code for serious text analysis.


Author: Fernando Perez

In [3]:
%run text_analysis.py
In [4]:
default_url  = "http://bibserver.berkeley.edu/tmp/titles.txt"
n_words = 15
n_nodes = 15
url  = default_url
    

Fetch text and do basic preprocessing.

In [5]:
text = get_text_from_url(url).lower()
lines = text.splitlines()
words = text_cleanup(text)

Compute frequency histogram.

In [6]:
wf = word_freq(words)
sorted_wf = sort_freqs(wf)

Build a graph from the n_nodes most frequent words.

In [7]:
popular = sorted_wf[-n_nodes:]
pop_words = [wc[0] for wc in popular]
co_occur = co_occurrences(lines, pop_words)
wgraph = co_occurrences_graph(popular, co_occur, cutoff=1)
centrality = nx.eigenvector_centrality_numpy(wgraph)

Print summaries of single-word frequencies and graph structure.

In [8]:
summarize_freq_hist(sorted_wf)
summarize_centrality(centrality)

Plot histogram and graph.

In [9]:
plot_word_histogram(sorted_wf, n_words,"Frequencies for %s most frequent words" % n_words)
In [10]:
plot_word_histogram(sorted_wf, 1.0, "Frequencies for entire word list")
In [11]:
plot_graph(wgraph)
In [10]:
%notebook save text_analysis.ipynb