Text Analysis Using NetworkX
This notebook will analyze a plain text file treating it as a list of newline-separated sentences (e.g. a list of paper titles).
It computes word frequencies (after doing some naive normalization by lowercasing and throwing away a few overly common words). It also computes, from the most common words, a weighted graph of word co-occurrences and displays it, as well as summarizing the graph structure by ranking its nodes in descending order of eigenvector centrality.
This is meant as an illustration of text processing in Python, using matplotlib for visualization and NetworkX for graph-theoretical manipulation. It should not be considered production-strength code for serious text analysis.
Author: Fernando Perez
%run text_analysis.py
default_url = "http://bibserver.berkeley.edu/tmp/titles.txt"
n_words = 15
n_nodes = 15
url = default_url
Fetch text and do basic preprocessing.
text = get_text_from_url(url).lower()
lines = text.splitlines()
words = text_cleanup(text)
Compute frequency histogram.
wf = word_freq(words)
sorted_wf = sort_freqs(wf)
Build a graph from the n_nodes most frequent words.
popular = sorted_wf[-n_nodes:]
pop_words = [wc[0] for wc in popular]
co_occur = co_occurrences(lines, pop_words)
wgraph = co_occurrences_graph(popular, co_occur, cutoff=1)
centrality = nx.eigenvector_centrality_numpy(wgraph)
Print summaries of single-word frequencies and graph structure.
summarize_freq_hist(sorted_wf)
summarize_centrality(centrality)
Plot histogram and graph.
plot_word_histogram(sorted_wf, n_words,"Frequencies for %s most frequent words" % n_words)
plot_word_histogram(sorted_wf, 1.0, "Frequencies for entire word list")
plot_graph(wgraph)
%notebook save text_analysis.ipynb