Adding new notebook examples....

Adding new notebook examples. * Many that use sympy's quantum computing (github master required) * One from Fernando Perez that does text analysis.

Brian Granger - - Load All Authors

File last commit:

r4328:7877585c


                r4328:7877585c

Download file

             text_analysis.ipynb
        
                    1 line
            
             | 2.5 KiB
            
                | text/plain
            
             |
                TextLexer

/ docs / examples / notebooks / text_analysis.ipynb

History | Annotation | Raw |Copy content |Copy permalink

Text Analysis Using NetworkX

This notebook will analyze a plain text file treating it as a list of newline-separated sentences (e.g. a list of paper titles).

It computes word frequencies (after doing some naive normalization by lowercasing and throwing away a few overly common words). It also computes, from the most common words, a weighted graph of word co-occurrences and displays it, as well as summarizing the graph structure by ranking its nodes in descending order of eigenvector centrality.

This is meant as an illustration of text processing in Python, using matplotlib for visualization and NetworkX for graph-theoretical manipulation. It should not be considered production-strength code for serious text analysis.

Author: Fernando Perez

In [3]:

%run text_analysis.py

In [4]:

default_url  = "http://bibserver.berkeley.edu/tmp/titles.txt"
n_words = 15
n_nodes = 15
url  = default_url

Fetch text and do basic preprocessing.

In [5]:

text = get_text_from_url(url).lower()
lines = text.splitlines()
words = text_cleanup(text)

Compute frequency histogram.

In [6]:

wf = word_freq(words)
sorted_wf = sort_freqs(wf)

Build a graph from the n_nodes most frequent words.

In [7]:

popular = sorted_wf[-n_nodes:]
pop_words = [wc[0] for wc in popular]
co_occur = co_occurrences(lines, pop_words)
wgraph = co_occurrences_graph(popular, co_occur, cutoff=1)
centrality = nx.eigenvector_centrality_numpy(wgraph)

Print summaries of single-word frequencies and graph structure.

In [8]:

summarize_freq_hist(sorted_wf)
summarize_centrality(centrality)

Plot histogram and graph.

In [9]:

plot_word_histogram(sorted_wf, n_words,"Frequencies for %s most frequent words" % n_words)

In [10]:

plot_word_histogram(sorted_wf, 1.0, "Frequencies for entire word list")

In [11]:

plot_graph(wgraph)

In [10]:

%notebook save text_analysis.ipynb

	Site-wide shortcuts
/	Use quick search box
g h	Goto home page
g g	Goto my private gists page
g G	Goto my public gists page
g 0-9	Goto bookmarked items from 0-9
n r	New repository page
n g	New gist page

	Repositories
g s	Goto summary page
g c	Goto changelog page
g f	Goto files page
g F	Goto files page with file search activated
g p	Goto pull requests page
g o	Goto repository settings
g O	Goto repository access permissions settings
t s	Toggle sidebar on some pages