upstream/ipython Files · docs/source/parallel/parallel_demos.txt

Massive amount of work to improve the test suite, restores doctests....

Massive amount of work to improve the test suite, restores doctests. After Brian's comments, I realized that our test machinery was NOT in reality running all the ipython-syntax doctests we have. This is now fixed. The test suite isn't completely passing, but this commit is for the underlying machinery. I will now work on fixing as many broken tests as I can. Fixes https://bugs.launchpad.net/ipython/+bug/505071

Brian Granger - - Load All Authors

File last commit:

r2347:ef857c5c


                r2414:7fce7ae8

Download file

             parallel_demos.txt
        
                    282 lines
            
             | 11.2 KiB
            
                | text/plain
            
             |
                TextLexer

/ docs / source / parallel / parallel_demos.txt

History | Annotation | Raw |Copy content |Copy permalink

				=================
				Parallel examples
				=================

				In this section we describe two more involved examples of using an IPython
				cluster to perform a parallel computation. In these examples, we will be using
				IPython's "pylab" mode, which enables interactive plotting using the
				Matplotlib package. IPython can be started in this mode by typing::

				ipython -p pylab

				at the system command line. If this prints an error message, you will
				need to install the default profiles from within IPython by doing,

				.. sourcecode:: ipython

				In [1]: %install_profiles

				and then restarting IPython.

				150 million digits of pi
				========================

				In this example we would like to study the distribution of digits in the
				number pi (in base 10). While it is not known if pi is a normal number (a
				number is normal in base 10 if 0-9 occur with equal likelihood) numerical
				investigations suggest that it is. We will begin with a serial calculation on
				10,000 digits of pi and then perform a parallel calculation involving 150
				million digits.

				In both the serial and parallel calculation we will be using functions defined
				in the :file:`pidigits.py` file, which is available in the
				:file:`docs/examples/kernel` directory of the IPython source distribution.
				These functions provide basic facilities for working with the digits of pi and
				can be loaded into IPython by putting :file:`pidigits.py` in your current
				working directory and then doing:

				.. sourcecode:: ipython

				In [1]: run pidigits.py

				Serial calculation
				------------------

				For the serial calculation, we will use SymPy (http://www.sympy.org) to
				calculate 10,000 digits of pi and then look at the frequencies of the digits
				0-9. Out of 10,000 digits, we expect each digit to occur 1,000 times. While
				SymPy is capable of calculating many more digits of pi, our purpose here is to
				set the stage for the much larger parallel calculation.

				In this example, we use two functions from :file:`pidigits.py`:
				:func:`one_digit_freqs` (which calculates how many times each digit occurs)
				and :func:`plot_one_digit_freqs` (which uses Matplotlib to plot the result).
				Here is an interactive IPython session that uses these functions with
				SymPy:

				.. sourcecode:: ipython

				In [7]: import sympy

				In [8]: pi = sympy.pi.evalf(40)

				In [9]: pi
				Out[9]: 3.141592653589793238462643383279502884197

				In [10]: pi = sympy.pi.evalf(10000)

				In [11]: digits = (d for d in str(pi)[2:]) # create a sequence of digits

				In [12]: run pidigits.py # load one_digit_freqs/plot_one_digit_freqs

				In [13]: freqs = one_digit_freqs(digits)

				In [14]: plot_one_digit_freqs(freqs)
				Out[14]: [<matplotlib.lines.Line2D object at 0x18a55290>]

				The resulting plot of the single digit counts shows that each digit occurs
				approximately 1,000 times, but that with only 10,000 digits the
				statistical fluctuations are still rather large:

				.. image:: single_digits.*

				It is clear that to reduce the relative fluctuations in the counts, we need
				to look at many more digits of pi. That brings us to the parallel calculation.

				Parallel calculation
				--------------------

				Calculating many digits of pi is a challenging computational problem in itself.
				Because we want to focus on the distribution of digits in this example, we
				will use pre-computed digit of pi from the website of Professor Yasumasa
				Kanada at the University of Tokoyo (http://www.super-computing.org). These
				digits come in a set of text files (ftp://pi.super-computing.org/.2/pi200m/)
				that each have 10 million digits of pi.

				For the parallel calculation, we have copied these files to the local hard
				drives of the compute nodes. A total of 15 of these files will be used, for a
				total of 150 million digits of pi. To make things a little more interesting we
				will calculate the frequencies of all 2 digits sequences (00-99) and then plot
				the result using a 2D matrix in Matplotlib.

				The overall idea of the calculation is simple: each IPython engine will
				compute the two digit counts for the digits in a single file. Then in a final
				step the counts from each engine will be added up. To perform this
				calculation, we will need two top-level functions from :file:`pidigits.py`:

				.. literalinclude:: ../../examples/kernel/pidigits.py
				:language: python
				:lines: 34-49

				We will also use the :func:`plot_two_digit_freqs` function to plot the
				results. The code to run this calculation in parallel is contained in
				:file:`docs/examples/kernel/parallelpi.py`. This code can be run in parallel
				using IPython by following these steps:

				1. Copy the text files with the digits of pi
				(ftp://pi.super-computing.org/.2/pi200m/) to the working directory of the
				engines on the compute nodes.
				2. Use :command:`ipcluster` to start 15 engines. We used an 8 core (2 quad
				core CPUs) cluster with hyperthreading enabled which makes the 8 cores
				looks like 16 (1 controller + 15 engines) in the OS. However, the maximum
				speedup we can observe is still only 8x.
				3. With the file :file:`parallelpi.py` in your current working directory, open
				up IPython in pylab mode and type ``run parallelpi.py``.

				When run on our 8 core cluster, we observe a speedup of 7.7x. This is slightly
				less than linear scaling (8x) because the controller is also running on one of
				the cores.

				To emphasize the interactive nature of IPython, we now show how the
				calculation can also be run by simply typing the commands from
				:file:`parallelpi.py` interactively into IPython:

				.. sourcecode:: ipython

				In [1]: from IPython.kernel import client
				2009-11-19 11:32:38-0800 [-] Log opened.

				# The MultiEngineClient allows us to use the engines interactively.
				# We simply pass MultiEngineClient the name of the cluster profile we
				# are using.
				In [2]: mec = client.MultiEngineClient(profile='mycluster')
				2009-11-19 11:32:44-0800 [-] Connecting [0]
				2009-11-19 11:32:44-0800 [Negotiation,client] Connected: ./ipcontroller-mec.furl

				In [3]: mec.get_ids()
				Out[3]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]

				In [4]: run pidigits.py

				In [5]: filestring = 'pi200m-ascii-%(i)02dof20.txt'

				# Create the list of files to process.
				In [6]: files = [filestring % {'i':i} for i in range(1,16)]

				In [7]: files
				Out[7]:
				['pi200m-ascii-01of20.txt',
				'pi200m-ascii-02of20.txt',
				'pi200m-ascii-03of20.txt',
				'pi200m-ascii-04of20.txt',
				'pi200m-ascii-05of20.txt',
				'pi200m-ascii-06of20.txt',
				'pi200m-ascii-07of20.txt',
				'pi200m-ascii-08of20.txt',
				'pi200m-ascii-09of20.txt',
				'pi200m-ascii-10of20.txt',
				'pi200m-ascii-11of20.txt',
				'pi200m-ascii-12of20.txt',
				'pi200m-ascii-13of20.txt',
				'pi200m-ascii-14of20.txt',
				'pi200m-ascii-15of20.txt']

				# This is the parallel calculation using the MultiEngineClient.map method
				# which applies compute_two_digit_freqs to each file in files in parallel.
				In [8]: freqs_all = mec.map(compute_two_digit_freqs, files)

				# Add up the frequencies from each engine.
				In [8]: freqs = reduce_freqs(freqs_all)

				In [9]: plot_two_digit_freqs(freqs)
				Out[9]: <matplotlib.image.AxesImage object at 0x18beb110>

				In [10]: plt.title('2 digit counts of 150m digits of pi')
				Out[10]: <matplotlib.text.Text object at 0x18d1f9b0>

				The resulting plot generated by Matplotlib is shown below. The colors indicate
				which two digit sequences are more (red) or less (blue) likely to occur in the
				first 150 million digits of pi. We clearly see that the sequence "41" is
				most likely and that "06" and "07" are least likely. Further analysis would
				show that the relative size of the statistical fluctuations have decreased
				compared to the 10,000 digit calculation.

				.. image:: two_digit_counts.*


				Parallel options pricing
				========================

				An option is a financial contract that gives the buyer of the contract the
				right to buy (a "call") or sell (a "put") a secondary asset (a stock for
				example) at a particular date in the future (the expiration date) for a
				pre-agreed upon price (the strike price). For this right, the buyer pays the
				seller a premium (the option price). There are a wide variety of flavors of
				options (American, European, Asian, etc.) that are useful for different
				purposes: hedging against risk, speculation, etc.

				Much of modern finance is driven by the need to price these contracts
				accurately based on what is known about the properties (such as volatility) of
				the underlying asset. One method of pricing options is to use a Monte Carlo
				simulation of the underlying asset price. In this example we use this approach
				to price both European and Asian (path dependent) options for various strike
				prices and volatilities.

				The code for this example can be found in the :file:`docs/examples/kernel`
				directory of the IPython source. The function :func:`price_options` in
				:file:`mcpricer.py` implements the basic Monte Carlo pricing algorithm using
				the NumPy package and is shown here:

				.. literalinclude:: ../../examples/kernel/mcpricer.py
				:language: python

				To run this code in parallel, we will use IPython's :class:`TaskClient` class,
				which distributes work to the engines using dynamic load balancing. This
				client can be used along side the :class:`MultiEngineClient` class shown in
				the previous example. The parallel calculation using :class:`TaskClient` can
				be found in the file :file:`mcpricer.py`. The code in this file creates a
				:class:`TaskClient` instance and then submits a set of tasks using
				:meth:`TaskClient.run` that calculate the option prices for different
				volatilities and strike prices. The results are then plotted as a 2D contour
				plot using Matplotlib.

				.. literalinclude:: ../../examples/kernel/mcdriver.py
				:language: python

				To use this code, start an IPython cluster using :command:`ipcluster`, open
				IPython in the pylab mode with the file :file:`mcdriver.py` in your current
				working directory and then type:

				.. sourcecode:: ipython

				In [7]: run mcdriver.py
				Submitted tasks: [0, 1, 2, ...]

				Once all the tasks have finished, the results can be plotted using the
				:func:`plot_options` function. Here we make contour plots of the Asian
				call and Asian put options as function of the volatility and strike price:

				.. sourcecode:: ipython

				In [8]: plot_options(sigma_vals, K_vals, prices['acall'])

				In [9]: plt.figure()
				Out[9]: <matplotlib.figure.Figure object at 0x18c178d0>

				In [10]: plot_options(sigma_vals, K_vals, prices['aput'])

				These results are shown in the two figures below. On a 8 core cluster the
				entire calculation (10 strike prices, 10 volatilities, 100,000 paths for each)
				took 30 seconds in parallel, giving a speedup of 7.7x, which is comparable
				to the speedup observed in our previous example.

				.. image:: asian_call.*

				.. image:: asian_put.*

				Conclusion
				==========

				To conclude these examples, we summarize the key features of IPython's
				parallel architecture that have been demonstrated:

				* Serial code can be parallelized often with only a few extra lines of code.
				We have used the :class:`MultiEngineClient` and :class:`TaskClient` classes
				for this purpose.
				* The resulting parallel code can be run without ever leaving the IPython's
				interactive shell.
				* Any data computed in parallel can be explored interactively through
				visualization or further numerical calculations.
				* We have run these examples on a cluster running Windows HPC Server 2008.
				IPython's built in support for the Windows HPC job scheduler makes it
				easy to get started with IPython's parallel capabilities.

	Site-wide shortcuts
/	Use quick search box
g h	Goto home page
g g	Goto my private gists page
g G	Goto my public gists page
g 0-9	Goto bookmarked items from 0-9
n r	New repository page
n g	New gist page

	Repositories
g s	Goto summary page
g c	Goto changelog page
g f	Goto files page
g F	Goto files page with file search activated
g p	Goto pull requests page
g o	Goto repository settings
g O	Goto repository access permissions settings
t s	Toggle sidebar on some pages