upstream/ipython Commit - r5926:d2afff84

add note about examples to head of parallel docs

MinRK -

r5926:d2afff84

parent child

docs/source/parallel/parallel_demos.txt

0 +2 0

+             .. _parallel_examples:
              =================
              Parallel examples
              =================
              In this section we describe two more involved examples of using an IPython
              cluster to perform a parallel computation. In these examples, we will be using
              IPython's "pylab" mode, which enables interactive plotting using the
              Matplotlib package. IPython can be started in this mode by typing::
                  ipython --pylab
              at the system command line.
 million digits of pi
              ========================
              In this example we would like to study the distribution of digits in the
              number pi (in base 10). While it is not known if pi is a normal number (a
              number is normal in base 10 if 0-9 occur with equal likelihood) numerical
              investigations suggest that it is. We will begin with a serial calculation on
 ,000 digits of pi and then perform a parallel calculation involving 150
              million digits.
              In both the serial and parallel calculation we will be using functions defined
              in the :file:`pidigits.py` file, which is available in the
              :file:`docs/examples/parallel` directory of the IPython source distribution.
              These functions provide basic facilities for working with the digits of pi and
              can be loaded into IPython by putting :file:`pidigits.py` in your current
              working directory and then doing:
              .. sourcecode:: ipython
                  In [1]: run pidigits.py
              Serial calculation
              ------------------
              For the serial calculation, we will use `SymPy <http://www.sympy.org>`_ to
              calculate 10,000 digits of pi and then look at the frequencies of the digits
 -9. Out of 10,000 digits, we expect each digit to occur 1,000 times. While
              SymPy is capable of calculating many more digits of pi, our purpose here is to
              set the stage for the much larger parallel calculation.
              In this example, we use two functions from :file:`pidigits.py`:
              :func:`one_digit_freqs` (which calculates how many times each digit occurs)
              and :func:`plot_one_digit_freqs` (which uses Matplotlib to plot the result).
              Here is an interactive IPython session that uses these functions with
              SymPy:
              .. sourcecode:: ipython
                  In [7]: import sympy
                  In [8]: pi = sympy.pi.evalf(40)
                  In [9]: pi
                  Out[9]: 3.141592653589793238462643383279502884197
                  In [10]: pi = sympy.pi.evalf(10000)
                  In [11]: digits = (d for d in str(pi)[2:])  # create a sequence of digits
                  In [12]: run pidigits.py  # load one_digit_freqs/plot_one_digit_freqs
                  In [13]: freqs = one_digit_freqs(digits)
                  In [14]: plot_one_digit_freqs(freqs)
                  Out[14]: [<matplotlib.lines.Line2D object at 0x18a55290>]
              The resulting plot of the single digit counts shows that each digit occurs
              approximately 1,000 times, but that with only 10,000 digits the
              statistical fluctuations are still rather large:
              .. image:: figs/single_digits.*
              It is clear that to reduce the relative fluctuations in the counts, we need
              to look at many more digits of pi. That brings us to the parallel calculation.
              Parallel calculation
              --------------------
              Calculating many digits of pi is a challenging computational problem in itself.
              Because we want to focus on the distribution of digits in this example, we
              will use pre-computed digit of pi from the website of Professor Yasumasa
              Kanada at the University of Tokyo (http://www.super-computing.org). These
              digits come in a set of text files (ftp://pi.super-computing.org/.2/pi200m/)
              that each have 10 million digits of pi.
              For the parallel calculation, we have copied these files to the local hard
              drives of the compute nodes. A total of 15 of these files will be used, for a
              total of 150 million digits of pi. To make things a little more interesting we
              will calculate the frequencies of all 2 digits sequences (00-99) and then plot
              the result using a 2D matrix in Matplotlib.
              The overall idea of the calculation is simple: each IPython engine will
              compute the two digit counts for the digits in a single file. Then in a final
              step the counts from each engine will be added up. To perform this
              calculation, we will need two top-level functions from :file:`pidigits.py`:
              .. literalinclude:: ../../examples/parallel/pi/pidigits.py
                 :language: python
                 :lines: 47-62
              We will also use the :func:`plot_two_digit_freqs` function to plot the
              results. The code to run this calculation in parallel is contained in
              :file:`docs/examples/parallel/parallelpi.py`. This code can be run in parallel
              using IPython by following these steps:
 . Use :command:`ipcluster` to start 15 engines. We used 16 cores of an SGE linux
                 cluster (1 controller + 15 engines).
 . With the file :file:`parallelpi.py` in your current working directory, open
                 up IPython in pylab mode and type ``run parallelpi.py``.  This will download
                 the pi files via ftp the first time you run it, if they are not
                 present in the Engines' working directory.
              When run on our 16 cores, we observe a speedup of 14.2x. This is slightly
              less than linear scaling (16x) because the controller is also running on one of
              the cores.
              To emphasize the interactive nature of IPython, we now show how the
              calculation can also be run by simply typing the commands from
              :file:`parallelpi.py` interactively into IPython:
              .. sourcecode:: ipython
                  In [1]: from IPython.parallel import Client
                  # The Client allows us to use the engines interactively.
                  # We simply pass Client the name of the cluster profile we
                  # are using.
                  In [2]: c = Client(profile='mycluster')
                  In [3]: v = c[:]
                  In [3]: c.ids
                  Out[3]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
                  In [4]: run pidigits.py
                  In [5]: filestring = 'pi200m.ascii.%(i)02dof20'
                  # Create the list of files to process.
                  In [6]: files = [filestring % {'i':i} for i in range(1,16)]
                  In [7]: files
                  Out[7]:
                  ['pi200m.ascii.01of20',
                   'pi200m.ascii.02of20',
                   'pi200m.ascii.03of20',
                   'pi200m.ascii.04of20',
                   'pi200m.ascii.05of20',
                   'pi200m.ascii.06of20',
                   'pi200m.ascii.07of20',
                   'pi200m.ascii.08of20',
                   'pi200m.ascii.09of20',
                   'pi200m.ascii.10of20',
                   'pi200m.ascii.11of20',
                   'pi200m.ascii.12of20',
                   'pi200m.ascii.13of20',
                   'pi200m.ascii.14of20',
                   'pi200m.ascii.15of20']
                  # download the data files if they don't already exist:
                  In [8]: v.map(fetch_pi_file, files)
                  # This is the parallel calculation using the Client.map method
                  # which applies compute_two_digit_freqs to each file in files in parallel.
                  In [9]: freqs_all = v.map(compute_two_digit_freqs, files)
                  # Add up the frequencies from each engine.
                  In [10]: freqs = reduce_freqs(freqs_all)
                  In [11]: plot_two_digit_freqs(freqs)
                  Out[11]: <matplotlib.image.AxesImage object at 0x18beb110>
                  In [12]: plt.title('2 digit counts of 150m digits of pi')
                  Out[12]: <matplotlib.text.Text object at 0x18d1f9b0>
              The resulting plot generated by Matplotlib is shown below. The colors indicate
              which two digit sequences are more (red) or less (blue) likely to occur in the
              first 150 million digits of pi. We clearly see that the sequence "41" is
              most likely and that "06" and "07" are least likely. Further analysis would
              show that the relative size of the statistical fluctuations have decreased
              compared to the 10,000 digit calculation.
              .. image:: figs/two_digit_counts.*
              Parallel options pricing
              ========================
              An option is a financial contract that gives the buyer of the contract the
              right to buy (a "call") or sell (a "put") a secondary asset (a stock for
              example) at a particular date in the future (the expiration date) for a
              pre-agreed upon price (the strike price). For this right, the buyer pays the
              seller a premium (the option price). There are a wide variety of flavors of
              options (American, European, Asian, etc.) that are useful for different
              purposes: hedging against risk, speculation, etc.
              Much of modern finance is driven by the need to price these contracts
              accurately based on what is known about the properties (such as volatility) of
              the underlying asset. One method of pricing options is to use a Monte Carlo
              simulation of the underlying asset price. In this example we use this approach
              to price both European and Asian (path dependent) options for various strike
              prices and volatilities.
              The code for this example can be found in the :file:`docs/examples/parallel/options`
              directory of the IPython source. The function :func:`price_options` in
              :file:`mckernel.py` implements the basic Monte Carlo pricing algorithm using
              the NumPy package and is shown here:
              .. literalinclude:: ../../examples/parallel/options/mckernel.py
                 :language: python
              To run this code in parallel, we will use IPython's :class:`LoadBalancedView` class,
              which distributes work to the engines using dynamic load balancing. This
              view is a wrapper of the :class:`Client` class shown in
              the previous example. The parallel calculation using :class:`LoadBalancedView` can
              be found in the file :file:`mcpricer.py`. The code in this file creates a
              :class:`LoadBalancedView` instance and then submits a set of tasks using
              :meth:`LoadBalancedView.apply` that calculate the option prices for different
              volatilities and strike prices. The results are then plotted as a 2D contour
              plot using Matplotlib.
              .. literalinclude:: ../../examples/parallel/options/mcpricer.py
                 :language: python
              To use this code, start an IPython cluster using :command:`ipcluster`, open
              IPython in the pylab mode with the file :file:`mckernel.py` in your current
              working directory and then type:
              .. sourcecode:: ipython
                  In [7]: run mcpricer.py
                  Submitted tasks:  30
              Once all the tasks have finished, the results can be plotted using the
              :func:`plot_options` function. Here we make contour plots of the Asian
              call and Asian put options as function of the volatility and strike price:
              .. sourcecode:: ipython
                  In [8]: plot_options(sigma_vals, strike_vals, prices['acall'])
                  In [9]: plt.figure()
                  Out[9]: <matplotlib.figure.Figure object at 0x18c178d0>
                  In [10]: plot_options(sigma_vals, strike_vals, prices['aput'])
              These results are shown in the two figures below. On our 15 engines, the
              entire calculation (15 strike prices, 15 volatilities, 100,000 paths for each)
              took 37 seconds in parallel, giving a speedup of 14.1x, which is comparable
              to the speedup observed in our previous example.
              .. image:: figs/asian_call.*
              .. image:: figs/asian_put.*
              Conclusion
              ==========
              To conclude these examples, we summarize the key features of IPython's
              parallel architecture that have been demonstrated:
              * Serial code can be parallelized often with only a few extra lines of code.
                We have used the :class:`DirectView` and :class:`LoadBalancedView` classes
                for this purpose.
              * The resulting parallel code can be run without ever leaving the IPython's
                interactive shell.
              * Any data computed in parallel can be explored interactively through
                visualization or further numerical calculations.
              * We have run these examples on a cluster running RHEL 5 and Sun GridEngine.
                IPython's built in support for SGE (and other batch systems) makes it easy
                to get started with IPython's parallel capabilities.

docs/source/parallel/parallel_intro.txt

0 +11 0

              .. _parallel_overview:
              ============================
              Overview and getting started
              ============================
+             Examples
+             ========
+             We have various example scripts and notebooks for using IPython.parallel in our
+             :file:`docs/examples/parallel` directory, or they can be found `on GitHub`__.
+             Some of these are covered in more detail in the :ref:`examples
+             <parallel_examples>` section.
+             .. __: https://github.com/ipython/ipython/tree/master/docs/examples/parallel
              Introduction
              ============
              This section gives an overview of IPython's sophisticated and powerful
              architecture for parallel and distributed computing. This architecture
              abstracts out parallelism in a very general way, which enables IPython to
              support many different styles of parallelism including:
              * Single program, multiple data (SPMD) parallelism.
              * Multiple program, multiple data (MPMD) parallelism.
              * Message passing using MPI.
              * Task farming.
              * Data parallel.
              * Combinations of these approaches.
              * Custom user defined approaches.
              Most importantly, IPython enables all types of parallel applications to
              be developed, executed, debugged and monitored *interactively*. Hence,
              the ``I`` in IPython.  The following are some example usage cases for IPython:
              * Quickly parallelize algorithms that are embarrassingly parallel
                using a number of simple approaches.  Many simple things can be
                parallelized interactively in one or two lines of code.
              * Steer traditional MPI applications on a supercomputer from an
                IPython session on your laptop.
              * Analyze and visualize large datasets (that could be remote and/or
                distributed) interactively using IPython and tools like
                matplotlib/TVTK.
              * Develop, test and debug new parallel algorithms
                (that may use MPI) interactively.
              * Tie together multiple MPI jobs running on different systems into
                one giant distributed and parallel system.
              * Start a parallel job on your cluster and then have a remote
                collaborator connect to it and pull back data into their
                local IPython session for plotting and analysis.
              * Run a set of tasks on a set of CPUs using dynamic load balancing.
              .. tip::
                 At the SciPy 2011 conference in Austin, Min Ragan-Kelley presented a
                 complete 4-hour tutorial on the use of these features, and all the materials
                 for the tutorial are now `available online`__.  That tutorial provides an
                 excellent, hands-on oriented complement to the reference documentation
                 presented here.
              .. __: http://minrk.github.com/scipy-tutorial-2011
              Architecture overview
              =====================
              .. figure:: figs/wideView.png
                  :width: 300px
              The IPython architecture consists of four components:
              * The IPython engine.
              * The IPython hub.
              * The IPython schedulers.
              * The controller client.
              These components live in the :mod:`IPython.parallel` package and are
              installed with IPython.  They do, however, have additional dependencies
              that must be installed.  For more information, see our
              :ref:`installation documentation <install_index>`.
              .. TODO: include zmq in install_index
              IPython engine
              ---------------
              The IPython engine is a Python instance that takes Python commands over a
              network connection. Eventually, the IPython engine will be a full IPython
              interpreter, but for now, it is a regular Python interpreter. The engine
              can also handle incoming and outgoing Python objects sent over a network
              connection.  When multiple engines are started, parallel and distributed
              computing becomes possible. An important feature of an IPython engine is
              that it blocks while user code is being executed. Read on for how the
              IPython controller solves this problem to expose a clean asynchronous API
              to the user.
              IPython controller
              ------------------
              The IPython controller processes provide an interface for working with a set of engines.
              At a general level, the controller is a collection of processes to which IPython engines
              and clients can connect. The controller is composed of a :class:`Hub` and a collection of
              :class:`Schedulers`. These Schedulers are typically run in separate processes but on the
              same machine as the Hub, but can be run anywhere from local threads or on remote machines.
              The controller also provides a single point of contact for users who wish to
              utilize the engines connected to the controller. There are different ways of
              working with a controller. In IPython, all of these models are implemented via
              the :meth:`.View.apply` method, after
              constructing :class:`.View` objects to represent subsets of engines. The two
              primary models for interacting with engines are:
              * A **Direct** interface, where engines are addressed explicitly.
              * A **LoadBalanced** interface, where the Scheduler is trusted with assigning work to
                appropriate engines.
              Advanced users can readily extend the View models to enable other
              styles of parallelism.
              .. note::
                  A single controller and set of engines can be used with multiple models
                  simultaneously. This opens the door for lots of interesting things.
              The Hub
              *******
              The center of an IPython cluster is the Hub. This is the process that keeps
              track of engine connections, schedulers, clients, as well as all task requests and
              results. The primary role of the Hub is to facilitate queries of the cluster state, and
              minimize the necessary information required to establish the many connections involved in
              connecting new clients and engines.
              Schedulers
              **********
              All actions that can be performed on the engine go through a Scheduler. While the engines
              themselves block when user code is run, the schedulers hide that from the user to provide
              a fully asynchronous interface to a set of engines.
              IPython client and views
              ------------------------
              There is one primary object, the :class:`~.parallel.Client`, for connecting to a cluster.
              For each execution model, there is a corresponding :class:`~.parallel.View`. These views
              allow users to interact with a set of engines through the interface. Here are the two default
              views:
              * The :class:`DirectView` class for explicit addressing.
              * The :class:`LoadBalancedView` class for destination-agnostic scheduling.
              Security
              --------
              IPython uses ZeroMQ for networking, which has provided many advantages, but
              one of the setbacks is its utter lack of security [ZeroMQ]_. By default, no IPython
              connections are encrypted, but open ports only listen on localhost. The only
              source of security for IPython is via ssh-tunnel. IPython supports both shell
              (`openssh`) and `paramiko` based tunnels for connections.  There is a key necessary
              to submit requests, but due to the lack of encryption, it does not provide
              significant security if loopback traffic is compromised.
              In our architecture, the controller is the only process that listens on
              network ports, and is thus the main point of vulnerability. The standard model
              for secure connections is to designate that the controller listen on
              localhost, and use ssh-tunnels to connect clients and/or
              engines.
              To connect and authenticate to the controller an engine or client needs
              some information that the controller has stored in a JSON file.
              Thus, the JSON files need to be copied to a location where
              the clients and engines can find them. Typically, this is the
              :file:`~/.ipython/profile_default/security` directory on the host where the
              client/engine is running (which could be a different host than the controller).
              Once the JSON files are copied over, everything should work fine.
              Currently, there are two JSON files that the controller creates:
              ipcontroller-engine.json
                  This JSON file has the information necessary for an engine to connect
                  to a controller.
              ipcontroller-client.json
                  The client's connection information.  This may not differ from the engine's,
                  but since the controller may listen on different ports for clients and
                  engines, it is stored separately.
              ipcontroller-client.json will look something like this, under default localhost
              circumstances:
              .. sourcecode:: python
                  {
                    "url":"tcp:\/\/127.0.0.1:54424",
                    "exec_key":"a361fe89-92fc-4762-9767-e2f0a05e3130",
                    "ssh":"",
                    "location":"10.19.1.135"
                  }
              If, however, you are running the controller on a work node on a cluster, you will likely
              need to use ssh tunnels to connect clients from your laptop to it.  You will also
              probably need to instruct the controller to listen for engines coming from other work nodes
              on the cluster.  An example of ipcontroller-client.json, as created by::
                  $> ipcontroller --ip=0.0.0.0 --ssh=login.mycluster.com
              .. sourcecode:: python
                  {
                    "url":"tcp:\/\/*:54424",
                    "exec_key":"a361fe89-92fc-4762-9767-e2f0a05e3130",
                    "ssh":"login.mycluster.com",
                    "location":"10.0.0.2"
                  }
              More details of how these JSON files are used are given below.
              A detailed description of the security model and its implementation in IPython
              can be found :ref:`here <parallelsecurity>`.
              .. warning::
                  Even at its most secure, the Controller listens on ports on localhost, and
                  every time you make a tunnel, you open a localhost port on the connecting
                  machine that points to the Controller. If localhost on the Controller's
                  machine, or the machine of any client or engine, is untrusted, then your
                  Controller is insecure. There is no way around this with ZeroMQ.
              Getting Started
              ===============
              To use IPython for parallel computing, you need to start one instance of the
              controller and one or more instances of the engine. Initially, it is best to
              simply start a controller and engines on a single host using the
              :command:`ipcluster` command. To start a controller and 4 engines on your
              localhost, just do::
                  $ ipcluster start -n 4
              More details about starting the IPython controller and engines can be found
              :ref:`here <parallel_process>`
              Once you have started the IPython controller and one or more engines, you
              are ready to use the engines to do something useful. To make sure
              everything is working correctly, try the following commands:
              .. sourcecode:: ipython
              	In [1]: from IPython.parallel import Client
              	In [2]: c = Client()
              	In [4]: c.ids
              	Out[4]: set([0, 1, 2, 3])
              	In [5]: c[:].apply_sync(lambda : "Hello, World")
              	Out[5]: [ 'Hello, World', 'Hello, World', 'Hello, World', 'Hello, World' ]
              When a client is created with no arguments, the client tries to find the corresponding JSON file
              in the local `~/.ipython/profile_default/security` directory. Or if you specified a profile,
              you can use that with the Client.  This should cover most cases:
              .. sourcecode:: ipython
                  In [2]: c = Client(profile='myprofile')
              If you have put the JSON file in a different location or it has a different name, create the
              client like this:
              .. sourcecode:: ipython
                  In [2]: c = Client('/path/to/my/ipcontroller-client.json')
              Remember, a client needs to be able to see the Hub's ports to connect. So if they are on a
              different machine, you may need to use an ssh server to tunnel access to that machine,
              then you would connect to it with:
              .. sourcecode:: ipython
                  In [2]: c = Client('/path/to/my/ipcontroller-client.json', sshserver='me@myhub.example.com')
              Where 'myhub.example.com' is the url or IP address of the machine on
              which the Hub process is running (or another machine that has direct access to the Hub's ports).
              The SSH server may already be specified in ipcontroller-client.json, if the controller was
              instructed at its launch time.
              You are now ready to learn more about the :ref:`Direct
              <parallel_multiengine>` and :ref:`LoadBalanced <parallel_task>` interfaces to the
              controller.
              .. [ZeroMQ] ZeroMQ.  http://www.zeromq.org

General Comments 0

Write
Preview

You need to be logged in to leave comments. Login now

No TODOs yet

	Site-wide shortcuts
/	Use quick search box
g h	Goto home page
g g	Goto my private gists page
g G	Goto my public gists page
g 0-9	Goto bookmarked items from 0-9
n r	New repository page
n g	New gist page

	Repositories
g s	Goto summary page
g c	Goto changelog page
g f	Goto files page
g F	Goto files page with file search activated
g p	Goto pull requests page
g o	Goto repository settings
g O	Goto repository access permissions settings
t s	Toggle sidebar on some pages