##// END OF EJS Templates
update 'tools and libraries available...' example in doc
update 'tools and libraries available...' example in doc

File last commit:

r5169:0cd2b7ac
r6420:abd8c870
Show More
parallel_intro.txt
295 lines | 10.9 KiB | text/plain | TextLexer
Fernando Perez
Lots more edits to release notes, nearing completion.
r4435 .. _parallel_overview:
MinRK
clone parallel docs to parallelz
r3586
============================
Overview and getting started
============================
Introduction
============
This section gives an overview of IPython's sophisticated and powerful
architecture for parallel and distributed computing. This architecture
abstracts out parallelism in a very general way, which enables IPython to
support many different styles of parallelism including:
* Single program, multiple data (SPMD) parallelism.
* Multiple program, multiple data (MPMD) parallelism.
* Message passing using MPI.
* Task farming.
* Data parallel.
* Combinations of these approaches.
* Custom user defined approaches.
Most importantly, IPython enables all types of parallel applications to
be developed, executed, debugged and monitored *interactively*. Hence,
the ``I`` in IPython. The following are some example usage cases for IPython:
* Quickly parallelize algorithms that are embarrassingly parallel
using a number of simple approaches. Many simple things can be
parallelized interactively in one or two lines of code.
* Steer traditional MPI applications on a supercomputer from an
IPython session on your laptop.
* Analyze and visualize large datasets (that could be remote and/or
distributed) interactively using IPython and tools like
matplotlib/TVTK.
* Develop, test and debug new parallel algorithms
(that may use MPI) interactively.
* Tie together multiple MPI jobs running on different systems into
one giant distributed and parallel system.
* Start a parallel job on your cluster and then have a remote
collaborator connect to it and pull back data into their
local IPython session for plotting and analysis.
* Run a set of tasks on a set of CPUs using dynamic load balancing.
Fernando Perez
Lots more edits to release notes, nearing completion.
r4435 .. tip::
At the SciPy 2011 conference in Austin, Min Ragan-Kelley presented a
complete 4-hour tutorial on the use of these features, and all the materials
for the tutorial are now `available online`__. That tutorial provides an
excellent, hands-on oriented complement to the reference documentation
presented here.
.. __: http://minrk.github.com/scipy-tutorial-2011
MinRK
clone parallel docs to parallelz
r3586 Architecture overview
=====================
MinRK
update parallel docs with some changes from scipy tutorial...
r5169 .. figure:: figs/wideView.png
:width: 300px
MinRK
initial draft of core zmq.parallel docs
r3591 The IPython architecture consists of four components:
MinRK
clone parallel docs to parallelz
r3586
* The IPython engine.
MinRK
parallelz doc updates, metadata bug fixed.
r3618 * The IPython hub.
* The IPython schedulers.
MinRK
initial draft of core zmq.parallel docs
r3591 * The controller client.
MinRK
clone parallel docs to parallelz
r3586
MinRK
move IPython.zmq.parallel to IPython.parallel
r3666 These components live in the :mod:`IPython.parallel` package and are
MinRK
clone parallel docs to parallelz
r3586 installed with IPython. They do, however, have additional dependencies
that must be installed. For more information, see our
:ref:`installation documentation <install_index>`.
MinRK
initial draft of core zmq.parallel docs
r3591 .. TODO: include zmq in install_index
MinRK
clone parallel docs to parallelz
r3586 IPython engine
---------------
The IPython engine is a Python instance that takes Python commands over a
network connection. Eventually, the IPython engine will be a full IPython
interpreter, but for now, it is a regular Python interpreter. The engine
can also handle incoming and outgoing Python objects sent over a network
connection. When multiple engines are started, parallel and distributed
computing becomes possible. An important feature of an IPython engine is
that it blocks while user code is being executed. Read on for how the
IPython controller solves this problem to expose a clean asynchronous API
to the user.
IPython controller
------------------
MinRK
parallelz doc updates, metadata bug fixed.
r3618 The IPython controller processes provide an interface for working with a set of engines.
At a general level, the controller is a collection of processes to which IPython engines
and clients can connect. The controller is composed of a :class:`Hub` and a collection of
MinRK
update connection/message docs for newparallel
r3600 :class:`Schedulers`. These Schedulers are typically run in separate processes but on the
same machine as the Hub, but can be run anywhere from local threads or on remote machines.
MinRK
clone parallel docs to parallelz
r3586
The controller also provides a single point of contact for users who wish to
utilize the engines connected to the controller. There are different ways of
MinRK
initial draft of core zmq.parallel docs
r3591 working with a controller. In IPython, all of these models are implemented via
MinRK
update parallel docs with some changes from scipy tutorial...
r5169 the :meth:`.View.apply` method, after
MinRK
initial draft of core zmq.parallel docs
r3591 constructing :class:`.View` objects to represent subsets of engines. The two
primary models for interacting with engines are:
MinRK
clone parallel docs to parallelz
r3586
MinRK
parallelz doc updates, metadata bug fixed.
r3618 * A **Direct** interface, where engines are addressed explicitly.
* A **LoadBalanced** interface, where the Scheduler is trusted with assigning work to
MinRK
initial draft of core zmq.parallel docs
r3591 appropriate engines.
MinRK
clone parallel docs to parallelz
r3586
MinRK
initial draft of core zmq.parallel docs
r3591 Advanced users can readily extend the View models to enable other
MinRK
clone parallel docs to parallelz
r3586 styles of parallelism.
.. note::
MinRK
initial draft of core zmq.parallel docs
r3591 A single controller and set of engines can be used with multiple models
simultaneously. This opens the door for lots of interesting things.
MinRK
clone parallel docs to parallelz
r3586
MinRK
update connection/message docs for newparallel
r3600 The Hub
*******
MinRK
parallelz doc updates, metadata bug fixed.
r3618 The center of an IPython cluster is the Hub. This is the process that keeps
MinRK
update connection/message docs for newparallel
r3600 track of engine connections, schedulers, clients, as well as all task requests and
results. The primary role of the Hub is to facilitate queries of the cluster state, and
minimize the necessary information required to establish the many connections involved in
connecting new clients and engines.
Schedulers
**********
All actions that can be performed on the engine go through a Scheduler. While the engines
themselves block when user code is run, the schedulers hide that from the user to provide
a fully asynchronous interface to a set of engines.
MinRK
update API after sagedays29...
r3664 IPython client and views
------------------------
MinRK
update connection/message docs for newparallel
r3600
MinRK
move IPython.zmq.parallel to IPython.parallel
r3666 There is one primary object, the :class:`~.parallel.Client`, for connecting to a cluster.
MinRK
organize IPython.parallel into subpackages
r3673 For each execution model, there is a corresponding :class:`~.parallel.View`. These views
MinRK
update API after sagedays29...
r3664 allow users to interact with a set of engines through the interface. Here are the two default
views:
MinRK
clone parallel docs to parallelz
r3586
MinRK
initial draft of core zmq.parallel docs
r3591 * The :class:`DirectView` class for explicit addressing.
* The :class:`LoadBalancedView` class for destination-agnostic scheduling.
MinRK
clone parallel docs to parallelz
r3586
Security
--------
MinRK
initial draft of core zmq.parallel docs
r3591 IPython uses ZeroMQ for networking, which has provided many advantages, but
one of the setbacks is its utter lack of security [ZeroMQ]_. By default, no IPython
MinRK
parallelz doc updates, metadata bug fixed.
r3618 connections are encrypted, but open ports only listen on localhost. The only
MinRK
initial draft of core zmq.parallel docs
r3591 source of security for IPython is via ssh-tunnel. IPython supports both shell
MinRK
parallelz doc updates, metadata bug fixed.
r3618 (`openssh`) and `paramiko` based tunnels for connections. There is a key necessary
to submit requests, but due to the lack of encryption, it does not provide
significant security if loopback traffic is compromised.
MinRK
clone parallel docs to parallelz
r3586
In our architecture, the controller is the only process that listens on
MinRK
initial draft of core zmq.parallel docs
r3591 network ports, and is thus the main point of vulnerability. The standard model
for secure connections is to designate that the controller listen on
MinRK
parallelz doc updates, metadata bug fixed.
r3618 localhost, and use ssh-tunnels to connect clients and/or
MinRK
initial draft of core zmq.parallel docs
r3591 engines.
MinRK
clone parallel docs to parallelz
r3586
MinRK
parallelz doc updates, metadata bug fixed.
r3618 To connect and authenticate to the controller an engine or client needs
some information that the controller has stored in a JSON file.
Thus, the JSON files need to be copied to a location where
the clients and engines can find them. Typically, this is the
MinRK
fix remaining 'cluster_' references in parallel docs.
r4060 :file:`~/.ipython/profile_default/security` directory on the host where the
MinRK
parallelz doc updates, metadata bug fixed.
r3618 client/engine is running (which could be a different host than the controller).
Once the JSON files are copied over, everything should work fine.
Currently, there are two JSON files that the controller creates:
ipcontroller-engine.json
This JSON file has the information necessary for an engine to connect
to a controller.
ipcontroller-client.json
The client's connection information. This may not differ from the engine's,
but since the controller may listen on different ports for clients and
engines, it is stored separately.
MinRK
update parallel docs with some changes from scipy tutorial...
r5169 ipcontroller-client.json will look something like this, under default localhost
circumstances:
.. sourcecode:: python
{
"url":"tcp:\/\/127.0.0.1:54424",
"exec_key":"a361fe89-92fc-4762-9767-e2f0a05e3130",
"ssh":"",
"location":"10.19.1.135"
}
If, however, you are running the controller on a work node on a cluster, you will likely
need to use ssh tunnels to connect clients from your laptop to it. You will also
probably need to instruct the controller to listen for engines coming from other work nodes
on the cluster. An example of ipcontroller-client.json, as created by::
$> ipcontroller --ip=0.0.0.0 --ssh=login.mycluster.com
.. sourcecode:: python
{
"url":"tcp:\/\/*:54424",
"exec_key":"a361fe89-92fc-4762-9767-e2f0a05e3130",
"ssh":"login.mycluster.com",
"location":"10.0.0.2"
}
MinRK
parallelz doc updates, metadata bug fixed.
r3618 More details of how these JSON files are used are given below.
A detailed description of the security model and its implementation in IPython
can be found :ref:`here <parallelsecurity>`.
MinRK
initial draft of core zmq.parallel docs
r3591 .. warning::
Even at its most secure, the Controller listens on ports on localhost, and
every time you make a tunnel, you open a localhost port on the connecting
machine that points to the Controller. If localhost on the Controller's
machine, or the machine of any client or engine, is untrusted, then your
Controller is insecure. There is no way around this with ZeroMQ.
MinRK
clone parallel docs to parallelz
r3586
Getting Started
===============
To use IPython for parallel computing, you need to start one instance of the
controller and one or more instances of the engine. Initially, it is best to
simply start a controller and engines on a single host using the
MinRK
rebase IPython.parallel after removal of IPython.kernel...
r3672 :command:`ipcluster` command. To start a controller and 4 engines on your
MinRK
clone parallel docs to parallelz
r3586 localhost, just do::
MinRK
update docs to reflect relaxed syntax of argparse
r4608 $ ipcluster start -n 4
MinRK
clone parallel docs to parallelz
r3586
More details about starting the IPython controller and engines can be found
:ref:`here <parallel_process>`
Once you have started the IPython controller and one or more engines, you
are ready to use the engines to do something useful. To make sure
everything is working correctly, try the following commands:
.. sourcecode:: ipython
MinRK
move IPython.zmq.parallel to IPython.parallel
r3666 In [1]: from IPython.parallel import Client
MinRK
clone parallel docs to parallelz
r3586
MinRK
move IPython.zmq.parallel to IPython.parallel
r3666 In [2]: c = Client()
MinRK
clone parallel docs to parallelz
r3586
MinRK
initial draft of core zmq.parallel docs
r3591 In [4]: c.ids
Out[4]: set([0, 1, 2, 3])
MinRK
clone parallel docs to parallelz
r3586
MinRK
update API after sagedays29...
r3664 In [5]: c[:].apply_sync(lambda : "Hello, World")
MinRK
parallelz doc updates, metadata bug fixed.
r3618 Out[5]: [ 'Hello, World', 'Hello, World', 'Hello, World', 'Hello, World' ]
MinRK
updates to docs and examples
r3670 When a client is created with no arguments, the client tries to find the corresponding JSON file
MinRK
fix remaining 'cluster_' references in parallel docs.
r4060 in the local `~/.ipython/profile_default/security` directory. Or if you specified a profile,
MinRK
updates to docs and examples
r3670 you can use that with the Client. This should cover most cases:
.. sourcecode:: ipython
In [2]: c = Client(profile='myprofile')
If you have put the JSON file in a different location or it has a different name, create the
client like this:
MinRK
parallelz doc updates, metadata bug fixed.
r3618
.. sourcecode:: ipython
MinRK
move IPython.zmq.parallel to IPython.parallel
r3666 In [2]: c = Client('/path/to/my/ipcontroller-client.json')
MinRK
clone parallel docs to parallelz
r3586
MinRK
parallelz doc updates, metadata bug fixed.
r3618 Remember, a client needs to be able to see the Hub's ports to connect. So if they are on a
different machine, you may need to use an ssh server to tunnel access to that machine,
then you would connect to it with:
MinRK
clone parallel docs to parallelz
r3586
MinRK
initial draft of core zmq.parallel docs
r3591 .. sourcecode:: ipython
MinRK
clone parallel docs to parallelz
r3586
MinRK
update parallel docs with some changes from scipy tutorial...
r5169 In [2]: c = Client('/path/to/my/ipcontroller-client.json', sshserver='me@myhub.example.com')
MinRK
clone parallel docs to parallelz
r3586
MinRK
update connection/message docs for newparallel
r3600 Where 'myhub.example.com' is the url or IP address of the machine on
MinRK
update API after sagedays29...
r3664 which the Hub process is running (or another machine that has direct access to the Hub's ports).
MinRK
clone parallel docs to parallelz
r3586
MinRK
updates to docs and examples
r3670 The SSH server may already be specified in ipcontroller-client.json, if the controller was
instructed at its launch time.
MinRK
parallelz doc updates, metadata bug fixed.
r3618 You are now ready to learn more about the :ref:`Direct
MinRK
update API after sagedays29...
r3664 <parallel_multiengine>` and :ref:`LoadBalanced <parallel_task>` interfaces to the
MinRK
clone parallel docs to parallelz
r3586 controller.
MinRK
initial draft of core zmq.parallel docs
r3591 .. [ZeroMQ] ZeroMQ. http://www.zeromq.org