upstream/ipython Files · docs/source/parallel/parallel_db.rst

move doc files to .rst so they render on github

Paul Ivanov - - Load All Authors

File last commit:

r11729:5cc34183


                r11729:5cc34183

Download file

             parallel_db.rst
        
                    159 lines
            
             | 5.8 KiB
            
                | text/x-rst
            
             |
                RstLexer
            
             / docs / source / parallel / parallel_db.rst
          
                    History
                
                 |
                  Source
                 | Raw
                 |Copy content
                 |Copy permalink

        MinRK
    
add db,resubmit/retries docs

              r3876
            
      .. _parallel_db:

      =======================

      IPython's Task Database

      =======================

        MinRK
    
update db docs with new NoDB default

              r7510
            
      Enabling a DB Backend

      =====================

      The IPython Hub can store all task requests and results in a database. 

      Currently supported backends are: MongoDB, SQLite, and an in-memory DictDB.

      This database behavior is optional due to its potential :ref:`db_cost`,

      so you must enable one, either at the command-line::

          $> ipcontroller --dictb # or --mongodb or --sqlitedb

      or in your :file:`ipcontroller_config.py`:

      .. sourcecode:: python

          c.HubFactory.db_class = "DictDB"

          c.HubFactory.db_class = "MongoDB"

          c.HubFactory.db_class = "SQLiteDB"

      Using the Task Database

      =======================

      The most common use case for this is clients requesting results for tasks they did not submit, via:

        MinRK
    
add db,resubmit/retries docs

              r3876
            
      .. sourcecode:: ipython

          In [1]: rc.get_result(task_id)

        MinRK
    
update db docs with new NoDB default

              r7510
            
      However, since we have this DB backend, we provide a direct query method in the :class:`~.Client`

        MinRK
    
add db,resubmit/retries docs

              r3876
            
      for users who want deeper introspection into their task history. The :meth:`db_query` method of

      the Client is modeled after MongoDB queries, so if you have used MongoDB it should look

        MinRK
    
update db docs with new NoDB default

              r7510
            
      familiar.  In fact, when the MongoDB backend is in use, the query is relayed directly.  

      When using other backends, the interface is emulated and only a subset of queries is possible.

        MinRK
    
add db,resubmit/retries docs

              r3876
            
      .. seealso::

          MongoDB query docs: http://www.mongodb.org/display/DOCS/Querying

      :meth:`Client.db_query` takes a dictionary query object, with keys from the TaskRecord key list,

      and values of either exact values to test, or MongoDB queries, which are dicts of The form:

      ``{'operator' : 'argument(s)'}``. There is also an optional `keys` argument, that specifies

      which subset of keys should be retrieved. The default is to retrieve all keys excluding the

      request and result buffers. :meth:`db_query` returns a list of TaskRecord dicts. Also like

      MongoDB, the `msg_id` key will always be included, whether requested or not.

      TaskRecord keys:

      =============== =============== =============

      Key             Type            Description

      =============== =============== =============

        MinRK
    
resubmitted tasks are now wholly separate (new msg_ids)...

              r6817
            
      msg_id          uuid(ascii)     The msg ID

        MinRK
    
add db,resubmit/retries docs

              r3876
            
      header          dict            The request header

      content         dict            The request content (likely empty)

      buffers         list(bytes)     buffers containing serialized request objects

      submitted       datetime        timestamp for time of submission (set by client)

        MinRK
    
update db docs with new NoDB default

              r7510
            
      client_uuid     uuid(ascii)     IDENT of client's socket

      engine_uuid     uuid(ascii)     IDENT of engine's socket

        MinRK
    
add db,resubmit/retries docs

              r3876
            
      started         datetime        time task began execution on engine

      completed       datetime        time task finished execution (success or failure) on engine

        MinRK
    
resubmitted tasks are now wholly separate (new msg_ids)...

              r6817
            
      resubmitted     uuid(ascii)     msg_id of resubmitted task (if applicable)

        MinRK
    
add db,resubmit/retries docs

              r3876
            
      result_header   dict            header for result

      result_content  dict            content for result

      result_buffers  list(bytes)     buffers containing serialized request objects

        MinRK
    
update db docs with new NoDB default

              r7510
            
      queue           str             The name of the queue for the task ('mux' or 'task')

      pyin            str             Python input source

      pyout           dict            Python output (pyout message content)

      pyerr           dict            Python traceback (pyerr message content)

        MinRK
    
add db,resubmit/retries docs

              r3876
            
      stdout          str             Stream of stdout data

      stderr          str             Stream of stderr data

      =============== =============== =============

      MongoDB operators we emulate on all backends:

      ==========  =================

      Operator    Python equivalent

      ==========  =================

        '$in'       in

        '$nin'      not in

        '$eq'       ==

        '$ne'       !=

        '$ge'       >

        '$gte'      >=

        '$le'       <

        '$lte'      <=

      ==========  =================

      The DB Query is useful for two primary cases:

      1. deep polling of task status or metadata

      2. selecting a subset of tasks, on which to perform a later operation (e.g. wait on result, purge records, resubmit,...)

        MinRK
    
update db docs with new NoDB default

              r7510
            
        MinRK
    
add db,resubmit/retries docs

              r3876
            
      Example Queries

      ===============

      To get all msg_ids that are not completed, only retrieving their ID and start time:

      .. sourcecode:: ipython

        MinRK
    
update db docs with new NoDB default

              r7510
            
          In [1]: incomplete = rc.db_query({'completed' : None}, keys=['msg_id', 'started'])

        MinRK
    
add db,resubmit/retries docs

              r3876
            
      All jobs started in the last hour by me:

      .. sourcecode:: ipython

          In [1]: from datetime import datetime, timedelta

          In [2]: hourago = datetime.now() - timedelta(1./24)

          In [3]: recent = rc.db_query({'started' : {'$gte' : hourago },

                                          'client_uuid' : rc.session.session})

      All jobs started more than an hour ago, by clients *other than me*:

      .. sourcecode:: ipython

          In [3]: recent = rc.db_query({'started' : {'$le' : hourago },

                                          'client_uuid' : {'$ne' : rc.session.session}})

      Result headers for all jobs on engine 3 or 4:

      .. sourcecode:: ipython

          In [1]: uuids = map(rc._engines.get, (3,4))

          In [2]: hist34 = rc.db_query({'engine_uuid' : {'$in' : uuids }, keys='result_header')

        MinRK
    
add NoDB for non-recording Hub...

              r5892
            
        MinRK
    
update db docs with new NoDB default

              r7510
            
      .. _db_cost:

        MinRK
    
add NoDB for non-recording Hub...

              r5892
            
      Cost

      ====

      The advantage of the database backends is, of course, that large amounts of

        MinRK
    
update db docs with new NoDB default

              r7510
            
      data can be stored that won't fit in memory.  The basic DictDB 'backend' is actually

        MinRK
    
add NoDB for non-recording Hub...

              r5892
            
      to just store all of this information in a Python dictionary.  This is very fast,

      but will run out of memory quickly if you move a lot of data around, or your

      cluster is to run for a long time.

      Unfortunately, the DB backends (SQLite and MongoDB) right now are rather slow,

      and can still consume large amounts of resources, particularly if large tasks

      or results are being created at a high frequency.

      For this reason, we have added :class:`~.NoDB`,a dummy backend that doesn't

      actually store any information. When you use this database, nothing is stored,

      and any request for results will result in a KeyError.  This obviously prevents

      later requests for results and task resubmission from functioning, but

      sometimes those nice features are not as useful as keeping Hub memory under

      control.

	Site-wide shortcuts
/	Use quick search box
g h	Goto home page
g g	Goto my private gists page
g G	Goto my public gists page
g 0-9	Goto bookmarked items from 0-9
n r	New repository page
n g	New gist page

	Repositories
g s	Goto summary page
g c	Goto changelog page
g f	Goto files page
g F	Goto files page with file search activated
g p	Goto pull requests page
g o	Goto repository settings
g O	Goto repository access permissions settings
t s	Toggle sidebar on some pages

MinRK add db,resubmit/retries docs	r3876	.. _parallel_db:

		=======================
		IPython's Task Database
		=======================

MinRK update db docs with new NoDB default	r7510	Enabling a DB Backend
		=====================

		The IPython Hub can store all task requests and results in a database.
		Currently supported backends are: MongoDB, SQLite, and an in-memory DictDB.

		This database behavior is optional due to its potential :ref:`db_cost`,
		so you must enable one, either at the command-line::

		$> ipcontroller --dictb # or --mongodb or --sqlitedb

		or in your :file:`ipcontroller_config.py`:

		.. sourcecode:: python

		c.HubFactory.db_class = "DictDB"
		c.HubFactory.db_class = "MongoDB"
		c.HubFactory.db_class = "SQLiteDB"


		Using the Task Database
		=======================

		The most common use case for this is clients requesting results for tasks they did not submit, via:
MinRK add db,resubmit/retries docs	r3876
		.. sourcecode:: ipython

		In [1]: rc.get_result(task_id)

MinRK update db docs with new NoDB default	r7510	However, since we have this DB backend, we provide a direct query method in the :class:`~.Client`
MinRK add db,resubmit/retries docs	r3876	for users who want deeper introspection into their task history. The :meth:`db_query` method of
		the Client is modeled after MongoDB queries, so if you have used MongoDB it should look
MinRK update db docs with new NoDB default	r7510	familiar. In fact, when the MongoDB backend is in use, the query is relayed directly.
		When using other backends, the interface is emulated and only a subset of queries is possible.
MinRK add db,resubmit/retries docs	r3876
		.. seealso::

		MongoDB query docs: http://www.mongodb.org/display/DOCS/Querying

		:meth:`Client.db_query` takes a dictionary query object, with keys from the TaskRecord key list,
		and values of either exact values to test, or MongoDB queries, which are dicts of The form:
		``{'operator' : 'argument(s)'}``. There is also an optional `keys` argument, that specifies
		which subset of keys should be retrieved. The default is to retrieve all keys excluding the
		request and result buffers. :meth:`db_query` returns a list of TaskRecord dicts. Also like
		MongoDB, the `msg_id` key will always be included, whether requested or not.

		TaskRecord keys:

		=============== =============== =============
		Key Type Description
		=============== =============== =============
MinRK resubmitted tasks are now wholly separate (new msg_ids)...	r6817	msg_id uuid(ascii) The msg ID
MinRK add db,resubmit/retries docs	r3876	header dict The request header
		content dict The request content (likely empty)
		buffers list(bytes) buffers containing serialized request objects
		submitted datetime timestamp for time of submission (set by client)
MinRK update db docs with new NoDB default	r7510	client_uuid uuid(ascii) IDENT of client's socket
		engine_uuid uuid(ascii) IDENT of engine's socket
MinRK add db,resubmit/retries docs	r3876	started datetime time task began execution on engine
		completed datetime time task finished execution (success or failure) on engine
MinRK resubmitted tasks are now wholly separate (new msg_ids)...	r6817	resubmitted uuid(ascii) msg_id of resubmitted task (if applicable)
MinRK add db,resubmit/retries docs	r3876	result_header dict header for result
		result_content dict content for result
		result_buffers list(bytes) buffers containing serialized request objects
MinRK update db docs with new NoDB default	r7510	queue str The name of the queue for the task ('mux' or 'task')
		pyin str Python input source
		pyout dict Python output (pyout message content)
		pyerr dict Python traceback (pyerr message content)
MinRK add db,resubmit/retries docs	r3876	stdout str Stream of stdout data
		stderr str Stream of stderr data

		=============== =============== =============

		MongoDB operators we emulate on all backends:

		========== =================
		Operator Python equivalent
		========== =================
		'$in' in
		'$nin' not in
		'$eq' ==
		'$ne' !=
		'$ge' >
		'$gte' >=
		'$le' <
		'$lte' <=
		========== =================


		The DB Query is useful for two primary cases:

		1. deep polling of task status or metadata
		2. selecting a subset of tasks, on which to perform a later operation (e.g. wait on result, purge records, resubmit,...)

MinRK update db docs with new NoDB default	r7510
MinRK add db,resubmit/retries docs	r3876	Example Queries
		===============

		To get all msg_ids that are not completed, only retrieving their ID and start time:

		.. sourcecode:: ipython

MinRK update db docs with new NoDB default	r7510	In [1]: incomplete = rc.db_query({'completed' : None}, keys=['msg_id', 'started'])
MinRK add db,resubmit/retries docs	r3876
		All jobs started in the last hour by me:

		.. sourcecode:: ipython

		In [1]: from datetime import datetime, timedelta

		In [2]: hourago = datetime.now() - timedelta(1./24)

		In [3]: recent = rc.db_query({'started' : {'$gte' : hourago },
		'client_uuid' : rc.session.session})

		All jobs started more than an hour ago, by clients other than me:

		.. sourcecode:: ipython

		In [3]: recent = rc.db_query({'started' : {'$le' : hourago },
		'client_uuid' : {'$ne' : rc.session.session}})

		Result headers for all jobs on engine 3 or 4:

		.. sourcecode:: ipython

		In [1]: uuids = map(rc._engines.get, (3,4))

		In [2]: hist34 = rc.db_query({'engine_uuid' : {'$in' : uuids }, keys='result_header')
MinRK add NoDB for non-recording Hub...	r5892
MinRK update db docs with new NoDB default	r7510	.. _db_cost:
MinRK add NoDB for non-recording Hub...	r5892
		Cost
		====

		The advantage of the database backends is, of course, that large amounts of
MinRK update db docs with new NoDB default	r7510	data can be stored that won't fit in memory. The basic DictDB 'backend' is actually
MinRK add NoDB for non-recording Hub...	r5892	to just store all of this information in a Python dictionary. This is very fast,
		but will run out of memory quickly if you move a lot of data around, or your
		cluster is to run for a long time.

		Unfortunately, the DB backends (SQLite and MongoDB) right now are rather slow,
		and can still consume large amounts of resources, particularly if large tasks
		or results are being created at a high frequency.

		For this reason, we have added :class:`~.NoDB`,a dummy backend that doesn't
		actually store any information. When you use this database, nothing is stored,
		and any request for results will result in a KeyError. This obviously prevents
		later requests for results and task resubmission from functioning, but
		sometimes those nice features are not as useful as keeping Hub memory under
		control.