parallel_db.txt
137 lines
| 5.3 KiB
| text/plain
|
TextLexer
MinRK
|
r3876 | .. _parallel_db: | ||
======================= | ||||
IPython's Task Database | ||||
======================= | ||||
The IPython Hub stores all task requests and results in a database. Currently supported backends | ||||
are: MongoDB, SQLite (the default), and an in-memory DictDB. The most common use case for | ||||
this is clients requesting results for tasks they did not submit, via: | ||||
.. sourcecode:: ipython | ||||
In [1]: rc.get_result(task_id) | ||||
However, since we have this DB backend, we provide a direct query method in the :class:`client` | ||||
for users who want deeper introspection into their task history. The :meth:`db_query` method of | ||||
the Client is modeled after MongoDB queries, so if you have used MongoDB it should look | ||||
familiar. In fact, when the MongoDB backend is in use, the query is relayed directly. However, | ||||
when using other backends, the interface is emulated and only a subset of queries is possible. | ||||
.. seealso:: | ||||
MongoDB query docs: http://www.mongodb.org/display/DOCS/Querying | ||||
:meth:`Client.db_query` takes a dictionary query object, with keys from the TaskRecord key list, | ||||
and values of either exact values to test, or MongoDB queries, which are dicts of The form: | ||||
``{'operator' : 'argument(s)'}``. There is also an optional `keys` argument, that specifies | ||||
which subset of keys should be retrieved. The default is to retrieve all keys excluding the | ||||
request and result buffers. :meth:`db_query` returns a list of TaskRecord dicts. Also like | ||||
MongoDB, the `msg_id` key will always be included, whether requested or not. | ||||
TaskRecord keys: | ||||
=============== =============== ============= | ||||
Key Type Description | ||||
=============== =============== ============= | ||||
msg_id uuid(bytes) The msg ID | ||||
header dict The request header | ||||
content dict The request content (likely empty) | ||||
buffers list(bytes) buffers containing serialized request objects | ||||
submitted datetime timestamp for time of submission (set by client) | ||||
client_uuid uuid(bytes) IDENT of client's socket | ||||
engine_uuid uuid(bytes) IDENT of engine's socket | ||||
started datetime time task began execution on engine | ||||
completed datetime time task finished execution (success or failure) on engine | ||||
resubmitted datetime time of resubmission (if applicable) | ||||
result_header dict header for result | ||||
result_content dict content for result | ||||
result_buffers list(bytes) buffers containing serialized request objects | ||||
queue bytes The name of the queue for the task ('mux' or 'task') | ||||
pyin <unused> Python input (unused) | ||||
pyout <unused> Python output (unused) | ||||
pyerr <unused> Python traceback (unused) | ||||
stdout str Stream of stdout data | ||||
stderr str Stream of stderr data | ||||
=============== =============== ============= | ||||
MongoDB operators we emulate on all backends: | ||||
========== ================= | ||||
Operator Python equivalent | ||||
========== ================= | ||||
'$in' in | ||||
'$nin' not in | ||||
'$eq' == | ||||
'$ne' != | ||||
'$ge' > | ||||
'$gte' >= | ||||
'$le' < | ||||
'$lte' <= | ||||
========== ================= | ||||
The DB Query is useful for two primary cases: | ||||
1. deep polling of task status or metadata | ||||
2. selecting a subset of tasks, on which to perform a later operation (e.g. wait on result, purge records, resubmit,...) | ||||
Example Queries | ||||
=============== | ||||
To get all msg_ids that are not completed, only retrieving their ID and start time: | ||||
.. sourcecode:: ipython | ||||
In [1]: incomplete = rc.db_query({'complete' : None}, keys=['msg_id', 'started']) | ||||
All jobs started in the last hour by me: | ||||
.. sourcecode:: ipython | ||||
In [1]: from datetime import datetime, timedelta | ||||
In [2]: hourago = datetime.now() - timedelta(1./24) | ||||
In [3]: recent = rc.db_query({'started' : {'$gte' : hourago }, | ||||
'client_uuid' : rc.session.session}) | ||||
All jobs started more than an hour ago, by clients *other than me*: | ||||
.. sourcecode:: ipython | ||||
In [3]: recent = rc.db_query({'started' : {'$le' : hourago }, | ||||
'client_uuid' : {'$ne' : rc.session.session}}) | ||||
Result headers for all jobs on engine 3 or 4: | ||||
.. sourcecode:: ipython | ||||
In [1]: uuids = map(rc._engines.get, (3,4)) | ||||
In [2]: hist34 = rc.db_query({'engine_uuid' : {'$in' : uuids }, keys='result_header') | ||||
MinRK
|
r5892 | |||
Cost | ||||
==== | ||||
The advantage of the database backends is, of course, that large amounts of | ||||
data can be stored that won't fit in memory. The default 'backend' is actually | ||||
to just store all of this information in a Python dictionary. This is very fast, | ||||
but will run out of memory quickly if you move a lot of data around, or your | ||||
cluster is to run for a long time. | ||||
Unfortunately, the DB backends (SQLite and MongoDB) right now are rather slow, | ||||
and can still consume large amounts of resources, particularly if large tasks | ||||
or results are being created at a high frequency. | ||||
For this reason, we have added :class:`~.NoDB`,a dummy backend that doesn't | ||||
actually store any information. When you use this database, nothing is stored, | ||||
and any request for results will result in a KeyError. This obviously prevents | ||||
later requests for results and task resubmission from functioning, but | ||||
sometimes those nice features are not as useful as keeping Hub memory under | ||||
control. | ||||