README.rst
943 lines
| 34.9 KiB
| text/x-rst
|
RstLexer
Gregory Szorc
|
r30435 | ================ | ||
python-zstandard | ||||
================ | ||||
Gregory Szorc
|
r30822 | This project provides Python bindings for interfacing with the | ||
`Zstandard <http://www.zstd.net>`_ compression library. A C extension | ||||
Gregory Szorc
|
r30895 | and CFFI interface are provided. | ||
Gregory Szorc
|
r30435 | |||
Gregory Szorc
|
r30895 | The primary goal of the project is to provide a rich interface to the | ||
underlying C API through a Pythonic interface while not sacrificing | ||||
performance. This means exposing most of the features and flexibility | ||||
Gregory Szorc
|
r30435 | of the C API while not sacrificing usability or safety that Python provides. | ||
Gregory Szorc
|
r30822 | The canonical home for this project is | ||
https://github.com/indygreg/python-zstandard. | ||||
Gregory Szorc
|
r30435 | | |ci-status| |win-ci-status| | ||
State of Project | ||||
================ | ||||
The project is officially in beta state. The author is reasonably satisfied | ||||
with the current API and that functionality works as advertised. There | ||||
may be some backwards incompatible changes before 1.0. Though the author | ||||
does not intend to make any major changes to the Python API. | ||||
Gregory Szorc
|
r30895 | This project is vendored and distributed with Mercurial 4.1, where it is | ||
used in a production capacity. | ||||
Gregory Szorc
|
r30435 | There is continuous integration for Python versions 2.6, 2.7, and 3.3+ | ||
on Linux x86_x64 and Windows x86 and x86_64. The author is reasonably | ||||
confident the extension is stable and works as advertised on these | ||||
platforms. | ||||
Expected Changes | ||||
---------------- | ||||
The author is reasonably confident in the current state of what's | ||||
implemented on the ``ZstdCompressor`` and ``ZstdDecompressor`` types. | ||||
Those APIs likely won't change significantly. Some low-level behavior | ||||
(such as naming and types expected by arguments) may change. | ||||
There will likely be arguments added to control the input and output | ||||
buffer sizes (currently, certain operations read and write in chunk | ||||
sizes using zstd's preferred defaults). | ||||
There should be an API that accepts an object that conforms to the buffer | ||||
interface and returns an iterator over compressed or decompressed output. | ||||
The author is on the fence as to whether to support the extremely | ||||
low level compression and decompression APIs. It could be useful to | ||||
support compression without the framing headers. But the author doesn't | ||||
believe it a high priority at this time. | ||||
Gregory Szorc
|
r30895 | The CFFI bindings are feature complete and all tests run against both | ||
the C extension and CFFI bindings to ensure behavior parity. | ||||
Gregory Szorc
|
r30435 | |||
Requirements | ||||
============ | ||||
Gregory Szorc
|
r30895 | This extension is designed to run with Python 2.6, 2.7, 3.3, 3.4, 3.5, and | ||
3.6 on common platforms (Linux, Windows, and OS X). Only x86_64 is | ||||
currently well-tested as an architecture. | ||||
Gregory Szorc
|
r30435 | |||
Installing | ||||
========== | ||||
This package is uploaded to PyPI at https://pypi.python.org/pypi/zstandard. | ||||
So, to install this package:: | ||||
$ pip install zstandard | ||||
Binary wheels are made available for some platforms. If you need to | ||||
install from a source distribution, all you should need is a working C | ||||
compiler and the Python development headers/libraries. On many Linux | ||||
distributions, you can install a ``python-dev`` or ``python-devel`` | ||||
package to provide these dependencies. | ||||
Packages are also uploaded to Anaconda Cloud at | ||||
https://anaconda.org/indygreg/zstandard. See that URL for how to install | ||||
this package with ``conda``. | ||||
Performance | ||||
=========== | ||||
Very crude and non-scientific benchmarking (most benchmarks fall in this | ||||
category because proper benchmarking is hard) show that the Python bindings | ||||
perform within 10% of the native C implementation. | ||||
The following table compares the performance of compressing and decompressing | ||||
a 1.1 GB tar file comprised of the files in a Firefox source checkout. Values | ||||
obtained with the ``zstd`` program are on the left. The remaining columns detail | ||||
performance of various compression APIs in the Python bindings. | ||||
+-------+-----------------+-----------------+-----------------+---------------+ | ||||
| Level | Native | Simple | Stream In | Stream Out | | ||||
| | Comp / Decomp | Comp / Decomp | Comp / Decomp | Comp | | ||||
+=======+=================+=================+=================+===============+ | ||||
| 1 | 490 / 1338 MB/s | 458 / 1266 MB/s | 407 / 1156 MB/s | 405 MB/s | | ||||
+-------+-----------------+-----------------+-----------------+---------------+ | ||||
| 2 | 412 / 1288 MB/s | 381 / 1203 MB/s | 345 / 1128 MB/s | 349 MB/s | | ||||
+-------+-----------------+-----------------+-----------------+---------------+ | ||||
| 3 | 342 / 1312 MB/s | 319 / 1182 MB/s | 285 / 1165 MB/s | 287 MB/s | | ||||
+-------+-----------------+-----------------+-----------------+---------------+ | ||||
| 11 | 64 / 1506 MB/s | 66 / 1436 MB/s | 56 / 1342 MB/s | 57 MB/s | | ||||
+-------+-----------------+-----------------+-----------------+---------------+ | ||||
Again, these are very unscientific. But it shows that Python is capable of | ||||
compressing at several hundred MB/s and decompressing at over 1 GB/s. | ||||
Comparison to Other Python Bindings | ||||
=================================== | ||||
Gregory Szorc
|
r30895 | https://pypi.python.org/pypi/zstd is an alternate Python binding to | ||
Gregory Szorc
|
r30435 | Zstandard. At the time this was written, the latest release of that | ||
Gregory Szorc
|
r30895 | package (1.1.2) only exposed the simple APIs for compression and decompression. | ||
This package exposes much more of the zstd API, including streaming and | ||||
dictionary compression. This package also has CFFI support. | ||||
Gregory Szorc
|
r30435 | |||
Bundling of Zstandard Source Code | ||||
================================= | ||||
The source repository for this project contains a vendored copy of the | ||||
Zstandard source code. This is done for a few reasons. | ||||
First, Zstandard is relatively new and not yet widely available as a system | ||||
package. Providing a copy of the source code enables the Python C extension | ||||
to be compiled without requiring the user to obtain the Zstandard source code | ||||
separately. | ||||
Second, Zstandard has both a stable *public* API and an *experimental* API. | ||||
The *experimental* API is actually quite useful (contains functionality for | ||||
training dictionaries for example), so it is something we wish to expose to | ||||
Python. However, the *experimental* API is only available via static linking. | ||||
Furthermore, the *experimental* API can change at any time. So, control over | ||||
the exact version of the Zstandard library linked against is important to | ||||
ensure known behavior. | ||||
Instructions for Building and Testing | ||||
===================================== | ||||
Once you have the source code, the extension can be built via setup.py:: | ||||
$ python setup.py build_ext | ||||
We recommend testing with ``nose``:: | ||||
$ nosetests | ||||
A Tox configuration is present to test against multiple Python versions:: | ||||
$ tox | ||||
Tests use the ``hypothesis`` Python package to perform fuzzing. If you | ||||
don't have it, those tests won't run. | ||||
There is also an experimental CFFI module. You need the ``cffi`` Python | ||||
package installed to build and test that. | ||||
To create a virtualenv with all development dependencies, do something | ||||
like the following:: | ||||
# Python 2 | ||||
$ virtualenv venv | ||||
# Python 3 | ||||
$ python3 -m venv venv | ||||
$ source venv/bin/activate | ||||
$ pip install cffi hypothesis nose tox | ||||
API | ||||
=== | ||||
The compiled C extension provides a ``zstd`` Python module. This module | ||||
exposes the following interfaces. | ||||
ZstdCompressor | ||||
-------------- | ||||
The ``ZstdCompressor`` class provides an interface for performing | ||||
compression operations. | ||||
Each instance is associated with parameters that control compression | ||||
behavior. These come from the following named arguments (all optional): | ||||
level | ||||
Integer compression level. Valid values are between 1 and 22. | ||||
dict_data | ||||
Compression dictionary to use. | ||||
Note: When using dictionary data and ``compress()`` is called multiple | ||||
times, the ``CompressionParameters`` derived from an integer compression | ||||
``level`` and the first compressed data's size will be reused for all | ||||
subsequent operations. This may not be desirable if source data size | ||||
varies significantly. | ||||
compression_params | ||||
A ``CompressionParameters`` instance (overrides the ``level`` value). | ||||
write_checksum | ||||
Whether a 4 byte checksum should be written with the compressed data. | ||||
Defaults to False. If True, the decompressor can verify that decompressed | ||||
data matches the original input data. | ||||
write_content_size | ||||
Whether the size of the uncompressed data will be written into the | ||||
header of compressed data. Defaults to False. The data will only be | ||||
written if the compressor knows the size of the input data. This is | ||||
likely not true for streaming compression. | ||||
write_dict_id | ||||
Whether to write the dictionary ID into the compressed data. | ||||
Defaults to True. The dictionary ID is only written if a dictionary | ||||
is being used. | ||||
Gregory Szorc
|
r30822 | Unless specified otherwise, assume that no two methods of ``ZstdCompressor`` | ||
instances can be called from multiple Python threads simultaneously. In other | ||||
words, assume instances are not thread safe unless stated otherwise. | ||||
Gregory Szorc
|
r30435 | Simple API | ||
^^^^^^^^^^ | ||||
``compress(data)`` compresses and returns data as a one-shot operation.:: | ||||
Gregory Szorc
|
r30822 | cctx = zstd.ZstdCompressor() | ||
Gregory Szorc
|
r30435 | compressed = cctx.compress(b'data to compress') | ||
Gregory Szorc
|
r30822 | Unless ``compression_params`` or ``dict_data`` are passed to the | ||
``ZstdCompressor``, each invocation of ``compress()`` will calculate the | ||||
optimal compression parameters for the configured compression ``level`` and | ||||
input data size (some parameters are fine-tuned for small input sizes). | ||||
If a compression dictionary is being used, the compression parameters | ||||
determined from the first input's size will be reused for subsequent | ||||
operations. | ||||
There is currently a deficiency in zstd's C APIs that makes it difficult | ||||
to round trip empty inputs when ``write_content_size=True``. Attempting | ||||
this will raise a ``ValueError`` unless ``allow_empty=True`` is passed | ||||
to ``compress()``. | ||||
Gregory Szorc
|
r30435 | Streaming Input API | ||
^^^^^^^^^^^^^^^^^^^ | ||||
``write_to(fh)`` (which behaves as a context manager) allows you to *stream* | ||||
data into a compressor.:: | ||||
cctx = zstd.ZstdCompressor(level=10) | ||||
with cctx.write_to(fh) as compressor: | ||||
compressor.write(b'chunk 0') | ||||
compressor.write(b'chunk 1') | ||||
... | ||||
The argument to ``write_to()`` must have a ``write(data)`` method. As | ||||
Gregory Szorc
|
r30822 | compressed data is available, ``write()`` will be called with the compressed | ||
Gregory Szorc
|
r30435 | data as its argument. Many common Python types implement ``write()``, including | ||
open file handles and ``io.BytesIO``. | ||||
``write_to()`` returns an object representing a streaming compressor instance. | ||||
It **must** be used as a context manager. That object's ``write(data)`` method | ||||
is used to feed data into the compressor. | ||||
Gregory Szorc
|
r30822 | A ``flush()`` method can be called to evict whatever data remains within the | ||
compressor's internal state into the output object. This may result in 0 or | ||||
more ``write()`` calls to the output object. | ||||
Gregory Szorc
|
r30895 | Both ``write()`` and ``flush()`` return the number of bytes written to the | ||
object's ``write()``. In many cases, small inputs do not accumulate enough | ||||
data to cause a write and ``write()`` will return ``0``. | ||||
Gregory Szorc
|
r30435 | If the size of the data being fed to this streaming compressor is known, | ||
you can declare it before compression begins:: | ||||
cctx = zstd.ZstdCompressor() | ||||
with cctx.write_to(fh, size=data_len) as compressor: | ||||
compressor.write(chunk0) | ||||
compressor.write(chunk1) | ||||
... | ||||
Declaring the size of the source data allows compression parameters to | ||||
be tuned. And if ``write_content_size`` is used, it also results in the | ||||
content size being written into the frame header of the output data. | ||||
The size of chunks being ``write()`` to the destination can be specified:: | ||||
cctx = zstd.ZstdCompressor() | ||||
with cctx.write_to(fh, write_size=32768) as compressor: | ||||
... | ||||
To see how much memory is being used by the streaming compressor:: | ||||
cctx = zstd.ZstdCompressor() | ||||
with cctx.write_to(fh) as compressor: | ||||
... | ||||
byte_size = compressor.memory_size() | ||||
Streaming Output API | ||||
^^^^^^^^^^^^^^^^^^^^ | ||||
``read_from(reader)`` provides a mechanism to stream data out of a compressor | ||||
as an iterator of data chunks.:: | ||||
cctx = zstd.ZstdCompressor() | ||||
for chunk in cctx.read_from(fh): | ||||
# Do something with emitted data. | ||||
``read_from()`` accepts an object that has a ``read(size)`` method or conforms | ||||
to the buffer protocol. (``bytes`` and ``memoryview`` are 2 common types that | ||||
provide the buffer protocol.) | ||||
Uncompressed data is fetched from the source either by calling ``read(size)`` | ||||
or by fetching a slice of data from the object directly (in the case where | ||||
the buffer protocol is being used). The returned iterator consists of chunks | ||||
of compressed data. | ||||
Gregory Szorc
|
r30822 | If reading from the source via ``read()``, ``read()`` will be called until | ||
it raises or returns an empty bytes (``b''``). It is perfectly valid for | ||||
the source to deliver fewer bytes than were what requested by ``read(size)``. | ||||
Gregory Szorc
|
r30435 | Like ``write_to()``, ``read_from()`` also accepts a ``size`` argument | ||
declaring the size of the input stream:: | ||||
cctx = zstd.ZstdCompressor() | ||||
for chunk in cctx.read_from(fh, size=some_int): | ||||
pass | ||||
You can also control the size that data is ``read()`` from the source and | ||||
the ideal size of output chunks:: | ||||
cctx = zstd.ZstdCompressor() | ||||
for chunk in cctx.read_from(fh, read_size=16384, write_size=8192): | ||||
pass | ||||
Gregory Szorc
|
r30822 | Unlike ``write_to()``, ``read_from()`` does not give direct control over the | ||
sizes of chunks fed into the compressor. Instead, chunk sizes will be whatever | ||||
the object being read from delivers. These will often be of a uniform size. | ||||
Gregory Szorc
|
r30435 | Stream Copying API | ||
^^^^^^^^^^^^^^^^^^ | ||||
``copy_stream(ifh, ofh)`` can be used to copy data between 2 streams while | ||||
compressing it.:: | ||||
cctx = zstd.ZstdCompressor() | ||||
cctx.copy_stream(ifh, ofh) | ||||
For example, say you wish to compress a file:: | ||||
cctx = zstd.ZstdCompressor() | ||||
with open(input_path, 'rb') as ifh, open(output_path, 'wb') as ofh: | ||||
cctx.copy_stream(ifh, ofh) | ||||
It is also possible to declare the size of the source stream:: | ||||
cctx = zstd.ZstdCompressor() | ||||
cctx.copy_stream(ifh, ofh, size=len_of_input) | ||||
You can also specify how large the chunks that are ``read()`` and ``write()`` | ||||
from and to the streams:: | ||||
cctx = zstd.ZstdCompressor() | ||||
cctx.copy_stream(ifh, ofh, read_size=32768, write_size=16384) | ||||
The stream copier returns a 2-tuple of bytes read and written:: | ||||
cctx = zstd.ZstdCompressor() | ||||
read_count, write_count = cctx.copy_stream(ifh, ofh) | ||||
Compressor API | ||||
^^^^^^^^^^^^^^ | ||||
``compressobj()`` returns an object that exposes ``compress(data)`` and | ||||
``flush()`` methods. Each returns compressed data or an empty bytes. | ||||
The purpose of ``compressobj()`` is to provide an API-compatible interface | ||||
with ``zlib.compressobj`` and ``bz2.BZ2Compressor``. This allows callers to | ||||
swap in different compressor objects while using the same API. | ||||
Gregory Szorc
|
r30822 | ``flush()`` accepts an optional argument indicating how to end the stream. | ||
``zstd.COMPRESSOBJ_FLUSH_FINISH`` (the default) ends the compression stream. | ||||
Once this type of flush is performed, ``compress()`` and ``flush()`` can | ||||
no longer be called. This type of flush **must** be called to end the | ||||
compression context. If not called, returned data may be incomplete. | ||||
A ``zstd.COMPRESSOBJ_FLUSH_BLOCK`` argument to ``flush()`` will flush a | ||||
zstd block. Flushes of this type can be performed multiple times. The next | ||||
call to ``compress()`` will begin a new zstd block. | ||||
Gregory Szorc
|
r30435 | |||
Here is how this API should be used:: | ||||
cctx = zstd.ZstdCompressor() | ||||
cobj = cctx.compressobj() | ||||
data = cobj.compress(b'raw input 0') | ||||
data = cobj.compress(b'raw input 1') | ||||
data = cobj.flush() | ||||
Gregory Szorc
|
r30822 | Or to flush blocks:: | ||
cctx.zstd.ZstdCompressor() | ||||
cobj = cctx.compressobj() | ||||
data = cobj.compress(b'chunk in first block') | ||||
data = cobj.flush(zstd.COMPRESSOBJ_FLUSH_BLOCK) | ||||
data = cobj.compress(b'chunk in second block') | ||||
data = cobj.flush() | ||||
Gregory Szorc
|
r30435 | For best performance results, keep input chunks under 256KB. This avoids | ||
extra allocations for a large output object. | ||||
It is possible to declare the input size of the data that will be fed into | ||||
the compressor:: | ||||
cctx = zstd.ZstdCompressor() | ||||
cobj = cctx.compressobj(size=6) | ||||
data = cobj.compress(b'foobar') | ||||
data = cobj.flush() | ||||
ZstdDecompressor | ||||
---------------- | ||||
The ``ZstdDecompressor`` class provides an interface for performing | ||||
decompression. | ||||
Each instance is associated with parameters that control decompression. These | ||||
come from the following named arguments (all optional): | ||||
dict_data | ||||
Compression dictionary to use. | ||||
The interface of this class is very similar to ``ZstdCompressor`` (by design). | ||||
Gregory Szorc
|
r30822 | Unless specified otherwise, assume that no two methods of ``ZstdDecompressor`` | ||
instances can be called from multiple Python threads simultaneously. In other | ||||
words, assume instances are not thread safe unless stated otherwise. | ||||
Gregory Szorc
|
r30435 | Simple API | ||
^^^^^^^^^^ | ||||
``decompress(data)`` can be used to decompress an entire compressed zstd | ||||
frame in a single operation.:: | ||||
dctx = zstd.ZstdDecompressor() | ||||
decompressed = dctx.decompress(data) | ||||
By default, ``decompress(data)`` will only work on data written with the content | ||||
size encoded in its header. This can be achieved by creating a | ||||
``ZstdCompressor`` with ``write_content_size=True``. If compressed data without | ||||
an embedded content size is seen, ``zstd.ZstdError`` will be raised. | ||||
If the compressed data doesn't have its content size embedded within it, | ||||
decompression can be attempted by specifying the ``max_output_size`` | ||||
argument.:: | ||||
dctx = zstd.ZstdDecompressor() | ||||
uncompressed = dctx.decompress(data, max_output_size=1048576) | ||||
Ideally, ``max_output_size`` will be identical to the decompressed output | ||||
size. | ||||
If ``max_output_size`` is too small to hold the decompressed data, | ||||
``zstd.ZstdError`` will be raised. | ||||
If ``max_output_size`` is larger than the decompressed data, the allocated | ||||
output buffer will be resized to only use the space required. | ||||
Please note that an allocation of the requested ``max_output_size`` will be | ||||
performed every time the method is called. Setting to a very large value could | ||||
result in a lot of work for the memory allocator and may result in | ||||
``MemoryError`` being raised if the allocation fails. | ||||
If the exact size of decompressed data is unknown, it is **strongly** | ||||
recommended to use a streaming API. | ||||
Streaming Input API | ||||
^^^^^^^^^^^^^^^^^^^ | ||||
``write_to(fh)`` can be used to incrementally send compressed data to a | ||||
decompressor.:: | ||||
dctx = zstd.ZstdDecompressor() | ||||
with dctx.write_to(fh) as decompressor: | ||||
decompressor.write(compressed_data) | ||||
This behaves similarly to ``zstd.ZstdCompressor``: compressed data is written to | ||||
the decompressor by calling ``write(data)`` and decompressed output is written | ||||
to the output object by calling its ``write(data)`` method. | ||||
Gregory Szorc
|
r30895 | Calls to ``write()`` will return the number of bytes written to the output | ||
object. Not all inputs will result in bytes being written, so return values | ||||
of ``0`` are possible. | ||||
Gregory Szorc
|
r30435 | The size of chunks being ``write()`` to the destination can be specified:: | ||
dctx = zstd.ZstdDecompressor() | ||||
with dctx.write_to(fh, write_size=16384) as decompressor: | ||||
pass | ||||
You can see how much memory is being used by the decompressor:: | ||||
dctx = zstd.ZstdDecompressor() | ||||
with dctx.write_to(fh) as decompressor: | ||||
byte_size = decompressor.memory_size() | ||||
Streaming Output API | ||||
^^^^^^^^^^^^^^^^^^^^ | ||||
``read_from(fh)`` provides a mechanism to stream decompressed data out of a | ||||
compressed source as an iterator of data chunks.:: | ||||
dctx = zstd.ZstdDecompressor() | ||||
for chunk in dctx.read_from(fh): | ||||
# Do something with original data. | ||||
``read_from()`` accepts a) an object with a ``read(size)`` method that will | ||||
return compressed bytes b) an object conforming to the buffer protocol that | ||||
can expose its data as a contiguous range of bytes. The ``bytes`` and | ||||
``memoryview`` types expose this buffer protocol. | ||||
``read_from()`` returns an iterator whose elements are chunks of the | ||||
decompressed data. | ||||
The size of requested ``read()`` from the source can be specified:: | ||||
dctx = zstd.ZstdDecompressor() | ||||
for chunk in dctx.read_from(fh, read_size=16384): | ||||
pass | ||||
It is also possible to skip leading bytes in the input data:: | ||||
dctx = zstd.ZstdDecompressor() | ||||
for chunk in dctx.read_from(fh, skip_bytes=1): | ||||
pass | ||||
Skipping leading bytes is useful if the source data contains extra | ||||
*header* data but you want to avoid the overhead of making a buffer copy | ||||
or allocating a new ``memoryview`` object in order to decompress the data. | ||||
Similarly to ``ZstdCompressor.read_from()``, the consumer of the iterator | ||||
controls when data is decompressed. If the iterator isn't consumed, | ||||
decompression is put on hold. | ||||
When ``read_from()`` is passed an object conforming to the buffer protocol, | ||||
the behavior may seem similar to what occurs when the simple decompression | ||||
API is used. However, this API works when the decompressed size is unknown. | ||||
Furthermore, if feeding large inputs, the decompressor will work in chunks | ||||
instead of performing a single operation. | ||||
Stream Copying API | ||||
^^^^^^^^^^^^^^^^^^ | ||||
``copy_stream(ifh, ofh)`` can be used to copy data across 2 streams while | ||||
performing decompression.:: | ||||
dctx = zstd.ZstdDecompressor() | ||||
dctx.copy_stream(ifh, ofh) | ||||
e.g. to decompress a file to another file:: | ||||
dctx = zstd.ZstdDecompressor() | ||||
with open(input_path, 'rb') as ifh, open(output_path, 'wb') as ofh: | ||||
dctx.copy_stream(ifh, ofh) | ||||
The size of chunks being ``read()`` and ``write()`` from and to the streams | ||||
can be specified:: | ||||
dctx = zstd.ZstdDecompressor() | ||||
dctx.copy_stream(ifh, ofh, read_size=8192, write_size=16384) | ||||
Decompressor API | ||||
^^^^^^^^^^^^^^^^ | ||||
``decompressobj()`` returns an object that exposes a ``decompress(data)`` | ||||
methods. Compressed data chunks are fed into ``decompress(data)`` and | ||||
uncompressed output (or an empty bytes) is returned. Output from subsequent | ||||
calls needs to be concatenated to reassemble the full decompressed byte | ||||
sequence. | ||||
The purpose of ``decompressobj()`` is to provide an API-compatible interface | ||||
with ``zlib.decompressobj`` and ``bz2.BZ2Decompressor``. This allows callers | ||||
to swap in different decompressor objects while using the same API. | ||||
Each object is single use: once an input frame is decoded, ``decompress()`` | ||||
can no longer be called. | ||||
Here is how this API should be used:: | ||||
dctx = zstd.ZstdDeompressor() | ||||
dobj = cctx.decompressobj() | ||||
data = dobj.decompress(compressed_chunk_0) | ||||
data = dobj.decompress(compressed_chunk_1) | ||||
Gregory Szorc
|
r30895 | Content-Only Dictionary Chain Decompression | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||||
``decompress_content_dict_chain(frames)`` performs decompression of a list of | ||||
zstd frames produced using chained *content-only* dictionary compression. Such | ||||
a list of frames is produced by compressing discrete inputs where each | ||||
non-initial input is compressed with a *content-only* dictionary consisting | ||||
of the content of the previous input. | ||||
For example, say you have the following inputs:: | ||||
inputs = [b'input 1', b'input 2', b'input 3'] | ||||
The zstd frame chain consists of: | ||||
1. ``b'input 1'`` compressed in standalone/discrete mode | ||||
2. ``b'input 2'`` compressed using ``b'input 1'`` as a *content-only* dictionary | ||||
3. ``b'input 3'`` compressed using ``b'input 2'`` as a *content-only* dictionary | ||||
Each zstd frame **must** have the content size written. | ||||
The following Python code can be used to produce a *content-only dictionary | ||||
chain*:: | ||||
def make_chain(inputs): | ||||
frames = [] | ||||
# First frame is compressed in standalone/discrete mode. | ||||
zctx = zstd.ZstdCompressor(write_content_size=True) | ||||
frames.append(zctx.compress(inputs[0])) | ||||
# Subsequent frames use the previous fulltext as a content-only dictionary | ||||
for i, raw in enumerate(inputs[1:]): | ||||
dict_data = zstd.ZstdCompressionDict(inputs[i]) | ||||
zctx = zstd.ZstdCompressor(write_content_size=True, dict_data=dict_data) | ||||
frames.append(zctx.compress(raw)) | ||||
return frames | ||||
``decompress_content_dict_chain()`` returns the uncompressed data of the last | ||||
element in the input chain. | ||||
It is possible to implement *content-only dictionary chain* decompression | ||||
on top of other Python APIs. However, this function will likely be significantly | ||||
faster, especially for long input chains, as it avoids the overhead of | ||||
instantiating and passing around intermediate objects between C and Python. | ||||
Gregory Szorc
|
r30435 | Choosing an API | ||
--------------- | ||||
Various forms of compression and decompression APIs are provided because each | ||||
are suitable for different use cases. | ||||
The simple/one-shot APIs are useful for small data, when the decompressed | ||||
data size is known (either recorded in the zstd frame header via | ||||
``write_content_size`` or known via an out-of-band mechanism, such as a file | ||||
size). | ||||
A limitation of the simple APIs is that input or output data must fit in memory. | ||||
And unless using advanced tricks with Python *buffer objects*, both input and | ||||
output must fit in memory simultaneously. | ||||
Another limitation is that compression or decompression is performed as a single | ||||
operation. So if you feed large input, it could take a long time for the | ||||
function to return. | ||||
The streaming APIs do not have the limitations of the simple API. The cost to | ||||
this is they are more complex to use than a single function call. | ||||
The streaming APIs put the caller in control of compression and decompression | ||||
behavior by allowing them to directly control either the input or output side | ||||
of the operation. | ||||
With the streaming input APIs, the caller feeds data into the compressor or | ||||
decompressor as they see fit. Output data will only be written after the caller | ||||
has explicitly written data. | ||||
With the streaming output APIs, the caller consumes output from the compressor | ||||
or decompressor as they see fit. The compressor or decompressor will only | ||||
consume data from the source when the caller is ready to receive it. | ||||
One end of the streaming APIs involves a file-like object that must | ||||
``write()`` output data or ``read()`` input data. Depending on what the | ||||
backing storage for these objects is, those operations may not complete quickly. | ||||
For example, when streaming compressed data to a file, the ``write()`` into | ||||
a streaming compressor could result in a ``write()`` to the filesystem, which | ||||
may take a long time to finish due to slow I/O on the filesystem. So, there | ||||
may be overhead in streaming APIs beyond the compression and decompression | ||||
operations. | ||||
Dictionary Creation and Management | ||||
---------------------------------- | ||||
Zstandard allows *dictionaries* to be used when compressing and | ||||
decompressing data. The idea is that if you are compressing a lot of similar | ||||
data, you can precompute common properties of that data (such as recurring | ||||
byte sequences) to achieve better compression ratios. | ||||
In Python, compression dictionaries are represented as the | ||||
``ZstdCompressionDict`` type. | ||||
Instances can be constructed from bytes:: | ||||
dict_data = zstd.ZstdCompressionDict(data) | ||||
Gregory Szorc
|
r30895 | It is possible to construct a dictionary from *any* data. Unless the | ||
data begins with a magic header, the dictionary will be treated as | ||||
*content-only*. *Content-only* dictionaries allow compression operations | ||||
that follow to reference raw data within the content. For one use of | ||||
*content-only* dictionaries, see | ||||
``ZstdDecompressor.decompress_content_dict_chain()``. | ||||
Gregory Szorc
|
r30435 | More interestingly, instances can be created by *training* on sample data:: | ||
dict_data = zstd.train_dictionary(size, samples) | ||||
This takes a list of bytes instances and creates and returns a | ||||
``ZstdCompressionDict``. | ||||
You can see how many bytes are in the dictionary by calling ``len()``:: | ||||
dict_data = zstd.train_dictionary(size, samples) | ||||
dict_size = len(dict_data) # will not be larger than ``size`` | ||||
Once you have a dictionary, you can pass it to the objects performing | ||||
compression and decompression:: | ||||
dict_data = zstd.train_dictionary(16384, samples) | ||||
cctx = zstd.ZstdCompressor(dict_data=dict_data) | ||||
for source_data in input_data: | ||||
compressed = cctx.compress(source_data) | ||||
# Do something with compressed data. | ||||
dctx = zstd.ZstdDecompressor(dict_data=dict_data) | ||||
for compressed_data in input_data: | ||||
buffer = io.BytesIO() | ||||
with dctx.write_to(buffer) as decompressor: | ||||
decompressor.write(compressed_data) | ||||
# Do something with raw data in ``buffer``. | ||||
Dictionaries have unique integer IDs. You can retrieve this ID via:: | ||||
dict_id = zstd.dictionary_id(dict_data) | ||||
You can obtain the raw data in the dict (useful for persisting and constructing | ||||
a ``ZstdCompressionDict`` later) via ``as_bytes()``:: | ||||
dict_data = zstd.train_dictionary(size, samples) | ||||
raw_data = dict_data.as_bytes() | ||||
Explicit Compression Parameters | ||||
------------------------------- | ||||
Zstandard's integer compression levels along with the input size and dictionary | ||||
size are converted into a data structure defining multiple parameters to tune | ||||
behavior of the compression algorithm. It is possible to use define this | ||||
data structure explicitly to have lower-level control over compression behavior. | ||||
The ``zstd.CompressionParameters`` type represents this data structure. | ||||
You can see how Zstandard converts compression levels to this data structure | ||||
by calling ``zstd.get_compression_parameters()``. e.g.:: | ||||
params = zstd.get_compression_parameters(5) | ||||
This function also accepts the uncompressed data size and dictionary size | ||||
to adjust parameters:: | ||||
params = zstd.get_compression_parameters(3, source_size=len(data), dict_size=len(dict_data)) | ||||
You can also construct compression parameters from their low-level components:: | ||||
params = zstd.CompressionParameters(20, 6, 12, 5, 4, 10, zstd.STRATEGY_FAST) | ||||
You can then configure a compressor to use the custom parameters:: | ||||
cctx = zstd.ZstdCompressor(compression_params=params) | ||||
Gregory Szorc
|
r30895 | The members/attributes of ``CompressionParameters`` instances are as follows:: | ||
Gregory Szorc
|
r30435 | |||
Gregory Szorc
|
r30895 | * window_log | ||
* chain_log | ||||
* hash_log | ||||
* search_log | ||||
* search_length | ||||
* target_length | ||||
* strategy | ||||
This is the order the arguments are passed to the constructor if not using | ||||
named arguments. | ||||
Gregory Szorc
|
r30435 | |||
You'll need to read the Zstandard documentation for what these parameters | ||||
do. | ||||
Gregory Szorc
|
r30895 | Frame Inspection | ||
---------------- | ||||
Data emitted from zstd compression is encapsulated in a *frame*. This frame | ||||
begins with a 4 byte *magic number* header followed by 2 to 14 bytes describing | ||||
the frame in more detail. For more info, see | ||||
https://github.com/facebook/zstd/blob/master/doc/zstd_compression_format.md. | ||||
``zstd.get_frame_parameters(data)`` parses a zstd *frame* header from a bytes | ||||
instance and return a ``FrameParameters`` object describing the frame. | ||||
Depending on which fields are present in the frame and their values, the | ||||
length of the frame parameters varies. If insufficient bytes are passed | ||||
in to fully parse the frame parameters, ``ZstdError`` is raised. To ensure | ||||
frame parameters can be parsed, pass in at least 18 bytes. | ||||
``FrameParameters`` instances have the following attributes: | ||||
content_size | ||||
Integer size of original, uncompressed content. This will be ``0`` if the | ||||
original content size isn't written to the frame (controlled with the | ||||
``write_content_size`` argument to ``ZstdCompressor``) or if the input | ||||
content size was ``0``. | ||||
window_size | ||||
Integer size of maximum back-reference distance in compressed data. | ||||
dict_id | ||||
Integer of dictionary ID used for compression. ``0`` if no dictionary | ||||
ID was used or if the dictionary ID was ``0``. | ||||
has_checksum | ||||
Bool indicating whether a 4 byte content checksum is stored at the end | ||||
of the frame. | ||||
Gregory Szorc
|
r30435 | Misc Functionality | ||
------------------ | ||||
estimate_compression_context_size(CompressionParameters) | ||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||||
Given a ``CompressionParameters`` struct, estimate the memory size required | ||||
to perform compression. | ||||
estimate_decompression_context_size() | ||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||||
Estimate the memory size requirements for a decompressor instance. | ||||
Constants | ||||
--------- | ||||
The following module constants/attributes are exposed: | ||||
ZSTD_VERSION | ||||
This module attribute exposes a 3-tuple of the Zstandard version. e.g. | ||||
``(1, 0, 0)`` | ||||
MAX_COMPRESSION_LEVEL | ||||
Integer max compression level accepted by compression functions | ||||
COMPRESSION_RECOMMENDED_INPUT_SIZE | ||||
Recommended chunk size to feed to compressor functions | ||||
COMPRESSION_RECOMMENDED_OUTPUT_SIZE | ||||
Recommended chunk size for compression output | ||||
DECOMPRESSION_RECOMMENDED_INPUT_SIZE | ||||
Recommended chunk size to feed into decompresor functions | ||||
DECOMPRESSION_RECOMMENDED_OUTPUT_SIZE | ||||
Recommended chunk size for decompression output | ||||
FRAME_HEADER | ||||
bytes containing header of the Zstandard frame | ||||
MAGIC_NUMBER | ||||
Frame header as an integer | ||||
WINDOWLOG_MIN | ||||
Minimum value for compression parameter | ||||
WINDOWLOG_MAX | ||||
Maximum value for compression parameter | ||||
CHAINLOG_MIN | ||||
Minimum value for compression parameter | ||||
CHAINLOG_MAX | ||||
Maximum value for compression parameter | ||||
HASHLOG_MIN | ||||
Minimum value for compression parameter | ||||
HASHLOG_MAX | ||||
Maximum value for compression parameter | ||||
SEARCHLOG_MIN | ||||
Minimum value for compression parameter | ||||
SEARCHLOG_MAX | ||||
Maximum value for compression parameter | ||||
SEARCHLENGTH_MIN | ||||
Minimum value for compression parameter | ||||
SEARCHLENGTH_MAX | ||||
Maximum value for compression parameter | ||||
TARGETLENGTH_MIN | ||||
Minimum value for compression parameter | ||||
TARGETLENGTH_MAX | ||||
Maximum value for compression parameter | ||||
STRATEGY_FAST | ||||
Gregory Szorc
|
r30895 | Compression strategy | ||
Gregory Szorc
|
r30435 | STRATEGY_DFAST | ||
Gregory Szorc
|
r30895 | Compression strategy | ||
Gregory Szorc
|
r30435 | STRATEGY_GREEDY | ||
Gregory Szorc
|
r30895 | Compression strategy | ||
Gregory Szorc
|
r30435 | STRATEGY_LAZY | ||
Gregory Szorc
|
r30895 | Compression strategy | ||
Gregory Szorc
|
r30435 | STRATEGY_LAZY2 | ||
Gregory Szorc
|
r30895 | Compression strategy | ||
Gregory Szorc
|
r30435 | STRATEGY_BTLAZY2 | ||
Gregory Szorc
|
r30895 | Compression strategy | ||
Gregory Szorc
|
r30435 | STRATEGY_BTOPT | ||
Gregory Szorc
|
r30895 | Compression strategy | ||
Performance Considerations | ||||
-------------------------- | ||||
The ``ZstdCompressor`` and ``ZstdDecompressor`` types maintain state to a | ||||
persistent compression or decompression *context*. Reusing a ``ZstdCompressor`` | ||||
or ``ZstdDecompressor`` instance for multiple operations is faster than | ||||
instantiating a new ``ZstdCompressor`` or ``ZstdDecompressor`` for each | ||||
operation. The differences are magnified as the size of data decreases. For | ||||
example, the difference between *context* reuse and non-reuse for 100,000 | ||||
100 byte inputs will be significant (possiby over 10x faster to reuse contexts) | ||||
whereas 10 1,000,000 byte inputs will be more similar in speed (because the | ||||
time spent doing compression dwarfs time spent creating new *contexts*). | ||||
Gregory Szorc
|
r30435 | |||
Note on Zstandard's *Experimental* API | ||||
====================================== | ||||
Many of the Zstandard APIs used by this module are marked as *experimental* | ||||
within the Zstandard project. This includes a large number of useful | ||||
features, such as compression and frame parameters and parts of dictionary | ||||
compression. | ||||
It is unclear how Zstandard's C API will evolve over time, especially with | ||||
regards to this *experimental* functionality. We will try to maintain | ||||
backwards compatibility at the Python API level. However, we cannot | ||||
guarantee this for things not under our control. | ||||
Since a copy of the Zstandard source code is distributed with this | ||||
module and since we compile against it, the behavior of a specific | ||||
version of this module should be constant for all of time. So if you | ||||
pin the version of this module used in your projects (which is a Python | ||||
best practice), you should be buffered from unwanted future changes. | ||||
Donate | ||||
====== | ||||
A lot of time has been invested into this project by the author. | ||||
If you find this project useful and would like to thank the author for | ||||
their work, consider donating some money. Any amount is appreciated. | ||||
.. image:: https://www.paypalobjects.com/en_US/i/btn/btn_donate_LG.gif | ||||
:target: https://www.paypal.com/cgi-bin/webscr?cmd=_donations&business=gregory%2eszorc%40gmail%2ecom&lc=US&item_name=python%2dzstandard¤cy_code=USD&bn=PP%2dDonationsBF%3abtn_donate_LG%2egif%3aNonHosted | ||||
:alt: Donate via PayPal | ||||
.. |ci-status| image:: https://travis-ci.org/indygreg/python-zstandard.svg?branch=master | ||||
:target: https://travis-ci.org/indygreg/python-zstandard | ||||
.. |win-ci-status| image:: https://ci.appveyor.com/api/projects/status/github/indygreg/python-zstandard?svg=true | ||||
:target: https://ci.appveyor.com/project/indygreg/python-zstandard | ||||
:alt: Windows build status | ||||