|
|
Revision logs - or *revlogs* - are an append only data structure for
|
|
|
storing discrete entries, or *revisions*. They are the primary storage
|
|
|
mechanism of repository data.
|
|
|
|
|
|
Revlogs effectively model a directed acyclic graph (DAG). Each node
|
|
|
has edges to 1 or 2 *parent* nodes. Each node contains metadata and
|
|
|
the raw value for that node.
|
|
|
|
|
|
Revlogs consist of entries which have metadata and revision data.
|
|
|
Metadata includes the hash of the revision's content, sizes, and
|
|
|
links to its *parent* entries. The collective metadata is referred
|
|
|
to as the *index* and the revision data is the *data*.
|
|
|
|
|
|
Revision data is stored as a series of compressed deltas against
|
|
|
ancestor revisions.
|
|
|
|
|
|
Revlogs are written in an append-only fashion. We never need to rewrite
|
|
|
a file to insert nor do we need to remove data. Rolling back in-progress
|
|
|
writes can be performed by truncating files. Read locks can be avoided
|
|
|
using simple techniques. This means that references to other data in
|
|
|
the same revlog *always* refer to a previous entry.
|
|
|
|
|
|
Revlogs can be modeled as 0-indexed arrays. The first revision is
|
|
|
revision #0 and the second is revision #1. The revision -1 is typically
|
|
|
used to mean *does not exist* or *not defined*.
|
|
|
|
|
|
File Format
|
|
|
===========
|
|
|
|
|
|
A revlog begins with a 32-bit big endian integer holding version info
|
|
|
and feature flags. This integer overlaps with the first four bytes of
|
|
|
the first revision entry.
|
|
|
|
|
|
This integer is logically divided into 2 16-bit shorts. The least
|
|
|
significant half of the integer is the format/version short. The other
|
|
|
short holds feature flags that dictate behavior of the revlog.
|
|
|
|
|
|
The following values for the format/version short are defined:
|
|
|
|
|
|
0
|
|
|
The original revlog version.
|
|
|
1
|
|
|
RevlogNG (*next generation*). It replaced version 0 when it was
|
|
|
implemented in 2006.
|
|
|
2
|
|
|
In-development version incorporating accumulated knowledge and
|
|
|
missing features from 10+ years of revlog version 1.
|
|
|
57005 (0xdead)
|
|
|
Reserved for internal testing of new versions. No defined format
|
|
|
beyond 32-bit header.
|
|
|
|
|
|
The feature flags short consists of bit flags. Where 0 is the least
|
|
|
significant bit. The bit flags vary by revlog version.
|
|
|
|
|
|
Version 0 revlogs have no defined flags and the presence of a flag
|
|
|
is considered an error.
|
|
|
|
|
|
Version 1 revlogs have the following flags at the specified bit offsets:
|
|
|
|
|
|
0
|
|
|
Store revision data inline.
|
|
|
1
|
|
|
Generaldelta encoding.
|
|
|
|
|
|
Version 2 revlogs have the following flags at the specified bit offsets:
|
|
|
|
|
|
0
|
|
|
Store revision data inline.
|
|
|
|
|
|
The following header values are common:
|
|
|
|
|
|
00 00 00 01
|
|
|
v1
|
|
|
00 01 00 01
|
|
|
v1 + inline
|
|
|
00 02 00 01
|
|
|
v1 + generaldelta
|
|
|
00 03 00 01
|
|
|
v1 + inline + generaldelta
|
|
|
|
|
|
Following the 32-bit header is the remaining 60 bytes of the first index
|
|
|
entry. Following that are additional *index* entries. Inlined revision
|
|
|
data is possibly located between index entries. More on this inlined
|
|
|
layout is described below.
|
|
|
|
|
|
Version 1 Format
|
|
|
================
|
|
|
|
|
|
Version 1 (RevlogNG) begins with an index describing the revisions in
|
|
|
the revlog. If the ``inline`` flag is set, revision data is stored inline,
|
|
|
or between index entries (as opposed to in a separate container).
|
|
|
|
|
|
Each index entry is 64 bytes. The byte layout of each entry is as
|
|
|
follows, with byte 0 being the first byte (all data stored as big endian):
|
|
|
|
|
|
0-3 (4 bytes) (rev 0 only)
|
|
|
Revlog header
|
|
|
|
|
|
0-5 (6 bytes)
|
|
|
Absolute offset of revision data from beginning of revlog.
|
|
|
|
|
|
6-7 (2 bytes)
|
|
|
Bit flags impacting revision behavior. The following bit offsets define:
|
|
|
|
|
|
0: REVIDX_ISCENSORED revision has censor metadata, must be verified.
|
|
|
|
|
|
1: REVIDX_ELLIPSIS revision hash does not match its data. Used by
|
|
|
narrowhg
|
|
|
|
|
|
2: REVIDX_EXTSTORED revision data is stored externally.
|
|
|
|
|
|
8-11 (4 bytes)
|
|
|
Compressed length of revision data / chunk as stored in revlog.
|
|
|
|
|
|
12-15 (4 bytes)
|
|
|
Uncompressed length of revision data. This is the size of the full
|
|
|
revision data, not the size of the chunk post decompression.
|
|
|
|
|
|
16-19 (4 bytes)
|
|
|
Base or previous revision this revision's delta was produced against.
|
|
|
This revision holds full text (as opposed to a delta) if it points to
|
|
|
itself. For generaldelta repos, this is the previous revision in the
|
|
|
delta chain. For non-generaldelta repos, this is the base or first
|
|
|
revision in the delta chain.
|
|
|
|
|
|
20-23 (4 bytes)
|
|
|
A revision this revision is *linked* to. This allows a revision in
|
|
|
one revlog to be forever associated with a revision in another
|
|
|
revlog. For example, a file's revlog may point to the changelog
|
|
|
revision that introduced it.
|
|
|
|
|
|
24-27 (4 bytes)
|
|
|
Revision of 1st parent. -1 indicates no parent.
|
|
|
|
|
|
28-31 (4 bytes)
|
|
|
Revision of 2nd parent. -1 indicates no 2nd parent.
|
|
|
|
|
|
32-63 (32 bytes)
|
|
|
Hash of revision's full text. Currently, SHA-1 is used and only
|
|
|
the first 20 bytes of this field are used. The rest of the bytes
|
|
|
are ignored and should be stored as \0.
|
|
|
|
|
|
If inline revision data is being stored, the compressed revision data
|
|
|
(of length from bytes offset 8-11 from the index entry) immediately
|
|
|
follows the index entry. There is no header on the revision data. There
|
|
|
is no padding between it and the index entries before and after.
|
|
|
|
|
|
If revision data is not inline, then raw revision data is stored in a
|
|
|
separate byte container. The offsets from bytes 0-5 and the compressed
|
|
|
length from bytes 8-11 define how to access this data.
|
|
|
|
|
|
The 6 byte absolute offset field from the first revlog entry overlaps
|
|
|
with the revlog header. That is, the first 6 bytes of the first revlog
|
|
|
entry can be split into four bytes containing the header for the revlog
|
|
|
file and an additional two bytes containing the offset for the first
|
|
|
entry. Since this is the offset from the beginning of the file for the
|
|
|
first revision entry, the two bytes will always be set to zero.
|
|
|
|
|
|
Version 2 Format
|
|
|
================
|
|
|
|
|
|
(In development. Format not finalized or stable.)
|
|
|
|
|
|
Version 2 is identical to version 1 with the following differences.
|
|
|
|
|
|
There is no dedicated *generaldelta* revlog format flag. Instead,
|
|
|
the feature is implied enabled by default.
|
|
|
|
|
|
Delta Chains
|
|
|
============
|
|
|
|
|
|
Revision data is encoded as a chain of *chunks*. Each chain begins with
|
|
|
the compressed original full text for that revision. Each subsequent
|
|
|
*chunk* is a *delta* against the previous revision. We therefore call
|
|
|
these chains of chunks/deltas *delta chains*.
|
|
|
|
|
|
The full text for a revision is reconstructed by loading the original
|
|
|
full text for the base revision of a *delta chain* and then applying
|
|
|
*deltas* until the target revision is reconstructed.
|
|
|
|
|
|
*Delta chains* are limited in length so lookup time is bound. They are
|
|
|
limited to ~2x the length of the revision's data. The linear distance
|
|
|
between the base chunk and the final chunk is also limited so the
|
|
|
amount of read I/O to load all chunks in the delta chain is bound.
|
|
|
|
|
|
Deltas and delta chains are either computed against the previous
|
|
|
revision in the revlog or another revision (almost certainly one of
|
|
|
the parents of the revision). Historically, deltas were computed against
|
|
|
the previous revision. The *generaldelta* revlog feature flag (enabled
|
|
|
by default in Mercurial 3.7) activates the mode where deltas are
|
|
|
computed against an arbitrary revision (almost certainly a parent revision).
|
|
|
|
|
|
File Storage
|
|
|
============
|
|
|
|
|
|
Revlogs logically consist of an index (metadata of entries) and
|
|
|
revision data. This data may be stored together in a single file or in
|
|
|
separate files. The mechanism used is indicated by the ``inline`` feature
|
|
|
flag on the revlog.
|
|
|
|
|
|
Mercurial's behavior is to use inline storage until a revlog reaches a
|
|
|
certain size, at which point it will be converted to non-inline. The
|
|
|
reason there is a size limit on inline storage is to establish an upper
|
|
|
bound on how much data must be read to load the index. It would be a waste
|
|
|
to read tens or hundreds of extra megabytes of data just to access the
|
|
|
index data.
|
|
|
|
|
|
The actual layout of revlog files on disk is governed by the repository's
|
|
|
*store format*. Typically, a ``.i`` file represents the index revlog
|
|
|
(possibly containing inline data) and a ``.d`` file holds the revision data.
|
|
|
|
|
|
Revision Entries
|
|
|
================
|
|
|
|
|
|
Revision entries consist of an optional 1 byte header followed by an
|
|
|
encoding of the revision data. The headers are as follows:
|
|
|
|
|
|
\0 (0x00)
|
|
|
Revision data is the entirety of the entry, including this header.
|
|
|
( (0x28)
|
|
|
zstd https://github.com/facebook/zstd
|
|
|
u (0x75)
|
|
|
Raw revision data follows.
|
|
|
x (0x78)
|
|
|
zlib (RFC 1950) data.
|
|
|
|
|
|
The 0x78 value is actually the first byte of the zlib header (CMF byte).
|
|
|
|
|
|
Hash Computation
|
|
|
================
|
|
|
|
|
|
The hash of the revision is stored in the index and is used both as a primary
|
|
|
key and for data integrity verification.
|
|
|
|
|
|
Currently, SHA-1 is the only supported hashing algorithm. To obtain the SHA-1
|
|
|
hash of a revision:
|
|
|
|
|
|
1. Hash the parent nodes
|
|
|
2. Hash the fulltext of the revision
|
|
|
|
|
|
The 20 byte node ids of the parents are fed into the hasher in ascending order.
|
|
|
|