dirstate-v2.txt
375 lines
| 14.1 KiB
| text/plain
|
TextLexer
Simon Sapin
|
r48978 | The *dirstate* is what Mercurial uses internally to track | ||
the state of files in the working directory, | ||||
such as set by commands like `hg add` and `hg rm`. | ||||
It also contains some cached data that help make `hg status` faster. | ||||
The name refers both to `.hg/dirstate` on the filesystem | ||||
and the corresponding data structure in memory while a Mercurial process | ||||
is running. | ||||
The original file format, retroactively dubbed `dirstate-v1`, | ||||
is described at https://www.mercurial-scm.org/wiki/DirState. | ||||
It is made of a flat sequence of unordered variable-size entries, | ||||
so accessing any information in it requires parsing all of it. | ||||
Similarly, saving changes requires rewriting the entire file. | ||||
The newer `dirsate-v2` file format is designed to fix these limitations | ||||
and make `hg status` faster. | ||||
User guide | ||||
========== | ||||
Compatibility | ||||
------------- | ||||
The file format is experimental and may still change. | ||||
Different versions of Mercurial may not be compatible with each other | ||||
when working on a local repository that uses this format. | ||||
When using an incompatible version with the experimental format, | ||||
anything can happen including data corruption. | ||||
Since the dirstate is entirely local and not relevant to the wire protocol, | ||||
`dirstate-v2` does not affect compatibility with remote Mercurial versions. | ||||
When `share-safe` is enabled, different repositories sharing the same store | ||||
can use different dirstate formats. | ||||
Enabling `dirsate-v2` for new local repositories | ||||
------------------------------------------------ | ||||
When creating a new local repository such as with `hg init` or `hg clone`, | ||||
the `exp-dirstate-v2` boolean in the `format` configuration section | ||||
controls whether to use this file format. | ||||
This is disabled by default as of this writing. | ||||
To enable it for a single repository, run for example:: | ||||
$ hg init my-project --config format.exp-dirstate-v2=1 | ||||
Checking the format of an existing local repsitory | ||||
-------------------------------------------------- | ||||
The `debugformat` commands prints information about | ||||
which of multiple optional formats are used in the current repository, | ||||
including `dirstate-v2`:: | ||||
$ hg debugformat | ||||
format-variant repo | ||||
fncache: yes | ||||
dirstate-v2: yes | ||||
[…] | ||||
Upgrading or downgrading an existing local repository | ||||
----------------------------------------------------- | ||||
The `debugupgrade` command does various upgrades or downgrades | ||||
on a local repository | ||||
based on the current Mercurial version and on configuration. | ||||
The same `format.exp-dirstate-v2` configuration is used again. | ||||
Example to upgrade:: | ||||
$ hg debugupgrade --config format.exp-dirstate-v2=1 | ||||
Example to downgrade to `dirstate-v1`:: | ||||
$ hg debugupgrade --config format.exp-dirstate-v2=0 | ||||
Both of this commands do nothing but print a list of proposed changes, | ||||
which may include changes unrelated to the dirstate. | ||||
Those other changes are controlled by their own configuration keys. | ||||
Add `--run` to a command to actually apply the proposed changes. | ||||
Backups of `.hg/requires` and `.hg/dirstate` are created | ||||
in a `.hg/upgradebackup.*` directory. | ||||
If something goes wrong, restoring those files should undo the change. | ||||
Note that upgrading affects compatibility with older versions of Mercurial | ||||
as noted above. | ||||
This can be relevant when a repository’s files are on a USB drive | ||||
or some other removable media, or shared over the network, etc. | ||||
Internal filesystem representation | ||||
================================== | ||||
Requirements file | ||||
----------------- | ||||
The `.hg/requires` file indicates which of various optional file formats | ||||
are used by a given repository. | ||||
Mercurial aborts when seeing a requirement it does not know about, | ||||
which avoids older version accidentally messing up a respository | ||||
that uses a format that was introduced later. | ||||
For versions that do support a format, the presence or absence of | ||||
the corresponding requirement indicates whether to use that format. | ||||
When the file contains a `exp-dirstate-v2` line, | ||||
the `dirstate-v2` format is used. | ||||
With no such line `dirstate-v1` is used. | ||||
High level description | ||||
---------------------- | ||||
Whereas `dirstate-v1` uses a single `.hg/disrtate` file, | ||||
in `dirstate-v2` that file is a "docket" file | ||||
that only contains some metadata | ||||
and points to separate data file named `.hg/dirstate.{ID}`, | ||||
where `{ID}` is a random identifier. | ||||
This separation allows making data files append-only | ||||
and therefore safer to memory-map. | ||||
Creating a new data file (occasionally to clean up unused data) | ||||
can be done with a different ID | ||||
without disrupting another Mercurial process | ||||
that could still be using the previous data file. | ||||
Both files have a format designed to reduce the need for parsing, | ||||
by using fixed-size binary components as much as possible. | ||||
For data that is not fixed-size, | ||||
references to other parts of a file can be made by storing "pseudo-pointers": | ||||
integers counted in bytes from the start of a file. | ||||
For read-only access no data structure is needed, | ||||
only a bytes buffer (possibly memory-mapped directly from the filesystem) | ||||
with specific parts read on demand. | ||||
The data file contains "nodes" organized in a tree. | ||||
Each node represents a file or directory inside the working directory | ||||
or its parent changeset. | ||||
This tree has the same structure as the filesystem, | ||||
so a node representing a directory has child nodes representing | ||||
the files and subdirectories contained directly in that directory. | ||||
The docket file format | ||||
---------------------- | ||||
This is implemented in `rust/hg-core/src/dirstate_tree/on_disk.rs` | ||||
and `mercurial/dirstateutils/docket.py`. | ||||
Components of the docket file are found at fixed offsets, | ||||
counted in bytes from the start of the file: | ||||
* Offset 0: | ||||
The 12-bytes marker string "dirstate-v2\n" ending with a newline character. | ||||
This makes it easier to tell a dirstate-v2 file from a dirstate-v1 file, | ||||
although it is not strictly necessary | ||||
since `.hg/requires` determines which format to use. | ||||
* Offset 12: | ||||
The changeset node ID on the first parent of the working directory, | ||||
as up to 32 binary bytes. | ||||
If a node ID is shorter (20 bytes for SHA-1), | ||||
it is start-aligned and the rest of the bytes are set to zero. | ||||
* Offset 44: | ||||
The changeset node ID on the second parent of the working directory, | ||||
or all zeros if there isn’t one. | ||||
Also 32 binary bytes. | ||||
* Offset 76: | ||||
Tree metadata on 44 bytes, described below. | ||||
Its separation in this documentation from the rest of the docket | ||||
reflects a detail of the current implementation. | ||||
Since tree metadata is also made of fields at fixed offsets, those could | ||||
be inlined here by adding 76 bytes to each offset. | ||||
* Offset 120: | ||||
The used size of the data file, as a 32-bit big-endian integer. | ||||
The actual size of the data file may be larger | ||||
(if another Mercurial processis in appending to it | ||||
but has not updated the docket yet). | ||||
That extra data must be ignored. | ||||
* Offset 124: | ||||
The length of the data file identifier, as a 8-bit integer. | ||||
* Offset 125: | ||||
The data file identifier. | ||||
* Any additional data is current ignored, and dropped when updating the file. | ||||
Tree metadata in the docket file | ||||
-------------------------------- | ||||
Tree metadata is similarly made of components at fixed offsets. | ||||
These offsets are counted in bytes from the start of tree metadata, | ||||
which is 76 bytes after the start of the docket file. | ||||
This metadata can be thought of as the singular root of the tree | ||||
formed by nodes in the data file. | ||||
* Offset 0: | ||||
Pseudo-pointer to the start of root nodes, | ||||
counted in bytes from the start of the data file, | ||||
as a 32-bit big-endian integer. | ||||
These nodes describe files and directories found directly | ||||
at the root of the working directory. | ||||
* Offset 4: | ||||
Number of root nodes, as a 32-bit big-endian integer. | ||||
* Offset 8: | ||||
Total number of nodes in the entire tree that "have a dirstate entry", | ||||
as a 32-bit big-endian integer. | ||||
Those nodes represent files that would be present at all in `dirstate-v1`. | ||||
This is typically less than the total number of nodes. | ||||
This counter is used to implement `len(dirstatemap)`. | ||||
* Offset 12: | ||||
Number of nodes in the entire tree that have a copy source, | ||||
as a 32-bit big-endian integer. | ||||
At the next commit, these files are recorded | ||||
as having been copied or moved/renamed from that source. | ||||
(A move is recorded as a copy and separate removal of the source.) | ||||
This counter is used to implement `len(dirstatemap.copymap)`. | ||||
* Offset 16: | ||||
An estimation of how many bytes of the data file | ||||
(within its used size) are unused, as a 32-bit big-endian integer. | ||||
When appending to an existing data file, | ||||
some existing nodes or paths can be unreachable from the new root | ||||
but they still take up space. | ||||
This counter is used to decide when to write a new data file from scratch | ||||
instead of appending to an existing one, | ||||
in order to get rid of that unreachable data | ||||
and avoid unbounded file size growth. | ||||
* Offset 20: | ||||
These four bytes are currently ignored | ||||
and reset to zero when updating a docket file. | ||||
This is an attempt at forward compatibility: | ||||
future Mercurial versions could use this as a bit field | ||||
to indicate that a dirstate has additional data or constraints. | ||||
Finding a dirstate file with the relevant bit unset indicates that | ||||
it was written by a then-older version | ||||
which is not aware of that future change. | ||||
* Offset 24: | ||||
Either 20 zero bytes, or a SHA-1 hash as 20 binary bytes. | ||||
When present, the hash is of ignore patterns | ||||
that were used for some previous run of the `status` algorithm. | ||||
* (Offset 44: end of tree metadata) | ||||
Optional hash of ignore patterns | ||||
-------------------------------- | ||||
The implementation of `status` at `rust/hg-core/src/dirstate_tree/status.rs` | ||||
has been optimized such that its run time is dominated by calls | ||||
to `stat` for reading the filesystem metadata of a file or directory, | ||||
and to `readdir` for listing the contents of a directory. | ||||
In some cases the algorithm can skip calls to `readdir` | ||||
(saving significant time) | ||||
because the dirstate already contains enough of the relevant information | ||||
to build the correct `status` results. | ||||
The default configuration of `hg status` is to list unknown files | ||||
but not ignored files. | ||||
In this case, it matters for the `readdir`-skipping optimization | ||||
if a given file used to be ignored but became unknown | ||||
because `.hgignore` changed. | ||||
To detect the possibility of such a change, | ||||
the tree metadata contains an optional hash of all ignore patterns. | ||||
We define: | ||||
* "Root" ignore files as: | ||||
- `.hgignore` at the root of the repository if it exists | ||||
- And all files from `ui.ignore.*` config. | ||||
This set of files is sorted by the string representation of their path. | ||||
* The "expanded contents" of an ignore files is the byte string made | ||||
by the concatenation of its contents followed by the "expanded contents" | ||||
of other files included with `include:` or `subinclude:` directives, | ||||
in inclusion order. This definition is recursive, as included files can | ||||
themselves include more files. | ||||
This hash is defined as the SHA-1 of the concatenation (in sorted | ||||
order) of the "expanded contents" of each "root" ignore file. | ||||
(Note that computing this does not require actually concatenating byte ranges into | ||||
contiguous memory. | ||||
Instead a SHA-1 hasher object can be created and fed separate byte ranges one by | ||||
one.) | ||||
The data file format | ||||
-------------------- | ||||
This is implemented in `rust/hg-core/src/dirstate_tree/on_disk.rs` | ||||
and `mercurial/dirstateutils/v2.py`. | ||||
The data file contains two types of data: paths and nodes. | ||||
Paths and nodes can be organized in any order in the file, except that sibling | ||||
nodes must be next to each other and sorted by their path. Contiguity lets | ||||
the parent refer to them all by their count with a single pseudo-pointer, | ||||
instead of storing one pseudo-pointer per child node. Sorting allows using | ||||
binary seach to find a child node with a given name in `O(log(n))` byte ranges | ||||
comparisons. | ||||
The current implemention writes paths and child node before a given node | ||||
for ease of figuring out the value of pseudo-pointers by the time the are to be | ||||
written, but this is not an obligation and readers must not rely on it. | ||||
A path is stored as a byte string anywhere in the file, without delimiter. | ||||
It is refered to by one or more node by a pseudo-pointer to its start, and its | ||||
length in bytes. Since there is no delimiter, | ||||
when a path is a substring of another the same bytes could be reused, | ||||
although the implementation does not exploit this as of this writing. | ||||
A node is stored on 43 bytes with components at fixed offsets. Paths and | ||||
child nodes relevant to a node are stored externally and referenced though | ||||
pseudo-pointers. | ||||
All integers are stored in big-endian. All pseudo-pointers are 32-bit integers | ||||
counting bytes from the start of the data file. Path lengths and positions | ||||
are 16-bit integers, also counted in bytes. | ||||
Node components are: | ||||
* Offset 0: | ||||
Pseudo-pointer to the full path of this node, | ||||
from the working directory root. | ||||
* Offset 4: | ||||
Length of the full path. | ||||
* Offset 6: | ||||
Position of the last `/` path separator within the full path, | ||||
in bytes from the start of the full path, | ||||
or zero if there isn’t one. | ||||
The part of the full path after this position is the "base name". | ||||
Since sibling nodes have the same parent, only their base name vary | ||||
and needs to be considered when doing binary search to find a given path. | ||||
* Offset 8: | ||||
Pseudo-pointer to the "copy source" path for this node, | ||||
or zero if there is no copy source. | ||||
* Offset 12: | ||||
Length of the copy source path, or zero if there isn’t one. | ||||
* Offset 14: | ||||
Pseudo-pointer to the start of child nodes. | ||||
* Offset 18: | ||||
Number of child nodes, as a 32-bit integer. | ||||
They occupy 43 times this number of bytes | ||||
(not counting space for paths, and further descendants). | ||||
* Offset 22: | ||||
Number as a 32-bit integer of descendant nodes in this subtree, | ||||
not including this node itself, | ||||
that "have a dirstate entry". | ||||
Those nodes represent files that would be present at all in `dirstate-v1`. | ||||
This is typically less than the total number of descendants. | ||||
This counter is used to implement `has_dir`. | ||||
* Offset 26: | ||||
Number as a 32-bit integer of descendant nodes in this subtree, | ||||
not including this node itself, | ||||
that represent files tracked in the working directory. | ||||
(For example, `hg rm` makes a file untracked.) | ||||
This counter is used to implement `has_tracked_dir`. | ||||
* Offset 30 and more: | ||||
**TODO:** docs not written yet | ||||
as this part of the format might be changing soon. | ||||