|
|
The *dirstate* is what Mercurial uses internally to track
|
|
|
the state of files in the working directory,
|
|
|
such as set by commands like `hg add` and `hg rm`.
|
|
|
It also contains some cached data that help make `hg status` faster.
|
|
|
The name refers both to `.hg/dirstate` on the filesystem
|
|
|
and the corresponding data structure in memory while a Mercurial process
|
|
|
is running.
|
|
|
|
|
|
The original file format, retroactively dubbed `dirstate-v1`,
|
|
|
is described at https://www.mercurial-scm.org/wiki/DirState.
|
|
|
It is made of a flat sequence of unordered variable-size entries,
|
|
|
so accessing any information in it requires parsing all of it.
|
|
|
Similarly, saving changes requires rewriting the entire file.
|
|
|
|
|
|
The newer `dirsate-v2` file format is designed to fix these limitations
|
|
|
and make `hg status` faster.
|
|
|
|
|
|
User guide
|
|
|
==========
|
|
|
|
|
|
Compatibility
|
|
|
-------------
|
|
|
|
|
|
The file format is experimental and may still change.
|
|
|
Different versions of Mercurial may not be compatible with each other
|
|
|
when working on a local repository that uses this format.
|
|
|
When using an incompatible version with the experimental format,
|
|
|
anything can happen including data corruption.
|
|
|
|
|
|
Since the dirstate is entirely local and not relevant to the wire protocol,
|
|
|
`dirstate-v2` does not affect compatibility with remote Mercurial versions.
|
|
|
|
|
|
When `share-safe` is enabled, different repositories sharing the same store
|
|
|
can use different dirstate formats.
|
|
|
|
|
|
Enabling `dirsate-v2` for new local repositories
|
|
|
------------------------------------------------
|
|
|
|
|
|
When creating a new local repository such as with `hg init` or `hg clone`,
|
|
|
the `exp-dirstate-v2` boolean in the `format` configuration section
|
|
|
controls whether to use this file format.
|
|
|
This is disabled by default as of this writing.
|
|
|
To enable it for a single repository, run for example::
|
|
|
|
|
|
$ hg init my-project --config format.exp-dirstate-v2=1
|
|
|
|
|
|
Checking the format of an existing local repsitory
|
|
|
--------------------------------------------------
|
|
|
|
|
|
The `debugformat` commands prints information about
|
|
|
which of multiple optional formats are used in the current repository,
|
|
|
including `dirstate-v2`::
|
|
|
|
|
|
$ hg debugformat
|
|
|
format-variant repo
|
|
|
fncache: yes
|
|
|
dirstate-v2: yes
|
|
|
[…]
|
|
|
|
|
|
Upgrading or downgrading an existing local repository
|
|
|
-----------------------------------------------------
|
|
|
|
|
|
The `debugupgrade` command does various upgrades or downgrades
|
|
|
on a local repository
|
|
|
based on the current Mercurial version and on configuration.
|
|
|
The same `format.exp-dirstate-v2` configuration is used again.
|
|
|
|
|
|
Example to upgrade::
|
|
|
|
|
|
$ hg debugupgrade --config format.exp-dirstate-v2=1
|
|
|
|
|
|
Example to downgrade to `dirstate-v1`::
|
|
|
|
|
|
$ hg debugupgrade --config format.exp-dirstate-v2=0
|
|
|
|
|
|
Both of this commands do nothing but print a list of proposed changes,
|
|
|
which may include changes unrelated to the dirstate.
|
|
|
Those other changes are controlled by their own configuration keys.
|
|
|
Add `--run` to a command to actually apply the proposed changes.
|
|
|
|
|
|
Backups of `.hg/requires` and `.hg/dirstate` are created
|
|
|
in a `.hg/upgradebackup.*` directory.
|
|
|
If something goes wrong, restoring those files should undo the change.
|
|
|
|
|
|
Note that upgrading affects compatibility with older versions of Mercurial
|
|
|
as noted above.
|
|
|
This can be relevant when a repository’s files are on a USB drive
|
|
|
or some other removable media, or shared over the network, etc.
|
|
|
|
|
|
Internal filesystem representation
|
|
|
==================================
|
|
|
|
|
|
Requirements file
|
|
|
-----------------
|
|
|
|
|
|
The `.hg/requires` file indicates which of various optional file formats
|
|
|
are used by a given repository.
|
|
|
Mercurial aborts when seeing a requirement it does not know about,
|
|
|
which avoids older version accidentally messing up a respository
|
|
|
that uses a format that was introduced later.
|
|
|
For versions that do support a format, the presence or absence of
|
|
|
the corresponding requirement indicates whether to use that format.
|
|
|
|
|
|
When the file contains a `exp-dirstate-v2` line,
|
|
|
the `dirstate-v2` format is used.
|
|
|
With no such line `dirstate-v1` is used.
|
|
|
|
|
|
High level description
|
|
|
----------------------
|
|
|
|
|
|
Whereas `dirstate-v1` uses a single `.hg/disrtate` file,
|
|
|
in `dirstate-v2` that file is a "docket" file
|
|
|
that only contains some metadata
|
|
|
and points to separate data file named `.hg/dirstate.{ID}`,
|
|
|
where `{ID}` is a random identifier.
|
|
|
|
|
|
This separation allows making data files append-only
|
|
|
and therefore safer to memory-map.
|
|
|
Creating a new data file (occasionally to clean up unused data)
|
|
|
can be done with a different ID
|
|
|
without disrupting another Mercurial process
|
|
|
that could still be using the previous data file.
|
|
|
|
|
|
Both files have a format designed to reduce the need for parsing,
|
|
|
by using fixed-size binary components as much as possible.
|
|
|
For data that is not fixed-size,
|
|
|
references to other parts of a file can be made by storing "pseudo-pointers":
|
|
|
integers counted in bytes from the start of a file.
|
|
|
For read-only access no data structure is needed,
|
|
|
only a bytes buffer (possibly memory-mapped directly from the filesystem)
|
|
|
with specific parts read on demand.
|
|
|
|
|
|
The data file contains "nodes" organized in a tree.
|
|
|
Each node represents a file or directory inside the working directory
|
|
|
or its parent changeset.
|
|
|
This tree has the same structure as the filesystem,
|
|
|
so a node representing a directory has child nodes representing
|
|
|
the files and subdirectories contained directly in that directory.
|
|
|
|
|
|
The docket file format
|
|
|
----------------------
|
|
|
|
|
|
This is implemented in `rust/hg-core/src/dirstate_tree/on_disk.rs`
|
|
|
and `mercurial/dirstateutils/docket.py`.
|
|
|
|
|
|
Components of the docket file are found at fixed offsets,
|
|
|
counted in bytes from the start of the file:
|
|
|
|
|
|
* Offset 0:
|
|
|
The 12-bytes marker string "dirstate-v2\n" ending with a newline character.
|
|
|
This makes it easier to tell a dirstate-v2 file from a dirstate-v1 file,
|
|
|
although it is not strictly necessary
|
|
|
since `.hg/requires` determines which format to use.
|
|
|
|
|
|
* Offset 12:
|
|
|
The changeset node ID on the first parent of the working directory,
|
|
|
as up to 32 binary bytes.
|
|
|
If a node ID is shorter (20 bytes for SHA-1),
|
|
|
it is start-aligned and the rest of the bytes are set to zero.
|
|
|
|
|
|
* Offset 44:
|
|
|
The changeset node ID on the second parent of the working directory,
|
|
|
or all zeros if there isn’t one.
|
|
|
Also 32 binary bytes.
|
|
|
|
|
|
* Offset 76:
|
|
|
Tree metadata on 44 bytes, described below.
|
|
|
Its separation in this documentation from the rest of the docket
|
|
|
reflects a detail of the current implementation.
|
|
|
Since tree metadata is also made of fields at fixed offsets, those could
|
|
|
be inlined here by adding 76 bytes to each offset.
|
|
|
|
|
|
* Offset 120:
|
|
|
The used size of the data file, as a 32-bit big-endian integer.
|
|
|
The actual size of the data file may be larger
|
|
|
(if another Mercurial processis in appending to it
|
|
|
but has not updated the docket yet).
|
|
|
That extra data must be ignored.
|
|
|
|
|
|
* Offset 124:
|
|
|
The length of the data file identifier, as a 8-bit integer.
|
|
|
|
|
|
* Offset 125:
|
|
|
The data file identifier.
|
|
|
|
|
|
* Any additional data is current ignored, and dropped when updating the file.
|
|
|
|
|
|
Tree metadata in the docket file
|
|
|
--------------------------------
|
|
|
|
|
|
Tree metadata is similarly made of components at fixed offsets.
|
|
|
These offsets are counted in bytes from the start of tree metadata,
|
|
|
which is 76 bytes after the start of the docket file.
|
|
|
|
|
|
This metadata can be thought of as the singular root of the tree
|
|
|
formed by nodes in the data file.
|
|
|
|
|
|
* Offset 0:
|
|
|
Pseudo-pointer to the start of root nodes,
|
|
|
counted in bytes from the start of the data file,
|
|
|
as a 32-bit big-endian integer.
|
|
|
These nodes describe files and directories found directly
|
|
|
at the root of the working directory.
|
|
|
|
|
|
* Offset 4:
|
|
|
Number of root nodes, as a 32-bit big-endian integer.
|
|
|
|
|
|
* Offset 8:
|
|
|
Total number of nodes in the entire tree that "have a dirstate entry",
|
|
|
as a 32-bit big-endian integer.
|
|
|
Those nodes represent files that would be present at all in `dirstate-v1`.
|
|
|
This is typically less than the total number of nodes.
|
|
|
This counter is used to implement `len(dirstatemap)`.
|
|
|
|
|
|
* Offset 12:
|
|
|
Number of nodes in the entire tree that have a copy source,
|
|
|
as a 32-bit big-endian integer.
|
|
|
At the next commit, these files are recorded
|
|
|
as having been copied or moved/renamed from that source.
|
|
|
(A move is recorded as a copy and separate removal of the source.)
|
|
|
This counter is used to implement `len(dirstatemap.copymap)`.
|
|
|
|
|
|
* Offset 16:
|
|
|
An estimation of how many bytes of the data file
|
|
|
(within its used size) are unused, as a 32-bit big-endian integer.
|
|
|
When appending to an existing data file,
|
|
|
some existing nodes or paths can be unreachable from the new root
|
|
|
but they still take up space.
|
|
|
This counter is used to decide when to write a new data file from scratch
|
|
|
instead of appending to an existing one,
|
|
|
in order to get rid of that unreachable data
|
|
|
and avoid unbounded file size growth.
|
|
|
|
|
|
* Offset 20:
|
|
|
These four bytes are currently ignored
|
|
|
and reset to zero when updating a docket file.
|
|
|
This is an attempt at forward compatibility:
|
|
|
future Mercurial versions could use this as a bit field
|
|
|
to indicate that a dirstate has additional data or constraints.
|
|
|
Finding a dirstate file with the relevant bit unset indicates that
|
|
|
it was written by a then-older version
|
|
|
which is not aware of that future change.
|
|
|
|
|
|
* Offset 24:
|
|
|
Either 20 zero bytes, or a SHA-1 hash as 20 binary bytes.
|
|
|
When present, the hash is of ignore patterns
|
|
|
that were used for some previous run of the `status` algorithm.
|
|
|
|
|
|
* (Offset 44: end of tree metadata)
|
|
|
|
|
|
Optional hash of ignore patterns
|
|
|
--------------------------------
|
|
|
|
|
|
The implementation of `status` at `rust/hg-core/src/dirstate_tree/status.rs`
|
|
|
has been optimized such that its run time is dominated by calls
|
|
|
to `stat` for reading the filesystem metadata of a file or directory,
|
|
|
and to `readdir` for listing the contents of a directory.
|
|
|
In some cases the algorithm can skip calls to `readdir`
|
|
|
(saving significant time)
|
|
|
because the dirstate already contains enough of the relevant information
|
|
|
to build the correct `status` results.
|
|
|
|
|
|
The default configuration of `hg status` is to list unknown files
|
|
|
but not ignored files.
|
|
|
In this case, it matters for the `readdir`-skipping optimization
|
|
|
if a given file used to be ignored but became unknown
|
|
|
because `.hgignore` changed.
|
|
|
To detect the possibility of such a change,
|
|
|
the tree metadata contains an optional hash of all ignore patterns.
|
|
|
|
|
|
We define:
|
|
|
|
|
|
* "Root" ignore files as:
|
|
|
|
|
|
- `.hgignore` at the root of the repository if it exists
|
|
|
- And all files from `ui.ignore.*` config.
|
|
|
|
|
|
This set of files is sorted by the string representation of their path.
|
|
|
|
|
|
* The "expanded contents" of an ignore files is the byte string made
|
|
|
by the concatenation of its contents followed by the "expanded contents"
|
|
|
of other files included with `include:` or `subinclude:` directives,
|
|
|
in inclusion order. This definition is recursive, as included files can
|
|
|
themselves include more files.
|
|
|
|
|
|
This hash is defined as the SHA-1 of the concatenation (in sorted
|
|
|
order) of the "expanded contents" of each "root" ignore file.
|
|
|
(Note that computing this does not require actually concatenating
|
|
|
into a single contiguous byte sequence.
|
|
|
Instead a SHA-1 hasher object can be created
|
|
|
and fed separate chunks one by one.)
|
|
|
|
|
|
The data file format
|
|
|
--------------------
|
|
|
|
|
|
This is implemented in `rust/hg-core/src/dirstate_tree/on_disk.rs`
|
|
|
and `mercurial/dirstateutils/v2.py`.
|
|
|
|
|
|
The data file contains two types of data: paths and nodes.
|
|
|
|
|
|
Paths and nodes can be organized in any order in the file, except that sibling
|
|
|
nodes must be next to each other and sorted by their path.
|
|
|
Contiguity lets the parent refer to them all
|
|
|
by their count and a single pseudo-pointer,
|
|
|
instead of storing one pseudo-pointer per child node.
|
|
|
Sorting allows using binary seach to find a child node with a given name
|
|
|
in `O(log(n))` byte sequence comparisons.
|
|
|
|
|
|
The current implemention writes paths and child node before a given node
|
|
|
for ease of figuring out the value of pseudo-pointers by the time the are to be
|
|
|
written, but this is not an obligation and readers must not rely on it.
|
|
|
|
|
|
A path is stored as a byte string anywhere in the file, without delimiter.
|
|
|
It is refered to by one or more node by a pseudo-pointer to its start, and its
|
|
|
length in bytes. Since there is no delimiter,
|
|
|
when a path is a substring of another the same bytes could be reused,
|
|
|
although the implementation does not exploit this as of this writing.
|
|
|
|
|
|
A node is stored on 43 bytes with components at fixed offsets. Paths and
|
|
|
child nodes relevant to a node are stored externally and referenced though
|
|
|
pseudo-pointers.
|
|
|
|
|
|
All integers are stored in big-endian. All pseudo-pointers are 32-bit integers
|
|
|
counting bytes from the start of the data file. Path lengths and positions
|
|
|
are 16-bit integers, also counted in bytes.
|
|
|
|
|
|
Node components are:
|
|
|
|
|
|
* Offset 0:
|
|
|
Pseudo-pointer to the full path of this node,
|
|
|
from the working directory root.
|
|
|
|
|
|
* Offset 4:
|
|
|
Length of the full path.
|
|
|
|
|
|
* Offset 6:
|
|
|
Position of the last `/` path separator within the full path,
|
|
|
in bytes from the start of the full path,
|
|
|
or zero if there isn’t one.
|
|
|
The part of the full path after this position is the "base name".
|
|
|
Since sibling nodes have the same parent, only their base name vary
|
|
|
and needs to be considered when doing binary search to find a given path.
|
|
|
|
|
|
* Offset 8:
|
|
|
Pseudo-pointer to the "copy source" path for this node,
|
|
|
or zero if there is no copy source.
|
|
|
|
|
|
* Offset 12:
|
|
|
Length of the copy source path, or zero if there isn’t one.
|
|
|
|
|
|
* Offset 14:
|
|
|
Pseudo-pointer to the start of child nodes.
|
|
|
|
|
|
* Offset 18:
|
|
|
Number of child nodes, as a 32-bit integer.
|
|
|
They occupy 43 times this number of bytes
|
|
|
(not counting space for paths, and further descendants).
|
|
|
|
|
|
* Offset 22:
|
|
|
Number as a 32-bit integer of descendant nodes in this subtree,
|
|
|
not including this node itself,
|
|
|
that "have a dirstate entry".
|
|
|
Those nodes represent files that would be present at all in `dirstate-v1`.
|
|
|
This is typically less than the total number of descendants.
|
|
|
This counter is used to implement `has_dir`.
|
|
|
|
|
|
* Offset 26:
|
|
|
Number as a 32-bit integer of descendant nodes in this subtree,
|
|
|
not including this node itself,
|
|
|
that represent files tracked in the working directory.
|
|
|
(For example, `hg rm` makes a file untracked.)
|
|
|
This counter is used to implement `has_tracked_dir`.
|
|
|
|
|
|
* Offset 30:
|
|
|
Some boolean values packed as bits of a single byte.
|
|
|
Starting from least-significant, bit masks are::
|
|
|
|
|
|
WDIR_TRACKED = 1 << 0
|
|
|
P1_TRACKED = 1 << 1
|
|
|
P2_INFO = 1 << 2
|
|
|
HAS_MODE_AND_SIZE = 1 << 3
|
|
|
HAS_MTIME = 1 << 4
|
|
|
|
|
|
Other bits are unset. The meaning of these bits are:
|
|
|
|
|
|
`WDIR_TRACKED`
|
|
|
Set if the working directory contains a tracked file at this node’s path.
|
|
|
This is typically set and unset by `hg add` and `hg rm`.
|
|
|
|
|
|
`P1_TRACKED`
|
|
|
set if the working directory’s first parent changeset
|
|
|
(whose node identifier is found in tree metadata)
|
|
|
contains a tracked file at this node’s path.
|
|
|
This is a cache to reduce manifest lookups.
|
|
|
|
|
|
`P2_INFO`
|
|
|
Set if the file has been involved in some merge operation.
|
|
|
Either because it was actually merged,
|
|
|
or because the version in the second parent p2 version was ahead,
|
|
|
or because some rename moved it there.
|
|
|
In either case `hg status` will want it displayed as modified.
|
|
|
|
|
|
Files that would be mentioned at all in the `dirstate-v1` file format
|
|
|
have a node with at least one of the above three bits set in `dirstate-v2`.
|
|
|
Let’s call these files "tracked anywhere",
|
|
|
and "untracked" the nodes with all three of these bits unset.
|
|
|
Untracked nodes are typically for directories:
|
|
|
they hold child nodes and form the tree structure.
|
|
|
Additional untracked nodes may also exist.
|
|
|
Although implementations should strive to clean up nodes
|
|
|
that are entirely unused, other untracked nodes may also exist.
|
|
|
For example, a future version of Mercurial might in some cases
|
|
|
add nodes for untracked files or/and ignored files in the working directory
|
|
|
in order to optimize `hg status`
|
|
|
by enabling it to skip `readdir` in more cases.
|
|
|
|
|
|
When a node is for a file tracked anywhere,
|
|
|
the rest of the node data is three fields:
|
|
|
|
|
|
* Offset 31:
|
|
|
If `HAS_MODE_AND_SIZE` is unset, four zero bytes.
|
|
|
Otherwise, a 32-bit integer for the Unix mode (as in `stat_result.st_mode`)
|
|
|
expected for this file to be considered clean.
|
|
|
Only the `S_IXUSR` bit (owner has execute permission) is considered.
|
|
|
|
|
|
* Offset 35:
|
|
|
If `HAS_MODE_AND_SIZE` is unset, four zero bytes.
|
|
|
Otherwise, a 32-bit integer for expected size of the file
|
|
|
truncated to its 31 least-significant bits.
|
|
|
Unlike in dirstate-v1, negative values are not used.
|
|
|
|
|
|
* Offset 39:
|
|
|
If `HAS_MTIME` is unset, four zero bytes.
|
|
|
Otherwise, a 32-bit integer for expected modified time of the file
|
|
|
(as in `stat_result.st_mtime`),
|
|
|
truncated to its 31 least-significant bits.
|
|
|
Unlike in dirstate-v1, negative values are not used.
|
|
|
|
|
|
If an untracked node `HAS_MTIME` *unset*, this space is unused:
|
|
|
|
|
|
* Offset 31:
|
|
|
12 bytes set to zero
|
|
|
|
|
|
If an untracked node `HAS_MTIME` *set*,
|
|
|
what follows is the modification time of a directory
|
|
|
represented with separated second and sub-second components
|
|
|
since the Unix epoch:
|
|
|
|
|
|
* Offset 31:
|
|
|
The number of seconds as a signed (two’s complement) 64-bit integer.
|
|
|
|
|
|
* Offset 39:
|
|
|
The number of nanoseconds as 32-bit integer.
|
|
|
Always greater than or equal to zero, and strictly less than a billion.
|
|
|
Increasing this component makes the modification time
|
|
|
go forward or backward in time dependening
|
|
|
on the sign of the integral seconds components.
|
|
|
(Note: this is buggy because there is no negative zero integer,
|
|
|
but will be changed soon.)
|
|
|
|
|
|
The presence of a directory modification time means that at some point,
|
|
|
this path in the working directory was observed:
|
|
|
|
|
|
- To be a directory
|
|
|
- With the given modification time
|
|
|
- That time was already strictly in the past when observed,
|
|
|
meaning that later changes cannot happen in the same clock tick
|
|
|
and must cause a different modification time
|
|
|
(unless the system clock jumps back and we get unlucky,
|
|
|
which is not impossible but deemed unlikely enough).
|
|
|
- All direct children of this directory
|
|
|
(as returned by `std::fs::read_dir`)
|
|
|
either have a corresponding dirstate node,
|
|
|
or are ignored by ignore patterns whose hash is in tree metadata.
|
|
|
|
|
|
This means that if `std::fs::symlink_metadata` later reports
|
|
|
the same modification time
|
|
|
and ignored patterns haven’t changed,
|
|
|
a run of status that is not listing ignored files
|
|
|
can skip calling `std::fs::read_dir` again for this directory,
|
|
|
and iterate child dirstate nodes instead.
|
|
|
|
|
|
|
|
|
* (Offset 43: end of this node)
|
|
|
|