# HG changeset patch # User Simon Sapin # Date 2021-10-01 10:17:09 # Node ID e8a576de703f51273a464b563a2ecde5d6c82143 # Parent d467e44f71d7a023ccce24a80abb2ce6d2e06bf8 dirstate-v2: Add internal documentation It can be viewed by running `hg help internals.dirstate-v2` Since that command rewraps paragraphs, the source text is written with semantic line breaks: https://sembr.org/ Differential Revision: https://phab.mercurial-scm.org/D11546 diff --git a/mercurial/help.py b/mercurial/help.py --- a/mercurial/help.py +++ b/mercurial/help.py @@ -365,6 +365,11 @@ internalstable = sorted( loaddoc(b'config', subdir=b'internals'), ), ( + [b'dirstate-v2'], + _(b'dirstate-v2 file format'), + loaddoc(b'dirstate-v2', subdir=b'internals'), + ), + ( [b'extensions', b'extension'], _(b'Extension API'), loaddoc(b'extensions', subdir=b'internals'), diff --git a/mercurial/helptext/internals/dirstate-v2.txt b/mercurial/helptext/internals/dirstate-v2.txt new file mode 100644 --- /dev/null +++ b/mercurial/helptext/internals/dirstate-v2.txt @@ -0,0 +1,375 @@ +The *dirstate* is what Mercurial uses internally to track +the state of files in the working directory, +such as set by commands like `hg add` and `hg rm`. +It also contains some cached data that help make `hg status` faster. +The name refers both to `.hg/dirstate` on the filesystem +and the corresponding data structure in memory while a Mercurial process +is running. + +The original file format, retroactively dubbed `dirstate-v1`, +is described at https://www.mercurial-scm.org/wiki/DirState. +It is made of a flat sequence of unordered variable-size entries, +so accessing any information in it requires parsing all of it. +Similarly, saving changes requires rewriting the entire file. + +The newer `dirsate-v2` file format is designed to fix these limitations +and make `hg status` faster. + +User guide +========== + +Compatibility +------------- + +The file format is experimental and may still change. +Different versions of Mercurial may not be compatible with each other +when working on a local repository that uses this format. +When using an incompatible version with the experimental format, +anything can happen including data corruption. + +Since the dirstate is entirely local and not relevant to the wire protocol, +`dirstate-v2` does not affect compatibility with remote Mercurial versions. + +When `share-safe` is enabled, different repositories sharing the same store +can use different dirstate formats. + +Enabling `dirsate-v2` for new local repositories +------------------------------------------------ + +When creating a new local repository such as with `hg init` or `hg clone`, +the `exp-dirstate-v2` boolean in the `format` configuration section +controls whether to use this file format. +This is disabled by default as of this writing. +To enable it for a single repository, run for example:: + + $ hg init my-project --config format.exp-dirstate-v2=1 + +Checking the format of an existing local repsitory +-------------------------------------------------- + +The `debugformat` commands prints information about +which of multiple optional formats are used in the current repository, +including `dirstate-v2`:: + + $ hg debugformat + format-variant repo + fncache: yes + dirstate-v2: yes + […] + +Upgrading or downgrading an existing local repository +----------------------------------------------------- + +The `debugupgrade` command does various upgrades or downgrades +on a local repository +based on the current Mercurial version and on configuration. +The same `format.exp-dirstate-v2` configuration is used again. + +Example to upgrade:: + + $ hg debugupgrade --config format.exp-dirstate-v2=1 + +Example to downgrade to `dirstate-v1`:: + + $ hg debugupgrade --config format.exp-dirstate-v2=0 + +Both of this commands do nothing but print a list of proposed changes, +which may include changes unrelated to the dirstate. +Those other changes are controlled by their own configuration keys. +Add `--run` to a command to actually apply the proposed changes. + +Backups of `.hg/requires` and `.hg/dirstate` are created +in a `.hg/upgradebackup.*` directory. +If something goes wrong, restoring those files should undo the change. + +Note that upgrading affects compatibility with older versions of Mercurial +as noted above. +This can be relevant when a repository’s files are on a USB drive +or some other removable media, or shared over the network, etc. + +Internal filesystem representation +================================== + +Requirements file +----------------- + +The `.hg/requires` file indicates which of various optional file formats +are used by a given repository. +Mercurial aborts when seeing a requirement it does not know about, +which avoids older version accidentally messing up a respository +that uses a format that was introduced later. +For versions that do support a format, the presence or absence of +the corresponding requirement indicates whether to use that format. + +When the file contains a `exp-dirstate-v2` line, +the `dirstate-v2` format is used. +With no such line `dirstate-v1` is used. + +High level description +---------------------- + +Whereas `dirstate-v1` uses a single `.hg/disrtate` file, +in `dirstate-v2` that file is a "docket" file +that only contains some metadata +and points to separate data file named `.hg/dirstate.{ID}`, +where `{ID}` is a random identifier. + +This separation allows making data files append-only +and therefore safer to memory-map. +Creating a new data file (occasionally to clean up unused data) +can be done with a different ID +without disrupting another Mercurial process +that could still be using the previous data file. + +Both files have a format designed to reduce the need for parsing, +by using fixed-size binary components as much as possible. +For data that is not fixed-size, +references to other parts of a file can be made by storing "pseudo-pointers": +integers counted in bytes from the start of a file. +For read-only access no data structure is needed, +only a bytes buffer (possibly memory-mapped directly from the filesystem) +with specific parts read on demand. + +The data file contains "nodes" organized in a tree. +Each node represents a file or directory inside the working directory +or its parent changeset. +This tree has the same structure as the filesystem, +so a node representing a directory has child nodes representing +the files and subdirectories contained directly in that directory. + +The docket file format +---------------------- + +This is implemented in `rust/hg-core/src/dirstate_tree/on_disk.rs` +and `mercurial/dirstateutils/docket.py`. + +Components of the docket file are found at fixed offsets, +counted in bytes from the start of the file: + +* Offset 0: + The 12-bytes marker string "dirstate-v2\n" ending with a newline character. + This makes it easier to tell a dirstate-v2 file from a dirstate-v1 file, + although it is not strictly necessary + since `.hg/requires` determines which format to use. + +* Offset 12: + The changeset node ID on the first parent of the working directory, + as up to 32 binary bytes. + If a node ID is shorter (20 bytes for SHA-1), + it is start-aligned and the rest of the bytes are set to zero. + +* Offset 44: + The changeset node ID on the second parent of the working directory, + or all zeros if there isn’t one. + Also 32 binary bytes. + +* Offset 76: + Tree metadata on 44 bytes, described below. + Its separation in this documentation from the rest of the docket + reflects a detail of the current implementation. + Since tree metadata is also made of fields at fixed offsets, those could + be inlined here by adding 76 bytes to each offset. + +* Offset 120: + The used size of the data file, as a 32-bit big-endian integer. + The actual size of the data file may be larger + (if another Mercurial processis in appending to it + but has not updated the docket yet). + That extra data must be ignored. + +* Offset 124: + The length of the data file identifier, as a 8-bit integer. + +* Offset 125: + The data file identifier. + +* Any additional data is current ignored, and dropped when updating the file. + +Tree metadata in the docket file +-------------------------------- + +Tree metadata is similarly made of components at fixed offsets. +These offsets are counted in bytes from the start of tree metadata, +which is 76 bytes after the start of the docket file. + +This metadata can be thought of as the singular root of the tree +formed by nodes in the data file. + +* Offset 0: + Pseudo-pointer to the start of root nodes, + counted in bytes from the start of the data file, + as a 32-bit big-endian integer. + These nodes describe files and directories found directly + at the root of the working directory. + +* Offset 4: + Number of root nodes, as a 32-bit big-endian integer. + +* Offset 8: + Total number of nodes in the entire tree that "have a dirstate entry", + as a 32-bit big-endian integer. + Those nodes represent files that would be present at all in `dirstate-v1`. + This is typically less than the total number of nodes. + This counter is used to implement `len(dirstatemap)`. + +* Offset 12: + Number of nodes in the entire tree that have a copy source, + as a 32-bit big-endian integer. + At the next commit, these files are recorded + as having been copied or moved/renamed from that source. + (A move is recorded as a copy and separate removal of the source.) + This counter is used to implement `len(dirstatemap.copymap)`. + +* Offset 16: + An estimation of how many bytes of the data file + (within its used size) are unused, as a 32-bit big-endian integer. + When appending to an existing data file, + some existing nodes or paths can be unreachable from the new root + but they still take up space. + This counter is used to decide when to write a new data file from scratch + instead of appending to an existing one, + in order to get rid of that unreachable data + and avoid unbounded file size growth. + +* Offset 20: + These four bytes are currently ignored + and reset to zero when updating a docket file. + This is an attempt at forward compatibility: + future Mercurial versions could use this as a bit field + to indicate that a dirstate has additional data or constraints. + Finding a dirstate file with the relevant bit unset indicates that + it was written by a then-older version + which is not aware of that future change. + +* Offset 24: + Either 20 zero bytes, or a SHA-1 hash as 20 binary bytes. + When present, the hash is of ignore patterns + that were used for some previous run of the `status` algorithm. + +* (Offset 44: end of tree metadata) + +Optional hash of ignore patterns +-------------------------------- + +The implementation of `status` at `rust/hg-core/src/dirstate_tree/status.rs` +has been optimized such that its run time is dominated by calls +to `stat` for reading the filesystem metadata of a file or directory, +and to `readdir` for listing the contents of a directory. +In some cases the algorithm can skip calls to `readdir` +(saving significant time) +because the dirstate already contains enough of the relevant information +to build the correct `status` results. + +The default configuration of `hg status` is to list unknown files +but not ignored files. +In this case, it matters for the `readdir`-skipping optimization +if a given file used to be ignored but became unknown +because `.hgignore` changed. +To detect the possibility of such a change, +the tree metadata contains an optional hash of all ignore patterns. + +We define: + +* "Root" ignore files as: + + - `.hgignore` at the root of the repository if it exists + - And all files from `ui.ignore.*` config. + + This set of files is sorted by the string representation of their path. + +* The "expanded contents" of an ignore files is the byte string made + by the concatenation of its contents followed by the "expanded contents" + of other files included with `include:` or `subinclude:` directives, + in inclusion order. This definition is recursive, as included files can + themselves include more files. + +This hash is defined as the SHA-1 of the concatenation (in sorted +order) of the "expanded contents" of each "root" ignore file. +(Note that computing this does not require actually concatenating byte ranges into +contiguous memory. +Instead a SHA-1 hasher object can be created and fed separate byte ranges one by +one.) + +The data file format +-------------------- + +This is implemented in `rust/hg-core/src/dirstate_tree/on_disk.rs` +and `mercurial/dirstateutils/v2.py`. + +The data file contains two types of data: paths and nodes. + +Paths and nodes can be organized in any order in the file, except that sibling +nodes must be next to each other and sorted by their path. Contiguity lets +the parent refer to them all by their count with a single pseudo-pointer, +instead of storing one pseudo-pointer per child node. Sorting allows using +binary seach to find a child node with a given name in `O(log(n))` byte ranges +comparisons. + +The current implemention writes paths and child node before a given node +for ease of figuring out the value of pseudo-pointers by the time the are to be +written, but this is not an obligation and readers must not rely on it. + +A path is stored as a byte string anywhere in the file, without delimiter. +It is refered to by one or more node by a pseudo-pointer to its start, and its +length in bytes. Since there is no delimiter, +when a path is a substring of another the same bytes could be reused, +although the implementation does not exploit this as of this writing. + +A node is stored on 43 bytes with components at fixed offsets. Paths and +child nodes relevant to a node are stored externally and referenced though +pseudo-pointers. + +All integers are stored in big-endian. All pseudo-pointers are 32-bit integers +counting bytes from the start of the data file. Path lengths and positions +are 16-bit integers, also counted in bytes. + +Node components are: + +* Offset 0: + Pseudo-pointer to the full path of this node, + from the working directory root. + +* Offset 4: + Length of the full path. + +* Offset 6: + Position of the last `/` path separator within the full path, + in bytes from the start of the full path, + or zero if there isn’t one. + The part of the full path after this position is the "base name". + Since sibling nodes have the same parent, only their base name vary + and needs to be considered when doing binary search to find a given path. + +* Offset 8: + Pseudo-pointer to the "copy source" path for this node, + or zero if there is no copy source. + +* Offset 12: + Length of the copy source path, or zero if there isn’t one. + +* Offset 14: + Pseudo-pointer to the start of child nodes. + +* Offset 18: + Number of child nodes, as a 32-bit integer. + They occupy 43 times this number of bytes + (not counting space for paths, and further descendants). + +* Offset 22: + Number as a 32-bit integer of descendant nodes in this subtree, + not including this node itself, + that "have a dirstate entry". + Those nodes represent files that would be present at all in `dirstate-v1`. + This is typically less than the total number of descendants. + This counter is used to implement `has_dir`. + +* Offset 26: + Number as a 32-bit integer of descendant nodes in this subtree, + not including this node itself, + that represent files tracked in the working directory. + (For example, `hg rm` makes a file untracked.) + This counter is used to implement `has_tracked_dir`. + +* Offset 30 and more: + **TODO:** docs not written yet + as this part of the format might be changing soon. diff --git a/rust/hg-core/src/dirstate_tree/on_disk.rs b/rust/hg-core/src/dirstate_tree/on_disk.rs --- a/rust/hg-core/src/dirstate_tree/on_disk.rs +++ b/rust/hg-core/src/dirstate_tree/on_disk.rs @@ -1,22 +1,6 @@ //! The "version 2" disk representation of the dirstate //! -//! # File format -//! -//! In dirstate-v2 format, the `.hg/dirstate` file is a "docket that starts -//! with a fixed-sized header whose layout is defined by the `DocketHeader` -//! struct, followed by the data file identifier. -//! -//! A separate `.hg/dirstate.{uuid}.d` file contains most of the data. That -//! file may be longer than the size given in the docket, but not shorter. Only -//! the start of the data file up to the given size is considered. The -//! fixed-size "root" of the dirstate tree whose layout is defined by the -//! `Root` struct is found at the end of that slice of data. -//! -//! Its `root_nodes` field contains the slice (offset and length) to -//! the nodes representing the files and directories at the root of the -//! repository. Each node is also fixed-size, defined by the `Node` struct. -//! Nodes in turn contain slices to variable-size paths, and to their own child -//! nodes (if any) for nested files and directories. +//! See `mercurial/helptext/internals/dirstate-v2.txt` use crate::dirstate_tree::dirstate_map::{self, DirstateMap, NodeRef}; use crate::dirstate_tree::path_with_basename::WithBasename; diff --git a/tests/test-help.t b/tests/test-help.t --- a/tests/test-help.t +++ b/tests/test-help.t @@ -1121,6 +1121,7 @@ internals topic renders index of availab censor Censor changegroups Changegroups config Config Registrar + dirstate-v2 dirstate-v2 file format extensions Extension API mergestate Mergestate requirements Repository Requirements @@ -3566,6 +3567,13 @@ Sub-topic indexes rendered properly Config Registrar + + dirstate-v2 + + + dirstate-v2 file format + + extensions