upstream/mercurial-mirror Files · mercurial/helptext/internals/dirstate-v2.txt

Simon Sapin

r48978

The *dirstate* is what Mercurial uses internally to track

the state of files in the working directory,

such as set by commands like `hg add` and `hg rm`.

It also contains some cached data that help make `hg status` faster.

The name refers both to `.hg/dirstate` on the filesystem

and the corresponding data structure in memory while a Mercurial process

is running.

The original file format, retroactively dubbed `dirstate-v1`,

is described at https://www.mercurial-scm.org/wiki/DirState.

It is made of a flat sequence of unordered variable-size entries,

so accessing any information in it requires parsing all of it.

Similarly, saving changes requires rewriting the entire file.

The newer `dirsate-v2` file format is designed to fix these limitations

and make `hg status` faster.

User guide

==========

Compatibility

-------------

The file format is experimental and may still change.

Different versions of Mercurial may not be compatible with each other

when working on a local repository that uses this format.

When using an incompatible version with the experimental format,

anything can happen including data corruption.

Since the dirstate is entirely local and not relevant to the wire protocol,

`dirstate-v2` does not affect compatibility with remote Mercurial versions.

When `share-safe` is enabled, different repositories sharing the same store

can use different dirstate formats.

Enabling `dirsate-v2` for new local repositories

------------------------------------------------

When creating a new local repository such as with `hg init` or `hg clone`,

the `exp-dirstate-v2` boolean in the `format` configuration section

controls whether to use this file format.

This is disabled by default as of this writing.

To enable it for a single repository, run for example::

$ hg init my-project --config format.exp-dirstate-v2=1

Checking the format of an existing local repsitory

--------------------------------------------------

The `debugformat` commands prints information about

which of multiple optional formats are used in the current repository,

including `dirstate-v2`::

$ hg debugformat

format-variant repo

fncache: yes

dirstate-v2: yes

[…]

Upgrading or downgrading an existing local repository

-----------------------------------------------------

The `debugupgrade` command does various upgrades or downgrades

on a local repository

based on the current Mercurial version and on configuration.

The same `format.exp-dirstate-v2` configuration is used again.

Example to upgrade::

$ hg debugupgrade --config format.exp-dirstate-v2=1

Example to downgrade to `dirstate-v1`::

$ hg debugupgrade --config format.exp-dirstate-v2=0

Both of this commands do nothing but print a list of proposed changes,

which may include changes unrelated to the dirstate.

Those other changes are controlled by their own configuration keys.

Add `--run` to a command to actually apply the proposed changes.

Backups of `.hg/requires` and `.hg/dirstate` are created

in a `.hg/upgradebackup.*` directory.

If something goes wrong, restoring those files should undo the change.

Note that upgrading affects compatibility with older versions of Mercurial

as noted above.

This can be relevant when a repository’s files are on a USB drive

or some other removable media, or shared over the network, etc.

Internal filesystem representation

==================================

Requirements file

-----------------

The `.hg/requires` file indicates which of various optional file formats

are used by a given repository.

Mercurial aborts when seeing a requirement it does not know about,

which avoids older version accidentally messing up a respository

that uses a format that was introduced later.

For versions that do support a format, the presence or absence of

the corresponding requirement indicates whether to use that format.

When the file contains a `exp-dirstate-v2` line,

the `dirstate-v2` format is used.

With no such line `dirstate-v1` is used.

High level description

----------------------

Whereas `dirstate-v1` uses a single `.hg/disrtate` file,

in `dirstate-v2` that file is a "docket" file

that only contains some metadata

and points to separate data file named `.hg/dirstate.{ID}`,

where `{ID}` is a random identifier.

This separation allows making data files append-only

and therefore safer to memory-map.

Creating a new data file (occasionally to clean up unused data)

can be done with a different ID

without disrupting another Mercurial process

that could still be using the previous data file.

Both files have a format designed to reduce the need for parsing,

by using fixed-size binary components as much as possible.

For data that is not fixed-size,

references to other parts of a file can be made by storing "pseudo-pointers":

integers counted in bytes from the start of a file.

For read-only access no data structure is needed,

only a bytes buffer (possibly memory-mapped directly from the filesystem)

with specific parts read on demand.

The data file contains "nodes" organized in a tree.

Each node represents a file or directory inside the working directory

or its parent changeset.

This tree has the same structure as the filesystem,

so a node representing a directory has child nodes representing

the files and subdirectories contained directly in that directory.

The docket file format

----------------------

This is implemented in `rust/hg-core/src/dirstate_tree/on_disk.rs`

and `mercurial/dirstateutils/docket.py`.

Components of the docket file are found at fixed offsets,

counted in bytes from the start of the file:

* Offset 0:

The 12-bytes marker string "dirstate-v2\n" ending with a newline character.

This makes it easier to tell a dirstate-v2 file from a dirstate-v1 file,

although it is not strictly necessary

since `.hg/requires` determines which format to use.

* Offset 12:

The changeset node ID on the first parent of the working directory,

as up to 32 binary bytes.

If a node ID is shorter (20 bytes for SHA-1),

it is start-aligned and the rest of the bytes are set to zero.

* Offset 44:

The changeset node ID on the second parent of the working directory,

or all zeros if there isn’t one.

Also 32 binary bytes.

* Offset 76:

Tree metadata on 44 bytes, described below.

Its separation in this documentation from the rest of the docket

reflects a detail of the current implementation.

Since tree metadata is also made of fields at fixed offsets, those could

be inlined here by adding 76 bytes to each offset.

* Offset 120:

The used size of the data file, as a 32-bit big-endian integer.

The actual size of the data file may be larger

(if another Mercurial processis in appending to it

but has not updated the docket yet).

That extra data must be ignored.

* Offset 124:

The length of the data file identifier, as a 8-bit integer.

* Offset 125:

The data file identifier.

* Any additional data is current ignored, and dropped when updating the file.

Tree metadata in the docket file

--------------------------------

Tree metadata is similarly made of components at fixed offsets.

These offsets are counted in bytes from the start of tree metadata,

which is 76 bytes after the start of the docket file.

This metadata can be thought of as the singular root of the tree

formed by nodes in the data file.

* Offset 0:

Pseudo-pointer to the start of root nodes,

counted in bytes from the start of the data file,

as a 32-bit big-endian integer.

These nodes describe files and directories found directly

at the root of the working directory.

* Offset 4:

Number of root nodes, as a 32-bit big-endian integer.

* Offset 8:

Total number of nodes in the entire tree that "have a dirstate entry",

as a 32-bit big-endian integer.

Those nodes represent files that would be present at all in `dirstate-v1`.

This is typically less than the total number of nodes.

This counter is used to implement `len(dirstatemap)`.

* Offset 12:

Number of nodes in the entire tree that have a copy source,

as a 32-bit big-endian integer.

At the next commit, these files are recorded

as having been copied or moved/renamed from that source.

(A move is recorded as a copy and separate removal of the source.)

This counter is used to implement `len(dirstatemap.copymap)`.

* Offset 16:

An estimation of how many bytes of the data file

(within its used size) are unused, as a 32-bit big-endian integer.

When appending to an existing data file,

some existing nodes or paths can be unreachable from the new root

but they still take up space.

This counter is used to decide when to write a new data file from scratch

instead of appending to an existing one,

in order to get rid of that unreachable data

and avoid unbounded file size growth.

* Offset 20:

These four bytes are currently ignored

and reset to zero when updating a docket file.

This is an attempt at forward compatibility:

future Mercurial versions could use this as a bit field

to indicate that a dirstate has additional data or constraints.

Finding a dirstate file with the relevant bit unset indicates that

it was written by a then-older version

which is not aware of that future change.

* Offset 24:

Either 20 zero bytes, or a SHA-1 hash as 20 binary bytes.

When present, the hash is of ignore patterns

that were used for some previous run of the `status` algorithm.

* (Offset 44: end of tree metadata)

Optional hash of ignore patterns

--------------------------------

The implementation of `status` at `rust/hg-core/src/dirstate_tree/status.rs`

has been optimized such that its run time is dominated by calls

to `stat` for reading the filesystem metadata of a file or directory,

and to `readdir` for listing the contents of a directory.

In some cases the algorithm can skip calls to `readdir`

(saving significant time)

because the dirstate already contains enough of the relevant information

to build the correct `status` results.

The default configuration of `hg status` is to list unknown files

but not ignored files.

In this case, it matters for the `readdir`-skipping optimization

if a given file used to be ignored but became unknown

because `.hgignore` changed.

To detect the possibility of such a change,

the tree metadata contains an optional hash of all ignore patterns.

We define:

* "Root" ignore files as:

- `.hgignore` at the root of the repository if it exists

- And all files from `ui.ignore.*` config.

This set of files is sorted by the string representation of their path.

* The "expanded contents" of an ignore files is the byte string made

by the concatenation of its contents followed by the "expanded contents"

of other files included with `include:` or `subinclude:` directives,

in inclusion order. This definition is recursive, as included files can

themselves include more files.

This hash is defined as the SHA-1 of the concatenation (in sorted

order) of the "expanded contents" of each "root" ignore file.

(Note that computing this does not require actually concatenating byte ranges into

contiguous memory.

Instead a SHA-1 hasher object can be created and fed separate byte ranges one by

one.)

The data file format

--------------------

This is implemented in `rust/hg-core/src/dirstate_tree/on_disk.rs`

and `mercurial/dirstateutils/v2.py`.

The data file contains two types of data: paths and nodes.

Paths and nodes can be organized in any order in the file, except that sibling

nodes must be next to each other and sorted by their path. Contiguity lets

the parent refer to them all by their count with a single pseudo-pointer,

instead of storing one pseudo-pointer per child node. Sorting allows using

binary seach to find a child node with a given name in `O(log(n))` byte ranges

comparisons.

The current implemention writes paths and child node before a given node

for ease of figuring out the value of pseudo-pointers by the time the are to be

written, but this is not an obligation and readers must not rely on it.

A path is stored as a byte string anywhere in the file, without delimiter.

It is refered to by one or more node by a pseudo-pointer to its start, and its

length in bytes. Since there is no delimiter,

when a path is a substring of another the same bytes could be reused,

although the implementation does not exploit this as of this writing.

A node is stored on 43 bytes with components at fixed offsets. Paths and

child nodes relevant to a node are stored externally and referenced though

pseudo-pointers.

All integers are stored in big-endian. All pseudo-pointers are 32-bit integers

counting bytes from the start of the data file. Path lengths and positions

are 16-bit integers, also counted in bytes.

Node components are:

* Offset 0:

Pseudo-pointer to the full path of this node,

from the working directory root.

* Offset 4:

Length of the full path.

* Offset 6:

Position of the last `/` path separator within the full path,

in bytes from the start of the full path,

or zero if there isn’t one.

The part of the full path after this position is the "base name".

Since sibling nodes have the same parent, only their base name vary

and needs to be considered when doing binary search to find a given path.

* Offset 8:

Pseudo-pointer to the "copy source" path for this node,

or zero if there is no copy source.

* Offset 12:

Length of the copy source path, or zero if there isn’t one.

* Offset 14:

Pseudo-pointer to the start of child nodes.

* Offset 18:

Number of child nodes, as a 32-bit integer.

They occupy 43 times this number of bytes

(not counting space for paths, and further descendants).

* Offset 22:

Number as a 32-bit integer of descendant nodes in this subtree,

not including this node itself,

that "have a dirstate entry".

Those nodes represent files that would be present at all in `dirstate-v1`.

This is typically less than the total number of descendants.

This counter is used to implement `has_dir`.

* Offset 26:

Number as a 32-bit integer of descendant nodes in this subtree,

not including this node itself,

that represent files tracked in the working directory.

(For example, `hg rm` makes a file untracked.)

This counter is used to implement `has_tracked_dir`.

* Offset 30 and more:

**TODO:** docs not written yet

as this part of the format might be changing soon.

	Site-wide shortcuts
/	Use quick search box
g h	Goto home page
g g	Goto my private gists page
g G	Goto my public gists page
g 0-9	Goto bookmarked items from 0-9
n r	New repository page
n g	New gist page

	Repositories
g s	Goto summary page
g c	Goto changelog page
g f	Goto files page
g F	Goto files page with file search activated
g p	Goto pull requests page
g o	Goto repository settings
g O	Goto repository access permissions settings
t s	Toggle sidebar on some pages

Simon Sapin dirstate-v2: Add internal documentation...	r48978	The dirstate is what Mercurial uses internally to track
		the state of files in the working directory,
		such as set by commands like `hg add` and `hg rm`.
		It also contains some cached data that help make `hg status` faster.
		The name refers both to `.hg/dirstate` on the filesystem
		and the corresponding data structure in memory while a Mercurial process
		is running.

		The original file format, retroactively dubbed `dirstate-v1`,
		is described at https://www.mercurial-scm.org/wiki/DirState.
		It is made of a flat sequence of unordered variable-size entries,
		so accessing any information in it requires parsing all of it.
		Similarly, saving changes requires rewriting the entire file.

		The newer `dirsate-v2` file format is designed to fix these limitations
		and make `hg status` faster.

		User guide
		==========

		Compatibility
		-------------

		The file format is experimental and may still change.
		Different versions of Mercurial may not be compatible with each other
		when working on a local repository that uses this format.
		When using an incompatible version with the experimental format,
		anything can happen including data corruption.

		Since the dirstate is entirely local and not relevant to the wire protocol,
		`dirstate-v2` does not affect compatibility with remote Mercurial versions.

		When `share-safe` is enabled, different repositories sharing the same store
		can use different dirstate formats.

		Enabling `dirsate-v2` for new local repositories
		------------------------------------------------

		When creating a new local repository such as with `hg init` or `hg clone`,
		the `exp-dirstate-v2` boolean in the `format` configuration section
		controls whether to use this file format.
		This is disabled by default as of this writing.
		To enable it for a single repository, run for example::

		$ hg init my-project --config format.exp-dirstate-v2=1

		Checking the format of an existing local repsitory
		--------------------------------------------------

		The `debugformat` commands prints information about
		which of multiple optional formats are used in the current repository,
		including `dirstate-v2`::

		$ hg debugformat
		format-variant repo
		fncache: yes
		dirstate-v2: yes
		[…]

		Upgrading or downgrading an existing local repository
		-----------------------------------------------------

		The `debugupgrade` command does various upgrades or downgrades
		on a local repository
		based on the current Mercurial version and on configuration.
		The same `format.exp-dirstate-v2` configuration is used again.

		Example to upgrade::

		$ hg debugupgrade --config format.exp-dirstate-v2=1

		Example to downgrade to `dirstate-v1`::

		$ hg debugupgrade --config format.exp-dirstate-v2=0

		Both of this commands do nothing but print a list of proposed changes,
		which may include changes unrelated to the dirstate.
		Those other changes are controlled by their own configuration keys.
		Add `--run` to a command to actually apply the proposed changes.

		Backups of `.hg/requires` and `.hg/dirstate` are created
		in a `.hg/upgradebackup.*` directory.
		If something goes wrong, restoring those files should undo the change.

		Note that upgrading affects compatibility with older versions of Mercurial
		as noted above.
		This can be relevant when a repository’s files are on a USB drive
		or some other removable media, or shared over the network, etc.

		Internal filesystem representation
		==================================

		Requirements file
		-----------------

		The `.hg/requires` file indicates which of various optional file formats
		are used by a given repository.
		Mercurial aborts when seeing a requirement it does not know about,
		which avoids older version accidentally messing up a respository
		that uses a format that was introduced later.
		For versions that do support a format, the presence or absence of
		the corresponding requirement indicates whether to use that format.

		When the file contains a `exp-dirstate-v2` line,
		the `dirstate-v2` format is used.
		With no such line `dirstate-v1` is used.

		High level description
		----------------------

		Whereas `dirstate-v1` uses a single `.hg/disrtate` file,
		in `dirstate-v2` that file is a "docket" file
		that only contains some metadata
		and points to separate data file named `.hg/dirstate.{ID}`,
		where `{ID}` is a random identifier.

		This separation allows making data files append-only
		and therefore safer to memory-map.
		Creating a new data file (occasionally to clean up unused data)
		can be done with a different ID
		without disrupting another Mercurial process
		that could still be using the previous data file.

		Both files have a format designed to reduce the need for parsing,
		by using fixed-size binary components as much as possible.
		For data that is not fixed-size,
		references to other parts of a file can be made by storing "pseudo-pointers":
		integers counted in bytes from the start of a file.
		For read-only access no data structure is needed,
		only a bytes buffer (possibly memory-mapped directly from the filesystem)
		with specific parts read on demand.

		The data file contains "nodes" organized in a tree.
		Each node represents a file or directory inside the working directory
		or its parent changeset.
		This tree has the same structure as the filesystem,
		so a node representing a directory has child nodes representing
		the files and subdirectories contained directly in that directory.

		The docket file format
		----------------------

		This is implemented in `rust/hg-core/src/dirstate_tree/on_disk.rs`
		and `mercurial/dirstateutils/docket.py`.

		Components of the docket file are found at fixed offsets,
		counted in bytes from the start of the file:

		* Offset 0:
		The 12-bytes marker string "dirstate-v2\n" ending with a newline character.
		This makes it easier to tell a dirstate-v2 file from a dirstate-v1 file,
		although it is not strictly necessary
		since `.hg/requires` determines which format to use.

		* Offset 12:
		The changeset node ID on the first parent of the working directory,
		as up to 32 binary bytes.
		If a node ID is shorter (20 bytes for SHA-1),
		it is start-aligned and the rest of the bytes are set to zero.

		* Offset 44:
		The changeset node ID on the second parent of the working directory,
		or all zeros if there isn’t one.
		Also 32 binary bytes.

		* Offset 76:
		Tree metadata on 44 bytes, described below.
		Its separation in this documentation from the rest of the docket
		reflects a detail of the current implementation.
		Since tree metadata is also made of fields at fixed offsets, those could
		be inlined here by adding 76 bytes to each offset.

		* Offset 120:
		The used size of the data file, as a 32-bit big-endian integer.
		The actual size of the data file may be larger
		(if another Mercurial processis in appending to it
		but has not updated the docket yet).
		That extra data must be ignored.

		* Offset 124:
		The length of the data file identifier, as a 8-bit integer.

		* Offset 125:
		The data file identifier.

		* Any additional data is current ignored, and dropped when updating the file.

		Tree metadata in the docket file
		--------------------------------

		Tree metadata is similarly made of components at fixed offsets.
		These offsets are counted in bytes from the start of tree metadata,
		which is 76 bytes after the start of the docket file.

		This metadata can be thought of as the singular root of the tree
		formed by nodes in the data file.

		* Offset 0:
		Pseudo-pointer to the start of root nodes,
		counted in bytes from the start of the data file,
		as a 32-bit big-endian integer.
		These nodes describe files and directories found directly
		at the root of the working directory.

		* Offset 4:
		Number of root nodes, as a 32-bit big-endian integer.

		* Offset 8:
		Total number of nodes in the entire tree that "have a dirstate entry",
		as a 32-bit big-endian integer.
		Those nodes represent files that would be present at all in `dirstate-v1`.
		This is typically less than the total number of nodes.
		This counter is used to implement `len(dirstatemap)`.

		* Offset 12:
		Number of nodes in the entire tree that have a copy source,
		as a 32-bit big-endian integer.
		At the next commit, these files are recorded
		as having been copied or moved/renamed from that source.
		(A move is recorded as a copy and separate removal of the source.)
		This counter is used to implement `len(dirstatemap.copymap)`.

		* Offset 16:
		An estimation of how many bytes of the data file
		(within its used size) are unused, as a 32-bit big-endian integer.
		When appending to an existing data file,
		some existing nodes or paths can be unreachable from the new root
		but they still take up space.
		This counter is used to decide when to write a new data file from scratch
		instead of appending to an existing one,
		in order to get rid of that unreachable data
		and avoid unbounded file size growth.

		* Offset 20:
		These four bytes are currently ignored
		and reset to zero when updating a docket file.
		This is an attempt at forward compatibility:
		future Mercurial versions could use this as a bit field
		to indicate that a dirstate has additional data or constraints.
		Finding a dirstate file with the relevant bit unset indicates that
		it was written by a then-older version
		which is not aware of that future change.

		* Offset 24:
		Either 20 zero bytes, or a SHA-1 hash as 20 binary bytes.
		When present, the hash is of ignore patterns
		that were used for some previous run of the `status` algorithm.

		* (Offset 44: end of tree metadata)

		Optional hash of ignore patterns
		--------------------------------

		The implementation of `status` at `rust/hg-core/src/dirstate_tree/status.rs`
		has been optimized such that its run time is dominated by calls
		to `stat` for reading the filesystem metadata of a file or directory,
		and to `readdir` for listing the contents of a directory.
		In some cases the algorithm can skip calls to `readdir`
		(saving significant time)
		because the dirstate already contains enough of the relevant information
		to build the correct `status` results.

		The default configuration of `hg status` is to list unknown files
		but not ignored files.
		In this case, it matters for the `readdir`-skipping optimization
		if a given file used to be ignored but became unknown
		because `.hgignore` changed.
		To detect the possibility of such a change,
		the tree metadata contains an optional hash of all ignore patterns.

		We define:

		* "Root" ignore files as:

		- `.hgignore` at the root of the repository if it exists
		- And all files from `ui.ignore.*` config.

		This set of files is sorted by the string representation of their path.

		* The "expanded contents" of an ignore files is the byte string made
		by the concatenation of its contents followed by the "expanded contents"
		of other files included with `include:` or `subinclude:` directives,
		in inclusion order. This definition is recursive, as included files can
		themselves include more files.

		This hash is defined as the SHA-1 of the concatenation (in sorted
		order) of the "expanded contents" of each "root" ignore file.
		(Note that computing this does not require actually concatenating byte ranges into
		contiguous memory.
		Instead a SHA-1 hasher object can be created and fed separate byte ranges one by
		one.)

		The data file format
		--------------------

		This is implemented in `rust/hg-core/src/dirstate_tree/on_disk.rs`
		and `mercurial/dirstateutils/v2.py`.

		The data file contains two types of data: paths and nodes.

		Paths and nodes can be organized in any order in the file, except that sibling
		nodes must be next to each other and sorted by their path. Contiguity lets
		the parent refer to them all by their count with a single pseudo-pointer,
		instead of storing one pseudo-pointer per child node. Sorting allows using
		binary seach to find a child node with a given name in `O(log(n))` byte ranges
		comparisons.

		The current implemention writes paths and child node before a given node
		for ease of figuring out the value of pseudo-pointers by the time the are to be
		written, but this is not an obligation and readers must not rely on it.

		A path is stored as a byte string anywhere in the file, without delimiter.
		It is refered to by one or more node by a pseudo-pointer to its start, and its
		length in bytes. Since there is no delimiter,
		when a path is a substring of another the same bytes could be reused,
		although the implementation does not exploit this as of this writing.

		A node is stored on 43 bytes with components at fixed offsets. Paths and
		child nodes relevant to a node are stored externally and referenced though
		pseudo-pointers.

		All integers are stored in big-endian. All pseudo-pointers are 32-bit integers
		counting bytes from the start of the data file. Path lengths and positions
		are 16-bit integers, also counted in bytes.

		Node components are:

		* Offset 0:
		Pseudo-pointer to the full path of this node,
		from the working directory root.

		* Offset 4:
		Length of the full path.

		* Offset 6:
		Position of the last `/` path separator within the full path,
		in bytes from the start of the full path,
		or zero if there isn’t one.
		The part of the full path after this position is the "base name".
		Since sibling nodes have the same parent, only their base name vary
		and needs to be considered when doing binary search to find a given path.

		* Offset 8:
		Pseudo-pointer to the "copy source" path for this node,
		or zero if there is no copy source.

		* Offset 12:
		Length of the copy source path, or zero if there isn’t one.

		* Offset 14:
		Pseudo-pointer to the start of child nodes.

		* Offset 18:
		Number of child nodes, as a 32-bit integer.
		They occupy 43 times this number of bytes
		(not counting space for paths, and further descendants).

		* Offset 22:
		Number as a 32-bit integer of descendant nodes in this subtree,
		not including this node itself,
		that "have a dirstate entry".
		Those nodes represent files that would be present at all in `dirstate-v1`.
		This is typically less than the total number of descendants.
		This counter is used to implement `has_dir`.

		* Offset 26:
		Number as a 32-bit integer of descendant nodes in this subtree,
		not including this node itself,
		that represent files tracked in the working directory.
		(For example, `hg rm` makes a file untracked.)
		This counter is used to implement `has_tracked_dir`.

		* Offset 30 and more:
		TODO: docs not written yet
		as this part of the format might be changing soon.