|
|
Some notes about Mercurial's design
|
|
|
|
|
|
Revlogs:
|
|
|
|
|
|
The fundamental storage type in Mercurial is a "revlog". A revlog is
|
|
|
the set of all revisions to a file. Each revision is either stored
|
|
|
compressed in its entirety or as a compressed binary delta against the
|
|
|
previous version. The decision of when to store a full version is made
|
|
|
based on how much data would be needed to reconstruct the file. This
|
|
|
lets us ensure that we never need to read huge amounts of data to
|
|
|
reconstruct a file, regardless of how many revisions of it we store.
|
|
|
|
|
|
In fact, we should always be able to do it with a single read,
|
|
|
provided we know when and where to read. This is where the index comes
|
|
|
in. Each revlog has an index containing a special hash (nodeid) of the
|
|
|
text, hashes for its parents, and where and how much of the revlog
|
|
|
data we need to read to reconstruct it. Thus, with one read of the
|
|
|
index and one read of the data, we can reconstruct any version in time
|
|
|
proportional to the file size.
|
|
|
|
|
|
Similarly, revlogs and their indices are append-only. This means that
|
|
|
adding a new version is also O(1) seeks.
|
|
|
|
|
|
Generally revlogs are used to represent revisions of files, but they
|
|
|
also are used to represent manifests and changesets.
|
|
|
|
|
|
Manifests:
|
|
|
|
|
|
A manifest is simply a list of all files in a given revision of a
|
|
|
project along with the nodeids of the corresponding file revisions. So
|
|
|
grabbing a given version of the project means simply looking up its
|
|
|
manifest and reconstruction all the file revisions pointed to by it.
|
|
|
|
|
|
Changesets:
|
|
|
|
|
|
A changeset is a list of all files changed in a check-in along with a
|
|
|
change description and some metadata like user and date. It also
|
|
|
contains a nodeid to the relevent revision of the manifest. Changesets
|
|
|
and manifests are one-to-one, but contain different data for
|
|
|
convenience.
|
|
|
|
|
|
Nodeids:
|
|
|
|
|
|
Nodeids are unique ids that are used to represent the contents of a
|
|
|
file AND its position in the project history. That is, if you change a
|
|
|
file and then change it back, the result will have a different nodeid
|
|
|
because it has different history. This is accomplished by including
|
|
|
the parents of a given revision's nodeids with the revision's text
|
|
|
when calculating the hash.
|
|
|
|
|
|
Graph merging:
|
|
|
|
|
|
Nodeids are implemented as they are to simplify merging. Merging a
|
|
|
pair of directed acyclic graphs (aka "the family tree" of the file
|
|
|
history) requires some method of determining if nodes in different
|
|
|
graphs correspond. Simply comparing the contents of the node (by
|
|
|
comparing text of given revisions or their hashes) can get confused by
|
|
|
identical revisions in the tree.
|
|
|
|
|
|
The nodeid approach makes it trivial - the hash uniquely describes a
|
|
|
revision's contents and its graph position relative to the root, so
|
|
|
merge is simply checking whether each nodeid in graph A is in the hash
|
|
|
table of graph B. If not, we pull them in, adding them sequentially to
|
|
|
the revlog.
|
|
|
|
|
|
Branching and merging:
|
|
|
|
|
|
Everything in Mercurial is potentially a branch and every user
|
|
|
effectively works in their own branch. When you do a checkout,
|
|
|
Mercurial remembers what the parent changeset was and uses it for the
|
|
|
next check in.
|
|
|
|
|
|
To do a merge of branches in Mercurial, you check out the heads of the
|
|
|
two branches into the same working directory which causes a merge to
|
|
|
be performed, and then check in the result once you're happy with it.
|
|
|
The resulting checkin will have two parents.
|
|
|
|
|
|
It decides when a merge is necessary by first determining if there are
|
|
|
any uncommitted changes in the working directory. This effectively
|
|
|
makes the working directory a branch off the checked in version it's
|
|
|
based on. Then it also determines if the working directory is a direct
|
|
|
ancestor or descendent of the second version we're attempting to
|
|
|
checkout. If neither is true, we simply replace the working directory
|
|
|
version with the new version. Otherwise we perform a merge between the
|
|
|
two versions.
|
|
|
|
|
|
Merging files and manifests:
|
|
|
|
|
|
We begin by comparing two versions manifests and deciding which files
|
|
|
need to be added, deleted, and merged.
|
|
|
|
|
|
Then for each file, we perform a graph merge and resolve as above.
|
|
|
It's important to merge files using per-file DAGs rather than just
|
|
|
changeset level DAGs as this diagram illustrates:
|
|
|
|
|
|
M M1 M2
|
|
|
|
|
|
AB
|
|
|
|`-------v M2 clones M
|
|
|
aB AB file A is change in mainline
|
|
|
|`---v AB' file B is changed in M2
|
|
|
| aB / | M1 clones M
|
|
|
| ab/ | M1 changes B
|
|
|
| ab' | M1 merges from M2, changes to B conflict
|
|
|
| | A'B' M2 changes A
|
|
|
`---+--.|
|
|
|
| a'B' M2 merges from mainline, changes to A conflict
|
|
|
`--.|
|
|
|
??? depending on which ancestor we choose, we will have
|
|
|
to redo A hand-merge, B hand-merge, or both
|
|
|
but if we look at the files independently, everything
|
|
|
is fine
|
|
|
|
|
|
The result is a merged version in the working directory, waiting for
|
|
|
check-in.
|
|
|
|
|
|
Rollback:
|
|
|
|
|
|
When performing a commit or a merge, we order things so that the
|
|
|
changeset entry gets added last. We keep a transaction log of the name
|
|
|
of each file touched and its length prior to the transaction. On
|
|
|
abort, we simply truncate each file to its prior length. This is one
|
|
|
of the nice properties of the append-only structure of the revlogs.
|
|
|
We can also reuse this journal for "undo".
|
|
|
|
|
|
Merging between repositories:
|
|
|
|
|
|
One of the key features of Mercurial is the ability to merge between
|
|
|
independent repositories in a decentralized fashion. Each repository
|
|
|
can act as a read-only server or a client. Clients operating by
|
|
|
pulling all branches that it hasn't seen from the server and adding
|
|
|
them into its graph. This is done in two steps: searching for new
|
|
|
"roots" and pulling a "changegroup"
|
|
|
|
|
|
Searching for new "roots" begins by finding all new heads and
|
|
|
searching backwards from those heads to the first unknown nodes in
|
|
|
their respective branches. These nodes are the 'roots' that are used
|
|
|
to calculate the 'changegroup': the set of all changesets starting at
|
|
|
those roots. Mercurial takes pains to make this search efficient in
|
|
|
both bandwidth and round-trips.
|
|
|
|
|
|
Once the roots are found, the changegroup can be transferred as a
|
|
|
single streaming transfer. This is organized as an ordered set of
|
|
|
deltas for changesets, manifests, and files. Large chunks of deltas
|
|
|
can be directly added to the repository without unpacking so it's
|
|
|
fairly fast.
|
|
|
|