upstream/mercurial-mirror Commit - r59:2bff7c0e

1

Some notes about Mercurial's design

1

Some notes about Mercurial's design

2

3

Revlogs:

3

Revlogs:

4

5

The fundamental storage type in Mercurial is a "revlog". A revlog is

5

The fundamental storage type in Mercurial is a "revlog". A revlog is

6

the set of all revisions to a file. Each revision is either stored

6

the set of all revisions to a file. Each revision is either stored

7

compressed in its entirety or as a compressed binary delta against the

7

compressed in its entirety or as a compressed binary delta against the

8

previous version. The decision of when to store a full version is made

8

previous version. The decision of when to store a full version is made

9

based on how much data would be needed to reconstruct the file. This

9

based on how much data would be needed to reconstruct the file. This

10

lets us ensure that we never need to read huge amounts of data to

10

lets us ensure that we never need to read huge amounts of data to

11

reconstruct a file, regardless of how many revisions of it we store.

11

reconstruct a file, regardless of how many revisions of it we store.

12

13

In fact, we should always be able to do it with a single read,

13

In fact, we should always be able to do it with a single read,

14

provided we know when and where to read. This is where the index comes

14

provided we know when and where to read. This is where the index comes

15

in. Each revlog has an index containing a special hash (nodeid) of the

15

in. Each revlog has an index containing a special hash (nodeid) of the

16

text, hashes for its parents, and where and how much of the revlog

16

text, hashes for its parents, and where and how much of the revlog

17

data we need to read to reconstruct it. Thus, with one read of the

17

data we need to read to reconstruct it. Thus, with one read of the

18

index and one read of the data, we can reconstruct any version in time

18

index and one read of the data, we can reconstruct any version in time

19

proportional to the file size.

19

proportional to the file size.

20

21

Similarly, revlogs and their indices are append-only. This means that

21

Similarly, revlogs and their indices are append-only. This means that

22

adding a new version is also O(1) seeks.

22

adding a new version is also O(1) seeks.

23

24

Generally revlogs are used to represent revisions of files, but they

24

Generally revlogs are used to represent revisions of files, but they

25

also are used to represent manifests and changesets.

25

also are used to represent manifests and changesets.

26

27

Manifests:

27

Manifests:

28

29

A manifest is simply a list of all files in a given revision of a

29

A manifest is simply a list of all files in a given revision of a

30

project along with the nodeids of the corresponding file revisions. So

30

project along with the nodeids of the corresponding file revisions. So

31

grabbing a given version of the project means simply looking up its

31

grabbing a given version of the project means simply looking up its

32

manifest and reconstruction all the file revisions pointed to by it.

32

manifest and reconstruction all the file revisions pointed to by it.

33

34

Changesets:

34

Changesets:

35

36

A changeset is a list of all files changed in a check-in along with a

36

A changeset is a list of all files changed in a check-in along with a

37

change description and some metadata like user and date. It also

37

change description and some metadata like user and date. It also

38

contains a nodeid to the relevent revision of the manifest. Changesets

38

contains a nodeid to the relevent revision of the manifest. Changesets

39

and manifests are one-to-one, but contain different data for

39

and manifests are one-to-one, but contain different data for

40

convenience.

40

convenience.

41

42

Nodeids:

42

Nodeids:

43

44

Nodeids are unique ids that are used to represent the contents of a

44

Nodeids are unique ids that are used to represent the contents of a

45

file AND its position in the project history. That is, if you change a

45

file AND its position in the project history. That is, if you change a

46

file and then change it back, the result will have a different nodeid

46

file and then change it back, the result will have a different nodeid

47

because it has different history. This is accomplished by including

47

because it has different history. This is accomplished by including

48

the parents of a given revision's nodeids with the revision's text

48

the parents of a given revision's nodeids with the revision's text

49

when calculating the hash.

49

when calculating the hash.

50

51

Graph merging:

51

Graph merging:

52

53

Nodeids are implemented as they are to simplify merging. Merging a

53

Nodeids are implemented as they are to simplify merging. Merging a

54

pair of directed acyclic graphs (aka "the family tree" of the file

54

pair of directed acyclic graphs (aka "the family tree" of the file

55

history) requires some method of determining if nodes in different

55

history) requires some method of determining if nodes in different

56

graphs correspond. Simply comparing the contents of the node (by

56

graphs correspond. Simply comparing the contents of the node (by

57

comparing text of given revisions or their hashes) can get confused by

57

comparing text of given revisions or their hashes) can get confused by

58

identical revisions in the tree.

58

identical revisions in the tree.

59

60

The nodeid approach makes it trivial - the hash uniquely describes a

60

The nodeid approach makes it trivial - the hash uniquely describes a

61

revision's contents and its graph position relative to the root, so

61

revision's contents and its graph position relative to the root, so

62

merge is simply checking whether each nodeid in graph A is in the hash

62

merge is simply checking whether each nodeid in graph A is in the hash

63

table of graph B. If not, we pull them in, adding them sequentially to

63

table of graph B. If not, we pull them in, adding them sequentially to

64

the revlog.

64

the revlog.

65

66

Graph resolving:

66

Graph resolving:

67

68

Mercurial does branching by copying (or COWing) a repository and thus

68

Mercurial does branching by copying (or COWing) a repository and thus

69

keeps everything nice and linear within a repository. However, when a

69

keeps everything nice and linear within a repository. However, when a

70

merge of repositories (a "pull") is done, we may often have two head

70

merge of repositories (a "pull") is done, we may often have two head

71

revisions in a given graph. To keep things simple, Mercurial forces

71

revisions in a given graph. To keep things simple, Mercurial forces

72

the head revisions to be merged.

72

the head revisions to be merged.

73

74

It first finds the closest common ancestor of the two heads. If one is

74

It first finds the closest common ancestor of the two heads. If one is

75

a child of the other, it becomes the new head. Otherwise, we call out

75

a child of the other, it becomes the new head. Otherwise, we call out

76

to a user-specified 3-way merge tool.

76

to a user-specified 3-way merge tool.

77

78

Merging files, manifests, and changesets:

78

Merging files, manifests, and changesets:

79

80

We begin by comparing changeset DAGs, pulling all nodes we don't have

80

We begin by comparing changeset DAGs, pulling all nodes we don't have

81

in our DAG from the other repository. As we do so, we collect a list

81

in our DAG from the other repository. As we do so, we collect a list

82

of changed files to merge.

82

of changed files to merge.

83

84

Then for each file, we perform a graph merge and resolve as above.

84

Then for each file, we perform a graph merge and resolve as above.

85

It's important to merge files using per-file DAGs rather than just

85

It's important to merge files using per-file DAGs rather than just

86

changeset level DAGs as this diagram illustrates:

86

changeset level DAGs as this diagram illustrates:

87

88

M M1 M2

88

M M1 M2

89

90

AB

90

AB

91

|`-------v M2 clones M

91

|`-------v M2 clones M

92

aB AB file A is change in mainline

92

aB AB file A is change in mainline

93

|`---v AB' file B is changed in M2

93

|`---v AB' file B is changed in M2

94

| aB / | M1 clones M

94

| aB / | M1 clones M

95

| ab/ | M1 changes B

95

| ab/ | M1 changes B

96

| ab' | M1 merges from M2, changes to B conflict

96

| ab' | M1 merges from M2, changes to B conflict

97

| | A'B' M2 changes A

97

| | A'B' M2 changes A

98

`---+--.|

98

`---+--.|

99

| a'B' M2 merges from mainline, changes to A conflict

99

| a'B' M2 merges from mainline, changes to A conflict

100

`--.|

100

`--.|

101

??? depending on which ancestor we choose, we will have

101

??? depending on which ancestor we choose, we will have

102

to redo A hand-merge, B hand-merge, or both

102

to redo A hand-merge, B hand-merge, or both

103

but if we look at the files independently, everything

103

but if we look at the files independently, everything

104

is fine

104

is fine

105

106

After we've merged files, we merge the manifest log DAG and resolve

106

After we've merged files, we merge the manifest log DAG and resolve

107

additions and deletions. Then we are ready to resolve the changeset

107

additions and deletions. Then we are ready to resolve the changeset

108

DAG - if our merge required any changes (the new head is not a

108

DAG - if our merge required any changes (the new head is not a

109

decendent of our tip), we must create a new changeset describing all

109

decendent of our tip), we must create a new changeset describing all

110

of the changes needed to merge it into the tip.

110

of the changes needed to merge it into the tip.

111

112

Merge performance:

112

Merge performance:

113

114

The I/O operations for performing a merge are O(changed files), not

114

The I/O operations for performing a merge are O(changed files), not

115

O(total changes) and in many cases, we needn't even unpack the deltas

115

O(total changes) and in many cases, we needn't even unpack the deltas

116

to add them to our repository (though this optimization isn't

116

to add them to our repository (though this optimization isn't

117

necessary).

117

necessary).

118

119

Rollback:

119

Rollback:

120

121

When performing a commit or a merge, we order things so that the

121

When performing a commit or a merge, we order things so that the

122

changeset entry gets added last. We keep a transaction log of the name

122

changeset entry gets added last. We keep a transaction log of the name

123

of each file touched and its length prior to the transaction. On

123

of each file touched and its length prior to the transaction. On

124

abort, we simply truncate each file to its prior length. This is one

124

abort, we simply truncate each file to its prior length. This is one

125

of the nice properties of the append-only structure of the revlogs.

125

of the nice properties of the append-only structure of the revlogs.

126

127

Remote access:

127

Remote access:

128

129

Mercurial currently supports pulling from "serverless" repositories.

129

Mercurial currently supports pulling from "serverless" repositories.

130

Simply making the repo directory accessibly via the web and pointing

130

Simply making the repo directory accessibly via the web and pointing

131

hg at it can accomplish a pull. This is relatively bandwidth efficient

131

hg at it can accomplish a pull. This is relatively bandwidth efficient

132

but no effort has been spent on pipelining, so it won't work

132

but no effort has been spent on pipelining, so it won't work

133

especially well over LAN yet.

133

especially well over LAN yet.

134

135

It's also quite amenable to rsync, if you don't mind keeping an intact

135

It's also quite amenable to rsync, if you don't mind keeping an intact

136

copy of the master around locally.

136

copy of the master around locally.

137

138

Also note the append-only and ordering properties of the commit

138

Also note the append-only and ordering properties of the commit

139

guarantee that readers will always see a repository in a consistent

139

guarantee that readers will always see a repository in a consistent

140

state and no special locking is necessary. As there is generally only

140

state and no special locking is necessary. As there is generally only

141

one writer to an hg repository, there is in fact no exclusion

141

one writer to an hg repository, there is in fact no exclusion

142

implemented yet.

142

implemented yet.

143

144

145

Some comparisons to git:

145

Some comparisons to git:

146

147

Most notably, Mercurial uses delta compression and repositories

147

Most notably, Mercurial uses delta compression and repositories

148

created with it will grow much more slowly over time. This also allows

148

created with it will grow much more slowly over time. This also allows

149

it to be much more bandwidth efficient. I expect repos sizes and sync

149

it to be much more bandwidth efficient. I expect repos sizes and sync

150

speeds to be similar to or better than BK, given the use of binary diffs.

150

speeds to be similar to or better than BK, given the use of binary diffs.

151

152

Mercurial is roughly the same performance as git ~~and is faster in~~

152

Mercurial is roughly the same performance as git in some areas and is

153

others as it keeps around more metadata. One example is ~~listing and~~

153

faster in others as it keeps around more metadata. One example is

154

retrieving past versions of a file, which it can do ~~without reading~~

154

listing and retrieving past versions of a file, which it can do

155

all the changesets. This metadata will also allow it ~~to perform better~~

155

without reading all the changesets. This metadata will also allow it

156

merges as described above.

156

to perform better merges as described above.

157

158

	Site-wide shortcuts
/	Use quick search box
g h	Goto home page
g g	Goto my private gists page
g G	Goto my public gists page
g 0-9	Goto bookmarked items from 0-9
n r	New repository page
n g	New gist page

	Repositories
g s	Goto summary page
g c	Goto changelog page
g f	Goto files page
g F	Goto files page with file search activated
g p	Goto pull requests page
g o	Goto repository settings
g O	Goto repository access permissions settings
t s	Toggle sidebar on some pages

             Some notes about Mercurial's design
             Revlogs:
             The fundamental storage type in Mercurial is a "revlog". A revlog is
             the set of all revisions to a file. Each revision is either stored
             compressed in its entirety or as a compressed binary delta against the
             previous version. The decision of when to store a full version is made
             based on how much data would be needed to reconstruct the file. This
             lets us ensure that we never need to read huge amounts of data to
             reconstruct a file, regardless of how many revisions of it we store.
             In fact, we should always be able to do it with a single read,
             provided we know when and where to read. This is where the index comes
             in. Each revlog has an index containing a special hash (nodeid) of the
             text, hashes for its parents, and where and how much of the revlog
             data we need to read to reconstruct it. Thus, with one read of the
             index and one read of the data, we can reconstruct any version in time
             proportional to the file size.
             Similarly, revlogs and their indices are append-only. This means that
             adding a new version is also O(1) seeks.
             Generally revlogs are used to represent revisions of files, but they
             also are used to represent manifests and changesets.
             Manifests:
             A manifest is simply a list of all files in a given revision of a
             project along with the nodeids of the corresponding file revisions. So
             grabbing a given version of the project means simply looking up its
             manifest and reconstruction all the file revisions pointed to by it.
             Changesets:
             A changeset is a list of all files changed in a check-in along with a
             change description and some metadata like user and date. It also
             contains a nodeid to the relevent revision of the manifest. Changesets
             and manifests are one-to-one, but contain different data for
             convenience.
             Nodeids:
             Nodeids are unique ids that are used to represent the contents of a
             file AND its position in the project history. That is, if you change a
             file and then change it back, the result will have a different nodeid
             because it has different history. This is accomplished by including
             the parents of a given revision's nodeids with the revision's text
             when calculating the hash.
             Graph merging:
             Nodeids are implemented as they are to simplify merging. Merging a
             pair of directed acyclic graphs (aka "the family tree" of the file
             history) requires some method of determining if nodes in different
             graphs correspond. Simply comparing the contents of the node (by
             comparing text of given revisions or their hashes) can get confused by
             identical revisions in the tree.
             The nodeid approach makes it trivial - the hash uniquely describes a
             revision's contents and its graph position relative to the root, so
             merge is simply checking whether each nodeid in graph A is in the hash
             table of graph B. If not, we pull them in, adding them sequentially to
             the revlog.
             Graph resolving:
             Mercurial does branching by copying (or COWing) a repository and thus
             keeps everything nice and linear within a repository. However, when a
             merge of repositories (a "pull") is done, we may often have two head
             revisions in a given graph. To keep things simple, Mercurial forces
             the head revisions to be merged.
             It first finds the closest common ancestor of the two heads. If one is
             a child of the other, it becomes the new head. Otherwise, we call out
             to a user-specified 3-way merge tool.
             Merging files, manifests, and changesets:
             We begin by comparing changeset DAGs, pulling all nodes we don't have
             in our DAG from the other repository. As we do so, we collect a list
             of changed files to merge.
             Then for each file, we perform a graph merge and resolve as above.
             It's important to merge files using per-file DAGs rather than just
             changeset level DAGs as this diagram illustrates:
             M   M1   M2
             AB
              |`-------v     M2 clones M
             aB       AB     file A is change in mainline
              |`---v  AB'    file B is changed in M2
              |   aB / |     M1 clones M
              |   ab/  |     M1 changes B
              |   ab'  |     M1 merges from M2, changes to B conflict
              |    |  A'B'   M2 changes A
               `---+--.|
                   |  a'B'   M2 merges from mainline, changes to A conflict
                   `--.|
                      ???    depending on which ancestor we choose, we will have
             	        to redo A hand-merge, B hand-merge, or both
                             but if we look at the files independently, everything
             		is fine
             After we've merged files, we merge the manifest log DAG and resolve
             additions and deletions. Then we are ready to resolve the changeset
             DAG - if our merge required any changes (the new head is not a
             decendent of our tip), we must create a new changeset describing all
             of the changes needed to merge it into the tip.
             Merge performance:
             The I/O operations for performing a merge are O(changed files), not
             O(total changes) and in many cases, we needn't even unpack the deltas
             to add them to our repository (though this optimization isn't
             necessary).
             Rollback:
             When performing a commit or a merge, we order things so that the
             changeset entry gets added last. We keep a transaction log of the name
             of each file touched and its length prior to the transaction. On
             abort, we simply truncate each file to its prior length. This is one
             of the nice properties of the append-only structure of the revlogs.
             Remote access:
             Mercurial currently supports pulling from "serverless" repositories.
             Simply making the repo directory accessibly via the web and pointing
             hg at it can accomplish a pull. This is relatively bandwidth efficient
             but no effort has been spent on pipelining, so it won't work
             especially well over LAN yet.
             It's also quite amenable to rsync, if you don't mind keeping an intact
             copy of the master around locally.
             Also note the append-only and ordering properties of the commit
             guarantee that readers will always see a repository in a consistent
             state and no special locking is necessary. As there is generally only
             one writer to an hg repository, there is in fact no exclusion
             implemented yet.
             Some comparisons to git:
             Most notably, Mercurial uses delta compression and repositories
             created with it will grow much more slowly over time. This also allows
             it to be much more bandwidth efficient. I expect repos sizes and sync
             speeds to be similar to or better than BK, given the use of binary diffs.
-            Mercurial is roughly the same performance as git and is faster in
+            Mercurial is roughly the same performance as git in some areas and is
-            others as it keeps around more metadata. One example is listing and
+            faster in others as it keeps around more metadata. One example is
-            retrieving past versions of a file, which it can do without reading
+            listing and retrieving past versions of a file, which it can do
-            all the changesets. This metadata will also allow it to perform better
+            without reading all the changesets. This metadata will also allow it
-            merges as described above.
+            to perform better merges as described above.