linelog.txt
302 lines
| 11.1 KiB
| text/plain
|
TextLexer
Matt Harbison
|
r44031 | linelog is a storage format inspired by the "Interleaved deltas" idea. See | ||
https://en.wikipedia.org/wiki/Interleaved_deltas for its introduction. | ||||
0. SCCS Weave | ||||
To understand what linelog is, first we have a quick look at a simplified | ||||
(with header removed) SCCS weave format, which is an implementation of the | ||||
"Interleaved deltas" idea. | ||||
0.1 Basic SCCS Weave File Format | ||||
A SCCS weave file consists of plain text lines. Each line is either a | ||||
special instruction starting with "^A" or part of the content of the real | ||||
file the weave tracks. There are 3 important operations, where REV denotes | ||||
the revision number: | ||||
^AI REV, marking the beginning of an insertion block introduced by REV | ||||
^AD REV, marking the beginning of a deletion block introduced by REV | ||||
^AE REV, marking the end of the block started by "^AI REV" or "^AD REV" | ||||
Note on revision numbers: For any two different revision numbers, one must | ||||
be an ancestor of the other to make them comparable. This enforces linear | ||||
history. Besides, the comparison functions (">=", "<") should be efficient. | ||||
This means, if revisions are strings like git or hg, an external map is | ||||
required to convert them into integers. | ||||
For example, to represent the following changes: | ||||
REV 1 | REV 2 | REV 3 | ||||
------+-------+------- | ||||
a | a | a | ||||
b | b | 2 | ||||
c | 1 | c | ||||
| 2 | | ||||
| c | | ||||
A possible weave file looks like: | ||||
^AI 1 | ||||
a | ||||
^AD 3 | ||||
b | ||||
^AI 2 | ||||
1 | ||||
^AE 3 | ||||
2 | ||||
^AE 2 | ||||
c | ||||
^AE 1 | ||||
An "^AE" does not always match its nearest operation ("^AI" or "^AD"). In | ||||
the above example, "^AE 3" does not match the nearest "^AI 2" but "^AD 3". | ||||
Therefore we need some extra information for "^AE". The SCCS weave uses a | ||||
revision number. It could also be a boolean value about whether it is an | ||||
insertion or a deletion (see section 0.4). | ||||
0.2 Checkout | ||||
The "checkout" operation is to retrieve file content at a given revision, | ||||
say X. It's doable by going through the file line by line and: | ||||
- If meet ^AI rev, and rev > X, find the corresponding ^AE and jump there | ||||
- If meet ^AD rev, and rev <= X, find the corresponding ^AE and jump there | ||||
- Ignore ^AE | ||||
- For normal lines, just output them | ||||
0.3 Annotate | ||||
The "annotate" operation is to show extra metadata like the revision number | ||||
and the original line number a line comes from. | ||||
It's basically just a "Checkout". For the extra metadata, they can be stored | ||||
side by side with the line contents. Alternatively, we can infer the | ||||
revision number from "^AI"s. | ||||
Some SCM tools have to calculate diffs on the fly and thus are much slower | ||||
on this operation. | ||||
0.4 Tree Structure | ||||
The word "interleaved" is used because "^AI" .. "^AE" and "^AD" .. "^AE" | ||||
blocks can be interleaved. | ||||
If we consider insertions and deletions separately, they can form tree | ||||
structures, respectively. | ||||
+--- ^AI 1 +--- ^AD 3 | ||||
| +- ^AI 2 | +- ^AD 2 | ||||
| | | | | ||||
| +- ^AE 2 | +- ^AE 2 | ||||
| | | ||||
+--- ^AE 1 +--- ^AE 3 | ||||
More specifically, it's possible to build a tree for all insertions, where | ||||
the tree node has the structure "(rev, startline, endline)". "startline" is | ||||
the line number of "^AI" and "endline" is the line number of the matched | ||||
"^AE". The tree will have these properties: | ||||
1. child.rev > parent.rev | ||||
2. child.startline > parent.startline | ||||
3. child.endline < parent.endline | ||||
A similar tree for all deletions can also be built with the first property | ||||
changed to: | ||||
1. child.rev < parent.rev | ||||
0.5 Malformed Cases | ||||
The following cases are considered malformed in our implementation: | ||||
1. Interleaved insertions, or interleaved deletions. | ||||
It can be rewritten to a non-interleaved tree structure. | ||||
Take insertions as example, deletions are similar: | ||||
^AI x ^AI x | ||||
a a | ||||
^AI x + 1 -> ^AI x + 1 | ||||
b b | ||||
^AE x ^AE x + 1 | ||||
c ^AE x | ||||
^AE x + 1 ^AI x + 1 | ||||
c | ||||
^AE x + 1 | ||||
2. Nested insertions, where the inner one has a smaller revision number. | ||||
Or nested deletions, where the inner one has a larger revision number. | ||||
It can be rewritten to a non-nested form. | ||||
Take insertions as example, deletions are similar: | ||||
^AI x + 1 ^AI x + 1 | ||||
a a | ||||
^AI x -> ^AE x + 1 | ||||
b ^AI x | ||||
^AE x b | ||||
c ^AE x | ||||
^AE x + 1 ^AI x + 1 | ||||
c | ||||
^AE x + 1 | ||||
3. Insertion inside deletion with a smaller revision number. | ||||
Rewrite by duplicating the content inserted: | ||||
^AD x ^AD x | ||||
a a | ||||
^AI x + 1 -> b | ||||
b c | ||||
^AE x + 1 ^AE x | ||||
c ^AI x + 1 | ||||
^AE x b | ||||
^AE x + 1 | ||||
Note: If "annotate" purely depends on "^AI" information, then the | ||||
duplication content will lose track of where "b" is originally from. | ||||
Some of them may be valid in other implementations for special purposes. For | ||||
example, to "revive" a previously deleted block in a newer revision. | ||||
0.6 Cases Can Be Optimized | ||||
It's always better to get things nested. For example, the left is more | ||||
efficient than the right while they represent the same content: | ||||
+--- ^AD 2 +- ^AD 1 | ||||
| +- ^AD 1 | LINE A | ||||
| | LINE A +- ^AE 1 | ||||
| +- ^AE 1 +- ^AD 2 | ||||
| LINE B | LINE B | ||||
+--- ^AE 2 +- ^AE 2 | ||||
Our implementation sometimes generates the less efficient data. To always | ||||
get the optimal form, it requires extra code complexity that seems unworthy. | ||||
0.7 Inefficiency | ||||
The file format can be slow because: | ||||
- Inserting a new line at position P requires rewriting all data after P. | ||||
- Finding "^AE" requires walking through the content (O(N), where N is the | ||||
number of lines between "^AI/D" and "^AE"). | ||||
1. Linelog | ||||
The linelog is a binary format that dedicates to speed up mercurial (or | ||||
git)'s "annotate" operation. It's designed to avoid issues mentioned in | ||||
section 0.7. | ||||
1.1 Content Stored | ||||
Linelog is not another storage for file contents. It only stores line | ||||
numbers and corresponding revision numbers, instead of actual line content. | ||||
This is okay for the "annotate" operation because usually the external | ||||
source is fast to checkout the content of a file at a specific revision. | ||||
A typical SCCS weave is also fast on the "grep" operation, which needs | ||||
random accesses to line contents from different revisions of a file. This | ||||
can be slow with linelog's no-line-content design. However we could use | ||||
an extra map ((rev, line num) -> line content) to speed it up. | ||||
Note the revision numbers in linelog should be independent from mercurial | ||||
integer revision numbers. There should be some mapping between linelog rev | ||||
and hg hash stored side by side, to make the files reusable after being | ||||
copied to another machine. | ||||
1.2 Basic Format | ||||
A linelog file consists of "instruction"s. An "instruction" can be either: | ||||
- JGE REV ADDR # jump to ADDR if rev >= REV | ||||
- JL REV ADDR # jump to ADDR if rev < REV | ||||
- LINE REV LINENUM # append the (LINENUM+1)-th line in revision REV | ||||
For example, here is the example linelog representing the same file with | ||||
3 revisions mentioned in section 0.1: | ||||
SCCS | Linelog | ||||
Weave | Addr : Instruction | ||||
------+------+------------- | ||||
^AI 1 | 0 : JL 1 8 | ||||
a | 1 : LINE 1 0 | ||||
^AD 3 | 2 : JGE 3 6 | ||||
b | 3 : LINE 1 1 | ||||
^AI 2 | 4 : JL 2 7 | ||||
1 | 5 : LINE 2 2 | ||||
^AE 3 | | ||||
2 | 6 : LINE 2 3 | ||||
^AE 2 | | ||||
c | 7 : LINE 1 2 | ||||
^AE 1 | | ||||
| 8 : END | ||||
This way, "find ^AE" is O(1) because we just jump there. And we can insert | ||||
new lines without rewriting most part of the file by appending new lines and | ||||
changing a single instruction to jump to them. | ||||
The current implementation uses 64 bits for an instruction: The opcode (JGE, | ||||
JL or LINE) takes 2 bits, REV takes 30 bits and ADDR or LINENUM takes 32 | ||||
bits. It also stores the max revision number and buffer size at the first | ||||
64 bits for quick access to these values. | ||||
1.3 Comparing with Mercurial's revlog format | ||||
Apparently, linelog is very different from revlog: linelog stores rev and | ||||
line numbers, while revlog has line contents and other metadata (like | ||||
parents, flags). However, the revlog format could also be used to store rev | ||||
and line numbers. For example, to speed up the annotate operation, we could | ||||
also pre-calculate annotate results and just store them using the revlog | ||||
format. | ||||
Therefore, linelog is actually somehow similar to revlog, with the important | ||||
trade-off that it only supports linear history (mentioned in section 0.1). | ||||
Essentially, the differences are: | ||||
a) Linelog is full of deltas, while revlog could contain full file | ||||
contents sometimes. So linelog is smaller. Revlog could trade | ||||
reconstruction speed for file size - best case, revlog is as small as | ||||
linelog. | ||||
b) The interleaved delta structure allows skipping large portion of | ||||
uninteresting deltas so linelog's content reconstruction is faster than | ||||
the delta-only version of revlog (however it's possible to construct | ||||
a case where interleaved deltas degrade to plain deltas, so linelog | ||||
worst case would be delta-only revlog). Revlog could trade file size | ||||
for reconstruction speed. | ||||
c) Linelog implicitly maintains the order of all lines it stores. So it | ||||
could dump all the lines from all revisions, with a reasonable order. | ||||
While revlog could also dump all line additions, it requires extra | ||||
computation to figure out the order putting those lines - that's some | ||||
kind of "merge". | ||||
"c" makes "hg absorb" easier to implement and makes it possible to do | ||||
"annotate --deleted". | ||||
1.4 Malformed Cases Handling | ||||
The following "case 1", "case 2", and "case 3" refer to cases mentioned | ||||
in section 0.5. | ||||
Using the exposed API (replacelines), case 1 is impossible to generate, | ||||
although it's possible to generate it by constructing rawdata and load that | ||||
via linelog.fromdata. | ||||
Doing annotate(maxrev) before replacelines (aka. a1, a2 passed to | ||||
replacelines are related to the latest revision) eliminates the possibility | ||||
of case 3. That makes sense since usually you'd like to make edits on top of | ||||
the latest revision. Practically, both absorb and fastannotate do this. | ||||
Doing annotate(maxrev), plus replacelines(rev, ...) where rev >= maxrev | ||||
eliminates the possibility of case 2. That makes sense since usually the | ||||
edits belong to "new revisions", not "old revisions". Practically, | ||||
fastannotate does this. Absorb calls replacelines with rev < maxrev to edit | ||||
past revisions. So it needs some extra care to not generate case 2. | ||||
If case 1 occurs, that probably means linelog file corruption (assuming | ||||
linelog is edited via public APIs) the checkout or annotate result could | ||||
be less meaningful or even error out, but linelog wouldn't enter an infinite | ||||
loop. | ||||
If either case 2 or 3 occurs, linelog works as if the inner "^AI/D" and "^AE" | ||||
operations on the left side are silently ignored. | ||||