|
|
linelog is a storage format inspired by the "Interleaved deltas" idea. See
|
|
|
https://en.wikipedia.org/wiki/Interleaved_deltas for its introduction.
|
|
|
|
|
|
0. SCCS Weave
|
|
|
|
|
|
To understand what linelog is, first we have a quick look at a simplified
|
|
|
(with header removed) SCCS weave format, which is an implementation of the
|
|
|
"Interleaved deltas" idea.
|
|
|
|
|
|
0.1 Basic SCCS Weave File Format
|
|
|
|
|
|
A SCCS weave file consists of plain text lines. Each line is either a
|
|
|
special instruction starting with "^A" or part of the content of the real
|
|
|
file the weave tracks. There are 3 important operations, where REV denotes
|
|
|
the revision number:
|
|
|
|
|
|
^AI REV, marking the beginning of an insertion block introduced by REV
|
|
|
^AD REV, marking the beginning of a deletion block introduced by REV
|
|
|
^AE REV, marking the end of the block started by "^AI REV" or "^AD REV"
|
|
|
|
|
|
Note on revision numbers: For any two different revision numbers, one must
|
|
|
be an ancestor of the other to make them comparable. This enforces linear
|
|
|
history. Besides, the comparison functions (">=", "<") should be efficient.
|
|
|
This means, if revisions are strings like git or hg, an external map is
|
|
|
required to convert them into integers.
|
|
|
|
|
|
For example, to represent the following changes:
|
|
|
|
|
|
REV 1 | REV 2 | REV 3
|
|
|
------+-------+-------
|
|
|
a | a | a
|
|
|
b | b | 2
|
|
|
c | 1 | c
|
|
|
| 2 |
|
|
|
| c |
|
|
|
|
|
|
A possible weave file looks like:
|
|
|
|
|
|
^AI 1
|
|
|
a
|
|
|
^AD 3
|
|
|
b
|
|
|
^AI 2
|
|
|
1
|
|
|
^AE 3
|
|
|
2
|
|
|
^AE 2
|
|
|
c
|
|
|
^AE 1
|
|
|
|
|
|
An "^AE" does not always match its nearest operation ("^AI" or "^AD"). In
|
|
|
the above example, "^AE 3" does not match the nearest "^AI 2" but "^AD 3".
|
|
|
Therefore we need some extra information for "^AE". The SCCS weave uses a
|
|
|
revision number. It could also be a boolean value about whether it is an
|
|
|
insertion or a deletion (see section 0.4).
|
|
|
|
|
|
0.2 Checkout
|
|
|
|
|
|
The "checkout" operation is to retrieve file content at a given revision,
|
|
|
say X. It's doable by going through the file line by line and:
|
|
|
|
|
|
- If meet ^AI rev, and rev > X, find the corresponding ^AE and jump there
|
|
|
- If meet ^AD rev, and rev <= X, find the corresponding ^AE and jump there
|
|
|
- Ignore ^AE
|
|
|
- For normal lines, just output them
|
|
|
|
|
|
0.3 Annotate
|
|
|
|
|
|
The "annotate" operation is to show extra metadata like the revision number
|
|
|
and the original line number a line comes from.
|
|
|
|
|
|
It's basically just a "Checkout". For the extra metadata, they can be stored
|
|
|
side by side with the line contents. Alternatively, we can infer the
|
|
|
revision number from "^AI"s.
|
|
|
|
|
|
Some SCM tools have to calculate diffs on the fly and thus are much slower
|
|
|
on this operation.
|
|
|
|
|
|
0.4 Tree Structure
|
|
|
|
|
|
The word "interleaved" is used because "^AI" .. "^AE" and "^AD" .. "^AE"
|
|
|
blocks can be interleaved.
|
|
|
|
|
|
If we consider insertions and deletions separately, they can form tree
|
|
|
structures, respectively.
|
|
|
|
|
|
+--- ^AI 1 +--- ^AD 3
|
|
|
| +- ^AI 2 | +- ^AD 2
|
|
|
| | | |
|
|
|
| +- ^AE 2 | +- ^AE 2
|
|
|
| |
|
|
|
+--- ^AE 1 +--- ^AE 3
|
|
|
|
|
|
More specifically, it's possible to build a tree for all insertions, where
|
|
|
the tree node has the structure "(rev, startline, endline)". "startline" is
|
|
|
the line number of "^AI" and "endline" is the line number of the matched
|
|
|
"^AE". The tree will have these properties:
|
|
|
|
|
|
1. child.rev > parent.rev
|
|
|
2. child.startline > parent.startline
|
|
|
3. child.endline < parent.endline
|
|
|
|
|
|
A similar tree for all deletions can also be built with the first property
|
|
|
changed to:
|
|
|
|
|
|
1. child.rev < parent.rev
|
|
|
|
|
|
0.5 Malformed Cases
|
|
|
|
|
|
The following cases are considered malformed in our implementation:
|
|
|
|
|
|
1. Interleaved insertions, or interleaved deletions.
|
|
|
It can be rewritten to a non-interleaved tree structure.
|
|
|
|
|
|
Take insertions as example, deletions are similar:
|
|
|
|
|
|
^AI x ^AI x
|
|
|
a a
|
|
|
^AI x + 1 -> ^AI x + 1
|
|
|
b b
|
|
|
^AE x ^AE x + 1
|
|
|
c ^AE x
|
|
|
^AE x + 1 ^AI x + 1
|
|
|
c
|
|
|
^AE x + 1
|
|
|
|
|
|
2. Nested insertions, where the inner one has a smaller revision number.
|
|
|
Or nested deletions, where the inner one has a larger revision number.
|
|
|
It can be rewritten to a non-nested form.
|
|
|
|
|
|
Take insertions as example, deletions are similar:
|
|
|
|
|
|
^AI x + 1 ^AI x + 1
|
|
|
a a
|
|
|
^AI x -> ^AE x + 1
|
|
|
b ^AI x
|
|
|
^AE x b
|
|
|
c ^AE x
|
|
|
^AE x + 1 ^AI x + 1
|
|
|
c
|
|
|
^AE x + 1
|
|
|
|
|
|
3. Insertion inside deletion with a smaller revision number.
|
|
|
|
|
|
Rewrite by duplicating the content inserted:
|
|
|
|
|
|
^AD x ^AD x
|
|
|
a a
|
|
|
^AI x + 1 -> b
|
|
|
b c
|
|
|
^AE x + 1 ^AE x
|
|
|
c ^AI x + 1
|
|
|
^AE x b
|
|
|
^AE x + 1
|
|
|
|
|
|
Note: If "annotate" purely depends on "^AI" information, then the
|
|
|
duplication content will lose track of where "b" is originally from.
|
|
|
|
|
|
Some of them may be valid in other implementations for special purposes. For
|
|
|
example, to "revive" a previously deleted block in a newer revision.
|
|
|
|
|
|
0.6 Cases Can Be Optimized
|
|
|
|
|
|
It's always better to get things nested. For example, the left is more
|
|
|
efficient than the right while they represent the same content:
|
|
|
|
|
|
+--- ^AD 2 +- ^AD 1
|
|
|
| +- ^AD 1 | LINE A
|
|
|
| | LINE A +- ^AE 1
|
|
|
| +- ^AE 1 +- ^AD 2
|
|
|
| LINE B | LINE B
|
|
|
+--- ^AE 2 +- ^AE 2
|
|
|
|
|
|
Our implementation sometimes generates the less efficient data. To always
|
|
|
get the optimal form, it requires extra code complexity that seems unworthy.
|
|
|
|
|
|
0.7 Inefficiency
|
|
|
|
|
|
The file format can be slow because:
|
|
|
|
|
|
- Inserting a new line at position P requires rewriting all data after P.
|
|
|
- Finding "^AE" requires walking through the content (O(N), where N is the
|
|
|
number of lines between "^AI/D" and "^AE").
|
|
|
|
|
|
1. Linelog
|
|
|
|
|
|
The linelog is a binary format that dedicates to speed up mercurial (or
|
|
|
git)'s "annotate" operation. It's designed to avoid issues mentioned in
|
|
|
section 0.7.
|
|
|
|
|
|
1.1 Content Stored
|
|
|
|
|
|
Linelog is not another storage for file contents. It only stores line
|
|
|
numbers and corresponding revision numbers, instead of actual line content.
|
|
|
This is okay for the "annotate" operation because usually the external
|
|
|
source is fast to checkout the content of a file at a specific revision.
|
|
|
|
|
|
A typical SCCS weave is also fast on the "grep" operation, which needs
|
|
|
random accesses to line contents from different revisions of a file. This
|
|
|
can be slow with linelog's no-line-content design. However we could use
|
|
|
an extra map ((rev, line num) -> line content) to speed it up.
|
|
|
|
|
|
Note the revision numbers in linelog should be independent from mercurial
|
|
|
integer revision numbers. There should be some mapping between linelog rev
|
|
|
and hg hash stored side by side, to make the files reusable after being
|
|
|
copied to another machine.
|
|
|
|
|
|
1.2 Basic Format
|
|
|
|
|
|
A linelog file consists of "instruction"s. An "instruction" can be either:
|
|
|
|
|
|
- JGE REV ADDR # jump to ADDR if rev >= REV
|
|
|
- JL REV ADDR # jump to ADDR if rev < REV
|
|
|
- LINE REV LINENUM # append the (LINENUM+1)-th line in revision REV
|
|
|
|
|
|
For example, here is the example linelog representing the same file with
|
|
|
3 revisions mentioned in section 0.1:
|
|
|
|
|
|
SCCS | Linelog
|
|
|
Weave | Addr : Instruction
|
|
|
------+------+-------------
|
|
|
^AI 1 | 0 : JL 1 8
|
|
|
a | 1 : LINE 1 0
|
|
|
^AD 3 | 2 : JGE 3 6
|
|
|
b | 3 : LINE 1 1
|
|
|
^AI 2 | 4 : JL 2 7
|
|
|
1 | 5 : LINE 2 2
|
|
|
^AE 3 |
|
|
|
2 | 6 : LINE 2 3
|
|
|
^AE 2 |
|
|
|
c | 7 : LINE 1 2
|
|
|
^AE 1 |
|
|
|
| 8 : END
|
|
|
|
|
|
This way, "find ^AE" is O(1) because we just jump there. And we can insert
|
|
|
new lines without rewriting most part of the file by appending new lines and
|
|
|
changing a single instruction to jump to them.
|
|
|
|
|
|
The current implementation uses 64 bits for an instruction: The opcode (JGE,
|
|
|
JL or LINE) takes 2 bits, REV takes 30 bits and ADDR or LINENUM takes 32
|
|
|
bits. It also stores the max revision number and buffer size at the first
|
|
|
64 bits for quick access to these values.
|
|
|
|
|
|
1.3 Comparing with Mercurial's revlog format
|
|
|
|
|
|
Apparently, linelog is very different from revlog: linelog stores rev and
|
|
|
line numbers, while revlog has line contents and other metadata (like
|
|
|
parents, flags). However, the revlog format could also be used to store rev
|
|
|
and line numbers. For example, to speed up the annotate operation, we could
|
|
|
also pre-calculate annotate results and just store them using the revlog
|
|
|
format.
|
|
|
|
|
|
Therefore, linelog is actually somehow similar to revlog, with the important
|
|
|
trade-off that it only supports linear history (mentioned in section 0.1).
|
|
|
Essentially, the differences are:
|
|
|
|
|
|
a) Linelog is full of deltas, while revlog could contain full file
|
|
|
contents sometimes. So linelog is smaller. Revlog could trade
|
|
|
reconstruction speed for file size - best case, revlog is as small as
|
|
|
linelog.
|
|
|
b) The interleaved delta structure allows skipping large portion of
|
|
|
uninteresting deltas so linelog's content reconstruction is faster than
|
|
|
the delta-only version of revlog (however it's possible to construct
|
|
|
a case where interleaved deltas degrade to plain deltas, so linelog
|
|
|
worst case would be delta-only revlog). Revlog could trade file size
|
|
|
for reconstruction speed.
|
|
|
c) Linelog implicitly maintains the order of all lines it stores. So it
|
|
|
could dump all the lines from all revisions, with a reasonable order.
|
|
|
While revlog could also dump all line additions, it requires extra
|
|
|
computation to figure out the order putting those lines - that's some
|
|
|
kind of "merge".
|
|
|
|
|
|
"c" makes "hg absorb" easier to implement and makes it possible to do
|
|
|
"annotate --deleted".
|
|
|
|
|
|
1.4 Malformed Cases Handling
|
|
|
|
|
|
The following "case 1", "case 2", and "case 3" refer to cases mentioned
|
|
|
in section 0.5.
|
|
|
|
|
|
Using the exposed API (replacelines), case 1 is impossible to generate,
|
|
|
although it's possible to generate it by constructing rawdata and load that
|
|
|
via linelog.fromdata.
|
|
|
|
|
|
Doing annotate(maxrev) before replacelines (aka. a1, a2 passed to
|
|
|
replacelines are related to the latest revision) eliminates the possibility
|
|
|
of case 3. That makes sense since usually you'd like to make edits on top of
|
|
|
the latest revision. Practically, both absorb and fastannotate do this.
|
|
|
|
|
|
Doing annotate(maxrev), plus replacelines(rev, ...) where rev >= maxrev
|
|
|
eliminates the possibility of case 2. That makes sense since usually the
|
|
|
edits belong to "new revisions", not "old revisions". Practically,
|
|
|
fastannotate does this. Absorb calls replacelines with rev < maxrev to edit
|
|
|
past revisions. So it needs some extra care to not generate case 2.
|
|
|
|
|
|
If case 1 occurs, that probably means linelog file corruption (assuming
|
|
|
linelog is edited via public APIs) the checkout or annotate result could
|
|
|
be less meaningful or even error out, but linelog wouldn't enter an infinite
|
|
|
loop.
|
|
|
|
|
|
If either case 2 or 3 occurs, linelog works as if the inner "^AI/D" and "^AE"
|
|
|
operations on the left side are silently ignored.
|
|
|
|