linelog is a storage format inspired by the "Interleaved deltas" idea. See https://en.wikipedia.org/wiki/Interleaved_deltas for its introduction. 0. SCCS Weave To understand what linelog is, first we have a quick look at a simplified (with header removed) SCCS weave format, which is an implementation of the "Interleaved deltas" idea. 0.1 Basic SCCS Weave File Format A SCCS weave file consists of plain text lines. Each line is either a special instruction starting with "^A" or part of the content of the real file the weave tracks. There are 3 important operations, where REV denotes the revision number: ^AI REV, marking the beginning of an insertion block introduced by REV ^AD REV, marking the beginning of a deletion block introduced by REV ^AE REV, marking the end of the block started by "^AI REV" or "^AD REV" Note on revision numbers: For any two different revision numbers, one must be an ancestor of the other to make them comparable. This enforces linear history. Besides, the comparison functions (">=", "<") should be efficient. This means, if revisions are strings like git or hg, an external map is required to convert them into integers. For example, to represent the following changes: REV 1 | REV 2 | REV 3 ------+-------+------- a | a | a b | b | 2 c | 1 | c | 2 | | c | A possible weave file looks like: ^AI 1 a ^AD 3 b ^AI 2 1 ^AE 3 2 ^AE 2 c ^AE 1 An "^AE" does not always match its nearest operation ("^AI" or "^AD"). In the above example, "^AE 3" does not match the nearest "^AI 2" but "^AD 3". Therefore we need some extra information for "^AE". The SCCS weave uses a revision number. It could also be a boolean value about whether it is an insertion or a deletion (see section 0.4). 0.2 Checkout The "checkout" operation is to retrieve file content at a given revision, say X. It's doable by going through the file line by line and: - If meet ^AI rev, and rev > X, find the corresponding ^AE and jump there - If meet ^AD rev, and rev <= X, find the corresponding ^AE and jump there - Ignore ^AE - For normal lines, just output them 0.3 Annotate The "annotate" operation is to show extra metadata like the revision number and the original line number a line comes from. It's basically just a "Checkout". For the extra metadata, they can be stored side by side with the line contents. Alternatively, we can infer the revision number from "^AI"s. Some SCM tools have to calculate diffs on the fly and thus are much slower on this operation. 0.4 Tree Structure The word "interleaved" is used because "^AI" .. "^AE" and "^AD" .. "^AE" blocks can be interleaved. If we consider insertions and deletions separately, they can form tree structures, respectively. +--- ^AI 1 +--- ^AD 3 | +- ^AI 2 | +- ^AD 2 | | | | | +- ^AE 2 | +- ^AE 2 | | +--- ^AE 1 +--- ^AE 3 More specifically, it's possible to build a tree for all insertions, where the tree node has the structure "(rev, startline, endline)". "startline" is the line number of "^AI" and "endline" is the line number of the matched "^AE". The tree will have these properties: 1. child.rev > parent.rev 2. child.startline > parent.startline 3. child.endline < parent.endline A similar tree for all deletions can also be built with the first property changed to: 1. child.rev < parent.rev 0.5 Malformed Cases The following cases are considered malformed in our implementation: 1. Interleaved insertions, or interleaved deletions. It can be rewritten to a non-interleaved tree structure. Take insertions as example, deletions are similar: ^AI x ^AI x a a ^AI x + 1 -> ^AI x + 1 b b ^AE x ^AE x + 1 c ^AE x ^AE x + 1 ^AI x + 1 c ^AE x + 1 2. Nested insertions, where the inner one has a smaller revision number. Or nested deletions, where the inner one has a larger revision number. It can be rewritten to a non-nested form. Take insertions as example, deletions are similar: ^AI x + 1 ^AI x + 1 a a ^AI x -> ^AE x + 1 b ^AI x ^AE x b c ^AE x ^AE x + 1 ^AI x + 1 c ^AE x + 1 3. Insertion inside deletion with a smaller revision number. Rewrite by duplicating the content inserted: ^AD x ^AD x a a ^AI x + 1 -> b b c ^AE x + 1 ^AE x c ^AI x + 1 ^AE x b ^AE x + 1 Note: If "annotate" purely depends on "^AI" information, then the duplication content will lose track of where "b" is originally from. Some of them may be valid in other implementations for special purposes. For example, to "revive" a previously deleted block in a newer revision. 0.6 Cases Can Be Optimized It's always better to get things nested. For example, the left is more efficient than the right while they represent the same content: +--- ^AD 2 +- ^AD 1 | +- ^AD 1 | LINE A | | LINE A +- ^AE 1 | +- ^AE 1 +- ^AD 2 | LINE B | LINE B +--- ^AE 2 +- ^AE 2 Our implementation sometimes generates the less efficient data. To always get the optimal form, it requires extra code complexity that seems unworthy. 0.7 Inefficiency The file format can be slow because: - Inserting a new line at position P requires rewriting all data after P. - Finding "^AE" requires walking through the content (O(N), where N is the number of lines between "^AI/D" and "^AE"). 1. Linelog The linelog is a binary format that dedicates to speed up mercurial (or git)'s "annotate" operation. It's designed to avoid issues mentioned in section 0.7. 1.1 Content Stored Linelog is not another storage for file contents. It only stores line numbers and corresponding revision numbers, instead of actual line content. This is okay for the "annotate" operation because usually the external source is fast to checkout the content of a file at a specific revision. A typical SCCS weave is also fast on the "grep" operation, which needs random accesses to line contents from different revisions of a file. This can be slow with linelog's no-line-content design. However we could use an extra map ((rev, line num) -> line content) to speed it up. Note the revision numbers in linelog should be independent from mercurial integer revision numbers. There should be some mapping between linelog rev and hg hash stored side by side, to make the files reusable after being copied to another machine. 1.2 Basic Format A linelog file consists of "instruction"s. An "instruction" can be either: - JGE REV ADDR # jump to ADDR if rev >= REV - JL REV ADDR # jump to ADDR if rev < REV - LINE REV LINENUM # append the (LINENUM+1)-th line in revision REV For example, here is the example linelog representing the same file with 3 revisions mentioned in section 0.1: SCCS | Linelog Weave | Addr : Instruction ------+------+------------- ^AI 1 | 0 : JL 1 8 a | 1 : LINE 1 0 ^AD 3 | 2 : JGE 3 6 b | 3 : LINE 1 1 ^AI 2 | 4 : JL 2 7 1 | 5 : LINE 2 2 ^AE 3 | 2 | 6 : LINE 2 3 ^AE 2 | c | 7 : LINE 1 2 ^AE 1 | | 8 : END This way, "find ^AE" is O(1) because we just jump there. And we can insert new lines without rewriting most part of the file by appending new lines and changing a single instruction to jump to them. The current implementation uses 64 bits for an instruction: The opcode (JGE, JL or LINE) takes 2 bits, REV takes 30 bits and ADDR or LINENUM takes 32 bits. It also stores the max revision number and buffer size at the first 64 bits for quick access to these values. 1.3 Comparing with Mercurial's revlog format Apparently, linelog is very different from revlog: linelog stores rev and line numbers, while revlog has line contents and other metadata (like parents, flags). However, the revlog format could also be used to store rev and line numbers. For example, to speed up the annotate operation, we could also pre-calculate annotate results and just store them using the revlog format. Therefore, linelog is actually somehow similar to revlog, with the important trade-off that it only supports linear history (mentioned in section 0.1). Essentially, the differences are: a) Linelog is full of deltas, while revlog could contain full file contents sometimes. So linelog is smaller. Revlog could trade reconstruction speed for file size - best case, revlog is as small as linelog. b) The interleaved delta structure allows skipping large portion of uninteresting deltas so linelog's content reconstruction is faster than the delta-only version of revlog (however it's possible to construct a case where interleaved deltas degrade to plain deltas, so linelog worst case would be delta-only revlog). Revlog could trade file size for reconstruction speed. c) Linelog implicitly maintains the order of all lines it stores. So it could dump all the lines from all revisions, with a reasonable order. While revlog could also dump all line additions, it requires extra computation to figure out the order putting those lines - that's some kind of "merge". "c" makes "hg absorb" easier to implement and makes it possible to do "annotate --deleted". 1.4 Malformed Cases Handling The following "case 1", "case 2", and "case 3" refer to cases mentioned in section 0.5. Using the exposed API (replacelines), case 1 is impossible to generate, although it's possible to generate it by constructing rawdata and load that via linelog.fromdata. Doing annotate(maxrev) before replacelines (aka. a1, a2 passed to replacelines are related to the latest revision) eliminates the possibility of case 3. That makes sense since usually you'd like to make edits on top of the latest revision. Practically, both absorb and fastannotate do this. Doing annotate(maxrev), plus replacelines(rev, ...) where rev >= maxrev eliminates the possibility of case 2. That makes sense since usually the edits belong to "new revisions", not "old revisions". Practically, fastannotate does this. Absorb calls replacelines with rev < maxrev to edit past revisions. So it needs some extra care to not generate case 2. If case 1 occurs, that probably means linelog file corruption (assuming linelog is edited via public APIs) the checkout or annotate result could be less meaningful or even error out, but linelog wouldn't enter an infinite loop. If either case 2 or 3 occurs, linelog works as if the inner "^AI/D" and "^AE" operations on the left side are silently ignored.