##// END OF EJS Templates
util: implement zstd compression engine...
util: implement zstd compression engine Now that zstd is vendored and being built (in some configurations), we can implement a compression engine for zstd! The zstd engine is a little different from existing engines. Because it may not always be present, we have to defer load the module in case importing it fails. We facilitate this via a cached property that holds a reference to the module or None. The "available" method is implemented to reflect reality. The zstd engine declares its ability to handle bundles using the "zstd" human name and the "ZS" internal name. The latter was chosen because internal names are 2 characters (by only convention I think) and "ZS" seems reasonable. The engine, like others, supports specifying the compression level. However, there are no consumers of this API that yet pass in that argument. I have plans to change that, so stay tuned. Since all we need to do to support bundle generation with a new compression engine is implement and register the compression engine, bundle generation with zstd "just works!" Tests demonstrating this have been added. How does performance of zstd for bundle generation compare? On the mozilla-unified repo, `hg bundle --all -t <engine>-v2` yields the following on my i7-6700K on Linux: engine CPU time bundle size vs orig size throughput none 97.0s 4,054,405,584 100.0% 41.8 MB/s bzip2 (l=9) 393.6s 975,343,098 24.0% 10.3 MB/s gzip (l=6) 184.0s 1,140,533,074 28.1% 22.0 MB/s zstd (l=1) 108.2s 1,119,434,718 27.6% 37.5 MB/s zstd (l=2) 111.3s 1,078,328,002 26.6% 36.4 MB/s zstd (l=3) 113.7s 1,011,823,727 25.0% 35.7 MB/s zstd (l=4) 116.0s 1,008,965,888 24.9% 35.0 MB/s zstd (l=5) 121.0s 977,203,148 24.1% 33.5 MB/s zstd (l=6) 131.7s 927,360,198 22.9% 30.8 MB/s zstd (l=7) 139.0s 912,808,505 22.5% 29.2 MB/s zstd (l=12) 198.1s 854,527,714 21.1% 20.5 MB/s zstd (l=18) 681.6s 789,750,690 19.5% 5.9 MB/s On compression, zstd for bundle generation delivers: * better compression than gzip with significantly less CPU utilization * better than bzip2 compression ratios while still being significantly faster than gzip * ability to aggressively tune compression level to achieve significantly smaller bundles That last point is important. With clone bundles, a server can pre-generate a bundle file, upload it to a static file server, and redirect clients to transparently download it during clone. The server could choose to produce a zstd bundle with the highest compression settings possible. This would take a very long time - a magnitude longer than a typical zstd bundle generation - but the result would be hundreds of megabytes smaller! For the clone volume we do at Mozilla, this could translate to petabytes of bandwidth savings per year and faster clones (due to smaller transfer size). I don't have detailed numbers to report on decompression. However, zstd decompression is fast: >1 GB/s output throughput on this machine, even through the Python bindings. And it can do that regardless of the compression level of the input. By the time you have enough data to worry about overhead of decompression, you have plenty of other things to worry about performance wise. zstd is wins all around. I can't wait to implement support for it on the wire protocol and in revlogs.

File last commit:

r30061:8e805cf2 default
r30442:41a81067 default
Show More
check-commit
107 lines | 3.3 KiB | text/plain | TextLexer
Matt Mackall
contrib: add check-commit hook script to sanity-check commits
r22043 #!/usr/bin/env python
#
# Copyright 2014 Matt Mackall <mpm@selenic.com>
#
# A tool/hook to run basic sanity checks on commits/patches for
# submission to Mercurial. Install by adding the following to your
# .hg/hgrc:
#
# [hooks]
# pretxncommit = contrib/check-commit
#
# The hook can be temporarily bypassed with:
#
# $ BYPASS= hg commit
#
Matt Mackall
urls: bulk-change primary website URLs
r26421 # See also: https://mercurial-scm.org/wiki/ContributingChanges
Matt Mackall
contrib: add check-commit hook script to sanity-check commits
r22043
Pulkit Goyal
py3: make contrib/check-commit use print_function
r29164 from __future__ import absolute_import, print_function
Pulkit Goyal
py3: make contrib/check-commit use absolute_import
r29163
import os
import re
import sys
Matt Mackall
contrib: add check-commit hook script to sanity-check commits
r22043
timeless
check-commit: try to fix multiline handling...
r27782 commitheader = r"^(?:# [^\n]*\n)*"
afterheader = commitheader + r"(?!#)"
beforepatch = afterheader + r"(?!\n(?!@@))"
Matt Mackall
contrib: add check-commit hook script to sanity-check commits
r22043 errors = [
timeless
check-commit: try to fix multiline handling...
r27782 (beforepatch + r".*[(]bc[)]", "(BC) needs to be uppercase"),
FUJIWARA Katsunori
check-commit: wrap too long line...
r28042 (beforepatch + r".*[(]issue \d\d\d",
"no space allowed between issue and number"),
timeless
check-commit: try to fix multiline handling...
r27782 (beforepatch + r".*[(]bug(\d|\s)", "use (issueDDDD) instead of bug"),
(commitheader + r"# User [^@\n]+\n", "username is not an email address"),
(commitheader + r"(?!merge with )[^#]\S+[^:] ",
Matt Mackall
contrib: add check-commit hook script to sanity-check commits
r22043 "summary line doesn't start with 'topic: '"),
timeless
check-commit: try to fix multiline handling...
r27782 (afterheader + r"[A-Z][a-z]\S+", "don't capitalize summary lines"),
(afterheader + r"[^\n]*: *[A-Z][a-z]\S+", "don't capitalize summary lines"),
Mathias De Maré
check-commit: allow underscore as commit topic...
r30061 (afterheader + r"\S*[^A-Za-z0-9-_]\S*: ",
Matt Mackall
check-commit: try to curb bad commit summary keywords...
r27692 "summary keyword should be most user-relevant one-word command or topic"),
timeless
check-commit: try to fix multiline handling...
r27782 (afterheader + r".*\.\s*\n", "don't add trailing period on summary line"),
(afterheader + r".{79,}", "summary line too long (limit is 78)"),
Matt Mackall
check-commit: check for double-addition of blank lines...
r28013 (r"\n\+\n( |\+)\n", "adds double empty line"),
timeless
check-commit: try to fix multiline handling...
r27782 (r"\n \n\+\n", "adds double empty line"),
Augie Fackler
check-commit: allow underbars in cffi_-prefix function names...
r29716 # Forbid "_" in function name.
#
# We skip the check for cffi related functions. They use names mapping the
# name of the C function. C function names may contain "_".
(r"\n\+[ \t]+def (?!cffi)[a-z]+_[a-z]",
"adds a function with foo_bar naming"),
Matt Mackall
contrib: add check-commit hook script to sanity-check commits
r22043 ]
timeless
check-commit: try to fix multiline handling...
r27782 word = re.compile('\S')
def nonempty(first, second):
if word.search(first):
return first
return second
FUJIWARA Katsunori
check-commit: omit whitespace...
r28043 def checkcommit(commit, node=None):
timeless
check-commit: modularize
r27780 exitcode = 0
timeless
check-commit: support REVs as commandline arguments...
r27781 printed = node is None
timeless
check-commit: sort errors by line number
r27783 hits = []
timeless
check-commit: modularize
r27780 for exp, msg in errors:
Matt Mackall
check-commit: scan for multiple instances of error patterns
r28012 for m in re.finditer(exp, commit):
timeless
check-commit: try to fix multiline handling...
r27782 end = m.end()
trailing = re.search(r'(\\n)+$', exp)
if trailing:
end -= len(trailing.group()) / 2
timeless
check-commit: sort errors by line number
r27783 hits.append((end, exp, msg))
if hits:
hits.sort()
pos = 0
last = ''
for n, l in enumerate(commit.splitlines(True)):
pos += len(l)
while len(hits):
end, exp, msg = hits[0]
timeless
check-commit: try to fix multiline handling...
r27782 if pos < end:
timeless
check-commit: modularize
r27780 break
timeless
check-commit: sort errors by line number
r27783 if not printed:
printed = True
Pulkit Goyal
py3: make contrib/check-commit use print_function
r29164 print("node: %s" % node)
print("%d: %s" % (n, msg))
print(" %s" % nonempty(l, last)[:-1])
timeless
check-commit: sort errors by line number
r27783 if "BYPASS" not in os.environ:
exitcode = 1
del hits[0]
last = nonempty(l, last)
timeless
check-commit: modularize
r27780 return exitcode
Matt Mackall
contrib: add check-commit hook script to sanity-check commits
r22043
timeless
check-commit: modularize
r27780 def readcommit(node):
return os.popen("hg export %s" % node).read()
if __name__ == "__main__":
timeless
check-commit: support REVs as commandline arguments...
r27781 exitcode = 0
timeless
check-commit: modularize
r27780 node = os.environ.get("HG_NODE")
Matt Mackall
contrib: add check-commit hook script to sanity-check commits
r22043
timeless
check-commit: modularize
r27780 if node:
commit = readcommit(node)
timeless
check-commit: support REVs as commandline arguments...
r27781 exitcode = checkcommit(commit)
elif sys.argv[1:]:
for node in sys.argv[1:]:
exitcode |= checkcommit(readcommit(node), node)
timeless
check-commit: modularize
r27780 else:
commit = sys.stdin.read()
timeless
check-commit: support REVs as commandline arguments...
r27781 exitcode = checkcommit(commit)
timeless
check-commit: modularize
r27780 sys.exit(exitcode)