upstream/mercurial-mirror Files · mercurial/pure/parsers.py

localrepo: experimental support for non-zlib revlog compression...

localrepo: experimental support for non-zlib revlog compression The final part of integrating the compression manager APIs into revlog storage is the plumbing for repositories to advertise they are using non-zlib storage and for revlogs to instantiate a non-zlib compression engine. The main intent of the compression manager work was to zstd all of the things. Adding zstd to revlogs has proved to be more involved than other places because revlogs are... special. Very small inputs and the use of delta chains (which are themselves a form of compression) are a completely different use case from streaming compression, which bundles and the wire protocol employ. I've conducted numerous experiments with zstd in revlogs and have yet to formalize compression settings and a storage architecture that I'm confident I won't regret later. In other words, I'm not yet ready to commit to a new mechanism for using zstd - or any other compression format - in revlogs. That being said, having some support for zstd (and other compression formats) in revlogs in core is beneficial. It can allow others to conduct experiments. This patch introduces *highly experimental* support for non-zlib compression formats in revlogs. Introduced is a config option to control which compression engine to use. Also introduced is a namespace of "exp-compression-*" requirements to denote support for non-zlib compression in revlogs. I've prefixed the namespace with "exp-" (short for "experimental") because I'm not confident of the requirements "schema" and in no way want to give the illusion of supporting these requirements in the future. I fully intend to drop support for these requirements once we figure out what we're doing with zstd in revlogs. A good portion of the patch is teaching the requirements system about registered compression engines and passing the requested compression engine as an opener option so revlogs can instantiate the proper compression engine for new operations. That's a verbose way of saying "we can now use zstd in revlogs!" On an `hg pull` conversion of the mozilla-unified repo with no extra redelta settings (like aggressivemergedeltas), we can see the impact of zstd vs zlib in revlogs: $ hg perfrevlogchunks -c ! chunk ! wall 2.032052 comb 2.040000 user 1.990000 sys 0.050000 (best of 5) ! wall 1.866360 comb 1.860000 user 1.820000 sys 0.040000 (best of 6) ! chunk batch ! wall 1.877261 comb 1.870000 user 1.860000 sys 0.010000 (best of 6) ! wall 1.705410 comb 1.710000 user 1.690000 sys 0.020000 (best of 6) $ hg perfrevlogchunks -m ! chunk ! wall 2.721427 comb 2.720000 user 2.640000 sys 0.080000 (best of 4) ! wall 2.035076 comb 2.030000 user 1.950000 sys 0.080000 (best of 5) ! chunk batch ! wall 2.614561 comb 2.620000 user 2.580000 sys 0.040000 (best of 4) ! wall 1.910252 comb 1.910000 user 1.880000 sys 0.030000 (best of 6) $ hg perfrevlog -c -d 1 ! wall 4.812885 comb 4.820000 user 4.800000 sys 0.020000 (best of 3) ! wall 4.699621 comb 4.710000 user 4.700000 sys 0.010000 (best of 3) $ hg perfrevlog -m -d 1000 ! wall 34.252800 comb 34.250000 user 33.730000 sys 0.520000 (best of 3) ! wall 24.094999 comb 24.090000 user 23.320000 sys 0.770000 (best of 3) Only modest wins for the changelog. But manifest reading is significantly faster. What's going on? One reason might be data volume. zstd decompresses faster. So given more bytes, it will put more distance between it and zlib. Another reason is size. In the current design, zstd revlogs are *larger*: debugcreatestreamclonebundle (size in bytes) zlib: 1,638,852,492 zstd: 1,680,601,332 I haven't investigated this fully, but I reckon a significant cause of larger revlogs is that the zstd frame/header has more bytes than zlib's. For very small inputs or data that doesn't compress well, we'll tend to store more uncompressed chunks than with zlib (because the compressed size isn't smaller than original). This will make revlog reading faster because it is doing less decompression. Moving on to bundle performance: $ hg bundle -a -t none-v2 (total CPU time) zlib: 102.79s zstd: 97.75s So, marginal CPU decrease for reading all chunks in all revlogs (this is somewhat disappointing). $ hg bundle -a -t <engine>-v2 (total CPU time) zlib: 191.59s zstd: 115.36s This last test effectively measures the difference between zlib->zlib and zstd->zstd for revlogs to bundle. This is a rough approximation of what a server does during `hg clone`. There are some promising results for zstd. But not enough for me to feel comfortable advertising it to users. We'll get there...

Maciej Fijalkowski - - Load All Authors

File last commit:

r29133:25527471 default


                r30818:4c0a5a25

default

Download file

             parsers.py
        
                    178 lines
            
             | 5.5 KiB
            
                | text/x-python
            
             |
                PythonLexer
            
             / mercurial / pure / parsers.py
          
                    History
                
                 |
                  Source
                 | Raw
                 |Copy content
                 |Copy permalink

        Martin Geisler
    
pure Python implementation of parsers.c

              r7700
            
      # parsers.py - Python implementation of parsers.c

      #

      # Copyright 2009 Matt Mackall <mpm@selenic.com> and others

      #

        Martin Geisler
    
updated license to be explicit about GPL version 2

              r8225
            
      # This software may be used and distributed according to the terms of the

        Matt Mackall
    
Update license to GPLv2+

              r10263
            
      # GNU General Public License version 2 or any later version.

        Martin Geisler
    
pure Python implementation of parsers.c

              r7700
            
        Gregory Szorc
    
parsers: use absolute_import

              r27339
            
      from __future__ import absolute_import

      import struct

      import zlib

      from .node import nullid

        timeless
    
pycompat: switch to util.stringio for py3 compat

              r28861
            
      from . import pycompat

      stringio = pycompat.stringio

        Martin Geisler
    
pure Python implementation of parsers.c

              r7700
            
      _pack = struct.pack

      _unpack = struct.unpack

      _compress = zlib.compress

      _decompress = zlib.decompress

        Siddharth Agarwal
    
parsers: inline fields of dirstate values in C version...

              r21809
            
      # Some code below makes tuples directly because it's more convenient. However,

      # code outside this module should always use dirstatetuple.

      def dirstatetuple(*x):

          # x is a tuple

          return x

        Maciej Fijalkowski
    
pure: write a really lazy version of pure indexObject...

              r29133
            
      indexformatng = ">Qiiiiii20s12x"

      indexfirst = struct.calcsize('Q')

      sizeint = struct.calcsize('i')

      indexsize = struct.calcsize(indexformatng)

      def gettype(q):

          return int(q & 0xFFFF)

        Matt Mackall
    
pure/parsers: fix circular imports, import mercurial modules properly

              r7945
            
        Maciej Fijalkowski
    
pure: write a really lazy version of pure indexObject...

              r29133
            
      def offset_type(offset, type):

          return long(long(offset) << 16 | type)

      class BaseIndexObject(object):

          def __len__(self):

              return self._lgt + len(self._extra) + 1

          def insert(self, i, tup):

              assert i == -1

              self._extra.append(tup)

        Matt Mackall
    
pure/parsers: fix circular imports, import mercurial modules properly

              r7945
            
        Maciej Fijalkowski
    
pure: write a really lazy version of pure indexObject...

              r29133
            
          def _fix_index(self, i):

              if not isinstance(i, int):

                  raise TypeError("expecting int indexes")

              if i < 0:

                  i = len(self) + i

              if i < 0 or i >= len(self):

                  raise IndexError

              return i

        Matt Mackall
    
pure/parsers: fix circular imports, import mercurial modules properly

              r7945
            
        Maciej Fijalkowski
    
pure: write a really lazy version of pure indexObject...

              r29133
            
          def __getitem__(self, i):

              i = self._fix_index(i)

              if i == len(self) - 1:

                  return (0, 0, 0, -1, -1, -1, -1, nullid)

              if i >= self._lgt:

                  return self._extra[i - self._lgt]

              index = self._calculate_index(i)

              r = struct.unpack(indexformatng, self._data[index:index + indexsize])

              if i == 0:

                  e = list(r)

                  type = gettype(e[0])

                  e[0] = offset_type(0, type)

                  return tuple(e)

              return r

      class IndexObject(BaseIndexObject):

          def __init__(self, data):

              assert len(data) % indexsize == 0

              self._data = data

              self._lgt = len(data) // indexsize

              self._extra = []

          def _calculate_index(self, i):

              return i * indexsize

        Matt Mackall
    
revlog: remove lazy index

              r13253
            
        Maciej Fijalkowski
    
pure: write a really lazy version of pure indexObject...

              r29133
            
          def __delitem__(self, i):

              if not isinstance(i, slice) or not i.stop == -1 or not i.step is None:

                  raise ValueError("deleting slices only supports a:-1 with step 1")

              i = self._fix_index(i.start)

              if i < self._lgt:

                  self._data = self._data[:i * indexsize]

                  self._lgt = i

                  self._extra = []

              else:

                  self._extra = self._extra[:i - self._lgt]

      class InlinedIndexObject(BaseIndexObject):

          def __init__(self, data, inline=0):

              self._data = data

              self._lgt = self._inline_scan(None)

              self._inline_scan(self._lgt)

              self._extra = []

        Martin Geisler
    
pure Python implementation of parsers.c

              r7700
            
        Maciej Fijalkowski
    
pure: write a really lazy version of pure indexObject...

              r29133
            
          def _inline_scan(self, lgt):

              off = 0

              if lgt is not None:

                  self._offsets = [0] * lgt

              count = 0

              while off <= len(self._data) - indexsize:

                  s, = struct.unpack('>i',

                      self._data[off + indexfirst:off + sizeint + indexfirst])

                  if lgt is not None:

                      self._offsets[count] = off

                  count += 1

                  off += indexsize + s

              if off != len(self._data):

                  raise ValueError("corrupted data")

              return count

        Augie Fackler
    
pure parsers: properly detect corrupt index files...

              r14421
            
        Maciej Fijalkowski
    
pure: write a really lazy version of pure indexObject...

              r29133
            
          def __delitem__(self, i):

              if not isinstance(i, slice) or not i.stop == -1 or not i.step is None:

                  raise ValueError("deleting slices only supports a:-1 with step 1")

              i = self._fix_index(i.start)

              if i < self._lgt:

                  self._offsets = self._offsets[:i]

                  self._lgt = i

                  self._extra = []

              else:

                  self._extra = self._extra[:i - self._lgt]

        Martin Geisler
    
pure Python implementation of parsers.c

              r7700
            
        Maciej Fijalkowski
    
pure: write a really lazy version of pure indexObject...

              r29133
            
          def _calculate_index(self, i):

              return self._offsets[i]

        Martin Geisler
    
pure Python implementation of parsers.c

              r7700
            
        Maciej Fijalkowski
    
pure: write a really lazy version of pure indexObject...

              r29133
            
      def parse_index2(data, inline):

          if not inline:

              return IndexObject(data), None

          return InlinedIndexObject(data, inline), (0, data)

        Martin Geisler
    
pure Python implementation of parsers.c

              r7700
            
      def parse_dirstate(dmap, copymap, st):

          parents = [st[:20], st[20: 40]]

        Mads Kiilerich
    
fix wording and not-completely-trivial spelling errors and bad docstrings

              r17425
            
          # dereference fields so they will be local in loop

        Matt Mackall
    
pure/parsers: fix circular imports, import mercurial modules properly

              r7945
            
          format = ">cllll"

          e_size = struct.calcsize(format)

        Martin Geisler
    
pure Python implementation of parsers.c

              r7700
            
          pos1 = 40

          l = len(st)

          # the inner loop

          while pos1 < l:

              pos2 = pos1 + e_size

              e = _unpack(">cllll", st[pos1:pos2]) # a literal here is faster

              pos1 = pos2 + e[4]

              f = st[pos2:pos1]

              if '\0' in f:

                  f, c = f.split('\0')

                  copymap[f] = c

              dmap[f] = e[:4]

          return parents

        Siddharth Agarwal
    
dirstate: move pure python dirstate packing to pure/parsers.py

              r18567
            
      def pack_dirstate(dmap, copymap, pl, now):

          now = int(now)

        timeless
    
pycompat: switch to util.stringio for py3 compat

              r28861
            
          cs = stringio()

        Siddharth Agarwal
    
dirstate: move pure python dirstate packing to pure/parsers.py

              r18567
            
          write = cs.write

          write("".join(pl))

          for f, e in dmap.iteritems():

              if e[0] == 'n' and e[3] == now:

                  # The file was last modified "simultaneously" with the current

                  # write to dirstate (i.e. within the same second for file-

                  # systems with a granularity of 1 sec). This commonly happens

                  # for at least a couple of files on 'update'.

                  # The user could change the file without changing its size

        Siddharth Agarwal
    
pack_dirstate: only invalidate mtime for files written in the last second...

              r19652
            
                  # within the same second. Invalidate the file's mtime in

        Siddharth Agarwal
    
dirstate: move pure python dirstate packing to pure/parsers.py

              r18567
            
                  # dirstate, forcing future 'status' calls to compare the

        Siddharth Agarwal
    
pack_dirstate: only invalidate mtime for files written in the last second...

              r19652
            
                  # contents of the file if the size is the same. This prevents

                  # mistakenly treating such files as clean.

        Siddharth Agarwal
    
parsers: inline fields of dirstate values in C version...

              r21809
            
                  e = dirstatetuple(e[0], e[1], e[2], -1)

        Siddharth Agarwal
    
dirstate: move pure python dirstate packing to pure/parsers.py

              r18567
            
                  dmap[f] = e

              if f in copymap:

                  f = "%s\0%s" % (f, copymap[f])

              e = _pack(">cllll", e[0], e[1], e[2], e[3], len(f))

              write(e)

              write(f)

          return cs.getvalue()

	Site-wide shortcuts
/	Use quick search box
g h	Goto home page
g g	Goto my private gists page
g G	Goto my public gists page
g 0-9	Goto bookmarked items from 0-9
n r	New repository page
n g	New gist page

	Repositories
g s	Goto summary page
g c	Goto changelog page
g f	Goto files page
g F	Goto files page with file search activated
g p	Goto pull requests page
g o	Goto repository settings
g O	Goto repository access permissions settings
t s	Toggle sidebar on some pages

Martin Geisler pure Python implementation of parsers.c	r7700	# parsers.py - Python implementation of parsers.c
		#
		# Copyright 2009 Matt Mackall <mpm@selenic.com> and others
		#
Martin Geisler updated license to be explicit about GPL version 2	r8225	# This software may be used and distributed according to the terms of the
Matt Mackall Update license to GPLv2+	r10263	# GNU General Public License version 2 or any later version.
Martin Geisler pure Python implementation of parsers.c	r7700
Gregory Szorc parsers: use absolute_import	r27339	from __future__ import absolute_import

		import struct
		import zlib

		from .node import nullid
timeless pycompat: switch to util.stringio for py3 compat	r28861	from . import pycompat
		stringio = pycompat.stringio
Martin Geisler pure Python implementation of parsers.c	r7700
		_pack = struct.pack
		_unpack = struct.unpack
		_compress = zlib.compress
		_decompress = zlib.decompress

Siddharth Agarwal parsers: inline fields of dirstate values in C version...	r21809	# Some code below makes tuples directly because it's more convenient. However,
		# code outside this module should always use dirstatetuple.
		def dirstatetuple(*x):
		# x is a tuple
		return x

Maciej Fijalkowski pure: write a really lazy version of pure indexObject...	r29133	indexformatng = ">Qiiiiii20s12x"
		indexfirst = struct.calcsize('Q')
		sizeint = struct.calcsize('i')
		indexsize = struct.calcsize(indexformatng)

		def gettype(q):
		return int(q & 0xFFFF)
Matt Mackall pure/parsers: fix circular imports, import mercurial modules properly	r7945
Maciej Fijalkowski pure: write a really lazy version of pure indexObject...	r29133	def offset_type(offset, type):
		return long(long(offset) << 16 \| type)

		class BaseIndexObject(object):
		def __len__(self):
		return self._lgt + len(self._extra) + 1

		def insert(self, i, tup):
		assert i == -1
		self._extra.append(tup)
Matt Mackall pure/parsers: fix circular imports, import mercurial modules properly	r7945
Maciej Fijalkowski pure: write a really lazy version of pure indexObject...	r29133	def _fix_index(self, i):
		if not isinstance(i, int):
		raise TypeError("expecting int indexes")
		if i < 0:
		i = len(self) + i
		if i < 0 or i >= len(self):
		raise IndexError
		return i
Matt Mackall pure/parsers: fix circular imports, import mercurial modules properly	r7945
Maciej Fijalkowski pure: write a really lazy version of pure indexObject...	r29133	def __getitem__(self, i):
		i = self._fix_index(i)
		if i == len(self) - 1:
		return (0, 0, 0, -1, -1, -1, -1, nullid)
		if i >= self._lgt:
		return self._extra[i - self._lgt]
		index = self._calculate_index(i)
		r = struct.unpack(indexformatng, self._data[index:index + indexsize])
		if i == 0:
		e = list(r)
		type = gettype(e[0])
		e[0] = offset_type(0, type)
		return tuple(e)
		return r

		class IndexObject(BaseIndexObject):
		def __init__(self, data):
		assert len(data) % indexsize == 0
		self._data = data
		self._lgt = len(data) // indexsize
		self._extra = []

		def _calculate_index(self, i):
		return i * indexsize
Matt Mackall revlog: remove lazy index	r13253
Maciej Fijalkowski pure: write a really lazy version of pure indexObject...	r29133	def __delitem__(self, i):
		if not isinstance(i, slice) or not i.stop == -1 or not i.step is None:
		raise ValueError("deleting slices only supports a:-1 with step 1")
		i = self._fix_index(i.start)
		if i < self._lgt:
		self._data = self._data[:i * indexsize]
		self._lgt = i
		self._extra = []
		else:
		self._extra = self._extra[:i - self._lgt]

		class InlinedIndexObject(BaseIndexObject):
		def __init__(self, data, inline=0):
		self._data = data
		self._lgt = self._inline_scan(None)
		self._inline_scan(self._lgt)
		self._extra = []
Martin Geisler pure Python implementation of parsers.c	r7700
Maciej Fijalkowski pure: write a really lazy version of pure indexObject...	r29133	def _inline_scan(self, lgt):
		off = 0
		if lgt is not None:
		self._offsets = [0] * lgt
		count = 0
		while off <= len(self._data) - indexsize:
		s, = struct.unpack('>i',
		self._data[off + indexfirst:off + sizeint + indexfirst])
		if lgt is not None:
		self._offsets[count] = off
		count += 1
		off += indexsize + s
		if off != len(self._data):
		raise ValueError("corrupted data")
		return count
Augie Fackler pure parsers: properly detect corrupt index files...	r14421
Maciej Fijalkowski pure: write a really lazy version of pure indexObject...	r29133	def __delitem__(self, i):
		if not isinstance(i, slice) or not i.stop == -1 or not i.step is None:
		raise ValueError("deleting slices only supports a:-1 with step 1")
		i = self._fix_index(i.start)
		if i < self._lgt:
		self._offsets = self._offsets[:i]
		self._lgt = i
		self._extra = []
		else:
		self._extra = self._extra[:i - self._lgt]
Martin Geisler pure Python implementation of parsers.c	r7700
Maciej Fijalkowski pure: write a really lazy version of pure indexObject...	r29133	def _calculate_index(self, i):
		return self._offsets[i]
Martin Geisler pure Python implementation of parsers.c	r7700
Maciej Fijalkowski pure: write a really lazy version of pure indexObject...	r29133	def parse_index2(data, inline):
		if not inline:
		return IndexObject(data), None
		return InlinedIndexObject(data, inline), (0, data)
Martin Geisler pure Python implementation of parsers.c	r7700
		def parse_dirstate(dmap, copymap, st):
		parents = [st[:20], st[20: 40]]
Mads Kiilerich fix wording and not-completely-trivial spelling errors and bad docstrings	r17425	# dereference fields so they will be local in loop
Matt Mackall pure/parsers: fix circular imports, import mercurial modules properly	r7945	format = ">cllll"
		e_size = struct.calcsize(format)
Martin Geisler pure Python implementation of parsers.c	r7700	pos1 = 40
		l = len(st)

		# the inner loop
		while pos1 < l:
		pos2 = pos1 + e_size
		e = _unpack(">cllll", st[pos1:pos2]) # a literal here is faster
		pos1 = pos2 + e[4]
		f = st[pos2:pos1]
		if '\0' in f:
		f, c = f.split('\0')
		copymap[f] = c
		dmap[f] = e[:4]
		return parents
Siddharth Agarwal dirstate: move pure python dirstate packing to pure/parsers.py	r18567
		def pack_dirstate(dmap, copymap, pl, now):
		now = int(now)
timeless pycompat: switch to util.stringio for py3 compat	r28861	cs = stringio()
Siddharth Agarwal dirstate: move pure python dirstate packing to pure/parsers.py	r18567	write = cs.write
		write("".join(pl))
		for f, e in dmap.iteritems():
		if e[0] == 'n' and e[3] == now:
		# The file was last modified "simultaneously" with the current
		# write to dirstate (i.e. within the same second for file-
		# systems with a granularity of 1 sec). This commonly happens
		# for at least a couple of files on 'update'.
		# The user could change the file without changing its size
Siddharth Agarwal pack_dirstate: only invalidate mtime for files written in the last second...	r19652	# within the same second. Invalidate the file's mtime in
Siddharth Agarwal dirstate: move pure python dirstate packing to pure/parsers.py	r18567	# dirstate, forcing future 'status' calls to compare the
Siddharth Agarwal pack_dirstate: only invalidate mtime for files written in the last second...	r19652	# contents of the file if the size is the same. This prevents
		# mistakenly treating such files as clean.
Siddharth Agarwal parsers: inline fields of dirstate values in C version...	r21809	e = dirstatetuple(e[0], e[1], e[2], -1)
Siddharth Agarwal dirstate: move pure python dirstate packing to pure/parsers.py	r18567	dmap[f] = e

		if f in copymap:
		f = "%s\0%s" % (f, copymap[f])
		e = _pack(">cllll", e[0], e[1], e[2], e[3], len(f))
		write(e)
		write(f)
		return cs.getvalue()