##// END OF EJS Templates
help: clarify overlap of revlog header and first revlog entry...
help: clarify overlap of revlog header and first revlog entry Differential Revision: https://phab.mercurial-scm.org/D6449

File last commit:

r42546:dbd0fcca default
r42593:bfd65b5e default
Show More
setdiscovery.py
439 lines | 15.5 KiB | text/x-python | PythonLexer
Peter Arrenbrecht
discovery: add new set-based discovery...
r14164 # setdiscovery.py - improved discovery of common nodeset for mercurial
#
# Copyright 2010 Benoit Boissinot <bboissin@gmail.com>
# and Peter Arrenbrecht <peter@arrenbrecht.ch>
#
# This software may be used and distributed according to the terms of the
# GNU General Public License version 2 or any later version.
Olle Lundberg
setdiscovery: document algorithms used...
r20656 """
Algorithm works in the following way. You have two repository: local and
remote. They both contains a DAG of changelists.
The goal of the discovery protocol is to find one set of node *common*,
the set of nodes shared by local and remote.
One of the issue with the original protocol was latency, it could
potentially require lots of roundtrips to discover that the local repo was a
subset of remote (which is a very common case, you usually have few changes
compared to upstream, while upstream probably had lots of development).
The new protocol only requires one interface for the remote repo: `known()`,
which given a set of changelists tells you if they are present in the DAG.
The algorithm then works as follow:
- We will be using three sets, `common`, `missing`, `unknown`. Originally
all nodes are in `unknown`.
- Take a sample from `unknown`, call `remote.known(sample)`
- For each node that remote knows, move it and all its ancestors to `common`
- For each node that remote doesn't know, move it and all its descendants
to `missing`
- Iterate until `unknown` is empty
There are a couple optimizations, first is instead of starting with a random
sample of missing, start by sending all heads, in the case where the local
repo is a subset, you computed the answer in one round trip.
Then you can do something similar to the bisecting strategy used when
finding faulty changesets. Instead of random samples, you can try picking
nodes that will maximize the number of nodes that will be
classified with it (since all ancestors or descendants will be marked as well).
"""
Peter Arrenbrecht
discovery: add new set-based discovery...
r14164
Gregory Szorc
setdiscovery: use absolute_import
r25973 from __future__ import absolute_import
Martin von Zweigbergk
util: drop alias for collections.deque...
r25113 import collections
Augie Fackler
cleanup: move stdlib imports to their own import statement...
r20034 import random
Gregory Szorc
setdiscovery: use absolute_import
r25973
from .i18n import _
from .node import (
nullid,
nullrev,
)
from . import (
Pierre-Yves David
error: get Abort from 'error' instead of 'util'...
r26587 error,
discovery: include timing in the debug output...
r32712 util,
Gregory Szorc
setdiscovery: use absolute_import
r25973 )
Peter Arrenbrecht
discovery: add new set-based discovery...
r14164
Gregory Szorc
setdiscovery: don't use dagutil for parent resolution...
r39210 def _updatesample(revs, heads, sample, parentfn, quicksamplesize=0):
Pierre-Yves David
setdiscovery: document the '_updatesample' function...
r23809 """update an existing sample to match the expected size
Gregory Szorc
setdiscovery: reflect use of revs instead of nodes...
r39204 The sample is updated with revs exponentially distant from each head of the
<revs> set. (H~1, H~2, H~4, H~8, etc).
Pierre-Yves David
setdiscovery: document the '_updatesample' function...
r23809
If a target size is specified, the sampling will stop once this size is
Gregory Szorc
setdiscovery: reflect use of revs instead of nodes...
r39204 reached. Otherwise sampling will happen until roots of the <revs> set are
Pierre-Yves David
setdiscovery: document the '_updatesample' function...
r23809 reached.
Gregory Szorc
setdiscovery: reflect use of revs instead of nodes...
r39204 :revs: set of revs we want to discover (if None, assume the whole dag)
Gregory Szorc
setdiscovery: pass heads into _updatesample()...
r39206 :heads: set of DAG head revs
Pierre-Yves David
setdiscovery: document the '_updatesample' function...
r23809 :sample: a sample to update
Gregory Szorc
setdiscovery: don't use dagutil for parent resolution...
r39210 :parentfn: a callable to resolve parents for a revision
Pierre-Yves David
setdiscovery: document the '_updatesample' function...
r23809 :quicksamplesize: optional target size of the sample"""
Peter Arrenbrecht
discovery: add new set-based discovery...
r14164 dist = {}
Martin von Zweigbergk
util: drop alias for collections.deque...
r25113 visit = collections.deque(heads)
Peter Arrenbrecht
discovery: add new set-based discovery...
r14164 seen = set()
factor = 1
while visit:
curr = visit.popleft()
if curr in seen:
continue
d = dist.setdefault(curr, 1)
if d > factor:
factor *= 2
if d == factor:
Pierre-Yves David
setdiscovery: drop the 'always' argument to '_updatesample'...
r23814 sample.add(curr)
if quicksamplesize and (len(sample) >= quicksamplesize):
return
Peter Arrenbrecht
discovery: add new set-based discovery...
r14164 seen.add(curr)
Gregory Szorc
setdiscovery: don't use dagutil for parent resolution...
r39210
for p in parentfn(curr):
if p != nullrev and (not revs or p in revs):
Peter Arrenbrecht
discovery: add new set-based discovery...
r14164 dist.setdefault(p, d + 1)
visit.append(p)
Pierre-Yves David
setdiscovery: extract sample limitation in a `_limitsample` function...
r23083 def _limitsample(sample, desiredlen):
"""return a random subset of sample of at most desiredlen item"""
if len(sample) > desiredlen:
sample = set(random.sample(sample, desiredlen))
return sample
Boris Feld
discovery: introduce a partialdiscovery object...
r41147 class partialdiscovery(object):
"""an object representing ongoing discovery
Feed with data from the remote repository, this object keep track of the
current set of changeset in various states:
Boris Feld
discovery: improve partial discovery documentation...
r41208 - common: revs also known remotely
- undecided: revs we don't have information on yet
- missing: revs missing remotely
(all tracked revisions are known locally)
Boris Feld
discovery: introduce a partialdiscovery object...
r41147 """
Boris Feld
discovery: move undecided set on the partialdiscovery...
r41203 def __init__(self, repo, targetheads):
Boris Feld
discovery: introduce a partialdiscovery object...
r41147 self._repo = repo
Boris Feld
discovery: move undecided set on the partialdiscovery...
r41203 self._targetheads = targetheads
Boris Feld
discovery: introduce a partialdiscovery object...
r41147 self._common = repo.changelog.incrementalmissingrevs()
Boris Feld
discovery: move undecided set on the partialdiscovery...
r41203 self._undecided = None
Boris Feld
discovery: move missing tracking inside the partialdiscovery object...
r41206 self.missing = set()
discovery: cache the children mapping used during each discovery...
r42051 self._childrenmap = None
Boris Feld
discovery: introduce a partialdiscovery object...
r41147
def addcommons(self, commons):
Joerg Sonnenberger
setdiscovery: fix a few typos...
r42351 """register nodes known as common"""
Boris Feld
discovery: introduce a partialdiscovery object...
r41147 self._common.addbases(commons)
Boris Feld
partialdiscovery: avoid `undecided` related computation sooner than necessary...
r41374 if self._undecided is not None:
self._common.removeancestorsfrom(self._undecided)
Boris Feld
discovery: introduce a partialdiscovery object...
r41147
Boris Feld
discovery: move missing tracking inside the partialdiscovery object...
r41206 def addmissings(self, missings):
Joerg Sonnenberger
setdiscovery: fix a few typos...
r42351 """register some nodes as missing"""
Boris Feld
discovery: compute newly discovered missing in a more efficient way...
r41316 newmissing = self._repo.revs('%ld::%ld', missings, self.undecided)
if newmissing:
self.missing.update(newmissing)
self.undecided.difference_update(newmissing)
Boris Feld
discovery: move missing tracking inside the partialdiscovery object...
r41206
Boris Feld
discovery: add a simple `addinfo` method...
r41207 def addinfo(self, sample):
"""consume an iterable of (rev, known) tuples"""
common = set()
missing = set()
for rev, known in sample:
if known:
common.add(rev)
else:
missing.add(rev)
if common:
self.addcommons(common)
if missing:
self.addmissings(missing)
Boris Feld
discovery: introduce a partialdiscovery object...
r41147 def hasinfo(self):
"""return True is we have any clue about the remote state"""
return self._common.hasbases()
Boris Feld
discovery: add a `iscomplete` method to the `partialdiscovery` object...
r41205 def iscomplete(self):
"""True if all the necessary data have been gathered"""
return self._undecided is not None and not self._undecided
Boris Feld
discovery: move undecided set on the partialdiscovery...
r41203 @property
def undecided(self):
if self._undecided is not None:
return self._undecided
self._undecided = set(self._common.missingancestors(self._targetheads))
return self._undecided
Georges Racinet
discovery: stop direct use of attribute of partialdiscovery...
r42272 def stats(self):
return {
'undecided': len(self.undecided),
}
Boris Feld
discovery: move common heads computation inside partialdiscovery object...
r41148 def commonheads(self):
"""the heads of the known common set"""
# heads(common) == heads(common.bases) since common represents
# common.bases and all its ancestors
Georges Racinet
discovery: using the new basesheads()...
r41281 return self._common.basesheads()
Boris Feld
discovery: introduce a partialdiscovery object...
r41147
discovery: use a lower level but faster way to retrieve parents...
r42047 def _parentsgetter(self):
getrev = self._repo.changelog.index.__getitem__
def getparents(r):
discovery: fix embarrassing typo in slice definition...
r42145 return getrev(r)[5:7]
discovery: use a lower level but faster way to retrieve parents...
r42047 return getparents
discovery: explicitly use `undecided` for the children mapping...
r42052 def _childrengetter(self):
discovery: move children computation in its own method...
r42050
discovery: cache the children mapping used during each discovery...
r42051 if self._childrenmap is not None:
discovery: clarify why the caching of children is valid...
r42055 # During discovery, the `undecided` set keep shrinking.
# Therefore, the map computed for an iteration N will be
# valid for iteration N+1. Instead of computing the same
# data over and over we cached it the first time.
discovery: cache the children mapping used during each discovery...
r42051 return self._childrenmap.__getitem__
discovery: move children computation in its own method...
r42050 # _updatesample() essentially does interaction over revisions to look
# up their children. This lookup is expensive and doing it in a loop is
# quadratic. We precompute the children for all relevant revisions and
# make the lookup in _updatesample() a simple dict lookup.
discovery: cache the children mapping used during each discovery...
r42051 self._childrenmap = children = {}
discovery: move children computation in its own method...
r42050
parentrevs = self._parentsgetter()
discovery: explicitly use `undecided` for the children mapping...
r42052 revs = self.undecided
discovery: move children computation in its own method...
r42050
for rev in sorted(revs):
# Always ensure revision has an entry so we don't need to worry
# about missing keys.
children[rev] = []
for prev in parentrevs(rev):
if prev == nullrev:
continue
c = children.get(prev)
if c is not None:
c.append(rev)
return children.__getitem__
Georges Racinet
discovery: moved sampling functions inside discovery object...
r42045 def takequicksample(self, headrevs, size):
"""takes a quick sample of size <size>
It is meant for initial sampling and focuses on querying heads and close
ancestors of heads.
:headrevs: set of head revisions in local DAG to consider
:size: the maximum size of the sample"""
revs = self.undecided
if len(revs) <= size:
return list(revs)
sample = set(self._repo.revs('heads(%ld)', revs))
if len(sample) >= size:
return _limitsample(sample, size)
discovery: use a lower level but faster way to retrieve parents...
r42047 _updatesample(None, headrevs, sample, self._parentsgetter(),
Georges Racinet
discovery: moved sampling functions inside discovery object...
r42045 quicksamplesize=size)
return sample
def takefullsample(self, headrevs, size):
revs = self.undecided
if len(revs) <= size:
return list(revs)
repo = self._repo
sample = set(repo.revs('heads(%ld)', revs))
discovery: use a lower level but faster way to retrieve parents...
r42047 parentrevs = self._parentsgetter()
Georges Racinet
discovery: moved sampling functions inside discovery object...
r42045
# update from heads
discovery: avoid computing identical sets of heads twice...
r42046 revsheads = sample.copy()
discovery: use a lower level but faster way to retrieve parents...
r42047 _updatesample(revs, revsheads, sample, parentrevs)
Georges Racinet
discovery: moved sampling functions inside discovery object...
r42045
# update from roots
revsroots = set(repo.revs('roots(%ld)', revs))
discovery: explicitly use `undecided` for the children mapping...
r42052 childrenrevs = self._childrengetter()
Georges Racinet
discovery: moved sampling functions inside discovery object...
r42045
discovery: move children computation in its own method...
r42050 _updatesample(revs, revsroots, sample, childrenrevs)
Georges Racinet
discovery: moved sampling functions inside discovery object...
r42045 assert sample
sample = _limitsample(sample, size)
if len(sample) < size:
more = size - len(sample)
sample.update(random.sample(list(revs - sample), more))
return sample
Martin von Zweigbergk
setdiscovery: back out changeset 5cfdf6137af8 (issue5809)...
r36732 def findcommonheads(ui, local, remote,
Peter Arrenbrecht
discovery: add new set-based discovery...
r14164 initialsamplesize=100,
fullsamplesize=200,
Boris Feld
setdiscover: allow to ignore part of the local graph...
r35305 abortwhenunrelated=True,
discovery: slowly increase sampling size...
r42546 ancestorsof=None,
samplegrowth=1.05):
Steven Brown
setdiscovery: limit lines to 80 characters
r14206 '''Return a tuple (common, anyincoming, remoteheads) used to identify
missing nodes from or in remote.
Peter Arrenbrecht
discovery: add new set-based discovery...
r14164 '''
discovery: include timing in the debug output...
r32712 start = util.timer()
Peter Arrenbrecht
discovery: add new set-based discovery...
r14164 roundtrips = 0
cl = local.changelog
Gregory Szorc
setdiscovery: don't use dagutil for rev -> node conversions...
r39195 clnode = cl.node
Gregory Szorc
setdiscovery: don't use dagutil for node -> rev conversion...
r39197 clrev = cl.rev
Gregory Szorc
setdiscovery: don't use dagutil for rev -> node conversions...
r39195
Boris Feld
setdiscover: allow to ignore part of the local graph...
r35305 if ancestorsof is not None:
Gregory Szorc
setdiscovery: don't use dagutil to compute heads...
r39201 ownheads = [clrev(n) for n in ancestorsof]
else:
ownheads = [rev for rev in cl.headrevs() if rev != nullrev]
Peter Arrenbrecht
setdiscovery: batch heads and known(ownheads)...
r14624 # early exit if we know all the specified remote heads already
Peter Arrenbrecht
discovery: add new set-based discovery...
r14164 ui.debug("query 1; heads\n")
roundtrips += 1
setdiscovery: stop limiting the number of local head we initially send...
r42335 # We also ask remote about all the local heads. That set can be arbitrarily
# large, so we used to limit it size to `initialsamplesize`. We no longer
# do as it proved counter productive. The skipped heads could lead to a
# large "undecided" set, slower to be clarified than if we asked the
# question for all heads right away.
#
# We are already fetching all server heads using the `heads` commands,
# sending a equivalent number of heads the other way should not have a
# significant impact. In addition, it is very likely that we are going to
# have to issue "known" request for an equivalent amount of revisions in
# order to decide if theses heads are common or missing.
#
# find a detailled analysis below.
#
# Case A: local and server both has few heads
#
# Ownheads is below initialsamplesize, limit would not have any effect.
#
# Case B: local has few heads and server has many
#
# Ownheads is below initialsamplesize, limit would not have any effect.
#
# Case C: local and server both has many heads
#
# We now transfert some more data, but not significantly more than is
# already transfered to carry the server heads.
#
# Case D: local has many heads, server has few
#
# D.1 local heads are mostly known remotely
#
# All the known head will have be part of a `known` request at some
# point for the discovery to finish. Sending them all earlier is
# actually helping.
#
# (This case is fairly unlikely, it requires the numerous heads to all
# be merged server side in only a few heads)
#
# D.2 local heads are mostly missing remotely
#
# To determine that the heads are missing, we'll have to issue `known`
# request for them or one of their ancestors. This amount of `known`
# request will likely be in the same order of magnitude than the amount
# of local heads.
#
# The only case where we can be more efficient using `known` request on
# ancestors are case were all the "missing" local heads are based on a
# few changeset, also "missing". This means we would have a "complex"
# graph (with many heads) attached to, but very independant to a the
# "simple" graph on the server. This is a fairly usual case and have
# not been met in the wild so far.
if remote.limitedarguments:
sample = _limitsample(ownheads, initialsamplesize)
# indices between sample and externalized version must match
sample = list(sample)
else:
sample = ownheads
Gregory Szorc
wireproto: implement batching on peer executor interface...
r37649
with remote.commandexecutor() as e:
fheads = e.callcommand('heads', {})
fknown = e.callcommand('known', {
Gregory Szorc
setdiscovery: don't use dagutil for rev -> node conversions...
r39195 'nodes': [clnode(r) for r in sample],
Gregory Szorc
wireproto: implement batching on peer executor interface...
r37649 })
srvheadhashes, yesno = fheads.result(), fknown.result()
Peter Arrenbrecht
discovery: add new set-based discovery...
r14164
if cl.tip() == nullid:
if srvheadhashes != [nullid]:
return [nullid], True, srvheadhashes
return [nullid], False, []
Steven Brown
setdiscovery: limit lines to 80 characters
r14206 # start actual discovery (we note this before the next "if" for
# compatibility reasons)
Peter Arrenbrecht
discovery: add new set-based discovery...
r14164 ui.status(_("searching for changes\n"))
Georges Racinet
discovery: rename `srvheads` to `knownsrvheads`...
r42044 knownsrvheads = [] # revnos of remote heads that are known locally
Gregory Szorc
setdiscovery: don't use dagutil for node -> rev conversion...
r39197 for node in srvheadhashes:
if node == nullid:
continue
try:
Georges Racinet
discovery: rename `srvheads` to `knownsrvheads`...
r42044 knownsrvheads.append(clrev(node))
Gregory Szorc
setdiscovery: don't use dagutil for node -> rev conversion...
r39197 # Catches unknown and filtered nodes.
except error.LookupError:
continue
Georges Racinet
discovery: rename `srvheads` to `knownsrvheads`...
r42044 if len(knownsrvheads) == len(srvheadhashes):
Matt Mackall
discovery: quiet note about heads...
r14833 ui.debug("all remote heads known locally\n")
Gregory Szorc
setdiscovery: don't use dagutil for rev -> node conversions...
r39195 return srvheadhashes, False, srvheadhashes
Peter Arrenbrecht
discovery: add new set-based discovery...
r14164
Martin von Zweigbergk
setdiscovery: remove initialsamplesize from a condition...
r36733 if len(sample) == len(ownheads) and all(yesno):
Mads Kiilerich
add missing localization markup
r15497 ui.note(_("all local heads known remotely\n"))
Gregory Szorc
setdiscovery: don't use dagutil for rev -> node conversions...
r39195 ownheadhashes = [clnode(r) for r in ownheads]
return ownheadhashes, True, srvheadhashes
Peter Arrenbrecht
setdiscovery: batch heads and known(ownheads)...
r14624
Peter Arrenbrecht
discovery: add new set-based discovery...
r14164 # full blown discovery
Boris Feld
discovery: move undecided set on the partialdiscovery...
r41203 disco = partialdiscovery(local, ownheads)
Siddharth Agarwal
setdiscovery: avoid a full changelog graph traversal...
r23343 # treat remote heads (and maybe own heads) as a first implicit sample
# response
Georges Racinet
discovery: rename `srvheads` to `knownsrvheads`...
r42044 disco.addcommons(knownsrvheads)
Boris Feld
discovery: add a simple `addinfo` method...
r41207 disco.addinfo(zip(sample, yesno))
Brodie Rao
cleanup: eradicate long lines
r16683
Peter Arrenbrecht
setdiscovery: batch heads and known(ownheads)...
r14624 full = False
Martin von Zweigbergk
setdiscovery: use progress helper...
r38369 progress = ui.makeprogress(_('searching'), unit=_('queries'))
Boris Feld
discovery: add a `iscomplete` method to the `partialdiscovery` object...
r41205 while not disco.iscomplete():
Peter Arrenbrecht
discovery: add new set-based discovery...
r14164
Boris Feld
discovery: introduce a partialdiscovery object...
r41147 if full or disco.hasinfo():
Pierre-Yves David
setdiscovery: factorize similar sampling code...
r23747 if full:
ui.note(_("sampling from both directions\n"))
else:
ui.debug("taking initial sample\n")
Georges Racinet
discovery: moved sampling functions inside discovery object...
r42045 samplefunc = disco.takefullsample
Pierre-Yves David
setdiscovery: limit the size of all sample (issue4411)...
r23130 targetsize = fullsamplesize
discovery: slowly increase sampling size...
r42546 if not remote.limitedarguments:
fullsamplesize = int(fullsamplesize * samplegrowth)
Peter Arrenbrecht
setdiscovery: batch heads and known(ownheads)...
r14624 else:
# use even cheaper initial sample
ui.debug("taking quick initial sample\n")
Georges Racinet
discovery: moved sampling functions inside discovery object...
r42045 samplefunc = disco.takequicksample
Pierre-Yves David
setdiscovery: limit the size of all sample (issue4411)...
r23130 targetsize = initialsamplesize
Georges Racinet
discovery: moved sampling functions inside discovery object...
r42045 sample = samplefunc(ownheads, targetsize)
Peter Arrenbrecht
discovery: add new set-based discovery...
r14164
roundtrips += 1
Martin von Zweigbergk
setdiscovery: use progress helper...
r38369 progress.update(roundtrips)
Georges Racinet
discovery: stop direct use of attribute of partialdiscovery...
r42272 stats = disco.stats()
Peter Arrenbrecht
discovery: add new set-based discovery...
r14164 ui.debug("query %i; still undecided: %i, sample size is: %i\n"
Georges Racinet
discovery: stop direct use of attribute of partialdiscovery...
r42272 % (roundtrips, stats['undecided'], len(sample)))
Peter Arrenbrecht
discovery: add new set-based discovery...
r14164 # indices between sample and externalized version must match
sample = list(sample)
Gregory Szorc
wireproto: implement command executor interface for version 1 peers...
r37648
with remote.commandexecutor() as e:
yesno = e.callcommand('known', {
Gregory Szorc
setdiscovery: don't use dagutil for rev -> node conversions...
r39195 'nodes': [clnode(r) for r in sample],
Gregory Szorc
wireproto: implement command executor interface for version 1 peers...
r37648 }).result()
Peter Arrenbrecht
setdiscovery: batch heads and known(ownheads)...
r14624 full = True
Peter Arrenbrecht
discovery: add new set-based discovery...
r14164
Boris Feld
discovery: add a simple `addinfo` method...
r41207 disco.addinfo(zip(sample, yesno))
Siddharth Agarwal
setdiscovery: avoid a full changelog graph traversal...
r23343
Boris Feld
discovery: move common heads computation inside partialdiscovery object...
r41148 result = disco.commonheads()
discovery: include timing in the debug output...
r32712 elapsed = util.timer() - start
Martin von Zweigbergk
progress: hide update(None) in a new complete() method...
r38392 progress.complete()
discovery: include timing in the debug output...
r32712 ui.debug("%d total queries in %.4fs\n" % (roundtrips, elapsed))
setdiscovery: improves logged message...
r32768 msg = ('found %d common and %d unknown server heads,'
' %d roundtrips in %.4fs\n')
Georges Racinet
discovery: rename `srvheads` to `knownsrvheads`...
r42044 missing = set(result) - set(knownsrvheads)
setdiscovery: improves logged message...
r32768 ui.log('discovery', msg, len(result), len(missing), roundtrips,
discovery: log discovery result in non-trivial cases...
r32713 elapsed)
Peter Arrenbrecht
discovery: add new set-based discovery...
r14164
if not result and srvheadhashes != [nullid]:
if abortwhenunrelated:
Pierre-Yves David
error: get Abort from 'error' instead of 'util'...
r26587 raise error.Abort(_("repository is unrelated"))
Peter Arrenbrecht
discovery: add new set-based discovery...
r14164 else:
ui.warn(_("warning: repository is unrelated\n"))
Martin von Zweigbergk
cleanup: use set literals...
r32291 return ({nullid}, True, srvheadhashes,)
Peter Arrenbrecht
discovery: add new set-based discovery...
r14164
Andrew Pritchard
setdiscovery: return anyincoming=False when remote's only head is nullid...
r14981 anyincoming = (srvheadhashes != [nullid])
Gregory Szorc
setdiscovery: don't use dagutil for rev -> node conversions...
r39195 result = {clnode(r) for r in result}
return result, anyincoming, srvheadhashes