##// END OF EJS Templates
discovery: use a lower level but faster way to retrieve parents...
discovery: use a lower level but faster way to retrieve parents We already know that no revision in the undecided set are filtered, so we can skip multiple checks and directly access lower level data. In a private pathological case, this improves the timing from about 70 seconds to about 50 seconds. There are other actions to be taken to improve that case, however this gives an idea of the general overhead.

File last commit:

r42047:e514799e default
r42047:e514799e default
Show More
setdiscovery.py
363 lines | 12.4 KiB | text/x-python | PythonLexer
Peter Arrenbrecht
discovery: add new set-based discovery...
r14164 # setdiscovery.py - improved discovery of common nodeset for mercurial
#
# Copyright 2010 Benoit Boissinot <bboissin@gmail.com>
# and Peter Arrenbrecht <peter@arrenbrecht.ch>
#
# This software may be used and distributed according to the terms of the
# GNU General Public License version 2 or any later version.
Olle Lundberg
setdiscovery: document algorithms used...
r20656 """
Algorithm works in the following way. You have two repository: local and
remote. They both contains a DAG of changelists.
The goal of the discovery protocol is to find one set of node *common*,
the set of nodes shared by local and remote.
One of the issue with the original protocol was latency, it could
potentially require lots of roundtrips to discover that the local repo was a
subset of remote (which is a very common case, you usually have few changes
compared to upstream, while upstream probably had lots of development).
The new protocol only requires one interface for the remote repo: `known()`,
which given a set of changelists tells you if they are present in the DAG.
The algorithm then works as follow:
- We will be using three sets, `common`, `missing`, `unknown`. Originally
all nodes are in `unknown`.
- Take a sample from `unknown`, call `remote.known(sample)`
- For each node that remote knows, move it and all its ancestors to `common`
- For each node that remote doesn't know, move it and all its descendants
to `missing`
- Iterate until `unknown` is empty
There are a couple optimizations, first is instead of starting with a random
sample of missing, start by sending all heads, in the case where the local
repo is a subset, you computed the answer in one round trip.
Then you can do something similar to the bisecting strategy used when
finding faulty changesets. Instead of random samples, you can try picking
nodes that will maximize the number of nodes that will be
classified with it (since all ancestors or descendants will be marked as well).
"""
Peter Arrenbrecht
discovery: add new set-based discovery...
r14164
Gregory Szorc
setdiscovery: use absolute_import
r25973 from __future__ import absolute_import
Martin von Zweigbergk
util: drop alias for collections.deque...
r25113 import collections
Augie Fackler
cleanup: move stdlib imports to their own import statement...
r20034 import random
Gregory Szorc
setdiscovery: use absolute_import
r25973
from .i18n import _
from .node import (
nullid,
nullrev,
)
from . import (
Pierre-Yves David
error: get Abort from 'error' instead of 'util'...
r26587 error,
discovery: include timing in the debug output...
r32712 util,
Gregory Szorc
setdiscovery: use absolute_import
r25973 )
Peter Arrenbrecht
discovery: add new set-based discovery...
r14164
Gregory Szorc
setdiscovery: don't use dagutil for parent resolution...
r39210 def _updatesample(revs, heads, sample, parentfn, quicksamplesize=0):
Pierre-Yves David
setdiscovery: document the '_updatesample' function...
r23809 """update an existing sample to match the expected size
Gregory Szorc
setdiscovery: reflect use of revs instead of nodes...
r39204 The sample is updated with revs exponentially distant from each head of the
<revs> set. (H~1, H~2, H~4, H~8, etc).
Pierre-Yves David
setdiscovery: document the '_updatesample' function...
r23809
If a target size is specified, the sampling will stop once this size is
Gregory Szorc
setdiscovery: reflect use of revs instead of nodes...
r39204 reached. Otherwise sampling will happen until roots of the <revs> set are
Pierre-Yves David
setdiscovery: document the '_updatesample' function...
r23809 reached.
Gregory Szorc
setdiscovery: reflect use of revs instead of nodes...
r39204 :revs: set of revs we want to discover (if None, assume the whole dag)
Gregory Szorc
setdiscovery: pass heads into _updatesample()...
r39206 :heads: set of DAG head revs
Pierre-Yves David
setdiscovery: document the '_updatesample' function...
r23809 :sample: a sample to update
Gregory Szorc
setdiscovery: don't use dagutil for parent resolution...
r39210 :parentfn: a callable to resolve parents for a revision
Pierre-Yves David
setdiscovery: document the '_updatesample' function...
r23809 :quicksamplesize: optional target size of the sample"""
Peter Arrenbrecht
discovery: add new set-based discovery...
r14164 dist = {}
Martin von Zweigbergk
util: drop alias for collections.deque...
r25113 visit = collections.deque(heads)
Peter Arrenbrecht
discovery: add new set-based discovery...
r14164 seen = set()
factor = 1
while visit:
curr = visit.popleft()
if curr in seen:
continue
d = dist.setdefault(curr, 1)
if d > factor:
factor *= 2
if d == factor:
Pierre-Yves David
setdiscovery: drop the 'always' argument to '_updatesample'...
r23814 sample.add(curr)
if quicksamplesize and (len(sample) >= quicksamplesize):
return
Peter Arrenbrecht
discovery: add new set-based discovery...
r14164 seen.add(curr)
Gregory Szorc
setdiscovery: don't use dagutil for parent resolution...
r39210
for p in parentfn(curr):
if p != nullrev and (not revs or p in revs):
Peter Arrenbrecht
discovery: add new set-based discovery...
r14164 dist.setdefault(p, d + 1)
visit.append(p)
Pierre-Yves David
setdiscovery: extract sample limitation in a `_limitsample` function...
r23083 def _limitsample(sample, desiredlen):
"""return a random subset of sample of at most desiredlen item"""
if len(sample) > desiredlen:
sample = set(random.sample(sample, desiredlen))
return sample
Boris Feld
discovery: introduce a partialdiscovery object...
r41147 class partialdiscovery(object):
"""an object representing ongoing discovery
Feed with data from the remote repository, this object keep track of the
current set of changeset in various states:
Boris Feld
discovery: improve partial discovery documentation...
r41208 - common: revs also known remotely
- undecided: revs we don't have information on yet
- missing: revs missing remotely
(all tracked revisions are known locally)
Boris Feld
discovery: introduce a partialdiscovery object...
r41147 """
Boris Feld
discovery: move undecided set on the partialdiscovery...
r41203 def __init__(self, repo, targetheads):
Boris Feld
discovery: introduce a partialdiscovery object...
r41147 self._repo = repo
Boris Feld
discovery: move undecided set on the partialdiscovery...
r41203 self._targetheads = targetheads
Boris Feld
discovery: introduce a partialdiscovery object...
r41147 self._common = repo.changelog.incrementalmissingrevs()
Boris Feld
discovery: move undecided set on the partialdiscovery...
r41203 self._undecided = None
Boris Feld
discovery: move missing tracking inside the partialdiscovery object...
r41206 self.missing = set()
Boris Feld
discovery: introduce a partialdiscovery object...
r41147
def addcommons(self, commons):
"""registrer nodes known as common"""
self._common.addbases(commons)
Boris Feld
partialdiscovery: avoid `undecided` related computation sooner than necessary...
r41374 if self._undecided is not None:
self._common.removeancestorsfrom(self._undecided)
Boris Feld
discovery: introduce a partialdiscovery object...
r41147
Boris Feld
discovery: move missing tracking inside the partialdiscovery object...
r41206 def addmissings(self, missings):
"""registrer some nodes as missing"""
Boris Feld
discovery: compute newly discovered missing in a more efficient way...
r41316 newmissing = self._repo.revs('%ld::%ld', missings, self.undecided)
if newmissing:
self.missing.update(newmissing)
self.undecided.difference_update(newmissing)
Boris Feld
discovery: move missing tracking inside the partialdiscovery object...
r41206
Boris Feld
discovery: add a simple `addinfo` method...
r41207 def addinfo(self, sample):
"""consume an iterable of (rev, known) tuples"""
common = set()
missing = set()
for rev, known in sample:
if known:
common.add(rev)
else:
missing.add(rev)
if common:
self.addcommons(common)
if missing:
self.addmissings(missing)
Boris Feld
discovery: introduce a partialdiscovery object...
r41147 def hasinfo(self):
"""return True is we have any clue about the remote state"""
return self._common.hasbases()
Boris Feld
discovery: add a `iscomplete` method to the `partialdiscovery` object...
r41205 def iscomplete(self):
"""True if all the necessary data have been gathered"""
return self._undecided is not None and not self._undecided
Boris Feld
discovery: move undecided set on the partialdiscovery...
r41203 @property
def undecided(self):
if self._undecided is not None:
return self._undecided
self._undecided = set(self._common.missingancestors(self._targetheads))
return self._undecided
Boris Feld
discovery: move common heads computation inside partialdiscovery object...
r41148 def commonheads(self):
"""the heads of the known common set"""
# heads(common) == heads(common.bases) since common represents
# common.bases and all its ancestors
Georges Racinet
discovery: using the new basesheads()...
r41281 return self._common.basesheads()
Boris Feld
discovery: introduce a partialdiscovery object...
r41147
discovery: use a lower level but faster way to retrieve parents...
r42047 def _parentsgetter(self):
getrev = self._repo.changelog.index.__getitem__
def getparents(r):
return getrev(r)[5:6]
return getparents
Georges Racinet
discovery: moved sampling functions inside discovery object...
r42045 def takequicksample(self, headrevs, size):
"""takes a quick sample of size <size>
It is meant for initial sampling and focuses on querying heads and close
ancestors of heads.
:headrevs: set of head revisions in local DAG to consider
:size: the maximum size of the sample"""
revs = self.undecided
if len(revs) <= size:
return list(revs)
sample = set(self._repo.revs('heads(%ld)', revs))
if len(sample) >= size:
return _limitsample(sample, size)
discovery: use a lower level but faster way to retrieve parents...
r42047 _updatesample(None, headrevs, sample, self._parentsgetter(),
Georges Racinet
discovery: moved sampling functions inside discovery object...
r42045 quicksamplesize=size)
return sample
def takefullsample(self, headrevs, size):
revs = self.undecided
if len(revs) <= size:
return list(revs)
repo = self._repo
sample = set(repo.revs('heads(%ld)', revs))
discovery: use a lower level but faster way to retrieve parents...
r42047 parentrevs = self._parentsgetter()
Georges Racinet
discovery: moved sampling functions inside discovery object...
r42045
# update from heads
discovery: avoid computing identical sets of heads twice...
r42046 revsheads = sample.copy()
discovery: use a lower level but faster way to retrieve parents...
r42047 _updatesample(revs, revsheads, sample, parentrevs)
Georges Racinet
discovery: moved sampling functions inside discovery object...
r42045
# update from roots
revsroots = set(repo.revs('roots(%ld)', revs))
# _updatesample() essentially does interaction over revisions to look
# up their children. This lookup is expensive and doing it in a loop is
# quadratic. We precompute the children for all relevant revisions and
# make the lookup in _updatesample() a simple dict lookup.
#
# Because this function can be called multiple times during discovery,
# we may still perform redundant work and there is room to optimize
# this by keeping a persistent cache of children across invocations.
children = {}
for rev in repo.changelog.revs(start=min(revsroots)):
# Always ensure revision has an entry so we don't need to worry
# about missing keys.
children.setdefault(rev, [])
for prev in parentrevs(rev):
if prev == nullrev:
continue
children.setdefault(prev, []).append(rev)
_updatesample(revs, revsroots, sample, children.__getitem__)
assert sample
sample = _limitsample(sample, size)
if len(sample) < size:
more = size - len(sample)
sample.update(random.sample(list(revs - sample), more))
return sample
Martin von Zweigbergk
setdiscovery: back out changeset 5cfdf6137af8 (issue5809)...
r36732 def findcommonheads(ui, local, remote,
Peter Arrenbrecht
discovery: add new set-based discovery...
r14164 initialsamplesize=100,
fullsamplesize=200,
Boris Feld
setdiscover: allow to ignore part of the local graph...
r35305 abortwhenunrelated=True,
ancestorsof=None):
Steven Brown
setdiscovery: limit lines to 80 characters
r14206 '''Return a tuple (common, anyincoming, remoteheads) used to identify
missing nodes from or in remote.
Peter Arrenbrecht
discovery: add new set-based discovery...
r14164 '''
discovery: include timing in the debug output...
r32712 start = util.timer()
Peter Arrenbrecht
discovery: add new set-based discovery...
r14164 roundtrips = 0
cl = local.changelog
Gregory Szorc
setdiscovery: don't use dagutil for rev -> node conversions...
r39195 clnode = cl.node
Gregory Szorc
setdiscovery: don't use dagutil for node -> rev conversion...
r39197 clrev = cl.rev
Gregory Szorc
setdiscovery: don't use dagutil for rev -> node conversions...
r39195
Boris Feld
setdiscover: allow to ignore part of the local graph...
r35305 if ancestorsof is not None:
Gregory Szorc
setdiscovery: don't use dagutil to compute heads...
r39201 ownheads = [clrev(n) for n in ancestorsof]
else:
ownheads = [rev for rev in cl.headrevs() if rev != nullrev]
Peter Arrenbrecht
setdiscovery: batch heads and known(ownheads)...
r14624 # early exit if we know all the specified remote heads already
Peter Arrenbrecht
discovery: add new set-based discovery...
r14164 ui.debug("query 1; heads\n")
roundtrips += 1
Pierre-Yves David
setdiscovery: limit the size of the initial sample (issue4411)...
r23084 sample = _limitsample(ownheads, initialsamplesize)
Mads Kiilerich
discovery: indices between sample and yesno must match (issue4438)...
r23192 # indices between sample and externalized version must match
sample = list(sample)
Gregory Szorc
wireproto: implement batching on peer executor interface...
r37649
with remote.commandexecutor() as e:
fheads = e.callcommand('heads', {})
fknown = e.callcommand('known', {
Gregory Szorc
setdiscovery: don't use dagutil for rev -> node conversions...
r39195 'nodes': [clnode(r) for r in sample],
Gregory Szorc
wireproto: implement batching on peer executor interface...
r37649 })
srvheadhashes, yesno = fheads.result(), fknown.result()
Peter Arrenbrecht
discovery: add new set-based discovery...
r14164
if cl.tip() == nullid:
if srvheadhashes != [nullid]:
return [nullid], True, srvheadhashes
return [nullid], False, []
Steven Brown
setdiscovery: limit lines to 80 characters
r14206 # start actual discovery (we note this before the next "if" for
# compatibility reasons)
Peter Arrenbrecht
discovery: add new set-based discovery...
r14164 ui.status(_("searching for changes\n"))
Georges Racinet
discovery: rename `srvheads` to `knownsrvheads`...
r42044 knownsrvheads = [] # revnos of remote heads that are known locally
Gregory Szorc
setdiscovery: don't use dagutil for node -> rev conversion...
r39197 for node in srvheadhashes:
if node == nullid:
continue
try:
Georges Racinet
discovery: rename `srvheads` to `knownsrvheads`...
r42044 knownsrvheads.append(clrev(node))
Gregory Szorc
setdiscovery: don't use dagutil for node -> rev conversion...
r39197 # Catches unknown and filtered nodes.
except error.LookupError:
continue
Georges Racinet
discovery: rename `srvheads` to `knownsrvheads`...
r42044 if len(knownsrvheads) == len(srvheadhashes):
Matt Mackall
discovery: quiet note about heads...
r14833 ui.debug("all remote heads known locally\n")
Gregory Szorc
setdiscovery: don't use dagutil for rev -> node conversions...
r39195 return srvheadhashes, False, srvheadhashes
Peter Arrenbrecht
discovery: add new set-based discovery...
r14164
Martin von Zweigbergk
setdiscovery: remove initialsamplesize from a condition...
r36733 if len(sample) == len(ownheads) and all(yesno):
Mads Kiilerich
add missing localization markup
r15497 ui.note(_("all local heads known remotely\n"))
Gregory Szorc
setdiscovery: don't use dagutil for rev -> node conversions...
r39195 ownheadhashes = [clnode(r) for r in ownheads]
return ownheadhashes, True, srvheadhashes
Peter Arrenbrecht
setdiscovery: batch heads and known(ownheads)...
r14624
Peter Arrenbrecht
discovery: add new set-based discovery...
r14164 # full blown discovery
Boris Feld
discovery: move undecided set on the partialdiscovery...
r41203 disco = partialdiscovery(local, ownheads)
Siddharth Agarwal
setdiscovery: avoid a full changelog graph traversal...
r23343 # treat remote heads (and maybe own heads) as a first implicit sample
# response
Georges Racinet
discovery: rename `srvheads` to `knownsrvheads`...
r42044 disco.addcommons(knownsrvheads)
Boris Feld
discovery: add a simple `addinfo` method...
r41207 disco.addinfo(zip(sample, yesno))
Brodie Rao
cleanup: eradicate long lines
r16683
Peter Arrenbrecht
setdiscovery: batch heads and known(ownheads)...
r14624 full = False
Martin von Zweigbergk
setdiscovery: use progress helper...
r38369 progress = ui.makeprogress(_('searching'), unit=_('queries'))
Boris Feld
discovery: add a `iscomplete` method to the `partialdiscovery` object...
r41205 while not disco.iscomplete():
Peter Arrenbrecht
discovery: add new set-based discovery...
r14164
Boris Feld
discovery: introduce a partialdiscovery object...
r41147 if full or disco.hasinfo():
Pierre-Yves David
setdiscovery: factorize similar sampling code...
r23747 if full:
ui.note(_("sampling from both directions\n"))
else:
ui.debug("taking initial sample\n")
Georges Racinet
discovery: moved sampling functions inside discovery object...
r42045 samplefunc = disco.takefullsample
Pierre-Yves David
setdiscovery: limit the size of all sample (issue4411)...
r23130 targetsize = fullsamplesize
Peter Arrenbrecht
setdiscovery: batch heads and known(ownheads)...
r14624 else:
# use even cheaper initial sample
ui.debug("taking quick initial sample\n")
Georges Racinet
discovery: moved sampling functions inside discovery object...
r42045 samplefunc = disco.takequicksample
Pierre-Yves David
setdiscovery: limit the size of all sample (issue4411)...
r23130 targetsize = initialsamplesize
Georges Racinet
discovery: moved sampling functions inside discovery object...
r42045 sample = samplefunc(ownheads, targetsize)
Peter Arrenbrecht
discovery: add new set-based discovery...
r14164
roundtrips += 1
Martin von Zweigbergk
setdiscovery: use progress helper...
r38369 progress.update(roundtrips)
Peter Arrenbrecht
discovery: add new set-based discovery...
r14164 ui.debug("query %i; still undecided: %i, sample size is: %i\n"
Boris Feld
discovery: move undecided set on the partialdiscovery...
r41203 % (roundtrips, len(disco.undecided), len(sample)))
Peter Arrenbrecht
discovery: add new set-based discovery...
r14164 # indices between sample and externalized version must match
sample = list(sample)
Gregory Szorc
wireproto: implement command executor interface for version 1 peers...
r37648
with remote.commandexecutor() as e:
yesno = e.callcommand('known', {
Gregory Szorc
setdiscovery: don't use dagutil for rev -> node conversions...
r39195 'nodes': [clnode(r) for r in sample],
Gregory Szorc
wireproto: implement command executor interface for version 1 peers...
r37648 }).result()
Peter Arrenbrecht
setdiscovery: batch heads and known(ownheads)...
r14624 full = True
Peter Arrenbrecht
discovery: add new set-based discovery...
r14164
Boris Feld
discovery: add a simple `addinfo` method...
r41207 disco.addinfo(zip(sample, yesno))
Siddharth Agarwal
setdiscovery: avoid a full changelog graph traversal...
r23343
Boris Feld
discovery: move common heads computation inside partialdiscovery object...
r41148 result = disco.commonheads()
discovery: include timing in the debug output...
r32712 elapsed = util.timer() - start
Martin von Zweigbergk
progress: hide update(None) in a new complete() method...
r38392 progress.complete()
discovery: include timing in the debug output...
r32712 ui.debug("%d total queries in %.4fs\n" % (roundtrips, elapsed))
setdiscovery: improves logged message...
r32768 msg = ('found %d common and %d unknown server heads,'
' %d roundtrips in %.4fs\n')
Georges Racinet
discovery: rename `srvheads` to `knownsrvheads`...
r42044 missing = set(result) - set(knownsrvheads)
setdiscovery: improves logged message...
r32768 ui.log('discovery', msg, len(result), len(missing), roundtrips,
discovery: log discovery result in non-trivial cases...
r32713 elapsed)
Peter Arrenbrecht
discovery: add new set-based discovery...
r14164
if not result and srvheadhashes != [nullid]:
if abortwhenunrelated:
Pierre-Yves David
error: get Abort from 'error' instead of 'util'...
r26587 raise error.Abort(_("repository is unrelated"))
Peter Arrenbrecht
discovery: add new set-based discovery...
r14164 else:
ui.warn(_("warning: repository is unrelated\n"))
Martin von Zweigbergk
cleanup: use set literals...
r32291 return ({nullid}, True, srvheadhashes,)
Peter Arrenbrecht
discovery: add new set-based discovery...
r14164
Andrew Pritchard
setdiscovery: return anyincoming=False when remote's only head is nullid...
r14981 anyincoming = (srvheadhashes != [nullid])
Gregory Szorc
setdiscovery: don't use dagutil for rev -> node conversions...
r39195 result = {clnode(r) for r in result}
return result, anyincoming, srvheadhashes