upstream/mercurial-mirror Commit - r48992:f12a19d0

fix: reduce number of tool executions...

Danny Hooper -

r48992:f12a19d0 default

parent child

hgext/fix.py

0 +63 -32

                      # There are no data dependencies between the workers fixing each file
                      # revision, so we can use all available parallelism.
                      def getfixes(items):
-                         for rev, path in items:
-                             ctx = repo[rev]
+                         for srcrev, path, dstrevs in items:
+                             ctx = repo[srcrev]
                              olddata = ctx[path].data()
                              metadata, newdata = fixfile(
-                                 ui, repo, opts, fixers, ctx, path, basepaths, basectxs[rev]
+                                 ui,
+                                 repo,
+                                 opts,
+                                 fixers,
+                                 ctx,
+                                 path,
+                                 basepaths,
+                                 basectxs[srcrev],
                              )
-                             # Don't waste memory/time passing unchanged content back, but
-                             # produce one result per item either way.
-                             yield (
-                                 rev,
-                                 path,
-                                 metadata,
-                                 newdata if newdata != olddata else None,
+                             )
+                             # We ungroup the work items now, because the code that consumes
+                             # these results has to handle each dstrev separately, and in
+                             # topological order. Because these are handled in topological
+                             # order, it's important that we pass around references to
+                             # "newdata" instead of copying it. Otherwise, we would be
+                             # keeping more copies of file content in memory at a time than
+                             # if we hadn't bothered to group/deduplicate the work items.
+                             data = newdata if newdata != olddata else None
+                             for dstrev in dstrevs:
+                                 yield (dstrev, path, metadata, data)
                      results = worker.worker(
                          ui, 1.0, getfixes, tuple(), workqueue, threadsafe=False
              def getworkqueue(ui, repo, pats, opts, revstofix, basectxs):
-                 """Constructs the list of files to be fixed at specific revisions
+                 """Constructs a list of files to fix and which revisions each fix applies to
-                 It is up to the caller how to consume the work items, and the only
-                 dependence between them is that replacement revisions must be committed in
-                 topological order. Each work item represents a file in the working copy or
-                 in some revision that should be fixed and written back to the working copy
-                 or into a replacement revision.
+                 To avoid duplicating work, there is usually only one work item for each file
+                 revision that might need to be fixed. There can be multiple work items per
+                 file revision if the same file needs to be fixed in multiple changesets with
+                 different baserevs. Each work item also contains a list of changesets where
+                 the file's data should be replaced with the fixed data. The work items for
+                 earlier changesets come earlier in the work queue, to improve pipelining by
+                 allowing the first changeset to be replaced while fixes are still being
+                 computed for later changesets.
-                 Work items for the same revision are grouped together, so that a worker
-                 pool starting with the first N items in parallel is likely to finish the
-                 first revision's work before other revisions. This can allow us to write
-                 the result to disk and reduce memory footprint. At time of writing, the
-                 partition strategy in worker.py seems favorable to this. We also sort the
-                 items by ascending revision number to match the order in which we commit
-                 the fixes later.
+                 Also returned is a map from changesets to the count of work items that might
+                 affect each changeset. This is used later to count when all of a changeset's
+                 work items have been finished, without having to inspect the remaining work
+                 queue in each worker subprocess.
+                 The example work item (1, "foo/bar.txt", (1, 2, 3)) means that the data of
+                 bar.txt should be read from revision 1, then fixed, and written back to
+                 revisions 1, 2 and 3. Revision 1 is called the "srcrev" and the list of
+                 revisions is called the "dstrevs". In practice the srcrev is always one of
+                 the dstrevs, and we make that choice when constructing the work item so that
+                 the choice can't be made inconsistently later on. The dstrevs should all
+                 have the same file revision for the given path, so the choice of srcrev is
+                 arbitrary. The wdirrev can be a dstrev and a srcrev.
                  """
-                 workqueue = []
+                 dstrevmap = collections.defaultdict(list)
                  numitems = collections.defaultdict(int)
                  maxfilesize = ui.configbytes(b'fix', b'maxfilesize')
                  for rev in sorted(revstofix):
                                  % (util.bytecount(maxfilesize), path)
                              )
                              continue
-                         workqueue.append((rev, path))
+                         baserevs = tuple(ctx.rev() for ctx in basectxs[rev])
+                         dstrevmap[(fctx.filerev(), baserevs, path)].append(rev)
                          numitems[rev] += 1
+                 workqueue = [
+                     (min(dstrevs), path, dstrevs)
+                     for (filerev, baserevs, path), dstrevs in dstrevmap.items()
+                 ]
+                 # Move work items for earlier changesets to the front of the queue, so we
+                 # might be able to replace those changesets (in topological order) while
+                 # we're still processing later work items. Note the min() in the previous
+                 # expression, which means we don't need a custom comparator here. The path
+                 # is also important in the sort order to make the output order stable. There
+                 # are some situations where this doesn't help much, but some situations
+                 # where it lets us buffer O(1) files instead of O(n) files.
+                 workqueue.sort()
                  return workqueue, numitems
                      return {}
                  basepaths = {}
-                 for rev, path in workqueue:
-                     fixctx = repo[rev]
-                     for basectx in basectxs[rev]:
+                 for srcrev, path, _dstrevs in workqueue:
+                     fixctx = repo[srcrev]
+                     for basectx in basectxs[srcrev]:
                          basepath = copies.pathcopies(basectx, fixctx).get(path, path)
                          if basepath in basectx:
                              basepaths[(basectx.rev(), fixctx.rev(), path)] = basepath
                  toprefetch = set()
                  # Prefetch the files that will be fixed.
-                 for rev, path in workqueue:
-                     if rev == wdirrev:
+                 for srcrev, path, _dstrevs in workqueue:
+                     if srcrev == wdirrev:
                          continue
-                     toprefetch.add((rev, path))
+                     toprefetch.add((srcrev, path))
                  # Prefetch the base contents for lineranges().
                  for (baserev, fixrev, path), basepath in basepaths.items():

tests/test-fix.t

0 +51 -2

		@@ -1797,7 +1797,56 b' fixed.'
1797	1797	$ cat $LOGFILE \| sort \| uniq -c
1798	1798	4 bar.log
1799	1799	4 baz.log
1800		4 foo.log
1801		4 qux.log
	1800	3 foo.log
	1801	2 qux.log
1802	1802
1803	1803	$ cd ..
	1804
	1805	For tools that support line ranges, it's wrong to blindly re-use fixed file
	1806	content for the same file revision if it appears twice with different baserevs,
	1807	because the line ranges could be different. Since computing line ranges is
	1808	ambiguous, this isn't a matter of correctness, but it affects the usability of
	1809	this extension. It could maybe be simpler if baserevs were computed on a
	1810	per-file basis to make this situation impossible to construct.
	1811
	1812	In the following example, we construct two subgraphs with the same file
	1813	revisions, and fix different sub-subgraphs to get different baserevs and
	1814	different changed line ranges. The key precondition is that revisions 1 and 4
	1815	have the same file revision, and the key result is that their successors don't
	1816	have the same file content, because we want to fix different areas of that same
	1817	file revision's content.
	1818
	1819	$ hg init differentlineranges
	1820	$ cd differentlineranges
	1821
	1822	$ printf "a\nb\n" > file.changed
	1823	$ hg commit -Aqm "0 ab"
	1824	$ printf "a\nx\n" > file.changed
	1825	$ hg commit -Aqm "1 ax"
	1826	$ hg remove file.changed
	1827	$ hg commit -Aqm "2 removed"
	1828	$ hg revert file.changed -r 0
	1829	$ hg commit -Aqm "3 ab (reverted)"
	1830	$ hg revert file.changed -r 1
	1831	$ hg commit -Aqm "4 ax (reverted)"
	1832
	1833	$ hg manifest --debug --template "{hash}\n" -r 0; \
	1834	> hg manifest --debug --template "{hash}\n" -r 3
	1835	418f692145676128d2fb518b027ddbac624be76e
	1836	418f692145676128d2fb518b027ddbac624be76e
	1837	$ hg manifest --debug --template "{hash}\n" -r 1; \
	1838	> hg manifest --debug --template "{hash}\n" -r 4
	1839	09b8b3ce5a507caaa282f7262679e6d04091426c
	1840	09b8b3ce5a507caaa282f7262679e6d04091426c
	1841
	1842	$ hg fix --working-dir -r 1+3+4
	1843	3 new orphan changesets
	1844
	1845	$ hg cat file.changed -r "successors(1)" --hidden
	1846	a
	1847	X
	1848	$ hg cat file.changed -r "successors(4)" --hidden
	1849	A
	1850	X
	1851
	1852	$ cd ..

General Comments 0

Write
Preview

You need to be logged in to leave comments. Login now

No TODOs yet

	Site-wide shortcuts
/	Use quick search box
g h	Goto home page
g g	Goto my private gists page
g G	Goto my public gists page
g 0-9	Goto bookmarked items from 0-9
n r	New repository page
n g	New gist page

	Repositories
g s	Goto summary page
g c	Goto changelog page
g f	Goto files page
g F	Goto files page with file search activated
g p	Goto pull requests page
g o	Goto repository settings
g O	Goto repository access permissions settings
t s	Toggle sidebar on some pages