##// END OF EJS Templates
lfs: fix the stall and corruption issue when concurrently uploading blobs...
lfs: fix the stall and corruption issue when concurrently uploading blobs We've avoided the issue up to this point by gating worker usage with an experimental config. See 10e62d5efa73, and the thread linked there for some of the initial diagnosis, but essentially some data was being read from the blob before an error occurred and `keepalive` retried, but didn't rewind the file pointer. So the leading data was lost from the blob on the server, and the connection stalled, trying to send more data than available. In trying to recreate this, I was unable to do so uploading from Windows to CentOS 7. But it reproduced every time going from CentOS 7 to another CentOS 7 over https. I found recent fixes in the FaceBook repo to address this[1][2]. The commit message for the first is: The KeepAlive HTTP implementation is bugged in it's retry logic, it supports reading from a file pointer, but doesn't support rewinding of the seek cursor when it performs a retry. So it can happen that an upload fails for whatever reason and will then 'hang' on the retry event. The sequence of events that get triggered are: - Upload file A, goes OK. Keep-Alive caches connection. - Upload file B, fails due to (for example) failing Keep-Alive, but LFS file pointer has been consumed for the upload and fd has been closed. - Retry for file B starts, sets the Content-Length properly to the expected file size, but since file pointer has been consumed no data will be uploaded, causing the server to wait for the uploaded data until either client or server reaches a timeout, making it seem as our mercurial process hangs. This is just a stop-gap measure to prevent this behavior from blocking Mercurial (LFS has retry logic). A proper solutions need to be build on top of this stop-gap measure: for upload from file pointers, we should support fseek() on the interface. Since we expect to consume the whole file always anyways, this should be safe. This way we can seek back to the beginning on a retry. I ported those two patches, and it works. But I see that `url._sendfile()` does a rewind on `httpsendfile` objects[3], so maybe it's better to keep this all in one place and avoid a second seek. We may still want the first FaceBook patch as extra protection for this problem in general. The other two uses of `httpsendfile` are in the wire protocol to upload bundles, and to upload largefiles. Neither of these appear to use a worker, and I'm not sure why workers seem to trigger this, or if this could have happened without a worker. Since `httpsendfile` already has a `close()` method, that is dropped. That class also explicitly says there's no `__len__` attribute, so that is removed too. The override for `read()` is necessary to avoid the progressbar usage per file. [1] https://github.com/facebookexperimental/eden/commit/c350d6536d90c044c837abdd3675185644481469 [2] https://github.com/facebookexperimental/eden/commit/77f0d3fd0415e81b63e317e457af9c55c46103ee [3] https://www.mercurial-scm.org/repo/hg/file/5.2.2/mercurial/url.py#l176 Differential Revision: https://phab.mercurial-scm.org/D7962

File last commit:

r42679:ed9a9956 default
r44746:43eea17a default
Show More
TODO.rst
195 lines | 7.6 KiB | text/x-rst | RstLexer

Prior to removing (EXPERIMENTAL)

These things affect UI and/or behavior, and should probably be implemented (or ruled out) prior to taking off the experimental shrinkwrap.

  1. Finish the hg convert story
    • Add an argument to accept a rules file to apply during conversion? Currently lfs.track is the only way to affect the conversion.
    • drop lfs.track config settings
    • splice in .hglfs file for normal repo -> lfs conversions?
  2. Stop uploading blobs when pushing between local repos
    • Could probably hardlink directly to the other local repo's store
    • Support inferring lfs.url for local push/pull (currently only supports http)
  3. Stop uploading blobs on strip/amend/histedit/etc.
    • This seems to be a side effect of doing it for hg bundle, which probably makes sense.
  4. Handle a server with the extension loaded and a client without the extension more gracefully.
  5. Remove lfs.retry hack in client? This came from FB, but it's not clear why it is/was needed.
  6. hg export currently writes out the LFS blob. Should it write the pointer instead?
    • hg diff is similar, and probably shouldn't see the pointer file
  7. Fix https multiplexing, and re-enable workers.
  8. Show to-be-applied rules with hg files -r 'wdir()' 'set:lfs()'
    • debugignore can show file + line number, so a dedicated command could be useful too.
  9. Filesets, revsets and templates
    • A dedicated revset should be faster than 'file(set:lfs())'
    • Attach {lfsoid} and {lfspointer} to general keywords, IFF the file is a blob
    • Drop existing items that would be redundant with general support
  10. Can grep avoid downloading most things?
    • Add a command option to skip LFS blobs?
  11. Add a flag that's visible in hg files -v to indicate external storage?
  12. Server side issues
  13. Add locks on cache and blob store
    • This is complicated with a global store, and multiple potentially unrelated local repositories that reference the same blob.
    • Alternately, maybe just handle collisions when trying to create the same blob in the store somehow.
  14. Are proper file sizes reported in debugupgraderepo?
  15. Finish prefetching files
    • -T {data} (other than cat?)
    • verify
    • grep
  16. Output cleanup
    • Can we print the url when connecting to the blobstore? (A sudden connection refused after pulling commits looks confusing.) Problem is, 'pushing to main url' is printed, and then lfs wants to upload before going back to the main repo transfer, so then that could be confusing with extra output. (This is kinda improved with 380f5131ee7b and 9f78d10742af.)
    • Add more progress indicators? Uploading a large repo looks idle for a long time while it scans for blobs in each outgoing revision.
    • Print filenames instead of hashes in error messages
      • subrepo aware paths, where necessary
    • Is existing output at the right status/note/debug level?
  17. Can verify be done without downloading everything?
    • If we know that we are talking to an hg server, we can leverage the fact that it validates in the Batch API portion, and skip d/l altogether. OTOH, maybe we should download the files unconditionally for forensics. The alternative is to define a custom transfer handler that definitively verifies without transferring, and then cache those results. When verify comes looking, look in the cache instead of actually opening the file and processing it.
    • Yuya has concerns about when blob fetch takes place vs when revlog is verified. Since the visible hash matches the blob content, I don't think there's a way to verify the pointer file that's actually stored in the filelog (other than basic JSON checks). Full verification requires the blob. See https://www.mercurial-scm.org/pipermail/mercurial-devel/2018-April/116133.html
    • Opening a corrupt pointer file aborts. It probably shouldn't for verify.

Future ideas/features/polishing

These aren't in any particular order, and are things that don't have obvious BC concerns.

  1. Garbage collection (issue5790)
    • This gets complicated because of the global cache, which may or may not consist of hardlinks to the repo, and may be in use by other repos. (So the gc may be pointless.)
  2. Compress blobs
    • 700MB repo becomes 2.5GB with all lfs blobs
    • What implications are there for filesystem paths that don't indicate compression? (i.e. how to share with global cache and other local repos?)
    • Probably needs to be stored under .hg/store/lfs/zstd, with a repo requirement.
    • Allow tuneable compression type and settings?
    • Support compression over the wire if both sides understand the compression?
    • debugupgraderepo to convert?
    • Probably not worth supporting compressed and uncompressed concurrently
  3. Determine things to upload with readfast()
    • Significantly faster when pushing an entire large repo to http.
    • Causes test changes to fileset and templates; may need both this and current methods of lookup.
  4. Is a command to download everything needed? This would allow copying the whole to a portable drive. Currently this can be effected by running hg verify.
  5. Stop reading in entire file into one buffer when passing through filelog interface
  6. Keep corrupt files around in 'store/lfs/incoming' for forensics?
    • Files should be downloaded to 'incoming', and moved to normal location when done.
  7. Client side path enhancements
  8. Server enhancements
  9. Handle 3rd party server storage.
    • Teach client to handle lfs verify action. This is needed after the server instructs the client to upload the file to another server, in order to tell the server that the upload completed.
    • Teach the server to send redirects if configured, and process verify requests.
  10. Is any hg-git work needed?