RE: [RFC] Add support for downloading blobs on demand

From: "Ben Peart" <peartben@gmail.com>
To: "'Jeff King'" <peff@peff.net>, "'Ben Peart'" <peartben@gmail.com>
Cc: <git@vger.kernel.org>
Subject: RE: [RFC] Add support for downloading blobs on demand
Date: Tue, 17 Jan 2017 16:50:47 -0500	[thread overview]
Message-ID: <002601d2710b$c3715890$4a5409b0$@gmail.com> (raw)
In-Reply-To: <20170117184258.sd7h2hkv27w52gzt@sigill.intra.peff.net>

Thanks for the thoughtful response.  No need to appologize for the 
length, it's a tough problem to solve so I don't expect it to be handled 
with a single, short email. :)

> -----Original Message-----
> From: Jeff King [mailto:peff@peff.net]
> Sent: Tuesday, January 17, 2017 1:43 PM
> To: Ben Peart <peartben@gmail.com>
> Cc: git@vger.kernel.org; Ben Peart <Ben.Peart@microsoft.com>
> Subject: Re: [RFC] Add support for downloading blobs on demand
> 
> This is an issue I've thought a lot about. So apologies in advance that this
> response turned out a bit long. :)
> 
> On Fri, Jan 13, 2017 at 10:52:53AM -0500, Ben Peart wrote:
> 
> > Design
> > ~~~~~~
> >
> > Clone and fetch will pass a  --lazy-clone  flag (open to a better name
> > here) similar to  --depth  that instructs the server to only return
> > commits and trees and to ignore blobs.
> >
> > Later during git operations like checkout, when a blob cannot be found
> > after checking all the regular places (loose, pack, alternates, etc),
> > git will download the missing object and place it into the local
> > object store (currently as a loose object) then resume the operation.
> 
> Have you looked at the "external odb" patches I wrote a while ago, and
> which Christian has been trying to resurrect?
> 
>   https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpublic-
> inbox.org%2Fgit%2F20161130210420.15982-1-
> chriscool%40tuxfamily.org%2F&data=02%7C01%7CBen.Peart%40microsoft.c
> om%7C9596d3bf32564f123e0c08d43f08a9e1%7C72f988bf86f141af91ab2d7c
> d011db47%7C1%7C0%7C636202753822020527&sdata=a6%2BGOAQoRhjFoxS
> vftY8JZAVUssmrXuDZ9OBy3xqNZk%3D&reserved=0
> 
> This is a similar approach, though I pushed the policy for "how do you get the
> objects" out into an external script. One advantage there is that large objects
> could easily be fetched from another source entirely (e.g., S3 or equivalent)
> rather than the repo itself.
> 
> The downside is that it makes things more complicated, because a push or a
> fetch now involves three parties (server, client, and the alternate object
> store). So questions like "do I have all the objects I need" are hard to reason
> about.
> 
> If you assume that there's going to be _some_ central Git repo which has all
> of the objects, you might as well fetch from there (and do it over normal git
> protocols). And that simplifies things a bit, at the cost of being less flexible.
> 

We looked quite a bit at the external odb patches, as well as lfs and 
even using alternates.  They all share a common downside that you must 
maintain a separate service that contains _some_ of the files.  These 
files must also be versioned, replicated, backed up and the service 
itself scaled out to handle the load.  As you mentioned, having multiple 
services involved increases flexability but it also increases the 
complexity and decreases the reliability of the overall version control 
service.  

For operational simplicity, we opted to go with a design that uses a 
single, central git repo which has _all_ the objects and to focus on 
enhancing it to handle large numbers of files efficiently.  This allows 
us to focus our efforts on a great git service and to avoid having to 
build out these other services.

> > To prevent git from accidentally downloading all missing blobs, some
> > git operations are updated to be aware of the potential for missing blobs.
> > The most obvious being check_connected which will return success as if
> > everything in the requested commits is available locally.
> 
> Actually, Git is pretty good about trying not to access blobs when it doesn't
> need to. The important thing is that you know enough about the blobs to
> fulfill has_sha1_file() and sha1_object_info() requests without actually
> fetching the data.
> 
> So the client definitely needs to have some list of which objects exist, and
> which it _could_ get if it needed to.
> 
> The one place you'd probably want to tweak things is in the diff code, as a
> single "git log -Sfoo" would fault in all of the blobs.
> 

It is an interesting idea to explore how we could be smarter about 
preventing blobs from faulting in if we had enough info to fulfill 
has_sha1_file() and sha1_object_info().  Given we also heavily prune the 
working directory using sparse-checkout, this hasn't been our top focus 
but it is certainly something worth looking into.

> > To minimize the impact on the server, the existing dumb HTTP protocol
> > endpoint  objects/<sha>  can be used to retrieve the individual
> > missing blobs when needed.
> 
> This is going to behave badly on well-packed repositories, because there isn't
> a good way to fetch a single object. The best case (which is not implemented
> at all in Git) is that you grab the pack .idx, then grab "slices" of the pack
> corresponding to specific objects, including hunting down delta bases.
> 
> But then next time the server repacks, you have to throw away your .idx file.
> And those can be big. The .idx for linux.git is 135MB. You really wouldn't	
> want to do an incremental fetch of 1MB worth of objects and have to grab
> the whole .idx just to figure out which bytes you needed.
> 
> You can solve this by replacing the dumb-http server with a smart one that
> actually serves up the individual objects as if they were truly sitting on the
> filesystem. But then you haven't really minimized impact on the server, and
> you might as well teach the smart protocols to do blob fetches.
> 

Yea, we actually implemented a new endpoint that we are using to fetch 
individual blobs, I just found the dumb endpoint recently and thought 
"hey, maybe we can us this to make it easier for other git servers."  
For a number of good reasons, I don't think this is the right approach.

> 
> One big hurdle to this approach, no matter the protocol, is how you are
> going to handle deltas. Right now, a git client tells the server "I have this
> commit, but I want this other one". And the server knows which objects the
> client has from the first, and which it needs from the second. Moreover, it
> knows that it can send objects in delta form directly from disk if the other
> side has the delta base.
> 
> So what happens in this system? We know we don't need to send any blobs
> in a regular fetch, because the whole idea is that we only send blobs on
> demand. So we wait for the client to ask us for blob A. But then what do we
> send? If we send the whole blob without deltas, we're going to waste a lot of
> bandwidth.
> 
> The on-disk size of all of the blobs in linux.git is ~500MB. The actual data size
> is ~48GB. Some of that is from zlib, which you get even for non-deltas. But
> the rest of it is from the delta compression. I don't think it's feasible to give
> that up, at least not for "normal" source repos like linux.git (more on that in
> a minute).
> 
> So ideally you do want to send deltas. But how do you know which objects
> the other side already has, which you can use as a delta base? Sending the
> list of "here are the blobs I have" doesn't scale. Just the sha1s start to add
> up, especially when you are doing incremental fetches.
> 
> I think this sort of things performs a lot better when you just focus on large
> objects. Because they don't tend to delta well anyway, and the savings are
> much bigger by avoiding ones you don't want. So a directive like "don't
> bother sending blobs larger than 1MB" avoids a lot of these issues. In other
> words, you have some quick shorthand to communicate between the client
> and server: this what I have, and what I don't.
> Normal git relies on commit reachability for that, but there are obviously
> other dimensions. The key thing is that both sides be able to express the
> filters succinctly, and apply them efficiently.
> 

Our challenge has been more the sheer _number_ of files that exist in 
the repo rather than the _size_ of the files in the repo.  With >3M 
source files and any typical developer only needing a small percentage 
of those files to do their job, our focus has been pruning the tree as 
much as possible such that they only pay the cost for the files they 
actually need.  With typical text source files being 10K - 20K in size, 
the overhead of the round trip is a significant part of the overall 
transfer time so deltas don't help as much.  I agree that large files 
are also a problem but it isn't my top focus at this point in time.  

> > After cloning, the developer can use sparse-checkout to limit the set
> > of files to the subset they need (typically only 1-10% in these large
> > repos).  This allows the initial checkout to only download the set of
> > files actually needed to complete their task.  At any point, the
> > sparse-checkout file can be updated to include additional files which
> > will be fetched transparently on demand.
> 
> If most of your benefits are not from avoiding blobs in general, but rather
> just from sparsely populating the tree, then it sounds like sparse clone might
> be an easier path forward. The general idea is to restrict not just the
> checkout, but the actual object transfer and reachability (in the tree
> dimension, the way shallow clone limits it in the time dimension, which will
> require cooperation between the client and server).
> 
> So that's another dimension of filtering, which should be expressed pretty
> succinctly: "I'm interested in these paths, and not these other ones." It's
> pretty easy to compute on the server side during graph traversal (though it
> interacts badly with reachability bitmaps, so there would need to be some
> hacks there).
> 
> It's an idea that's been talked about many times, but I don't recall that there
> were ever working patches. You might dig around in the list archive under
> the name "sparse clone" or possibly "narrow clone".

While a sparse/narrow clone would work with this proposal, it isn't 
required.  You'd still probably want all the commits and trees but the 
clone would also bring down the specified blobs.  Combined with using 
"depth" you could further limit it to those blobs at tip. 

We did run into problems with this model however as our usage patterns 
are such that our working directories often contain very sparse trees 
and as a result, we can end up with thousands of entries in the sparse 
checkout file.  This makes it difficult for users to manually specify a 
sparse-checkout before they even do a clone.  We have implemented a 
hashmap based sparse-checkout to deal with the performance issues of 
having that many entries but that's a different RFC/PATCH.  In short, we 
found that a "lazy-clone" and downloading blobs on demand provided a 
better developer experience.

> 
> > Now some numbers
> > ~~~~~~~~~~~~~~~~
> >
> > One repo has 3+ million files at tip across 500K folders with 5-6K
> > active developers.  They have done a lot of work to remove large files
> > from the repo so it is down to < 100GB.
> >
> > Before changes: clone took hours to transfer the 87GB .pack + 119MB
> > .idx
> >
> > After changes: clone took 4 minutes to transfer 305MB .pack + 37MB
> > .idx
> >
> > After hydrating 35K files (the typical number any individual developer
> > needs to do their work), there was an additional 460 MB of loose files
> > downloaded.
> 
> It sounds like you have a case where the repository has a lot of large files
> that are either historical, or uninteresting the sparse-tree dimension.
> 
> How big is that 460MB if it were actually packed with deltas?
> 

Uninteresting in the sparse-tree dimension.  460 MB divided by 35K files 
is less than 13 KB per file which is fairly typical for source code.  
Given there are no versions to calculate deltas from, compressing them 
into a pack file would help some but I don't have the numbers as to how 
much.  When we get to the "future work" below and start batching up 
requests, we'll have better data on that.

> > Future Work
> > ~~~~~~~~~~~
> >
> > The current prototype calls a new hook proc in
> > sha1_object_info_extended and read_object, to download each missing
> > blob.  A better solution would be to implement this via a long running
> > process that is spawned on the first download and listens for requests
> > to download additional objects until it terminates when the parent git
> > operation exits (similar to the recent long running smudge and clean filter
> work).
> 
> Yeah, see the external-odb discussion. Those prototypes use a process per
> object, but I think we all agree after seeing how the git-lfs interface has
> scaled that this is a non-starter. Recent versions of git-lfs do the single-
> process thing, and I think any sort of external-odb hook should be modeled
> on that protocol.
> 

I'm looking into this now and plan to re-implement it this way before 
sending out the first patch series.  Glad to hear you think it is a good 
protocol to model it on.

> > Need to investigate an alternate batching scheme where we can make a
> > single request for a set of "related" blobs and receive single a
> > packfile (especially during checkout).
> 
> I think this sort of batching is going to be the really hard part to retrofit onto
> git. Because you're throwing out the procedural notion that you can loop
> over a set of objects and ask for each individually.
> You have to start deferring computation until answers are ready. Some
> operations can do that reasonably well (e.g., checkout), but something like
> "git log -p" is constantly digging down into history. I suppose you could just
> perform the skeleton of the operation _twice_, once to find the list of objects
> to fault in, and the second time to actually do it.
> 
> That will make git feel a lot slower, because a lot of the illusion of speed is
> the way it streams out results. OTOH, if you have to wait to fault in objects
> from the network, it's going to feel pretty slow anyway. :)
> 

The good news is that for most operations, git doesn't need to access to 
all the blobs.  You're right, any command that does ends up faulting in 
a bunch of blobs from the network can get pretty slow.  Sometimes you 
get streaming results and sometimes it just "hangs" while we go off 
downloading blobs in the background.  We capture telemetry to detect 
these types of issues but typically the users are more than happy to 
send us an "I just ran command 'foo' and it hung" email. :)

> -Peff