Re: Removing Partial Clone / Filtered Clone on a repo

From: Derrick Stolee <stolee@gmail.com>
To: Tao Klerks <tao@klerks.biz>
Cc: git@vger.kernel.org
Subject: Re: Removing Partial Clone / Filtered Clone on a repo
Date: Tue, 1 Jun 2021 09:40:40 -0400	[thread overview]
Message-ID: <0b57cba9-3ab3-dfdf-5589-a0016eaea634@gmail.com> (raw)
In-Reply-To: <CAPMMpohOuXX-0YOjV46jFZFvx7mQdj0p7s8SDR4SQxj5hEhCgg@mail.gmail.com>

On 6/1/2021 9:16 AM, Tao Klerks wrote:
> On Tue, Jun 1, 2021 at 12:39 PM Derrick Stolee <stolee@gmail.com> wrote:
> 
>> Could you describe more about your scenario and why you want to
>> get all objects?
> 
> A 13GB (with 1.2GB shallow head) repo is in that in-between spot where
> you want to be able to get something useful to the user as fast as
> possible (read: in less than the 4 hours it would take to download the
> whole thing over a mediocre VPN, with corresponding risk of errors
> partway), but where a user might later (eg overnight) want to get the
> rest of the repo, to avoid history inconsistency issues.

As you describe below, the inconsistency is in terms of performance,
not correctness. I thought it was worth a clarification.

...
> With the filtered clone there are still little edge-cases that might
> motivate a user to "bite the bullet" and unfilter their clone,
> however: The most obvious one I've found so far is "git blame" - it
> loops fetch requests serially until it bottoms out, which on an older
> poorly-factored file (hundreds or thousands of commits, each touching
> different bits of a file) will effectively never complete, at
> 10s/fetch. And depending on the UI tooling the user is using, they may
> have almost no visibility into why this "git blame" (or "annotate", or
> whatever the given UI calls it) seems to hang forever.

I'm aware that the first 'git blame' on a file is a bit slow in the
partial clone case. It's been on my list for improvement whenever I
get the "spare" time to do it. However, if someone else wants to work
on it I will briefly outline the approach I was going to investigate:

  During the history walk for 'git blame', it might be helpful to
  collect a batch of blobs to download in a single round trip. This
  requires refactoring the search to walk the commit history and
  collect a list of (commit id, blob id) pairs as if we were doing
  a simplified history walk. We can then ask for the list of blob id's
  in a single request and then perform the line-by-line blaming logic
  on that list. [If we ever hit a point where we would do a rename
  check, pause the walk and request all blobs so far and flush the
  line-by-line diff before continuing.]

This basic idea is likely difficult to implement, but would likely
dramatically improve the first 'git blame' in a blobless clone. A
similar approach could maybe be used by the line-log logic
(git log -L).

> You can work around this "git blame" issue for *most* situations, in
> the case of our repo, by using a different initial filter spec, eg
> "--filter=blob:limit=200k", which only costs you an extra 1GB or so...
> But then you still have outliers - and in fact, the most "blameable"
> files will tend to be the larger ones... :)

I'm interested in this claim that 'the most "blameable" files will
tend to be the larger ones.' I typically expect blame to be used on
human-readable text files, and my initial reaction is that larger
files are harder to use with 'git blame'.

However, your 200k limit isn't so large that we can't expect _some_
files to reach that size. Looking at the root of git.git I see a
few files above 100k and files like diff.c reaching very close to
200k (uncompressed). I tend to find that the files in git.git are
smaller than the typical large project.

> My working theory is that we should explain all the following to users:
> * Your initial download is a nice compromise between functionality and
> download delay
> * You have almost all the useful history, and you have it within less
> than an hour
> * If you try to use "git blame" (or some other as-yet-undiscovered
> scenarios) on a larger file, it may hang. In that case cancel, run a
> magic command we provide which fetches all the blobs in that specific
> file's history, and try again. (the magic command is a path-filtered
> rev-list looking for missing objects, passed into fetch)
> * If you ever get tired of the rare weird hangs, you have the option
> of running *some process* that "unfilters" the repo, paying down that
> initial compromise (and taking up a bit more HD space), eg overnight

Partial clone is all about tradeoffs: you get faster clones that
download missing objects as needed. The user behavior dictates how
many objects are needed, so the user has the capability to adjust
that need. The fewer objects needed locally, the faster the repo
will be.

Your concern about slow commands is noted, but also blindly
downloading every file in history will slow the repo due to the
full size of the objects on disk.

I think there is merit to your back-filling history idea. There
are likely benefits to the "download everything missing" concept,
but also it would be good to design such a feature to have other
custom knobs, such as:

* Get only "recent" history, perhaps with a "--since=<date>"
  kind of flag. This would walk commits only to a certain date,
  then find all missing blobs reachable from their root trees.

* Get only a "cone" of history. This could work especially well
  with sparse-checkout, but other pathspecs could be used to
  limit the walk. 

...
> Of course, if unfiltering a large repo is impractical (and if it will
> remain so), then we will probably need to err on the side of
> generosity in the original clone - eg 1M instead of 200k as the blob
> filter, 3GB vs 2.5GB as the initial download - and remove the last
> line of the explanation! If unfiltering, or refiltering, were
> practical, then we would probably err on the size of
> less-blobs-by-default to optimize first download.

I'm glad that you have self-discovered a workaround to handle
these cases. If we had a refiltering feature, then you could even
start with a blobless clone to have an extremely fast initial
clone, followed by a background job that downloads the remaining
objects.

> Over time, as we refactor the project itself to reduce the incidence
> of megafiles, I would expect to be able to drop the
> standard/recommended blob-size-limit too.

My experience working with large repos and partial clone is
similar: the new pain points introduced by these features make
users aware of "repository smells" in their organization and
they tend to self-correct by refactoring the repository. This
is a never-ending process as repos grow, especially with many
contributors.

Thank you for sharing your experience!

-Stolee