Re: Questions about partial clone with '--filter=tree:0'

From: Jonathan Tan <jonathantanmy@google.com>
To: alexandr.miloslavskiy@syntevo.com
Cc: jonathantanmy@google.com, git@vger.kernel.org,
	christian.couder@gmail.com, marc.strapetz@syntevo.com,
	me@ttaylorr.com
Subject: Re: Questions about partial clone with '--filter=tree:0'
Date: Mon, 26 Oct 2020 12:46:35 -0700	[thread overview]
Message-ID: <20201026194635.2119420-1-jonathantanmy@google.com> (raw)
In-Reply-To: <2f04c074-3eee-766c-bedb-2e3cc0a91528@syntevo.com>

> > Having such an option (and teaching "blame" to use it to prefetch) would
> > indeed speed up "blame". But if we implement this, what would happen if
> > the user ran "blame" on the same file twice? I can't think of a way of
> > preventing the same fetch from happening twice except by checking the
> > existence of, say, the last 10 OIDs corresponding to that path. But if
> > we have the list of those 10 OIDs, we could just prefetch those 10 OIDs
> > without needing a new filter.
> 
> I must admit that I didn't notice this problem. Still, it seems easy 
> enough to solve with this approach:
> 
> 1) Estimate number of missing things
> 2) If "many", just download everything for <path> as described before
>     and consider it done.
> 3) If "not so many", assemble a list of OIDs on the boundary of unknown
>     (for example, all root tree OIDs for commits that are missing any
>     trees) and use the usual fetch to download all OIDs in one go.
> 4) Repeat step 3 multiple times. Only N=<maximum tree depth> requests
>     are needed, regardless of the number of commits.

My point was that if you can estimate it ("have the list of those 10
OIDs"), then you can just fetch it. This does send "quite a bit of
OIDs", as you said below - I'll address it below.

> > Another possible solution that has been discussed before (but a much
> > more involved one) is to teach Git to be able to serve results of
> > computations, and then have "blame" be able to stitch that with local
> > data. (For example, "blame" could check the history of a certain path to
> > find the commit(s) that the remote has information of, query the remote
> > for those commits, and then stitch the results together with local
> > history.) This scheme would work not only for "blame" but for things
> > like "grep" (with history) and "log -S", whereas
> > "--filter=sparse:parthlist" would only work with "blame". But
> > admittedly, this solution is more involved.
> 
> I understand that you're basically talking about implementing 
> prefetching in git itself?

No - I did talk about prefetching earlier, but here I mean having Git on
the server perform the "blame" computation itself.

For example, let's say I want to run "blame" on foo.txt at HEAD. HEAD
and HEAD^ are commits that only the local client has, whereas HEAD^^ was
fetched from the remote. By comparing HEAD, HEAD^, and HEAD^^, Git knows
which lines come from HEAD and HEAD^. For the rest, Git would make a
request to the server, passing the commit ID and the path, and would get
back a list of line numbers and commits.

> To my understanding, this will still need 
> either the command I suggested, or implement graph walking with massive 
> OID requests as described above in 1)2)3)4). The latter will not require 
> protocol changes, but will involve sending quite a bit of OIDs around.

Yes, prefetching will require graph walking with large OID requests but
will not require protocol changes, as you say. I'm not too worried about
the large numbers of OIDs - Git servers already have to support
relatively large numbers of OIDs to support the bulk prefetch we do
during things like checkout and diff.