Re: Questions about partial clone with '--filter=tree:0'

From: Alexandr Miloslavskiy <alexandr.miloslavskiy@syntevo.com>
To: Jonathan Tan <jonathantanmy@google.com>
Cc: git@vger.kernel.org, christian.couder@gmail.com,
	marc.strapetz@syntevo.com, me@ttaylorr.com
Subject: Re: Questions about partial clone with '--filter=tree:0'
Date: Mon, 26 Oct 2020 19:44:27 +0100	[thread overview]
Message-ID: <2f04c074-3eee-766c-bedb-2e3cc0a91528@syntevo.com> (raw)
In-Reply-To: <20201026182417.2105954-1-jonathantanmy@google.com>

On 26.10.2020 19:24, Jonathan Tan wrote:
> Sorry for the late reply - I have been out of office for a while.

I'm quite happy to get the replies at all, even if later. Thanks!

> As Taylor said in another email, it's good for some use cases but
> perhaps not for the "blame" one that you describe later.

OK, so our expectations seem to match your expectations, that's good.

> That's true. I made some progress with cbe566a071 ("negotiator/noop: add
> noop fetch negotiator", 2020-08-18) (which adds a no-op negotiatior, so
> the client never reports its own commits as "have") but as you said in
> another email, we still run into the problem that if we have the commit
> that we're fetching, we still won't fetch it.

Right, I already discovered 'fetch.negotiationAlgorithm=noop' and gave 
it a quick try, but it didn't seem to help at all.

> To clarify: we partially support the last point - "git clone" now
> supports "--sparse". When used with "--filter", only the blobs in the
> sparse checkout specification will be fetched, so users are already able
> to download only the objects in a specific path.

I see. Still, it seems that two other problems will be solved.

> Having said that, I
> think you also want the histories of these objects, so admittedly this
> is not complete for your use case.

Right.

> Having such an option (and teaching "blame" to use it to prefetch) would
> indeed speed up "blame". But if we implement this, what would happen if
> the user ran "blame" on the same file twice? I can't think of a way of
> preventing the same fetch from happening twice except by checking the
> existence of, say, the last 10 OIDs corresponding to that path. But if
> we have the list of those 10 OIDs, we could just prefetch those 10 OIDs
> without needing a new filter.

I must admit that I didn't notice this problem. Still, it seems easy 
enough to solve with this approach:

1) Estimate number of missing things
2) If "many", just download everything for <path> as described before
    and consider it done.
3) If "not so many", assemble a list of OIDs on the boundary of unknown
    (for example, all root tree OIDs for commits that are missing any
    trees) and use the usual fetch to download all OIDs in one go.
4) Repeat step 3 multiple times. Only N=<maximum tree depth> requests
    are needed, regardless of the number of commits.

> Another issue (but a smaller one) is this does not fetch all objects
> necessary if the file being "blame"d has been renamed, but that is
> probably solvable - we can just refetch with the old name.

Right, we also discussed this and figured that we'd just query more
things as needed. Maybe also individual other blobs for rename detection.

> Another possible solution that has been discussed before (but a much
> more involved one) is to teach Git to be able to serve results of
> computations, and then have "blame" be able to stitch that with local
> data. (For example, "blame" could check the history of a certain path to
> find the commit(s) that the remote has information of, query the remote
> for those commits, and then stitch the results together with local
> history.) This scheme would work not only for "blame" but for things
> like "grep" (with history) and "log -S", whereas
> "--filter=sparse:parthlist" would only work with "blame". But
> admittedly, this solution is more involved.

I understand that you're basically talking about implementing 
prefetching in git itself? To my understanding, this will still need 
either the command I suggested, or implement graph walking with massive 
OID requests as described above in 1)2)3)4). The latter will not require 
protocol changes, but will involve sending quite a bit of OIDs around.