Re: Questions about partial clone with '--filter=tree:0'

From: Jonathan Tan <jonathantanmy@google.com>
To: alexandr.miloslavskiy@syntevo.com
Cc: git@vger.kernel.org, christian.couder@gmail.com,
	jonathantanmy@google.com, marc.strapetz@syntevo.com,
	me@ttaylorr.com
Subject: Re: Questions about partial clone with '--filter=tree:0'
Date: Mon, 26 Oct 2020 11:24:17 -0700	[thread overview]
Message-ID: <20201026182417.2105954-1-jonathantanmy@google.com> (raw)
In-Reply-To: <aa7b89ee-08aa-7943-6a00-28dcf344426e@syntevo.com>

> (1) Is it even considered a realistic use case?
> -----------------------------------------------
> Summary: is '--filter=tree:0' a realistic or "crazy" scenario that is
> not considered worthy of supporting?
> 
> I decided to use Linux repo, which is reasonably large, and it seems
> that '--filter=tree:0' could be desired because it helps with disk
> space (~0.66gb) and network (~0.54gb):

Sorry for the late reply - I have been out of office for a while.

As Taylor said in another email, it's good for some use cases but
perhaps not for the "blame" one that you describe later.

> (2) A command to enrich repo with trees
> ---------------------------------------
> There is no good way to "un-partial" repository that was cloned with
> '--filter=tree:0' to have all trees, but no blobs.
> 
> There seems to be a dirty way of doing that by abusing 'fetch --deepen'
> which happens to skip "ref tip already present locally" check, but
> it will also re-download all commits, which means extra ~0.5gb network
> in case of Linux repo.

That's true. I made some progress with cbe566a071 ("negotiator/noop: add
noop fetch negotiator", 2020-08-18) (which adds a no-op negotiatior, so
the client never reports its own commits as "have") but as you said in
another email, we still run into the problem that if we have the commit
that we're fetching, we still won't fetch it.

> (3) A command to download ALL trees and/or blobs for a subpath
> -----------------------------------------------
> Summary: Running a Blame or file log in '--filter=tree:0' repo is
> currently very inefficient, up to a point where it can be discussed
> as not really working.
> 
> The suggested command will be able to accept a path and download ALL
> trees and/or blobs that match it.
> 
> This will solve many problems at once:
> * Solve (2)
> * Make it possible to prepare for efficient blame and file log
> * Make a new experience with super-mono-repos, where user will now
>    be able to only download a part of it by path.

To clarify: we partially support the last point - "git clone" now
supports "--sparse". When used with "--filter", only the blobs in the
sparse checkout specification will be fetched, so users are already able
to download only the objects in a specific path. Having said that, I
think you also want the histories of these objects, so admittedly this
is not complete for your use case.

> Currently '--filter=sparse:oid' is there to support that, but it is
> very hard to use on client side, because it requires paths to be
> already present in a commit on server.
> 
> For a possible solution, it sounds reasonable to have such filter:
>    --filter=sparse:pathlist=/1/2'
> Path list could be delimited with some special character, and paths
> themselves could be escaped.

Having such an option (and teaching "blame" to use it to prefetch) would
indeed speed up "blame". But if we implement this, what would happen if
the user ran "blame" on the same file twice? I can't think of a way of
preventing the same fetch from happening twice except by checking the
existence of, say, the last 10 OIDs corresponding to that path. But if
we have the list of those 10 OIDs, we could just prefetch those 10 OIDs
without needing a new filter.

Another issue (but a smaller one) is this does not fetch all objects
necessary if the file being "blame"d has been renamed, but that is
probably solvable - we can just refetch with the old name.

Another possible solution that has been discussed before (but a much
more involved one) is to teach Git to be able to serve results of
computations, and then have "blame" be able to stitch that with local
data. (For example, "blame" could check the history of a certain path to
find the commit(s) that the remote has information of, query the remote
for those commits, and then stitch the results together with local
history.) This scheme would work not only for "blame" but for things
like "grep" (with history) and "log -S", whereas
"--filter=sparse:parthlist" would only work with "blame". But
admittedly, this solution is more involved.