archive mirror
 help / color / mirror / Atom feed
* Optimizing for partial clone with '--filter=tree:0'
@ 2020-10-05 16:38 Alexandr Miloslavskiy
  0 siblings, 0 replies; only message in thread
From: Alexandr Miloslavskiy @ 2020-10-05 16:38 UTC (permalink / raw)
  To: git

We are implementing a git UI. One interesting case is the repository
cloned with '--filter=tree:0', because it makes it a lot harder to
run basic git operations such as file log and blame.

Eventually we arrived at a number of problems. We should be able to
make patches, at least for (2) and (4), if deemed worthy and the plan
is clear enough. Note that optimal patches (as we see it) will involve
a protocol change.

(1) Is it even considered a realistic use case?
I used Linux repository as an example of reasonably large repo: (951025 commits)

I cloned Linux repository with various filters and got these stats:
   git clone --bare <url>
	7'624'042 objects
	   2.86gb network
	   3.10gb disk
   git clone --bare --filter=blob:none <url>
	5'484'714 (71.9%) objects
	   1.01gb (35.3%) network
	   1.16gb (37.4%) disk
   git clone --bare --filter=tree:0 <url>
	  951'693 (12.5%) objects
	   0.47gb (16.4%) network
	   0.50gb (16.1%) disk
   git clone --bare --depth 1 --branch master <url>
	   74'380 ( 0.9%) objects
	   0.19gb ( 6.6%) network
	   0.19gb ( 6.1%) disk

My conclusion is that '--filter=tree:0' could be desired because it
reasonably saves disk space and network.

(2) A command to enrich repo with trees
Since all filters currently include commit objects, it doesn't seem
possible to append the trees alone to a repository that already has
commits. It seems that it could be possible to download trees+commits
like this:

   git -c "remote.origin.partialclonefilter=blob:none" fetch
   --deepen=999999 origin

   Here, '--deepen' is a dirty hack to convince git to re-download
   commits that are already present locally (without trees though).

   Here, '-c' is a workaround for the problem where 'git fetch'
   overwrites filter in config. This problem is probably solved in
   cooking topic: 'fetch: do not override partial clone filter'.

However, according to figures in (1), re-downloading commits should
cost around the cost of 'clone --filter=tree:0', that is 0.5gb extra in
case of Linux repo. It would be nice to avoid that by having a filter
like "trees only please".

It would also be nice to get rid of '--deepen' hack.

(3) Properly supporting 'git blame' and 'git log -- path'
Currently, promisor will download things one at a time, which is very
slow. For example, 'git blame' will download trees for commits,
processing one commit at a time. See (4) for a possible solution.

(4) Command to download ALL trees for a subpath
E.g. for blamed path '/1/2/3/4.txt', only parent trees will be

Such minimal approach should fall in line with user's intention for
using '--filter=tree:0' - user obviously wanted to minimize something,
be that disk or network used. It doesn't sound nice if the first
'git blame' reverts to a repo with all trees, as if cloned with

Currently '--filter=sparse:oid' is there to support that, but it is
very hard to use on client side, because it requires paths to be
already present in a commit on server.

For a possible solution, it sounds reasonable to have such filter:
Path list could be delimited with some special character, and paths
themselves could be escaped.

On top of helping with 'git blame' and 'git log', this feature should
help a lot with sparse clones of large mono-repos, such as Google's

^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2020-10-05 23:02 UTC | newest]

Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-05 16:38 Optimizing for partial clone with '--filter=tree:0' Alexandr Miloslavskiy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).