git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jonathan Tan <jonathantanmy@google.com>
To: alexandr.miloslavskiy@syntevo.com
Cc: git@vger.kernel.org, christian.couder@gmail.com,
	jonathantanmy@google.com, marc.strapetz@syntevo.com,
	me@ttaylorr.com
Subject: Re: Questions about partial clone with '--filter=tree:0'
Date: Mon, 26 Oct 2020 11:24:17 -0700	[thread overview]
Message-ID: <20201026182417.2105954-1-jonathantanmy@google.com> (raw)
In-Reply-To: <aa7b89ee-08aa-7943-6a00-28dcf344426e@syntevo.com>

> (1) Is it even considered a realistic use case?
> -----------------------------------------------
> Summary: is '--filter=tree:0' a realistic or "crazy" scenario that is
> not considered worthy of supporting?
> 
> I decided to use Linux repo, which is reasonably large, and it seems
> that '--filter=tree:0' could be desired because it helps with disk
> space (~0.66gb) and network (~0.54gb):

Sorry for the late reply - I have been out of office for a while.

As Taylor said in another email, it's good for some use cases but
perhaps not for the "blame" one that you describe later.

> (2) A command to enrich repo with trees
> ---------------------------------------
> There is no good way to "un-partial" repository that was cloned with
> '--filter=tree:0' to have all trees, but no blobs.
> 
> There seems to be a dirty way of doing that by abusing 'fetch --deepen'
> which happens to skip "ref tip already present locally" check, but
> it will also re-download all commits, which means extra ~0.5gb network
> in case of Linux repo.

That's true. I made some progress with cbe566a071 ("negotiator/noop: add
noop fetch negotiator", 2020-08-18) (which adds a no-op negotiatior, so
the client never reports its own commits as "have") but as you said in
another email, we still run into the problem that if we have the commit
that we're fetching, we still won't fetch it.

> (3) A command to download ALL trees and/or blobs for a subpath
> -----------------------------------------------
> Summary: Running a Blame or file log in '--filter=tree:0' repo is
> currently very inefficient, up to a point where it can be discussed
> as not really working.
> 
> The suggested command will be able to accept a path and download ALL
> trees and/or blobs that match it.
> 
> This will solve many problems at once:
> * Solve (2)
> * Make it possible to prepare for efficient blame and file log
> * Make a new experience with super-mono-repos, where user will now
>    be able to only download a part of it by path.

To clarify: we partially support the last point - "git clone" now
supports "--sparse". When used with "--filter", only the blobs in the
sparse checkout specification will be fetched, so users are already able
to download only the objects in a specific path. Having said that, I
think you also want the histories of these objects, so admittedly this
is not complete for your use case.

> Currently '--filter=sparse:oid' is there to support that, but it is
> very hard to use on client side, because it requires paths to be
> already present in a commit on server.
> 
> For a possible solution, it sounds reasonable to have such filter:
>    --filter=sparse:pathlist=/1/2'
> Path list could be delimited with some special character, and paths
> themselves could be escaped.

Having such an option (and teaching "blame" to use it to prefetch) would
indeed speed up "blame". But if we implement this, what would happen if
the user ran "blame" on the same file twice? I can't think of a way of
preventing the same fetch from happening twice except by checking the
existence of, say, the last 10 OIDs corresponding to that path. But if
we have the list of those 10 OIDs, we could just prefetch those 10 OIDs
without needing a new filter.

Another issue (but a smaller one) is this does not fetch all objects
necessary if the file being "blame"d has been renamed, but that is
probably solvable - we can just refetch with the old name.

Another possible solution that has been discussed before (but a much
more involved one) is to teach Git to be able to serve results of
computations, and then have "blame" be able to stitch that with local
data. (For example, "blame" could check the history of a certain path to
find the commit(s) that the remote has information of, query the remote
for those commits, and then stitch the results together with local
history.) This scheme would work not only for "blame" but for things
like "grep" (with history) and "log -S", whereas
"--filter=sparse:parthlist" would only work with "blame". But
admittedly, this solution is more involved.

  parent reply	other threads:[~2020-10-26 18:24 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-20 17:09 Questions about partial clone with '--filter=tree:0' Alexandr Miloslavskiy
2020-10-20 22:29 ` Taylor Blau
2020-10-21 17:10   ` Alexandr Miloslavskiy
2020-10-21 17:31     ` Taylor Blau
2020-10-21 17:46       ` Alexandr Miloslavskiy
2020-10-26 18:24 ` Jonathan Tan [this message]
2020-10-26 18:44   ` Alexandr Miloslavskiy
2020-10-26 19:46     ` Jonathan Tan
2020-10-26 20:08       ` Alexandr Miloslavskiy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201026182417.2105954-1-jonathantanmy@google.com \
    --to=jonathantanmy@google.com \
    --cc=alexandr.miloslavskiy@syntevo.com \
    --cc=christian.couder@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=marc.strapetz@syntevo.com \
    --cc=me@ttaylorr.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).