git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jonathan Tan <jonathantanmy@google.com>
To: alexandr.miloslavskiy@syntevo.com
Cc: jonathantanmy@google.com, git@vger.kernel.org,
	christian.couder@gmail.com, marc.strapetz@syntevo.com,
	me@ttaylorr.com
Subject: Re: Questions about partial clone with '--filter=tree:0'
Date: Mon, 26 Oct 2020 12:46:35 -0700	[thread overview]
Message-ID: <20201026194635.2119420-1-jonathantanmy@google.com> (raw)
In-Reply-To: <2f04c074-3eee-766c-bedb-2e3cc0a91528@syntevo.com>

> > Having such an option (and teaching "blame" to use it to prefetch) would
> > indeed speed up "blame". But if we implement this, what would happen if
> > the user ran "blame" on the same file twice? I can't think of a way of
> > preventing the same fetch from happening twice except by checking the
> > existence of, say, the last 10 OIDs corresponding to that path. But if
> > we have the list of those 10 OIDs, we could just prefetch those 10 OIDs
> > without needing a new filter.
> 
> I must admit that I didn't notice this problem. Still, it seems easy 
> enough to solve with this approach:
> 
> 1) Estimate number of missing things
> 2) If "many", just download everything for <path> as described before
>     and consider it done.
> 3) If "not so many", assemble a list of OIDs on the boundary of unknown
>     (for example, all root tree OIDs for commits that are missing any
>     trees) and use the usual fetch to download all OIDs in one go.
> 4) Repeat step 3 multiple times. Only N=<maximum tree depth> requests
>     are needed, regardless of the number of commits.

My point was that if you can estimate it ("have the list of those 10
OIDs"), then you can just fetch it. This does send "quite a bit of
OIDs", as you said below - I'll address it below.

> > Another possible solution that has been discussed before (but a much
> > more involved one) is to teach Git to be able to serve results of
> > computations, and then have "blame" be able to stitch that with local
> > data. (For example, "blame" could check the history of a certain path to
> > find the commit(s) that the remote has information of, query the remote
> > for those commits, and then stitch the results together with local
> > history.) This scheme would work not only for "blame" but for things
> > like "grep" (with history) and "log -S", whereas
> > "--filter=sparse:parthlist" would only work with "blame". But
> > admittedly, this solution is more involved.
> 
> I understand that you're basically talking about implementing 
> prefetching in git itself?

No - I did talk about prefetching earlier, but here I mean having Git on
the server perform the "blame" computation itself.

For example, let's say I want to run "blame" on foo.txt at HEAD. HEAD
and HEAD^ are commits that only the local client has, whereas HEAD^^ was
fetched from the remote. By comparing HEAD, HEAD^, and HEAD^^, Git knows
which lines come from HEAD and HEAD^. For the rest, Git would make a
request to the server, passing the commit ID and the path, and would get
back a list of line numbers and commits.

> To my understanding, this will still need 
> either the command I suggested, or implement graph walking with massive 
> OID requests as described above in 1)2)3)4). The latter will not require 
> protocol changes, but will involve sending quite a bit of OIDs around.

Yes, prefetching will require graph walking with large OID requests but
will not require protocol changes, as you say. I'm not too worried about
the large numbers of OIDs - Git servers already have to support
relatively large numbers of OIDs to support the bulk prefetch we do
during things like checkout and diff.

  reply	other threads:[~2020-10-26 19:46 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-20 17:09 Alexandr Miloslavskiy
2020-10-20 22:29 ` Taylor Blau
2020-10-21 17:10   ` Alexandr Miloslavskiy
2020-10-21 17:31     ` Taylor Blau
2020-10-21 17:46       ` Alexandr Miloslavskiy
2020-10-26 18:24 ` Jonathan Tan
2020-10-26 18:44   ` Alexandr Miloslavskiy
2020-10-26 19:46     ` Jonathan Tan [this message]
2020-10-26 20:08       ` Alexandr Miloslavskiy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201026194635.2119420-1-jonathantanmy@google.com \
    --to=jonathantanmy@google.com \
    --cc=alexandr.miloslavskiy@syntevo.com \
    --cc=christian.couder@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=marc.strapetz@syntevo.com \
    --cc=me@ttaylorr.com \
    --subject='Re: Questions about partial clone with '\''--filter=tree:0'\''' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).