git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Alexandr Miloslavskiy <alexandr.miloslavskiy@syntevo.com>
To: Jonathan Tan <jonathantanmy@google.com>
Cc: git@vger.kernel.org, christian.couder@gmail.com,
	marc.strapetz@syntevo.com, me@ttaylorr.com
Subject: Re: Questions about partial clone with '--filter=tree:0'
Date: Mon, 26 Oct 2020 19:44:27 +0100	[thread overview]
Message-ID: <2f04c074-3eee-766c-bedb-2e3cc0a91528@syntevo.com> (raw)
In-Reply-To: <20201026182417.2105954-1-jonathantanmy@google.com>

On 26.10.2020 19:24, Jonathan Tan wrote:
> Sorry for the late reply - I have been out of office for a while.

I'm quite happy to get the replies at all, even if later. Thanks!

> As Taylor said in another email, it's good for some use cases but
> perhaps not for the "blame" one that you describe later.

OK, so our expectations seem to match your expectations, that's good.

> That's true. I made some progress with cbe566a071 ("negotiator/noop: add
> noop fetch negotiator", 2020-08-18) (which adds a no-op negotiatior, so
> the client never reports its own commits as "have") but as you said in
> another email, we still run into the problem that if we have the commit
> that we're fetching, we still won't fetch it.

Right, I already discovered 'fetch.negotiationAlgorithm=noop' and gave 
it a quick try, but it didn't seem to help at all.

> To clarify: we partially support the last point - "git clone" now
> supports "--sparse". When used with "--filter", only the blobs in the
> sparse checkout specification will be fetched, so users are already able
> to download only the objects in a specific path.

I see. Still, it seems that two other problems will be solved.

> Having said that, I
> think you also want the histories of these objects, so admittedly this
> is not complete for your use case.

Right.

> Having such an option (and teaching "blame" to use it to prefetch) would
> indeed speed up "blame". But if we implement this, what would happen if
> the user ran "blame" on the same file twice? I can't think of a way of
> preventing the same fetch from happening twice except by checking the
> existence of, say, the last 10 OIDs corresponding to that path. But if
> we have the list of those 10 OIDs, we could just prefetch those 10 OIDs
> without needing a new filter.

I must admit that I didn't notice this problem. Still, it seems easy 
enough to solve with this approach:

1) Estimate number of missing things
2) If "many", just download everything for <path> as described before
    and consider it done.
3) If "not so many", assemble a list of OIDs on the boundary of unknown
    (for example, all root tree OIDs for commits that are missing any
    trees) and use the usual fetch to download all OIDs in one go.
4) Repeat step 3 multiple times. Only N=<maximum tree depth> requests
    are needed, regardless of the number of commits.

> Another issue (but a smaller one) is this does not fetch all objects
> necessary if the file being "blame"d has been renamed, but that is
> probably solvable - we can just refetch with the old name.

Right, we also discussed this and figured that we'd just query more
things as needed. Maybe also individual other blobs for rename detection.

> Another possible solution that has been discussed before (but a much
> more involved one) is to teach Git to be able to serve results of
> computations, and then have "blame" be able to stitch that with local
> data. (For example, "blame" could check the history of a certain path to
> find the commit(s) that the remote has information of, query the remote
> for those commits, and then stitch the results together with local
> history.) This scheme would work not only for "blame" but for things
> like "grep" (with history) and "log -S", whereas
> "--filter=sparse:parthlist" would only work with "blame". But
> admittedly, this solution is more involved.

I understand that you're basically talking about implementing 
prefetching in git itself? To my understanding, this will still need 
either the command I suggested, or implement graph walking with massive 
OID requests as described above in 1)2)3)4). The latter will not require 
protocol changes, but will involve sending quite a bit of OIDs around.

  reply	other threads:[~2020-10-26 19:19 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-20 17:09 Questions about partial clone with '--filter=tree:0' Alexandr Miloslavskiy
2020-10-20 22:29 ` Taylor Blau
2020-10-21 17:10   ` Alexandr Miloslavskiy
2020-10-21 17:31     ` Taylor Blau
2020-10-21 17:46       ` Alexandr Miloslavskiy
2020-10-26 18:24 ` Jonathan Tan
2020-10-26 18:44   ` Alexandr Miloslavskiy [this message]
2020-10-26 19:46     ` Jonathan Tan
2020-10-26 20:08       ` Alexandr Miloslavskiy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2f04c074-3eee-766c-bedb-2e3cc0a91528@syntevo.com \
    --to=alexandr.miloslavskiy@syntevo.com \
    --cc=christian.couder@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=jonathantanmy@google.com \
    --cc=marc.strapetz@syntevo.com \
    --cc=me@ttaylorr.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).