From: Alexandr Miloslavskiy <alexandr.miloslavskiy@syntevo.com>
To: Jonathan Tan <jonathantanmy@google.com>
Cc: git@vger.kernel.org, christian.couder@gmail.com,
marc.strapetz@syntevo.com, me@ttaylorr.com
Subject: Re: Questions about partial clone with '--filter=tree:0'
Date: Mon, 26 Oct 2020 19:44:27 +0100 [thread overview]
Message-ID: <2f04c074-3eee-766c-bedb-2e3cc0a91528@syntevo.com> (raw)
In-Reply-To: <20201026182417.2105954-1-jonathantanmy@google.com>
On 26.10.2020 19:24, Jonathan Tan wrote:
> Sorry for the late reply - I have been out of office for a while.
I'm quite happy to get the replies at all, even if later. Thanks!
> As Taylor said in another email, it's good for some use cases but
> perhaps not for the "blame" one that you describe later.
OK, so our expectations seem to match your expectations, that's good.
> That's true. I made some progress with cbe566a071 ("negotiator/noop: add
> noop fetch negotiator", 2020-08-18) (which adds a no-op negotiatior, so
> the client never reports its own commits as "have") but as you said in
> another email, we still run into the problem that if we have the commit
> that we're fetching, we still won't fetch it.
Right, I already discovered 'fetch.negotiationAlgorithm=noop' and gave
it a quick try, but it didn't seem to help at all.
> To clarify: we partially support the last point - "git clone" now
> supports "--sparse". When used with "--filter", only the blobs in the
> sparse checkout specification will be fetched, so users are already able
> to download only the objects in a specific path.
I see. Still, it seems that two other problems will be solved.
> Having said that, I
> think you also want the histories of these objects, so admittedly this
> is not complete for your use case.
Right.
> Having such an option (and teaching "blame" to use it to prefetch) would
> indeed speed up "blame". But if we implement this, what would happen if
> the user ran "blame" on the same file twice? I can't think of a way of
> preventing the same fetch from happening twice except by checking the
> existence of, say, the last 10 OIDs corresponding to that path. But if
> we have the list of those 10 OIDs, we could just prefetch those 10 OIDs
> without needing a new filter.
I must admit that I didn't notice this problem. Still, it seems easy
enough to solve with this approach:
1) Estimate number of missing things
2) If "many", just download everything for <path> as described before
and consider it done.
3) If "not so many", assemble a list of OIDs on the boundary of unknown
(for example, all root tree OIDs for commits that are missing any
trees) and use the usual fetch to download all OIDs in one go.
4) Repeat step 3 multiple times. Only N=<maximum tree depth> requests
are needed, regardless of the number of commits.
> Another issue (but a smaller one) is this does not fetch all objects
> necessary if the file being "blame"d has been renamed, but that is
> probably solvable - we can just refetch with the old name.
Right, we also discussed this and figured that we'd just query more
things as needed. Maybe also individual other blobs for rename detection.
> Another possible solution that has been discussed before (but a much
> more involved one) is to teach Git to be able to serve results of
> computations, and then have "blame" be able to stitch that with local
> data. (For example, "blame" could check the history of a certain path to
> find the commit(s) that the remote has information of, query the remote
> for those commits, and then stitch the results together with local
> history.) This scheme would work not only for "blame" but for things
> like "grep" (with history) and "log -S", whereas
> "--filter=sparse:parthlist" would only work with "blame". But
> admittedly, this solution is more involved.
I understand that you're basically talking about implementing
prefetching in git itself? To my understanding, this will still need
either the command I suggested, or implement graph walking with massive
OID requests as described above in 1)2)3)4). The latter will not require
protocol changes, but will involve sending quite a bit of OIDs around.
next prev parent reply other threads:[~2020-10-26 19:19 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-10-20 17:09 Questions about partial clone with '--filter=tree:0' Alexandr Miloslavskiy
2020-10-20 22:29 ` Taylor Blau
2020-10-21 17:10 ` Alexandr Miloslavskiy
2020-10-21 17:31 ` Taylor Blau
2020-10-21 17:46 ` Alexandr Miloslavskiy
2020-10-26 18:24 ` Jonathan Tan
2020-10-26 18:44 ` Alexandr Miloslavskiy [this message]
2020-10-26 19:46 ` Jonathan Tan
2020-10-26 20:08 ` Alexandr Miloslavskiy
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=2f04c074-3eee-766c-bedb-2e3cc0a91528@syntevo.com \
--to=alexandr.miloslavskiy@syntevo.com \
--cc=christian.couder@gmail.com \
--cc=git@vger.kernel.org \
--cc=jonathantanmy@google.com \
--cc=marc.strapetz@syntevo.com \
--cc=me@ttaylorr.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).