Re: [PATCH 1/1] builtin/pack-objects.c: avoid iterating all refs

From: Jeff King <peff@peff.net>
To: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
Cc: Jacob Vosmaer <jacob@gitlab.com>,
	git@vger.kernel.org, Taylor Blau <me@ttaylorr.com>
Subject: Re: [PATCH 1/1] builtin/pack-objects.c: avoid iterating all refs
Date: Thu, 21 Jan 2021 10:34:26 -0500	[thread overview]
Message-ID: <YAmfAjZYDHoXBevr@coredump.intra.peff.net> (raw)
In-Reply-To: <874kja8u2i.fsf@evledraar.gmail.com>

On Thu, Jan 21, 2021 at 11:26:13AM +0100, Ævar Arnfjörð Bjarmason wrote:

> > So we are not just saving the client from sending a "want", but making a
> > second full connection to do so. That seems to be an optimization worth
> > continuing to have.
> 
> Correct. Suppose a history COMMIT(TAG) history like:
> 
>     mainline: A(X) -> B(Y) -> C
>     topic:    A(X) -> B(Y) -> D(Z)
> 
> You only want the "mainline" history and its tags in your fetch. Now a
> server will give you tags X & Y for free, ignoring Z.
> 
> The note in protocol-capabilities.txt supposes that you'll need to only
> get A/B/C in one fetch, then do another one where you see from the OIDs
> you have and advertised (peeled) OIDs for the tags and know you just
> need X/Y, not Z.

Right. That note basically indicates that what the server is doing is
"best effort". If it sends you B, it will also send you Y. But it is
ultimately up to the client to decide if they want or need a tag the
server didn't send and do the follow-up.

So include-tag is an optimization, but it works often enough that it
frequently saves the second fetch from happening at all.

> So we could also just fetch Z, then do our own walk on the client to see
> if it's reachable, and throw it away if not. Although I suppose that'll
> need a list of "bad" tags on the client so we don't repeat the process
> the next time around, hrm...

Not only that, but now you've fetched "D" that you never wanted. If it's
one commit, that's not so bad. But it could be pulling in arbitrary
amounts of reachable history that you don't want.

> > You seem to be against auto-following at all. And certainly I can see an
> > argument that it is not worth the trouble it causes. But it is the
> > default behavior, and I suspect many people are relying on it. Fetching
> > every tag indiscriminately is going to grab a bunch of extra unwanted
> > objects in some repos. An obvious case is any time "clone
> > --single-branch --depth" is used.
> 
> I wonder if the use-cases for --depth in the wild aren't 99.9%
> --depth=1, in which case the peeled commit on the main branch being
> advertised by a tag would alredy give you this information.

I'd have thought there was basically no use for anything but --depth=1,
but people seem really enamored with the notion of --depth=50. Many CI
systems use that for some reason.

--depth is just one case where you might not have all of history,
though. Even without it, --single-branch means you wouldn't want to get
the history of all of the other branches. They may only be a few commits
ahead of the main history, in which case fetching extra tags that point
to them might not be a big deal. But it really depends on how your repo
is shaped, what tags point to, etc.

Even without any options, your repo may have a disjoint chunk of history
mentioned by a tag (e.g., a tag pointing to a segment of history that is
meant to be grafted on, or even an archived version of history from
before filter-branch rewrite).

Perhaps if we were designing from scratch we might write those off as
unusual cases we don't need to care about. But I'd be very hesitant to
change the way tag following works now, after so many years. If we were
to make any change, I think one plausible one would be to have clone set
up refspecs to fetch all tags (and then avoid sending include-tag, since
we know we're not auto-following). That would let people still use
auto-follow when they wanted to, but make things simpler to reason
about. And after all, clone already fetches all of the tags.

But I haven't thought too hard about it.

> > Maybe I'm not quite understanding what you're proposing.
> 
> Not much really, just that looking deeper into this area might be a
> useful exercise. I.e. it's a case of server<->client cooperation where
> we offlod the work to one or the other based on an old design decision,
> maybe that trade-off is not optimal in most cases anymore.

I don't think it's very much work on the server, though. Sure, iterating
the tags is O(nr_tags). But we do that iteration lots of other places,
too (the advertisement unless the client asks for a very narrow prefix,
or upload-pack's ref-marking for reachability).

And saving us from sending unwanted objects is a potential win on the
server (even a single unlucky delta where we have to reconstruct and
re-deflate an object will likely blow a ref iteration out of the water).
Likewise, preventing the client from connecting a second time is a win
for the server, which doesn't have to spin up a new upload-pack
(likewise for the client of course; it might even have to prompt the
user for auth again!).

-Peff