git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jeff King <peff@peff.net>
To: Junio C Hamano <gitster@pobox.com>
Cc: Taylor Blau <me@ttaylorr.com>,
	git@vger.kernel.org, Derrick Stolee <derrickstolee@github.com>
Subject: Re: [PATCH v3] builtin/pack-objects.c: introduce `pack.extraCruftTips`
Date: Fri, 5 May 2023 17:39:46 -0400	[thread overview]
Message-ID: <20230505213946.GC3321533@coredump.intra.peff.net> (raw)
In-Reply-To: <xmqqv8h9m2az.fsf@gitster.g>

On Wed, May 03, 2023 at 04:18:44PM -0700, Junio C Hamano wrote:

> > +	When generating a cruft pack, use the shell to execute the
> > +	specified command(s), and interpret their output as additional
> > +	tips of objects to keep in the cruft pack, regardless of their
> 
> What is a "tip of an object"?  The first byte ;-)?
> 
> A "tip of history" would only imply commit objects, but presumably
> you would want to specify a tree and protect all the blobs and trees
> it recursively contains, so that is not a good name for it.

"tips of the object graph" perhaps?

> > +		if (parse_oid_hex(buf.buf, &oid, &rest) || *rest) {
> > +			ret = error(_("invalid extra cruft tip: '%s'"), buf.buf);
> > +			goto done;
> > +		}
> > +
> > +		type = oid_object_info(the_repository, &oid, NULL);
> > +		if (type < 0)
> > +			continue;
> > +
> > +		obj = lookup_object_by_type(the_repository, &oid, type);
> > +		if (!obj)
> > +			continue;
> 
> Hmph, we may want to have an interface that lets us avoid looking up
> the same oid twice in the same set of tables.  Given an object
> unseen so far, oid_object_info() should have done most of the work
> necessary for lookup_object_by_type() to get to and start parsing
> the data of the object in the good case (i.e. object exists and in a
> pack---just we haven't needed it yet), but in the above sequence
> there is not enough information passed between two calls to take
> advantage of it.

This code was my suggestion, but it may have actually been a bad
direction.

I don't think communicating between oid_object_info() and
lookup_object_by_type() is important. The latter is only doing a lookup
in the internal hash with lookup_object(), and then auto-vivifying using
the type if necessary (which we provide to it).

The bigger inefficiency is that we call oid_object_info() before seeing
if we have already instantiated an object struct via lookup_object().

Obviously we could do that first. But let's take a step back. My
original suggestion was thinking that we don't want to call
parse_object() because it's expensive, especially for a blob. But in the
long run, most of these objects (except blobs!) will end up parsed
anyway, because we are going to see which other objects they reach.

So it's OK to parse anything except blobs. And indeed, we have a better
tool for that these days:

  obj = parse_object_with_flags(r, oid, PARSE_OBJECT_SKIP_HASH_CHECK);

That does exactly what we want. If we already saw and parsed the object,
it's quick noop after a hash lookup. If we didn't, then it already has
optimizations to avoid reading object contents if possible (checking the
commit graph, checking the type for blobs).

Skipping the hash check might seem like a bad idea for a repack, but
it's what we already do for blobs found via traversing. A disk repack
uses the much cheaper pack idx crc for exactly this purpose: to avoid
expanding objects unnecessarily.

-Peff

  parent reply	other threads:[~2023-05-05 21:39 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-04-20 17:27 [PATCH] builtin/pack-objects.c: introduce `pack.extraCruftTips` Taylor Blau
2023-04-20 18:12 ` Junio C Hamano
2023-04-20 19:30   ` Taylor Blau
2023-04-20 19:52     ` Junio C Hamano
2023-04-20 20:48       ` Taylor Blau
2023-04-21  0:10 ` Chris Torek
2023-04-21  2:14   ` Taylor Blau
2023-04-25 19:42 ` Derrick Stolee
2023-04-25 21:25   ` Taylor Blau
2023-04-26 10:52     ` Derrick Stolee
2023-05-03  0:06       ` Taylor Blau
2023-05-03  0:09 ` [PATCH v2] " Taylor Blau
2023-05-03 14:01   ` Derrick Stolee
2023-05-03 19:59   ` Jeff King
2023-05-03 21:22     ` Taylor Blau
2023-05-05 21:23       ` Jeff King
2023-05-06  0:06         ` Taylor Blau
2023-05-06  0:14           ` Taylor Blau
2023-05-03 21:28     ` Taylor Blau
2023-05-05 21:26       ` Jeff King
2023-05-05 22:13         ` Jeff King
2023-05-06  0:13           ` Taylor Blau
2023-05-06  0:20             ` Taylor Blau
2023-05-06  2:12             ` Jeff King
2023-05-03 22:05 ` [PATCH v3] " Taylor Blau
2023-05-03 23:18   ` Junio C Hamano
2023-05-03 23:42     ` Junio C Hamano
2023-05-03 23:48       ` Taylor Blau
2023-05-03 23:50       ` Taylor Blau
2023-05-05 21:39     ` Jeff King [this message]
2023-05-05 22:19   ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230505213946.GC3321533@coredump.intra.peff.net \
    --to=peff@peff.net \
    --cc=derrickstolee@github.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=me@ttaylorr.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).