git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jeff King <peff@peff.net>
To: Taylor Blau <me@ttaylorr.com>
Cc: git@vger.kernel.org, gitster@pobox.com, derrickstolee@github.com
Subject: Re: [PATCH] midx.c: use `pack-objects --stdin-packs` when repacking
Date: Tue, 20 Sep 2022 15:28:37 -0400	[thread overview]
Message-ID: <YyoUZb90HeJnOuAV@coredump.intra.peff.net> (raw)
In-Reply-To: <9195a9ecd11a19f2c7fb1c70136d2d13fa308010.1663639662.git.me@ttaylorr.com>

On Mon, Sep 19, 2022 at 10:08:35PM -0400, Taylor Blau wrote:

> This patch replaces the pre-`--stdin-packs` invocation (where each
> object is given to `pack-objects` one by one) with the more modern
> `--stdin-packs` option.
> 
> This allows us to avoid some CPU cycles serializing and deserializing
> every object ID in all of the packs we're aggregating. It also avoids us
> having to send a potentially large amount of data down to
> `pack-objects`.

Makes sense. Just playing devil's advocate for a moment: is there any
way that getting the list of packs could be worse? I'm thinking
particularly of a race condition where a pack goes away while we're
running, but if we had the actual object list, we could fall back to
finding it elsewhere.

I think that could only happen if we had two gc's running
simultaneously, which is something we try to avoid already. And the
worst case would be that one would say "oops, this pack went away" and
bail, and not any kind of corruption.

So I think it's fine, but just trying to talk through any unexpected
implications.

> But more importantly, it generates slightly higher quality (read: more
> tightly compressed) packs, because of the reachability traversal that
> `--stdin-packs` does after the fact in order to gather namehash values
> which seed the delta selection process.

I think we _could_ do that same traversal even in objects mode. Or do
--stdin-packs without it. If we were starting from scratch, it might be
nice for the two features to be orthogonal so we could evaluate the
changes independently. But I don't think it's worth going back and
trying to split them out now. Although...

> In practice, this seems to add a slight amount of overhead (on the order
> of a few seconds for git.git broken up into ~100 packs), in exchange for
> a modest reduction (on the order of ~3.5%) in the resulting pack size.

Hmm. I thought we'd have some code to reuse the cached name-hashes in
the .bitmap file, if one is present. But I don't see any such code in
the stdin-packs feature. I think for "repack --geometric" it doesn't
matter. There the "main" pack with the bitmap would also be excluded
from the rollup (unless we are rolling all-into-one, in which case we do
the full from-scratch repack with a traversal).

Is that true also of "multi-pack-index repack"? I guess it would depend
on how you invoke it. I admit I don't think I've ever used it myself,
since the new "repack --geometric --write-midx" approach matches my
mental model. I'm not sure when you'd actually run the "multi-pack-index
repack" command. But if you did it with --batch-size=0 (the default), I
think we'd end up traversing every object in history.

>  midx.c | 17 +++++++++--------
>  1 file changed, 9 insertions(+), 8 deletions(-)

The patch itself is mostly as expected, but I did have one question:

> @@ -2026,17 +2027,17 @@ int midx_repack(struct repository *r, const char *object_dir, size_t batch_size,
> 
>  	cmd_in = xfdopen(cmd.in, "w");
> 
> -	for (i = 0; i < m->num_objects; i++) {
> -		struct object_id oid;
> -		uint32_t pack_int_id = nth_midxed_pack_int_id(m, i);
> +	for (i = 0; i < m->num_packs; i++) {
> +		strbuf_reset(&scratch);

The old code went in object order within the midx. Is this sorted by
sha1, or the pack pseudo-order? If the former, then that will yield a
different order of objects inside pack-objects (since it is seeing the
packs in order of our m->pack_names array). I don't _think_ it matters,
but I just wanted to double check.

-Peff

  parent reply	other threads:[~2022-09-20 19:28 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-09-20  2:08 [PATCH] midx.c: use `pack-objects --stdin-packs` when repacking Taylor Blau
2022-09-20  2:14 ` Taylor Blau
2022-09-20 19:28 ` Jeff King [this message]
2022-09-20 19:49   ` Taylor Blau
2022-09-20 20:06     ` Jeff King
2022-09-20 20:35       ` Taylor Blau

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YyoUZb90HeJnOuAV@coredump.intra.peff.net \
    --to=peff@peff.net \
    --cc=derrickstolee@github.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=me@ttaylorr.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).