git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Taylor Blau <me@ttaylorr.com>
To: Patrick Steinhardt <ps@pks.im>
Cc: git@vger.kernel.org, Junio C Hamano <gitster@pobox.com>,
	Jeff King <peff@peff.net>,
	Jonathan Tan <jonathantanmy@google.com>
Subject: Re: [PATCH 2/2] builtin/repack.c: implement support for `--cruft-max-size`
Date: Mon, 2 Oct 2023 16:30:58 -0400	[thread overview]
Message-ID: <ZRsogqtbd6klqypL@nand.local> (raw)
In-Reply-To: <ZPsDyKOa76Mxb9u-@tanuki>

On Fri, Sep 08, 2023 at 01:21:44PM +0200, Patrick Steinhardt wrote:
> On Thu, Sep 07, 2023 at 05:52:04PM -0400, Taylor Blau wrote:
> [snip]
> > @@ -125,17 +133,39 @@ static void mark_packs_for_deletion_1(struct string_list *names,
> >  		if (len < hexsz)
> >  			continue;
> >  		sha1 = item->string + len - hexsz;
> > -		/*
> > -		 * Mark this pack for deletion, which ensures that this
> > -		 * pack won't be included in a MIDX (if `--write-midx`
> > -		 * was given) and that we will actually delete this pack
> > -		 * (if `-d` was given).
> > -		 */
> > -		if (!string_list_has_string(names, sha1))
> > -			item->util = (void*)1;
> > +
> > +		if (pack_is_retained(item)) {
> > +			item->util = NULL;
> > +		} else if (!string_list_has_string(names, sha1)) {
> > +			/*
> > +			 * Mark this pack for deletion, which ensures
> > +			 * that this pack won't be included in a MIDX
> > +			 * (if `--write-midx` was given) and that we
> > +			 * will actually delete this pack (if `-d` was
> > +			 * given).
> > +			 */
> > +			item->util = DELETE_PACK;
> > +		}
>
> I find the behaviour of this function a tad surprising as it doesn't
> only mark a pack for deletion, but it also marks a pack as not being
> retained anymore. Shouldn't we rather:
>
>     if (pack_is_retained(item)) {
>         // Theoretically speaking we shouldn't even do this bit here as
>         // we _un_mark the pack for deletion. But at least we shouldn't
>         // be removing the `RETAIN_PACK` bit, I'd think.
>         item->util &= ~DELETE_PACK;
>     } else if (!string_list_has_string(names, sha1)) {
>         // And here we shouldn't discard the `RETAIN_PACK` bit either.
>         item->util |= DELETE_PACK;
>     }

I think the new version should address these issues. But yeah, I
definitely understand your confusion here. I think what's written in
this patch is OK, because we check only whether the `->util` field is
non-NULL before deleting, which is why we have to remove the RETAINED
bit.

But the new version looks like this instead:

    if (pack_is_retained(item))
        pack_unmark_for_deletion(item);
    else if (!string_list_has_string(names, sha1))
        pack_mark_for_deletion(item);

the RETAINED bits still stick around (pack_unmark_for_deletion() just
does `item->util &= ~DELETE_PACK`), but we don't consult them after
mark_packs_for_deletion_1() has finished executing. Instead we just
check for the existence of the DELETE_PACK bit, rather than whether or
not the whole util field is NULL.

> > @@ -799,6 +831,72 @@ static void remove_redundant_bitmaps(struct string_list *include,
> >  	strbuf_release(&path);
> >  }
> >
> > +static int existing_cruft_pack_cmp(const void *va, const void *vb)
> > +{
> > +	struct packed_git *a = *(struct packed_git **)va;
> > +	struct packed_git *b = *(struct packed_git **)vb;
> > +
> > +	if (a->pack_size < b->pack_size)
> > +		return -1;
> > +	if (a->pack_size > b->pack_size)
> > +		return 1;
> > +	return 0;
> > +}
> > +
> > +static void collapse_small_cruft_packs(FILE *in, unsigned long max_size,
>
> We might want to use `size_t` to denote file sizes instead of `unsigned
> long`.

We can safely change these to use size_t, but let's leave OPT_MAGNITUDE
alone (and treat that portion as #leftoverbits).

> > +		p = existing_cruft[i];
> > +		proposed = st_add(total_size, p->pack_size);
> > +
> > +		if (proposed <= max_size) {
> > +			total_size = proposed;
> > +			fprintf(in, "-%s\n", pack_basename(p));
> > +		} else {
> > +			retain_cruft_pack(existing, p);
> > +			fprintf(in, "%s\n", pack_basename(p));
> > +		}
>
> It's a bit funny that we re-check whether we have exceeded the maximum
> size in subsequente iterations once we hit the limit, but it arguably
> makes the logic a bit simpler.

Yeah. Those checks are all noops (IOW, once we end up in the else
branch, we'll stay there for the rest of the loop). But we don't want to
break early, because we have to call retain_cruft_pack() on everything.
In theory you could do something like:

    for (i = 0; i < existing_cruft_nr; i++) {
        size_t proposed;
        p = existing_cruft[i];
        proposed = st_add(total_size, p->pack_size);

        if (proposed <= max_size) {
            total_size = proposed;
            fprintf(in, "-%s\n", pack_basename(p));
        } else {
            break;
        }
    }

    for (; i < existing_cruft_nr; i++) {
        retain_cruft_pack(existing, existing_cruft[i]);
        fprintf(in, "%s\n", pack_basename(existing_cruft[i]));
    }

But I think that the above is slightly more error-prone than what is
written in the original patch. I have only the vaguest of preferences
towards the former, but I'm happy to change it around if you feel
strongly.

> If I understand correctly, we only collapse small cruft packs in case
> we're not expiring any objects at the same time. Is there an inherent
> reason why? I would imagine that it can indeed be useful to expire
> objects contained in cruft packs and then have git-repack(1) recombine
> whatever is left into larger packs.
>
> If the reason is basically "it's complicated" then that is fine with me,
> we can still implement the functionality at a later point in time. But
> until then I think that we should let callers know that the two options
> are incompatible with each other by producing an error when both are
> passed.

Your understanding is correct. We could try to leave existing cruft
packs alone when none of their objects are removed as a result of
pruning, but that case should be relatively rare. Another thing you
could do is handle cruft packs which have only part of their objects
being pruned by combining the non-pruned parts into a new pack.

The latter should be mostly straightforward to implement, but since
we're often ending up with very few cruft objects post-pruning, it
likely wouldn't help much.

Thanks,
Taylor

  reply	other threads:[~2023-10-02 20:31 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-09-07 21:51 [PATCH 0/2] repack: implement `--cruft-max-size` Taylor Blau
2023-09-07 21:52 ` [PATCH 1/2] t7700: split cruft-related tests to t7704 Taylor Blau
2023-09-08  0:01   ` Eric Sunshine
2023-09-07 21:52 ` [PATCH 2/2] builtin/repack.c: implement support for `--cruft-max-size` Taylor Blau
2023-09-07 23:42   ` Junio C Hamano
2023-09-25 18:01     ` Taylor Blau
2023-09-08 11:21   ` Patrick Steinhardt
2023-10-02 20:30     ` Taylor Blau [this message]
2023-10-03  0:44 ` [PATCH v2 0/3] repack: implement `--cruft-max-size` Taylor Blau
2023-10-03  0:44   ` [PATCH v2 1/3] t7700: split cruft-related tests to t7704 Taylor Blau
2023-10-03  0:44   ` [PATCH v2 2/3] builtin/repack.c: parse `--max-pack-size` with OPT_MAGNITUDE Taylor Blau
2023-10-05 11:31     ` Patrick Steinhardt
2023-10-05 17:28       ` Taylor Blau
2023-10-05 20:22         ` Junio C Hamano
2023-10-03  0:44   ` [PATCH v2 3/3] builtin/repack.c: implement support for `--max-cruft-size` Taylor Blau
2023-10-05 12:08     ` Patrick Steinhardt
2023-10-05 17:35       ` Taylor Blau
2023-10-05 20:25       ` Junio C Hamano
2023-10-07 17:20     ` [PATCH] repack: free existing_cruft array after use Jeff King
2023-10-09  1:24       ` Taylor Blau
2023-10-09 17:28         ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZRsogqtbd6klqypL@nand.local \
    --to=me@ttaylorr.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jonathantanmy@google.com \
    --cc=peff@peff.net \
    --cc=ps@pks.im \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).