git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Taylor Blau <me@ttaylorr.com>
To: Jonathan Tan <jonathantanmy@google.com>
Cc: Taylor Blau <me@ttaylorr.com>,
	git@vger.kernel.org, derrickstolee@github.com, gitster@pobox.com
Subject: Re: [RFC PATCH 0/4] move pruned objects to a separate repository
Date: Wed, 29 Jun 2022 22:47:23 -0400	[thread overview]
Message-ID: <Yr0OuwCyDot0wJjs@nand.local> (raw)
In-Reply-To: <20220629225405.1864460-1-jonathantanmy@google.com>

On Wed, Jun 29, 2022 at 03:54:04PM -0700, Jonathan Tan wrote:
> Taylor Blau <me@ttaylorr.com> writes:
> > This series is an RFC for now since I'm interested in discussing whether
> > or not this is a feature that people would actually want to use or not.
> > But if it is, I'm happy to polish this up and turn it into a
> > non-RFC-quality series ;-).
> >
> > In the meantime, thanks for your review!
>
> Thanks for this patch set. I can see this being used by, say, someone
> who wants to preserve a repo that rewinds branches all the time (the
> refs would need to be backed-up separately, but at least this provides a
> way for objects to be stored efficiently, in that reachable objects are
> still stored in the main repo and unreachable objects are stored in the
> backup with no overlap between them).

Yes, definitely.

If it helps, I can share a little bit about the motivating use-case
within GitHub. All objects from a fork network are stored together in a
repository that we call the network.git, with individual forks keeping
track of their own references.

The network.git repository can often grow quite large, and/or contain
data that the owner of an individual fork would like removed (e.g., they
accidentally pushed sensitive credentials, force-pushed over it, but
would like the now-unreachable objects to be removed).

We don't usually do pruning GC's except during manual intervention or
upon request through a support ticket. But when we do it is often
infeasible to lock the entire network's push traffic and reference
updates. So it is not an unheard of event to encounter the race that I
described above.

The idea is that, at least for non-sensitive pruning, we would move the
pruned objects to a separate repository and hold them there until we
could run `git fsck` on the repository after pruning and verify that the
repository is intact. If it is, then the expired.git repository can be
emptied, too, permanently removing the pruned objects. If not, the
expired.git repository then becomes a donor for the missing objects,
which are used to heal the corrupt main repository. Once *that* is done,
and fsck comes back clean, then the expired.git repository can be
removed.

> I think there is at least one more alternative that should be
> considered, though: since the cruft pack is unlikely to have its objects
> "resurrected" (since the reason why they're there is because they are
> unreachable), it is likely that the objects that are pruned are exactly
> the same as those in the craft pack. So it would be more efficient to
> just unconditionally rename the cruft pack to the backup destination.

This isn't quite right. The contents that are written into the
expired.git repository is everything that *didn't* end up in the cruft
pack.

Suppose your cruft expiration is 1.hour.ago, and your doing a repack on
repository foo.git, expiring objects into expired.git. There are three
disjoint sets of objects:

  - reachable objects, which will stay in foo.git
  - unreachable objects which were written within the last hour (and are
    thus too new to prune) which will stay in foo.git
  - unreachable objects which *weren't* written within the last hour
    (and thus will be pruned) which are moved to a new pack in
    expired.git (and removed from foo.git)

So the cruft pack in foo.git and the one written to expired.git are a
disjoint cut of the unreachable objects in foo.git based on their mtime,
with the recent objects staying in the source repository and the stale
ones moving to the expired.git repository.

The original implementation of this feature was to move the entire cruft
pack out of the way like you describe. This is sub-optimal because you
are forced to generate that cruft pack with `--cruft-expiration=never`,
since you can't actually prune any objects when generating the cruft
pack, or they would be gone forever. But since you have to move the
entire cruft pack out of the way, the visible effect looks like you
actually pruned *all* unreachable objects, as if you had supplied
`--cruft-expiration=now`.

Being able to expire just the objects which have aged out of the grace
period should cause this race to happen less frequently in practice.

> Having said that, if there is a compelling use case for repacking even
> when we're moving from cruft pack to backup, the design of this patch
> set looks good overall. There are some minor points (e.g. the naming of
> the parameter "out" in patch 1), but I understand that this is an RFC
> and I'll wait for a non-RFC patch set before looking more closely at
> these things.

Thanks,
Taylor

  reply	other threads:[~2022-06-30  2:47 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-29 18:45 [RFC PATCH 0/4] move pruned objects to a separate repository Taylor Blau
2022-06-29 18:45 ` [RFC PATCH 1/4] builtin/repack.c: pass "out" to `prepare_pack_objects` Taylor Blau
2022-06-29 18:47 ` [RFC PATCH 2/4] builtin/repack.c: pass "cruft_expiration" to `write_cruft_pack` Taylor Blau
2022-06-29 18:47 ` [RFC PATCH 3/4] builtin/repack.c: write cruft packs to arbitrary locations Taylor Blau
2022-06-29 18:47 ` [RFC PATCH 4/4] builtin/repack.c: implement `--expire-to` for storing pruned objects Taylor Blau
2022-06-29 22:54 ` [RFC PATCH 0/4] move pruned objects to a separate repository Jonathan Tan
2022-06-30  2:47   ` Taylor Blau [this message]
2022-06-30 21:15     ` Jonathan Tan
2022-06-30  8:00 ` Ævar Arnfjörð Bjarmason

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Yr0OuwCyDot0wJjs@nand.local \
    --to=me@ttaylorr.com \
    --cc=derrickstolee@github.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jonathantanmy@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).