git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] builtin/pack-objects.c: introduce `pack.extraCruftTips`
@ 2023-04-20 17:27 Taylor Blau
  2023-04-20 18:12 ` Junio C Hamano
                   ` (4 more replies)
  0 siblings, 5 replies; 31+ messages in thread
From: Taylor Blau @ 2023-04-20 17:27 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Derrick Stolee, Junio C Hamano

This patch introduces a new multi-valued configuration option,
`pack.extraCruftTips` as an alternative means to mark certain objects in
the cruft pack as rescued, even if they would otherwise be pruned.

Traditionally, cruft packs are generated in one of two ways:

  - When not pruning, all packed and loose objects which do not appear
    in the any of the packs which wil remain (as indicated over stdin,
    with `--cruft`) are packed separately into a cruft pack.

  - When pruning, the above process happens, with two additional tweaks.
    Instead of adding every object into the cruft pack, only objects
    which have mtime that is within the grace period are kept. In
    addition, a "rescuing" traversal is performed over the remaining
    cruft objects to keep around cruft objects which would have aged out
    of the repository, but are reachable from other cruft objects which
    have not yet aged out.

This works well for repositories which have all of their "interesting"
objects worth keeping around reachable via at least one reference.

However, there is no option to be able to keep around certain objects
that have otherwise aged out of the grace period. The only way to retain
those objects is:

  - to point a reference at them, which may be undesirable or
    infeasible,
  - to extend the grace period, which may keep around other objects that
    the caller *does* want to discard,
  - or, to force the caller to construct the pack of objects they want
    to keep themselves, and then mark the pack as kept by adding a
    ".keep" file.

This patch introduces a new configuration, `pack.extraCruftTips` which
allows the caller to specify a program (or set of programs) whose output
is treated as a set of objects to treat as "kept", even if they are
unreachable and have aged out of the retention period.

The implementation is straightforward: after determining the set of
objects to pack into the cruft pack (either by calling
`enumerate_cruft_objects()` or `enumerate_and_traverse_cruft_objects()`)
we figure out if there are any program(s) we need to consult in order to
determine if there are any objects which we need to "rescue". We then
add those as tips to another reachability traversal, marking every
object along the way as cruft (and thus retaining it in the cruft pack).

A potential alternative here is to introduce a new mode to alter the
contents of the reachable pack instead of the cruft one. One could
imagine a new option to `pack-objects`, say `--extra-tips` that does the
same thing as above, adding the visited set of objects along the
traversal to the pack.

But this has the unfortunate side-effect of altering the reachability
closure of that pack. If parts of the unreachable object graph mentioned
by one or more of the "extra tips" programs is not closed, then the
resulting pack won't be either. This makes it impossible in the general
case to write out reachability bitmaps for that pack, since closure is a
requirement there.

Instead, keep these unreachable objects in the cruft pack instead, to
ensure that we can continue to have a pack containing just reachable
objects, which is always safe to write a bitmap over.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/config/pack.txt |  9 ++++
 builtin/pack-objects.c        | 95 +++++++++++++++++++++++++++++++++
 t/t5329-pack-objects-cruft.sh | 98 +++++++++++++++++++++++++++++++++++
 3 files changed, 202 insertions(+)

diff --git a/Documentation/config/pack.txt b/Documentation/config/pack.txt
index 53093d9996..ab6aadd4b0 100644
--- a/Documentation/config/pack.txt
+++ b/Documentation/config/pack.txt
@@ -67,6 +67,15 @@ pack.deltaCacheLimit::
 	result once the best match for all objects is found.
 	Defaults to 1000. Maximum value is 65535.
 
+pack.extraCruftTips::
+	When generating a cruft pack, use the shell to execute the
+	specified command(s), and interpret their output as additional
+	tips of objects to keep in the cruft pack, regardless of their
+	age.
++
+Output must contain exactly one hex object ID per line, and nothing
+else. Objects which cannot be found in the repository are ignored.
+
 pack.threads::
 	Specifies the number of threads to spawn when searching for best
 	delta matches.  This requires that linkgit:git-pack-objects[1]
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 77d88f85b0..3b575c79dc 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -43,6 +43,7 @@
 #include "pack-mtimes.h"
 #include "parse-options.h"
 #include "wrapper.h"
+#include "run-command.h"
 
 /*
  * Objects we are going to pack are collected in the `to_pack` structure.
@@ -3586,6 +3587,98 @@ static void enumerate_and_traverse_cruft_objects(struct string_list *fresh_packs
 	stop_progress(&progress_state);
 }
 
+static int enumerate_extra_cruft_tips_1(struct rev_info *revs, const char *args)
+{
+	struct child_process cmd = CHILD_PROCESS_INIT;
+	struct strbuf buf = STRBUF_INIT;
+	FILE *out = NULL;
+	int ret = 0;
+
+	cmd.use_shell = 1;
+	cmd.out = -1;
+
+	strvec_push(&cmd.args, args);
+	strvec_pushv(&cmd.env, (const char **)local_repo_env);
+
+	if (start_command(&cmd)) {
+		ret = -1;
+		goto done;
+	}
+
+	out = xfdopen(cmd.out, "r");
+	while (strbuf_getline_lf(&buf, out) != EOF) {
+		struct object_id oid;
+		struct object *obj;
+		const char *rest;
+
+		if (parse_oid_hex(buf.buf, &oid, &rest) || *rest) {
+			ret = error(_("invalid extra cruft tip: '%s'"), buf.buf);
+			goto done;
+		}
+
+		obj = parse_object(the_repository, &oid);
+		if (!obj)
+			continue;
+
+		display_progress(progress_state, ++nr_seen);
+		add_pending_object(revs, obj, "");
+	}
+
+done:
+	if (out)
+		fclose(out);
+	strbuf_release(&buf);
+	child_process_clear(&cmd);
+
+	return ret;
+}
+
+static int enumerate_extra_cruft_tips(void)
+{
+	struct rev_info revs;
+	const struct string_list *programs;
+	int ret = 0;
+	size_t i;
+
+	if (git_config_get_string_multi("pack.extracrufttips", &programs))
+		return ret;
+
+	repo_init_revisions(the_repository, &revs, NULL);
+
+	revs.tag_objects = 1;
+	revs.tree_objects = 1;
+	revs.blob_objects = 1;
+
+	revs.include_check = cruft_include_check;
+	revs.include_check_obj = cruft_include_check_obj;
+
+	revs.ignore_missing_links = 1;
+
+	if (progress)
+		progress_state = start_progress(_("Enumerating extra cruft tips"), 0);
+	nr_seen = 0;
+	for (i = 0; i < programs->nr; i++) {
+		ret = enumerate_extra_cruft_tips_1(&revs,
+						   programs->items[i].string);
+		if (!ret)
+			break;
+	}
+	stop_progress(&progress_state);
+
+	if (ret)
+		die(_("unable to enumerate additional cruft tips"));
+
+	if (prepare_revision_walk(&revs))
+		die(_("revision walk setup failed"));
+	if (progress)
+		progress_state = start_progress(_("Traversing extra cruft objects"), 0);
+	nr_seen = 0;
+	traverse_commit_list(&revs, show_cruft_commit, show_cruft_object, NULL);
+	stop_progress(&progress_state);
+
+	return ret;
+}
+
 static void read_cruft_objects(void)
 {
 	struct strbuf buf = STRBUF_INIT;
@@ -3641,6 +3734,8 @@ static void read_cruft_objects(void)
 	else
 		enumerate_cruft_objects();
 
+	enumerate_extra_cruft_tips();
+
 	strbuf_release(&buf);
 	string_list_clear(&discard_packs, 0);
 	string_list_clear(&fresh_packs, 0);
diff --git a/t/t5329-pack-objects-cruft.sh b/t/t5329-pack-objects-cruft.sh
index 303f7a5d84..1260e11d41 100755
--- a/t/t5329-pack-objects-cruft.sh
+++ b/t/t5329-pack-objects-cruft.sh
@@ -739,4 +739,102 @@ test_expect_success 'cruft objects are freshend via loose' '
 	)
 '
 
+test_expect_success 'additional cruft tips may be specified via pack.extraCruftTips' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		# Create a handful of objects.
+		#
+		#   - one reachable commit, "base", designated for the reachable
+		#     pack
+		#   - one unreachable commit, "cruft.discard", which is marked
+		#     for deletion
+		#   - one unreachable commit, "cruft.old", which would be marked
+		#     for deletion, but is rescued as an extra cruft tip
+		#   - one unreachable commit, "cruft.new", which is not marked
+		#     for deletion
+		test_commit base &&
+		git branch -M main &&
+
+		git checkout --orphan discard &&
+		git rm -fr . &&
+		test_commit --no-tag cruft.discard &&
+
+		git checkout --orphan old &&
+		git rm -fr . &&
+		test_commit --no-tag cruft.old &&
+		cruft_old="$(git rev-parse HEAD)" &&
+
+		git checkout --orphan new &&
+		git rm -fr . &&
+		test_commit --no-tag cruft.new &&
+		cruft_new="$(git rev-parse HEAD)" &&
+
+		git checkout main &&
+		git branch -D old new &&
+		git reflog expire --all --expire=all &&
+
+		# mark cruft.old with an mtime that is many minutes
+		# older than the expiration period, and mark cruft.new
+		# with an mtime that is in the future (and thus not
+		# eligible for pruning).
+		test-tool chmtime -2000 "$objdir/$(test_oid_to_path $cruft_old)" &&
+		test-tool chmtime +1000 "$objdir/$(test_oid_to_path $cruft_new)" &&
+
+		# Write the list of cruft objects we expect to
+		# accumulate, which is comprised of everything reachable
+		# from cruft.old and cruft.new, but not cruft.discard.
+		git rev-list --objects --no-object-names \
+			$cruft_old $cruft_new >cruft.raw &&
+		sort cruft.raw >cruft.expect &&
+
+		# Write the script to list extra tips, which are limited
+		# to cruft.old, in this case.
+		write_script extra-tips <<-EOF &&
+		echo $cruft_old
+		EOF
+		git config pack.extraCruftTips ./extra-tips &&
+
+		git repack --cruft --cruft-expiration=now -d &&
+
+		mtimes="$(ls .git/objects/pack/pack-*.mtimes)" &&
+		git show-index <${mtimes%.mtimes}.idx >cruft &&
+		cut -d" " -f2 cruft | sort >cruft.actual &&
+		test_cmp cruft.expect cruft.actual
+	)
+'
+
+test_expect_success 'additional cruft blobs via pack.extraCruftTips' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit base &&
+
+		blob=$(echo "unreachable" | git hash-object -w --stdin) &&
+
+		# mark the unreachable blob we wrote above as having
+		# aged out of the retention period
+		test-tool chmtime -2000 "$objdir/$(test_oid_to_path $blob)" &&
+
+		# Write the script to list extra tips, which is just the
+		# extra blob as above.
+		write_script extra-tips <<-EOF &&
+		echo $blob
+		EOF
+		git config pack.extraCruftTips ./extra-tips &&
+
+		git repack --cruft --cruft-expiration=now -d &&
+
+		mtimes="$(ls .git/objects/pack/pack-*.mtimes)" &&
+		git show-index <${mtimes%.mtimes}.idx >cruft &&
+		cut -d" " -f2 cruft >actual &&
+		echo $blob >expect &&
+		test_cmp expect actual
+	)
+'
+
 test_done
-- 
2.40.0.353.g8af478ebe3

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH] builtin/pack-objects.c: introduce `pack.extraCruftTips`
  2023-04-20 17:27 [PATCH] builtin/pack-objects.c: introduce `pack.extraCruftTips` Taylor Blau
@ 2023-04-20 18:12 ` Junio C Hamano
  2023-04-20 19:30   ` Taylor Blau
  2023-04-21  0:10 ` Chris Torek
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 31+ messages in thread
From: Junio C Hamano @ 2023-04-20 18:12 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Jeff King, Derrick Stolee

Taylor Blau <me@ttaylorr.com> writes:

> This patch introduces a new configuration, `pack.extraCruftTips` which
> allows the caller to specify a program (or set of programs) whose output
> is treated as a set of objects to treat as "kept", even if they are
> unreachable and have aged out of the retention period.

I can see why this would make the implementation simpler, simply
because no existing codepaths that determines if an object is still
live knows about such a program hence all the codepaths by default
keep considering them expungeable crufts except for a very limited
set of codepaths that will pay attention to the program.  But it
makes me wonder if it would make the life of end-users simpler if we
reserve a special ref hierarchy, say "refs/crufts/*", than having to
write a program for doing something like this.

Thanks.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] builtin/pack-objects.c: introduce `pack.extraCruftTips`
  2023-04-20 18:12 ` Junio C Hamano
@ 2023-04-20 19:30   ` Taylor Blau
  2023-04-20 19:52     ` Junio C Hamano
  0 siblings, 1 reply; 31+ messages in thread
From: Taylor Blau @ 2023-04-20 19:30 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Jeff King, Derrick Stolee

On Thu, Apr 20, 2023 at 11:12:34AM -0700, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> > This patch introduces a new configuration, `pack.extraCruftTips` which
> > allows the caller to specify a program (or set of programs) whose output
> > is treated as a set of objects to treat as "kept", even if they are
> > unreachable and have aged out of the retention period.
>
> I can see why this would make the implementation simpler, simply
> because no existing codepaths that determines if an object is still
> live knows about such a program hence all the codepaths by default
> keep considering them expungeable crufts except for a very limited
> set of codepaths that will pay attention to the program.

Right, this is about band-aiding a situation where there are objects
that we want to save for whatever reason but which do not (need to) be
reachable via any reference(s).

> But it makes me wonder if it would make the life of end-users simpler
> if we reserve a special ref hierarchy, say "refs/crufts/*", than
> having to write a program for doing something like this.

Ideally, yes. But I think there are certain instances where there are
far too many (disconnected) objects that creating a reference for each
part of the unreachable object graph that we want to keep is infeasible.

Another way to think about pack.extraCruftTips is that the program
invocation is acting like the refs/crufts hierarchy would if it existed,
but without actually having to write all of those references down.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] builtin/pack-objects.c: introduce `pack.extraCruftTips`
  2023-04-20 19:30   ` Taylor Blau
@ 2023-04-20 19:52     ` Junio C Hamano
  2023-04-20 20:48       ` Taylor Blau
  0 siblings, 1 reply; 31+ messages in thread
From: Junio C Hamano @ 2023-04-20 19:52 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Jeff King, Derrick Stolee

Taylor Blau <me@ttaylorr.com> writes:

>> But it makes me wonder if it would make the life of end-users simpler
>> if we reserve a special ref hierarchy, say "refs/crufts/*", than
>> having to write a program for doing something like this.
>
> Ideally, yes. But I think there are certain instances where there are
> far too many (disconnected) objects that creating a reference for each
> part of the unreachable object graph that we want to keep is infeasible.
>
> Another way to think about pack.extraCruftTips is that the program
> invocation is acting like the refs/crufts hierarchy would if it existed,
> but without actually having to write all of those references down.

I do not need another way, actually.

If it is infeasible to keep track of what we want to keep from
expiration because they are too numerous and disconnected, I have to
wonder how a programatic way to enumerate them is feasible,
especially the enumeration is for doing something useful.  

You can write a program that enumerates all the loose objects and
existing object names that appear in .idx files and such a program
will cover "everything is too disconnected" case, but that does not
count as a useful use case---disabling cruft expiration would
achieve the same thing.  Is there a less hand-wavy use case you have
in mind?

THanks.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] builtin/pack-objects.c: introduce `pack.extraCruftTips`
  2023-04-20 19:52     ` Junio C Hamano
@ 2023-04-20 20:48       ` Taylor Blau
  0 siblings, 0 replies; 31+ messages in thread
From: Taylor Blau @ 2023-04-20 20:48 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Jeff King, Derrick Stolee, Michael Haggerty

On Thu, Apr 20, 2023 at 12:52:55PM -0700, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> >> But it makes me wonder if it would make the life of end-users simpler
> >> if we reserve a special ref hierarchy, say "refs/crufts/*", than
> >> having to write a program for doing something like this.
> >
> > Ideally, yes. But I think there are certain instances where there are
> > far too many (disconnected) objects that creating a reference for each
> > part of the unreachable object graph that we want to keep is infeasible.
> >
> > Another way to think about pack.extraCruftTips is that the program
> > invocation is acting like the refs/crufts hierarchy would if it existed,
> > but without actually having to write all of those references down.
>
> [...] Is there a less hand-wavy use case you have in mind?

Sure. The use-case I have in mind directly is keeping certain entries
from GitHub's `audit_log` file (see a description from Peff in [1])
while excluding others.

We use the audit_log to track every single reference change, like the
reflog but with the reference name prepended to each entry and some
optional metadata attached to the end of each entry.

Our goal is to be able to prune the test-merge objects that GitHub
creates without (usually) pruning any objects that was pushed by a user.
E.g., even if a user force-pushes their branch from A to B (where B is
not strictly ahead of A) we want to keep the objects from A around, even
though a reference is no longer pointing at it.

This already works with reflogs, which are considered as reachable
objects when generating a cruft pack (that is, they go in the "big"
pack with the rest of the reachable objects. This is extending that
mechanism to work with GitHub's custom format, but doing so in a way
that is not tied to that format whatsoever.

The hope is that others may find it useful in other special
circumstances like above.

Thanks,
Taylor

[1]: https://lore.kernel.org/git/20150624094919.GC5436@peff.net/

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] builtin/pack-objects.c: introduce `pack.extraCruftTips`
  2023-04-20 17:27 [PATCH] builtin/pack-objects.c: introduce `pack.extraCruftTips` Taylor Blau
  2023-04-20 18:12 ` Junio C Hamano
@ 2023-04-21  0:10 ` Chris Torek
  2023-04-21  2:14   ` Taylor Blau
  2023-04-25 19:42 ` Derrick Stolee
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 31+ messages in thread
From: Chris Torek @ 2023-04-21  0:10 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Jeff King, Derrick Stolee, Junio C Hamano

On Thu, Apr 20, 2023 at 10:35 AM Taylor Blau <me@ttaylorr.com> wrote:
> +Output must contain exactly one hex object ID per line, and nothing
> +else. Objects which cannot be found in the repository are ignored.

The first part sounds good, as a rigid output format can always be
relaxed later if necessary. The second I'm less sure about: should
there perhaps be a --{no-,}missing-objects option to control it?

In particular I'd be a bit concerned if I always wanted to keep object
$H and ran a command that is supposed to keep $H, but it was
silently dropped (accidentally or not) and is then gone.

One could always run `gitt rev-parse` or `git cat-file -t` on the hash
afterward to double check, I suppose...

Chris

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] builtin/pack-objects.c: introduce `pack.extraCruftTips`
  2023-04-21  0:10 ` Chris Torek
@ 2023-04-21  2:14   ` Taylor Blau
  0 siblings, 0 replies; 31+ messages in thread
From: Taylor Blau @ 2023-04-21  2:14 UTC (permalink / raw)
  To: Chris Torek; +Cc: git, Jeff King, Derrick Stolee, Junio C Hamano

On Thu, Apr 20, 2023 at 05:10:31PM -0700, Chris Torek wrote:
> On Thu, Apr 20, 2023 at 10:35 AM Taylor Blau <me@ttaylorr.com> wrote:
> > +Output must contain exactly one hex object ID per line, and nothing
> > +else. Objects which cannot be found in the repository are ignored.
>
> The first part sounds good, as a rigid output format can always be
> relaxed later if necessary. The second I'm less sure about: should
> there perhaps be a --{no-,}missing-objects option to control it?

Maybe, although specifying it might be a little tricky.

I think you'd have to tie that setting into the configuration instead of
passing it through as a command-line argument. So things get interesting
when you have multiple `pack.extraCruftTips` commands: which of them are
allowed (or not allowed) to specify missing objects?

I think you could do something like:

    [packExtraCruft "foo"]
        exec = /path/to/script
        allowMissing = true

and specify multiple pack.extraCruftTips-esque programs, that each allow
(or not) passing missing objects.

I dunno. It might be overkill, but I could also see a compelling
argument towards doing something like that. It may equally be OK to set
`pack.allowMissingExtraCruftTips` once and have it apply to all values
of `pack.extraCruftTips`.

That might be a reasonable compromise, and it doesn't feel like overkill
to me. I can't easily imagine a circumstance where you'd want to allow
some programs to specify missing objects and not others.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] builtin/pack-objects.c: introduce `pack.extraCruftTips`
  2023-04-20 17:27 [PATCH] builtin/pack-objects.c: introduce `pack.extraCruftTips` Taylor Blau
  2023-04-20 18:12 ` Junio C Hamano
  2023-04-21  0:10 ` Chris Torek
@ 2023-04-25 19:42 ` Derrick Stolee
  2023-04-25 21:25   ` Taylor Blau
  2023-05-03  0:09 ` [PATCH v2] " Taylor Blau
  2023-05-03 22:05 ` [PATCH v3] " Taylor Blau
  4 siblings, 1 reply; 31+ messages in thread
From: Derrick Stolee @ 2023-04-25 19:42 UTC (permalink / raw)
  To: Taylor Blau, git; +Cc: Jeff King, Junio C Hamano

On 4/20/2023 1:27 PM, Taylor Blau wrote:
> This patch introduces a new multi-valued configuration option,
> `pack.extraCruftTips` as an alternative means to mark certain objects in
> the cruft pack as rescued, even if they would otherwise be pruned.

> However, there is no option to be able to keep around certain objects
> that have otherwise aged out of the grace period. The only way to retain
> those objects is:
> 
>   - to point a reference at them, which may be undesirable or
>     infeasible,
>   - to extend the grace period, which may keep around other objects that
>     the caller *does* want to discard,
>   - or, to force the caller to construct the pack of objects they want
>     to keep themselves, and then mark the pack as kept by adding a
>     ".keep" file.
> 
> This patch introduces a new configuration, `pack.extraCruftTips` which
> allows the caller to specify a program (or set of programs) whose output
> is treated as a set of objects to treat as "kept", even if they are
> unreachable and have aged out of the retention period.

One way to think about this is that we want to store a second set of
references to objects we do not want to prune, but those references
are not useful for _any other Git operation_, so polluting the ref
space would be too expensive. And reflogs are insufficient because
deleted references also delete their reflogs.

These "forever" references can be kept in a completely different format
if we use a hook interface. This helps us in particular because we
already have such a format on our systems. This may be a common
scenario in other systems, too.

> The implementation is straightforward: after determining the set of
> objects to pack into the cruft pack (either by calling
> `enumerate_cruft_objects()` or `enumerate_and_traverse_cruft_objects()`)
> we figure out if there are any program(s) we need to consult in order to
> determine if there are any objects which we need to "rescue". We then
> add those as tips to another reachability traversal, marking every
> object along the way as cruft (and thus retaining it in the cruft pack).

I initially wondered why we would need a second walk, and I _think_
one reason is that we don't want to update the mtimes for these
objects as if they were reachable. The next point about the reachable
pack is also critical.

> A potential alternative here is to introduce a new mode to alter the
> contents of the reachable pack instead of the cruft one. One could
> imagine a new option to `pack-objects`, say `--extra-tips` that does the
> same thing as above, adding the visited set of objects along the
> traversal to the pack.
> 
> But this has the unfortunate side-effect of altering the reachability
> closure of that pack. If parts of the unreachable object graph mentioned
> by one or more of the "extra tips" programs is not closed, then the
> resulting pack won't be either. This makes it impossible in the general
> case to write out reachability bitmaps for that pack, since closure is a
> requirement there.

This caught me off guard at first, but the important thing is that
Git allows objects to be missing if they are only reachable from
objects that are not reachable from references. This "the repository
is not necessarily closed under reachability when including
unreachable objects" is always a confusing state to keep in mind.

There's another reason to keep them in the cruft pack, even if they
were closed under reachability: their presence muddies the reachable
pack(s) so their deltas are worse and their bitmaps a less efficient.

> Instead, keep these unreachable objects in the cruft pack instead, to
> ensure that we can continue to have a pack containing just reachable
> objects, which is always safe to write a bitmap over.

Makes sense. Also: don't update their mtime so they could be removed
immediately if the external pointers change.

> +pack.extraCruftTips::
> +	When generating a cruft pack, use the shell to execute the
> +	specified command(s), and interpret their output as additional
> +	tips of objects to keep in the cruft pack, regardless of their
> +	age.
> ++
> +Output must contain exactly one hex object ID per line, and nothing
> +else. Objects which cannot be found in the repository are ignored.
> +

This only indicates it is multi-valued by the word "command(s)". It
also doesn't specify that we execute them all or stop when one
completes with success. (see code)

> +static int enumerate_extra_cruft_tips_1(struct rev_info *revs, const char *args)
> +{
> +	struct child_process cmd = CHILD_PROCESS_INIT;
> +	struct strbuf buf = STRBUF_INIT;
> +	FILE *out = NULL;
> +	int ret = 0;
> +
> +	cmd.use_shell = 1;
> +	cmd.out = -1;
> +
> +	strvec_push(&cmd.args, args);
> +	strvec_pushv(&cmd.env, (const char **)local_repo_env);

I'm surprised this is necessary to include here. However, I see
one other use like this is for core.alternateRefsCommand, which
serves a similar purpose.

> +
> +	if (start_command(&cmd)) {
> +		ret = -1;
> +		goto done;
> +	}
> +
> +	out = xfdopen(cmd.out, "r");
> +	while (strbuf_getline_lf(&buf, out) != EOF) {
> +		struct object_id oid;
> +		struct object *obj;
> +		const char *rest;
> +
> +		if (parse_oid_hex(buf.buf, &oid, &rest) || *rest) {
> +			ret = error(_("invalid extra cruft tip: '%s'"), buf.buf);
> +			goto done;
> +		}

Reporting an error on malformed output from hook...

> +		obj = parse_object(the_repository, &oid);
> +		if (!obj)
> +			continue;

...but skipping over missing objects. Good.

> +
> +		display_progress(progress_state, ++nr_seen);
> +		add_pending_object(revs, obj, "");
> +	}
> +
> +done:
> +	if (out)
> +		fclose(out);
> +	strbuf_release(&buf);
> +	child_process_clear(&cmd);

This cleanup looks sufficient from my reading.

> +	return ret;
> +}
> +
> +static int enumerate_extra_cruft_tips(void)
> +{
> +	struct rev_info revs;
> +	const struct string_list *programs;
> +	int ret = 0;
> +	size_t i;
> +
> +	if (git_config_get_string_multi("pack.extracrufttips", &programs))
> +		return ret;
> +
> +	repo_init_revisions(the_repository, &revs, NULL);
> +
> +	revs.tag_objects = 1;
> +	revs.tree_objects = 1;
> +	revs.blob_objects = 1;
> +
> +	revs.include_check = cruft_include_check;
> +	revs.include_check_obj = cruft_include_check_obj;
> +
> +	revs.ignore_missing_links = 1;
> +
> +	if (progress)
> +		progress_state = start_progress(_("Enumerating extra cruft tips"), 0);

Should we be including two progress counts? or would it be sufficient
to say "traversing extra cruft objects" and include both of these steps
in that one progress mode?

> +	nr_seen = 0;
> +	for (i = 0; i < programs->nr; i++) {
> +		ret = enumerate_extra_cruft_tips_1(&revs,
> +						   programs->items[i].string);
> +		if (!ret)
> +			break;
> +	}
> +	stop_progress(&progress_state);
> +
> +	if (ret)
> +		die(_("unable to enumerate additional cruft tips"));

If we are going to die() here, we might as well do it in the loop
instead of the break. But wait... this seems backwards.

I suppose what this is saying is: try each possible hook until we
get a success (in which case we stop trying). If we have a failure
in all attempts, then we die().

I was expecting this to be a cumulative operation: we execute each
hook in order and include all of their output. This implementation
doesn't seem to match that (or I'm reading something wrong).

> +	if (prepare_revision_walk(&revs))
> +		die(_("revision walk setup failed"));
> +	if (progress)
> +		progress_state = start_progress(_("Traversing extra cruft objects"), 0);
> +	nr_seen = 0;
> +	traverse_commit_list(&revs, show_cruft_commit, show_cruft_object, NULL);

Hm. I'm trying to look at these helpers and figure out what they
are doing to prevent these objects from being pruned. This could
use a comment or something, as I'm unable to grok it at present.

> +	stop_progress(&progress_state);
> +
> +	return ret;
> +}


> +test_expect_success 'additional cruft tips may be specified via pack.extraCruftTips' '
> +	git init repo &&
> +	test_when_finished "rm -fr repo" &&
> +	(
> +		cd repo &&
> +
> +		# Create a handful of objects.
> +		#
> +		#   - one reachable commit, "base", designated for the reachable
> +		#     pack
> +		#   - one unreachable commit, "cruft.discard", which is marked
> +		#     for deletion
> +		#   - one unreachable commit, "cruft.old", which would be marked
> +		#     for deletion, but is rescued as an extra cruft tip
> +		#   - one unreachable commit, "cruft.new", which is not marked
> +		#     for deletion
> +		test_commit base &&
> +		git branch -M main &&
> +
> +		git checkout --orphan discard &&
> +		git rm -fr . &&
> +		test_commit --no-tag cruft.discard &&

Could we store this commit to make sure it is removed from the
repository later?

> +		git checkout --orphan old &&
> +		git rm -fr . &&
> +		test_commit --no-tag cruft.old &&
> +		cruft_old="$(git rev-parse HEAD)" &&
> +
> +		git checkout --orphan new &&
> +		git rm -fr . &&
> +		test_commit --no-tag cruft.new &&
> +		cruft_new="$(git rev-parse HEAD)" &&
> +
> +		git checkout main &&
> +		git branch -D old new &&

Do you need to delete the 'discard' branch here?

> +		git reflog expire --all --expire=all &&
> +
> +		# mark cruft.old with an mtime that is many minutes
> +		# older than the expiration period, and mark cruft.new
> +		# with an mtime that is in the future (and thus not
> +		# eligible for pruning).
> +		test-tool chmtime -2000 "$objdir/$(test_oid_to_path $cruft_old)" &&
> +		test-tool chmtime +1000 "$objdir/$(test_oid_to_path $cruft_new)" &&
> +
> +		# Write the list of cruft objects we expect to
> +		# accumulate, which is comprised of everything reachable
> +		# from cruft.old and cruft.new, but not cruft.discard.
> +		git rev-list --objects --no-object-names \
> +			$cruft_old $cruft_new >cruft.raw &&
> +		sort cruft.raw >cruft.expect &&

Ok, here you list the exact object set you are expecting post-prune.

> +		# Write the script to list extra tips, which are limited
> +		# to cruft.old, in this case.
> +		write_script extra-tips <<-EOF &&
> +		echo $cruft_old
> +		EOF
> +		git config pack.extraCruftTips ./extra-tips &&
> +
> +		git repack --cruft --cruft-expiration=now -d &&

So, at this point, it seems we are not including the 'discard'
objects in the cruft pack because they are still reachable. Or,
am I reading that incorrectly?

> +		mtimes="$(ls .git/objects/pack/pack-*.mtimes)" &&
> +		git show-index <${mtimes%.mtimes}.idx >cruft &&
> +		cut -d" " -f2 cruft | sort >cruft.actual &&
> +		test_cmp cruft.expect cruft.actual

One thing that would be good, as a safety check, is to remove
the pack.extraCruftTips config and re-run the repack and be
sure that we are pruning the objects previously saved by the
hook.

> +	)
> +'
> +
> +test_expect_success 'additional cruft blobs via pack.extraCruftTips' '
> +	git init repo &&
> +	test_when_finished "rm -fr repo" &&
> +	(
> +		cd repo &&
> +
> +		test_commit base &&
> +
> +		blob=$(echo "unreachable" | git hash-object -w --stdin) &&
> +
> +		# mark the unreachable blob we wrote above as having
> +		# aged out of the retention period
> +		test-tool chmtime -2000 "$objdir/$(test_oid_to_path $blob)" &&
> +
> +		# Write the script to list extra tips, which is just the
> +		# extra blob as above.
> +		write_script extra-tips <<-EOF &&
> +		echo $blob
> +		EOF
> +		git config pack.extraCruftTips ./extra-tips &&
> +
> +		git repack --cruft --cruft-expiration=now -d &&
> +
> +		mtimes="$(ls .git/objects/pack/pack-*.mtimes)" &&
> +		git show-index <${mtimes%.mtimes}.idx >cruft &&
> +		cut -d" " -f2 cruft >actual &&
> +		echo $blob >expect &&
> +		test_cmp expect actual
> +	)
> +'

It would be helpful to include a test for the multi-value case where
multiple scripts are executed and maybe include an "exit 1" in some
of them?

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] builtin/pack-objects.c: introduce `pack.extraCruftTips`
  2023-04-25 19:42 ` Derrick Stolee
@ 2023-04-25 21:25   ` Taylor Blau
  2023-04-26 10:52     ` Derrick Stolee
  0 siblings, 1 reply; 31+ messages in thread
From: Taylor Blau @ 2023-04-25 21:25 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, Jeff King, Junio C Hamano

On Tue, Apr 25, 2023 at 03:42:51PM -0400, Derrick Stolee wrote:
> On 4/20/2023 1:27 PM, Taylor Blau wrote:
> > This patch introduces a new multi-valued configuration option,
> > `pack.extraCruftTips` as an alternative means to mark certain objects in
> > the cruft pack as rescued, even if they would otherwise be pruned.
>
> > However, there is no option to be able to keep around certain objects
> > that have otherwise aged out of the grace period. The only way to retain
> > those objects is:
> >
> >   - to point a reference at them, which may be undesirable or
> >     infeasible,
> >   - to extend the grace period, which may keep around other objects that
> >     the caller *does* want to discard,
> >   - or, to force the caller to construct the pack of objects they want
> >     to keep themselves, and then mark the pack as kept by adding a
> >     ".keep" file.
> >
> > This patch introduces a new configuration, `pack.extraCruftTips` which
> > allows the caller to specify a program (or set of programs) whose output
> > is treated as a set of objects to treat as "kept", even if they are
> > unreachable and have aged out of the retention period.
>
> One way to think about this is that we want to store a second set of
> references to objects we do not want to prune, but those references
> are not useful for _any other Git operation_, so polluting the ref
> space would be too expensive. And reflogs are insufficient because
> deleted references also delete their reflogs.

Well put. I think that reflogs only stick around as long as their
references do is the key point and is under-emphasized in my original patch.

At the risk of repeating what you already said: yes, we (a) want to keep
around a second set of objects, but (b) don't want to point references
at them, since they are useless to all other operations as well as
expensive to track, and (c) don't want to use reflogs since their
lifetime is tied to the reference they're tracking.

> > The implementation is straightforward: after determining the set of
> > objects to pack into the cruft pack (either by calling
> > `enumerate_cruft_objects()` or `enumerate_and_traverse_cruft_objects()`)
> > we figure out if there are any program(s) we need to consult in order to
> > determine if there are any objects which we need to "rescue". We then
> > add those as tips to another reachability traversal, marking every
> > object along the way as cruft (and thus retaining it in the cruft pack).
>
> I initially wondered why we would need a second walk, and I _think_
> one reason is that we don't want to update the mtimes for these
> objects as if they were reachable. The next point about the reachable
> pack is also critical.

Their mtimes might be updated during that second walk, but the key point
is that we want to rescue any objects that the extra tips might reach
(but aren't listed themselves in the output of one of the hooks).

The idea there is similar to the rescuing pass we do for pruning GC's,
where we'll save any objects which would have been pruned, but are
reachable from another object which won't yet be pruned.

> > Instead, keep these unreachable objects in the cruft pack instead, to
> > ensure that we can continue to have a pack containing just reachable
> > objects, which is always safe to write a bitmap over.
>
> Makes sense. Also: don't update their mtime so they could be removed
> immediately if the external pointers change.

I don't think I'm quite following why we would want to leave their
mtimes alone here. I think we'd want their mtimes to be accurate as of
the last time we updated the *.mtimes file so that if they (a) no longer
appear in the output of one of the extra-cruft hooks, and (b) they have
been written recently, that they would not be pruned immediately.

> > +pack.extraCruftTips::
> > +	When generating a cruft pack, use the shell to execute the
> > +	specified command(s), and interpret their output as additional
> > +	tips of objects to keep in the cruft pack, regardless of their
> > +	age.
> > ++
> > +Output must contain exactly one hex object ID per line, and nothing
> > +else. Objects which cannot be found in the repository are ignored.
> > +
>
> This only indicates it is multi-valued by the word "command(s)". It
> also doesn't specify that we execute them all or stop when one
> completes with success. (see code)

> > +	return ret;
> > +}
> > +
> > +static int enumerate_extra_cruft_tips(void)
> > +{
> > +	struct rev_info revs;
> > +	const struct string_list *programs;
> > +	int ret = 0;
> > +	size_t i;
> > +
> > +	if (git_config_get_string_multi("pack.extracrufttips", &programs))
> > +		return ret;
> > +
> > +	repo_init_revisions(the_repository, &revs, NULL);
> > +
> > +	revs.tag_objects = 1;
> > +	revs.tree_objects = 1;
> > +	revs.blob_objects = 1;
> > +
> > +	revs.include_check = cruft_include_check;
> > +	revs.include_check_obj = cruft_include_check_obj;
> > +
> > +	revs.ignore_missing_links = 1;
> > +
> > +	if (progress)
> > +		progress_state = start_progress(_("Enumerating extra cruft tips"), 0);
>
> Should we be including two progress counts? or would it be sufficient
> to say "traversing extra cruft objects" and include both of these steps
> in that one progress mode?

I may be missing something, but there are already two progress meters
here: one (above) for enumerating the list of extra cruft tips, which
increments each time we read a new line that refers to an extant object,
and another (below) for the number of objects that we visit along the
second walk.

> > +	nr_seen = 0;
> > +	for (i = 0; i < programs->nr; i++) {
> > +		ret = enumerate_extra_cruft_tips_1(&revs,
> > +						   programs->items[i].string);
> > +		if (!ret)
> > +			break;
> > +	}
> > +	stop_progress(&progress_state);
> > +
> > +	if (ret)
> > +		die(_("unable to enumerate additional cruft tips"));
>
> If we are going to die() here, we might as well do it in the loop
> instead of the break. But wait... this seems backwards.
>
> I suppose what this is saying is: try each possible hook until we
> get a success (in which case we stop trying). If we have a failure
> in all attempts, then we die().
>
> I was expecting this to be a cumulative operation: we execute each
> hook in order and include all of their output. This implementation
> doesn't seem to match that (or I'm reading something wrong).

Yeah, this is definitely a bug, and the if-statement should read "if
(ret)" not "if (!ret)". We should ideally consult all of the hooks here,
but if any one of them fails, we should stop and refuse to generate a
cruft pack.

> > +	if (prepare_revision_walk(&revs))
> > +		die(_("revision walk setup failed"));
> > +	if (progress)
> > +		progress_state = start_progress(_("Traversing extra cruft objects"), 0);
> > +	nr_seen = 0;
> > +	traverse_commit_list(&revs, show_cruft_commit, show_cruft_object, NULL);
>
> Hm. I'm trying to look at these helpers and figure out what they
> are doing to prevent these objects from being pruned. This could
> use a comment or something, as I'm unable to grok it at present.

Very fair, the show_cruft_object() callback calls
add_cruft_object_entry(), which adds the visited object to the cruft
pack regardless of its mtime with respect to pruning.

A comment here might look like:

    /* retain all objects reachable from extra tips via show_cruft_object() */

or something.

> Could we store this commit to make sure it is removed from the
> repository later?

Possibly, though I think we already have good coverage of this in other
tests. So I'd be equally happy to assert on the exact contents of the
cruft pack (which wouldn't include the above commit), but I'm happy to
assert on it more directly if you feel strongly.

> > +		git checkout --orphan old &&
> > +		git rm -fr . &&
> > +		test_commit --no-tag cruft.old &&
> > +		cruft_old="$(git rev-parse HEAD)" &&
> > +
> > +		git checkout --orphan new &&
> > +		git rm -fr . &&
> > +		test_commit --no-tag cruft.new &&
> > +		cruft_new="$(git rev-parse HEAD)" &&
> > +
> > +		git checkout main &&
> > +		git branch -D old new &&
>
> Do you need to delete the 'discard' branch here?

Great catch. I didn't notice it here because even though it was
reachable (and not mentioned as an extra tip), so didn't appear in the
cruft pack.

> > +		# Write the script to list extra tips, which are limited
> > +		# to cruft.old, in this case.
> > +		write_script extra-tips <<-EOF &&
> > +		echo $cruft_old
> > +		EOF
> > +		git config pack.extraCruftTips ./extra-tips &&
> > +
> > +		git repack --cruft --cruft-expiration=now -d &&
>
> So, at this point, it seems we are not including the 'discard'
> objects in the cruft pack because they are still reachable. Or,
> am I reading that incorrectly?

My mistake, see above.

> > +		mtimes="$(ls .git/objects/pack/pack-*.mtimes)" &&
> > +		git show-index <${mtimes%.mtimes}.idx >cruft &&
> > +		cut -d" " -f2 cruft | sort >cruft.actual &&
> > +		test_cmp cruft.expect cruft.actual
>
> One thing that would be good, as a safety check, is to remove
> the pack.extraCruftTips config and re-run the repack and be
> sure that we are pruning the objects previously saved by the
> hook.

Good call. I think this amounts to:

--- 8< ---
diff --git a/t/t5329-pack-objects-cruft.sh b/t/t5329-pack-objects-cruft.sh
index 1260e11d41..44852e6925 100755
--- a/t/t5329-pack-objects-cruft.sh
+++ b/t/t5329-pack-objects-cruft.sh
@@ -773,7 +773,7 @@ test_expect_success 'additional cruft tips may be specified via pack.extraCruftT
 		cruft_new="$(git rev-parse HEAD)" &&

 		git checkout main &&
-		git branch -D old new &&
+		git branch -D discard old new &&
 		git reflog expire --all --expire=all &&

 		# mark cruft.old with an mtime that is many minutes
@@ -802,6 +802,19 @@ test_expect_success 'additional cruft tips may be specified via pack.extraCruftT
 		mtimes="$(ls .git/objects/pack/pack-*.mtimes)" &&
 		git show-index <${mtimes%.mtimes}.idx >cruft &&
 		cut -d" " -f2 cruft | sort >cruft.actual &&
+		test_cmp cruft.expect cruft.actual &&
+
+		# Ensure that the "old" objects are removed after
+		# dropping the pack.extraCruftTips hook.
+		git config --unset pack.extraCruftTips &&
+		git repack --cruft --cruft-expiration=now -d &&
+
+		mtimes="$(ls .git/objects/pack/pack-*.mtimes)" &&
+		git show-index <${mtimes%.mtimes}.idx >cruft &&
+		cut -d" " -f2 cruft | sort >cruft.actual &&
+
+		git rev-list --objects --no-object-names $cruft_new >cruft.raw &&
+		sort cruft.raw >cruft.expect &&
 		test_cmp cruft.expect cruft.actual
 	)
 '
--- >8 ---

> It would be helpful to include a test for the multi-value case where
> multiple scripts are executed and maybe include an "exit 1" in some
> of them?

Definitely, that is a great suggestion (especially because it would have
caught the "if (!ret)" bug that you pointed out above).

Thanks for a thorough review. I'll wait for other feedback before
re-rolling, or do so in a couple of days if I haven't heard anything.

Thanks,
Taylor

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH] builtin/pack-objects.c: introduce `pack.extraCruftTips`
  2023-04-25 21:25   ` Taylor Blau
@ 2023-04-26 10:52     ` Derrick Stolee
  2023-05-03  0:06       ` Taylor Blau
  0 siblings, 1 reply; 31+ messages in thread
From: Derrick Stolee @ 2023-04-26 10:52 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Jeff King, Junio C Hamano

On 4/25/2023 5:25 PM, Taylor Blau wrote:
> On Tue, Apr 25, 2023 at 03:42:51PM -0400, Derrick Stolee wrote:
>> On 4/20/2023 1:27 PM, Taylor Blau wrote:

>>> The implementation is straightforward: after determining the set of
>>> objects to pack into the cruft pack (either by calling
>>> `enumerate_cruft_objects()` or `enumerate_and_traverse_cruft_objects()`)
>>> we figure out if there are any program(s) we need to consult in order to
>>> determine if there are any objects which we need to "rescue". We then
>>> add those as tips to another reachability traversal, marking every
>>> object along the way as cruft (and thus retaining it in the cruft pack).
>>
>> I initially wondered why we would need a second walk, and I _think_
>> one reason is that we don't want to update the mtimes for these
>> objects as if they were reachable. The next point about the reachable
>> pack is also critical.
> 
> Their mtimes might be updated during that second walk, but the key point
> is that we want to rescue any objects that the extra tips might reach
> (but aren't listed themselves in the output of one of the hooks).
> 
> The idea there is similar to the rescuing pass we do for pruning GC's,
> where we'll save any objects which would have been pruned, but are
> reachable from another object which won't yet be pruned.
> 
>>> Instead, keep these unreachable objects in the cruft pack instead, to
>>> ensure that we can continue to have a pack containing just reachable
>>> objects, which is always safe to write a bitmap over.
>>
>> Makes sense. Also: don't update their mtime so they could be removed
>> immediately if the external pointers change.
> 
> I don't think I'm quite following why we would want to leave their
> mtimes alone here. I think we'd want their mtimes to be accurate as of
> the last time we updated the *.mtimes file so that if they (a) no longer
> appear in the output of one of the extra-cruft hooks, and (b) they have
> been written recently, that they would not be pruned immediately.

You corrected me that we _are_ updating mtimes, so my understanding was
incorrect and this follow-up is incorrect. Thanks.

Is it possible to demonstrate this mtime-updating in a test?

>>> +	if (progress)
>>> +		progress_state = start_progress(_("Enumerating extra cruft tips"), 0);
>>
>> Should we be including two progress counts? or would it be sufficient
>> to say "traversing extra cruft objects" and include both of these steps
>> in that one progress mode?
> 
> I may be missing something, but there are already two progress meters
> here: one (above) for enumerating the list of extra cruft tips, which
> increments each time we read a new line that refers to an extant object,
> and another (below) for the number of objects that we visit along the
> second walk.

I meant: why two counts? Should we instead only have one?

>>> +	if (prepare_revision_walk(&revs))
>>> +		die(_("revision walk setup failed"));
>>> +	if (progress)
>>> +		progress_state = start_progress(_("Traversing extra cruft objects"), 0);
>>> +	nr_seen = 0;
>>> +	traverse_commit_list(&revs, show_cruft_commit, show_cruft_object, NULL);
>>
>> Hm. I'm trying to look at these helpers and figure out what they
>> are doing to prevent these objects from being pruned. This could
>> use a comment or something, as I'm unable to grok it at present.
> 
> Very fair, the show_cruft_object() callback calls
> add_cruft_object_entry(), which adds the visited object to the cruft
> pack regardless of its mtime with respect to pruning.

Ah, thanks. That helps a lot.
 
> A comment here might look like:
> 
>     /* retain all objects reachable from extra tips via show_cruft_object() */
> 
> or something.

Sounds good.

>> Could we store this commit to make sure it is removed from the
>> repository later?
> 
> Possibly, though I think we already have good coverage of this in other
> tests. So I'd be equally happy to assert on the exact contents of the
> cruft pack (which wouldn't include the above commit), but I'm happy to
> assert on it more directly if you feel strongly.

This is a case of me thinking "can we test the test?" to make sure
the test is doing everything we think it is doing...

>>> +		git checkout --orphan old &&
>>> +		git rm -fr . &&
>>> +		test_commit --no-tag cruft.old &&
>>> +		cruft_old="$(git rev-parse HEAD)" &&
>>> +
>>> +		git checkout --orphan new &&
>>> +		git rm -fr . &&
>>> +		test_commit --no-tag cruft.new &&
>>> +		cruft_new="$(git rev-parse HEAD)" &&
>>> +
>>> +		git checkout main &&
>>> +		git branch -D old new &&
>>
>> Do you need to delete the 'discard' branch here?
> 
> Great catch. I didn't notice it here because even though it was
> reachable (and not mentioned as an extra tip), so didn't appear in the
> cruft pack.

...so we catch things like this automatically.

>>> +		mtimes="$(ls .git/objects/pack/pack-*.mtimes)" &&
>>> +		git show-index <${mtimes%.mtimes}.idx >cruft &&
>>> +		cut -d" " -f2 cruft | sort >cruft.actual &&
>>> +		test_cmp cruft.expect cruft.actual
>>
>> One thing that would be good, as a safety check, is to remove
>> the pack.extraCruftTips config and re-run the repack and be
>> sure that we are pruning the objects previously saved by the
>> hook.
> 
> Good call. I think this amounts to:
> 
> --- 8< ---
> diff --git a/t/t5329-pack-objects-cruft.sh b/t/t5329-pack-objects-cruft.sh
> index 1260e11d41..44852e6925 100755
> --- a/t/t5329-pack-objects-cruft.sh
> +++ b/t/t5329-pack-objects-cruft.sh
> @@ -773,7 +773,7 @@ test_expect_success 'additional cruft tips may be specified via pack.extraCruftT
>  		cruft_new="$(git rev-parse HEAD)" &&
> 
>  		git checkout main &&
> -		git branch -D old new &&
> +		git branch -D discard old new &&
>  		git reflog expire --all --expire=all &&
> 
>  		# mark cruft.old with an mtime that is many minutes
> @@ -802,6 +802,19 @@ test_expect_success 'additional cruft tips may be specified via pack.extraCruftT
>  		mtimes="$(ls .git/objects/pack/pack-*.mtimes)" &&
>  		git show-index <${mtimes%.mtimes}.idx >cruft &&
>  		cut -d" " -f2 cruft | sort >cruft.actual &&
> +		test_cmp cruft.expect cruft.actual &&

If we store "discard=$(git rev-parse discard)" earlier, we can also
check

		test_must_fail git show $discard &&

> +		# Ensure that the "old" objects are removed after
> +		# dropping the pack.extraCruftTips hook.
> +		git config --unset pack.extraCruftTips &&
> +		git repack --cruft --cruft-expiration=now -d &&
> +
> +		mtimes="$(ls .git/objects/pack/pack-*.mtimes)" &&
> +		git show-index <${mtimes%.mtimes}.idx >cruft &&
> +		cut -d" " -f2 cruft | sort >cruft.actual &&
> +
> +		git rev-list --objects --no-object-names $cruft_new >cruft.raw &&

I like how $cruft_new is still kept because its mtime is still
"later than now".

> +		sort cruft.raw >cruft.expect &&
>  		test_cmp cruft.expect cruft.actual
>  	)
>  '
> --- >8 ---
> 
>> It would be helpful to include a test for the multi-value case where
>> multiple scripts are executed and maybe include an "exit 1" in some
>> of them?
> 
> Definitely, that is a great suggestion (especially because it would have
> caught the "if (!ret)" bug that you pointed out above).
> 
> Thanks for a thorough review. I'll wait for other feedback before
> re-rolling, or do so in a couple of days if I haven't heard anything.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] builtin/pack-objects.c: introduce `pack.extraCruftTips`
  2023-04-26 10:52     ` Derrick Stolee
@ 2023-05-03  0:06       ` Taylor Blau
  0 siblings, 0 replies; 31+ messages in thread
From: Taylor Blau @ 2023-05-03  0:06 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, Jeff King, Junio C Hamano

On Wed, Apr 26, 2023 at 06:52:26AM -0400, Derrick Stolee wrote:
> You corrected me that we _are_ updating mtimes, so my understanding was
> incorrect and this follow-up is incorrect. Thanks.
>
> Is it possible to demonstrate this mtime-updating in a test?

It is, though I think that this might be out of scope for the
`pack.extraCruftTips` feature, and instead is more directly related to
the core machinery behind cruft packs. We have fairly thorough tests
throughout t5329 that cover this behavior implicitly, but let me know if
you think there is anything missing there.

> >>> +	if (progress)
> >>> +		progress_state = start_progress(_("Enumerating extra cruft tips"), 0);
> >>
> >> Should we be including two progress counts? or would it be sufficient
> >> to say "traversing extra cruft objects" and include both of these steps
> >> in that one progress mode?
> >
> > I may be missing something, but there are already two progress meters
> > here: one (above) for enumerating the list of extra cruft tips, which
> > increments each time we read a new line that refers to an extant object,
> > and another (below) for the number of objects that we visit along the
> > second walk.
>
> I meant: why two counts? Should we instead only have one?

Ah, thanks for helping my understanding. We have two counts to match the
behavior when generating cruft packs with pruning, where one counter
tracks the number of tips we add to the traversal, and the second
counter tracks the number of objects that we actually visit along the
traversal (and thus end up in the cruft pack).

Since we have to do an identical traversal here to ensure reachability
closure (where possible) over the extra cruft tips, I added a second
counter to track that progress.

> >> Could we store this commit to make sure it is removed from the
> >> repository later?
> >
> > Possibly, though I think we already have good coverage of this in other
> > tests. So I'd be equally happy to assert on the exact contents of the
> > cruft pack (which wouldn't include the above commit), but I'm happy to
> > assert on it more directly if you feel strongly.
>
> This is a case of me thinking "can we test the test?" to make sure
> the test is doing everything we think it is doing...

Good call, and doing so isn't too bad, either:

--- 8< ---
diff --git a/t/t5329-pack-objects-cruft.sh b/t/t5329-pack-objects-cruft.sh
index 0ac2f515b8..3ad16a9fca 100755
--- a/t/t5329-pack-objects-cruft.sh
+++ b/t/t5329-pack-objects-cruft.sh
@@ -814,8 +814,16 @@ test_expect_success 'additional cruft tips may be specified via pack.extraCruftT
 		cut -d" " -f2 cruft | sort >cruft.actual &&

 		git rev-list --objects --no-object-names $cruft_new >cruft.raw &&
+		cp cruft.expect cruft.old &&
 		sort cruft.raw >cruft.expect &&
-		test_cmp cruft.expect cruft.actual
+		test_cmp cruft.expect cruft.actual &&
+
+		# ensure objects which are no longer in the cruft pack were
+		# removed from the repository
+		for object in $(comm -13 cruft.expect cruft.old)
+		do
+			test_must_fail git cat-file -t $object || return 1
+		done
 	)
 '
--- >8 ---

> > +		# Ensure that the "old" objects are removed after
> > +		# dropping the pack.extraCruftTips hook.
> > +		git config --unset pack.extraCruftTips &&
> > +		git repack --cruft --cruft-expiration=now -d &&
> > +
> > +		mtimes="$(ls .git/objects/pack/pack-*.mtimes)" &&
> > +		git show-index <${mtimes%.mtimes}.idx >cruft &&
> > +		cut -d" " -f2 cruft | sort >cruft.actual &&
> > +
> > +		git rev-list --objects --no-object-names $cruft_new >cruft.raw &&
>
> I like how $cruft_new is still kept because its mtime is still
> "later than now".

Thanks. I added this object to make sure that we weren't testing trivial
behavior, such as removing all objects regardless of whether or not they
appear as extra cruft tips.

Thanks,
Taylor

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2] builtin/pack-objects.c: introduce `pack.extraCruftTips`
  2023-04-20 17:27 [PATCH] builtin/pack-objects.c: introduce `pack.extraCruftTips` Taylor Blau
                   ` (2 preceding siblings ...)
  2023-04-25 19:42 ` Derrick Stolee
@ 2023-05-03  0:09 ` Taylor Blau
  2023-05-03 14:01   ` Derrick Stolee
  2023-05-03 19:59   ` Jeff King
  2023-05-03 22:05 ` [PATCH v3] " Taylor Blau
  4 siblings, 2 replies; 31+ messages in thread
From: Taylor Blau @ 2023-05-03  0:09 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Junio C Hamano, Derrick Stolee

This patch introduces a new multi-valued configuration option,
`pack.extraCruftTips` as an alternative means to mark certain objects in
the cruft pack as rescued, even if they would otherwise be pruned.

Traditionally, cruft packs are generated in one of two ways:

  - When not pruning, all packed and loose objects which do not appear
    in the any of the packs which wil remain (as indicated over stdin,
    with `--cruft`) are packed separately into a cruft pack.

  - When pruning, the above process happens, with two additional tweaks.
    Instead of adding every object into the cruft pack, only objects
    which have mtime that is within the grace period are kept. In
    addition, a "rescuing" traversal is performed over the remaining
    cruft objects to keep around cruft objects which would have aged out
    of the repository, but are reachable from other cruft objects which
    have not yet aged out.

This works well for repositories which have all of their "interesting"
objects worth keeping around reachable via at least one reference.

However, there is no option to be able to keep around certain objects
that have otherwise aged out of the grace period. The only way to retain
those objects is:

  - to point a reference at them, which may be undesirable or
    infeasible,
  - to track them via the reflog, which may be undesirable since the
    reflog's lifetime is limited to that of the reference it's tracking
    (and callers may want to keep those unreachable objects around for
    longer)
  - to extend the grace period, which may keep around other objects that
    the caller *does* want to discard,
  - or, to force the caller to construct the pack of objects they want
    to keep themselves, and then mark the pack as kept by adding a
    ".keep" file.

This patch introduces a new configuration, `pack.extraCruftTips` which
allows the caller to specify a program (or set of programs) whose output
is treated as a set of objects to treat as "kept", even if they are
unreachable and have aged out of the retention period.

The implementation is straightforward: after determining the set of
objects to pack into the cruft pack (either by calling
`enumerate_cruft_objects()` or `enumerate_and_traverse_cruft_objects()`)
we figure out if there are any program(s) we need to consult in order to
determine if there are any objects which we need to "rescue". We then
add those as tips to another reachability traversal, marking every
object along the way as cruft (and thus retaining it in the cruft pack).

A potential alternative here is to introduce a new mode to alter the
contents of the reachable pack instead of the cruft one. One could
imagine a new option to `pack-objects`, say `--extra-tips` that does the
same thing as above, adding the visited set of objects along the
traversal to the pack.

But this has the unfortunate side-effect of altering the reachability
closure of that pack. If parts of the unreachable object graph mentioned
by one or more of the "extra tips" programs is not closed, then the
resulting pack won't be either. This makes it impossible in the general
case to write out reachability bitmaps for that pack, since closure is a
requirement there.

Instead, keep these unreachable objects in the cruft pack instead, to
ensure that we can continue to have a pack containing just reachable
objects, which is always safe to write a bitmap over.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
Range-diff against v1:
1:  8af478ebe3 ! 1:  73ad7b90e1 builtin/pack-objects.c: introduce `pack.extraCruftTips`
    @@ Commit message
     
           - to point a reference at them, which may be undesirable or
             infeasible,
    +      - to track them via the reflog, which may be undesirable since the
    +        reflog's lifetime is limited to that of the reference it's tracking
    +        (and callers may want to keep those unreachable objects around for
    +        longer)
           - to extend the grace period, which may keep around other objects that
             the caller *does* want to discard,
           - or, to force the caller to construct the pack of objects they want
    @@ Documentation/config/pack.txt: pack.deltaCacheLimit::
     ++
     +Output must contain exactly one hex object ID per line, and nothing
     +else. Objects which cannot be found in the repository are ignored.
    ++Multiple hooks are supported, but all must exit successfully, else no
    ++cruft pack will be generated.
     +
      pack.threads::
      	Specifies the number of threads to spawn when searching for best
    @@ builtin/pack-objects.c: static void enumerate_and_traverse_cruft_objects(struct
     +		add_pending_object(revs, obj, "");
     +	}
     +
    ++	ret = finish_command(&cmd);
    ++
     +done:
     +	if (out)
     +		fclose(out);
    @@ builtin/pack-objects.c: static void enumerate_and_traverse_cruft_objects(struct
     +	for (i = 0; i < programs->nr; i++) {
     +		ret = enumerate_extra_cruft_tips_1(&revs,
     +						   programs->items[i].string);
    -+		if (!ret)
    ++		if (ret)
     +			break;
     +	}
     +	stop_progress(&progress_state);
    @@ builtin/pack-objects.c: static void enumerate_and_traverse_cruft_objects(struct
     +	if (progress)
     +		progress_state = start_progress(_("Traversing extra cruft objects"), 0);
     +	nr_seen = 0;
    ++
    ++	/*
    ++	 * Retain all objects reachable from extra tips via
    ++	 * show_cruft_commit(), and show_cruft_object(), regardless of
    ++	 * their mtime.
    ++	 */
     +	traverse_commit_list(&revs, show_cruft_commit, show_cruft_object, NULL);
     +	stop_progress(&progress_state);
     +
    @@ t/t5329-pack-objects-cruft.sh: test_expect_success 'cruft objects are freshend v
     +		cruft_new="$(git rev-parse HEAD)" &&
     +
     +		git checkout main &&
    -+		git branch -D old new &&
    ++		git branch -D discard old new &&
     +		git reflog expire --all --expire=all &&
     +
     +		# mark cruft.old with an mtime that is many minutes
    @@ t/t5329-pack-objects-cruft.sh: test_expect_success 'cruft objects are freshend v
     +		mtimes="$(ls .git/objects/pack/pack-*.mtimes)" &&
     +		git show-index <${mtimes%.mtimes}.idx >cruft &&
     +		cut -d" " -f2 cruft | sort >cruft.actual &&
    -+		test_cmp cruft.expect cruft.actual
    ++		test_cmp cruft.expect cruft.actual &&
    ++
    ++		# Ensure that the "old" objects are removed after
    ++		# dropping the pack.extraCruftTips hook.
    ++		git config --unset pack.extraCruftTips &&
    ++		git repack --cruft --cruft-expiration=now -d &&
    ++
    ++		mtimes="$(ls .git/objects/pack/pack-*.mtimes)" &&
    ++		git show-index <${mtimes%.mtimes}.idx >cruft &&
    ++		cut -d" " -f2 cruft | sort >cruft.actual &&
    ++
    ++		git rev-list --objects --no-object-names $cruft_new >cruft.raw &&
    ++		cp cruft.expect cruft.old &&
    ++		sort cruft.raw >cruft.expect &&
    ++		test_cmp cruft.expect cruft.actual &&
    ++
    ++		# ensure objects which are no longer in the cruft pack were
    ++		# removed from the repository
    ++		for object in $(comm -13 cruft.expect cruft.old)
    ++		do
    ++			test_must_fail git cat-file -t $object || return 1
    ++		done
    ++	)
    ++'
    ++
    ++test_expect_success 'multi-valued pack.extraCruftTips' '
    ++	git init repo &&
    ++	test_when_finished "rm -fr repo" &&
    ++	(
    ++		cd repo &&
    ++
    ++		test_commit base &&
    ++		git branch -M main &&
    ++
    ++		git checkout --orphan cruft.a &&
    ++		git rm -fr . &&
    ++		test_commit --no-tag cruft.a &&
    ++		cruft_a="$(git rev-parse HEAD)" &&
    ++
    ++		git checkout --orphan cruft.b &&
    ++		git rm -fr . &&
    ++		test_commit --no-tag cruft.b &&
    ++		cruft_b="$(git rev-parse HEAD)" &&
    ++
    ++		git checkout main &&
    ++		git branch -D cruft.a cruft.b &&
    ++		git reflog expire --all --expire=all &&
    ++
    ++		echo "echo $cruft_a" | write_script extra-tips.a &&
    ++		echo "echo $cruft_b" | write_script extra-tips.b &&
    ++		echo "false" | write_script extra-tips.c &&
    ++
    ++		git rev-list --objects --no-object-names $cruft_a $cruft_b \
    ++			>cruft.raw &&
    ++		sort cruft.raw >cruft.expect &&
    ++
    ++		# ensure that each extra cruft tip is saved by its
    ++		# respective hook
    ++		git config --add pack.extraCruftTips ./extra-tips.a &&
    ++		git config --add pack.extraCruftTips ./extra-tips.b &&
    ++		git repack --cruft --cruft-expiration=now -d &&
    ++
    ++		mtimes="$(ls .git/objects/pack/pack-*.mtimes)" &&
    ++		git show-index <${mtimes%.mtimes}.idx >cruft &&
    ++		cut -d" " -f2 cruft | sort >cruft.actual &&
    ++		test_cmp cruft.expect cruft.actual &&
    ++
    ++		# ensure that a dirty exit halts cruft pack generation
    ++		git config --add pack.extraCruftTips ./extra-tips.c &&
    ++		test_must_fail git repack --cruft -d 2>err &&
    ++		grep "unable to enumerate additional cruft tips" err &&
    ++
    ++		# and that the existing cruft pack is left alone
    ++		test_path_is_file "$mtimes"
     +	)
     +'
     +

 Documentation/config/pack.txt |  11 +++
 builtin/pack-objects.c        | 103 ++++++++++++++++++++
 t/t5329-pack-objects-cruft.sh | 171 ++++++++++++++++++++++++++++++++++
 3 files changed, 285 insertions(+)

diff --git a/Documentation/config/pack.txt b/Documentation/config/pack.txt
index d4c7c9d4e4..0b79bd1751 100644
--- a/Documentation/config/pack.txt
+++ b/Documentation/config/pack.txt
@@ -67,6 +67,17 @@ pack.deltaCacheLimit::
 	result once the best match for all objects is found.
 	Defaults to 1000. Maximum value is 65535.
 
+pack.extraCruftTips::
+	When generating a cruft pack, use the shell to execute the
+	specified command(s), and interpret their output as additional
+	tips of objects to keep in the cruft pack, regardless of their
+	age.
++
+Output must contain exactly one hex object ID per line, and nothing
+else. Objects which cannot be found in the repository are ignored.
+Multiple hooks are supported, but all must exit successfully, else no
+cruft pack will be generated.
+
 pack.threads::
 	Specifies the number of threads to spawn when searching for best
 	delta matches.  This requires that linkgit:git-pack-objects[1]
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index a5b466839b..79a29796fe 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -44,6 +44,7 @@
 #include "pack-mtimes.h"
 #include "parse-options.h"
 #include "wrapper.h"
+#include "run-command.h"
 
 /*
  * Objects we are going to pack are collected in the `to_pack` structure.
@@ -3587,6 +3588,106 @@ static void enumerate_and_traverse_cruft_objects(struct string_list *fresh_packs
 	stop_progress(&progress_state);
 }
 
+static int enumerate_extra_cruft_tips_1(struct rev_info *revs, const char *args)
+{
+	struct child_process cmd = CHILD_PROCESS_INIT;
+	struct strbuf buf = STRBUF_INIT;
+	FILE *out = NULL;
+	int ret = 0;
+
+	cmd.use_shell = 1;
+	cmd.out = -1;
+
+	strvec_push(&cmd.args, args);
+	strvec_pushv(&cmd.env, (const char **)local_repo_env);
+
+	if (start_command(&cmd)) {
+		ret = -1;
+		goto done;
+	}
+
+	out = xfdopen(cmd.out, "r");
+	while (strbuf_getline_lf(&buf, out) != EOF) {
+		struct object_id oid;
+		struct object *obj;
+		const char *rest;
+
+		if (parse_oid_hex(buf.buf, &oid, &rest) || *rest) {
+			ret = error(_("invalid extra cruft tip: '%s'"), buf.buf);
+			goto done;
+		}
+
+		obj = parse_object(the_repository, &oid);
+		if (!obj)
+			continue;
+
+		display_progress(progress_state, ++nr_seen);
+		add_pending_object(revs, obj, "");
+	}
+
+	ret = finish_command(&cmd);
+
+done:
+	if (out)
+		fclose(out);
+	strbuf_release(&buf);
+	child_process_clear(&cmd);
+
+	return ret;
+}
+
+static int enumerate_extra_cruft_tips(void)
+{
+	struct rev_info revs;
+	const struct string_list *programs;
+	int ret = 0;
+	size_t i;
+
+	if (git_config_get_string_multi("pack.extracrufttips", &programs))
+		return ret;
+
+	repo_init_revisions(the_repository, &revs, NULL);
+
+	revs.tag_objects = 1;
+	revs.tree_objects = 1;
+	revs.blob_objects = 1;
+
+	revs.include_check = cruft_include_check;
+	revs.include_check_obj = cruft_include_check_obj;
+
+	revs.ignore_missing_links = 1;
+
+	if (progress)
+		progress_state = start_progress(_("Enumerating extra cruft tips"), 0);
+	nr_seen = 0;
+	for (i = 0; i < programs->nr; i++) {
+		ret = enumerate_extra_cruft_tips_1(&revs,
+						   programs->items[i].string);
+		if (ret)
+			break;
+	}
+	stop_progress(&progress_state);
+
+	if (ret)
+		die(_("unable to enumerate additional cruft tips"));
+
+	if (prepare_revision_walk(&revs))
+		die(_("revision walk setup failed"));
+	if (progress)
+		progress_state = start_progress(_("Traversing extra cruft objects"), 0);
+	nr_seen = 0;
+
+	/*
+	 * Retain all objects reachable from extra tips via
+	 * show_cruft_commit(), and show_cruft_object(), regardless of
+	 * their mtime.
+	 */
+	traverse_commit_list(&revs, show_cruft_commit, show_cruft_object, NULL);
+	stop_progress(&progress_state);
+
+	return ret;
+}
+
 static void read_cruft_objects(void)
 {
 	struct strbuf buf = STRBUF_INIT;
@@ -3642,6 +3743,8 @@ static void read_cruft_objects(void)
 	else
 		enumerate_cruft_objects();
 
+	enumerate_extra_cruft_tips();
+
 	strbuf_release(&buf);
 	string_list_clear(&discard_packs, 0);
 	string_list_clear(&fresh_packs, 0);
diff --git a/t/t5329-pack-objects-cruft.sh b/t/t5329-pack-objects-cruft.sh
index 303f7a5d84..3ad16a9fca 100755
--- a/t/t5329-pack-objects-cruft.sh
+++ b/t/t5329-pack-objects-cruft.sh
@@ -739,4 +739,175 @@ test_expect_success 'cruft objects are freshend via loose' '
 	)
 '
 
+test_expect_success 'additional cruft tips may be specified via pack.extraCruftTips' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		# Create a handful of objects.
+		#
+		#   - one reachable commit, "base", designated for the reachable
+		#     pack
+		#   - one unreachable commit, "cruft.discard", which is marked
+		#     for deletion
+		#   - one unreachable commit, "cruft.old", which would be marked
+		#     for deletion, but is rescued as an extra cruft tip
+		#   - one unreachable commit, "cruft.new", which is not marked
+		#     for deletion
+		test_commit base &&
+		git branch -M main &&
+
+		git checkout --orphan discard &&
+		git rm -fr . &&
+		test_commit --no-tag cruft.discard &&
+
+		git checkout --orphan old &&
+		git rm -fr . &&
+		test_commit --no-tag cruft.old &&
+		cruft_old="$(git rev-parse HEAD)" &&
+
+		git checkout --orphan new &&
+		git rm -fr . &&
+		test_commit --no-tag cruft.new &&
+		cruft_new="$(git rev-parse HEAD)" &&
+
+		git checkout main &&
+		git branch -D discard old new &&
+		git reflog expire --all --expire=all &&
+
+		# mark cruft.old with an mtime that is many minutes
+		# older than the expiration period, and mark cruft.new
+		# with an mtime that is in the future (and thus not
+		# eligible for pruning).
+		test-tool chmtime -2000 "$objdir/$(test_oid_to_path $cruft_old)" &&
+		test-tool chmtime +1000 "$objdir/$(test_oid_to_path $cruft_new)" &&
+
+		# Write the list of cruft objects we expect to
+		# accumulate, which is comprised of everything reachable
+		# from cruft.old and cruft.new, but not cruft.discard.
+		git rev-list --objects --no-object-names \
+			$cruft_old $cruft_new >cruft.raw &&
+		sort cruft.raw >cruft.expect &&
+
+		# Write the script to list extra tips, which are limited
+		# to cruft.old, in this case.
+		write_script extra-tips <<-EOF &&
+		echo $cruft_old
+		EOF
+		git config pack.extraCruftTips ./extra-tips &&
+
+		git repack --cruft --cruft-expiration=now -d &&
+
+		mtimes="$(ls .git/objects/pack/pack-*.mtimes)" &&
+		git show-index <${mtimes%.mtimes}.idx >cruft &&
+		cut -d" " -f2 cruft | sort >cruft.actual &&
+		test_cmp cruft.expect cruft.actual &&
+
+		# Ensure that the "old" objects are removed after
+		# dropping the pack.extraCruftTips hook.
+		git config --unset pack.extraCruftTips &&
+		git repack --cruft --cruft-expiration=now -d &&
+
+		mtimes="$(ls .git/objects/pack/pack-*.mtimes)" &&
+		git show-index <${mtimes%.mtimes}.idx >cruft &&
+		cut -d" " -f2 cruft | sort >cruft.actual &&
+
+		git rev-list --objects --no-object-names $cruft_new >cruft.raw &&
+		cp cruft.expect cruft.old &&
+		sort cruft.raw >cruft.expect &&
+		test_cmp cruft.expect cruft.actual &&
+
+		# ensure objects which are no longer in the cruft pack were
+		# removed from the repository
+		for object in $(comm -13 cruft.expect cruft.old)
+		do
+			test_must_fail git cat-file -t $object || return 1
+		done
+	)
+'
+
+test_expect_success 'multi-valued pack.extraCruftTips' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit base &&
+		git branch -M main &&
+
+		git checkout --orphan cruft.a &&
+		git rm -fr . &&
+		test_commit --no-tag cruft.a &&
+		cruft_a="$(git rev-parse HEAD)" &&
+
+		git checkout --orphan cruft.b &&
+		git rm -fr . &&
+		test_commit --no-tag cruft.b &&
+		cruft_b="$(git rev-parse HEAD)" &&
+
+		git checkout main &&
+		git branch -D cruft.a cruft.b &&
+		git reflog expire --all --expire=all &&
+
+		echo "echo $cruft_a" | write_script extra-tips.a &&
+		echo "echo $cruft_b" | write_script extra-tips.b &&
+		echo "false" | write_script extra-tips.c &&
+
+		git rev-list --objects --no-object-names $cruft_a $cruft_b \
+			>cruft.raw &&
+		sort cruft.raw >cruft.expect &&
+
+		# ensure that each extra cruft tip is saved by its
+		# respective hook
+		git config --add pack.extraCruftTips ./extra-tips.a &&
+		git config --add pack.extraCruftTips ./extra-tips.b &&
+		git repack --cruft --cruft-expiration=now -d &&
+
+		mtimes="$(ls .git/objects/pack/pack-*.mtimes)" &&
+		git show-index <${mtimes%.mtimes}.idx >cruft &&
+		cut -d" " -f2 cruft | sort >cruft.actual &&
+		test_cmp cruft.expect cruft.actual &&
+
+		# ensure that a dirty exit halts cruft pack generation
+		git config --add pack.extraCruftTips ./extra-tips.c &&
+		test_must_fail git repack --cruft -d 2>err &&
+		grep "unable to enumerate additional cruft tips" err &&
+
+		# and that the existing cruft pack is left alone
+		test_path_is_file "$mtimes"
+	)
+'
+
+test_expect_success 'additional cruft blobs via pack.extraCruftTips' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit base &&
+
+		blob=$(echo "unreachable" | git hash-object -w --stdin) &&
+
+		# mark the unreachable blob we wrote above as having
+		# aged out of the retention period
+		test-tool chmtime -2000 "$objdir/$(test_oid_to_path $blob)" &&
+
+		# Write the script to list extra tips, which is just the
+		# extra blob as above.
+		write_script extra-tips <<-EOF &&
+		echo $blob
+		EOF
+		git config pack.extraCruftTips ./extra-tips &&
+
+		git repack --cruft --cruft-expiration=now -d &&
+
+		mtimes="$(ls .git/objects/pack/pack-*.mtimes)" &&
+		git show-index <${mtimes%.mtimes}.idx >cruft &&
+		cut -d" " -f2 cruft >actual &&
+		echo $blob >expect &&
+		test_cmp expect actual
+	)
+'
+
 test_done
-- 
2.40.1.477.g73ad7b90e1

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v2] builtin/pack-objects.c: introduce `pack.extraCruftTips`
  2023-05-03  0:09 ` [PATCH v2] " Taylor Blau
@ 2023-05-03 14:01   ` Derrick Stolee
  2023-05-03 19:59   ` Jeff King
  1 sibling, 0 replies; 31+ messages in thread
From: Derrick Stolee @ 2023-05-03 14:01 UTC (permalink / raw)
  To: Taylor Blau, git; +Cc: Jeff King, Junio C Hamano

On 5/2/2023 8:09 PM, Taylor Blau wrote:
> This patch introduces a new multi-valued configuration option,
> `pack.extraCruftTips` as an alternative means to mark certain objects in
> the cruft pack as rescued, even if they would otherwise be pruned.

> Range-diff against v1:
> 1:  8af478ebe3 ! 1:  73ad7b90e1 builtin/pack-objects.c: introduce `pack.extraCruftTips`
>     @@ Commit message
>      
>            - to point a reference at them, which may be undesirable or
>              infeasible,
>     +      - to track them via the reflog, which may be undesirable since the
>     +        reflog's lifetime is limited to that of the reference it's tracking
>     +        (and callers may want to keep those unreachable objects around for
>     +        longer)
>            - to extend the grace period, which may keep around other objects that
>              the caller *does* want to discard,
>            - or, to force the caller to construct the pack of objects they want
>     @@ Documentation/config/pack.txt: pack.deltaCacheLimit::
>      ++
>      +Output must contain exactly one hex object ID per line, and nothing
>      +else. Objects which cannot be found in the repository are ignored.
>     ++Multiple hooks are supported, but all must exit successfully, else no
>     ++cruft pack will be generated.

Thanks for these updates.

>       pack.threads::
>       	Specifies the number of threads to spawn when searching for best
>     @@ builtin/pack-objects.c: static void enumerate_and_traverse_cruft_objects(struct
>      +		add_pending_object(revs, obj, "");
>      +	}
>      +
>     ++	ret = finish_command(&cmd);
>     ++
>      +done:
>      +	if (out)
>      +		fclose(out);
>     @@ builtin/pack-objects.c: static void enumerate_and_traverse_cruft_objects(struct
>      +	for (i = 0; i < programs->nr; i++) {
>      +		ret = enumerate_extra_cruft_tips_1(&revs,
>      +						   programs->items[i].string);
>     -+		if (!ret)
>     ++		if (ret)
>      +			break;
>      +	}

If we have a failure, then we stop immediately and report a failure
(from this bit outside the range-diff's context):

> +	if (ret)
> +		die(_("unable to enumerate additional cruft tips"));

>     @@ t/t5329-pack-objects-cruft.sh: test_expect_success 'cruft objects are freshend v
>      +		cruft_new="$(git rev-parse HEAD)" &&
>      +
>      +		git checkout main &&
>     -+		git branch -D old new &&
>     ++		git branch -D discard old new &&

Good update.

>      +		git reflog expire --all --expire=all &&
>      +
>      +		# mark cruft.old with an mtime that is many minutes
>     @@ t/t5329-pack-objects-cruft.sh: test_expect_success 'cruft objects are freshend v
>      +		mtimes="$(ls .git/objects/pack/pack-*.mtimes)" &&
>      +		git show-index <${mtimes%.mtimes}.idx >cruft &&
>      +		cut -d" " -f2 cruft | sort >cruft.actual &&
>     -+		test_cmp cruft.expect cruft.actual
>     ++		test_cmp cruft.expect cruft.actual &&
>     ++
>     ++		# Ensure that the "old" objects are removed after
>     ++		# dropping the pack.extraCruftTips hook.
>     ++		git config --unset pack.extraCruftTips &&
>     ++		git repack --cruft --cruft-expiration=now -d &&

Double-checking that removing the tips correctly removes the objects
at this time is good, but...

>     ++		mtimes="$(ls .git/objects/pack/pack-*.mtimes)" &&
>     ++		git show-index <${mtimes%.mtimes}.idx >cruft &&
>     ++		cut -d" " -f2 cruft | sort >cruft.actual &&
>     ++
>     ++		git rev-list --objects --no-object-names $cruft_new >cruft.raw &&
>     ++		cp cruft.expect cruft.old &&
>     ++		sort cruft.raw >cruft.expect &&
>     ++		test_cmp cruft.expect cruft.actual &&
>     ++
>     ++		# ensure objects which are no longer in the cruft pack were
>     ++		# removed from the repository
>     ++		for object in $(comm -13 cruft.expect cruft.old)
>     ++		do
>     ++			test_must_fail git cat-file -t $object || return 1
>     ++		done

...it would be good to do this check before the removal of the hook
to be sure the objects from 'discard' are removed as part of the step
with the hook. I can imagine a world where the hook-based approach
erroneously keeps the 'discard' objects in the non-cruft pack only
for them to be deleted by the repack where hooks are disabled, and
having a test would help guarantee we don't live in that hypothetical
world.

I would also be satisfied with checking just the commits that were
previously referenced by 'discard' and 'old', but this more
thorough check is also good.

>     ++test_expect_success 'multi-valued pack.extraCruftTips' '
>     ++	git init repo &&
>     ++	test_when_finished "rm -fr repo" &&
>     ++	(
>     ++		cd repo &&
>     ++
>     ++		test_commit base &&
>     ++		git branch -M main &&
>     ++
>     ++		git checkout --orphan cruft.a &&
>     ++		git rm -fr . &&
>     ++		test_commit --no-tag cruft.a &&
>     ++		cruft_a="$(git rev-parse HEAD)" &&
>     ++
>     ++		git checkout --orphan cruft.b &&
>     ++		git rm -fr . &&
>     ++		test_commit --no-tag cruft.b &&
>     ++		cruft_b="$(git rev-parse HEAD)" &&
>     ++
>     ++		git checkout main &&
>     ++		git branch -D cruft.a cruft.b &&
>     ++		git reflog expire --all --expire=all &&
>     ++
>     ++		echo "echo $cruft_a" | write_script extra-tips.a &&
>     ++		echo "echo $cruft_b" | write_script extra-tips.b &&
>     ++		echo "false" | write_script extra-tips.c &&
>     ++
>     ++		git rev-list --objects --no-object-names $cruft_a $cruft_b \
>     ++			>cruft.raw &&
>     ++		sort cruft.raw >cruft.expect &&
>     ++
>     ++		# ensure that each extra cruft tip is saved by its
>     ++		# respective hook
>     ++		git config --add pack.extraCruftTips ./extra-tips.a &&
>     ++		git config --add pack.extraCruftTips ./extra-tips.b &&
>     ++		git repack --cruft --cruft-expiration=now -d &&
>     ++
>     ++		mtimes="$(ls .git/objects/pack/pack-*.mtimes)" &&
>     ++		git show-index <${mtimes%.mtimes}.idx >cruft &&
>     ++		cut -d" " -f2 cruft | sort >cruft.actual &&
>     ++		test_cmp cruft.expect cruft.actual &&
>     ++
>     ++		# ensure that a dirty exit halts cruft pack generation
>     ++		git config --add pack.extraCruftTips ./extra-tips.c &&
>     ++		test_must_fail git repack --cruft -d 2>err &&
>     ++		grep "unable to enumerate additional cruft tips" err &&
>     ++
>     ++		# and that the existing cruft pack is left alone
>     ++		test_path_is_file "$mtimes"
>      +	)

This new test looks good to me.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2] builtin/pack-objects.c: introduce `pack.extraCruftTips`
  2023-05-03  0:09 ` [PATCH v2] " Taylor Blau
  2023-05-03 14:01   ` Derrick Stolee
@ 2023-05-03 19:59   ` Jeff King
  2023-05-03 21:22     ` Taylor Blau
  2023-05-03 21:28     ` Taylor Blau
  1 sibling, 2 replies; 31+ messages in thread
From: Jeff King @ 2023-05-03 19:59 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Junio C Hamano, Derrick Stolee

On Tue, May 02, 2023 at 08:09:47PM -0400, Taylor Blau wrote:

> However, there is no option to be able to keep around certain objects
> that have otherwise aged out of the grace period. The only way to retain
> those objects is:
> 
>   - to point a reference at them, which may be undesirable or
>     infeasible,
>   - to track them via the reflog, which may be undesirable since the
>     reflog's lifetime is limited to that of the reference it's tracking
>     (and callers may want to keep those unreachable objects around for
>     longer)
>   - to extend the grace period, which may keep around other objects that
>     the caller *does* want to discard,
>   - or, to force the caller to construct the pack of objects they want
>     to keep themselves, and then mark the pack as kept by adding a
>     ".keep" file.

OK, I understand the use case you're trying to support, and your
approach mostly makes sense. But there are two things I was surprised by
in the implementation:

  1. Does this need to be tied to cruft packs? The same logic would
     apply to "repack -A" which turns objects loose (of course you
     probably don't want to do that in the long term for performance
     reasons, but it's conceptually the same thing; see below).

  2. Why is there a separate walk distinct from the one that rescues
     recent-but-unreachable objects?

Conceptually it seems to me that the simplest and most flexible way to
think of this new feature is: pretend these objects are recent enough to
be kept in the grace period, even though their mtimes do not qualify".

And then everything else would just fall out naturally. Am I missing
something?

In a pre-cruft-pack world, I'd have just expected the patch to look like
this:

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index bd6ad016d6..1d655dc758 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -4077,6 +4077,7 @@ static void get_object_list(struct rev_info *revs, int ac, const char **av)
 		if (add_unseen_recent_objects_to_traversal(revs,
 				unpack_unreachable_expiration, NULL, 0))
 			die(_("unable to add recent objects"));
+		add_tips_from_program(revs);
 		if (prepare_revision_walk(revs))
 			die(_("revision walk setup failed"));
 		traverse_commit_list(revs, record_recent_commit,

Sadly the cruft-pack feature doesn't really share this code, but I think
it does something similar, and could just consider those synthetic tips
as "recent" for this run.

> +static int enumerate_extra_cruft_tips_1(struct rev_info *revs, const char *args)
> +{
> +	struct child_process cmd = CHILD_PROCESS_INIT;
> +	struct strbuf buf = STRBUF_INIT;
> +	FILE *out = NULL;
> +	int ret = 0;
> +
> +	cmd.use_shell = 1;
> +	cmd.out = -1;
> +
> +	strvec_push(&cmd.args, args);
> +	strvec_pushv(&cmd.env, (const char **)local_repo_env);

Why does this clear the environment of local-repo variables? That seems
unlike most other hooks we have, and would cause confusion if $GIT_DIR
or various config variables are important (e.g., should "git -c foo.bar
gc" persist foo.bar when running the hook? It usually does).

I know that some hooks that try to change repositories by changing
directories have the opposite confusion ($GIT_DIR is set, but they did
not want it). But it makes more sense to me to remain consistent with
how other hooks behave.

> +	if (start_command(&cmd)) {
> +		ret = -1;
> +		goto done;
> +	}

This may be a matter of taste, but you can "return -1" directly here, as
nothing has been allocated (and a failed start_command() will call
child_process_clear() for you). This would mean "out" is always set in
the "done:" label, so it wouldn't need a NULL-check (nor an
initialization).

> +
> +	out = xfdopen(cmd.out, "r");
> +	while (strbuf_getline_lf(&buf, out) != EOF) {

is there any reason to be a stickler for LF versus CRLF here? I.e.,
wouldn't strbuf_getline() be more friendly, with little chance that we
misinterpret the result?

> +		struct object_id oid;
> +		struct object *obj;
> +		const char *rest;
> +
> +		if (parse_oid_hex(buf.buf, &oid, &rest) || *rest) {
> +			ret = error(_("invalid extra cruft tip: '%s'"), buf.buf);
> +			goto done;
> +		}
> +
> +		obj = parse_object(the_repository, &oid);
> +		if (!obj)
> +			continue;

This parse_object() can be pretty expensive, especially if you are
rescuing a lot of objects (or if your program directly references large
blobs). Can we use oid_object_info() here in combination with
lookup_object_by_type()?

-Peff

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v2] builtin/pack-objects.c: introduce `pack.extraCruftTips`
  2023-05-03 19:59   ` Jeff King
@ 2023-05-03 21:22     ` Taylor Blau
  2023-05-05 21:23       ` Jeff King
  2023-05-03 21:28     ` Taylor Blau
  1 sibling, 1 reply; 31+ messages in thread
From: Taylor Blau @ 2023-05-03 21:22 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Junio C Hamano, Derrick Stolee

On Wed, May 03, 2023 at 03:59:06PM -0400, Jeff King wrote:
> On Tue, May 02, 2023 at 08:09:47PM -0400, Taylor Blau wrote:
>
> > However, there is no option to be able to keep around certain objects
> > that have otherwise aged out of the grace period. The only way to retain
> > those objects is:
> >
> >   - to point a reference at them, which may be undesirable or
> >     infeasible,
> >   - to track them via the reflog, which may be undesirable since the
> >     reflog's lifetime is limited to that of the reference it's tracking
> >     (and callers may want to keep those unreachable objects around for
> >     longer)
> >   - to extend the grace period, which may keep around other objects that
> >     the caller *does* want to discard,
> >   - or, to force the caller to construct the pack of objects they want
> >     to keep themselves, and then mark the pack as kept by adding a
> >     ".keep" file.
>
> OK, I understand the use case you're trying to support, and your
> approach mostly makes sense. But there are two things I was surprised by
> in the implementation:
>
>   1. Does this need to be tied to cruft packs? The same logic would
>      apply to "repack -A" which turns objects loose (of course you
>      probably don't want to do that in the long term for performance
>      reasons, but it's conceptually the same thing; see below).
>
>   2. Why is there a separate walk distinct from the one that rescues
>      recent-but-unreachable objects?

> Conceptually it seems to me that the simplest and most flexible way to
> think of this new feature is: pretend these objects are recent enough to
> be kept in the grace period, even though their mtimes do not qualify".
>
> And then everything else would just fall out naturally. Am I missing
> something?

I originally shied away from it, thinking that I wouldn't want to do an
expensive walk with all of the recent objects during a non-pruning
repack.

The two code paths are quite different in practice. In the pruning case,
we add only new objects from the kept packs and then start our traversal
there. In the non-pruning case, we just do
`add_objects_in_unpacked_packs()` which is really just a call to
`for_each_packed_object()`.

So it gets tricky when you have a pack.extraCruftTips program and want
to invoke it in a non-pruning case. You could do something like:

  - call enumerate_and_traverse_cruft_objects() *always*, either because
    we were doing a pruning GC, or calling it after
    `enumerate_cruft_objects()` (in the non-pruning case)

  - ensure that enumerate_and_traverse_cruft_objects() is a noop when
    (a) cruft_expiration is set to zero, and (b) there are no
    pack.extraCruftTips programs specified

> > +static int enumerate_extra_cruft_tips_1(struct rev_info *revs, const char *args)
> > +{
> > +	struct child_process cmd = CHILD_PROCESS_INIT;
> > +	struct strbuf buf = STRBUF_INIT;
> > +	FILE *out = NULL;
> > +	int ret = 0;
> > +
> > +	cmd.use_shell = 1;
> > +	cmd.out = -1;
> > +
> > +	strvec_push(&cmd.args, args);
> > +	strvec_pushv(&cmd.env, (const char **)local_repo_env);
>
> Why does this clear the environment of local-repo variables? That seems
> unlike most other hooks we have, and would cause confusion if $GIT_DIR
> or various config variables are important (e.g., should "git -c foo.bar
> gc" persist foo.bar when running the hook? It usually does).
>
> I know that some hooks that try to change repositories by changing
> directories have the opposite confusion ($GIT_DIR is set, but they did
> not want it). But it makes more sense to me to remain consistent with
> how other hooks behave.

Copy-and-pasted from the core.alternateRefsCommand stuff, there is no
need for this here.

> > +	if (start_command(&cmd)) {
> > +		ret = -1;
> > +		goto done;
> > +	}
>
> This may be a matter of taste, but you can "return -1" directly here, as
> nothing has been allocated (and a failed start_command() will call
> child_process_clear() for you). This would mean "out" is always set in
> the "done:" label, so it wouldn't need a NULL-check (nor an
> initialization).

Very nice, thanks for the suggestion.

> > +
> > +	out = xfdopen(cmd.out, "r");
> > +	while (strbuf_getline_lf(&buf, out) != EOF) {
>
> is there any reason to be a stickler for LF versus CRLF here? I.e.,
> wouldn't strbuf_getline() be more friendly, with little chance that we
> misinterpret the result?

No reason, other than I happened to type this one out first ;-). We
should definitely be more permissive here, I agree that strbuf_getline()
is the friendlier choice.

> > +		struct object_id oid;
> > +		struct object *obj;
> > +		const char *rest;
> > +
> > +		if (parse_oid_hex(buf.buf, &oid, &rest) || *rest) {
> > +			ret = error(_("invalid extra cruft tip: '%s'"), buf.buf);
> > +			goto done;
> > +		}
> > +
> > +		obj = parse_object(the_repository, &oid);
> > +		if (!obj)
> > +			continue;
>
> This parse_object() can be pretty expensive, especially if you are
> rescuing a lot of objects (or if your program directly references large
> blobs). Can we use oid_object_info() here in combination with
> lookup_object_by_type()?

Yep, definitely. Nice catch ;-).

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2] builtin/pack-objects.c: introduce `pack.extraCruftTips`
  2023-05-03 19:59   ` Jeff King
  2023-05-03 21:22     ` Taylor Blau
@ 2023-05-03 21:28     ` Taylor Blau
  2023-05-05 21:26       ` Jeff King
  1 sibling, 1 reply; 31+ messages in thread
From: Taylor Blau @ 2023-05-03 21:28 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Junio C Hamano, Derrick Stolee

On Wed, May 03, 2023 at 03:59:06PM -0400, Jeff King wrote:
> On Tue, May 02, 2023 at 08:09:47PM -0400, Taylor Blau wrote:
>
> > However, there is no option to be able to keep around certain objects
> > that have otherwise aged out of the grace period. The only way to retain
> > those objects is:
> >
> >   - to point a reference at them, which may be undesirable or
> >     infeasible,
> >   - to track them via the reflog, which may be undesirable since the
> >     reflog's lifetime is limited to that of the reference it's tracking
> >     (and callers may want to keep those unreachable objects around for
> >     longer)
> >   - to extend the grace period, which may keep around other objects that
> >     the caller *does* want to discard,
> >   - or, to force the caller to construct the pack of objects they want
> >     to keep themselves, and then mark the pack as kept by adding a
> >     ".keep" file.
>
> OK, I understand the use case you're trying to support, and your
> approach mostly makes sense. But there are two things I was surprised by
> in the implementation:
>
>   1. Does this need to be tied to cruft packs? The same logic would
>      apply to "repack -A" which turns objects loose (of course you
>      probably don't want to do that in the long term for performance
>      reasons, but it's conceptually the same thing; see below).

I agree that you wouldn't want to do it for performance reasons, but I'm
comfortable with the asymmetry here, since this is `pack.extraCruftTips`
(emphasis on "cruft"), so it's not clear that it has to be related to
"repack -A".

Happy to change things up if you feel strongly, though.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v3] builtin/pack-objects.c: introduce `pack.extraCruftTips`
  2023-04-20 17:27 [PATCH] builtin/pack-objects.c: introduce `pack.extraCruftTips` Taylor Blau
                   ` (3 preceding siblings ...)
  2023-05-03  0:09 ` [PATCH v2] " Taylor Blau
@ 2023-05-03 22:05 ` Taylor Blau
  2023-05-03 23:18   ` Junio C Hamano
  2023-05-05 22:19   ` Jeff King
  4 siblings, 2 replies; 31+ messages in thread
From: Taylor Blau @ 2023-05-03 22:05 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Junio C Hamano, Derrick Stolee

This patch introduces a new multi-valued configuration option,
`pack.extraCruftTips` as an alternative means to mark certain objects in
the cruft pack as rescued, even if they would otherwise be pruned.

Traditionally, cruft packs are generated in one of two ways:

  - When not pruning, all packed and loose objects which do not appear
    in the any of the packs which wil remain (as indicated over stdin,
    with `--cruft`) are packed separately into a cruft pack.

  - When pruning, the above process happens, with two additional tweaks.
    Instead of adding every object into the cruft pack, only objects
    which have mtime that is within the grace period are kept. In
    addition, a "rescuing" traversal is performed over the remaining
    cruft objects to keep around cruft objects which would have aged out
    of the repository, but are reachable from other cruft objects which
    have not yet aged out.

This works well for repositories which have all of their "interesting"
objects worth keeping around reachable via at least one reference.

However, there is no option to be able to keep around certain objects
that have otherwise aged out of the grace period. The only way to retain
those objects is:

  - to point a reference at them, which may be undesirable or
    infeasible,
  - to track them via the reflog, which may be undesirable since the
    reflog's lifetime is limited to that of the reference it's tracking
    (and callers may want to keep those unreachable objects around for
    longer)
  - to extend the grace period, which may keep around other objects that
    the caller *does* want to discard,
  - or, to force the caller to construct the pack of objects they want
    to keep themselves, and then mark the pack as kept by adding a
    ".keep" file.

This patch introduces a new configuration, `pack.extraCruftTips` which
allows the caller to specify a program (or set of programs) whose output
is treated as a set of objects to treat as "kept", even if they are
unreachable and have aged out of the retention period.

The implementation is straightforward, and pretends as if any objects
listed by one of the `pack.extraCruftTips` hooks is recent-enough to
warrant rescuing (even for non-pruning GCs).

We then add those as tips to another reachability traversal (along with
any recent objects, if pruning), marking every object along the way as
cruft (and thus retaining it in the cruft pack).

A potential alternative here is to introduce a new mode to alter the
contents of the reachable pack instead of the cruft one. One could
imagine a new option to `pack-objects`, say `--extra-tips` that does the
same thing as above, adding the visited set of objects along the
traversal to the pack.

But this has the unfortunate side-effect of altering the reachability
closure of that pack. If parts of the unreachable object graph mentioned
by one or more of the "extra tips" programs is not closed, then the
resulting pack won't be either. This makes it impossible in the general
case to write out reachability bitmaps for that pack, since closure is a
requirement there.

Instead, keep these unreachable objects in the cruft pack instead, to
ensure that we can continue to have a pack containing just reachable
objects, which is always safe to write a bitmap over.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
Here is a reworked version of the pack.extraCruftTips feature to reuse
parts of the "rescue old/unreachable, but reachable-from-recent" traversal.

It is somewhat more complicated than the original, and I would
definitely appreciate a close review from a second (or third) set of
eyes ;-).

Range-diff against v2:
1:  da1711b13b ! 1:  27a7f16aab builtin/pack-objects.c: introduce `pack.extraCruftTips`
    @@ Commit message
         is treated as a set of objects to treat as "kept", even if they are
         unreachable and have aged out of the retention period.

    -    The implementation is straightforward: after determining the set of
    -    objects to pack into the cruft pack (either by calling
    -    `enumerate_cruft_objects()` or `enumerate_and_traverse_cruft_objects()`)
    -    we figure out if there are any program(s) we need to consult in order to
    -    determine if there are any objects which we need to "rescue". We then
    -    add those as tips to another reachability traversal, marking every
    -    object along the way as cruft (and thus retaining it in the cruft pack).
    +    The implementation is straightforward, and pretends as if any objects
    +    listed by one of the `pack.extraCruftTips` hooks is recent-enough to
    +    warrant rescuing (even for non-pruning GCs).
    +
    +    We then add those as tips to another reachability traversal (along with
    +    any recent objects, if pruning), marking every object along the way as
    +    cruft (and thus retaining it in the cruft pack).

         A potential alternative here is to introduce a new mode to alter the
         contents of the reachable pack instead of the cruft one. One could
    @@ builtin/pack-objects.c

      /*
       * Objects we are going to pack are collected in the `to_pack` structure.
    -@@ builtin/pack-objects.c: static void enumerate_and_traverse_cruft_objects(struct string_list *fresh_packs
    +@@ builtin/pack-objects.c: static void enumerate_cruft_objects(void)
      	stop_progress(&progress_state);
      }

    @@ builtin/pack-objects.c: static void enumerate_and_traverse_cruft_objects(struct
     +{
     +	struct child_process cmd = CHILD_PROCESS_INIT;
     +	struct strbuf buf = STRBUF_INIT;
    -+	FILE *out = NULL;
    ++	FILE *out;
     +	int ret = 0;
     +
     +	cmd.use_shell = 1;
     +	cmd.out = -1;
     +
     +	strvec_push(&cmd.args, args);
    -+	strvec_pushv(&cmd.env, (const char **)local_repo_env);
     +
    -+	if (start_command(&cmd)) {
    -+		ret = -1;
    -+		goto done;
    -+	}
    ++	if (start_command(&cmd))
    ++		return -1;
     +
     +	out = xfdopen(cmd.out, "r");
    -+	while (strbuf_getline_lf(&buf, out) != EOF) {
    ++	while (strbuf_getline(&buf, out) != EOF) {
     +		struct object_id oid;
     +		struct object *obj;
    ++		int type;
     +		const char *rest;
     +
     +		if (parse_oid_hex(buf.buf, &oid, &rest) || *rest) {
    @@ builtin/pack-objects.c: static void enumerate_and_traverse_cruft_objects(struct
     +			goto done;
     +		}
     +
    -+		obj = parse_object(the_repository, &oid);
    ++		type = oid_object_info(the_repository, &oid, NULL);
    ++		if (type < 0)
    ++			continue;
    ++
    ++		obj = lookup_object_by_type(the_repository, &oid, type);
     +		if (!obj)
     +			continue;
     +
    @@ builtin/pack-objects.c: static void enumerate_and_traverse_cruft_objects(struct
     +	return ret;
     +}
     +
    -+static int enumerate_extra_cruft_tips(void)
    ++static void add_extra_cruft_tips_to_traversal(struct rev_info *revs)
     +{
    -+	struct rev_info revs;
     +	const struct string_list *programs;
     +	int ret = 0;
     +	size_t i;
     +
     +	if (git_config_get_string_multi("pack.extracrufttips", &programs))
    -+		return ret;
    -+
    -+	repo_init_revisions(the_repository, &revs, NULL);
    -+
    -+	revs.tag_objects = 1;
    -+	revs.tree_objects = 1;
    -+	revs.blob_objects = 1;
    -+
    -+	revs.include_check = cruft_include_check;
    -+	revs.include_check_obj = cruft_include_check_obj;
    -+
    -+	revs.ignore_missing_links = 1;
    ++		return;
     +
     +	if (progress)
     +		progress_state = start_progress(_("Enumerating extra cruft tips"), 0);
     +	nr_seen = 0;
     +	for (i = 0; i < programs->nr; i++) {
    -+		ret = enumerate_extra_cruft_tips_1(&revs,
    ++		ret = enumerate_extra_cruft_tips_1(revs,
     +						   programs->items[i].string);
     +		if (ret)
     +			break;
    @@ builtin/pack-objects.c: static void enumerate_and_traverse_cruft_objects(struct
     +
     +	if (ret)
     +		die(_("unable to enumerate additional cruft tips"));
    -+
    -+	if (prepare_revision_walk(&revs))
    -+		die(_("revision walk setup failed"));
    -+	if (progress)
    -+		progress_state = start_progress(_("Traversing extra cruft objects"), 0);
    -+	nr_seen = 0;
    -+
    -+	/*
    -+	 * Retain all objects reachable from extra tips via
    -+	 * show_cruft_commit(), and show_cruft_object(), regardless of
    -+	 * their mtime.
    -+	 */
    -+	traverse_commit_list(&revs, show_cruft_commit, show_cruft_object, NULL);
    -+	stop_progress(&progress_state);
    -+
    -+	return ret;
     +}
     +
    - static void read_cruft_objects(void)
    + static void enumerate_and_traverse_cruft_objects(struct string_list *fresh_packs)
      {
    - 	struct strbuf buf = STRBUF_INIT;
    + 	struct packed_git *p;
    + 	struct rev_info revs;
    +-	int ret;
    ++	int ret = 0;
    +
    + 	repo_init_revisions(the_repository, &revs, NULL);
    +
    +@@ builtin/pack-objects.c: static void enumerate_and_traverse_cruft_objects(struct string_list *fresh_packs
    +
    + 	revs.ignore_missing_links = 1;
    +
    +-	if (progress)
    +-		progress_state = start_progress(_("Enumerating cruft objects"), 0);
    +-	ret = add_unseen_recent_objects_to_traversal(&revs, cruft_expiration,
    +-						     set_cruft_mtime, 1);
    +-	stop_progress(&progress_state);
    ++	if (cruft_expiration) {
    ++		if (progress)
    ++			progress_state = start_progress(_("Enumerating cruft objects"), 0);
    ++		ret = add_unseen_recent_objects_to_traversal(&revs, cruft_expiration,
    ++							     set_cruft_mtime, 1);
    ++		stop_progress(&progress_state);
    ++	}
    ++
    ++	add_extra_cruft_tips_to_traversal(&revs);
    +
    + 	if (ret)
    + 		die(_("unable to add cruft objects"));
     @@ builtin/pack-objects.c: static void read_cruft_objects(void)
    - 	else
    +
    + 	if (cruft_expiration)
    + 		enumerate_and_traverse_cruft_objects(&fresh_packs);
    +-	else
    ++	else {
    ++		/*
    ++		 * call both enumerate_cruft_objects() and
    ++		 * enumerate_and_traverse_cruft_objects(), since:
    ++		 *
    ++		 *   - the former is required to implement non-pruning GCs
    ++		 *   - the latter is a noop for "cruft_expiration == 0", but
    ++		 *     picks up any additional tips mentioned via
    ++		 *     `pack.extraCruftTips`.
    ++		 */
      		enumerate_cruft_objects();
    ++		enumerate_and_traverse_cruft_objects(&fresh_packs);
    ++	}

    -+	enumerate_extra_cruft_tips();
    -+
      	strbuf_release(&buf);
      	string_list_clear(&discard_packs, 0);
    - 	string_list_clear(&fresh_packs, 0);

      ## t/t5329-pack-objects-cruft.sh ##
     @@ t/t5329-pack-objects-cruft.sh: test_expect_success 'cruft objects are freshend via loose' '

 Documentation/config/pack.txt |  11 +++
 builtin/pack-objects.c        | 104 +++++++++++++++++++--
 t/t5329-pack-objects-cruft.sh | 171 ++++++++++++++++++++++++++++++++++
 3 files changed, 279 insertions(+), 7 deletions(-)

diff --git a/Documentation/config/pack.txt b/Documentation/config/pack.txt
index d4c7c9d4e4..0b79bd1751 100644
--- a/Documentation/config/pack.txt
+++ b/Documentation/config/pack.txt
@@ -67,6 +67,17 @@ pack.deltaCacheLimit::
 	result once the best match for all objects is found.
 	Defaults to 1000. Maximum value is 65535.

+pack.extraCruftTips::
+	When generating a cruft pack, use the shell to execute the
+	specified command(s), and interpret their output as additional
+	tips of objects to keep in the cruft pack, regardless of their
+	age.
++
+Output must contain exactly one hex object ID per line, and nothing
+else. Objects which cannot be found in the repository are ignored.
+Multiple hooks are supported, but all must exit successfully, else no
+cruft pack will be generated.
+
 pack.threads::
 	Specifies the number of threads to spawn when searching for best
 	delta matches.  This requires that linkgit:git-pack-objects[1]
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index a5b466839b..6f6e7872cd 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -44,6 +44,7 @@
 #include "pack-mtimes.h"
 #include "parse-options.h"
 #include "wrapper.h"
+#include "run-command.h"

 /*
  * Objects we are going to pack are collected in the `to_pack` structure.
@@ -3543,11 +3544,85 @@ static void enumerate_cruft_objects(void)
 	stop_progress(&progress_state);
 }

+static int enumerate_extra_cruft_tips_1(struct rev_info *revs, const char *args)
+{
+	struct child_process cmd = CHILD_PROCESS_INIT;
+	struct strbuf buf = STRBUF_INIT;
+	FILE *out;
+	int ret = 0;
+
+	cmd.use_shell = 1;
+	cmd.out = -1;
+
+	strvec_push(&cmd.args, args);
+
+	if (start_command(&cmd))
+		return -1;
+
+	out = xfdopen(cmd.out, "r");
+	while (strbuf_getline(&buf, out) != EOF) {
+		struct object_id oid;
+		struct object *obj;
+		int type;
+		const char *rest;
+
+		if (parse_oid_hex(buf.buf, &oid, &rest) || *rest) {
+			ret = error(_("invalid extra cruft tip: '%s'"), buf.buf);
+			goto done;
+		}
+
+		type = oid_object_info(the_repository, &oid, NULL);
+		if (type < 0)
+			continue;
+
+		obj = lookup_object_by_type(the_repository, &oid, type);
+		if (!obj)
+			continue;
+
+		display_progress(progress_state, ++nr_seen);
+		add_pending_object(revs, obj, "");
+	}
+
+	ret = finish_command(&cmd);
+
+done:
+	if (out)
+		fclose(out);
+	strbuf_release(&buf);
+	child_process_clear(&cmd);
+
+	return ret;
+}
+
+static void add_extra_cruft_tips_to_traversal(struct rev_info *revs)
+{
+	const struct string_list *programs;
+	int ret = 0;
+	size_t i;
+
+	if (git_config_get_string_multi("pack.extracrufttips", &programs))
+		return;
+
+	if (progress)
+		progress_state = start_progress(_("Enumerating extra cruft tips"), 0);
+	nr_seen = 0;
+	for (i = 0; i < programs->nr; i++) {
+		ret = enumerate_extra_cruft_tips_1(revs,
+						   programs->items[i].string);
+		if (ret)
+			break;
+	}
+	stop_progress(&progress_state);
+
+	if (ret)
+		die(_("unable to enumerate additional cruft tips"));
+}
+
 static void enumerate_and_traverse_cruft_objects(struct string_list *fresh_packs)
 {
 	struct packed_git *p;
 	struct rev_info revs;
-	int ret;
+	int ret = 0;

 	repo_init_revisions(the_repository, &revs, NULL);

@@ -3560,11 +3635,15 @@ static void enumerate_and_traverse_cruft_objects(struct string_list *fresh_packs

 	revs.ignore_missing_links = 1;

-	if (progress)
-		progress_state = start_progress(_("Enumerating cruft objects"), 0);
-	ret = add_unseen_recent_objects_to_traversal(&revs, cruft_expiration,
-						     set_cruft_mtime, 1);
-	stop_progress(&progress_state);
+	if (cruft_expiration) {
+		if (progress)
+			progress_state = start_progress(_("Enumerating cruft objects"), 0);
+		ret = add_unseen_recent_objects_to_traversal(&revs, cruft_expiration,
+							     set_cruft_mtime, 1);
+		stop_progress(&progress_state);
+	}
+
+	add_extra_cruft_tips_to_traversal(&revs);

 	if (ret)
 		die(_("unable to add cruft objects"));
@@ -3639,8 +3718,19 @@ static void read_cruft_objects(void)

 	if (cruft_expiration)
 		enumerate_and_traverse_cruft_objects(&fresh_packs);
-	else
+	else {
+		/*
+		 * call both enumerate_cruft_objects() and
+		 * enumerate_and_traverse_cruft_objects(), since:
+		 *
+		 *   - the former is required to implement non-pruning GCs
+		 *   - the latter is a noop for "cruft_expiration == 0", but
+		 *     picks up any additional tips mentioned via
+		 *     `pack.extraCruftTips`.
+		 */
 		enumerate_cruft_objects();
+		enumerate_and_traverse_cruft_objects(&fresh_packs);
+	}

 	strbuf_release(&buf);
 	string_list_clear(&discard_packs, 0);
diff --git a/t/t5329-pack-objects-cruft.sh b/t/t5329-pack-objects-cruft.sh
index 303f7a5d84..3ad16a9fca 100755
--- a/t/t5329-pack-objects-cruft.sh
+++ b/t/t5329-pack-objects-cruft.sh
@@ -739,4 +739,175 @@ test_expect_success 'cruft objects are freshend via loose' '
 	)
 '

+test_expect_success 'additional cruft tips may be specified via pack.extraCruftTips' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		# Create a handful of objects.
+		#
+		#   - one reachable commit, "base", designated for the reachable
+		#     pack
+		#   - one unreachable commit, "cruft.discard", which is marked
+		#     for deletion
+		#   - one unreachable commit, "cruft.old", which would be marked
+		#     for deletion, but is rescued as an extra cruft tip
+		#   - one unreachable commit, "cruft.new", which is not marked
+		#     for deletion
+		test_commit base &&
+		git branch -M main &&
+
+		git checkout --orphan discard &&
+		git rm -fr . &&
+		test_commit --no-tag cruft.discard &&
+
+		git checkout --orphan old &&
+		git rm -fr . &&
+		test_commit --no-tag cruft.old &&
+		cruft_old="$(git rev-parse HEAD)" &&
+
+		git checkout --orphan new &&
+		git rm -fr . &&
+		test_commit --no-tag cruft.new &&
+		cruft_new="$(git rev-parse HEAD)" &&
+
+		git checkout main &&
+		git branch -D discard old new &&
+		git reflog expire --all --expire=all &&
+
+		# mark cruft.old with an mtime that is many minutes
+		# older than the expiration period, and mark cruft.new
+		# with an mtime that is in the future (and thus not
+		# eligible for pruning).
+		test-tool chmtime -2000 "$objdir/$(test_oid_to_path $cruft_old)" &&
+		test-tool chmtime +1000 "$objdir/$(test_oid_to_path $cruft_new)" &&
+
+		# Write the list of cruft objects we expect to
+		# accumulate, which is comprised of everything reachable
+		# from cruft.old and cruft.new, but not cruft.discard.
+		git rev-list --objects --no-object-names \
+			$cruft_old $cruft_new >cruft.raw &&
+		sort cruft.raw >cruft.expect &&
+
+		# Write the script to list extra tips, which are limited
+		# to cruft.old, in this case.
+		write_script extra-tips <<-EOF &&
+		echo $cruft_old
+		EOF
+		git config pack.extraCruftTips ./extra-tips &&
+
+		git repack --cruft --cruft-expiration=now -d &&
+
+		mtimes="$(ls .git/objects/pack/pack-*.mtimes)" &&
+		git show-index <${mtimes%.mtimes}.idx >cruft &&
+		cut -d" " -f2 cruft | sort >cruft.actual &&
+		test_cmp cruft.expect cruft.actual &&
+
+		# Ensure that the "old" objects are removed after
+		# dropping the pack.extraCruftTips hook.
+		git config --unset pack.extraCruftTips &&
+		git repack --cruft --cruft-expiration=now -d &&
+
+		mtimes="$(ls .git/objects/pack/pack-*.mtimes)" &&
+		git show-index <${mtimes%.mtimes}.idx >cruft &&
+		cut -d" " -f2 cruft | sort >cruft.actual &&
+
+		git rev-list --objects --no-object-names $cruft_new >cruft.raw &&
+		cp cruft.expect cruft.old &&
+		sort cruft.raw >cruft.expect &&
+		test_cmp cruft.expect cruft.actual &&
+
+		# ensure objects which are no longer in the cruft pack were
+		# removed from the repository
+		for object in $(comm -13 cruft.expect cruft.old)
+		do
+			test_must_fail git cat-file -t $object || return 1
+		done
+	)
+'
+
+test_expect_success 'multi-valued pack.extraCruftTips' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit base &&
+		git branch -M main &&
+
+		git checkout --orphan cruft.a &&
+		git rm -fr . &&
+		test_commit --no-tag cruft.a &&
+		cruft_a="$(git rev-parse HEAD)" &&
+
+		git checkout --orphan cruft.b &&
+		git rm -fr . &&
+		test_commit --no-tag cruft.b &&
+		cruft_b="$(git rev-parse HEAD)" &&
+
+		git checkout main &&
+		git branch -D cruft.a cruft.b &&
+		git reflog expire --all --expire=all &&
+
+		echo "echo $cruft_a" | write_script extra-tips.a &&
+		echo "echo $cruft_b" | write_script extra-tips.b &&
+		echo "false" | write_script extra-tips.c &&
+
+		git rev-list --objects --no-object-names $cruft_a $cruft_b \
+			>cruft.raw &&
+		sort cruft.raw >cruft.expect &&
+
+		# ensure that each extra cruft tip is saved by its
+		# respective hook
+		git config --add pack.extraCruftTips ./extra-tips.a &&
+		git config --add pack.extraCruftTips ./extra-tips.b &&
+		git repack --cruft --cruft-expiration=now -d &&
+
+		mtimes="$(ls .git/objects/pack/pack-*.mtimes)" &&
+		git show-index <${mtimes%.mtimes}.idx >cruft &&
+		cut -d" " -f2 cruft | sort >cruft.actual &&
+		test_cmp cruft.expect cruft.actual &&
+
+		# ensure that a dirty exit halts cruft pack generation
+		git config --add pack.extraCruftTips ./extra-tips.c &&
+		test_must_fail git repack --cruft -d 2>err &&
+		grep "unable to enumerate additional cruft tips" err &&
+
+		# and that the existing cruft pack is left alone
+		test_path_is_file "$mtimes"
+	)
+'
+
+test_expect_success 'additional cruft blobs via pack.extraCruftTips' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit base &&
+
+		blob=$(echo "unreachable" | git hash-object -w --stdin) &&
+
+		# mark the unreachable blob we wrote above as having
+		# aged out of the retention period
+		test-tool chmtime -2000 "$objdir/$(test_oid_to_path $blob)" &&
+
+		# Write the script to list extra tips, which is just the
+		# extra blob as above.
+		write_script extra-tips <<-EOF &&
+		echo $blob
+		EOF
+		git config pack.extraCruftTips ./extra-tips &&
+
+		git repack --cruft --cruft-expiration=now -d &&
+
+		mtimes="$(ls .git/objects/pack/pack-*.mtimes)" &&
+		git show-index <${mtimes%.mtimes}.idx >cruft &&
+		cut -d" " -f2 cruft >actual &&
+		echo $blob >expect &&
+		test_cmp expect actual
+	)
+'
+
 test_done
--
2.40.1.477.g73ad7b90e1.dirty

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v3] builtin/pack-objects.c: introduce `pack.extraCruftTips`
  2023-05-03 22:05 ` [PATCH v3] " Taylor Blau
@ 2023-05-03 23:18   ` Junio C Hamano
  2023-05-03 23:42     ` Junio C Hamano
  2023-05-05 21:39     ` Jeff King
  2023-05-05 22:19   ` Jeff King
  1 sibling, 2 replies; 31+ messages in thread
From: Junio C Hamano @ 2023-05-03 23:18 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Jeff King, Derrick Stolee

Taylor Blau <me@ttaylorr.com> writes:

>   - When not pruning, all packed and loose objects which do not appear
>     in the any of the packs which wil remain (as indicated over stdin,
>     with `--cruft`) are packed separately into a cruft pack.

"wil remain" -> "will remain"

But more importantly, it is not clear to me whose "stdin" we are
talking about and what is sent "over stdin" here.  Somebody pipes
the output of "printf %s --cruft" to "pack-objects"???

> However, there is no option to be able to keep around certain objects
> that have otherwise aged out of the grace period. The only way to retain
> those objects is:
>
>   - to point a reference at them, which may be undesirable or
>     infeasible,
>   - to track them via the reflog, which may be undesirable since the
>     reflog's lifetime is limited to that of the reference it's tracking
>     (and callers may want to keep those unreachable objects around for
>     longer)
>   - to extend the grace period, which may keep around other objects that
>     the caller *does* want to discard,
>   - or, to force the caller to construct the pack of objects they want
>     to keep themselves, and then mark the pack as kept by adding a
>     ".keep" file.

If you do not want to point objects you want to keep with a ref,
then you are making a problem for yourself, as gc decisions are
designed around the notion that anything worthwhile to keep are
supposed to be reachable from some anchoring points (like the refs,
the reflogs, the index and its various extensions, etc.).

> Here is a reworked version of the pack.extraCruftTips feature to reuse
> parts of the "rescue old/unreachable, but reachable-from-recent" traversal.

That sounds like a sensible way to go, but as I said earlier, it
still is not clear, at least to me, why the pack.extraCruftTips
program can compute them cheaply enough when you cannot have these
tips pointed at with refs in some reserved namespaces of your own.

> +pack.extraCruftTips::

Is this consistent with the way we name a variable whose value
specifies a command to run?  At first, I thought this was a
multi-valued configuration variable, each of the value is an object
name.  extraCruftTipsCmd?  extraCruftTipsHook?

> +	When generating a cruft pack, use the shell to execute the
> +	specified command(s), and interpret their output as additional
> +	tips of objects to keep in the cruft pack, regardless of their

What is a "tip of an object"?  The first byte ;-)?

A "tip of history" would only imply commit objects, but presumably
you would want to specify a tree and protect all the blobs and trees
it recursively contains, so that is not a good name for it.

> +	age.
> ++
> +Output must contain exactly one hex object ID per line, and nothing
> +else. Objects which cannot be found in the repository are ignored.

Are objects that cannot be found the same as objects that are
missing?  How do we determine if a given object name refers to an
object which cannot be found?  Would the lazy fetching from promisor
remotes come into play?

> +Multiple hooks are supported, but all must exit successfully, else no
> +cruft pack will be generated.
> +

Now, we are told this variable refers to "hooks".  If that is the
name we want to call them, we'd better use the term from the
beginning.

> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index a5b466839b..6f6e7872cd 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -44,6 +44,7 @@
>  #include "pack-mtimes.h"
>  #include "parse-options.h"
>  #include "wrapper.h"
> +#include "run-command.h"
>
>  /*
>   * Objects we are going to pack are collected in the `to_pack` structure.
> @@ -3543,11 +3544,85 @@ static void enumerate_cruft_objects(void)
>  	stop_progress(&progress_state);
>  }
>
> +static int enumerate_extra_cruft_tips_1(struct rev_info *revs, const char *args)

Let's lose the _1 suffix unless you have
enumerate_extra_cruft_tips() that farms out bulk of its inner
workings to this one.

> +{
> +	struct child_process cmd = CHILD_PROCESS_INIT;
> +	struct strbuf buf = STRBUF_INIT;
> +	FILE *out;
> +	int ret = 0;
> +
> +	cmd.use_shell = 1;
> +	cmd.out = -1;
> +
> +	strvec_push(&cmd.args, args);
> +
> +	if (start_command(&cmd))
> +		return -1;
> +
> +	out = xfdopen(cmd.out, "r");

It somehow feels funny to be reading from "out", which is usually
where we write things to.

> +	while (strbuf_getline(&buf, out) != EOF) {

Ditto.

> +		struct object_id oid;
> +		struct object *obj;
> +		int type;
> +		const char *rest;
> +
> +		if (parse_oid_hex(buf.buf, &oid, &rest) || *rest) {
> +			ret = error(_("invalid extra cruft tip: '%s'"), buf.buf);
> +			goto done;
> +		}
> +
> +		type = oid_object_info(the_repository, &oid, NULL);
> +		if (type < 0)
> +			continue;
> +
> +		obj = lookup_object_by_type(the_repository, &oid, type);
> +		if (!obj)
> +			continue;

Hmph, we may want to have an interface that lets us avoid looking up
the same oid twice in the same set of tables.  Given an object
unseen so far, oid_object_info() should have done most of the work
necessary for lookup_object_by_type() to get to and start parsing
the data of the object in the good case (i.e. object exists and in a
pack---just we haven't needed it yet), but in the above sequence
there is not enough information passed between two calls to take
advantage of it.

> +		display_progress(progress_state, ++nr_seen);
> +		add_pending_object(revs, obj, "");
> +	}
> +
> +	ret = finish_command(&cmd);
> +
> +done:
> +	if (out)
> +		fclose(out);
> +	strbuf_release(&buf);
> +	child_process_clear(&cmd);
> +
> +	return ret;
> +}
> +
> +static void add_extra_cruft_tips_to_traversal(struct rev_info *revs)
> +{
> +	const struct string_list *programs;
> +	int ret = 0;
> +	size_t i;
> +
> +	if (git_config_get_string_multi("pack.extracrufttips", &programs))
> +		return;
> +
> +	if (progress)
> +		progress_state = start_progress(_("Enumerating extra cruft tips"), 0);
> +	nr_seen = 0;
> +	for (i = 0; i < programs->nr; i++) {
> +		ret = enumerate_extra_cruft_tips_1(revs,
> +						   programs->items[i].string);
> +		if (ret)
> +			break;
> +	}
> +	stop_progress(&progress_state);

We iterate over the value list internal to the repo_config, but we
are only borrowing them so there is no need to do any clean-up upon
leaving.  OK.

> +	if (ret)
> +		die(_("unable to enumerate additional cruft tips"));
> +}
> +
>  static void enumerate_and_traverse_cruft_objects(struct string_list *fresh_packs)
>  {
>  	struct packed_git *p;
>  	struct rev_info revs;
> -	int ret;
> +	int ret = 0;
>
>  	repo_init_revisions(the_repository, &revs, NULL);
>
> @@ -3560,11 +3635,15 @@ static void enumerate_and_traverse_cruft_objects(struct string_list *fresh_packs
>
>  	revs.ignore_missing_links = 1;
>
> -	if (progress)
> -		progress_state = start_progress(_("Enumerating cruft objects"), 0);
> -	ret = add_unseen_recent_objects_to_traversal(&revs, cruft_expiration,
> -						     set_cruft_mtime, 1);
> -	stop_progress(&progress_state);
> +	if (cruft_expiration) {
> +		if (progress)
> +			progress_state = start_progress(_("Enumerating cruft objects"), 0);
> +		ret = add_unseen_recent_objects_to_traversal(&revs, cruft_expiration,
> +							     set_cruft_mtime, 1);
> +		stop_progress(&progress_state);
> +	}
> +
> +	add_extra_cruft_tips_to_traversal(&revs);

Hmph, is this hunk doing two unrelated things at the same time?
What was the reason why we didn't have to be conditional on
cruft_expiration in the original code, and why does it start
mattering when we teach the code to call out to hooks?

> @@ -3639,8 +3718,19 @@ static void read_cruft_objects(void)
>
>  	if (cruft_expiration)
>  		enumerate_and_traverse_cruft_objects(&fresh_packs);
> -	else
> +	else {
> +		/*
> +		 * call both enumerate_cruft_objects() and
> +		 * enumerate_and_traverse_cruft_objects(), since:
> +		 *
> +		 *   - the former is required to implement non-pruning GCs
> +		 *   - the latter is a noop for "cruft_expiration == 0", but
> +		 *     picks up any additional tips mentioned via
> +		 *     `pack.extraCruftTips`.
> +		 */
>  		enumerate_cruft_objects();
> +		enumerate_and_traverse_cruft_objects(&fresh_packs);
> +	}

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3] builtin/pack-objects.c: introduce `pack.extraCruftTips`
  2023-05-03 23:18   ` Junio C Hamano
@ 2023-05-03 23:42     ` Junio C Hamano
  2023-05-03 23:48       ` Taylor Blau
  2023-05-03 23:50       ` Taylor Blau
  2023-05-05 21:39     ` Jeff King
  1 sibling, 2 replies; 31+ messages in thread
From: Junio C Hamano @ 2023-05-03 23:42 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Jeff King, Derrick Stolee

Junio C Hamano <gitster@pobox.com> writes:

>> +Multiple hooks are supported, but all must exit successfully, else no
>> +cruft pack will be generated.
>> +
>
> Now, we are told this variable refers to "hooks".  If that is the
> name we want to call them, we'd better use the term from the
> beginning.

I was updating the "What's cooking" draft and here is my tentative
summary of the topic.

    * tb/pack-extra-cruft-tips (2023-05-03) 1 commit
     - builtin/pack-objects.c: introduce `pack.extraCruftTips`

     "git pack-objects" learned to invoke a new hook program that
     enumerates extra objects to be used as anchoring points to keep
     otherwise unreachable objects in cruft packs.

     source: <27a7f16aab35b5cac391d9831aadb0f2e2146313.1683151485.git.me@ttaylorr.com>

I think I earlier saw review comments on a side thread that
mentioned "repack -A"; when I read it I didn't quite get the point
of the discussion, but after summarizing the gist of the topic as
above, I _think_ I tend to agree.  This should not be about "should
we have these extra objects in cruft pack?" but is about "should we
keep these extra objects alive and exempt from pruning".  What these
extra tips tell us is that the part of the reachability graph rooted
at these extra objects should be treated as if they are reachable.

Storing them inside cruft packs may be a reasonable choice to make
today, in the sense that among various object storage mechansim, the
cruft pack mechanism may be the best fit in today's system, but it
does not have to stay that way.  Naming the variable to specify the
hooks with name "cruft" in them would make it hard to explain once
we find an even better storage mechanism to store such a "not really
used but want to keep" objects.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3] builtin/pack-objects.c: introduce `pack.extraCruftTips`
  2023-05-03 23:42     ` Junio C Hamano
@ 2023-05-03 23:48       ` Taylor Blau
  2023-05-03 23:50       ` Taylor Blau
  1 sibling, 0 replies; 31+ messages in thread
From: Taylor Blau @ 2023-05-03 23:48 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Jeff King, Derrick Stolee

On Wed, May 03, 2023 at 04:42:41PM -0700, Junio C Hamano wrote:
> Storing them inside cruft packs may be a reasonable choice to make
> today, in the sense that among various object storage mechansim, the
> cruft pack mechanism may be the best fit in today's system, but it
> does not have to stay that way.  Naming the variable to specify the
> hooks with name "cruft" in them would make it hard to explain once
> we find an even better storage mechanism to store such a "not really
> used but want to keep" objects.

I dunno. I thought about this too, and I get your argument, but I am not
convinced that a future mechanism would lend itself well to keeping
around additional sets of objects in the same way cruft packs do. In
that case, we would prefer having called this `pack.extraCruftTips` and
relegating it to the cruft pack system.

We could make this more generic, and extend support to the legacy
prune-via-loose mechanism. But like I said to Peff, I have a hard time
imagining anybody using it.

So, I'm torn. I see what you're saying, but I think I still tend to fall
on the side of leaving it as-is.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3] builtin/pack-objects.c: introduce `pack.extraCruftTips`
  2023-05-03 23:42     ` Junio C Hamano
  2023-05-03 23:48       ` Taylor Blau
@ 2023-05-03 23:50       ` Taylor Blau
  1 sibling, 0 replies; 31+ messages in thread
From: Taylor Blau @ 2023-05-03 23:50 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Jeff King, Derrick Stolee

On Wed, May 03, 2023 at 04:42:41PM -0700, Junio C Hamano wrote:
> Storing them inside cruft packs may be a reasonable choice to make
> today, in the sense that among various object storage mechansim, the
> cruft pack mechanism may be the best fit in today's system, but it
> does not have to stay that way.  Naming the variable to specify the
> hooks with name "cruft" in them would make it hard to explain once
> we find an even better storage mechanism to store such a "not really
> used but want to keep" objects.

I dunno. I thought about this too, and I get your argument, but I am not
convinced that a future mechanism would lend itself well to keeping
around additional sets of objects in the same way cruft packs do. In
that case, we would prefer having called this `pack.extraCruftTips` and
relegating it to the cruft pack system.

We could make this more generic, and extend support to the legacy
prune-via-loose mechanism. But like I said to Peff, I have a hard time
imagining anybody using it.

So, I'm torn. I see what you're saying, but I think I still tend to fall
on the side of leaving it as-is.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2] builtin/pack-objects.c: introduce `pack.extraCruftTips`
  2023-05-03 21:22     ` Taylor Blau
@ 2023-05-05 21:23       ` Jeff King
  2023-05-06  0:06         ` Taylor Blau
  0 siblings, 1 reply; 31+ messages in thread
From: Jeff King @ 2023-05-05 21:23 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Junio C Hamano, Derrick Stolee

On Wed, May 03, 2023 at 05:22:04PM -0400, Taylor Blau wrote:

> > OK, I understand the use case you're trying to support, and your
> > approach mostly makes sense. But there are two things I was surprised by
> > in the implementation:
> >
> >   1. Does this need to be tied to cruft packs? The same logic would
> >      apply to "repack -A" which turns objects loose (of course you
> >      probably don't want to do that in the long term for performance
> >      reasons, but it's conceptually the same thing; see below).
> >
> >   2. Why is there a separate walk distinct from the one that rescues
> >      recent-but-unreachable objects?
> 
> > Conceptually it seems to me that the simplest and most flexible way to
> > think of this new feature is: pretend these objects are recent enough to
> > be kept in the grace period, even though their mtimes do not qualify".
> >
> > And then everything else would just fall out naturally. Am I missing
> > something?
> 
> I originally shied away from it, thinking that I wouldn't want to do an
> expensive walk with all of the recent objects during a non-pruning
> repack.
>
> The two code paths are quite different in practice. In the pruning case,
> we add only new objects from the kept packs and then start our traversal
> there. In the non-pruning case, we just do
> `add_objects_in_unpacked_packs()` which is really just a call to
> `for_each_packed_object()`.

Right, that's what I'd expect. I think you are describing a cruft-pack
world here (because you say "kept packs"), but the traditional "repack
-k" versus "repack -A" is similar (if we are not doing something special
with recent objects, then there is no need to figure out what they
reach; we can just add them all).

> So it gets tricky when you have a pack.extraCruftTips program and want
> to invoke it in a non-pruning case. You could do something like:
> 
>   - call enumerate_and_traverse_cruft_objects() *always*, either because
>     we were doing a pruning GC, or calling it after
>     `enumerate_cruft_objects()` (in the non-pruning case)
> 
>   - ensure that enumerate_and_traverse_cruft_objects() is a noop when
>     (a) cruft_expiration is set to zero, and (b) there are no
>     pack.extraCruftTips programs specified

I'm not sure why you'd need to traverse, though. If we are in "-k" mode,
we are keeping everything anyway (so I don't even see the point of
asking the helper about extra tips). And all of those objects that are
not reachable from the regular traversal are by definition "cruft" and
go into the cruft pack.

Maybe I don't understand what you mean by "non-pruning" here.

-Peff

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2] builtin/pack-objects.c: introduce `pack.extraCruftTips`
  2023-05-03 21:28     ` Taylor Blau
@ 2023-05-05 21:26       ` Jeff King
  2023-05-05 22:13         ` Jeff King
  0 siblings, 1 reply; 31+ messages in thread
From: Jeff King @ 2023-05-05 21:26 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Junio C Hamano, Derrick Stolee

On Wed, May 03, 2023 at 05:28:45PM -0400, Taylor Blau wrote:

> > OK, I understand the use case you're trying to support, and your
> > approach mostly makes sense. But there are two things I was surprised by
> > in the implementation:
> >
> >   1. Does this need to be tied to cruft packs? The same logic would
> >      apply to "repack -A" which turns objects loose (of course you
> >      probably don't want to do that in the long term for performance
> >      reasons, but it's conceptually the same thing; see below).
> 
> I agree that you wouldn't want to do it for performance reasons, but I'm
> comfortable with the asymmetry here, since this is `pack.extraCruftTips`
> (emphasis on "cruft"), so it's not clear that it has to be related to
> "repack -A".
> 
> Happy to change things up if you feel strongly, though.

I don't feel strongly. I certainly have no intent to run "git repack -A"
with extra tips specified. Mostly I just thought that it would be
simpler to just apply it everywhere, both conceptually to the user and
within the implementation.

But since the cruft-pack implementation diverges quite a bit from the
regular "-A" handling, I guess it makes the code more complex rather
than less. The asymmetry feels like a wart to me, but I guess in the
long run we'd hope that the "turn unreachable loose" strategy is
deprecated anyway.

-Peff

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3] builtin/pack-objects.c: introduce `pack.extraCruftTips`
  2023-05-03 23:18   ` Junio C Hamano
  2023-05-03 23:42     ` Junio C Hamano
@ 2023-05-05 21:39     ` Jeff King
  1 sibling, 0 replies; 31+ messages in thread
From: Jeff King @ 2023-05-05 21:39 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Taylor Blau, git, Derrick Stolee

On Wed, May 03, 2023 at 04:18:44PM -0700, Junio C Hamano wrote:

> > +	When generating a cruft pack, use the shell to execute the
> > +	specified command(s), and interpret their output as additional
> > +	tips of objects to keep in the cruft pack, regardless of their
> 
> What is a "tip of an object"?  The first byte ;-)?
> 
> A "tip of history" would only imply commit objects, but presumably
> you would want to specify a tree and protect all the blobs and trees
> it recursively contains, so that is not a good name for it.

"tips of the object graph" perhaps?

> > +		if (parse_oid_hex(buf.buf, &oid, &rest) || *rest) {
> > +			ret = error(_("invalid extra cruft tip: '%s'"), buf.buf);
> > +			goto done;
> > +		}
> > +
> > +		type = oid_object_info(the_repository, &oid, NULL);
> > +		if (type < 0)
> > +			continue;
> > +
> > +		obj = lookup_object_by_type(the_repository, &oid, type);
> > +		if (!obj)
> > +			continue;
> 
> Hmph, we may want to have an interface that lets us avoid looking up
> the same oid twice in the same set of tables.  Given an object
> unseen so far, oid_object_info() should have done most of the work
> necessary for lookup_object_by_type() to get to and start parsing
> the data of the object in the good case (i.e. object exists and in a
> pack---just we haven't needed it yet), but in the above sequence
> there is not enough information passed between two calls to take
> advantage of it.

This code was my suggestion, but it may have actually been a bad
direction.

I don't think communicating between oid_object_info() and
lookup_object_by_type() is important. The latter is only doing a lookup
in the internal hash with lookup_object(), and then auto-vivifying using
the type if necessary (which we provide to it).

The bigger inefficiency is that we call oid_object_info() before seeing
if we have already instantiated an object struct via lookup_object().

Obviously we could do that first. But let's take a step back. My
original suggestion was thinking that we don't want to call
parse_object() because it's expensive, especially for a blob. But in the
long run, most of these objects (except blobs!) will end up parsed
anyway, because we are going to see which other objects they reach.

So it's OK to parse anything except blobs. And indeed, we have a better
tool for that these days:

  obj = parse_object_with_flags(r, oid, PARSE_OBJECT_SKIP_HASH_CHECK);

That does exactly what we want. If we already saw and parsed the object,
it's quick noop after a hash lookup. If we didn't, then it already has
optimizations to avoid reading object contents if possible (checking the
commit graph, checking the type for blobs).

Skipping the hash check might seem like a bad idea for a repack, but
it's what we already do for blobs found via traversing. A disk repack
uses the much cheaper pack idx crc for exactly this purpose: to avoid
expanding objects unnecessarily.

-Peff

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2] builtin/pack-objects.c: introduce `pack.extraCruftTips`
  2023-05-05 21:26       ` Jeff King
@ 2023-05-05 22:13         ` Jeff King
  2023-05-06  0:13           ` Taylor Blau
  0 siblings, 1 reply; 31+ messages in thread
From: Jeff King @ 2023-05-05 22:13 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Junio C Hamano, Derrick Stolee

On Fri, May 05, 2023 at 05:26:31PM -0400, Jeff King wrote:

> But since the cruft-pack implementation diverges quite a bit from the
> regular "-A" handling, I guess it makes the code more complex rather
> than less. The asymmetry feels like a wart to me, but I guess in the
> long run we'd hope that the "turn unreachable loose" strategy is
> deprecated anyway.

One thing that could make this a lot simpler is if the code was added to
"are we recent" code paths in the first place.

Something like this:

diff --git a/reachable.c b/reachable.c
index 55bb114353..86bb5e021e 100644
--- a/reachable.c
+++ b/reachable.c
@@ -16,6 +16,9 @@
 #include "object-store.h"
 #include "pack-bitmap.h"
 #include "pack-mtimes.h"
+#include "oidset.h"
+#include "run-command.h"
+#include "config.h"
 
 struct connectivity_progress {
 	struct progress *progress;
@@ -67,8 +70,71 @@ struct recent_data {
 	timestamp_t timestamp;
 	report_recent_object_fn *cb;
 	int ignore_in_core_kept_packs;
+	struct oidset fake_recent_oids;
+	int fake_recent_oids_loaded;
 };
 
+static int run_one_recent_cmd(struct oidset *set, const char *args)
+{
+	struct child_process cmd = CHILD_PROCESS_INIT;
+	struct strbuf buf = STRBUF_INIT;
+	FILE *out;
+	int ret = 0;
+
+	cmd.use_shell = 1;
+	cmd.out = -1;
+
+	strvec_push(&cmd.args, args);
+
+	if (start_command(&cmd))
+		return -1;
+
+	out = xfdopen(cmd.out, "r");
+	while (strbuf_getline(&buf, out) != EOF) {
+		struct object_id oid;
+		const char *rest;
+
+		if (parse_oid_hex(buf.buf, &oid, &rest) || *rest)
+			die(_("invalid extra cruft tip: '%s'"), buf.buf);
+
+		oidset_insert(set, &oid);
+	}
+
+	if (finish_command(&cmd))
+		die(_("cruft tip hook returned error"));
+
+	fclose(out);
+	strbuf_release(&buf);
+
+	return ret;
+}
+
+static int obj_is_recent(const struct object_id *oid, timestamp_t mtime,
+			 struct recent_data *data)
+{
+	if (mtime > data->timestamp)
+		return 1;
+
+	if (!data->fake_recent_oids_loaded) {
+		const struct string_list *programs;
+
+		if (!git_config_get_string_multi("pack.extracrufttips", &programs)) {
+			size_t i;
+			for (i = 0; i < programs->nr; i++) {
+				if (run_one_recent_cmd(&data->fake_recent_oids,
+						       programs->items[i].string) < 0)
+					die(_("unable to enumerate additional cruft tips"));
+			}
+		}
+		data->fake_recent_oids_loaded = 1;
+	}
+
+	if (oidset_contains(&data->fake_recent_oids, oid))
+		return 1;
+
+	return 0;
+}
+
 static void add_recent_object(const struct object_id *oid,
 			      struct packed_git *pack,
 			      off_t offset,
@@ -78,7 +144,7 @@ static void add_recent_object(const struct object_id *oid,
 	struct object *obj;
 	enum object_type type;
 
-	if (mtime <= data->timestamp)
+	if (!obj_is_recent(oid, mtime, data))
 		return;
 
 	/*
@@ -193,6 +259,10 @@ int add_unseen_recent_objects_to_traversal(struct rev_info *revs,
 	data.cb = cb;
 	data.ignore_in_core_kept_packs = ignore_in_core_kept_packs;
 
+	/* XXX must free before exiting function */
+	oidset_init(&data.fake_recent_oids, 0);
+	data.fake_recent_oids_loaded = 0;
+
 	r = for_each_loose_object(add_recent_loose, &data,
 				  FOR_EACH_OBJECT_LOCAL_ONLY);
 	if (r)

That would affect both "repack -A" and "repack --cruft". It would also
affect "git prune", but that seems like a good thing to me.

It would not affect a straight "repack -ad", since there we do not ask
about recency at all. I'm not sure if that use case is important or not.
If it is, we could easily trigger the "is it recent" code in that case
only if a hook is configured (we'd just set the cutoff to "0" so that
everything looks "too old").

Note that this inverts the loop logic from what you have. In your patch,
you ask the hook to enumerate all faux-recent objects, and then you
treat them separately from objects we found on disk. Here we enumerate
the objects on disk as usual, and just adjust our idea of "recent" based
on the hook. I think this is a bit simpler because objects we don't even
bother to ask it about objects we've already handled, are already marked
as recent, or are not present in the repository.

-Peff

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v3] builtin/pack-objects.c: introduce `pack.extraCruftTips`
  2023-05-03 22:05 ` [PATCH v3] " Taylor Blau
  2023-05-03 23:18   ` Junio C Hamano
@ 2023-05-05 22:19   ` Jeff King
  1 sibling, 0 replies; 31+ messages in thread
From: Jeff King @ 2023-05-05 22:19 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Junio C Hamano, Derrick Stolee

On Wed, May 03, 2023 at 06:05:45PM -0400, Taylor Blau wrote:

> +	out = xfdopen(cmd.out, "r");
> +	while (strbuf_getline(&buf, out) != EOF) {
> +		struct object_id oid;
> +		struct object *obj;
> +		int type;
> +		const char *rest;
> +
> +		if (parse_oid_hex(buf.buf, &oid, &rest) || *rest) {
> +			ret = error(_("invalid extra cruft tip: '%s'"), buf.buf);
> +			goto done;
> +		}

While adapting this for my "how about this" patch elsewhere in the
thread, I noticed the error handling here is a little funny.

The other side sent us unexpected output, so we stopped parsing. But we
never reap the child process. We just jump to "done", which clears the
memory used by the process struct, but never call finish_command().

I think you want to break out of this loop and run finish_command. But
you have to make sure to fclose(out) first, so that it knows to exit
(otherwise it may fill up the pipe buffer and block waiting for us to
read, creating a deadlock).

And then

So something like:

  while (strbuf_getline(...)) {
	if (some_error) {
		ret = error(...);
		break;
  }

  fclose(out); /* no need for "if (out)" here; it's always open */
  ret |= finish_command(&cmd);

  /* no need to child_process_clear(); already done by finish_command() */
  strbuf_release(&buf);
  return ret;

-Peff

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2] builtin/pack-objects.c: introduce `pack.extraCruftTips`
  2023-05-05 21:23       ` Jeff King
@ 2023-05-06  0:06         ` Taylor Blau
  2023-05-06  0:14           ` Taylor Blau
  0 siblings, 1 reply; 31+ messages in thread
From: Taylor Blau @ 2023-05-06  0:06 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Junio C Hamano, Derrick Stolee

On Fri, May 05, 2023 at 05:23:22PM -0400, Jeff King wrote:
> On Wed, May 03, 2023 at 05:22:04PM -0400, Taylor Blau wrote:
>
> > So it gets tricky when you have a pack.extraCruftTips program and want
> > to invoke it in a non-pruning case. You could do something like:
> >
> >   - call enumerate_and_traverse_cruft_objects() *always*, either because
> >     we were doing a pruning GC, or calling it after
> >     `enumerate_cruft_objects()` (in the non-pruning case)
> >
> >   - ensure that enumerate_and_traverse_cruft_objects() is a noop when
> >     (a) cruft_expiration is set to zero, and (b) there are no
> >     pack.extraCruftTips programs specified
>
> I'm not sure why you'd need to traverse, though. If we are in "-k" mode,
> we are keeping everything anyway (so I don't even see the point of
> asking the helper about extra tips). And all of those objects that are
> not reachable from the regular traversal are by definition "cruft" and
> go into the cruft pack.
>
> Maybe I don't understand what you mean by "non-pruning" here.

By non-pruning, I meant something like "git gc --prune=never", which
would run the equivalent of `git repack -A` to generate the pack
containing just reachable objects, and then invoke `git pack-objects
--cruft` to generate the cruft pack.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2] builtin/pack-objects.c: introduce `pack.extraCruftTips`
  2023-05-05 22:13         ` Jeff King
@ 2023-05-06  0:13           ` Taylor Blau
  2023-05-06  0:20             ` Taylor Blau
  2023-05-06  2:12             ` Jeff King
  0 siblings, 2 replies; 31+ messages in thread
From: Taylor Blau @ 2023-05-06  0:13 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Junio C Hamano, Derrick Stolee

On Fri, May 05, 2023 at 06:13:57PM -0400, Jeff King wrote:
> One thing that could make this a lot simpler is if the code was added to
> "are we recent" code paths in the first place.
>
> Something like this:

This is quite nice, and I think that it's a good direction to push this
~series~ patch in before queuing. That said, I wasn't sure about...

> @@ -78,7 +144,7 @@ static void add_recent_object(const struct object_id *oid,
>  	struct object *obj;
>  	enum object_type type;
>
> -	if (mtime <= data->timestamp)
> +	if (!obj_is_recent(oid, mtime, data))
>  		return;
>
>  	/*

...this hunk. That only kicks in if you have other recent object(s),
since the hooks are consulted as a side-effect of calling your new
`obj_is_recent()` function.

I think in most cases that's fine, but if you had no otherwise-recent
objects around, then this code wouldn't kick in in the first place.

I wonder if it might make more sense to call the hooks directly in
add_unseen_recent_objects_to_traversal().

> That would affect both "repack -A" and "repack --cruft". It would also
> affect "git prune", but that seems like a good thing to me.

I was going to say, I'm not sure this would work since we don't use any
of this code from enumerate_cruft_objects() when the cruft_expiration is
set to "never", but that's fine since we're keeping all of those objects
anyway.

OK, my bad, I think that was the point you were trying to make in your
previous email, but I didn't quite grok it at the time. Yes, I agree,
this code only needs to kick in when pruning.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2] builtin/pack-objects.c: introduce `pack.extraCruftTips`
  2023-05-06  0:06         ` Taylor Blau
@ 2023-05-06  0:14           ` Taylor Blau
  0 siblings, 0 replies; 31+ messages in thread
From: Taylor Blau @ 2023-05-06  0:14 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Junio C Hamano, Derrick Stolee

On Fri, May 05, 2023 at 08:06:14PM -0400, Taylor Blau wrote:
> On Fri, May 05, 2023 at 05:23:22PM -0400, Jeff King wrote:
> > On Wed, May 03, 2023 at 05:22:04PM -0400, Taylor Blau wrote:
> >
> > > So it gets tricky when you have a pack.extraCruftTips program and want
> > > to invoke it in a non-pruning case. You could do something like:
> > >
> > >   - call enumerate_and_traverse_cruft_objects() *always*, either because
> > >     we were doing a pruning GC, or calling it after
> > >     `enumerate_cruft_objects()` (in the non-pruning case)
> > >
> > >   - ensure that enumerate_and_traverse_cruft_objects() is a noop when
> > >     (a) cruft_expiration is set to zero, and (b) there are no
> > >     pack.extraCruftTips programs specified
> >
> > I'm not sure why you'd need to traverse, though. If we are in "-k" mode,
> > we are keeping everything anyway (so I don't even see the point of
> > asking the helper about extra tips). And all of those objects that are
> > not reachable from the regular traversal are by definition "cruft" and
> > go into the cruft pack.
> >
> > Maybe I don't understand what you mean by "non-pruning" here.
>
> By non-pruning, I meant something like "git gc --prune=never", which
> would run the equivalent of `git repack -A` to generate the pack
> containing just reachable objects, and then invoke `git pack-objects
> --cruft` to generate the cruft pack.

Oops, my misunderstanding. I see what you're saying: in a `git gc
--prune=never`, it does not matter whether we ask for extra cruft tips,
since we're keeping all of those objects anyway. Duh.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2] builtin/pack-objects.c: introduce `pack.extraCruftTips`
  2023-05-06  0:13           ` Taylor Blau
@ 2023-05-06  0:20             ` Taylor Blau
  2023-05-06  2:12             ` Jeff King
  1 sibling, 0 replies; 31+ messages in thread
From: Taylor Blau @ 2023-05-06  0:20 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Junio C Hamano, Derrick Stolee

On Fri, May 05, 2023 at 08:13:45PM -0400, Taylor Blau wrote:
> On Fri, May 05, 2023 at 06:13:57PM -0400, Jeff King wrote:
> > One thing that could make this a lot simpler is if the code was added to
> > "are we recent" code paths in the first place.
> >
> > Something like this:
>
> This is quite nice, and I think that it's a good direction to push this
> ~series~ patch in before queuing. That said, I wasn't sure about...
>
> > @@ -78,7 +144,7 @@ static void add_recent_object(const struct object_id *oid,
> >  	struct object *obj;
> >  	enum object_type type;
> >
> > -	if (mtime <= data->timestamp)
> > +	if (!obj_is_recent(oid, mtime, data))
> >  		return;
> >
> >  	/*
>
> ...this hunk. That only kicks in if you have other recent object(s),
> since the hooks are consulted as a side-effect of calling your new
> `obj_is_recent()` function.
>
> I think in most cases that's fine, but if you had no otherwise-recent
> objects around, then this code wouldn't kick in in the first place.
>
> I wonder if it might make more sense to call the hooks directly in
> add_unseen_recent_objects_to_traversal().

OK, no, this is fine, but we'd need to make sure that
want_recent_object() also understood the fake recent set here, since
add_recent_loose() and add_recent_packed() both bail early when
want_recent_object() returns 0.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2] builtin/pack-objects.c: introduce `pack.extraCruftTips`
  2023-05-06  0:13           ` Taylor Blau
  2023-05-06  0:20             ` Taylor Blau
@ 2023-05-06  2:12             ` Jeff King
  1 sibling, 0 replies; 31+ messages in thread
From: Jeff King @ 2023-05-06  2:12 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Junio C Hamano, Derrick Stolee

On Fri, May 05, 2023 at 08:13:45PM -0400, Taylor Blau wrote:

> > @@ -78,7 +144,7 @@ static void add_recent_object(const struct object_id *oid,
> >  	struct object *obj;
> >  	enum object_type type;
> >
> > -	if (mtime <= data->timestamp)
> > +	if (!obj_is_recent(oid, mtime, data))
> >  		return;
> >
> >  	/*
> 
> ...this hunk. That only kicks in if you have other recent object(s),
> since the hooks are consulted as a side-effect of calling your new
> `obj_is_recent()` function.

I think we'll evaluate each object in the repo for recent-ness, via
for_each_loose_object and for_each_packed_object. So if an object exists
in the repo it will be evaluated here, as long as we are checking for
recent objects at all. And if it doesn't exist, then having the hook
tell us about it won't help; there is nothing for us to save (nor even
to use as a source of reachability, since we don't know what's in it).

Modulo the want_recent_object() thing you mentioned, of course, which is
evaluated first. And I could see that yeah, that might need to let the
hook override it, which shouldn't be too hard to do.

I'm not super familiar with this aspect of the cruft pack code. That
function is trying to avoid looking at objects that are in in-core kept
packs, which implies repack feeding those via stdin. Looking at
repack.c, we are feeding the existing packs we just wrote (so all of the
reachable stuff). So I can see why we might skip those packs, but at the
same time, we know we are keeping them (they were just written!) so they
are not really cruft candidates anyway. If an extraCruftTips hook told
us about them, would it matter? They are already being kept as
non-cruft.

But that is just an analysis based on 5 minutes of poking at the code. I
won't be surprised if I'm misunderstanding it or missing subtle cases.

-Peff

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2023-05-06  2:13 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-04-20 17:27 [PATCH] builtin/pack-objects.c: introduce `pack.extraCruftTips` Taylor Blau
2023-04-20 18:12 ` Junio C Hamano
2023-04-20 19:30   ` Taylor Blau
2023-04-20 19:52     ` Junio C Hamano
2023-04-20 20:48       ` Taylor Blau
2023-04-21  0:10 ` Chris Torek
2023-04-21  2:14   ` Taylor Blau
2023-04-25 19:42 ` Derrick Stolee
2023-04-25 21:25   ` Taylor Blau
2023-04-26 10:52     ` Derrick Stolee
2023-05-03  0:06       ` Taylor Blau
2023-05-03  0:09 ` [PATCH v2] " Taylor Blau
2023-05-03 14:01   ` Derrick Stolee
2023-05-03 19:59   ` Jeff King
2023-05-03 21:22     ` Taylor Blau
2023-05-05 21:23       ` Jeff King
2023-05-06  0:06         ` Taylor Blau
2023-05-06  0:14           ` Taylor Blau
2023-05-03 21:28     ` Taylor Blau
2023-05-05 21:26       ` Jeff King
2023-05-05 22:13         ` Jeff King
2023-05-06  0:13           ` Taylor Blau
2023-05-06  0:20             ` Taylor Blau
2023-05-06  2:12             ` Jeff King
2023-05-03 22:05 ` [PATCH v3] " Taylor Blau
2023-05-03 23:18   ` Junio C Hamano
2023-05-03 23:42     ` Junio C Hamano
2023-05-03 23:48       ` Taylor Blau
2023-05-03 23:50       ` Taylor Blau
2023-05-05 21:39     ` Jeff King
2023-05-05 22:19   ` Jeff King

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).