All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/4] move pruned objects to a separate repository
@ 2022-06-29 18:45 Taylor Blau
  2022-06-29 18:45 ` [RFC PATCH 1/4] builtin/repack.c: pass "out" to `prepare_pack_objects` Taylor Blau
                   ` (5 more replies)
  0 siblings, 6 replies; 9+ messages in thread
From: Taylor Blau @ 2022-06-29 18:45 UTC (permalink / raw)
  To: git; +Cc: derrickstolee, jonathantanmy, gitster

Now that cruft packs are available in v2.37.0, here is an interesting
application of that new feature to enable a two-phase object pruning
approach.

This came out of a discussion within GitHub about ways we could support
storing a set of pruned objects in "limbo" so that they were not
accessible from the repository which pruned them, but instead stored in
a cruft pack in a separate repository which lists the original one as an
alternate.

This makes it possible to take the collection of all pruned objects and
store them in a cruft pack in a separate repository. This repository
(which I have been referring to as the "expired.git") can then be used
as a donor repository for any missing objects (like the ones described
by the race in [1]).

The first few patches are preparatory. The final one implements writing
the pruned objects separately. The trick is to write another cruft pack
to a separate repository, with two tweaks:

  - the `--cruft-expiration` value is set to "never", since we want to
    keep around all of the objects we expired in the previous step, and

  - the original cruft pack appears as a pack that we are going to keep,
    meaning all unreachable objects that are stored in the original
    cruft pack are excluded from the one we write to the "expired.git"
    repository.

You can try this out yourself by doing something like:

    $ git init --bare ../expired.git $ git repack --cruft
    --cruft-expiration=1.day.ago -d \
    --expire-to=../expired.git/objects/pack/pack

which will create two cruft packs:

  - one in the repository which ran `git repack` containing all
    unreachable objects written within the last day, and
  - another in the "expired.git" repository which contains all
    unreachable objects written prior to the last day

This series is an RFC for now since I'm interested in discussing whether
or not this is a feature that people would actually want to use or not.
But if it is, I'm happy to polish this up and turn it into a
non-RFC-quality series ;-).

In the meantime, thanks for your review!

[1]: https://lore.kernel.org/git/YryF+vkosJOXf+mQ@nand.local/

Taylor Blau (4):
  builtin/repack.c: pass "out" to `prepare_pack_objects`
  builtin/repack.c: pass "cruft_expiration" to `write_cruft_pack`
  builtin/repack.c: write cruft packs to arbitrary locations
  builtin/repack.c: implement `--expire-to` for storing pruned objects

 Documentation/git-repack.txt |   6 ++
 builtin/repack.c             |  67 ++++++++++++++++---
 t/t7700-repack.sh            | 121 +++++++++++++++++++++++++++++++++++
 3 files changed, 186 insertions(+), 8 deletions(-)

-- 
2.37.0.1.g1379af2e9d

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [RFC PATCH 1/4] builtin/repack.c: pass "out" to `prepare_pack_objects`
  2022-06-29 18:45 [RFC PATCH 0/4] move pruned objects to a separate repository Taylor Blau
@ 2022-06-29 18:45 ` Taylor Blau
  2022-06-29 18:47 ` [RFC PATCH 2/4] builtin/repack.c: pass "cruft_expiration" to `write_cruft_pack` Taylor Blau
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 9+ messages in thread
From: Taylor Blau @ 2022-06-29 18:45 UTC (permalink / raw)
  To: git; +Cc: derrickstolee, jonathantanmy, gitster

`builtin/repack.c`'s `prepare_pack_objects()` is used to prepare a set
of arguments to a `pack-objects` process which will generate a desired
pack.

A future patch will add an `--expire-to` option which allows `git
repack` to write a cruft pack containing the pruned objects out to a
separate repository. Prepare for this by teaching that function to write
packs to an arbitrary location specified by the caller.

All existing callers of `prepare_pack_objects()` will pass `packtmp` for
`out`, retaining the existing behavior.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/repack.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/builtin/repack.c b/builtin/repack.c
index 4a7ae4cf48..025882a075 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -188,7 +188,8 @@ static void remove_redundant_pack(const char *dir_name, const char *base_name)
 }
 
 static void prepare_pack_objects(struct child_process *cmd,
-				 const struct pack_objects_args *args)
+				 const struct pack_objects_args *args,
+				 const char *out)
 {
 	strvec_push(&cmd->args, "pack-objects");
 	if (args->window)
@@ -211,7 +212,7 @@ static void prepare_pack_objects(struct child_process *cmd,
 		strvec_push(&cmd->args,  "--quiet");
 	if (delta_base_offset)
 		strvec_push(&cmd->args,  "--delta-base-offset");
-	strvec_push(&cmd->args, packtmp);
+	strvec_push(&cmd->args, out);
 	cmd->git_cmd = 1;
 	cmd->out = -1;
 }
@@ -275,7 +276,7 @@ static void repack_promisor_objects(const struct pack_objects_args *args,
 	FILE *out;
 	struct strbuf line = STRBUF_INIT;
 
-	prepare_pack_objects(&cmd, args);
+	prepare_pack_objects(&cmd, args, packtmp);
 	cmd.in = -1;
 
 	/*
@@ -673,7 +674,7 @@ static int write_cruft_pack(const struct pack_objects_args *args,
 	FILE *in, *out;
 	int ret;
 
-	prepare_pack_objects(&cmd, args);
+	prepare_pack_objects(&cmd, args, packtmp);
 
 	strvec_push(&cmd.args, "--cruft");
 	if (cruft_expiration)
@@ -862,7 +863,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 
 	sigchain_push_common(remove_pack_on_signal);
 
-	prepare_pack_objects(&cmd, &po_args);
+	prepare_pack_objects(&cmd, &po_args, packtmp);
 
 	show_progress = !po_args.quiet && isatty(2);
 
-- 
2.37.0.1.g1379af2e9d


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC PATCH 2/4] builtin/repack.c: pass "cruft_expiration" to `write_cruft_pack`
  2022-06-29 18:45 [RFC PATCH 0/4] move pruned objects to a separate repository Taylor Blau
  2022-06-29 18:45 ` [RFC PATCH 1/4] builtin/repack.c: pass "out" to `prepare_pack_objects` Taylor Blau
@ 2022-06-29 18:47 ` Taylor Blau
  2022-06-29 18:47 ` [RFC PATCH 3/4] builtin/repack.c: write cruft packs to arbitrary locations Taylor Blau
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 9+ messages in thread
From: Taylor Blau @ 2022-06-29 18:47 UTC (permalink / raw)
  To: git; +Cc: derrickstolee, jonathantanmy, gitster

`builtin/repack.c`'s `write_cruft_pack()` is used to generate the cruft
pack when `--cruft` is supplied. It uses a static variable
"cruft_expiration" which is filled in by option parsing.

A future patch will add an `--expire-to` option which allows `git
repack` to write a cruft pack containing the pruned objects out to a
separate repository. In order to implement this functionality, some
callers will have to pass a value for `cruft_expiration` different than
the one filled out by option parsing.

Prepare for this by teaching `write_cruft_pack` to take a
"cruft_expiration" parameter, instead of reading a single static
variable.

The (sole) existing caller of `write_cruft_pack()` will pass the value
for "cruft_expiration" filled in by option parsing, retaining existing
behavior. This means that we can make the variable local to
`cmd_repack()`, and eliminate the static declaration.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/repack.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/builtin/repack.c b/builtin/repack.c
index 025882a075..19e789d745 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -32,7 +32,6 @@ static int write_bitmaps = -1;
 static int use_delta_islands;
 static int run_update_server_info = 1;
 static char *packdir, *packtmp_name, *packtmp;
-static char *cruft_expiration;
 
 static const char *const git_repack_usage[] = {
 	N_("git repack [<options>]"),
@@ -664,6 +663,7 @@ static int write_midx_included_packs(struct string_list *include,
 
 static int write_cruft_pack(const struct pack_objects_args *args,
 			    const char *pack_prefix,
+			    const char *cruft_expiration,
 			    struct string_list *names,
 			    struct string_list *existing_packs,
 			    struct string_list *existing_kept_packs)
@@ -747,6 +747,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	struct pack_objects_args cruft_po_args = {NULL};
 	int geometric_factor = 0;
 	int write_midx = 0;
+	const char *cruft_expiration = NULL;
 
 	struct option builtin_repack_options[] = {
 		OPT_BIT('a', NULL, &pack_everything,
@@ -986,7 +987,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		cruft_po_args.local = po_args.local;
 		cruft_po_args.quiet = po_args.quiet;
 
-		ret = write_cruft_pack(&cruft_po_args, pack_prefix, &names,
+		ret = write_cruft_pack(&cruft_po_args, pack_prefix,
+				       cruft_expiration, &names,
 				       &existing_nonkept_packs,
 				       &existing_kept_packs);
 		if (ret)
-- 
2.37.0.1.g1379af2e9d


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC PATCH 3/4] builtin/repack.c: write cruft packs to arbitrary locations
  2022-06-29 18:45 [RFC PATCH 0/4] move pruned objects to a separate repository Taylor Blau
  2022-06-29 18:45 ` [RFC PATCH 1/4] builtin/repack.c: pass "out" to `prepare_pack_objects` Taylor Blau
  2022-06-29 18:47 ` [RFC PATCH 2/4] builtin/repack.c: pass "cruft_expiration" to `write_cruft_pack` Taylor Blau
@ 2022-06-29 18:47 ` Taylor Blau
  2022-06-29 18:47 ` [RFC PATCH 4/4] builtin/repack.c: implement `--expire-to` for storing pruned objects Taylor Blau
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 9+ messages in thread
From: Taylor Blau @ 2022-06-29 18:47 UTC (permalink / raw)
  To: git; +Cc: derrickstolee, jonathantanmy, gitster

In the following commit, a new write_cruft_pack() caller will be added
which wants to write a cruft pack to an arbitrary location. Prepare for
this by adding a parameter which controls the destination of the cruft
pack.

For now, provide "packtmp" so that this commit does not change any
behavior.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/repack.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/builtin/repack.c b/builtin/repack.c
index 19e789d745..ab976007e1 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -662,6 +662,7 @@ static int write_midx_included_packs(struct string_list *include,
 }
 
 static int write_cruft_pack(const struct pack_objects_args *args,
+			    const char *destination,
 			    const char *pack_prefix,
 			    const char *cruft_expiration,
 			    struct string_list *names,
@@ -673,8 +674,10 @@ static int write_cruft_pack(const struct pack_objects_args *args,
 	struct string_list_item *item;
 	FILE *in, *out;
 	int ret;
+	const char *scratch;
+	int local = skip_prefix(destination, packdir, &scratch);
 
-	prepare_pack_objects(&cmd, args, packtmp);
+	prepare_pack_objects(&cmd, args, destination);
 
 	strvec_push(&cmd.args, "--cruft");
 	if (cruft_expiration)
@@ -714,7 +717,12 @@ static int write_cruft_pack(const struct pack_objects_args *args,
 		if (line.len != the_hash_algo->hexsz)
 			die(_("repack: Expecting full hex object ID lines only "
 			      "from pack-objects."));
-		string_list_append(names, line.buf);
+                /*
+		 * avoid putting packs written outside of the repository in the
+		 * list of names
+		 */
+		if (local)
+			string_list_append(names, line.buf);
 	}
 	fclose(out);
 
@@ -987,7 +995,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		cruft_po_args.local = po_args.local;
 		cruft_po_args.quiet = po_args.quiet;
 
-		ret = write_cruft_pack(&cruft_po_args, pack_prefix,
+		ret = write_cruft_pack(&cruft_po_args, packtmp, pack_prefix,
 				       cruft_expiration, &names,
 				       &existing_nonkept_packs,
 				       &existing_kept_packs);
-- 
2.37.0.1.g1379af2e9d


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC PATCH 4/4] builtin/repack.c: implement `--expire-to` for storing pruned objects
  2022-06-29 18:45 [RFC PATCH 0/4] move pruned objects to a separate repository Taylor Blau
                   ` (2 preceding siblings ...)
  2022-06-29 18:47 ` [RFC PATCH 3/4] builtin/repack.c: write cruft packs to arbitrary locations Taylor Blau
@ 2022-06-29 18:47 ` Taylor Blau
  2022-06-29 22:54 ` [RFC PATCH 0/4] move pruned objects to a separate repository Jonathan Tan
  2022-06-30  8:00 ` Ævar Arnfjörð Bjarmason
  5 siblings, 0 replies; 9+ messages in thread
From: Taylor Blau @ 2022-06-29 18:47 UTC (permalink / raw)
  To: git; +Cc: derrickstolee, jonathantanmy, gitster

When pruning objects with `--cruft`, `git repack` offers some
flexibility when selecting the set of which objects are pruned via the
`--cruft-expiration` option.

This is useful for expiring objects which are older than the grace
period, making races where to-be-pruned objects become reachable and
then ancestors of freshly pushed objects, leaving the repository in a
corrupt state after pruning substantially less likely.

But in practice, such races are impossible to avoid entirely, no matter
how long the grace period is. To prevent this race, it is often
advisable to temporarily put a repository into a read-only state. But in
practice, this is not always practical, and so some middle ground would
be nice.

This patch introduces a new option, `--expire-to`, which teaches `git
repack` to write an additional cruft pack containing just the objects
which were pruned from the repository. The caller can specify a
directory outside of the current repository as the destination for this
second cruft pack.

This makes it possible to prune objects from a repository, while still
holding onto a supplemental copy of them outside of the original
repository. Having this copy on-disk makes it substantially easier to
recover objects when the aforementioned race is encountered.

`--expire-to` is implemented in a somewhat convoluted manner, which is
to take advantage of the fact that the first time `write_cruft_pack()`
is called, it adds the name of the cruft pack to the `names` string
list. That means the second time we call `write_cruft_pack()`, objects
in the previously-written cruft pack will be excluded.

As long as the caller ensures that no objects are expired during the
second pass, this is sufficient to generate a cruft pack containing all
objects which don't appear in any of the new packs written by `git
repack`, including the cruft pack. In other words, all of the objects
which are about to be pruned from the repository.

It is important to note that the destination in `--expire-to` does not
necessarily need to be a Git repository (though it can be) Notably, the
expired packs do not contain all ancestors of expired objects. So if the
source repository contains something like:

              <unreachable>
             /
    C1 --- C2
      \
       refs/heads/master

where C2 is unreachable, but has a parent (C1) which is reachable, and
C2 would be pruned, then the expiry pack will contain only C2, not C1.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/git-repack.txt |   6 ++
 builtin/repack.c             |  40 ++++++++++++
 t/t7700-repack.sh            | 121 +++++++++++++++++++++++++++++++++++
 3 files changed, 167 insertions(+)

diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
index 0bf13893d8..4017157949 100644
--- a/Documentation/git-repack.txt
+++ b/Documentation/git-repack.txt
@@ -74,6 +74,12 @@ to the new separate pack will be written.
 	immediately instead of waiting for the next `git gc` invocation.
 	Only useful with `--cruft -d`.
 
+--expire-to=<dir>::
+	Write a cruft pack containing pruned objects (if any) to the
+	directory `<dir>`. This option is useful for keeping a copy of
+	any pruned objects in a separate directory as a backup. Only
+	useful with `--cruft -d`.
+
 -l::
 	Pass the `--local` option to 'git pack-objects'. See
 	linkgit:git-pack-objects[1].
diff --git a/builtin/repack.c b/builtin/repack.c
index ab976007e1..d789150a2e 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -702,6 +702,10 @@ static int write_cruft_pack(const struct pack_objects_args *args,
 	 * By the time it is read here, it contains only the pack(s)
 	 * that were just written, which is exactly the set of packs we
 	 * want to consider kept.
+	 *
+	 * If `--expire-to` is given, the double-use served by `names`
+	 * ensures that the pack written to `--expire-to` excludes any
+	 * objects contained in the cruft pack.
 	 */
 	in = xfdopen(cmd.in, "w");
 	for_each_string_list_item(item, names)
@@ -756,6 +760,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	int geometric_factor = 0;
 	int write_midx = 0;
 	const char *cruft_expiration = NULL;
+	const char *expire_to = NULL;
 
 	struct option builtin_repack_options[] = {
 		OPT_BIT('a', NULL, &pack_everything,
@@ -805,6 +810,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 			    N_("find a geometric progression with factor <N>")),
 		OPT_BOOL('m', "write-midx", &write_midx,
 			   N_("write a multi-pack index of the resulting packs")),
+		OPT_STRING(0, "expire-to", &expire_to, N_("dir"),
+			   N_("pack prefix to store a pack containing pruned objects")),
 		OPT_END()
 	};
 
@@ -1001,6 +1008,39 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 				       &existing_kept_packs);
 		if (ret)
 			return ret;
+
+		if (delete_redundant && expire_to) {
+			/*
+			 * If `--expire-to` is given with `-d`, it's possible
+			 * that we're about to prune some objects. With cruft
+			 * packs, pruning is implicit: any objects from existing
+			 * packs that weren't picked up by new packs are removed
+			 * when their packs are deleted.
+			 *
+			 * Generate an additional cruft pack, with one twist:
+			 * `names` now includes the name of the cruft pack
+			 * written in the previous step. So the contents of
+			 * _this_ cruft pack exclude everything contained in the
+			 * existing cruft pack (that is, all of the unreachable
+			 * objects which are no older than
+			 * `--cruft-expiration`).
+			 *
+			 * To make this work, cruft_expiration must become NULL
+			 * so that this cruft pack doesn't actually prune any
+			 * objects. If it were non-NULL, this call would always
+			 * generate an empty pack (since every object not in the
+			 * cruft pack generated above will have an mtime older
+			 * than the expiration).
+			 */
+			ret = write_cruft_pack(&cruft_po_args, expire_to,
+					       pack_prefix,
+					       NULL,
+					       &names,
+					       &existing_nonkept_packs,
+					       &existing_kept_packs);
+			if (ret)
+				return ret;
+		}
 	}
 
 	string_list_sort(&names);
diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh
index ca45c4cd2c..7ffd3c7b54 100755
--- a/t/t7700-repack.sh
+++ b/t/t7700-repack.sh
@@ -482,4 +482,125 @@ test_expect_success '-n overrides repack.updateServerInfo=true' '
 	test_server_info_missing
 '
 
+test_expect_success '--expire to stores pruned objects (now)' '
+	git init expire-to-now &&
+	(
+		cd expire-to-now &&
+
+		git branch -M main &&
+
+		test_commit base &&
+
+		git checkout -b cruft &&
+		test_commit --no-tag cruft &&
+
+		git rev-list --objects --no-object-names main..cruft >moved.raw &&
+		sort moved.raw >moved.want &&
+
+		git rev-list --all --objects --no-object-names >expect.raw &&
+		sort expect.raw >expect &&
+
+		git checkout main &&
+		git branch -D cruft &&
+		git reflog expire --all --expire=all &&
+
+		git init --bare expired.git &&
+		git repack -d \
+			--cruft --cruft-expiration="now" \
+			--expire-to="expired.git/objects/pack/pack" &&
+
+		expired="$(ls expired.git/objects/pack/pack-*.idx)" &&
+		test_path_is_file "${expired%.idx}.mtimes" &&
+
+		# Since the `--cruft-expiration` is "now", the effective
+		# behavior is to move _all_ unreachable objects out to
+		# the location in `--expire-to`.
+		git show-index <$expired >expired.raw &&
+		cut -d" " -f2 expired.raw | sort >expired.objects &&
+		git rev-list --all --objects --no-object-names \
+			>remaining.objects &&
+
+		# ...in other words, the combined contents of this
+		# repository and expired.git should be the same as the
+		# set of objects we started with.
+		cat expired.objects remaining.objects | sort >actual &&
+		test_cmp expect actual &&
+
+		# The "moved" objects (i.e., those in expired.git)
+		# should be the same as the cruft objects which were
+		# expired in the previous step.
+		test_cmp moved.want expired.objects
+	)
+'
+
+test_expect_success '--expire to stores pruned objects (5.minutes.ago)' '
+	git init expire-to-5.minutes.ago &&
+	(
+		cd expire-to-5.minutes.ago &&
+
+		git branch -M main &&
+
+		test_commit base &&
+
+		# Create two classes of unreachable objects, one which
+		# is older than 5 minutes (stale), and another which is
+		# newer (recent).
+		for kind in stale recent
+		do
+			git checkout -b $kind main &&
+			test_commit --no-tag $kind
+		done &&
+
+		git rev-list --objects --no-object-names main..stale >in &&
+		stale="$(git pack-objects $objdir/pack/pack <in)" &&
+		mtime="$(test-tool chmtime --get =-600 $objdir/pack/pack-$stale.pack)" &&
+
+		# expect holds the set of objects we expect to find in
+		# this repository after repacking
+		git rev-list --objects --no-object-names recent >expect.raw &&
+		sort expect.raw >expect &&
+
+		# moved.want holds the set of objects we expect to find
+		# in expired.git
+		git rev-list --objects --no-object-names main..stale >out &&
+		sort out >moved.want &&
+
+		git checkout main &&
+		git branch -D stale recent &&
+		git reflog expire --all --expire=all &&
+		git prune-packed &&
+
+		git init --bare expired.git &&
+		git repack -d \
+			--cruft --cruft-expiration=5.minutes.ago \
+			--expire-to="expired.git/objects/pack/pack" &&
+
+		# Some of the remaining objects in this repository are
+		# unreachable, so use `cat-file --batch-all-objects`
+		# instead of `rev-list` to get their names
+		git cat-file --batch-all-objects --batch-check="%(objectname)" \
+			>remaining.objects &&
+		sort remaining.objects >actual &&
+		test_cmp expect actual &&
+
+		(
+			cd expired.git &&
+
+			expired="$(ls objects/pack/pack-*.mtimes)" &&
+			test-tool pack-mtimes $(basename $expired) >out &&
+			cut -d" " -f1 out | sort >../moved.got &&
+
+			# Ensure that there are as many objects with the
+			# expected mtime as were moved to expired.git.
+			#
+			# In other words, ensure that the recorded
+			# mtimes of any moved objects was written
+			# correctly.
+			grep " $mtime$" out >matching &&
+			test_line_count = $(wc -l <../moved.want) matching
+		) &&
+		test_cmp moved.want moved.got
+	)
+'
+
 test_done
-- 
2.37.0.1.g1379af2e9d

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 0/4] move pruned objects to a separate repository
  2022-06-29 18:45 [RFC PATCH 0/4] move pruned objects to a separate repository Taylor Blau
                   ` (3 preceding siblings ...)
  2022-06-29 18:47 ` [RFC PATCH 4/4] builtin/repack.c: implement `--expire-to` for storing pruned objects Taylor Blau
@ 2022-06-29 22:54 ` Jonathan Tan
  2022-06-30  2:47   ` Taylor Blau
  2022-06-30  8:00 ` Ævar Arnfjörð Bjarmason
  5 siblings, 1 reply; 9+ messages in thread
From: Jonathan Tan @ 2022-06-29 22:54 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Jonathan Tan, git, derrickstolee, gitster

Taylor Blau <me@ttaylorr.com> writes:
> This series is an RFC for now since I'm interested in discussing whether
> or not this is a feature that people would actually want to use or not.
> But if it is, I'm happy to polish this up and turn it into a
> non-RFC-quality series ;-).
> 
> In the meantime, thanks for your review!

Thanks for this patch set. I can see this being used by, say, someone
who wants to preserve a repo that rewinds branches all the time (the
refs would need to be backed-up separately, but at least this provides a
way for objects to be stored efficiently, in that reachable objects are
still stored in the main repo and unreachable objects are stored in the
backup with no overlap between them).

I think there is at least one more alternative that should be
considered, though: since the cruft pack is unlikely to have its objects
"resurrected" (since the reason why they're there is because they are
unreachable), it is likely that the objects that are pruned are exactly
the same as those in the craft pack. So it would be more efficient to
just unconditionally rename the cruft pack to the backup destination.

Having said that, if there is a compelling use case for repacking even
when we're moving from cruft pack to backup, the design of this patch
set looks good overall. There are some minor points (e.g. the naming of
the parameter "out" in patch 1), but I understand that this is an RFC
and I'll wait for a non-RFC patch set before looking more closely at
these things.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 0/4] move pruned objects to a separate repository
  2022-06-29 22:54 ` [RFC PATCH 0/4] move pruned objects to a separate repository Jonathan Tan
@ 2022-06-30  2:47   ` Taylor Blau
  2022-06-30 21:15     ` Jonathan Tan
  0 siblings, 1 reply; 9+ messages in thread
From: Taylor Blau @ 2022-06-30  2:47 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: Taylor Blau, git, derrickstolee, gitster

On Wed, Jun 29, 2022 at 03:54:04PM -0700, Jonathan Tan wrote:
> Taylor Blau <me@ttaylorr.com> writes:
> > This series is an RFC for now since I'm interested in discussing whether
> > or not this is a feature that people would actually want to use or not.
> > But if it is, I'm happy to polish this up and turn it into a
> > non-RFC-quality series ;-).
> >
> > In the meantime, thanks for your review!
>
> Thanks for this patch set. I can see this being used by, say, someone
> who wants to preserve a repo that rewinds branches all the time (the
> refs would need to be backed-up separately, but at least this provides a
> way for objects to be stored efficiently, in that reachable objects are
> still stored in the main repo and unreachable objects are stored in the
> backup with no overlap between them).

Yes, definitely.

If it helps, I can share a little bit about the motivating use-case
within GitHub. All objects from a fork network are stored together in a
repository that we call the network.git, with individual forks keeping
track of their own references.

The network.git repository can often grow quite large, and/or contain
data that the owner of an individual fork would like removed (e.g., they
accidentally pushed sensitive credentials, force-pushed over it, but
would like the now-unreachable objects to be removed).

We don't usually do pruning GC's except during manual intervention or
upon request through a support ticket. But when we do it is often
infeasible to lock the entire network's push traffic and reference
updates. So it is not an unheard of event to encounter the race that I
described above.

The idea is that, at least for non-sensitive pruning, we would move the
pruned objects to a separate repository and hold them there until we
could run `git fsck` on the repository after pruning and verify that the
repository is intact. If it is, then the expired.git repository can be
emptied, too, permanently removing the pruned objects. If not, the
expired.git repository then becomes a donor for the missing objects,
which are used to heal the corrupt main repository. Once *that* is done,
and fsck comes back clean, then the expired.git repository can be
removed.

> I think there is at least one more alternative that should be
> considered, though: since the cruft pack is unlikely to have its objects
> "resurrected" (since the reason why they're there is because they are
> unreachable), it is likely that the objects that are pruned are exactly
> the same as those in the craft pack. So it would be more efficient to
> just unconditionally rename the cruft pack to the backup destination.

This isn't quite right. The contents that are written into the
expired.git repository is everything that *didn't* end up in the cruft
pack.

Suppose your cruft expiration is 1.hour.ago, and your doing a repack on
repository foo.git, expiring objects into expired.git. There are three
disjoint sets of objects:

  - reachable objects, which will stay in foo.git
  - unreachable objects which were written within the last hour (and are
    thus too new to prune) which will stay in foo.git
  - unreachable objects which *weren't* written within the last hour
    (and thus will be pruned) which are moved to a new pack in
    expired.git (and removed from foo.git)

So the cruft pack in foo.git and the one written to expired.git are a
disjoint cut of the unreachable objects in foo.git based on their mtime,
with the recent objects staying in the source repository and the stale
ones moving to the expired.git repository.

The original implementation of this feature was to move the entire cruft
pack out of the way like you describe. This is sub-optimal because you
are forced to generate that cruft pack with `--cruft-expiration=never`,
since you can't actually prune any objects when generating the cruft
pack, or they would be gone forever. But since you have to move the
entire cruft pack out of the way, the visible effect looks like you
actually pruned *all* unreachable objects, as if you had supplied
`--cruft-expiration=now`.

Being able to expire just the objects which have aged out of the grace
period should cause this race to happen less frequently in practice.

> Having said that, if there is a compelling use case for repacking even
> when we're moving from cruft pack to backup, the design of this patch
> set looks good overall. There are some minor points (e.g. the naming of
> the parameter "out" in patch 1), but I understand that this is an RFC
> and I'll wait for a non-RFC patch set before looking more closely at
> these things.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 0/4] move pruned objects to a separate repository
  2022-06-29 18:45 [RFC PATCH 0/4] move pruned objects to a separate repository Taylor Blau
                   ` (4 preceding siblings ...)
  2022-06-29 22:54 ` [RFC PATCH 0/4] move pruned objects to a separate repository Jonathan Tan
@ 2022-06-30  8:00 ` Ævar Arnfjörð Bjarmason
  5 siblings, 0 replies; 9+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-06-30  8:00 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, derrickstolee, jonathantanmy, gitster, Jeff King


On Wed, Jun 29 2022, Taylor Blau wrote:

> Now that cruft packs are available in v2.37.0, here is an interesting
> application of that new feature to enable a two-phase object pruning
> approach.
>
> This came out of a discussion within GitHub about ways we could support
> storing a set of pruned objects in "limbo" so that they were not
> accessible from the repository which pruned them, but instead stored in
> a cruft pack in a separate repository which lists the original one as an
> alternate.
>
> This makes it possible to take the collection of all pruned objects and
> store them in a cruft pack in a separate repository. This repository
> (which I have been referring to as the "expired.git") can then be used
> as a donor repository for any missing objects (like the ones described
> by the race in [1]).
> [...]
> [1]: https://lore.kernel.org/git/YryF+vkosJOXf+mQ@nand.local/

I think the best description of that race on-list is this by Jeff King,
if so I think it would be nice to work it into a commit message (for
4/4):

	https://public-inbox.org/git/20190319001829.GL29661@sigill.intra.peff.net/

Downthread of that, starting at:

	https://public-inbox.org/git/878svjj4t5.fsf@evledraar.gmail.com/

I describe a proposed mechanism to address the race condition, which
seems to me be functionally the same as parts of what you're proposing
here. I.e. the "limbo" here being the same as the proposed "gc
quarantine".

The main difference being one that this RFC leaves on the table, which
is how you'd get these objects back into the non-cruft repository once
they're erroneously/racily expired. I imagined that we'd add it as a
special alternate, read it last, and make the object reading code aware
that any object needed from such an alternate is one that we'd need to
COR over to our primary repository:

	https://public-inbox.org/git/8736lnxlig.fsf@evledraar.gmail.com/

Whereas it seems like you're imagining just having the "cruft pack"
repository around so that a support engineer can manually recover from
corruption, or have some other out-of-tree mechanism not part of this
RFC to (semi-?)automate that step.

If you haven't it would be nice if you could read that thread & see if
what I'm describing there is essentially a superset of what you have
here, and if any of the concerns Jeff King brought up there are ones you
think apply here.

Particularly as there was a reference to an off-list (presumably at
GitHub) discussion with Michael Haggerty about these sorts of races. I
don't know if either Jeff or Michael were involved in the discussions
you had.

I think that the mechanism I proposed there was subtly different from
what Jeff was concerned about being racy, but that thread was left
hanging as the last reply is from me trying to clarify that point.

So maybe I'm wrong, but I think if that was the case you'd also be wrong
about this approach being viable, so it would be nice to clear that up
:)

I'd also be very interested to know if you have anything like my
proposed auto-healing via a special alternate planned.  I think that
would allow aggressive pruning of live repositories not by fixing our
underlying race conditions, but by "leaning into" them as it were.

I.e. we'd race even more, but as we could always auto-heal by "no, I'll
actually need that" COR-ing the relevant object(s) back from the "gc
quarantine" (or your "cruft repository") that would be OK.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 0/4] move pruned objects to a separate repository
  2022-06-30  2:47   ` Taylor Blau
@ 2022-06-30 21:15     ` Jonathan Tan
  0 siblings, 0 replies; 9+ messages in thread
From: Jonathan Tan @ 2022-06-30 21:15 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Jonathan Tan, git, derrickstolee, gitster

Taylor Blau <me@ttaylorr.com> writes:
> We don't usually do pruning GC's except during manual intervention or
> upon request through a support ticket. But when we do it is often
> infeasible to lock the entire network's push traffic and reference
> updates. So it is not an unheard of event to encounter the race that I
> described above.
> 
> The idea is that, at least for non-sensitive pruning, we would move the
> pruned objects to a separate repository and hold them there until we
> could run `git fsck` on the repository after pruning and verify that the
> repository is intact. If it is, then the expired.git repository can be
> emptied, too, permanently removing the pruned objects. If not, the
> expired.git repository then becomes a donor for the missing objects,
> which are used to heal the corrupt main repository. Once *that* is done,
> and fsck comes back clean, then the expired.git repository can be
> removed.

Thanks for the information. Presumably, during such pruning GCs (because
manual intervention was needed or because of a support ticket), you
would use a cruft expiration of "now", not something like "14 days ago".
If so, more on this specific case later...

> > I think there is at least one more alternative that should be
> > considered, though: since the cruft pack is unlikely to have its objects
> > "resurrected" (since the reason why they're there is because they are
> > unreachable), it is likely that the objects that are pruned are exactly
> > the same as those in the craft pack. So it would be more efficient to
> > just unconditionally rename the cruft pack to the backup destination.
> 
> This isn't quite right. The contents that are written into the
> expired.git repository is everything that *didn't* end up in the cruft
> pack.

I reread what I wrote and realized that I didn't give a good description
of what I was thinking. So hopefully this is a better one: I was
thinking of the case in which GC is regularly run on a repo, say, every
14 days with a 14-day expiration time. So on the first run you would
have:

  a1.pack      a2.pack (+ .mtime)
  Reachable    Unreachable (cruft pack)

and on the second run, assuming this patch set is in effect:

  b1.pack      b2.pack (+ .mtime)          expired.git/b3.pack
  Reachable    Unreachable (cruft pack)    In expired.git

and my idea was that it is very likely that a2.pack and
expired.git/b3.pack are the same, so I thought that a feature in which
cruft packs could be moved instead of deleted would be more efficient.

Going back to the specific case at the start of this email. I now see
that my idea wouldn't work in that specific case, because you would want
to generate expired.git packs from a repository that is unlikely to have
cruft packs (or, at least, it is unlikely that the existing cruft packs
match exactly what you would write in expired.git).

If it is likely that cruft pack(s) would only be written in one place
(whether in the same repository or in expired.git, but not both),
another design option would be to be able to tell Git where to write
cruft packs, but not give Git the ability to write cruft packs both to
the same repository and expired.git. (Unless in your use case, Taylor,
you envision that you would occasionally need to write cruft packs in
both places.) That would simplify the UI and the code a bit, but not by
much (this patch set, as written, is already elegantly written), so I
don't feel too strongly about it and I would be happy with the current
pattern.



^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2022-06-30 21:15 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-29 18:45 [RFC PATCH 0/4] move pruned objects to a separate repository Taylor Blau
2022-06-29 18:45 ` [RFC PATCH 1/4] builtin/repack.c: pass "out" to `prepare_pack_objects` Taylor Blau
2022-06-29 18:47 ` [RFC PATCH 2/4] builtin/repack.c: pass "cruft_expiration" to `write_cruft_pack` Taylor Blau
2022-06-29 18:47 ` [RFC PATCH 3/4] builtin/repack.c: write cruft packs to arbitrary locations Taylor Blau
2022-06-29 18:47 ` [RFC PATCH 4/4] builtin/repack.c: implement `--expire-to` for storing pruned objects Taylor Blau
2022-06-29 22:54 ` [RFC PATCH 0/4] move pruned objects to a separate repository Jonathan Tan
2022-06-30  2:47   ` Taylor Blau
2022-06-30 21:15     ` Jonathan Tan
2022-06-30  8:00 ` Ævar Arnfjörð Bjarmason

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.