All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] repack: make `collect_pack_filenames()` consistent
@ 2023-07-10 19:37 Taylor Blau
  2023-07-10 19:37 ` [PATCH 1/2] builtin/repack.c: only repack `.pack`s that exist Taylor Blau
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Taylor Blau @ 2023-07-10 19:37 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Derrick Stolee, Jeff King, Johannes Schindelin,
	Victoria Dye

This series combines a patch that Stolee wrote[1] as an RFC, along with
one that I added as a follow-up on top[2].

The details are spelled out in the commit messages, but the gist is that
the first patch restores behavior from prior to 73320e49ad
(builtin/repack.c: only collect fully-formed packs, 2023-06-07).

The second patch does not change any behavior, but reimplements
`collect_pack_filenames()` in terms of `get_all_packs()`, to make `git
repack`'s notion of which packs exist in a usable state in the
repository is consistent with `add_packed_git()`,
`install_packed_git()`, etc.

[1]: https://lore.kernel.org/git/pull.1546.git.1687287782439.gitgitgadget@gmail.com/
[2]: https://lore.kernel.org/git/ZJ1N2I6sDfxhrJo8@nand.local/

Derrick Stolee (1):
  builtin/repack.c: only repack `.pack`s that exist

Taylor Blau (1):
  builtin/repack.c: avoid dir traversal in `collect_pack_filenames()`

 builtin/repack.c  | 38 +++++++++++++++-----------------------
 t/t7700-repack.sh | 47 ++++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 57 insertions(+), 28 deletions(-)

-- 
2.41.0.320.gb3d0d9308ef

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/2] builtin/repack.c: only repack `.pack`s that exist
  2023-07-10 19:37 [PATCH 0/2] repack: make `collect_pack_filenames()` consistent Taylor Blau
@ 2023-07-10 19:37 ` Taylor Blau
  2023-07-10 20:09   ` Junio C Hamano
  2023-07-10 19:37 ` [PATCH 2/2] builtin/repack.c: avoid dir traversal in `collect_pack_filenames()` Taylor Blau
  2023-07-11 17:32 ` [PATCH v2 0/2] repack: make `collect_pack_filenames()` consistent Taylor Blau
  2 siblings, 1 reply; 10+ messages in thread
From: Taylor Blau @ 2023-07-10 19:37 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Derrick Stolee, Jeff King, Johannes Schindelin,
	Victoria Dye

From: Derrick Stolee <derrickstolee@github.com>

In 73320e49add (builtin/repack.c: only collect fully-formed packs,
2023-06-07), we switched the check for which packs to collect by
starting at the .idx files and looking for matching .pack files. This
avoids trying to repack pack-files that have not had their pack-indexes
installed yet.

However, it does cause maintenance to halt if we find the (problematic,
but not insurmountable) case of a .idx file without a corresponding
.pack file. In an environment where packfile maintenance is a critical
function, such a hard stop is costly and requires human intervention to
resolve (by deleting the .idx file).

This was not the case before. We successfully repacked through this
scenario until the recent change to scan for .idx files.

Further, if we are actually in a case where objects are missing, we
detect this at a different point during the reachability walk.

In other cases, Git prepares its list of packfiles by scanning .idx
files and then only adds it to the packfile list if the corresponding
.pack file exists. It even does so without a warning! (See
add_packed_git() in packfile.c for details.)

This case is much less likely to occur than the failures seen before
73320e49add. Packfiles are "installed" by writing the .pack file before
the .idx and that process can be interrupted. Packfiles _should_ be
deleted by deleting the .idx first, followed by the .pack file, but
unlink_pack_path() does not do this: it deletes the .pack _first_,
allowing a window where this process could be interrupted. We leave the
consideration of changing this order as a separate concern. Knowing that
this condition is possible from interrupted Git processes and not other
tools lends some weight that Git should be more flexible around this
scenario.

Add a check to see if the .pack file exists before adding it to the list
for repacking. This will stop a number of maintenance failures seen in
production but fixed by deleting the .idx files.

This brings us closer to the case before 73320e49add in that 'git
repack' will not fail when there is an orphaned .idx file, at least, not
due to the way we scan for packfiles. In the case that the .pack file
was erroneously deleted without copies of its objects in other installed
packfiles, then 'git repack' will fail due to the reachable object walk.

This does resolve the case where automated repacks will no longer be
halted on this case. The tests in t7700 show both these successful
scenarios and the case of failing if the .pack was truly required.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/repack.c  |  3 +++
 t/t7700-repack.sh | 47 ++++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 45 insertions(+), 5 deletions(-)

diff --git a/builtin/repack.c b/builtin/repack.c
index 51698e3c68..724e09536e 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -125,6 +125,9 @@ static void collect_pack_filenames(struct string_list *fname_nonkept_list,
 		strbuf_add(&buf, e->d_name, len);
 		strbuf_addstr(&buf, ".pack");
 
+		if (!file_exists(mkpath("%s/%s", packdir, buf.buf)))
+			continue;
+
 		for (i = 0; i < extra_keep->nr; i++)
 			if (!fspathcmp(buf.buf, extra_keep->items[i].string))
 				break;
diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh
index af79266c58..284954791d 100755
--- a/t/t7700-repack.sh
+++ b/t/t7700-repack.sh
@@ -213,16 +213,16 @@ test_expect_success 'repack --keep-pack' '
 	test_create_repo keep-pack &&
 	(
 		cd keep-pack &&
-		# avoid producing difference packs to delta/base choices
+		# avoid producing different packs due to delta/base choices
 		git config pack.window 0 &&
 		P1=$(commit_and_pack 1) &&
 		P2=$(commit_and_pack 2) &&
 		P3=$(commit_and_pack 3) &&
 		P4=$(commit_and_pack 4) &&
-		ls .git/objects/pack/*.pack >old-counts &&
+		find .git/objects/pack -name "*.pack" -type f | sort >old-counts &&
 		test_line_count = 4 old-counts &&
 		git repack -a -d --keep-pack $P1 --keep-pack $P4 &&
-		ls .git/objects/pack/*.pack >new-counts &&
+		find .git/objects/pack -name "*.pack" -type f | sort >new-counts &&
 		grep -q $P1 new-counts &&
 		grep -q $P4 new-counts &&
 		test_line_count = 3 new-counts &&
@@ -239,14 +239,51 @@ test_expect_success 'repack --keep-pack' '
 			mv "$from" "$to" || return 1
 		done &&
 
+		# A .idx file without a .pack should not stop us from
+		# repacking what we can.
+		touch .git/objects/pack/pack-does-not-exist.idx &&
+
+		find .git/objects/pack -name "*.pack" -type f | sort >packs.before &&
 		git repack --cruft -d --keep-pack $P1 --keep-pack $P4 &&
+		find .git/objects/pack -name "*.pack" -type f | sort >packs.after &&
 
-		ls .git/objects/pack/*.pack >newer-counts &&
-		test_cmp new-counts newer-counts &&
+		test_cmp packs.before packs.after &&
 		git fsck
 	)
 '
 
+test_expect_success 'repacking fails when missing .pack actually means missing objects' '
+	test_create_repo idx-without-pack &&
+	(
+		cd idx-without-pack &&
+
+		# Avoid producing different packs due to delta/base choices
+		git config pack.window 0 &&
+		P1=$(commit_and_pack 1) &&
+		P2=$(commit_and_pack 2) &&
+		P3=$(commit_and_pack 3) &&
+		P4=$(commit_and_pack 4) &&
+		find .git/objects/pack -name "*.pack" -type f | sort >old-counts &&
+		test_line_count = 4 old-counts &&
+
+		# Remove one .pack file
+		rm .git/objects/pack/$P2 &&
+
+		find .git/objects/pack -name "*.pack" -type f |
+			sort >before-pack-dir &&
+
+		test_must_fail git fsck &&
+		test_must_fail git repack --cruft -d 2>err &&
+		grep "bad object" err &&
+
+		# Before failing, the repack did not modify the
+		# pack directory.
+		find .git/objects/pack -name "*.pack" -type f |
+			sort >after-pack-dir &&
+		test_cmp before-pack-dir after-pack-dir
+	)
+'
+
 test_expect_success 'bitmaps are created by default in bare repos' '
 	git clone --bare .git bare.git &&
 	rm -f bare.git/objects/pack/*.bitmap &&
-- 
2.41.0.320.gb3d0d9308ef


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/2] builtin/repack.c: avoid dir traversal in `collect_pack_filenames()`
  2023-07-10 19:37 [PATCH 0/2] repack: make `collect_pack_filenames()` consistent Taylor Blau
  2023-07-10 19:37 ` [PATCH 1/2] builtin/repack.c: only repack `.pack`s that exist Taylor Blau
@ 2023-07-10 19:37 ` Taylor Blau
  2023-07-10 21:35   ` Junio C Hamano
  2023-07-11 17:32 ` [PATCH v2 0/2] repack: make `collect_pack_filenames()` consistent Taylor Blau
  2 siblings, 1 reply; 10+ messages in thread
From: Taylor Blau @ 2023-07-10 19:37 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Derrick Stolee, Jeff King, Johannes Schindelin,
	Victoria Dye

When repacking, the function `collect_pack_filenames()` is responsible
for collecting the set of existing packs in the repository, and
partitioning them into "kept" (if the pack has a ".keep" file or was
given via `--keep-pack`) and "nonkept" (otherwise) lists.

This function comes from the original C port of git-repack.sh from back
in a1bbc6c0176 (repack: rewrite the shell script in C, 2013-09-15),
where it first appears as `get_non_kept_pack_filenames()`. At the time,
the implementation was a fairly direct translation from the relevant
portion of git-repack.sh, which looped over the results of

    find "$PACKDIR" -type f -name '*.pack'

either ignoring the pack as kept, or adding it to the list of existing
packs.

So the choice to directly translate this function in terms of
`readdir()` in a1bbc6c0176 made sense. At the time, it was possible to
refine the C version in terms of packed_git structs, but was never done.

However, manually enumerating a repository's packs via `readdir()` is
confusing and error-prone. It leads to frustrating inconsistencies
between which packs Git considers to be part of a repository (i.e.,
could be found in the list of packs from `get_all_packs()`), and which
packs `collect_pack_filenames()` considers to meet the same criteria.

This bit us in 73320e49ad (builtin/repack.c: only collect fully-formed
packs, 2023-06-07), and again in the previous commit.

Prevent these issues from biting us in the future by implementing the
`collect_pack_filenames()` function by looping over an array of pointers
to `packed_git` structs, ensuring that we use the same criteria to
determine the set of available packs.

One gotcha here is that we have to ignore non-local packs, since the
original version of `collect_pack_filenames()` only looks at the local
pack directory to collect existing packs.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/repack.c | 41 +++++++++++++++--------------------------
 1 file changed, 15 insertions(+), 26 deletions(-)

diff --git a/builtin/repack.c b/builtin/repack.c
index 724e09536e..ac9fc61d2e 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -106,49 +106,38 @@ static void collect_pack_filenames(struct string_list *fname_nonkept_list,
 				   struct string_list *fname_kept_list,
 				   const struct string_list *extra_keep)
 {
-	DIR *dir;
-	struct dirent *e;
-	char *fname;
+	struct packed_git *p;
 	struct strbuf buf = STRBUF_INIT;
 
-	if (!(dir = opendir(packdir)))
-		return;
-
-	while ((e = readdir(dir)) != NULL) {
-		size_t len;
+	for (p = get_all_packs(the_repository); p; p = p->next) {
 		int i;
+		const char *base;
 
-		if (!strip_suffix(e->d_name, ".idx", &len))
+		if (!p->pack_local)
 			continue;
 
-		strbuf_reset(&buf);
-		strbuf_add(&buf, e->d_name, len);
-		strbuf_addstr(&buf, ".pack");
-
-		if (!file_exists(mkpath("%s/%s", packdir, buf.buf)))
-			continue;
+		base = pack_basename(p);
 
 		for (i = 0; i < extra_keep->nr; i++)
-			if (!fspathcmp(buf.buf, extra_keep->items[i].string))
+			if (!fspathcmp(base, extra_keep->items[i].string))
 				break;
 
-		fname = xmemdupz(e->d_name, len);
+		strbuf_reset(&buf);
+		strbuf_addstr(&buf, base);
+		strbuf_strip_suffix(&buf, ".pack");
 
-		if ((extra_keep->nr > 0 && i < extra_keep->nr) ||
-		    (file_exists(mkpath("%s/%s.keep", packdir, fname)))) {
-			string_list_append_nodup(fname_kept_list, fname);
-		} else {
+		if ((extra_keep->nr > 0 && i < extra_keep->nr) || p->pack_keep)
+			string_list_append(fname_kept_list, buf.buf);
+		else {
 			struct string_list_item *item;
-			item = string_list_append_nodup(fname_nonkept_list,
-							fname);
-			if (file_exists(mkpath("%s/%s.mtimes", packdir, fname)))
+			item = string_list_append(fname_nonkept_list, buf.buf);
+			if (p->is_cruft)
 				item->util = (void*)(uintptr_t)CRUFT_PACK;
 		}
 	}
-	closedir(dir);
-	strbuf_release(&buf);
 
 	string_list_sort(fname_kept_list);
+	strbuf_release(&buf);
 }
 
 static void remove_redundant_pack(const char *dir_name, const char *base_name)
-- 
2.41.0.320.gb3d0d9308ef

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] builtin/repack.c: only repack `.pack`s that exist
  2023-07-10 19:37 ` [PATCH 1/2] builtin/repack.c: only repack `.pack`s that exist Taylor Blau
@ 2023-07-10 20:09   ` Junio C Hamano
  2023-07-11 17:28     ` Taylor Blau
  0 siblings, 1 reply; 10+ messages in thread
From: Junio C Hamano @ 2023-07-10 20:09 UTC (permalink / raw)
  To: Taylor Blau
  Cc: git, Derrick Stolee, Jeff King, Johannes Schindelin, Victoria Dye

Taylor Blau <me@ttaylorr.com> writes:

> Add a check to see if the .pack file exists before adding it to the list
> for repacking. This will stop a number of maintenance failures seen in
> production but fixed by deleting the .idx files.

When we are adding we'd eventually add both and something that lack
one of them will eventually become complete and will be part of the
repacking once that happens.  When we are removing (because another
repack has about to finish), removing either one of them will make
the pack ineligible, which is OK.  If somebody crashed while adding
or removing and ended up leaving only one, not both, on the
filesystem, ignoring such a leftover stuff would be the least
disruptive for automation.  Makes sense.

> diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh
> index af79266c58..284954791d 100755
> --- a/t/t7700-repack.sh
> +++ b/t/t7700-repack.sh
> @@ -213,16 +213,16 @@ test_expect_success 'repack --keep-pack' '
>  	test_create_repo keep-pack &&
>  	(
>  		cd keep-pack &&
> -		# avoid producing difference packs to delta/base choices
> +		# avoid producing different packs due to delta/base choices

OK.

>  		git config pack.window 0 &&
>  		P1=$(commit_and_pack 1) &&
>  		P2=$(commit_and_pack 2) &&
>  		P3=$(commit_and_pack 3) &&
>  		P4=$(commit_and_pack 4) &&
> -		ls .git/objects/pack/*.pack >old-counts &&
> +		find .git/objects/pack -name "*.pack" -type f | sort >old-counts &&
>  		test_line_count = 4 old-counts &&
>  		git repack -a -d --keep-pack $P1 --keep-pack $P4 &&
> -		ls .git/objects/pack/*.pack >new-counts &&
> +		find .git/objects/pack -name "*.pack" -type f | sort >new-counts &&
>  		grep -q $P1 new-counts &&
>  		grep -q $P4 new-counts &&
>  		test_line_count = 3 new-counts &&

I do not think "sort" in both of these added lines is doing anything
useful.  Does the test break without this hunk and if so how?

> @@ -239,14 +239,51 @@ test_expect_success 'repack --keep-pack' '
>  			mv "$from" "$to" || return 1
>  		done &&
>  
> +		# A .idx file without a .pack should not stop us from
> +		# repacking what we can.
> +		touch .git/objects/pack/pack-does-not-exist.idx &&

	>.git/objects/pack/pack-does-not-exist.idx &&

> +		find .git/objects/pack -name "*.pack" -type f | sort >packs.before &&
>  		git repack --cruft -d --keep-pack $P1 --keep-pack $P4 &&
> +		find .git/objects/pack -name "*.pack" -type f | sort >packs.after &&
>  
> -		ls .git/objects/pack/*.pack >newer-counts &&
> -		test_cmp new-counts newer-counts &&
> +		test_cmp packs.before packs.after &&


I'd say the changes from "ls" to "find" in the earlier hunk is
excusable in the name of consistency with this updated on, if the
use of "sort" matters in this hunk, but as "ls" gives a consistent
output unless you say "ls (-U|--sort=none)", I am not sure if we
need "sort" in this hunk, either.  So, I dunno.

"before vs after" does look like an improvement over "new vs newer".

> +test_expect_success 'repacking fails when missing .pack actually means missing objects' '
> +	test_create_repo idx-without-pack &&
> +	(
> +		cd idx-without-pack &&
> +
> +		# Avoid producing different packs due to delta/base choices
> +		git config pack.window 0 &&
> +		P1=$(commit_and_pack 1) &&
> +		P2=$(commit_and_pack 2) &&
> +		P3=$(commit_and_pack 3) &&
> +		P4=$(commit_and_pack 4) &&
> +		find .git/objects/pack -name "*.pack" -type f | sort >old-counts &&
> +		test_line_count = 4 old-counts &&
> +
> +		# Remove one .pack file
> +		rm .git/objects/pack/$P2 &&
> +
> +		find .git/objects/pack -name "*.pack" -type f |
> +			sort >before-pack-dir &&
> +
> +		test_must_fail git fsck &&
> +		test_must_fail git repack --cruft -d 2>err &&
> +		grep "bad object" err &&
> +
> +		# Before failing, the repack did not modify the
> +		# pack directory.
> +		find .git/objects/pack -name "*.pack" -type f |
> +			sort >after-pack-dir &&
> +		test_cmp before-pack-dir after-pack-dir
> +	)
> +'

Ditto for the use of "find | sort" vs "ls", but otherwise it looks
like it is testing the right thing.

Thanks.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] builtin/repack.c: avoid dir traversal in `collect_pack_filenames()`
  2023-07-10 19:37 ` [PATCH 2/2] builtin/repack.c: avoid dir traversal in `collect_pack_filenames()` Taylor Blau
@ 2023-07-10 21:35   ` Junio C Hamano
  2023-07-11 17:29     ` Taylor Blau
  0 siblings, 1 reply; 10+ messages in thread
From: Junio C Hamano @ 2023-07-10 21:35 UTC (permalink / raw)
  To: Taylor Blau
  Cc: git, Derrick Stolee, Jeff King, Johannes Schindelin, Victoria Dye

Taylor Blau <me@ttaylorr.com> writes:

> However, manually enumerating a repository's packs via `readdir()` is
> confusing and error-prone. It leads to frustrating inconsistencies
> between which packs Git considers to be part of a repository (i.e.,
> could be found in the list of packs from `get_all_packs()`), and which
> packs `collect_pack_filenames()` considers to meet the same criteria.

Makes sense.

> One gotcha here is that we have to ignore non-local packs, since the
> original version of `collect_pack_filenames()` only looks at the local
> pack directory to collect existing packs.

Yup.

Does this "fix" anything, or just makes the resulting code clearer
and protects it from future breakage?  I think it is the latter and
not having any test is justified.

Thanks.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] builtin/repack.c: only repack `.pack`s that exist
  2023-07-10 20:09   ` Junio C Hamano
@ 2023-07-11 17:28     ` Taylor Blau
  0 siblings, 0 replies; 10+ messages in thread
From: Taylor Blau @ 2023-07-11 17:28 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, Derrick Stolee, Jeff King, Johannes Schindelin, Victoria Dye

On Mon, Jul 10, 2023 at 01:09:50PM -0700, Junio C Hamano wrote:
> >  		git config pack.window 0 &&
> >  		P1=$(commit_and_pack 1) &&
> >  		P2=$(commit_and_pack 2) &&
> >  		P3=$(commit_and_pack 3) &&
> >  		P4=$(commit_and_pack 4) &&
> > -		ls .git/objects/pack/*.pack >old-counts &&
> > +		find .git/objects/pack -name "*.pack" -type f | sort >old-counts &&
> >  		test_line_count = 4 old-counts &&
> >  		git repack -a -d --keep-pack $P1 --keep-pack $P4 &&
> > -		ls .git/objects/pack/*.pack >new-counts &&
> > +		find .git/objects/pack -name "*.pack" -type f | sort >new-counts &&
> >  		grep -q $P1 new-counts &&
> >  		grep -q $P4 new-counts &&
> >  		test_line_count = 3 new-counts &&
>
> I do not think "sort" in both of these added lines is doing anything
> useful.  Does the test break without this hunk and if so how?

This is an error on my part; I don't think I realized that a bare ls
gives sorted output, so changing these to 'find ... | sort' was
unnecessary. I'll send a version that removes that noise.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] builtin/repack.c: avoid dir traversal in `collect_pack_filenames()`
  2023-07-10 21:35   ` Junio C Hamano
@ 2023-07-11 17:29     ` Taylor Blau
  0 siblings, 0 replies; 10+ messages in thread
From: Taylor Blau @ 2023-07-11 17:29 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, Derrick Stolee, Jeff King, Johannes Schindelin, Victoria Dye

On Mon, Jul 10, 2023 at 02:35:19PM -0700, Junio C Hamano wrote:
> Does this "fix" anything, or just makes the resulting code clearer
> and protects it from future breakage?  I think it is the latter and
> not having any test is justified.

Nope, this doesn't change any existing behavior, and there is nothing
broken with the original. So this is more about clarity and hardening
this function against future breakage.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v2 0/2] repack: make `collect_pack_filenames()` consistent
  2023-07-10 19:37 [PATCH 0/2] repack: make `collect_pack_filenames()` consistent Taylor Blau
  2023-07-10 19:37 ` [PATCH 1/2] builtin/repack.c: only repack `.pack`s that exist Taylor Blau
  2023-07-10 19:37 ` [PATCH 2/2] builtin/repack.c: avoid dir traversal in `collect_pack_filenames()` Taylor Blau
@ 2023-07-11 17:32 ` Taylor Blau
  2023-07-11 17:32   ` [PATCH v2 1/2] builtin/repack.c: only repack `.pack`s that exist Taylor Blau
  2023-07-11 17:32   ` [PATCH v2 2/2] builtin/repack.c: avoid dir traversal in `collect_pack_filenames()` Taylor Blau
  2 siblings, 2 replies; 10+ messages in thread
From: Taylor Blau @ 2023-07-11 17:32 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Jeff King, Junio C Hamano

Here is a small replacement for a series that Stolee and I worked on
together to make the behavior of `collect_pack_filenames()` consistent
with the rest of Git's pack liveness rules.

Nothing much has changed since v1, with the exception of removing
unnecessary conversions from 'ls' -> 'find ... | sort'. As always, a
range-diff is included below for convenience.

Thanks in advance for another look.

Derrick Stolee (1):
  builtin/repack.c: only repack `.pack`s that exist

Taylor Blau (1):
  builtin/repack.c: avoid dir traversal in `collect_pack_filenames()`

 builtin/repack.c  | 38 +++++++++++++++-----------------------
 t/t7700-repack.sh | 36 +++++++++++++++++++++++++++++++++++-
 2 files changed, 50 insertions(+), 24 deletions(-)

Range-diff against v1:
1:  f14a88f107 ! 1:  410d2f0165 builtin/repack.c: only repack `.pack`s that exist
    @@ t/t7700-repack.sh: test_expect_success 'repack --keep-pack' '
      		git config pack.window 0 &&
      		P1=$(commit_and_pack 1) &&
      		P2=$(commit_and_pack 2) &&
    - 		P3=$(commit_and_pack 3) &&
    - 		P4=$(commit_and_pack 4) &&
    --		ls .git/objects/pack/*.pack >old-counts &&
    -+		find .git/objects/pack -name "*.pack" -type f | sort >old-counts &&
    - 		test_line_count = 4 old-counts &&
    - 		git repack -a -d --keep-pack $P1 --keep-pack $P4 &&
    --		ls .git/objects/pack/*.pack >new-counts &&
    -+		find .git/objects/pack -name "*.pack" -type f | sort >new-counts &&
    - 		grep -q $P1 new-counts &&
    - 		grep -q $P4 new-counts &&
    - 		test_line_count = 3 new-counts &&
     @@ t/t7700-repack.sh: test_expect_success 'repack --keep-pack' '
      			mv "$from" "$to" || return 1
      		done &&
    @@ t/t7700-repack.sh: test_expect_success 'repack --keep-pack' '
     +		# repacking what we can.
     +		touch .git/objects/pack/pack-does-not-exist.idx &&
     +
    -+		find .git/objects/pack -name "*.pack" -type f | sort >packs.before &&
      		git repack --cruft -d --keep-pack $P1 --keep-pack $P4 &&
    -+		find .git/objects/pack -name "*.pack" -type f | sort >packs.after &&
      
    --		ls .git/objects/pack/*.pack >newer-counts &&
    --		test_cmp new-counts newer-counts &&
    -+		test_cmp packs.before packs.after &&
    - 		git fsck
    + 		ls .git/objects/pack/*.pack >newer-counts &&
    +@@ t/t7700-repack.sh: test_expect_success 'repack --keep-pack' '
      	)
      '
      
    @@ t/t7700-repack.sh: test_expect_success 'repack --keep-pack' '
     +		P2=$(commit_and_pack 2) &&
     +		P3=$(commit_and_pack 3) &&
     +		P4=$(commit_and_pack 4) &&
    -+		find .git/objects/pack -name "*.pack" -type f | sort >old-counts &&
    ++		ls .git/objects/pack/*.pack >old-counts &&
     +		test_line_count = 4 old-counts &&
     +
     +		# Remove one .pack file
     +		rm .git/objects/pack/$P2 &&
     +
    -+		find .git/objects/pack -name "*.pack" -type f |
    -+			sort >before-pack-dir &&
    ++		ls .git/objects/pack/*.pack >before-pack-dir &&
     +
     +		test_must_fail git fsck &&
     +		test_must_fail git repack --cruft -d 2>err &&
    @@ t/t7700-repack.sh: test_expect_success 'repack --keep-pack' '
     +
     +		# Before failing, the repack did not modify the
     +		# pack directory.
    -+		find .git/objects/pack -name "*.pack" -type f |
    -+			sort >after-pack-dir &&
    ++		ls .git/objects/pack/*.pack >after-pack-dir &&
     +		test_cmp before-pack-dir after-pack-dir
     +	)
     +'
2:  b3d0d9308e = 2:  ffdf85f6d3 builtin/repack.c: avoid dir traversal in `collect_pack_filenames()`
-- 
2.41.0.329.gffdf85f6d39

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v2 1/2] builtin/repack.c: only repack `.pack`s that exist
  2023-07-11 17:32 ` [PATCH v2 0/2] repack: make `collect_pack_filenames()` consistent Taylor Blau
@ 2023-07-11 17:32   ` Taylor Blau
  2023-07-11 17:32   ` [PATCH v2 2/2] builtin/repack.c: avoid dir traversal in `collect_pack_filenames()` Taylor Blau
  1 sibling, 0 replies; 10+ messages in thread
From: Taylor Blau @ 2023-07-11 17:32 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Jeff King, Junio C Hamano

From: Derrick Stolee <derrickstolee@github.com>

In 73320e49add (builtin/repack.c: only collect fully-formed packs,
2023-06-07), we switched the check for which packs to collect by
starting at the .idx files and looking for matching .pack files. This
avoids trying to repack pack-files that have not had their pack-indexes
installed yet.

However, it does cause maintenance to halt if we find the (problematic,
but not insurmountable) case of a .idx file without a corresponding
.pack file. In an environment where packfile maintenance is a critical
function, such a hard stop is costly and requires human intervention to
resolve (by deleting the .idx file).

This was not the case before. We successfully repacked through this
scenario until the recent change to scan for .idx files.

Further, if we are actually in a case where objects are missing, we
detect this at a different point during the reachability walk.

In other cases, Git prepares its list of packfiles by scanning .idx
files and then only adds it to the packfile list if the corresponding
.pack file exists. It even does so without a warning! (See
add_packed_git() in packfile.c for details.)

This case is much less likely to occur than the failures seen before
73320e49add. Packfiles are "installed" by writing the .pack file before
the .idx and that process can be interrupted. Packfiles _should_ be
deleted by deleting the .idx first, followed by the .pack file, but
unlink_pack_path() does not do this: it deletes the .pack _first_,
allowing a window where this process could be interrupted. We leave the
consideration of changing this order as a separate concern. Knowing that
this condition is possible from interrupted Git processes and not other
tools lends some weight that Git should be more flexible around this
scenario.

Add a check to see if the .pack file exists before adding it to the list
for repacking. This will stop a number of maintenance failures seen in
production but fixed by deleting the .idx files.

This brings us closer to the case before 73320e49add in that 'git
repack' will not fail when there is an orphaned .idx file, at least, not
due to the way we scan for packfiles. In the case that the .pack file
was erroneously deleted without copies of its objects in other installed
packfiles, then 'git repack' will fail due to the reachable object walk.

This does resolve the case where automated repacks will no longer be
halted on this case. The tests in t7700 show both these successful
scenarios and the case of failing if the .pack was truly required.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/repack.c  |  3 +++
 t/t7700-repack.sh | 36 +++++++++++++++++++++++++++++++++++-
 2 files changed, 38 insertions(+), 1 deletion(-)

diff --git a/builtin/repack.c b/builtin/repack.c
index 51698e3c68..724e09536e 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -125,6 +125,9 @@ static void collect_pack_filenames(struct string_list *fname_nonkept_list,
 		strbuf_add(&buf, e->d_name, len);
 		strbuf_addstr(&buf, ".pack");
 
+		if (!file_exists(mkpath("%s/%s", packdir, buf.buf)))
+			continue;
+
 		for (i = 0; i < extra_keep->nr; i++)
 			if (!fspathcmp(buf.buf, extra_keep->items[i].string))
 				break;
diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh
index af79266c58..27b66807cd 100755
--- a/t/t7700-repack.sh
+++ b/t/t7700-repack.sh
@@ -213,7 +213,7 @@ test_expect_success 'repack --keep-pack' '
 	test_create_repo keep-pack &&
 	(
 		cd keep-pack &&
-		# avoid producing difference packs to delta/base choices
+		# avoid producing different packs due to delta/base choices
 		git config pack.window 0 &&
 		P1=$(commit_and_pack 1) &&
 		P2=$(commit_and_pack 2) &&
@@ -239,6 +239,10 @@ test_expect_success 'repack --keep-pack' '
 			mv "$from" "$to" || return 1
 		done &&
 
+		# A .idx file without a .pack should not stop us from
+		# repacking what we can.
+		touch .git/objects/pack/pack-does-not-exist.idx &&
+
 		git repack --cruft -d --keep-pack $P1 --keep-pack $P4 &&
 
 		ls .git/objects/pack/*.pack >newer-counts &&
@@ -247,6 +251,36 @@ test_expect_success 'repack --keep-pack' '
 	)
 '
 
+test_expect_success 'repacking fails when missing .pack actually means missing objects' '
+	test_create_repo idx-without-pack &&
+	(
+		cd idx-without-pack &&
+
+		# Avoid producing different packs due to delta/base choices
+		git config pack.window 0 &&
+		P1=$(commit_and_pack 1) &&
+		P2=$(commit_and_pack 2) &&
+		P3=$(commit_and_pack 3) &&
+		P4=$(commit_and_pack 4) &&
+		ls .git/objects/pack/*.pack >old-counts &&
+		test_line_count = 4 old-counts &&
+
+		# Remove one .pack file
+		rm .git/objects/pack/$P2 &&
+
+		ls .git/objects/pack/*.pack >before-pack-dir &&
+
+		test_must_fail git fsck &&
+		test_must_fail git repack --cruft -d 2>err &&
+		grep "bad object" err &&
+
+		# Before failing, the repack did not modify the
+		# pack directory.
+		ls .git/objects/pack/*.pack >after-pack-dir &&
+		test_cmp before-pack-dir after-pack-dir
+	)
+'
+
 test_expect_success 'bitmaps are created by default in bare repos' '
 	git clone --bare .git bare.git &&
 	rm -f bare.git/objects/pack/*.bitmap &&
-- 
2.41.0.329.gffdf85f6d39


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v2 2/2] builtin/repack.c: avoid dir traversal in `collect_pack_filenames()`
  2023-07-11 17:32 ` [PATCH v2 0/2] repack: make `collect_pack_filenames()` consistent Taylor Blau
  2023-07-11 17:32   ` [PATCH v2 1/2] builtin/repack.c: only repack `.pack`s that exist Taylor Blau
@ 2023-07-11 17:32   ` Taylor Blau
  1 sibling, 0 replies; 10+ messages in thread
From: Taylor Blau @ 2023-07-11 17:32 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Jeff King, Junio C Hamano

When repacking, the function `collect_pack_filenames()` is responsible
for collecting the set of existing packs in the repository, and
partitioning them into "kept" (if the pack has a ".keep" file or was
given via `--keep-pack`) and "nonkept" (otherwise) lists.

This function comes from the original C port of git-repack.sh from back
in a1bbc6c0176 (repack: rewrite the shell script in C, 2013-09-15),
where it first appears as `get_non_kept_pack_filenames()`. At the time,
the implementation was a fairly direct translation from the relevant
portion of git-repack.sh, which looped over the results of

    find "$PACKDIR" -type f -name '*.pack'

either ignoring the pack as kept, or adding it to the list of existing
packs.

So the choice to directly translate this function in terms of
`readdir()` in a1bbc6c0176 made sense. At the time, it was possible to
refine the C version in terms of packed_git structs, but was never done.

However, manually enumerating a repository's packs via `readdir()` is
confusing and error-prone. It leads to frustrating inconsistencies
between which packs Git considers to be part of a repository (i.e.,
could be found in the list of packs from `get_all_packs()`), and which
packs `collect_pack_filenames()` considers to meet the same criteria.

This bit us in 73320e49ad (builtin/repack.c: only collect fully-formed
packs, 2023-06-07), and again in the previous commit.

Prevent these issues from biting us in the future by implementing the
`collect_pack_filenames()` function by looping over an array of pointers
to `packed_git` structs, ensuring that we use the same criteria to
determine the set of available packs.

One gotcha here is that we have to ignore non-local packs, since the
original version of `collect_pack_filenames()` only looks at the local
pack directory to collect existing packs.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/repack.c | 41 +++++++++++++++--------------------------
 1 file changed, 15 insertions(+), 26 deletions(-)

diff --git a/builtin/repack.c b/builtin/repack.c
index 724e09536e..ac9fc61d2e 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -106,49 +106,38 @@ static void collect_pack_filenames(struct string_list *fname_nonkept_list,
 				   struct string_list *fname_kept_list,
 				   const struct string_list *extra_keep)
 {
-	DIR *dir;
-	struct dirent *e;
-	char *fname;
+	struct packed_git *p;
 	struct strbuf buf = STRBUF_INIT;
 
-	if (!(dir = opendir(packdir)))
-		return;
-
-	while ((e = readdir(dir)) != NULL) {
-		size_t len;
+	for (p = get_all_packs(the_repository); p; p = p->next) {
 		int i;
+		const char *base;
 
-		if (!strip_suffix(e->d_name, ".idx", &len))
+		if (!p->pack_local)
 			continue;
 
-		strbuf_reset(&buf);
-		strbuf_add(&buf, e->d_name, len);
-		strbuf_addstr(&buf, ".pack");
-
-		if (!file_exists(mkpath("%s/%s", packdir, buf.buf)))
-			continue;
+		base = pack_basename(p);
 
 		for (i = 0; i < extra_keep->nr; i++)
-			if (!fspathcmp(buf.buf, extra_keep->items[i].string))
+			if (!fspathcmp(base, extra_keep->items[i].string))
 				break;
 
-		fname = xmemdupz(e->d_name, len);
+		strbuf_reset(&buf);
+		strbuf_addstr(&buf, base);
+		strbuf_strip_suffix(&buf, ".pack");
 
-		if ((extra_keep->nr > 0 && i < extra_keep->nr) ||
-		    (file_exists(mkpath("%s/%s.keep", packdir, fname)))) {
-			string_list_append_nodup(fname_kept_list, fname);
-		} else {
+		if ((extra_keep->nr > 0 && i < extra_keep->nr) || p->pack_keep)
+			string_list_append(fname_kept_list, buf.buf);
+		else {
 			struct string_list_item *item;
-			item = string_list_append_nodup(fname_nonkept_list,
-							fname);
-			if (file_exists(mkpath("%s/%s.mtimes", packdir, fname)))
+			item = string_list_append(fname_nonkept_list, buf.buf);
+			if (p->is_cruft)
 				item->util = (void*)(uintptr_t)CRUFT_PACK;
 		}
 	}
-	closedir(dir);
-	strbuf_release(&buf);
 
 	string_list_sort(fname_kept_list);
+	strbuf_release(&buf);
 }
 
 static void remove_redundant_pack(const char *dir_name, const char *base_name)
-- 
2.41.0.329.gffdf85f6d39

^ permalink raw reply related	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2023-07-11 17:32 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-10 19:37 [PATCH 0/2] repack: make `collect_pack_filenames()` consistent Taylor Blau
2023-07-10 19:37 ` [PATCH 1/2] builtin/repack.c: only repack `.pack`s that exist Taylor Blau
2023-07-10 20:09   ` Junio C Hamano
2023-07-11 17:28     ` Taylor Blau
2023-07-10 19:37 ` [PATCH 2/2] builtin/repack.c: avoid dir traversal in `collect_pack_filenames()` Taylor Blau
2023-07-10 21:35   ` Junio C Hamano
2023-07-11 17:29     ` Taylor Blau
2023-07-11 17:32 ` [PATCH v2 0/2] repack: make `collect_pack_filenames()` consistent Taylor Blau
2023-07-11 17:32   ` [PATCH v2 1/2] builtin/repack.c: only repack `.pack`s that exist Taylor Blau
2023-07-11 17:32   ` [PATCH v2 2/2] builtin/repack.c: avoid dir traversal in `collect_pack_filenames()` Taylor Blau

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.