All of lore.kernel.org
 help / color / mirror / Atom feed
From: Taylor Blau <me@ttaylorr.com>
To: git@vger.kernel.org
Cc: Derrick Stolee <derrickstolee@github.com>,
	Jeff King <peff@peff.net>, Junio C Hamano <gitster@pobox.com>
Subject: [PATCH] builtin/repack.c: only collect fully-formed packs
Date: Wed, 7 Jun 2023 06:16:12 -0400	[thread overview]
Message-ID: <dda5a34a3e879787ce8651674962db6cf913b7b2.1686132967.git.me@ttaylorr.com> (raw)

To partition the set of packs based on which ones are "kept" (either
they have a .keep file, or were otherwise marked via the `--keep-pack`
option) and "non-kept" ones (anything else), `git repack` uses its
`collect_pack_filenames()` function.

Ordinarily, we would rely on a convenience function such as
`get_all_packs()` to enumerate and partition the set of packs. But
`collect_pack_filenames()` uses `readdir()` directly to read the
contents of the "$GIT_DIR/objects/pack" directory, and adds each entry
ending in ".pack" to the appropriate list (either kept, or non-kept as
above).

This is subtly racy, since `collect_pack_filenames()` may see a pack
that is not fully staged (i.e., it is missing its ".idx" file).
Ordinarily, this doesn't cause a problem. But it can cause issues when
generating a cruft pack.

This is because `git repack` feeds (among other things) the list of
existing kept packs down to `git pack-objects --cruft` to indicate that
any kept packs will not be removed from the repository (so that the
cruft pack machinery can avoid packing objects that appear in those
packs as cruft).

But `read_cruft_objects()` lists packfiles by calling `get_all_packs()`.
So if a ".pack" file exists (necessary to get that pack to appear to
`collect_pack_filenames()`), but doesn't have a corresponding ".idx"
file (necessary to get that pack to appear via `get_all_packs()`), we'll
complain with:

    fatal: could not find pack '.tmp-5841-pack-a6b0150558609c323c496ced21de6f4b66589260.pack'

Fix the above by teaching `collect_pack_filenames()` to only collect
packs with their corresponding `*.idx` files in place, indicating that
those packs have been fully staged.

There are a couple of things worth noting:

  - Since each entry in the `extra_keep` list (which contains the
    `--keep-pack` names) has a `*.pack` suffix, we'll have to swap the
    suffix from ".pack" to ".idx", and compare that instead.

  - Since we use the the `fname_kept_list` to figure out which packs to
    delete (with `git repack -d`), we would have previously deleted a
    `*.pack` with no index (since the existince of a ".pack" file is
    necessary and sufficient to include that pack in the list of
    existing non-kept packs).

    Now we will leave it alone (since that pack won't appear in the
    list). This is far more correct behavior, since we don't want
    to race with a pack being staged. Deleting a partially staged pack
    is unlikely, however, since the window of time between staging a
    pack and moving its .idx file into place is miniscule.

    Note that this window does *not* include the time it takes to
    receive and index the pack, since the incoming data goes into
    "$GIT_DIR/objects/tmp_pack_XXXXXX", which does not end in ".pack"
    and is thus ignored by collect_pack_filenames().

In the future, this function should probably be rewritten as a callback
to `for_each_file_in_pack_dir()`, but this is the simplest change we
could do in the short-term.

Reported-by: Michael Haggerty <mhagger@github.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/repack.c  | 14 ++++++++++----
 t/t7700-repack.sh | 23 +++++++++++++++++++++++
 2 files changed, 33 insertions(+), 4 deletions(-)

diff --git a/builtin/repack.c b/builtin/repack.c
index 0541c3ce15..1e21a21ea8 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -95,8 +95,8 @@ static int repack_config(const char *var, const char *value, void *cb)
 }
 
 /*
- * Adds all packs hex strings to either fname_nonkept_list or
- * fname_kept_list based on whether each pack has a corresponding
+ * Adds all packs hex strings (pack-$HASH) to either fname_nonkept_list
+ * or fname_kept_list based on whether each pack has a corresponding
  * .keep file or not.  Packs without a .keep file are not to be kept
  * if we are going to pack everything into one file.
  */
@@ -107,6 +107,7 @@ static void collect_pack_filenames(struct string_list *fname_nonkept_list,
 	DIR *dir;
 	struct dirent *e;
 	char *fname;
+	struct strbuf buf = STRBUF_INIT;
 
 	if (!(dir = opendir(packdir)))
 		return;
@@ -115,11 +116,15 @@ static void collect_pack_filenames(struct string_list *fname_nonkept_list,
 		size_t len;
 		int i;
 
-		if (!strip_suffix(e->d_name, ".pack", &len))
+		if (!strip_suffix(e->d_name, ".idx", &len))
 			continue;
 
+		strbuf_reset(&buf);
+		strbuf_add(&buf, e->d_name, len);
+		strbuf_addstr(&buf, ".pack");
+
 		for (i = 0; i < extra_keep->nr; i++)
-			if (!fspathcmp(e->d_name, extra_keep->items[i].string))
+			if (!fspathcmp(buf.buf, extra_keep->items[i].string))
 				break;
 
 		fname = xmemdupz(e->d_name, len);
@@ -136,6 +141,7 @@ static void collect_pack_filenames(struct string_list *fname_nonkept_list,
 		}
 	}
 	closedir(dir);
+	strbuf_release(&buf);
 
 	string_list_sort(fname_kept_list);
 }
diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh
index faa739eeb9..08b5ba0c15 100755
--- a/t/t7700-repack.sh
+++ b/t/t7700-repack.sh
@@ -10,6 +10,10 @@ test_description='git repack works correctly'
 commit_and_pack () {
 	test_commit "$@" 1>&2 &&
 	incrpackid=$(git pack-objects --all --unpacked --incremental .git/objects/pack/pack </dev/null) &&
+	# Remove any loose object(s) created by test_commit, since they have
+	# already been packed. Leaving these around can create subtly different
+	# packs with `pack-objects`'s `--unpacked` option.
+	git prune-packed 1>&2 &&
 	echo pack-${incrpackid}.pack
 }
 
@@ -209,6 +213,8 @@ test_expect_success 'repack --keep-pack' '
 	test_create_repo keep-pack &&
 	(
 		cd keep-pack &&
+		# avoid producing difference packs to delta/base choices
+		git config pack.window 0 &&
 		P1=$(commit_and_pack 1) &&
 		P2=$(commit_and_pack 2) &&
 		P3=$(commit_and_pack 3) &&
@@ -220,6 +226,23 @@ test_expect_success 'repack --keep-pack' '
 		grep -q $P1 new-counts &&
 		grep -q $P4 new-counts &&
 		test_line_count = 3 new-counts &&
+		git fsck &&
+
+		P5=$(commit_and_pack --no-tag 5) &&
+		git reset --hard HEAD^ &&
+		git reflog expire --all --expire=all &&
+		rm -f ".git/objects/pack/${P5%.pack}.idx" &&
+		rm -f ".git/objects/info/commit-graph" &&
+		for from in $(find .git/objects/pack -type f -name "${P5%.pack}.*")
+		do
+			to="$(dirname "$from")/.tmp-1234-$(basename "$from")" &&
+			mv "$from" "$to" || return 1
+		done &&
+
+		git repack --cruft -d --keep-pack $P1 --keep-pack $P4 &&
+
+		ls .git/objects/pack/*.pack >newer-counts &&
+		test_cmp new-counts newer-counts &&
 		git fsck
 	)
 '
-- 
2.41.0.1.gcf79d13182

WARNING: multiple messages have this Message-ID (diff)
From: Taylor Blau <me@ttaylorr.com>
To: git@vger.kernel.org
Cc: Derrick Stolee <derrickstolee@github.com>,
	Jeff King <peff@peff.net>, Junio C Hamano <gitster@pobox.com>
Subject: [PATCH] builtin/repack.c: only collect fully-formed packs
Date: Wed, 7 Jun 2023 06:16:17 -0400	[thread overview]
Message-ID: <dda5a34a3e879787ce8651674962db6cf913b7b2.1686132967.git.me@ttaylorr.com> (raw)
Message-ID: <20230607101617.ges6tnMry4E52lDGld43QgtNUsIS4YQq6w-t71hEfkQ@z> (raw)

To partition the set of packs based on which ones are "kept" (either
they have a .keep file, or were otherwise marked via the `--keep-pack`
option) and "non-kept" ones (anything else), `git repack` uses its
`collect_pack_filenames()` function.

Ordinarily, we would rely on a convenience function such as
`get_all_packs()` to enumerate and partition the set of packs. But
`collect_pack_filenames()` uses `readdir()` directly to read the
contents of the "$GIT_DIR/objects/pack" directory, and adds each entry
ending in ".pack" to the appropriate list (either kept, or non-kept as
above).

This is subtly racy, since `collect_pack_filenames()` may see a pack
that is not fully staged (i.e., it is missing its ".idx" file).
Ordinarily, this doesn't cause a problem. But it can cause issues when
generating a cruft pack.

This is because `git repack` feeds (among other things) the list of
existing kept packs down to `git pack-objects --cruft` to indicate that
any kept packs will not be removed from the repository (so that the
cruft pack machinery can avoid packing objects that appear in those
packs as cruft).

But `read_cruft_objects()` lists packfiles by calling `get_all_packs()`.
So if a ".pack" file exists (necessary to get that pack to appear to
`collect_pack_filenames()`), but doesn't have a corresponding ".idx"
file (necessary to get that pack to appear via `get_all_packs()`), we'll
complain with:

    fatal: could not find pack '.tmp-5841-pack-a6b0150558609c323c496ced21de6f4b66589260.pack'

Fix the above by teaching `collect_pack_filenames()` to only collect
packs with their corresponding `*.idx` files in place, indicating that
those packs have been fully staged.

There are a couple of things worth noting:

  - Since each entry in the `extra_keep` list (which contains the
    `--keep-pack` names) has a `*.pack` suffix, we'll have to swap the
    suffix from ".pack" to ".idx", and compare that instead.

  - Since we use the the `fname_kept_list` to figure out which packs to
    delete (with `git repack -d`), we would have previously deleted a
    `*.pack` with no index (since the existince of a ".pack" file is
    necessary and sufficient to include that pack in the list of
    existing non-kept packs).

    Now we will leave it alone (since that pack won't appear in the
    list). This is far more correct behavior, since we don't want
    to race with a pack being staged. Deleting a partially staged pack
    is unlikely, however, since the window of time between staging a
    pack and moving its .idx file into place is miniscule.

    Note that this window does *not* include the time it takes to
    receive and index the pack, since the incoming data goes into
    "$GIT_DIR/objects/tmp_pack_XXXXXX", which does not end in ".pack"
    and is thus ignored by collect_pack_filenames().

In the future, this function should probably be rewritten as a callback
to `for_each_file_in_pack_dir()`, but this is the simplest change we
could do in the short-term.

Reported-by: Michael Haggerty <mhagger@github.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/repack.c  | 14 ++++++++++----
 t/t7700-repack.sh | 23 +++++++++++++++++++++++
 2 files changed, 33 insertions(+), 4 deletions(-)

diff --git a/builtin/repack.c b/builtin/repack.c
index 0541c3ce15..1e21a21ea8 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -95,8 +95,8 @@ static int repack_config(const char *var, const char *value, void *cb)
 }
 
 /*
- * Adds all packs hex strings to either fname_nonkept_list or
- * fname_kept_list based on whether each pack has a corresponding
+ * Adds all packs hex strings (pack-$HASH) to either fname_nonkept_list
+ * or fname_kept_list based on whether each pack has a corresponding
  * .keep file or not.  Packs without a .keep file are not to be kept
  * if we are going to pack everything into one file.
  */
@@ -107,6 +107,7 @@ static void collect_pack_filenames(struct string_list *fname_nonkept_list,
 	DIR *dir;
 	struct dirent *e;
 	char *fname;
+	struct strbuf buf = STRBUF_INIT;
 
 	if (!(dir = opendir(packdir)))
 		return;
@@ -115,11 +116,15 @@ static void collect_pack_filenames(struct string_list *fname_nonkept_list,
 		size_t len;
 		int i;
 
-		if (!strip_suffix(e->d_name, ".pack", &len))
+		if (!strip_suffix(e->d_name, ".idx", &len))
 			continue;
 
+		strbuf_reset(&buf);
+		strbuf_add(&buf, e->d_name, len);
+		strbuf_addstr(&buf, ".pack");
+
 		for (i = 0; i < extra_keep->nr; i++)
-			if (!fspathcmp(e->d_name, extra_keep->items[i].string))
+			if (!fspathcmp(buf.buf, extra_keep->items[i].string))
 				break;
 
 		fname = xmemdupz(e->d_name, len);
@@ -136,6 +141,7 @@ static void collect_pack_filenames(struct string_list *fname_nonkept_list,
 		}
 	}
 	closedir(dir);
+	strbuf_release(&buf);
 
 	string_list_sort(fname_kept_list);
 }
diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh
index faa739eeb9..08b5ba0c15 100755
--- a/t/t7700-repack.sh
+++ b/t/t7700-repack.sh
@@ -10,6 +10,10 @@ test_description='git repack works correctly'
 commit_and_pack () {
 	test_commit "$@" 1>&2 &&
 	incrpackid=$(git pack-objects --all --unpacked --incremental .git/objects/pack/pack </dev/null) &&
+	# Remove any loose object(s) created by test_commit, since they have
+	# already been packed. Leaving these around can create subtly different
+	# packs with `pack-objects`'s `--unpacked` option.
+	git prune-packed 1>&2 &&
 	echo pack-${incrpackid}.pack
 }
 
@@ -209,6 +213,8 @@ test_expect_success 'repack --keep-pack' '
 	test_create_repo keep-pack &&
 	(
 		cd keep-pack &&
+		# avoid producing difference packs to delta/base choices
+		git config pack.window 0 &&
 		P1=$(commit_and_pack 1) &&
 		P2=$(commit_and_pack 2) &&
 		P3=$(commit_and_pack 3) &&
@@ -220,6 +226,23 @@ test_expect_success 'repack --keep-pack' '
 		grep -q $P1 new-counts &&
 		grep -q $P4 new-counts &&
 		test_line_count = 3 new-counts &&
+		git fsck &&
+
+		P5=$(commit_and_pack --no-tag 5) &&
+		git reset --hard HEAD^ &&
+		git reflog expire --all --expire=all &&
+		rm -f ".git/objects/pack/${P5%.pack}.idx" &&
+		rm -f ".git/objects/info/commit-graph" &&
+		for from in $(find .git/objects/pack -type f -name "${P5%.pack}.*")
+		do
+			to="$(dirname "$from")/.tmp-1234-$(basename "$from")" &&
+			mv "$from" "$to" || return 1
+		done &&
+
+		git repack --cruft -d --keep-pack $P1 --keep-pack $P4 &&
+
+		ls .git/objects/pack/*.pack >newer-counts &&
+		test_cmp new-counts newer-counts &&
 		git fsck
 	)
 '
-- 
2.41.0.1.gcf79d13182

             reply	other threads:[~2023-06-07 10:16 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-06-07 10:16 Taylor Blau [this message]
2023-06-07 10:16 ` [PATCH] builtin/repack.c: only collect fully-formed packs Taylor Blau
2023-06-07 13:51 ` Derrick Stolee
2023-06-07 10:16 Taylor Blau
2023-06-12 21:00 ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=dda5a34a3e879787ce8651674962db6cf913b7b2.1686132967.git.me@ttaylorr.com \
    --to=me@ttaylorr.com \
    --cc=derrickstolee@github.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.