From: Taylor Blau <me@ttaylorr.com> To: git@vger.kernel.org Cc: Derrick Stolee <derrickstolee@github.com>, Jeff King <peff@peff.net>, Junio C Hamano <gitster@pobox.com> Subject: [PATCH] builtin/repack.c: only collect fully-formed packs Date: Wed, 7 Jun 2023 06:16:12 -0400 [thread overview] Message-ID: <dda5a34a3e879787ce8651674962db6cf913b7b2.1686132967.git.me@ttaylorr.com> (raw) To partition the set of packs based on which ones are "kept" (either they have a .keep file, or were otherwise marked via the `--keep-pack` option) and "non-kept" ones (anything else), `git repack` uses its `collect_pack_filenames()` function. Ordinarily, we would rely on a convenience function such as `get_all_packs()` to enumerate and partition the set of packs. But `collect_pack_filenames()` uses `readdir()` directly to read the contents of the "$GIT_DIR/objects/pack" directory, and adds each entry ending in ".pack" to the appropriate list (either kept, or non-kept as above). This is subtly racy, since `collect_pack_filenames()` may see a pack that is not fully staged (i.e., it is missing its ".idx" file). Ordinarily, this doesn't cause a problem. But it can cause issues when generating a cruft pack. This is because `git repack` feeds (among other things) the list of existing kept packs down to `git pack-objects --cruft` to indicate that any kept packs will not be removed from the repository (so that the cruft pack machinery can avoid packing objects that appear in those packs as cruft). But `read_cruft_objects()` lists packfiles by calling `get_all_packs()`. So if a ".pack" file exists (necessary to get that pack to appear to `collect_pack_filenames()`), but doesn't have a corresponding ".idx" file (necessary to get that pack to appear via `get_all_packs()`), we'll complain with: fatal: could not find pack '.tmp-5841-pack-a6b0150558609c323c496ced21de6f4b66589260.pack' Fix the above by teaching `collect_pack_filenames()` to only collect packs with their corresponding `*.idx` files in place, indicating that those packs have been fully staged. There are a couple of things worth noting: - Since each entry in the `extra_keep` list (which contains the `--keep-pack` names) has a `*.pack` suffix, we'll have to swap the suffix from ".pack" to ".idx", and compare that instead. - Since we use the the `fname_kept_list` to figure out which packs to delete (with `git repack -d`), we would have previously deleted a `*.pack` with no index (since the existince of a ".pack" file is necessary and sufficient to include that pack in the list of existing non-kept packs). Now we will leave it alone (since that pack won't appear in the list). This is far more correct behavior, since we don't want to race with a pack being staged. Deleting a partially staged pack is unlikely, however, since the window of time between staging a pack and moving its .idx file into place is miniscule. Note that this window does *not* include the time it takes to receive and index the pack, since the incoming data goes into "$GIT_DIR/objects/tmp_pack_XXXXXX", which does not end in ".pack" and is thus ignored by collect_pack_filenames(). In the future, this function should probably be rewritten as a callback to `for_each_file_in_pack_dir()`, but this is the simplest change we could do in the short-term. Reported-by: Michael Haggerty <mhagger@github.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> --- builtin/repack.c | 14 ++++++++++---- t/t7700-repack.sh | 23 +++++++++++++++++++++++ 2 files changed, 33 insertions(+), 4 deletions(-) diff --git a/builtin/repack.c b/builtin/repack.c index 0541c3ce15..1e21a21ea8 100644 --- a/builtin/repack.c +++ b/builtin/repack.c @@ -95,8 +95,8 @@ static int repack_config(const char *var, const char *value, void *cb) } /* - * Adds all packs hex strings to either fname_nonkept_list or - * fname_kept_list based on whether each pack has a corresponding + * Adds all packs hex strings (pack-$HASH) to either fname_nonkept_list + * or fname_kept_list based on whether each pack has a corresponding * .keep file or not. Packs without a .keep file are not to be kept * if we are going to pack everything into one file. */ @@ -107,6 +107,7 @@ static void collect_pack_filenames(struct string_list *fname_nonkept_list, DIR *dir; struct dirent *e; char *fname; + struct strbuf buf = STRBUF_INIT; if (!(dir = opendir(packdir))) return; @@ -115,11 +116,15 @@ static void collect_pack_filenames(struct string_list *fname_nonkept_list, size_t len; int i; - if (!strip_suffix(e->d_name, ".pack", &len)) + if (!strip_suffix(e->d_name, ".idx", &len)) continue; + strbuf_reset(&buf); + strbuf_add(&buf, e->d_name, len); + strbuf_addstr(&buf, ".pack"); + for (i = 0; i < extra_keep->nr; i++) - if (!fspathcmp(e->d_name, extra_keep->items[i].string)) + if (!fspathcmp(buf.buf, extra_keep->items[i].string)) break; fname = xmemdupz(e->d_name, len); @@ -136,6 +141,7 @@ static void collect_pack_filenames(struct string_list *fname_nonkept_list, } } closedir(dir); + strbuf_release(&buf); string_list_sort(fname_kept_list); } diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh index faa739eeb9..08b5ba0c15 100755 --- a/t/t7700-repack.sh +++ b/t/t7700-repack.sh @@ -10,6 +10,10 @@ test_description='git repack works correctly' commit_and_pack () { test_commit "$@" 1>&2 && incrpackid=$(git pack-objects --all --unpacked --incremental .git/objects/pack/pack </dev/null) && + # Remove any loose object(s) created by test_commit, since they have + # already been packed. Leaving these around can create subtly different + # packs with `pack-objects`'s `--unpacked` option. + git prune-packed 1>&2 && echo pack-${incrpackid}.pack } @@ -209,6 +213,8 @@ test_expect_success 'repack --keep-pack' ' test_create_repo keep-pack && ( cd keep-pack && + # avoid producing difference packs to delta/base choices + git config pack.window 0 && P1=$(commit_and_pack 1) && P2=$(commit_and_pack 2) && P3=$(commit_and_pack 3) && @@ -220,6 +226,23 @@ test_expect_success 'repack --keep-pack' ' grep -q $P1 new-counts && grep -q $P4 new-counts && test_line_count = 3 new-counts && + git fsck && + + P5=$(commit_and_pack --no-tag 5) && + git reset --hard HEAD^ && + git reflog expire --all --expire=all && + rm -f ".git/objects/pack/${P5%.pack}.idx" && + rm -f ".git/objects/info/commit-graph" && + for from in $(find .git/objects/pack -type f -name "${P5%.pack}.*") + do + to="$(dirname "$from")/.tmp-1234-$(basename "$from")" && + mv "$from" "$to" || return 1 + done && + + git repack --cruft -d --keep-pack $P1 --keep-pack $P4 && + + ls .git/objects/pack/*.pack >newer-counts && + test_cmp new-counts newer-counts && git fsck ) ' -- 2.41.0.1.gcf79d13182
WARNING: multiple messages have this Message-ID (diff)
From: Taylor Blau <me@ttaylorr.com> To: git@vger.kernel.org Cc: Derrick Stolee <derrickstolee@github.com>, Jeff King <peff@peff.net>, Junio C Hamano <gitster@pobox.com> Subject: [PATCH] builtin/repack.c: only collect fully-formed packs Date: Wed, 7 Jun 2023 06:16:17 -0400 [thread overview] Message-ID: <dda5a34a3e879787ce8651674962db6cf913b7b2.1686132967.git.me@ttaylorr.com> (raw) Message-ID: <20230607101617.ges6tnMry4E52lDGld43QgtNUsIS4YQq6w-t71hEfkQ@z> (raw) To partition the set of packs based on which ones are "kept" (either they have a .keep file, or were otherwise marked via the `--keep-pack` option) and "non-kept" ones (anything else), `git repack` uses its `collect_pack_filenames()` function. Ordinarily, we would rely on a convenience function such as `get_all_packs()` to enumerate and partition the set of packs. But `collect_pack_filenames()` uses `readdir()` directly to read the contents of the "$GIT_DIR/objects/pack" directory, and adds each entry ending in ".pack" to the appropriate list (either kept, or non-kept as above). This is subtly racy, since `collect_pack_filenames()` may see a pack that is not fully staged (i.e., it is missing its ".idx" file). Ordinarily, this doesn't cause a problem. But it can cause issues when generating a cruft pack. This is because `git repack` feeds (among other things) the list of existing kept packs down to `git pack-objects --cruft` to indicate that any kept packs will not be removed from the repository (so that the cruft pack machinery can avoid packing objects that appear in those packs as cruft). But `read_cruft_objects()` lists packfiles by calling `get_all_packs()`. So if a ".pack" file exists (necessary to get that pack to appear to `collect_pack_filenames()`), but doesn't have a corresponding ".idx" file (necessary to get that pack to appear via `get_all_packs()`), we'll complain with: fatal: could not find pack '.tmp-5841-pack-a6b0150558609c323c496ced21de6f4b66589260.pack' Fix the above by teaching `collect_pack_filenames()` to only collect packs with their corresponding `*.idx` files in place, indicating that those packs have been fully staged. There are a couple of things worth noting: - Since each entry in the `extra_keep` list (which contains the `--keep-pack` names) has a `*.pack` suffix, we'll have to swap the suffix from ".pack" to ".idx", and compare that instead. - Since we use the the `fname_kept_list` to figure out which packs to delete (with `git repack -d`), we would have previously deleted a `*.pack` with no index (since the existince of a ".pack" file is necessary and sufficient to include that pack in the list of existing non-kept packs). Now we will leave it alone (since that pack won't appear in the list). This is far more correct behavior, since we don't want to race with a pack being staged. Deleting a partially staged pack is unlikely, however, since the window of time between staging a pack and moving its .idx file into place is miniscule. Note that this window does *not* include the time it takes to receive and index the pack, since the incoming data goes into "$GIT_DIR/objects/tmp_pack_XXXXXX", which does not end in ".pack" and is thus ignored by collect_pack_filenames(). In the future, this function should probably be rewritten as a callback to `for_each_file_in_pack_dir()`, but this is the simplest change we could do in the short-term. Reported-by: Michael Haggerty <mhagger@github.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> --- builtin/repack.c | 14 ++++++++++---- t/t7700-repack.sh | 23 +++++++++++++++++++++++ 2 files changed, 33 insertions(+), 4 deletions(-) diff --git a/builtin/repack.c b/builtin/repack.c index 0541c3ce15..1e21a21ea8 100644 --- a/builtin/repack.c +++ b/builtin/repack.c @@ -95,8 +95,8 @@ static int repack_config(const char *var, const char *value, void *cb) } /* - * Adds all packs hex strings to either fname_nonkept_list or - * fname_kept_list based on whether each pack has a corresponding + * Adds all packs hex strings (pack-$HASH) to either fname_nonkept_list + * or fname_kept_list based on whether each pack has a corresponding * .keep file or not. Packs without a .keep file are not to be kept * if we are going to pack everything into one file. */ @@ -107,6 +107,7 @@ static void collect_pack_filenames(struct string_list *fname_nonkept_list, DIR *dir; struct dirent *e; char *fname; + struct strbuf buf = STRBUF_INIT; if (!(dir = opendir(packdir))) return; @@ -115,11 +116,15 @@ static void collect_pack_filenames(struct string_list *fname_nonkept_list, size_t len; int i; - if (!strip_suffix(e->d_name, ".pack", &len)) + if (!strip_suffix(e->d_name, ".idx", &len)) continue; + strbuf_reset(&buf); + strbuf_add(&buf, e->d_name, len); + strbuf_addstr(&buf, ".pack"); + for (i = 0; i < extra_keep->nr; i++) - if (!fspathcmp(e->d_name, extra_keep->items[i].string)) + if (!fspathcmp(buf.buf, extra_keep->items[i].string)) break; fname = xmemdupz(e->d_name, len); @@ -136,6 +141,7 @@ static void collect_pack_filenames(struct string_list *fname_nonkept_list, } } closedir(dir); + strbuf_release(&buf); string_list_sort(fname_kept_list); } diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh index faa739eeb9..08b5ba0c15 100755 --- a/t/t7700-repack.sh +++ b/t/t7700-repack.sh @@ -10,6 +10,10 @@ test_description='git repack works correctly' commit_and_pack () { test_commit "$@" 1>&2 && incrpackid=$(git pack-objects --all --unpacked --incremental .git/objects/pack/pack </dev/null) && + # Remove any loose object(s) created by test_commit, since they have + # already been packed. Leaving these around can create subtly different + # packs with `pack-objects`'s `--unpacked` option. + git prune-packed 1>&2 && echo pack-${incrpackid}.pack } @@ -209,6 +213,8 @@ test_expect_success 'repack --keep-pack' ' test_create_repo keep-pack && ( cd keep-pack && + # avoid producing difference packs to delta/base choices + git config pack.window 0 && P1=$(commit_and_pack 1) && P2=$(commit_and_pack 2) && P3=$(commit_and_pack 3) && @@ -220,6 +226,23 @@ test_expect_success 'repack --keep-pack' ' grep -q $P1 new-counts && grep -q $P4 new-counts && test_line_count = 3 new-counts && + git fsck && + + P5=$(commit_and_pack --no-tag 5) && + git reset --hard HEAD^ && + git reflog expire --all --expire=all && + rm -f ".git/objects/pack/${P5%.pack}.idx" && + rm -f ".git/objects/info/commit-graph" && + for from in $(find .git/objects/pack -type f -name "${P5%.pack}.*") + do + to="$(dirname "$from")/.tmp-1234-$(basename "$from")" && + mv "$from" "$to" || return 1 + done && + + git repack --cruft -d --keep-pack $P1 --keep-pack $P4 && + + ls .git/objects/pack/*.pack >newer-counts && + test_cmp new-counts newer-counts && git fsck ) ' -- 2.41.0.1.gcf79d13182
next reply other threads:[~2023-06-07 10:16 UTC|newest] Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top 2023-06-07 10:16 Taylor Blau [this message] 2023-06-07 10:16 ` [PATCH] builtin/repack.c: only collect fully-formed packs Taylor Blau 2023-06-07 13:51 ` Derrick Stolee 2023-06-07 10:16 Taylor Blau 2023-06-12 21:00 ` Junio C Hamano
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=dda5a34a3e879787ce8651674962db6cf913b7b2.1686132967.git.me@ttaylorr.com \ --to=me@ttaylorr.com \ --cc=derrickstolee@github.com \ --cc=git@vger.kernel.org \ --cc=gitster@pobox.com \ --cc=peff@peff.net \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.