* [PATCH] builtin/repack.c: only collect fully-formed packs
@ 2023-06-07 10:16 ` Taylor Blau
0 siblings, 0 replies; 4+ messages in thread
From: Taylor Blau @ 2023-06-07 10:16 UTC (permalink / raw)
To: git; +Cc: Derrick Stolee, Jeff King, Junio C Hamano
To partition the set of packs based on which ones are "kept" (either
they have a .keep file, or were otherwise marked via the `--keep-pack`
option) and "non-kept" ones (anything else), `git repack` uses its
`collect_pack_filenames()` function.
Ordinarily, we would rely on a convenience function such as
`get_all_packs()` to enumerate and partition the set of packs. But
`collect_pack_filenames()` uses `readdir()` directly to read the
contents of the "$GIT_DIR/objects/pack" directory, and adds each entry
ending in ".pack" to the appropriate list (either kept, or non-kept as
above).
This is subtly racy, since `collect_pack_filenames()` may see a pack
that is not fully staged (i.e., it is missing its ".idx" file).
Ordinarily, this doesn't cause a problem. But it can cause issues when
generating a cruft pack.
This is because `git repack` feeds (among other things) the list of
existing kept packs down to `git pack-objects --cruft` to indicate that
any kept packs will not be removed from the repository (so that the
cruft pack machinery can avoid packing objects that appear in those
packs as cruft).
But `read_cruft_objects()` lists packfiles by calling `get_all_packs()`.
So if a ".pack" file exists (necessary to get that pack to appear to
`collect_pack_filenames()`), but doesn't have a corresponding ".idx"
file (necessary to get that pack to appear via `get_all_packs()`), we'll
complain with:
fatal: could not find pack '.tmp-5841-pack-a6b0150558609c323c496ced21de6f4b66589260.pack'
Fix the above by teaching `collect_pack_filenames()` to only collect
packs with their corresponding `*.idx` files in place, indicating that
those packs have been fully staged.
There are a couple of things worth noting:
- Since each entry in the `extra_keep` list (which contains the
`--keep-pack` names) has a `*.pack` suffix, we'll have to swap the
suffix from ".pack" to ".idx", and compare that instead.
- Since we use the the `fname_kept_list` to figure out which packs to
delete (with `git repack -d`), we would have previously deleted a
`*.pack` with no index (since the existince of a ".pack" file is
necessary and sufficient to include that pack in the list of
existing non-kept packs).
Now we will leave it alone (since that pack won't appear in the
list). This is far more correct behavior, since we don't want
to race with a pack being staged. Deleting a partially staged pack
is unlikely, however, since the window of time between staging a
pack and moving its .idx file into place is miniscule.
Note that this window does *not* include the time it takes to
receive and index the pack, since the incoming data goes into
"$GIT_DIR/objects/tmp_pack_XXXXXX", which does not end in ".pack"
and is thus ignored by collect_pack_filenames().
In the future, this function should probably be rewritten as a callback
to `for_each_file_in_pack_dir()`, but this is the simplest change we
could do in the short-term.
Reported-by: Michael Haggerty <mhagger@github.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
builtin/repack.c | 14 ++++++++++----
t/t7700-repack.sh | 23 +++++++++++++++++++++++
2 files changed, 33 insertions(+), 4 deletions(-)
diff --git a/builtin/repack.c b/builtin/repack.c
index 0541c3ce15..1e21a21ea8 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -95,8 +95,8 @@ static int repack_config(const char *var, const char *value, void *cb)
}
/*
- * Adds all packs hex strings to either fname_nonkept_list or
- * fname_kept_list based on whether each pack has a corresponding
+ * Adds all packs hex strings (pack-$HASH) to either fname_nonkept_list
+ * or fname_kept_list based on whether each pack has a corresponding
* .keep file or not. Packs without a .keep file are not to be kept
* if we are going to pack everything into one file.
*/
@@ -107,6 +107,7 @@ static void collect_pack_filenames(struct string_list *fname_nonkept_list,
DIR *dir;
struct dirent *e;
char *fname;
+ struct strbuf buf = STRBUF_INIT;
if (!(dir = opendir(packdir)))
return;
@@ -115,11 +116,15 @@ static void collect_pack_filenames(struct string_list *fname_nonkept_list,
size_t len;
int i;
- if (!strip_suffix(e->d_name, ".pack", &len))
+ if (!strip_suffix(e->d_name, ".idx", &len))
continue;
+ strbuf_reset(&buf);
+ strbuf_add(&buf, e->d_name, len);
+ strbuf_addstr(&buf, ".pack");
+
for (i = 0; i < extra_keep->nr; i++)
- if (!fspathcmp(e->d_name, extra_keep->items[i].string))
+ if (!fspathcmp(buf.buf, extra_keep->items[i].string))
break;
fname = xmemdupz(e->d_name, len);
@@ -136,6 +141,7 @@ static void collect_pack_filenames(struct string_list *fname_nonkept_list,
}
}
closedir(dir);
+ strbuf_release(&buf);
string_list_sort(fname_kept_list);
}
diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh
index faa739eeb9..08b5ba0c15 100755
--- a/t/t7700-repack.sh
+++ b/t/t7700-repack.sh
@@ -10,6 +10,10 @@ test_description='git repack works correctly'
commit_and_pack () {
test_commit "$@" 1>&2 &&
incrpackid=$(git pack-objects --all --unpacked --incremental .git/objects/pack/pack </dev/null) &&
+ # Remove any loose object(s) created by test_commit, since they have
+ # already been packed. Leaving these around can create subtly different
+ # packs with `pack-objects`'s `--unpacked` option.
+ git prune-packed 1>&2 &&
echo pack-${incrpackid}.pack
}
@@ -209,6 +213,8 @@ test_expect_success 'repack --keep-pack' '
test_create_repo keep-pack &&
(
cd keep-pack &&
+ # avoid producing difference packs to delta/base choices
+ git config pack.window 0 &&
P1=$(commit_and_pack 1) &&
P2=$(commit_and_pack 2) &&
P3=$(commit_and_pack 3) &&
@@ -220,6 +226,23 @@ test_expect_success 'repack --keep-pack' '
grep -q $P1 new-counts &&
grep -q $P4 new-counts &&
test_line_count = 3 new-counts &&
+ git fsck &&
+
+ P5=$(commit_and_pack --no-tag 5) &&
+ git reset --hard HEAD^ &&
+ git reflog expire --all --expire=all &&
+ rm -f ".git/objects/pack/${P5%.pack}.idx" &&
+ rm -f ".git/objects/info/commit-graph" &&
+ for from in $(find .git/objects/pack -type f -name "${P5%.pack}.*")
+ do
+ to="$(dirname "$from")/.tmp-1234-$(basename "$from")" &&
+ mv "$from" "$to" || return 1
+ done &&
+
+ git repack --cruft -d --keep-pack $P1 --keep-pack $P4 &&
+
+ ls .git/objects/pack/*.pack >newer-counts &&
+ test_cmp new-counts newer-counts &&
git fsck
)
'
--
2.41.0.1.gcf79d13182
^ permalink raw reply related [flat|nested] 4+ messages in thread
* [PATCH] builtin/repack.c: only collect fully-formed packs
@ 2023-06-07 10:16 ` Taylor Blau
0 siblings, 0 replies; 4+ messages in thread
From: Taylor Blau @ 2023-06-07 10:16 UTC (permalink / raw)
To: git; +Cc: Derrick Stolee, Jeff King, Junio C Hamano
To partition the set of packs based on which ones are "kept" (either
they have a .keep file, or were otherwise marked via the `--keep-pack`
option) and "non-kept" ones (anything else), `git repack` uses its
`collect_pack_filenames()` function.
Ordinarily, we would rely on a convenience function such as
`get_all_packs()` to enumerate and partition the set of packs. But
`collect_pack_filenames()` uses `readdir()` directly to read the
contents of the "$GIT_DIR/objects/pack" directory, and adds each entry
ending in ".pack" to the appropriate list (either kept, or non-kept as
above).
This is subtly racy, since `collect_pack_filenames()` may see a pack
that is not fully staged (i.e., it is missing its ".idx" file).
Ordinarily, this doesn't cause a problem. But it can cause issues when
generating a cruft pack.
This is because `git repack` feeds (among other things) the list of
existing kept packs down to `git pack-objects --cruft` to indicate that
any kept packs will not be removed from the repository (so that the
cruft pack machinery can avoid packing objects that appear in those
packs as cruft).
But `read_cruft_objects()` lists packfiles by calling `get_all_packs()`.
So if a ".pack" file exists (necessary to get that pack to appear to
`collect_pack_filenames()`), but doesn't have a corresponding ".idx"
file (necessary to get that pack to appear via `get_all_packs()`), we'll
complain with:
fatal: could not find pack '.tmp-5841-pack-a6b0150558609c323c496ced21de6f4b66589260.pack'
Fix the above by teaching `collect_pack_filenames()` to only collect
packs with their corresponding `*.idx` files in place, indicating that
those packs have been fully staged.
There are a couple of things worth noting:
- Since each entry in the `extra_keep` list (which contains the
`--keep-pack` names) has a `*.pack` suffix, we'll have to swap the
suffix from ".pack" to ".idx", and compare that instead.
- Since we use the the `fname_kept_list` to figure out which packs to
delete (with `git repack -d`), we would have previously deleted a
`*.pack` with no index (since the existince of a ".pack" file is
necessary and sufficient to include that pack in the list of
existing non-kept packs).
Now we will leave it alone (since that pack won't appear in the
list). This is far more correct behavior, since we don't want
to race with a pack being staged. Deleting a partially staged pack
is unlikely, however, since the window of time between staging a
pack and moving its .idx file into place is miniscule.
Note that this window does *not* include the time it takes to
receive and index the pack, since the incoming data goes into
"$GIT_DIR/objects/tmp_pack_XXXXXX", which does not end in ".pack"
and is thus ignored by collect_pack_filenames().
In the future, this function should probably be rewritten as a callback
to `for_each_file_in_pack_dir()`, but this is the simplest change we
could do in the short-term.
Reported-by: Michael Haggerty <mhagger@github.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
builtin/repack.c | 14 ++++++++++----
t/t7700-repack.sh | 23 +++++++++++++++++++++++
2 files changed, 33 insertions(+), 4 deletions(-)
diff --git a/builtin/repack.c b/builtin/repack.c
index 0541c3ce15..1e21a21ea8 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -95,8 +95,8 @@ static int repack_config(const char *var, const char *value, void *cb)
}
/*
- * Adds all packs hex strings to either fname_nonkept_list or
- * fname_kept_list based on whether each pack has a corresponding
+ * Adds all packs hex strings (pack-$HASH) to either fname_nonkept_list
+ * or fname_kept_list based on whether each pack has a corresponding
* .keep file or not. Packs without a .keep file are not to be kept
* if we are going to pack everything into one file.
*/
@@ -107,6 +107,7 @@ static void collect_pack_filenames(struct string_list *fname_nonkept_list,
DIR *dir;
struct dirent *e;
char *fname;
+ struct strbuf buf = STRBUF_INIT;
if (!(dir = opendir(packdir)))
return;
@@ -115,11 +116,15 @@ static void collect_pack_filenames(struct string_list *fname_nonkept_list,
size_t len;
int i;
- if (!strip_suffix(e->d_name, ".pack", &len))
+ if (!strip_suffix(e->d_name, ".idx", &len))
continue;
+ strbuf_reset(&buf);
+ strbuf_add(&buf, e->d_name, len);
+ strbuf_addstr(&buf, ".pack");
+
for (i = 0; i < extra_keep->nr; i++)
- if (!fspathcmp(e->d_name, extra_keep->items[i].string))
+ if (!fspathcmp(buf.buf, extra_keep->items[i].string))
break;
fname = xmemdupz(e->d_name, len);
@@ -136,6 +141,7 @@ static void collect_pack_filenames(struct string_list *fname_nonkept_list,
}
}
closedir(dir);
+ strbuf_release(&buf);
string_list_sort(fname_kept_list);
}
diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh
index faa739eeb9..08b5ba0c15 100755
--- a/t/t7700-repack.sh
+++ b/t/t7700-repack.sh
@@ -10,6 +10,10 @@ test_description='git repack works correctly'
commit_and_pack () {
test_commit "$@" 1>&2 &&
incrpackid=$(git pack-objects --all --unpacked --incremental .git/objects/pack/pack </dev/null) &&
+ # Remove any loose object(s) created by test_commit, since they have
+ # already been packed. Leaving these around can create subtly different
+ # packs with `pack-objects`'s `--unpacked` option.
+ git prune-packed 1>&2 &&
echo pack-${incrpackid}.pack
}
@@ -209,6 +213,8 @@ test_expect_success 'repack --keep-pack' '
test_create_repo keep-pack &&
(
cd keep-pack &&
+ # avoid producing difference packs to delta/base choices
+ git config pack.window 0 &&
P1=$(commit_and_pack 1) &&
P2=$(commit_and_pack 2) &&
P3=$(commit_and_pack 3) &&
@@ -220,6 +226,23 @@ test_expect_success 'repack --keep-pack' '
grep -q $P1 new-counts &&
grep -q $P4 new-counts &&
test_line_count = 3 new-counts &&
+ git fsck &&
+
+ P5=$(commit_and_pack --no-tag 5) &&
+ git reset --hard HEAD^ &&
+ git reflog expire --all --expire=all &&
+ rm -f ".git/objects/pack/${P5%.pack}.idx" &&
+ rm -f ".git/objects/info/commit-graph" &&
+ for from in $(find .git/objects/pack -type f -name "${P5%.pack}.*")
+ do
+ to="$(dirname "$from")/.tmp-1234-$(basename "$from")" &&
+ mv "$from" "$to" || return 1
+ done &&
+
+ git repack --cruft -d --keep-pack $P1 --keep-pack $P4 &&
+
+ ls .git/objects/pack/*.pack >newer-counts &&
+ test_cmp new-counts newer-counts &&
git fsck
)
'
--
2.41.0.1.gcf79d13182
^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: [PATCH] builtin/repack.c: only collect fully-formed packs
2023-06-07 10:16 ` Taylor Blau
(?)
@ 2023-06-07 13:51 ` Derrick Stolee
-1 siblings, 0 replies; 4+ messages in thread
From: Derrick Stolee @ 2023-06-07 13:51 UTC (permalink / raw)
To: Taylor Blau, git; +Cc: Jeff King, Junio C Hamano
On 6/7/2023 6:16 AM, Taylor Blau wrote:
> To partition the set of packs based on which ones are "kept" (either
> they have a .keep file, or were otherwise marked via the `--keep-pack`
> option) and "non-kept" ones (anything else), `git repack` uses its
> `collect_pack_filenames()` function.
...
> Fix the above by teaching `collect_pack_filenames()` to only collect
> packs with their corresponding `*.idx` files in place, indicating that
> those packs have been fully staged.
I reviewed an internal version of this patch, so I'm happy with this
version. I'd be very interested if anyone has concerns with the
approach, though.
Thanks,
-Stolee
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH] builtin/repack.c: only collect fully-formed packs
@ 2023-06-12 21:00 ` Junio C Hamano
0 siblings, 0 replies; 4+ messages in thread
From: Junio C Hamano @ 2023-06-12 21:00 UTC (permalink / raw)
To: Taylor Blau; +Cc: git, Derrick Stolee, Jeff King
Taylor Blau <me@ttaylorr.com> writes:
> Fix the above by teaching `collect_pack_filenames()` to only collect
> packs with their corresponding `*.idx` files in place, indicating that
> those packs have been fully staged.
Hmph, after following the problem description (which by the way was
very well written), I expected to see "collect_pack_filenames() is
taught to use get_all_packs() and then enumerate the names from the
in-core packed_git structures---this way there is no possibility for
the two methods to produce inconsistent results".
But that is not what is going on. Rather, it is "we still keep
using a separate readdir() loop, but made sure that the criteria to
include a packfile in the result is identical to the logic currently
used by the other function", whose implication is that they can
diverge again. I am somewhat puzzled.
Will queue. Thanks.
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2023-06-12 21:04 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-07 10:16 [PATCH] builtin/repack.c: only collect fully-formed packs Taylor Blau
2023-06-12 21:00 ` Junio C Hamano
-- strict thread matches above, loose matches on Subject: below --
2023-06-07 10:16 Taylor Blau
2023-06-07 10:16 ` Taylor Blau
2023-06-07 13:51 ` Derrick Stolee
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.