All of lore.kernel.org
 help / color / mirror / Atom feed
From: Derrick Stolee <stolee@gmail.com>
To: Taylor Blau <me@ttaylorr.com>
Cc: Jeff King <peff@peff.net>, git@vger.kernel.org, dstolee@microsoft.com
Subject: Re: [PATCH] builtin/repack.c: invalidate MIDX only when necessary
Date: Tue, 25 Aug 2020 11:14:41 -0400	[thread overview]
Message-ID: <6a34d7ee-8c6b-8c6c-93bd-0013dccccafb@gmail.com> (raw)
In-Reply-To: <20200825144146.GA7671@syl.lan>

On 8/25/2020 10:41 AM, Taylor Blau wrote:
> On Tue, Aug 25, 2020 at 09:14:19AM -0400, Derrick Stolee wrote:
>> The code in builtin/repack.c looks good for sure. I have a quick question
>> about this new test:
>>
>> +test_expect_success 'repack preserves multi-pack-index when deleting unknown packs' '
>> +	git multi-pack-index write &&
>> +	cp $objdir/pack/multi-pack-index $objdir/pack/multi-pack-index.bak &&
>> +	test_when_finished "rm -f $objdir/pack/multi-pack-index.bak" &&
>> +
>> +	# Write a new pack that is unknown to the multi-pack-index.
>> +	git hash-object -w </dev/null >blob &&
>> +	git pack-objects $objdir/pack/pack <blob &&
>> +
>> +	GIT_TEST_MULTI_PACK_INDEX=0 git -c core.multiPackIndex repack -d &&
>> +	test_cmp_bin $objdir/pack/multi-pack-index \
>> +		$objdir/pack/multi-pack-index.bak
>> +'
>> +
>>
>> You create an arbitrary blob, and then add it to a pack-file. Do we
>> know that 'git repack' is definitely creating a new pack-file that makes
>> our manually-created pack-file redundant?
>>
>> My suggestion is to have the test check itself:
>>
>> +test_expect_success 'repack preserves multi-pack-index when deleting unknown packs' '
>> +	git multi-pack-index write &&
>> +	cp $objdir/pack/multi-pack-index $objdir/pack/multi-pack-index.bak &&
>> +	test_when_finished "rm -f $objdir/pack/multi-pack-index.bak" &&
>> +
>> +	# Write a new pack that is unknown to the multi-pack-index.
>> +	git hash-object -w </dev/null >blob &&
>> +	HASH=$(git pack-objects $objdir/pack/pack <blob) &&
>> +
>> +	GIT_TEST_MULTI_PACK_INDEX=0 git -c core.multiPackIndex repack -d &&
>> +	test_cmp_bin $objdir/pack/multi-pack-index \
>> +		$objdir/pack/multi-pack-index.bak &&
>> +	test_path_is_missing $objdir/pack/pack-$HASH.pack
>> +'
>> +
>>
>> This test fails for me, on the 'test_path_is_missing'. Likely, the
>> blob is seen as already in a pack-file so is just pruned by 'git repack'
>> instead. I thought that perhaps we need to add a new pack ourselves that
>> overrides the small pack. Here is my attempt:
>>
>> test_expect_success 'repack preserves multi-pack-index when deleting unknown packs' '
>> 	git multi-pack-index write &&
>> 	cp $objdir/pack/multi-pack-index $objdir/pack/multi-pack-index.bak &&
>> 	test_when_finished "rm -f $objdir/pack/multi-pack-index.bak" &&
>>
>> 	# Write a new pack that is unknown to the multi-pack-index.
>> 	BLOB1=$(echo blob1 | git hash-object -w --stdin) &&
>> 	BLOB2=$(echo blob2 | git hash-object -w --stdin) &&
>> 	cat >blobs <<-EOF &&
>> 	$BLOB1
>> 	$BLOB2
>> 	EOF
>> 	HASH1=$(echo $BLOB1 | git pack-objects $objdir/pack/pack) &&
>> 	HASH2=$(git pack-objects $objdir/pack/pack <blobs) &&
>> 	GIT_TEST_MULTI_PACK_INDEX=0 git -c core.multiPackIndex repack -d &&
>> 	test_cmp_bin $objdir/pack/multi-pack-index \
>> 		$objdir/pack/multi-pack-index.bak &&
>> 	test_path_is_file $objdir/pack/pack-$HASH2.pack &&
>> 	test_path_is_missing $objdir/pack/pack-$HASH1.pack
>> '
>>
>> However, this _still_ fails on the "test_path_is_missing" line, so I'm not sure
>> how to make sure your logic is tested. I saw that 'git repack' was writing
>> "nothing new to pack" in the output, so I also tested adding a few commits and
>> trying to force it to repack reachable data, but I cannot seem to trigger it
>> to create a new pack that overrides only one pack that is not in the MIDX.
>>
>> Likely, I just don't know how 'git rebase' works well enough to trigger this
>> behavior. But the test as-is is not testing what you want it to test.
> 
> I think this case might actually be impossible to tickle in a test. I
> thought that 'git repack -d' looked for existing packs whose objects are
> a subset of some new pack generated. But, it's much simpler than that:
> '-d' by itself just looks for packs that were already on disk with the
> same SHA-1 as a new pack, and it removes the old one.

If 'git repack' never calls remove_redundant_pack() unless we are doing
a "full repack", then we could simplify this logic:

 static void remove_redundant_pack(const char *dir_name, const char *base_name)
 {
 	struct strbuf buf = STRBUF_INIT;
-	strbuf_addf(&buf, "%s/%s.pack", dir_name, base_name);
+	struct multi_pack_index *m = get_multi_pack_index(the_repository);
+	strbuf_addf(&buf, "%s.pack", base_name);
+	if (m && midx_contains_pack(m, buf.buf))
+		clear_midx_file(the_repository);
+	strbuf_insertf(&buf, 0, "%s/", dir_name);
 	unlink_pack_path(buf.buf, 1);
 	strbuf_release(&buf);
 }

to

 static void remove_redundant_pack(const char *dir_name, const char *base_name)
 {
 	struct strbuf buf = STRBUF_INIT;
	strbuf_addf(&buf, "%s/%s.pack", dir_name, base_name);
+	clear_midx_file(the_repository);
 	unlink_pack_path(buf.buf, 1);
 	strbuf_release(&buf);
 }

and get the same results as we are showing in these tests. This does
move us incrementally to a better situation: don't delete the MIDX
if we aren't deleting pack files. But, I think we can get around it.

> Note that 'git repack' uses 'git pack-objects' internally to find
> objects and generate a packfile. When calling 'git pack-objects', 'git
> repack -d' passes '--all' and '--unpacked', which means that there is no
> way we'd generate a new pack with the same SHA-1 as an existing pack
> ordinarily.
> 
> So, I think this case is impossible, or at least astronomically
> unlikely. What is more interesting (and untested) is that adding a _new_
> pack doesn't cause us to invalidate the MIDX. Here's a patch that does
> that:
> 
>   diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
>   index 16a1ad040e..620f2058d6 100755
>   --- a/t/t5319-multi-pack-index.sh
>   +++ b/t/t5319-multi-pack-index.sh
>   @@ -391,18 +391,27 @@ test_expect_success 'repack removes multi-pack-index when deleting packs' '
>           test_path_is_missing $objdir/pack/multi-pack-index
>    '
> 
>   -test_expect_success 'repack preserves multi-pack-index when deleting unknown packs' '
>   -       git multi-pack-index write &&
>   -       cp $objdir/pack/multi-pack-index $objdir/pack/multi-pack-index.bak &&
>   -       test_when_finished "rm -f $objdir/pack/multi-pack-index.bak" &&
>   -
>   -       # Write a new pack that is unknown to the multi-pack-index.
>   -       git hash-object -w </dev/null >blob &&
>   -       git pack-objects $objdir/pack/pack <blob &&
>   -
>   -       GIT_TEST_MULTI_PACK_INDEX=0 git -c core.multiPackIndex repack -d &&
>   -       test_cmp_bin $objdir/pack/multi-pack-index \
>   -               $objdir/pack/multi-pack-index.bak
>   +test_expect_success 'repack preserves multi-pack-index when creating packs' '
>   +       git init preserve &&
>   +       test_when_finished "rm -fr preserve" &&
>   +       (
>   +               cd preserve &&
>   +               midx=.git/objects/pack/multi-pack-index &&
>   +
>   +               test_commit "initial" &&
>   +               git repack -ad &&
>   +               git multi-pack-index write &&
>   +               ls .git/objects/pack | grep "\.pack$" >before &&
>   +
>   +               cp $midx $midx.bak &&
>   +
>   +               test_commit "another" &&
>   +               GIT_TEST_MULTI_PACK_INDEX=0 git -c core.multiPackIndex repack -d &&
>   +               ls .git/objects/pack | grep "\.pack$" >after &&
>   +
>   +               test_cmp_bin $midx.bak $midx &&
>   +               ! test_cmp before after
>   +       )
>    '

After looking at the callers to remote_redundant_pack() I noticed that it is only
called after inspecting the "names" struct, which contains the names of the packs
to group into a new pack-file. We can use a .keep file to preserve the pack-file(s) in
the MIDX but also to ensure multiple pack-files outside of the MIDX are repacked and
deleted. While this is very unlikely in the wild, it is definitely possible.

test_expect_success 'repack preserves multi-pack-index when deleting unknown packs' '
	git init preserve &&
	test_when_finished "rm -fr preserve" &&
	(
		cd preserve &&
		midx=.git/objects/pack/multi-pack-index &&

		test_commit 1 &&
		HASH1=$(git pack-objects --all .git/objects/pack/pack) &&
		touch .git/objects/pack/pack-$HASH1.keep &&

		cat >pack-input <<-\EOF &&
		HEAD
		^HEAD~1
		EOF

		test_commit 2 &&
		HASH2=$(git pack-objects --revs .git/objects/pack/pack <pack-input) &&
		touch .git/objects/pack/pack-$HASH2.keep &&

		git multi-pack-index write &&
		cp $midx $midx.bak &&

		test_commit 3 &&
		HASH3=$(git pack-objects --revs .git/objects/pack/pack <pack-input) &&

		test_commit 4 &&
		HASH4=$(git pack-objects --revs .git/objects/pack/pack <pack-input) &&

		GIT_TEST_MULTI_PACK_INDEX=0 git -c core.multiPackIndex repack -ad &&
		test_path_is_file .git/objects/pack/pack-$HASH1.pack &&
		test_path_is_file .git/objects/pack/pack-$HASH2.pack &&
		test_path_is_missing .git/objects/pack/pack-$HASH3.pack &&
		test_path_is_missing .git/objects/pack/pack-$HASH4.pack
       )
'

I believe this checks your condition properly enough.

Thanks,
-Stolee

  reply	other threads:[~2020-08-25 15:14 UTC|newest]

Thread overview: 78+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-08-25  2:01 [PATCH] builtin/repack.c: invalidate MIDX only when necessary Taylor Blau
2020-08-25  2:26 ` Jeff King
2020-08-25  2:37   ` Taylor Blau
2020-08-25 13:14     ` Derrick Stolee
2020-08-25 14:41       ` Taylor Blau
2020-08-25 15:14         ` Derrick Stolee [this message]
2020-08-25 15:42           ` Taylor Blau
2020-08-25 16:56             ` Jeff King
2020-08-25 15:58   ` Junio C Hamano
2020-08-25 16:08     ` Taylor Blau
2020-08-25 16:18     ` Derrick Stolee
2020-08-25 17:34       ` Jeff King
2020-08-25 17:22     ` Jeff King
2020-08-25 18:05       ` Junio C Hamano
2020-08-25 18:27         ` Jeff King
2020-08-25 22:45           ` [PATCH] pack-redundant: gauge the usage before proposing its removal Junio C Hamano
2020-08-25 23:09             ` Taylor Blau
2020-08-25 23:22               ` Junio C Hamano
2020-08-26  1:17             ` [PATCH v1 0/3] War on dashed-git Junio C Hamano
2020-08-26  1:17               ` [PATCH v1 1/3] transport-helper: do not run git-remote-ext etc. in dashed form Junio C Hamano
2020-08-26  1:24                 ` Eric Sunshine
2020-08-26  7:55                   ` Johannes Schindelin
2020-08-26 16:27                   ` Junio C Hamano
2020-08-26  1:17               ` [PATCH v1 2/3] cvsexportcommit: do not run git programs " Junio C Hamano
2020-08-26  1:28                 ` Eric Sunshine
2020-08-26  1:42                   ` Junio C Hamano
2020-08-26 16:08                   ` Junio C Hamano
2020-08-26 16:28                     ` Junio C Hamano
2020-08-26  8:02                 ` Johannes Schindelin
2020-08-26  1:17               ` [PATCH v1 3/3] git: catch an attempt to run "git-foo" Junio C Hamano
2020-08-26  1:19                 ` Junio C Hamano
2020-08-26  8:06                 ` Johannes Schindelin
2020-08-26 16:30                   ` Junio C Hamano
2020-08-28  2:13                 ` Johannes Schindelin
2020-08-28 22:03                   ` Junio C Hamano
2020-08-31  9:59                     ` Johannes Schindelin
2020-08-31 17:45                       ` Junio C Hamano
2020-12-20 15:25                   ` Johannes Schindelin
2020-12-21 22:24                     ` Junio C Hamano
2020-12-30  5:30                       ` Johannes Schindelin
2020-08-26  8:09               ` [PATCH v1 0/3] War on dashed-git Johannes Schindelin
2020-08-26 16:45                 ` Junio C Hamano
2020-08-26 19:46                   ` [PATCH v2 0/2] avoid running "git-subcmd" in the dashed form Junio C Hamano
2020-08-26 19:46                     ` [PATCH v2 1/2] transport-helper: do not run git-remote-ext etc. in " Junio C Hamano
2020-08-26 19:46                     ` [PATCH v2 2/2] cvsexportcommit: do not run git programs " Junio C Hamano
2020-08-26 21:37                       ` [PATCH v2 3/2] credential-cache: use child_process.args Junio C Hamano
2020-08-26 22:25                         ` [PATCH] run_command: teach API users to use embedded 'args' more Junio C Hamano
2020-08-27  4:21                           ` Jeff King
2020-08-27  4:30                             ` Junio C Hamano
2020-08-27  4:31                             ` Eric Sunshine
2020-08-27  4:44                               ` Jeff King
2020-08-27  5:03                                 ` Eric Sunshine
2020-08-27  5:25                                   ` [PATCH] worktree: fix leak in check_clean_worktree() Jeff King
2020-08-27  5:56                                     ` Eric Sunshine
2020-08-27 15:31                                       ` Junio C Hamano
2020-08-27  4:13                         ` [PATCH v2 3/2] credential-cache: use child_process.args Jeff King
2020-08-27  4:22                           ` Jeff King
2020-08-27  4:31                           ` Junio C Hamano
2020-08-27  4:14                         ` Jeff King
2020-08-27 15:34                           ` Junio C Hamano
2020-08-31 22:56                         ` Junio C Hamano
2020-09-01  4:49                           ` Jeff King
2020-09-01 16:11                             ` Junio C Hamano
2020-08-27  0:57                     ` [PATCH v2 0/2] avoid running "git-subcmd" in the dashed form Derrick Stolee
2020-08-27  1:22                       ` Junio C Hamano
2020-08-28  9:14             ` [PATCH] pack-redundant: gauge the usage before proposing its removal Jeff King
2020-08-28 22:45               ` Junio C Hamano
2020-08-25  7:55 ` [PATCH] builtin/repack.c: invalidate MIDX only when necessary Son Luong Ngoc
2020-08-25 12:45   ` Derrick Stolee
2020-08-25 14:45   ` Taylor Blau
2020-08-25 16:04     ` [PATCH v2] " Taylor Blau
2020-08-26 20:51       ` Derrick Stolee
2020-08-26 20:54         ` Junio C Hamano
2020-08-25 16:47     ` [PATCH] " Jeff King
2020-08-25 17:10       ` Derrick Stolee
2020-08-25 17:29         ` Jeff King
2020-08-25 17:34           ` Taylor Blau
2020-08-25 17:42             ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6a34d7ee-8c6b-8c6c-93bd-0013dccccafb@gmail.com \
    --to=stolee@gmail.com \
    --cc=dstolee@microsoft.com \
    --cc=git@vger.kernel.org \
    --cc=me@ttaylorr.com \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.