All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] Add failing test for fetching from multiple packs over dumb httpd
@ 2015-01-27 15:20 Charles Bailey
  2015-01-27 18:12 ` Jeff King
  0 siblings, 1 reply; 6+ messages in thread
From: Charles Bailey @ 2015-01-27 15:20 UTC (permalink / raw)
  To: git

From: Charles Bailey <cbailey32@bloomberg.net>

When objects are spread across multiple packs, if an initial fetch does
require all pack files, a subsequent fetch for objects in packs not
retrieved in the initial fetch will fail.
---

I'm not very familiar with the http client code so this analysis is based
purely on observed behaviour.

When fetching only some refs from a repository served over dumb httpd Git
appears to download all of the index files for the available packs but then
only chooses the pack files that help it resolve the objects which we need.

If we then later try to fetch an object which is in a pack file for
which Git has previously downloaded an index file, it seems to trip because it
believes it already has the object locally due to the presence of the index
file but doesn't actually have it because it never retrieved the corresponding
pack file. It reports an error of the form "Cannot obtain needed object ...".

Manually deleting index files which have no corresponding local pack
file will allow a repeat of the failed fetch to succeed.

 t/t5550-http-fetch-dumb.sh | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/t/t5550-http-fetch-dumb.sh b/t/t5550-http-fetch-dumb.sh
index ac71418..cf2362a 100755
--- a/t/t5550-http-fetch-dumb.sh
+++ b/t/t5550-http-fetch-dumb.sh
@@ -165,6 +165,24 @@ test_expect_success 'fetch notices corrupt idx' '
 	)
 '
 
+test_expect_failure 'fetch packed branches' '
+	git checkout --orphan branch1 &&
+	echo base >file &&
+	git add file &&
+	git commit -m base &&
+	git --bare init "$HTTPD_DOCUMENT_ROOT_PATH"/repo_packed_branches.git &&
+	git push "$HTTPD_DOCUMENT_ROOT_PATH"/repo_packed_branches.git branch1 &&
+	git --git-dir="$HTTPD_DOCUMENT_ROOT_PATH"/repo_packed_branches.git repack -d &&
+	git checkout -b branch2 branch1 &&
+	echo b2 >>file &&
+	git commit -a -m b2 &&
+	git push "$HTTPD_DOCUMENT_ROOT_PATH"/repo_packed_branches.git branch2 &&
+	git --git-dir="$HTTPD_DOCUMENT_ROOT_PATH"/repo_packed_branches.git repack -d &&
+	git --bare init clone_packed_branches.git &&
+	git --git-dir=clone_packed_branches.git fetch "$HTTPD_URL"/dumb/repo_packed_branches.git branch1:branch1 &&
+	git --git-dir=clone_packed_branches.git fetch "$HTTPD_URL"/dumb/repo_packed_branches.git branch2:branch2
+'
+
 test_expect_success 'did not use upload-pack service' '
 	grep '/git-upload-pack' <"$HTTPD_ROOT_PATH"/access.log >act
 	: >exp
-- 
2.0.2.611.g8c85416

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] Add failing test for fetching from multiple packs over dumb httpd
  2015-01-27 15:20 [PATCH] Add failing test for fetching from multiple packs over dumb httpd Charles Bailey
@ 2015-01-27 18:12 ` Jeff King
  2015-01-27 18:29   ` Charles Bailey
  2015-01-27 20:02   ` [PATCH] dumb-http: do not pass NULL path to parse_pack_index Jeff King
  0 siblings, 2 replies; 6+ messages in thread
From: Jeff King @ 2015-01-27 18:12 UTC (permalink / raw)
  To: Charles Bailey; +Cc: git

On Tue, Jan 27, 2015 at 03:20:41PM +0000, Charles Bailey wrote:

> From: Charles Bailey <cbailey32@bloomberg.net>
> 
> When objects are spread across multiple packs, if an initial fetch does
> require all pack files, a subsequent fetch for objects in packs not
> retrieved in the initial fetch will fail.

s/does/does not/, I think?

> I'm not very familiar with the http client code so this analysis is based
> purely on observed behaviour.

Debugging the http code is a royal pain because all the work happens in
a separate helper. I use a git-remote-debug script like this:

  #!/bin/sh
  host=localhost:5001
  proto=$(echo "${2:-$1}" | sed 's/:.*//')
  prog=git-remote-$proto
  echo >&2 "gdb -ex 'target remote $host' $prog"
  gdbserver localhost:5001 "$prog" "$@"

and then you can use:

  git fetch debug::http://...

in the test script, cut-and-paste the gdb command printed to stderr, and
you're dropped into the appropriate debugger without worrying about all
of the stdio mess.

> When fetching only some refs from a repository served over dumb httpd Git
> appears to download all of the index files for the available packs but then
> only chooses the pack files that help it resolve the objects which we need.

Right. And it looks like we have special code in sha1_file.c to make
sure we do not trust an index which does not have a matching packfile.
So that's good.

The http-walker code does its own check, in fetch_and_setup_pack_index,
that checks for an existing valid copy of the index. If we don't have
it, we download the index and proceed. If we do, we skip straight to
grabbing the pack. But if we have it and it doesn't appear valid, we
return an error. And there seems to be a bug with checking the validity.

It looks like the culprit is 7b64469 (Allow parse_pack_index on
temporary files, 2010-04-19). It added a new "idx_path" parameter to
parse_pack_index, which we pass as NULL.  That causes its call to
check_packed_git_idx to fail (because it has no idea what file we are
talking about!).

This seems to fix it:

diff --git a/sha1_file.c b/sha1_file.c
index 30995e6..eda4d90 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1149,6 +1149,9 @@ struct packed_git *parse_pack_index(unsigned char *sha1, const char *idx_path)
 	const char *path = sha1_pack_name(sha1);
 	struct packed_git *p = alloc_packed_git(strlen(path) + 1);
 
+	if (!idx_path)
+		idx_path = sha1_pack_index_name(sha1);
+
 	strcpy(p->pack_name, path);
 	hashcpy(p->sha1, sha1);
 	if (check_packed_git_idx(idx_path, p)) {

(Alternatively, we could pass in sha1_pack_index_name instead of NULL in
the first place, but I think it is reasonable for parse_pack_index to
take care of this).

I think it may also make sense for fetch_and_setup_pack_index to delete
and re-download a broken .idx file (rather than aborting), but I don't
think that's a big deal. It should only happen in the face of on-disk
data corruption, and the user can remove the broken .idx themselves.

-Peff

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] Add failing test for fetching from multiple packs over dumb httpd
  2015-01-27 18:12 ` Jeff King
@ 2015-01-27 18:29   ` Charles Bailey
  2015-01-27 20:02   ` [PATCH] dumb-http: do not pass NULL path to parse_pack_index Jeff King
  1 sibling, 0 replies; 6+ messages in thread
From: Charles Bailey @ 2015-01-27 18:29 UTC (permalink / raw)
  To: Jeff King; +Cc: git

On Tue, Jan 27, 2015 at 01:12:21PM -0500, Jeff King wrote:
> On Tue, Jan 27, 2015 at 03:20:41PM +0000, Charles Bailey wrote:
> 
> > From: Charles Bailey <cbailey32@bloomberg.net>
> > 
> > When objects are spread across multiple packs, if an initial fetch does
> > require all pack files, a subsequent fetch for objects in packs not
> > retrieved in the initial fetch will fail.
> 
> s/does/does not/, I think?

Yes, that's definitely what I meant to write.

[...]
> It looks like the culprit is 7b64469 (Allow parse_pack_index on
> temporary files, 2010-04-19). It added a new "idx_path" parameter to
> parse_pack_index, which we pass as NULL.  That causes its call to
> check_packed_git_idx to fail (because it has no idea what file we are
> talking about!).

That change looks like it went into 1.7.1.1. I cannot confirm this
working before then but we've definitely seen the bug in 1.7.12.3 and
more recent versions.

> This seems to fix it:
> 
> diff --git a/sha1_file.c b/sha1_file.c
> index 30995e6..eda4d90 100644
> --- a/sha1_file.c
> +++ b/sha1_file.c
> @@ -1149,6 +1149,9 @@ struct packed_git *parse_pack_index(unsigned char *sha1, const char *idx_path)
>  	const char *path = sha1_pack_name(sha1);
>  	struct packed_git *p = alloc_packed_git(strlen(path) + 1);
>  
> +	if (!idx_path)
> +		idx_path = sha1_pack_index_name(sha1);
> +
>  	strcpy(p->pack_name, path);
>  	hashcpy(p->sha1, sha1);
>  	if (check_packed_git_idx(idx_path, p)) {

It certainly fixes my test script and I can give this patch a test in
the 'real' world.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH] dumb-http: do not pass NULL path to parse_pack_index
  2015-01-27 18:12 ` Jeff King
  2015-01-27 18:29   ` Charles Bailey
@ 2015-01-27 20:02   ` Jeff King
  2015-01-27 20:19     ` Charles Bailey
  1 sibling, 1 reply; 6+ messages in thread
From: Jeff King @ 2015-01-27 20:02 UTC (permalink / raw)
  To: Charles Bailey; +Cc: Shawn Pearce, Junio C Hamano, git

On Tue, Jan 27, 2015 at 01:12:20PM -0500, Jeff King wrote:

> It looks like the culprit is 7b64469 (Allow parse_pack_index on
> temporary files, 2010-04-19). It added a new "idx_path" parameter to
> parse_pack_index, which we pass as NULL.  That causes its call to
> check_packed_git_idx to fail (because it has no idea what file we are
> talking about!).

This is not quite right. That refactoring commit correctly passes the
result of sha1_pack_index_name() for the newly-added parameter. It's the
_next_ commit which then erroneously passes NULL. :)

Here's my fix with a proper commit message. I tweaked the title on your
test, as I didn't find the original very descriptive. I also bumped the
call to sha1_pack_index_name() into the caller, as that makes more sense
to me after reading the older commits (and there are literally no other
callers outside of this function).

-- >8 --
Once upon a time, dumb http always fetched .idx files
directly into their final location, and then checked their
validity with parse_pack_index. This was refactored in
commit 750ef42 (http-fetch: Use temporary files for
pack-*.idx until verified, 2010-04-19), which uses the
following logic:

  1. If we have the idx already in place, see if it's
     valid (using parse_pack_index). If so, use it.

  2. Otherwise, fetch the .idx to a tempfile, check
     that, and if so move it into place.

  3. Either way, fetch the pack itself if necessary.

However, it got step 1 wrong. We pass a NULL path parameter
to parse_pack_index, so an existing .idx file always looks
broken. Worse, we do not treat this broken .idx as an
opportunity to re-fetch, but instead return an error,
ignoring the pack entirely. This can lead to a dumb-http
fetch failing to retrieve the necessary objects.

This doesn't come up much in practice, because it must be a
packfile that we found out about (and whose .idx we stored)
during an earlier dumb-http fetch, but whose packfile we
_didn't_ fetch. I.e., we did a partial clone of a
repository, didn't need some packfiles, and now a followup
fetch needs them.

Discovery and tests by Charles Bailey <charles@hashpling.org>.

Signed-off-by: Jeff King <peff@peff.net>
---
I'm happy to flip the authorship on this. You have more lines in it than
I do. :)

As I mentioned before, I think we could treat a broken .idx file
differently (by deleting and refetching), but I doubt it matters in
practice. The only reason this came up at all is that our test for
"broken" was bogus, and it may be better to let a user diagnose and
deal with breakage.

 http.c                     |  2 +-
 t/t5550-http-fetch-dumb.sh | 18 ++++++++++++++++++
 2 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/http.c b/http.c
index 040f362..f2373c0 100644
--- a/http.c
+++ b/http.c
@@ -1240,7 +1240,7 @@ static int fetch_and_setup_pack_index(struct packed_git **packs_head,
 	int ret;
 
 	if (has_pack_index(sha1)) {
-		new_pack = parse_pack_index(sha1, NULL);
+		new_pack = parse_pack_index(sha1, sha1_pack_index_name(sha1));
 		if (!new_pack)
 			return -1; /* parse_pack_index() already issued error message */
 		goto add_pack;
diff --git a/t/t5550-http-fetch-dumb.sh b/t/t5550-http-fetch-dumb.sh
index ac71418..6da9422 100755
--- a/t/t5550-http-fetch-dumb.sh
+++ b/t/t5550-http-fetch-dumb.sh
@@ -165,6 +165,24 @@ test_expect_success 'fetch notices corrupt idx' '
 	)
 '
 
+test_expect_success 'fetch can handle previously-fetched .idx files' '
+	git checkout --orphan branch1 &&
+	echo base >file &&
+	git add file &&
+	git commit -m base &&
+	git --bare init "$HTTPD_DOCUMENT_ROOT_PATH"/repo_packed_branches.git &&
+	git push "$HTTPD_DOCUMENT_ROOT_PATH"/repo_packed_branches.git branch1 &&
+	git --git-dir="$HTTPD_DOCUMENT_ROOT_PATH"/repo_packed_branches.git repack -d &&
+	git checkout -b branch2 branch1 &&
+	echo b2 >>file &&
+	git commit -a -m b2 &&
+	git push "$HTTPD_DOCUMENT_ROOT_PATH"/repo_packed_branches.git branch2 &&
+	git --git-dir="$HTTPD_DOCUMENT_ROOT_PATH"/repo_packed_branches.git repack -d &&
+	git --bare init clone_packed_branches.git &&
+	git --git-dir=clone_packed_branches.git fetch "$HTTPD_URL"/dumb/repo_packed_branches.git branch1:branch1 &&
+	git --git-dir=clone_packed_branches.git fetch "$HTTPD_URL"/dumb/repo_packed_branches.git branch2:branch2
+'
+
 test_expect_success 'did not use upload-pack service' '
 	grep '/git-upload-pack' <"$HTTPD_ROOT_PATH"/access.log >act
 	: >exp
-- 
2.3.0.rc1.287.g761fd19

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] dumb-http: do not pass NULL path to parse_pack_index
  2015-01-27 20:02   ` [PATCH] dumb-http: do not pass NULL path to parse_pack_index Jeff King
@ 2015-01-27 20:19     ` Charles Bailey
  2015-01-27 20:46       ` Junio C Hamano
  0 siblings, 1 reply; 6+ messages in thread
From: Charles Bailey @ 2015-01-27 20:19 UTC (permalink / raw)
  To: Jeff King; +Cc: Shawn Pearce, Junio C Hamano, git

On Tue, Jan 27, 2015 at 03:02:27PM -0500, Jeff King wrote:
> Discovery and tests by Charles Bailey <charles@hashpling.org>.
> 
> Signed-off-by: Jeff King <peff@peff.net>
> ---
> I'm happy to flip the authorship on this. You have more lines in it than
> I do. :)

No, I'm happy with you taking the blame/praise for this, it's definitely
your fix.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] dumb-http: do not pass NULL path to parse_pack_index
  2015-01-27 20:19     ` Charles Bailey
@ 2015-01-27 20:46       ` Junio C Hamano
  0 siblings, 0 replies; 6+ messages in thread
From: Junio C Hamano @ 2015-01-27 20:46 UTC (permalink / raw)
  To: Charles Bailey; +Cc: Jeff King, Shawn Pearce, git

Charles Bailey <charles@hashpling.org> writes:

> On Tue, Jan 27, 2015 at 03:02:27PM -0500, Jeff King wrote:
>> Discovery and tests by Charles Bailey <charles@hashpling.org>.
>> 
>> Signed-off-by: Jeff King <peff@peff.net>
>> ---
>> I'm happy to flip the authorship on this. You have more lines in it than
>> I do. :)
>
> No, I'm happy with you taking the blame/praise for this, it's definitely
> your fix.

Looks good (rather, makes the original look obviously broken and
makes me wonder why our review process did not catch that bogus NULL
in the first place).

Thanks, both.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2015-01-27 20:46 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-01-27 15:20 [PATCH] Add failing test for fetching from multiple packs over dumb httpd Charles Bailey
2015-01-27 18:12 ` Jeff King
2015-01-27 18:29   ` Charles Bailey
2015-01-27 20:02   ` [PATCH] dumb-http: do not pass NULL path to parse_pack_index Jeff King
2015-01-27 20:19     ` Charles Bailey
2015-01-27 20:46       ` Junio C Hamano

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.