All of lore.kernel.org
 help / color / mirror / Atom feed
* duplicate objects in packfile
@ 2013-08-14 18:17 Jeff King
  2013-08-14 18:29 ` Junio C Hamano
  2013-08-14 18:50 ` duplicate objects in packfile Nicolas Pitre
  0 siblings, 2 replies; 37+ messages in thread
From: Jeff King @ 2013-08-14 18:17 UTC (permalink / raw)
  To: git; +Cc: Shawn O. Pearce, Junio C Hamano, Nicolas Pitre

I'm tracking down a rather odd problem in a packfile I found on GitHub.
This particular packfile contains the same object at various offsets
within the file. In fact there are several packfiles that exhibit this
behavior, all in the same repository, and in each one there are several
duplicated objects (some of which are present 3 or even 4 times).

index-pack is happy to index these packfiles, and just creates multiple
entries for the object. The usual binary search in find_pack_entry_one
will find one of them (though of course which depends on the exact
layout of the index). But curiously, the experimental sha1_entry_pos
lookup does not.  It hits an assert() that can only be triggered in the
face of duplicate objects. For example:

  $ git cat-file -t 4ea4acbcb0930ac42acc87a0d203864dec1a9697
  commit

  $ GIT_USE_LOOKUP=1 git cat-file -t 4ea4acbcb0930ac42acc87a0d203864dec1a9697
  git: sha1-lookup.c:222: sha1_entry_pos: Assertion `lov < hiv' failed.
  Aborted

It's easy enough to fix with a repack, but I am curious:

  1. Is sha1_entry_pos wrong to barf on duplicate items in the index? If
     so, do we want to fix it, or simply retire GIT_USE_LOOKUP?

     Related, should we consider duplicate items in a packfile to be a
     bogus packfile (and consequently notice and complain during
     indexing)? I don't think it _hurts_ anything (aside from the assert
     above), though it is of course wasteful.

     I didn't go into detail on how the assertion above is triggered,
     but I can break it down line by line if anybody cares; the short of
     it is that it can only be caused by an unsorted index or a
     duplicate entry.

  2. How can duplicate entries get into a packfile?

     Git itself should not generate duplicate entries (pack-objects is
     careful to remove duplicates). Since these packs almost certainly
     were pushed by a client, I wondered if "index-pack --fix-thin"
     might accidentally add multiple copies of an object when it is the
     preferred base for multiple objects, but it specifically avoids
     doing so.

     The packs in question were received by us in 2010. Did we ever have
     bugs in this area? I don't recall any, nor could I find any in the
     history.

     That makes me suspect the user may have been using an alternate git
     implementation. libgit2 did not know how to pack in 2010, and Grit
     still doesn't.  JGit did, and I don't know offhand about Dulwich.
     Does anyone know of bugs in those implementations that might have
     caused this?

The packs in question are public, so I can share them if anybody is
curious to look (but the whole repository is on the order of 700M).

Given the dates on the packs and how rare this is, I'm pretty much
willing to chalk it up to a random bug (in git or otherwise) that does
not any longer exist. But I was curious if anybody else knew anything,
and we may want to fix sha1_entry_pos to behave more like the regular
binary search.

-Peff

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: duplicate objects in packfile
  2013-08-14 18:17 duplicate objects in packfile Jeff King
@ 2013-08-14 18:29 ` Junio C Hamano
  2013-08-14 18:39   ` Junio C Hamano
  2013-08-14 18:50 ` duplicate objects in packfile Nicolas Pitre
  1 sibling, 1 reply; 37+ messages in thread
From: Junio C Hamano @ 2013-08-14 18:29 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Shawn O. Pearce, Nicolas Pitre

Jeff King <peff@peff.net> writes:

> lookup does not.  It hits an assert() that can only be triggered in the
> face of duplicate objects. For example:
>
>   $ git cat-file -t 4ea4acbcb0930ac42acc87a0d203864dec1a9697
>   commit
>
>   $ GIT_USE_LOOKUP=1 git cat-file -t 4ea4acbcb0930ac42acc87a0d203864dec1a9697
>   git: sha1-lookup.c:222: sha1_entry_pos: Assertion `lov < hiv' failed.
>   Aborted

Older versions of JGit used to put duplicate objects in a pack, and
IIRC Shawn declared that a bug in the packer and fixed it, so from
that point of view, I think rejecting is probably the right thing,
even though I think warning and continuing is also acceptable and
indeed may be better.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: duplicate objects in packfile
  2013-08-14 18:29 ` Junio C Hamano
@ 2013-08-14 18:39   ` Junio C Hamano
  2013-08-14 18:54     ` Nicolas Pitre
  2013-08-16 15:01     ` Jeff King
  0 siblings, 2 replies; 37+ messages in thread
From: Junio C Hamano @ 2013-08-14 18:39 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Shawn O. Pearce, Nicolas Pitre

Junio C Hamano <gitster@pobox.com> writes:

> Jeff King <peff@peff.net> writes:
>
>> lookup does not.  It hits an assert() that can only be triggered in the
>> face of duplicate objects. For example:
>>
>>   $ git cat-file -t 4ea4acbcb0930ac42acc87a0d203864dec1a9697
>>   commit
>>
>>   $ GIT_USE_LOOKUP=1 git cat-file -t 4ea4acbcb0930ac42acc87a0d203864dec1a9697
>>   git: sha1-lookup.c:222: sha1_entry_pos: Assertion `lov < hiv' failed.
>>   Aborted
>
> Older versions of JGit used to put duplicate objects in a pack, and
> IIRC Shawn declared that a bug in the packer and fixed it, so from
> that point of view, I think rejecting is probably the right thing,
> even though I think warning and continuing is also acceptable and
> indeed may be better.

Also repacking may have a funny corner case. I do not recall the
details as the above was a long time ago, but when I was tracking it
down, a delta was made against one copy of the base object, and
referred to it using delta-offset, while there was another copy of
the base object which was found by the bisection search, and from
there on, the inconsistencies between these two (they represent the
same payload, but they are at different offsets in the same pack and
with different in-pack sizes) led to some funky behaviour.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: duplicate objects in packfile
  2013-08-14 18:17 duplicate objects in packfile Jeff King
  2013-08-14 18:29 ` Junio C Hamano
@ 2013-08-14 18:50 ` Nicolas Pitre
  1 sibling, 0 replies; 37+ messages in thread
From: Nicolas Pitre @ 2013-08-14 18:50 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Shawn O. Pearce, Junio C Hamano

On Wed, 14 Aug 2013, Jeff King wrote:

>   1. Is sha1_entry_pos wrong to barf on duplicate items in the index? If
>      so, do we want to fix it, or simply retire GIT_USE_LOOKUP?

I'd think that sha1_entry_pos should be more lenient here, especially if 
this doesn't compromize the overall git behavior.

>      Related, should we consider duplicate items in a packfile to be a
>      bogus packfile (and consequently notice and complain during
>      indexing)? I don't think it _hurts_ anything (aside from the assert
>      above), though it is of course wasteful.

This should indeed be considered a bogus pack file.  But we have a lot 
of code to be able to cope with bogus/corrupted pack files already.  
Handling this case as well would not hurt.

More importantly we should make sure the code we have doesn't generate 
such packs.

>   2. How can duplicate entries get into a packfile?
> 
>      Git itself should not generate duplicate entries (pack-objects is
>      careful to remove duplicates). Since these packs almost certainly
>      were pushed by a client, I wondered if "index-pack --fix-thin"
>      might accidentally add multiple copies of an object when it is the
>      preferred base for multiple objects, but it specifically avoids
>      doing so.

It is probably simpler than that.  An alternative pack-objects 
implementation could stream multiple copies of an object upon a push, 
and index-pack on the receiving end would simply store what's been 
received to disk as is without a fuss.

> Given the dates on the packs and how rare this is, I'm pretty much
> willing to chalk it up to a random bug (in git or otherwise) that does
> not any longer exist.

Possibly.  Given this is not compromizing the validity of the pack, and 
a simple repack "fixes" it, I would not worry too much about it.


Nicolas

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: duplicate objects in packfile
  2013-08-14 18:39   ` Junio C Hamano
@ 2013-08-14 18:54     ` Nicolas Pitre
  2013-08-16 15:01     ` Jeff King
  1 sibling, 0 replies; 37+ messages in thread
From: Nicolas Pitre @ 2013-08-14 18:54 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, git, Shawn O. Pearce

On Wed, 14 Aug 2013, Junio C Hamano wrote:

> Also repacking may have a funny corner case. I do not recall the
> details as the above was a long time ago, but when I was tracking it
> down, a delta was made against one copy of the base object, and
> referred to it using delta-offset, while there was another copy of
> the base object which was found by the bisection search, and from
> there on, the inconsistencies between these two (they represent the
> same payload, but they are at different offsets in the same pack and
> with different in-pack sizes) led to some funky behaviour.

Crap.

Better try to cope with this (with a test case, etc.) rather than 
rejecting them now I'd say.  Strictly speaking, they're still valid pack 
files.


Nicolas

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: duplicate objects in packfile
  2013-08-14 18:39   ` Junio C Hamano
  2013-08-14 18:54     ` Nicolas Pitre
@ 2013-08-16 15:01     ` Jeff King
  2013-08-21 20:49       ` [RFC/PATCH 0/4] duplicate objects in packfiles Jeff King
  1 sibling, 1 reply; 37+ messages in thread
From: Jeff King @ 2013-08-16 15:01 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Shawn O. Pearce, Nicolas Pitre

On Wed, Aug 14, 2013 at 11:39:34AM -0700, Junio C Hamano wrote:

> > Older versions of JGit used to put duplicate objects in a pack, and
> > IIRC Shawn declared that a bug in the packer and fixed it, so from
> > that point of view, I think rejecting is probably the right thing,
> > even though I think warning and continuing is also acceptable and
> > indeed may be better.
> 
> Also repacking may have a funny corner case. I do not recall the
> details as the above was a long time ago, but when I was tracking it
> down, a delta was made against one copy of the base object, and
> referred to it using delta-offset, while there was another copy of
> the base object which was found by the bisection search, and from
> there on, the inconsistencies between these two (they represent the
> same payload, but they are at different offsets in the same pack and
> with different in-pack sizes) led to some funky behaviour.

Thanks for the pointer. I found this commit:

  https://eclipse.googlesource.com/jgit/jgit/+/2fbf296fda205446eac11a13abd4fcdb182f28d9

which is presumably what you're thinking of.

I did not run into the problem described in my case, but presumably I
did not have a delta cycle between the multiple versions. In theory we
should find the same copy of the object each time we search, but there
are enough code paths to access the objects that I would not be
surprised if such funkiness is still triggerable, including infinite
loops.

That makes me inclined to teach index-pack to reject duplicate objects
in a single pack in order to prevent denial-of-service attacks. We can
potentially make them work in all code paths, but given that nobody
should be doing this legitimately, rejecting the duplicates outright
keeps our attack surface small, and nobody but attackers or users of
broken implementations should care.

-Peff

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [RFC/PATCH 0/4] duplicate objects in packfiles
  2013-08-16 15:01     ` Jeff King
@ 2013-08-21 20:49       ` Jeff King
  2013-08-21 20:51         ` [PATCH 1/4] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP Jeff King
                           ` (4 more replies)
  0 siblings, 5 replies; 37+ messages in thread
From: Jeff King @ 2013-08-21 20:49 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Shawn O. Pearce, Nicolas Pitre

On Fri, Aug 16, 2013 at 11:01:38AM -0400, Jeff King wrote:

> That makes me inclined to teach index-pack to reject duplicate objects
> in a single pack in order to prevent denial-of-service attacks. We can
> potentially make them work in all code paths, but given that nobody
> should be doing this legitimately, rejecting the duplicates outright
> keeps our attack surface small, and nobody but attackers or users of
> broken implementations should care.

Here's a patch series in that direction:

  [1/4]: sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP

This reproduces and fixes the sha1-lookup bug. We should do this no
matter what else we do.

  [2/4]: index-pack: optionally reject packs with duplicate objects

This adds a pack.indexDuplicates option so that sites receiving
packfiles from random folks on the internet can protect themselves from
the potential denial-of-service mentioned above. The default remains to
allow it.

  [3/4]: reject duplicates when indexing a packfile we created

This is a safety check for packs we generate. Optional, but I think it's
probably a good idea (and doesn't cost very much).

  [4/4]: index-pack: optionally skip duplicate packfile entries

I really wanted to have a "fix" mode where we could take in packs with
duplicate entries and just use them as-is. It's not correct to throw
away the duplicates (they may be bases in a delta cycle), but we could
maybe get by with simply not referencing them in the pack index.
Unfortunately, the pack reader does not like the index we generate; see
the patch for details and possible solutions (all of which are ugly).
And it only helps in a delta cycle when delta base offsets are used.

I had hoped to have a 5/4 flipping the default to "skip", since it would
potentially fix the infinite loop problem and wouldn't have a downside.
But since it doesn't work (and cannot fix the REF_DELTA case), it seems
like a bad idea.

Which leaves the open question: should the default for index-pack flip
to reject duplicates rather than allow? It seems like it would be worth
it to identify buggy packfiles before they move between repos. And in an
emergency, we have the config flag to be more permissive in case
somebody really needs to move the data via git.

Thoughts?

-Peff

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 1/4] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP
  2013-08-21 20:49       ` [RFC/PATCH 0/4] duplicate objects in packfiles Jeff King
@ 2013-08-21 20:51         ` Jeff King
  2013-08-21 20:52         ` [PATCH 2/4] index-pack: optionally reject packs with duplicate objects Jeff King
                           ` (3 subsequent siblings)
  4 siblings, 0 replies; 37+ messages in thread
From: Jeff King @ 2013-08-21 20:51 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Shawn O. Pearce, Nicolas Pitre

The sha1_entry_pos function tries to be smart about
selecting the middle of a range for its binary search by
looking at the value differences between the "lo" and "hi"
constraints. However, it is unable to cope with entries with
duplicate keys in the sorted list.

We may hit a point in the search where both our "lo" and
"hi" point to the same key. In this case, the range of
values between our endpoints is 0, and trying to scale the
difference between our key and the endpoints over that range
is undefined (i.e., divide by zero). The current code
catches this with an "assert(lov < hiv)".

Moreover, after seeing that the first 20 byte of the key are
the same, we will try to establish a value from the 21st
byte. Which is nonsensical.

Instead, we can detect the case that we are in a run of
duplicates, and simply do a final comparison against any one
of them (since they are all the same, it does not matter
which). If the keys match, we have found our entry (or one
of them, anyway).  If not, then we know that we do not need
to look further, as we must be in a run of the duplicate
key.

Furthermore, we know that one of our endpoints must be
the edge of the run of duplicates. For example, given this
sequence:

 idx 0 1 2 3 4 5
 key A C C C C D

If we are searching for "B", we might hit the duplicate run
at lo=1, hi=3 (e.g., by first mi=3, then mi=0). But we can
never have lo > 1, because B < C. That is, if our key is
less than the run, we know that "lo" is the edge, but we can
say nothing of "hi". Similarly, if our key is greater than
the run, we know that "hi" is the edge, but we can say
nothing of "lo". But that is enough for us to return not
only "not found", but show the position at which we would
insert the new item.

Signed-off-by: Jeff King <peff@peff.net>
---
 sha1-lookup.c                     | 23 ++++++++++
 t/t5308-pack-detect-duplicates.sh | 97 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 120 insertions(+)
 create mode 100755 t/t5308-pack-detect-duplicates.sh

diff --git a/sha1-lookup.c b/sha1-lookup.c
index c4dc55d..614cbb6 100644
--- a/sha1-lookup.c
+++ b/sha1-lookup.c
@@ -204,7 +204,30 @@ int sha1_entry_pos(const void *table,
 			 * byte 0 thru (ofs-1) are the same between
 			 * lo and hi; ofs is the first byte that is
 			 * different.
+			 *
+			 * If ofs==20, then no bytes are different,
+			 * meaning we have entries with duplicate
+			 * keys. We know that we are in a solid run
+			 * of this entry (because the entries are
+			 * sorted, and our lo and hi are the same,
+			 * there can be nothing but this single key
+			 * in between). So we can stop the search.
+			 * Either one of these entries is it (and
+			 * we do not care which), or we do not have
+			 * it.
 			 */
+			if (ofs == 20) {
+				mi = lo;
+				mi_key = base + elem_size * mi + key_offset;
+				cmp = memcmp(mi_key, key, 20);
+				if (!cmp)
+					return mi;
+				if (cmp < 0)
+					return -1 - hi;
+				else
+					return -1 - lo;
+			}
+
 			hiv = hi_key[ofs_0];
 			if (ofs_0 < 19)
 				hiv = (hiv << 8) | hi_key[ofs_0+1];
diff --git a/t/t5308-pack-detect-duplicates.sh b/t/t5308-pack-detect-duplicates.sh
new file mode 100755
index 0000000..4800379
--- /dev/null
+++ b/t/t5308-pack-detect-duplicates.sh
@@ -0,0 +1,97 @@
+#!/bin/sh
+
+test_description='handling of duplicate objects in incoming packfiles'
+. ./test-lib.sh
+
+# git will never intentionally create packfiles with
+# duplicate objects, so we have to construct them by hand.
+#
+# $1 is the number of times to duplicate each object (must be
+# less than 127 for simplicify of implementation).
+create_pack() {
+	{
+		# header magic
+		printf 'PACK' &&
+		# version 2
+		printf '\0\0\0\2' &&
+		# num of objects
+		printf '\0\0\0\'"$(printf "%o" $(($1*2)))" &&
+
+		for i in $(test_seq 1 "$1"); do
+			# blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
+			printf '\060\170\234\003\0\0\0\0\1' &&
+			# blob e68fe8129b546b101aee9510c5328e7f21ca1d18
+			printf '\062\170\234\143\267\3\0\0\116\0\106'
+		done
+	} >tmp.pack &&
+	# we store and cat the pack because we also have to output
+	# the sha1 pack trailer
+	cat tmp.pack &&
+	test-sha1 <tmp.pack | hex2bytes &&
+	rm -f tmp.pack
+}
+
+# convert an ascii hex representation of bytes into binary
+hex2bytes() {
+	"$PERL_PATH" -ne 's/[0-9a-f]{2}/print chr hex $&/ge'
+}
+
+# remove any existing packs, since we want to make
+# sure future operations use any new packs we are about
+# to install
+clear_packs() {
+	rm -f .git/objects/pack/*
+}
+
+# The sha1s we have in our pack. It's important that these have the same
+# starting byte, so that they end up in the same fanout section of the index.
+# That lets us make sure we are exercising the binary search with both sets.
+LO_SHA1=e68fe8129b546b101aee9510c5328e7f21ca1d18
+HI_SHA1=e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
+
+# And here's a "missing sha1" which will produce failed lookups. It must also
+# be in the same fanout section, and should be between the two (so that during
+# our binary search, we are sure to end up looking at one or the other of the
+# duplicate runs).
+MISSING_SHA1='e69d000000000000000000000000000000000000'
+
+# double-check that create_pack actually works
+test_expect_success 'pack with no duplicates' '
+	create_pack 1 >no-dups.pack &&
+	git index-pack --stdin <no-dups.pack
+'
+
+test_expect_success 'index-pack will allow duplicate objects by default' '
+	clear_packs &&
+	create_pack 100 >dups.pack &&
+	git index-pack --stdin <dups.pack
+'
+
+test_expect_success 'create test vectors' '
+	cat >input <<-EOF &&
+	$LO_SHA1
+	$HI_SHA1
+	$MISSING_SHA1
+	EOF
+	cat >expect <<-EOF
+	$LO_SHA1 blob 2
+	$HI_SHA1 blob 0
+	$MISSING_SHA1 missing
+	EOF
+'
+
+test_expect_success 'lookup in duplicated pack (binary search)' '
+	git cat-file --batch-check <input >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'lookup in duplicated pack (GIT_USE_LOOKUP)' '
+	(
+		GIT_USE_LOOKUP=1 &&
+		export GIT_USE_LOOKUP &&
+		git cat-file --batch-check <input >actual
+	) &&
+	test_cmp expect actual
+'
+
+test_done
-- 
1.8.4.rc2.28.g6bb5f3f

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 2/4] index-pack: optionally reject packs with duplicate objects
  2013-08-21 20:49       ` [RFC/PATCH 0/4] duplicate objects in packfiles Jeff King
  2013-08-21 20:51         ` [PATCH 1/4] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP Jeff King
@ 2013-08-21 20:52         ` Jeff King
  2013-08-22 13:45           ` Duy Nguyen
  2013-08-21 20:53         ` [PATCH 3/4] reject duplicates when indexing a packfile we created Jeff King
                           ` (2 subsequent siblings)
  4 siblings, 1 reply; 37+ messages in thread
From: Jeff King @ 2013-08-21 20:52 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Shawn O. Pearce, Nicolas Pitre

Git should never generate packs with duplicate objects.
However, we may see such packs due to bugs in Git or other
implementations (e.g., JGit had such a bug a few years ago).

In theory, such packs should not be a problem for us (we
will simply find one of the instances of the object when
looking in the pack). However, the JGit bug report mentioned
possible infinite loops during repacking due to cycles in
the delta chain.  Though this problem hasn't specifically
been reproduced on modern git, there is no reason not to be
careful with incoming packs, given that only buggy
implementations should be producing such packs, anyway.

This patch introduces the pack.indexDuplicates option to
allow or reject such packs from index-pack. The default
remains to allow it.

Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/index-pack.c              | 10 ++++++++++
 pack-write.c                      | 23 +++++++++++++++++++++++
 pack.h                            |  5 +++++
 t/t5308-pack-detect-duplicates.sh |  8 ++++++++
 4 files changed, 46 insertions(+)

diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index 9c1cfac..83fd3bb 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -1369,6 +1369,16 @@ static int git_index_pack_config(const char *k, const char *v, void *cb)
 #endif
 		return 0;
 	}
+	if (!strcmp(k, "pack.indexduplicates")) {
+		int boolval = git_config_maybe_bool(k, v);
+		if (boolval > 0 || !strcmp(v, "ignore"))
+			opts->duplicates = WRITE_IDX_DUPLICATES_IGNORE;
+		else if (boolval == 0 || !strcmp(v, "reject"))
+			opts->duplicates = WRITE_IDX_DUPLICATES_REJECT;
+		else
+			die("unknown value for pack.indexduplicates: %s", v);
+		return 0;
+	}
 	return git_default_config(k, v, cb);
 }
 
diff --git a/pack-write.c b/pack-write.c
index ca9e63b..542b081 100644
--- a/pack-write.c
+++ b/pack-write.c
@@ -37,6 +37,19 @@ static int need_large_offset(off_t offset, const struct pack_idx_option *opts)
 			 sizeof(ofsval), cmp_uint32);
 }
 
+static void *find_duplicate(void *vbase, size_t n, size_t size,
+			    int (*cmp)(const void *, const void *))
+{
+	unsigned char *base = vbase;
+	while (n > 1) {
+		if (!cmp(base, base + size))
+			return base;
+		base += size;
+		n--;
+	}
+	return NULL;
+}
+
 /*
  * On entry *sha1 contains the pack content SHA1 hash, on exit it is
  * the SHA1 hash of sorted object names. The objects array passed in
@@ -68,6 +81,16 @@ const char *write_idx_file(const char *index_name, struct pack_idx_entry **objec
 	else
 		sorted_by_sha = list = last = NULL;
 
+	if (opts->duplicates == WRITE_IDX_DUPLICATES_REJECT) {
+		struct pack_idx_entry **dup;
+
+		dup = find_duplicate(sorted_by_sha, nr_objects,
+				     sizeof(*sorted_by_sha), sha1_compare);
+		if (dup)
+			die("pack has duplicate entries for %s",
+			    sha1_to_hex((*dup)->sha1));
+	}
+
 	if (opts->flags & WRITE_IDX_VERIFY) {
 		assert(index_name);
 		f = sha1fd_check(index_name);
diff --git a/pack.h b/pack.h
index aa6ee7d..cd4b536 100644
--- a/pack.h
+++ b/pack.h
@@ -44,6 +44,11 @@ struct pack_idx_option {
 	uint32_t version;
 	uint32_t off32_limit;
 
+	enum {
+		WRITE_IDX_DUPLICATES_IGNORE = 0,
+		WRITE_IDX_DUPLICATES_REJECT
+	} duplicates;
+
 	/*
 	 * List of offsets that would fit within off32_limit but
 	 * need to be written out as 64-bit entity for byte-for-byte
diff --git a/t/t5308-pack-detect-duplicates.sh b/t/t5308-pack-detect-duplicates.sh
index 4800379..0f2d928 100755
--- a/t/t5308-pack-detect-duplicates.sh
+++ b/t/t5308-pack-detect-duplicates.sh
@@ -94,4 +94,12 @@ test_expect_success 'lookup in duplicated pack (GIT_USE_LOOKUP)' '
 	test_cmp expect actual
 '
 
+test_expect_success 'index-pack can reject packs with duplicates' '
+	clear_packs &&
+	create_pack 2 >dups.pack &&
+	test_must_fail \
+		git -c pack.indexDuplicates=0 index-pack --stdin <dups.pack &&
+	test_expect_code 1 git cat-file -e $LO_SHA1
+'
+
 test_done
-- 
1.8.4.rc2.28.g6bb5f3f

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 3/4] reject duplicates when indexing a packfile we created
  2013-08-21 20:49       ` [RFC/PATCH 0/4] duplicate objects in packfiles Jeff King
  2013-08-21 20:51         ` [PATCH 1/4] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP Jeff King
  2013-08-21 20:52         ` [PATCH 2/4] index-pack: optionally reject packs with duplicate objects Jeff King
@ 2013-08-21 20:53         ` Jeff King
  2013-08-21 20:55         ` [DO NOT APPLY PATCH 4/4] index-pack: optionally skip duplicate packfile entries Jeff King
  2013-08-21 22:17         ` [RFC/PATCH 0/4] duplicate objects in packfiles Junio C Hamano
  4 siblings, 0 replies; 37+ messages in thread
From: Jeff King @ 2013-08-21 20:53 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Shawn O. Pearce, Nicolas Pitre

The pack index-writing code is capable of detecting and
rejecting duplicate entries. This can be used with
index-pack to prevent buggy foreign packs from entering the
repository.  However, we can also be careful about the packs
we generate, by double-checking during the index write that
we do not have any duplicate objects. This should never
happen, but it serves as an additional check that we are not
generating such packfiles.

Signed-off-by: Jeff King <peff@peff.net>
---
This turns on rejection for everywhere _except_ a separately-run
index-pack. If we decide to turn it on there, too, it would make sense
to scrap this patch and just put it in reset_pack_idx_opts().

 builtin/pack-objects.c | 1 +
 bulk-checkin.c         | 1 +
 fast-import.c          | 1 +
 3 files changed, 3 insertions(+)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index f069462..8da2a66 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -2511,6 +2511,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	read_replace_refs = 0;
 
 	reset_pack_idx_option(&pack_idx_opts);
+	pack_idx_opts.duplicates = WRITE_IDX_DUPLICATES_REJECT;
 	git_config(git_pack_config, NULL);
 	if (!pack_compression_seen && core_compression_seen)
 		pack_compression_level = core_compression_level;
diff --git a/bulk-checkin.c b/bulk-checkin.c
index 6b0b6d4..bad191f 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c
@@ -174,6 +174,7 @@ static void prepare_to_stream(struct bulk_checkin_state *state,
 
 	state->f = create_tmp_packfile(&state->pack_tmp_name);
 	reset_pack_idx_option(&state->pack_idx_opts);
+	state->pack_idx_opts.duplicates = WRITE_IDX_DUPLICATES_REJECT;
 
 	/* Pretend we are going to write only one object */
 	state->offset = write_pack_header(state->f, 1);
diff --git a/fast-import.c b/fast-import.c
index 23f625f..b7beab0 100644
--- a/fast-import.c
+++ b/fast-import.c
@@ -3360,6 +3360,7 @@ int main(int argc, char **argv)
 
 	setup_git_directory();
 	reset_pack_idx_option(&pack_idx_opts);
+	pack_idx_opts.duplicates = WRITE_IDX_DUPLICATES_REJECT;
 	git_config(git_pack_config, NULL);
 	if (!pack_compression_seen && core_compression_seen)
 		pack_compression_level = core_compression_level;
-- 
1.8.4.rc2.28.g6bb5f3f

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [DO NOT APPLY PATCH 4/4] index-pack: optionally skip duplicate packfile entries
  2013-08-21 20:49       ` [RFC/PATCH 0/4] duplicate objects in packfiles Jeff King
                           ` (2 preceding siblings ...)
  2013-08-21 20:53         ` [PATCH 3/4] reject duplicates when indexing a packfile we created Jeff King
@ 2013-08-21 20:55         ` Jeff King
  2013-08-21 23:20           ` Junio C Hamano
  2013-08-21 22:17         ` [RFC/PATCH 0/4] duplicate objects in packfiles Junio C Hamano
  4 siblings, 1 reply; 37+ messages in thread
From: Jeff King @ 2013-08-21 20:55 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Shawn O. Pearce, Nicolas Pitre

When we are building the pack index, we can notice that
there are duplicate objects, pick one "winner" instance, and
mention the object only once in the index (mapped to the
winner's offset).

This has the effect that the duplicate packfile entries are
never found by lookup. The data still exists in the
packfile, though, and can be used as a delta base if delta
base offsets are used. If delta refs are used, then it is
possible that some deltas may be broken.

Unfortunately, this scheme does not quite work, as the pack
reader checks that the index and packfile claim the same
number of objects, and will refuse to open such a packfile.

We have a few options:

  1. Loosen the check in the reader. This drops a
     potentially useful sanity check on the data, and it
     won't work for any other implementations (including
     older versions of git).

  2. Loosen the check only if a special flag is found in the
     index indicating that we removed duplicates. This means
     that we only lose the safety check in the rare case
     that duplicates are found. But there is actually no
     place in the index to put such a flag, and in addition
     no other implementation would understand our flag.

  3. Instead of reducing the numnber of objects in the
     index, include "dummy" entries using the null sha1 or a
     similar sentinel value (and pointing to some invalid
     offset). This should work, but is awfully hacky (and
     will probably create havoc as we will now claim to have
     the null sha1, but will throw errors if you actually
     look at it).

Signed-off-by: Jeff King <peff@peff.net>
---
I think this line of "fixing" should probably be scrapped. Truly fixing
it and covering the REF_DELTA case would involve magic in the actual
object lookups (we would have to detect cycles and find the "other"
object that is the real base). And it's probably just not worth the
effort.

 builtin/index-pack.c              |  2 ++
 pack-write.c                      | 30 ++++++++++++++++++++++++++++++
 pack.h                            |  3 ++-
 t/t5308-pack-detect-duplicates.sh |  8 ++++++++
 4 files changed, 42 insertions(+), 1 deletion(-)

diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index 83fd3bb..1dadd56 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -1375,6 +1375,8 @@ static int git_index_pack_config(const char *k, const char *v, void *cb)
 			opts->duplicates = WRITE_IDX_DUPLICATES_IGNORE;
 		else if (boolval == 0 || !strcmp(v, "reject"))
 			opts->duplicates = WRITE_IDX_DUPLICATES_REJECT;
+		else if (!strcmp(v, "skip"))
+			opts->duplicates = WRITE_IDX_DUPLICATES_SKIP;
 		else
 			die("unknown value for pack.indexduplicates: %s", v);
 		return 0;
diff --git a/pack-write.c b/pack-write.c
index 542b081..657da2a 100644
--- a/pack-write.c
+++ b/pack-write.c
@@ -50,6 +50,27 @@ static void *find_duplicate(void *vbase, size_t n, size_t size,
 	return NULL;
 }
 
+static size_t remove_duplicates(void *base, size_t n, size_t size,
+				int (*cmp)(const void *, const void *))
+{
+	unsigned char *from = base, *to = base;
+
+	from += size;
+	to += size;
+	n--;
+
+	while (n > 0) {
+		if (cmp(from, from - size)) {
+			if (to != from)
+				memcpy(to, from, size);
+			to += size;
+		}
+		from += size;
+		n--;
+	}
+	return (to - (unsigned char *)base) / size;
+}
+
 /*
  * On entry *sha1 contains the pack content SHA1 hash, on exit it is
  * the SHA1 hash of sorted object names. The objects array passed in
@@ -89,6 +110,15 @@ const char *write_idx_file(const char *index_name, struct pack_idx_entry **objec
 		if (dup)
 			die("pack has duplicate entries for %s",
 			    sha1_to_hex((*dup)->sha1));
+	} else if (opts->duplicates == WRITE_IDX_DUPLICATES_SKIP) {
+		int nr_unique = remove_duplicates(sorted_by_sha,
+						  nr_objects,
+						  sizeof(*sorted_by_sha),
+						  sha1_compare);
+		if (nr_unique != nr_objects) {
+			nr_objects = nr_unique;
+			last = sorted_by_sha + nr_objects;
+		}
 	}
 
 	if (opts->flags & WRITE_IDX_VERIFY) {
diff --git a/pack.h b/pack.h
index cd4b536..3017ea4 100644
--- a/pack.h
+++ b/pack.h
@@ -46,7 +46,8 @@ struct pack_idx_option {
 
 	enum {
 		WRITE_IDX_DUPLICATES_IGNORE = 0,
-		WRITE_IDX_DUPLICATES_REJECT
+		WRITE_IDX_DUPLICATES_REJECT,
+		WRITE_IDX_DUPLICATES_SKIP
 	} duplicates;
 
 	/*
diff --git a/t/t5308-pack-detect-duplicates.sh b/t/t5308-pack-detect-duplicates.sh
index 0f2d928..963d7b9 100755
--- a/t/t5308-pack-detect-duplicates.sh
+++ b/t/t5308-pack-detect-duplicates.sh
@@ -102,4 +102,12 @@ test_expect_success 'index-pack can reject packs with duplicates' '
 	test_expect_code 1 git cat-file -e $LO_SHA1
 '
 
+test_expect_success 'index-pack can fix packs with duplicates' '
+	clear_packs &&
+	create_pack 2 >dups.pack &&
+	git -c pack.indexDuplicates=skip index-pack --stdin <dups.pack &&
+	git cat-file --batch-check <input >actual &&
+	test_cmp expect actual
+'
+
 test_done
-- 
1.8.4.rc2.28.g6bb5f3f

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [RFC/PATCH 0/4] duplicate objects in packfiles
  2013-08-21 20:49       ` [RFC/PATCH 0/4] duplicate objects in packfiles Jeff King
                           ` (3 preceding siblings ...)
  2013-08-21 20:55         ` [DO NOT APPLY PATCH 4/4] index-pack: optionally skip duplicate packfile entries Jeff King
@ 2013-08-21 22:17         ` Junio C Hamano
  4 siblings, 0 replies; 37+ messages in thread
From: Junio C Hamano @ 2013-08-21 22:17 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Shawn O. Pearce, Nicolas Pitre

Jeff King <peff@peff.net> writes:

> Which leaves the open question: should the default for index-pack flip
> to reject duplicates rather than allow?

I'd say so.

In am emergency, people with broken packfiles can feed them to
unpack-objects ;-)

More seriously, an alternative emergency mode may be an option to
unpack-objects that does not unpack objects but streams to a new
pack or something to resurret the objects that are salvageable from
a corrupt packfile.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [DO NOT APPLY PATCH 4/4] index-pack: optionally skip duplicate packfile entries
  2013-08-21 20:55         ` [DO NOT APPLY PATCH 4/4] index-pack: optionally skip duplicate packfile entries Jeff King
@ 2013-08-21 23:20           ` Junio C Hamano
  2013-08-22  0:47             ` Jeff King
  0 siblings, 1 reply; 37+ messages in thread
From: Junio C Hamano @ 2013-08-21 23:20 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Shawn O. Pearce, Nicolas Pitre

Jeff King <peff@peff.net> writes:

> When we are building the pack index, we can notice that
> there are duplicate objects, pick one "winner" instance, and
> mention the object only once in the index (mapped to the
> winner's offset).
>
> This has the effect that the duplicate packfile entries are
> never found by lookup. The data still exists in the
> packfile, though, and can be used as a delta base if delta
> base offsets are used. If delta refs are used, then it is
> possible that some deltas may be broken.

I do not understand the last bit.  If two copies of an object exist
but you have only one slot for the object in the index, and another
object names it as its base with ref-delta, then reconstituting it
should work just fine---whichever representation of the base object
is recorded in the .idx, that first needs to be reconstituted before
the delta is applied to it, and both copies should yield identical
contents for the delta base object, no?

In any case, ejecting one from the pack .idx would not help in the
presense of either broken or malicious packer that reuses delta too
aggressively.  Suppose you have objects A and B and somehow manage
to create a cycle of deltas, A names B as its delta base and B names
A as its delta base.  The packer may notice its mistake and then add
another copy of A as a base object.  The pack contains two copies of
A (one is delta based on B, the other is full) and B (delta against
A).

If B refers to the copy of A that is delta against B using ofs-delta,
fixing the .idx file will have no effect.  read_packed_sha1(B) will
read the delta data of B, finds the offset to start reading the data
for A which was excised from the .idx and unless that codepath is
updated to consult revindex (i.e. you mark one copy of A in the .pack
as bad, and when B refers to that bad copy of A via ofs-delta, you
check the offset against revindex to get the object name of A and go
to the good copy of A), you will never finish reading B because
reading the bad copy of A will lead you to first reconstitute B.

> I think this line of "fixing" should probably be scrapped.

I tend to agree.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [DO NOT APPLY PATCH 4/4] index-pack: optionally skip duplicate packfile entries
  2013-08-21 23:20           ` Junio C Hamano
@ 2013-08-22  0:47             ` Jeff King
  0 siblings, 0 replies; 37+ messages in thread
From: Jeff King @ 2013-08-22  0:47 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Shawn O. Pearce, Nicolas Pitre

On Wed, Aug 21, 2013 at 04:20:07PM -0700, Junio C Hamano wrote:

> I do not understand the last bit.  If two copies of an object exist
> but you have only one slot for the object in the index, and another
> object names it as its base with ref-delta, then reconstituting it
> should work just fine---whichever representation of the base object
> is recorded in the .idx, that first needs to be reconstituted before
> the delta is applied to it, and both copies should yield identical
> contents for the delta base object, no?
> 
> In any case, ejecting one from the pack .idx would not help in the
> presense of either broken or malicious packer that reuses delta too
> aggressively.  Suppose you have objects A and B and somehow manage
> to create a cycle of deltas, A names B as its delta base and B names
> A as its delta base.  The packer may notice its mistake and then add
> another copy of A as a base object.  The pack contains two copies of
> A (one is delta based on B, the other is full) and B (delta against
> A).

Yes, that is the potentially problematic scenario...

> If B refers to the copy of A that is delta against B using ofs-delta,
> fixing the .idx file will have no effect.  read_packed_sha1(B) will
> read the delta data of B, finds the offset to start reading the data
> for A which was excised from the .idx and unless that codepath is
> updated to consult revindex (i.e. you mark one copy of A in the .pack
> as bad, and when B refers to that bad copy of A via ofs-delta, you
> check the offset against revindex to get the object name of A and go
> to the good copy of A), you will never finish reading B because
> reading the bad copy of A will lead you to first reconstitute B.

Yes. There are two levels of brokenness here.

One is that the pack has a delta base offset for B that leads to the
"wrong" A, creating a cycle. This is an utterly broken pack and cannot
be fixed automatically (you do not even know the name of the cycled A
because you cannot constitute it to find its name and realize that it
has a duplicate!). But we should notice this during index-pack, because
we cannot reconstruct the object.

Another situation is that your delta base points to the "right" A. You
can reconstruct either A or B, because although there is a duplicate,
there are no cycles in the delta chain (i.e., the chain does not care
about object name, only offset, and offsets point only one way).

So we do not have a problem at all with reconstructing the object data.
We do have two ways we might access A from the same pack. For regular
object lookup, that is probably not a big deal. For operations like
repacking that look more closely at the on-disk representation, I do not
know. We may mark A as "has a delta" or "does not have a delta" at one
point in the code, but then later find the other copy of A which does
not match. I did not trace through all of pack-objects to find whether
that is possible, or what bugs it might cause. But indexing only one
copy as the "master" means that we will always get a consistent view of
the object (from that particular pack, at least).

Now consider REF_DELTA. We lookup the base object by name, so we may get
a different answer depending on how it is indexed. E.g., index-pack
keeps an internal hash table, whereas later callers will use the .idx
file. When looking at the .idx file, we may use a regular binary search,
or sha1_entry_pos. The exact order of the search may even change from
git version to git version.

So you may have a situation where index-pack finds the "right" A and
properly reconstructs the file while creating the .idx, and thinks all
is well. But later lookups may find the "wrong" A, and fail. In other
words, we cannot trust that data that makes it through index-pack is not
corrupted (and it is index-pack's output that lets receive-pack commit
to the client that we have their objects, so we want to be reasonably
sure we have uncorrupted copies at that point).

Choosing a single "master" A to go in the .idx means we will get a
consistent view of A once the .idx is generated. But it may not be the
right one, and it may not be the one we used to check that we can
resolve A and B in the first place.

The right fix for that situation is to keep both entries in the .idx,
detect delta cycles, then look up extra copies of A, but being aware
that you already found one copy in the .idx and that sha1_entry_pos (or
the vanilla binary search) should return any other copies, if they
exist. I do not think we detect delta cycles at all (though I didn't
check), but certainly we do not have a way to tell sha1_entry_pos any
different information so that it will find the "other" A.

So I think it is a recoverable state, but it is a non-trivial bit of
code for something that should never happen. Rejecting on push/fetch via
index-pack is simple and easy, and we can deal with the fallout if and
when it ever actually happens.

> > I think this line of "fixing" should probably be scrapped.
> 
> I tend to agree.

OK. Let's drop the "skip" patch, then. I'll re-roll with a patch to flip
the default to "reject".

-Peff

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/4] index-pack: optionally reject packs with duplicate objects
  2013-08-21 20:52         ` [PATCH 2/4] index-pack: optionally reject packs with duplicate objects Jeff King
@ 2013-08-22 13:45           ` Duy Nguyen
  2013-08-22 14:43             ` Jeff King
  0 siblings, 1 reply; 37+ messages in thread
From: Duy Nguyen @ 2013-08-22 13:45 UTC (permalink / raw)
  To: Jeff King
  Cc: Git Mailing List, Junio C Hamano, Shawn O. Pearce, Nicolas Pitre

On Thu, Aug 22, 2013 at 3:52 AM, Jeff King <peff@peff.net> wrote:
> @@ -68,6 +81,16 @@ const char *write_idx_file(const char *index_name, struct pack_idx_entry **objec
>         else
>                 sorted_by_sha = list = last = NULL;
>
> +       if (opts->duplicates == WRITE_IDX_DUPLICATES_REJECT) {
> +               struct pack_idx_entry **dup;
> +
> +               dup = find_duplicate(sorted_by_sha, nr_objects,
> +                                    sizeof(*sorted_by_sha), sha1_compare);
> +               if (dup)
> +                       die("pack has duplicate entries for %s",
> +                           sha1_to_hex((*dup)->sha1));
> +       }
> +
>         if (opts->flags & WRITE_IDX_VERIFY) {
>                 assert(index_name);
>                 f = sha1fd_check(index_name);

write_idx_file() is called after index-pack processes all delta
objects. Could resolve_deltas() go cyclic with certain duplicate
object setup?
-- 
Duy

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/4] index-pack: optionally reject packs with duplicate objects
  2013-08-22 13:45           ` Duy Nguyen
@ 2013-08-22 14:43             ` Jeff King
  2013-08-22 23:12               ` [PATCHv2 0/6] duplicate objects and delta cycles, oh my! Jeff King
  0 siblings, 1 reply; 37+ messages in thread
From: Jeff King @ 2013-08-22 14:43 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Git Mailing List, Junio C Hamano, Shawn O. Pearce, Nicolas Pitre

On Thu, Aug 22, 2013 at 08:45:19PM +0700, Nguyen Thai Ngoc Duy wrote:

> On Thu, Aug 22, 2013 at 3:52 AM, Jeff King <peff@peff.net> wrote:
> > @@ -68,6 +81,16 @@ const char *write_idx_file(const char *index_name, struct pack_idx_entry **objec
> >         else
> >                 sorted_by_sha = list = last = NULL;
> >
> > +       if (opts->duplicates == WRITE_IDX_DUPLICATES_REJECT) {
> > +               struct pack_idx_entry **dup;
> > +
> > +               dup = find_duplicate(sorted_by_sha, nr_objects,
> > +                                    sizeof(*sorted_by_sha), sha1_compare);
> > +               if (dup)
> > +                       die("pack has duplicate entries for %s",
> > +                           sha1_to_hex((*dup)->sha1));
> > +       }
> > +
> >         if (opts->flags & WRITE_IDX_VERIFY) {
> >                 assert(index_name);
> >                 f = sha1fd_check(index_name);
> 
> write_idx_file() is called after index-pack processes all delta
> objects. Could resolve_deltas() go cyclic with certain duplicate
> object setup?

Good question. I'm not sure. I'll check it out.

-Peff

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCHv2 0/6] duplicate objects and delta cycles, oh my!
  2013-08-22 14:43             ` Jeff King
@ 2013-08-22 23:12               ` Jeff King
  2013-08-22 23:12                 ` [PATCH 1/6] test-sha1: add a binary output mode Jeff King
                                   ` (6 more replies)
  0 siblings, 7 replies; 37+ messages in thread
From: Jeff King @ 2013-08-22 23:12 UTC (permalink / raw)
  To: Git Mailing List
  Cc: Duy Nguyen, Junio C Hamano, Shawn O. Pearce, Nicolas Pitre

On Thu, Aug 22, 2013 at 10:43:05AM -0400, Jeff King wrote:

> > write_idx_file() is called after index-pack processes all delta
> > objects. Could resolve_deltas() go cyclic with certain duplicate
> > object setup?
> 
> Good question. I'm not sure. I'll check it out.

I think the answer is "no", based on both reasoning and testing (both of
which are included in patches 3-4 of the series below).

So here's my re-roll of the series.

  [1/6]: test-sha1: add a binary output mode

    New in this iteration; the previous version piped test-sha1 into
    perl to create the pack trailer, but with this simple change we can
    drop the perl dependency.

  [2/6]: sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP

    Same code as before. I've factored the pack-creation bits from the
    tests into lib-pack.sh, so they can be reused elsewhere when we want
    to create bogus packs (and patches 3-4 reuse them here).

  [3/6]: add tests for indexing packs with delta cycles
  [4/6]: test index-pack on packs with recoverable delta cycles

    New tests covering delta cycles.

  [5/6]: index-pack: optionally reject packs with duplicate objects

    Similar to before, but I converted the config flag to a simple
    boolean (since we scrapped the "fix" of the tri-state "allow,
    reject, fix").

  [6/6]: default pack.indexDuplicates to false

    This flips the safety check on by default everywhere (before, it was
    left off for index-pack).

-Peff

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 1/6] test-sha1: add a binary output mode
  2013-08-22 23:12               ` [PATCHv2 0/6] duplicate objects and delta cycles, oh my! Jeff King
@ 2013-08-22 23:12                 ` Jeff King
  2013-08-22 23:14                 ` [PATCH 2/6] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP Jeff King
                                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 37+ messages in thread
From: Jeff King @ 2013-08-22 23:12 UTC (permalink / raw)
  To: Git Mailing List
  Cc: Duy Nguyen, Junio C Hamano, Shawn O. Pearce, Nicolas Pitre

The test-sha1 helper program will run our internal sha1
routines over its input and output the 40-byte hex sha1.
Sometimes, however, it is useful to have the binary 20-byte
sha1 (and it's a pain to convert back in the shell). Let's
add a "-b" option to output the binary version.

Signed-off-by: Jeff King <peff@peff.net>
---
 test-sha1.c | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/test-sha1.c b/test-sha1.c
index 80daba9..e57eae1 100644
--- a/test-sha1.c
+++ b/test-sha1.c
@@ -5,10 +5,15 @@ int main(int ac, char **av)
 	git_SHA_CTX ctx;
 	unsigned char sha1[20];
 	unsigned bufsz = 8192;
+	int binary = 0;
 	char *buffer;
 
-	if (ac == 2)
-		bufsz = strtoul(av[1], NULL, 10) * 1024 * 1024;
+	if (ac == 2) {
+		if (!strcmp(av[1], "-b"))
+			binary = 1;
+		else
+			bufsz = strtoul(av[1], NULL, 10) * 1024 * 1024;
+	}
 
 	if (!bufsz)
 		bufsz = 8192;
@@ -42,6 +47,10 @@ int main(int ac, char **av)
 		git_SHA1_Update(&ctx, buffer, this_sz);
 	}
 	git_SHA1_Final(sha1, &ctx);
-	puts(sha1_to_hex(sha1));
+
+	if (binary)
+		fwrite(sha1, 1, 20, stdout);
+	else
+		puts(sha1_to_hex(sha1));
 	exit(0);
 }
-- 
1.8.4.rc2.28.g6bb5f3f

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 2/6] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP
  2013-08-22 23:12               ` [PATCHv2 0/6] duplicate objects and delta cycles, oh my! Jeff King
  2013-08-22 23:12                 ` [PATCH 1/6] test-sha1: add a binary output mode Jeff King
@ 2013-08-22 23:14                 ` Jeff King
  2013-08-23 16:41                   ` Junio C Hamano
  2013-08-23 19:41                   ` [PATCH 2/6] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP Johannes Sixt
  2013-08-22 23:14                 ` [PATCH 3/6] add tests for indexing packs with delta cycles Jeff King
                                   ` (4 subsequent siblings)
  6 siblings, 2 replies; 37+ messages in thread
From: Jeff King @ 2013-08-22 23:14 UTC (permalink / raw)
  To: Git Mailing List
  Cc: Duy Nguyen, Junio C Hamano, Shawn O. Pearce, Nicolas Pitre

The sha1_entry_pos function tries to be smart about
selecting the middle of a range for its binary search by
looking at the value differences between the "lo" and "hi"
constraints. However, it is unable to cope with entries with
duplicate keys in the sorted list.

We may hit a point in the search where both our "lo" and
"hi" point to the same key. In this case, the range of
values between our endpoints is 0, and trying to scale the
difference between our key and the endpoints over that range
is undefined (i.e., divide by zero). The current code
catches this with an "assert(lov < hiv)".

Moreover, after seeing that the first 20 byte of the key are
the same, we will try to establish a value from the 21st
byte. Which is nonsensical.

Instead, we can detect the case that we are in a run of
duplicates, and simply do a final comparison against any one
of them (since they are all the same, it does not matter
which). If the keys match, we have found our entry (or one
of them, anyway).  If not, then we know that we do not need
to look further, as we must be in a run of the duplicate
key.

Furthermore, we know that one of our endpoints must be
the edge of the run of duplicates. For example, given this
sequence:

 idx 0 1 2 3 4 5
 key A C C C C D

If we are searching for "B", we might hit the duplicate run
at lo=1, hi=3 (e.g., by first mi=3, then mi=0). But we can
never have lo > 1, because B < C. That is, if our key is
less than the run, we know that "lo" is the edge, but we can
say nothing of "hi". Similarly, if our key is greater than
the run, we know that "hi" is the edge, but we can say
nothing of "lo". But that is enough for us to return not
only "not found", but show the position at which we would
insert the new item.

Signed-off-by: Jeff King <peff@peff.net>
---
 sha1-lookup.c                     | 23 ++++++++++++
 t/lib-pack.sh                     | 78 +++++++++++++++++++++++++++++++++++++++
 t/t5308-pack-detect-duplicates.sh | 73 ++++++++++++++++++++++++++++++++++++
 3 files changed, 174 insertions(+)
 create mode 100644 t/lib-pack.sh
 create mode 100755 t/t5308-pack-detect-duplicates.sh

diff --git a/sha1-lookup.c b/sha1-lookup.c
index c4dc55d..614cbb6 100644
--- a/sha1-lookup.c
+++ b/sha1-lookup.c
@@ -204,7 +204,30 @@ int sha1_entry_pos(const void *table,
 			 * byte 0 thru (ofs-1) are the same between
 			 * lo and hi; ofs is the first byte that is
 			 * different.
+			 *
+			 * If ofs==20, then no bytes are different,
+			 * meaning we have entries with duplicate
+			 * keys. We know that we are in a solid run
+			 * of this entry (because the entries are
+			 * sorted, and our lo and hi are the same,
+			 * there can be nothing but this single key
+			 * in between). So we can stop the search.
+			 * Either one of these entries is it (and
+			 * we do not care which), or we do not have
+			 * it.
 			 */
+			if (ofs == 20) {
+				mi = lo;
+				mi_key = base + elem_size * mi + key_offset;
+				cmp = memcmp(mi_key, key, 20);
+				if (!cmp)
+					return mi;
+				if (cmp < 0)
+					return -1 - hi;
+				else
+					return -1 - lo;
+			}
+
 			hiv = hi_key[ofs_0];
 			if (ofs_0 < 19)
 				hiv = (hiv << 8) | hi_key[ofs_0+1];
diff --git a/t/lib-pack.sh b/t/lib-pack.sh
new file mode 100644
index 0000000..61c5d19
--- /dev/null
+++ b/t/lib-pack.sh
@@ -0,0 +1,78 @@
+#!/bin/sh
+#
+# Support routines for hand-crafting weird or malicious packs.
+#
+# You can make a complete pack like:
+#
+#   pack_header 2 >foo.pack &&
+#   pack_obj e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 >>foo.pack &&
+#   pack_obj e68fe8129b546b101aee9510c5328e7f21ca1d18 >>foo.pack &&
+#   pack_trailer foo.pack
+
+# Print the big-endian 4-byte octal representation of $1
+uint32_octal() {
+	n=$1
+	printf '\%o' $(($n / 16777216)); n=$((n % 16777216))
+	printf '\%o' $(($n /    65536)); n=$((n %    65536))
+	printf '\%o' $(($n /      256)); n=$((n %      256))
+	printf '\%o' $(($n           ));
+}
+
+# Print the big-endian 4-byte binary representation of $1
+uint32_binary() {
+	printf "$(uint32_octal "$1")"
+}
+
+# Print a pack header, version 2, for a pack with $1 objects
+pack_header() {
+	printf 'PACK' &&
+	printf '\0\0\0\2' &&
+	uint32_binary "$1"
+}
+
+# Print the pack data for object $1, as a delta against object $2 (or as a full
+# object if $2 is missing or empty). The output is suitable for including
+# directly in the packfile, and represents the entirety of the object entry.
+# Doing this on the fly (especially picking your deltas) is quite tricky, so we
+# have hardcoded some well-known objects. See the case statements below for the
+# complete list.
+pack_obj() {
+	case "$1" in
+	# empty blob
+	e69de29bb2d1d6434b8b29ae775ad8c2e48c5391)
+		case "$2" in
+		'')
+			printf '\060\170\234\003\0\0\0\0\1'
+			return
+			;;
+		esac
+		;;
+
+	# blob containing "\7\76"
+	e68fe8129b546b101aee9510c5328e7f21ca1d18)
+		case "$2" in
+		'')
+			printf '\062\170\234\143\267\3\0\0\116\0\106'
+			return
+			;;
+		esac
+		;;
+	esac
+
+	echo >&2 "BUG: don't know how to print $1${2:+ (from $2)}"
+	return 1
+}
+
+# Compute and append pack trailer to "$1"
+pack_trailer() {
+	test-sha1 -b <"$1" >trailer.tmp &&
+	cat trailer.tmp >>"$1" &&
+	rm -f trailer.tmp
+}
+
+# Remove any existing packs to make sure that
+# whatever we index next will be the pack that we
+# actually use.
+clear_packs() {
+	rm -f .git/objects/pack/*
+}
diff --git a/t/t5308-pack-detect-duplicates.sh b/t/t5308-pack-detect-duplicates.sh
new file mode 100755
index 0000000..719d48f
--- /dev/null
+++ b/t/t5308-pack-detect-duplicates.sh
@@ -0,0 +1,73 @@
+#!/bin/sh
+
+test_description='handling of duplicate objects in incoming packfiles'
+. ./test-lib.sh
+. ../lib-pack.sh
+
+# The sha1s we have in our pack. It's important that these have the same
+# starting byte, so that they end up in the same fanout section of the index.
+# That lets us make sure we are exercising the binary search with both sets.
+LO_SHA1=e68fe8129b546b101aee9510c5328e7f21ca1d18
+HI_SHA1=e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
+
+# And here's a "missing sha1" which will produce failed lookups. It must also
+# be in the same fanout section, and should be between the two (so that during
+# our binary search, we are sure to end up looking at one or the other of the
+# duplicate runs).
+MISSING_SHA1='e69d000000000000000000000000000000000000'
+
+# git will never intentionally create packfiles with
+# duplicate objects, so we have to construct them by hand.
+#
+# $1 is the name of the packfile to create
+#
+# $2 is the number of times to duplicate each object
+create_pack() {
+	pack_header "$((2 * $2))" >"$1" &&
+	for i in $(test_seq 1 "$2"); do
+		pack_obj $LO_SHA1 &&
+		pack_obj $HI_SHA1
+	done >>"$1" &&
+	pack_trailer "$1"
+}
+
+# double-check that create_pack actually works
+test_expect_success 'pack with no duplicates' '
+	create_pack no-dups.pack 1 &&
+	git index-pack --stdin <no-dups.pack
+'
+
+test_expect_success 'index-pack will allow duplicate objects by default' '
+	clear_packs &&
+	create_pack dups.pack 100 &&
+	git index-pack --stdin <dups.pack
+'
+
+test_expect_success 'create batch-check test vectors' '
+	cat >input <<-EOF &&
+	$LO_SHA1
+	$HI_SHA1
+	$MISSING_SHA1
+	EOF
+	cat >expect <<-EOF
+	$LO_SHA1 blob 2
+	$HI_SHA1 blob 0
+	$MISSING_SHA1 missing
+	EOF
+'
+
+test_expect_success 'lookup in duplicated pack (binary search)' '
+	git cat-file --batch-check <input >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'lookup in duplicated pack (GIT_USE_LOOKUP)' '
+	(
+		GIT_USE_LOOKUP=1 &&
+		export GIT_USE_LOOKUP &&
+		git cat-file --batch-check <input >actual
+	) &&
+	test_cmp expect actual
+'
+
+test_done
-- 
1.8.4.rc2.28.g6bb5f3f

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 3/6] add tests for indexing packs with delta cycles
  2013-08-22 23:12               ` [PATCHv2 0/6] duplicate objects and delta cycles, oh my! Jeff King
  2013-08-22 23:12                 ` [PATCH 1/6] test-sha1: add a binary output mode Jeff King
  2013-08-22 23:14                 ` [PATCH 2/6] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP Jeff King
@ 2013-08-22 23:14                 ` Jeff King
  2013-08-22 23:15                 ` [PATCH 4/6] test index-pack on packs with recoverable " Jeff King
                                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 37+ messages in thread
From: Jeff King @ 2013-08-22 23:14 UTC (permalink / raw)
  To: Git Mailing List
  Cc: Duy Nguyen, Junio C Hamano, Shawn O. Pearce, Nicolas Pitre

If we receive a broken or malicious pack from a remote, we
will feed it to index-pack. As index-pack processes the
objects as a stream, reconstructing and hashing each object
to get its name, it is not very susceptible to doing the
wrong with bad data (it simply notices that the data is
bogus and aborts).

However, one question raised on the list is whether it could
be susceptible to problems during the delta-resolution
phase. In particular, can a cycle in the packfile deltas
cause us to go into an infinite loop or cause any other
problem?

The answer is no.

We cannot have a cycle of delta-base offsets, because they
go only in one direction (the OFS_DELTA object mentions its
base by an offset towards the beginning of the file, and we
explicitly reject negative offsets).

We can have a cycle of REF_DELTA objects, which refer to
base objects by sha1 name. However, index-pack does not know
these sha1 names ahead of time; it has to reconstruct the
objects to get their names, and it cannot do so if there is
a delta cycle (in other words, it does not even realize
there is a cycle, but only that there are items that cannot
be resolved).

Even though we can reason out that index-pack should handle
this fine, let's add a few tests to make sure it behaves
correctly.

Signed-off-by: Jeff King <peff@peff.net>
---
 t/lib-pack.sh                | 22 +++++++++++++++++
 t/t5309-pack-delta-cycles.sh | 59 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 81 insertions(+)
 create mode 100755 t/t5309-pack-delta-cycles.sh

diff --git a/t/lib-pack.sh b/t/lib-pack.sh
index 61c5d19..e028c40 100644
--- a/t/lib-pack.sh
+++ b/t/lib-pack.sh
@@ -55,6 +55,28 @@ pack_obj() {
 			printf '\062\170\234\143\267\3\0\0\116\0\106'
 			return
 			;;
+		01d7713666f4de822776c7622c10f1b07de280dc)
+			printf '\165\1\327\161\66\146\364\336\202\47\166' &&
+			printf '\307\142\54\20\361\260\175\342\200\334\170' &&
+			printf '\234\143\142\142\142\267\003\0\0\151\0\114'
+			return
+			;;
+		esac
+		;;
+
+	# blob containing "\7\0"
+	01d7713666f4de822776c7622c10f1b07de280dc)
+		case "$2" in
+		'')
+			printf '\062\170\234\143\147\0\0\0\20\0\10'
+			return
+			;;
+		e68fe8129b546b101aee9510c5328e7f21ca1d18)
+			printf '\165\346\217\350\22\233\124\153\20\32\356' &&
+			printf '\225\20\305\62\216\177\41\312\35\30\170\234' &&
+			printf '\143\142\142\142\147\0\0\0\53\0\16'
+			return
+			;;
 		esac
 		;;
 	esac
diff --git a/t/t5309-pack-delta-cycles.sh b/t/t5309-pack-delta-cycles.sh
new file mode 100755
index 0000000..9b3f2c3
--- /dev/null
+++ b/t/t5309-pack-delta-cycles.sh
@@ -0,0 +1,59 @@
+#!/bin/sh
+
+test_description='test index-pack handling of delta cycles in packfiles'
+. ./test-lib.sh
+. ../lib-pack.sh
+
+# Two similar-ish objects that we have computed deltas between.
+A=01d7713666f4de822776c7622c10f1b07de280dc
+B=e68fe8129b546b101aee9510c5328e7f21ca1d18
+
+# double-check our hand-constucted packs
+test_expect_success 'index-pack works with a single delta (A->B)' '
+	clear_packs &&
+	{
+		pack_header 2 &&
+		pack_obj $A $B &&
+		pack_obj $B
+	} >ab.pack &&
+	pack_trailer ab.pack &&
+	git index-pack --stdin <ab.pack &&
+	git cat-file -t $A &&
+	git cat-file -t $B
+'
+
+test_expect_success 'index-pack works with a single delta (B->A)' '
+	clear_packs &&
+	{
+		pack_header 2 &&
+		pack_obj $A &&
+		pack_obj $B $A
+	} >ba.pack &&
+	pack_trailer ba.pack &&
+	git index-pack --stdin <ba.pack &&
+	git cat-file -t $A &&
+	git cat-file -t $B
+'
+
+test_expect_success 'index-pack detects missing base objects' '
+	clear_packs &&
+	{
+		pack_header 1 &&
+		pack_obj $A $B
+	} >missing.pack &&
+	pack_trailer missing.pack &&
+	test_must_fail git index-pack --fix-thin --stdin <missing.pack
+'
+
+test_expect_success 'index-pack detects REF_DELTA cycles' '
+	clear_packs &&
+	{
+		pack_header 2 &&
+		pack_obj $A $B &&
+		pack_obj $B $A
+	} >cycle.pack &&
+	pack_trailer cycle.pack &&
+	test_must_fail git index-pack --fix-thin --stdin <cycle.pack
+'
+
+test_done
-- 
1.8.4.rc2.28.g6bb5f3f

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 4/6] test index-pack on packs with recoverable delta cycles
  2013-08-22 23:12               ` [PATCHv2 0/6] duplicate objects and delta cycles, oh my! Jeff King
                                   ` (2 preceding siblings ...)
  2013-08-22 23:14                 ` [PATCH 3/6] add tests for indexing packs with delta cycles Jeff King
@ 2013-08-22 23:15                 ` Jeff King
  2013-08-22 23:15                 ` [PATCH 5/6] index-pack: optionally reject packs with duplicate objects Jeff King
                                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 37+ messages in thread
From: Jeff King @ 2013-08-22 23:15 UTC (permalink / raw)
  To: Git Mailing List
  Cc: Duy Nguyen, Junio C Hamano, Shawn O. Pearce, Nicolas Pitre

The previous commit added tests to show that index-pack
correctly bails in unrecoverable situations. There are some
situations where the data could be recovered, but it is not
currently:

  1. If we can break the cycle using an object from another
     pack via --fix-thin.

  2. If we can break the cycle using a duplicate of one of
     the objects found in the same pack.

Note that neither of these is particularly high priority; a
delta cycle within a pack should never occur, and we have no
record of even a buggy git implementation creating such a
pack.

However, it's worth adding these tests for two reasons. One,
to document that we do not currently handle the situation,
even though it is possible. And two, to exercise the code
that runs in this situation; even though it fails, by
running it we can confirm that index-pack detects the
situation and aborts, and does not misbehave (e.g., by
following the cycle in an infinite loop).

In both cases, we hit an assert that aborts index-pack.

Signed-off-by: Jeff King <peff@peff.net>
---
 t/t5309-pack-delta-cycles.sh | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/t/t5309-pack-delta-cycles.sh b/t/t5309-pack-delta-cycles.sh
index 9b3f2c3..38e1809 100755
--- a/t/t5309-pack-delta-cycles.sh
+++ b/t/t5309-pack-delta-cycles.sh
@@ -56,4 +56,22 @@ test_expect_success 'index-pack detects REF_DELTA cycles' '
 	test_must_fail git index-pack --fix-thin --stdin <cycle.pack
 '
 
+test_expect_failure 'failover to an object in another pack' '
+	clear_packs &&
+	git index-pack --stdin <ab.pack &&
+	git index-pack --stdin --fix-thin <cycle.pack
+'
+
+test_expect_failure 'failover to a duplicate object in the same pack' '
+	clear_packs &&
+	{
+		pack_header 3 &&
+		pack_obj $A $B &&
+		pack_obj $B $A &&
+		pack_obj $A
+	} >recoverable.pack &&
+	pack_trailer recoverable.pack &&
+	git index-pack --fix-thin --stdin <recoverable.pack
+'
+
 test_done
-- 
1.8.4.rc2.28.g6bb5f3f

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 5/6] index-pack: optionally reject packs with duplicate objects
  2013-08-22 23:12               ` [PATCHv2 0/6] duplicate objects and delta cycles, oh my! Jeff King
                                   ` (3 preceding siblings ...)
  2013-08-22 23:15                 ` [PATCH 4/6] test index-pack on packs with recoverable " Jeff King
@ 2013-08-22 23:15                 ` Jeff King
  2013-08-22 23:16                 ` [PATCH 6/6] default pack.indexDuplicates to false Jeff King
  2013-08-22 23:35                 ` [PATCHv2 0/6] duplicate objects and delta cycles, oh my! Nicolas Pitre
  6 siblings, 0 replies; 37+ messages in thread
From: Jeff King @ 2013-08-22 23:15 UTC (permalink / raw)
  To: Git Mailing List
  Cc: Duy Nguyen, Junio C Hamano, Shawn O. Pearce, Nicolas Pitre

Git should never generate packs with duplicate objects.
However, we may see such packs due to bugs in Git or other
implementations (e.g., JGit had such a bug a few years ago).

In theory, such packs should not be a problem for us (we
will simply find one of the instances of the object when
looking in the pack). However, the JGit bug report mentioned
possible infinite loops during repacking due to cycles in
the delta chain.  Though this problem hasn't specifically
been reproduced on modern git, there is no reason not to be
careful with incoming packs, given that only buggy
implementations should be producing such packs, anyway.

This patch introduces the pack.indexDuplicates option to
allow or reject such packs from index-pack. The default
remains to allow it.

Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/index-pack.c              |  4 ++++
 pack-write.c                      | 24 ++++++++++++++++++++++++
 pack.h                            |  2 ++
 t/t5308-pack-detect-duplicates.sh |  8 ++++++++
 4 files changed, 38 insertions(+)

diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index 9c1cfac..c27556f 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -1369,6 +1369,10 @@ static int git_index_pack_config(const char *k, const char *v, void *cb)
 #endif
 		return 0;
 	}
+	if (!strcmp(k, "pack.indexduplicates")) {
+		opts->allow_duplicates = git_config_bool(k, v);
+		return 0;
+	}
 	return git_default_config(k, v, cb);
 }
 
diff --git a/pack-write.c b/pack-write.c
index ca9e63b..da946a7 100644
--- a/pack-write.c
+++ b/pack-write.c
@@ -7,6 +7,7 @@ void reset_pack_idx_option(struct pack_idx_option *opts)
 	memset(opts, 0, sizeof(*opts));
 	opts->version = 2;
 	opts->off32_limit = 0x7fffffff;
+	opts->allow_duplicates = 1;
 }
 
 static int sha1_compare(const void *_a, const void *_b)
@@ -37,6 +38,19 @@ static int need_large_offset(off_t offset, const struct pack_idx_option *opts)
 			 sizeof(ofsval), cmp_uint32);
 }
 
+static void *find_duplicate(void *vbase, size_t n, size_t size,
+			    int (*cmp)(const void *, const void *))
+{
+	unsigned char *base = vbase;
+	while (n > 1) {
+		if (!cmp(base, base + size))
+			return base;
+		base += size;
+		n--;
+	}
+	return NULL;
+}
+
 /*
  * On entry *sha1 contains the pack content SHA1 hash, on exit it is
  * the SHA1 hash of sorted object names. The objects array passed in
@@ -68,6 +82,16 @@ const char *write_idx_file(const char *index_name, struct pack_idx_entry **objec
 	else
 		sorted_by_sha = list = last = NULL;
 
+	if (!opts->allow_duplicates) {
+		struct pack_idx_entry **dup;
+
+		dup = find_duplicate(sorted_by_sha, nr_objects,
+				     sizeof(*sorted_by_sha), sha1_compare);
+		if (dup)
+			die("pack has duplicate entries for %s",
+			    sha1_to_hex((*dup)->sha1));
+	}
+
 	if (opts->flags & WRITE_IDX_VERIFY) {
 		assert(index_name);
 		f = sha1fd_check(index_name);
diff --git a/pack.h b/pack.h
index aa6ee7d..45217b6 100644
--- a/pack.h
+++ b/pack.h
@@ -44,6 +44,8 @@ struct pack_idx_option {
 	uint32_t version;
 	uint32_t off32_limit;
 
+	int allow_duplicates;
+
 	/*
 	 * List of offsets that would fit within off32_limit but
 	 * need to be written out as 64-bit entity for byte-for-byte
diff --git a/t/t5308-pack-detect-duplicates.sh b/t/t5308-pack-detect-duplicates.sh
index 719d48f..cb9b44e 100755
--- a/t/t5308-pack-detect-duplicates.sh
+++ b/t/t5308-pack-detect-duplicates.sh
@@ -70,4 +70,12 @@ test_expect_success 'lookup in duplicated pack (GIT_USE_LOOKUP)' '
 	test_cmp expect actual
 '
 
+test_expect_success 'index-pack can reject packs with duplicates' '
+	clear_packs &&
+	create_pack dups.pack 2 &&
+	test_must_fail \
+		git -c pack.indexDuplicates=0 index-pack --stdin <dups.pack &&
+	test_expect_code 1 git cat-file -e $LO_SHA1
+'
+
 test_done
-- 
1.8.4.rc2.28.g6bb5f3f

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 6/6] default pack.indexDuplicates to false
  2013-08-22 23:12               ` [PATCHv2 0/6] duplicate objects and delta cycles, oh my! Jeff King
                                   ` (4 preceding siblings ...)
  2013-08-22 23:15                 ` [PATCH 5/6] index-pack: optionally reject packs with duplicate objects Jeff King
@ 2013-08-22 23:16                 ` Jeff King
  2013-08-22 23:35                 ` [PATCHv2 0/6] duplicate objects and delta cycles, oh my! Nicolas Pitre
  6 siblings, 0 replies; 37+ messages in thread
From: Jeff King @ 2013-08-22 23:16 UTC (permalink / raw)
  To: Git Mailing List
  Cc: Duy Nguyen, Junio C Hamano, Shawn O. Pearce, Nicolas Pitre

We should never see duplicate objects in packs, and it is
unknown what kind of complications such packs could create
during the repacking process. The previous commit introduced
a safety valve for checking packs coming into the repository
and being indexed by index-pack.

Let's turn the safety valve on by default. This helps
protect sites receiving packfiles from potentially malicious
strangers, and shouldn't affect normal use (and if it does,
it will have helped us identify a buggy packfile writer).
In a pinch, users can recover by turning off the option, or
by resorting to unpack-objects to explode the pack.

Note that this also turns the option on for packs we write
ourselves (e.g., via pack-objects, fast-import, or streaming
large blobs). We do not expect maliciously constructed
packfiles in these code paths, of course, but it can serve
as an extra check that we have no accidentally created a
buggy pack (and the check itself is cheap to perform).

Signed-off-by: Jeff King <peff@peff.net>
---
 pack-write.c                      | 1 -
 t/t5308-pack-detect-duplicates.sh | 9 ++++-----
 2 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/pack-write.c b/pack-write.c
index da946a7..1e3c459 100644
--- a/pack-write.c
+++ b/pack-write.c
@@ -7,7 +7,6 @@ void reset_pack_idx_option(struct pack_idx_option *opts)
 	memset(opts, 0, sizeof(*opts));
 	opts->version = 2;
 	opts->off32_limit = 0x7fffffff;
-	opts->allow_duplicates = 1;
 }
 
 static int sha1_compare(const void *_a, const void *_b)
diff --git a/t/t5308-pack-detect-duplicates.sh b/t/t5308-pack-detect-duplicates.sh
index cb9b44e..e982095 100755
--- a/t/t5308-pack-detect-duplicates.sh
+++ b/t/t5308-pack-detect-duplicates.sh
@@ -37,10 +37,10 @@ test_expect_success 'index-pack will allow duplicate objects by default' '
 	git index-pack --stdin <no-dups.pack
 '
 
-test_expect_success 'index-pack will allow duplicate objects by default' '
+test_expect_success 'index-pack will allow duplicate objects if asked' '
 	clear_packs &&
 	create_pack dups.pack 100 &&
-	git index-pack --stdin <dups.pack
+	git -c pack.indexDuplicates=true index-pack --stdin <dups.pack
 '
 
 test_expect_success 'create batch-check test vectors' '
@@ -70,11 +70,10 @@ test_expect_success 'index-pack can reject packs with duplicates' '
 	test_cmp expect actual
 '
 
-test_expect_success 'index-pack can reject packs with duplicates' '
+test_expect_success 'index-pack rejects packs with duplicates by default' '
 	clear_packs &&
 	create_pack dups.pack 2 &&
-	test_must_fail \
-		git -c pack.indexDuplicates=0 index-pack --stdin <dups.pack &&
+	test_must_fail git index-pack --stdin <dups.pack &&
 	test_expect_code 1 git cat-file -e $LO_SHA1
 '
 
-- 
1.8.4.rc2.28.g6bb5f3f

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCHv2 0/6] duplicate objects and delta cycles, oh my!
  2013-08-22 23:12               ` [PATCHv2 0/6] duplicate objects and delta cycles, oh my! Jeff King
                                   ` (5 preceding siblings ...)
  2013-08-22 23:16                 ` [PATCH 6/6] default pack.indexDuplicates to false Jeff King
@ 2013-08-22 23:35                 ` Nicolas Pitre
  6 siblings, 0 replies; 37+ messages in thread
From: Nicolas Pitre @ 2013-08-22 23:35 UTC (permalink / raw)
  To: Jeff King; +Cc: Git Mailing List, Duy Nguyen, Junio C Hamano, Shawn O. Pearce

On Thu, 22 Aug 2013, Jeff King wrote:

> On Thu, Aug 22, 2013 at 10:43:05AM -0400, Jeff King wrote:
> 
> > > write_idx_file() is called after index-pack processes all delta
> > > objects. Could resolve_deltas() go cyclic with certain duplicate
> > > object setup?
> > 
> > Good question. I'm not sure. I'll check it out.
> 
> I think the answer is "no", based on both reasoning and testing (both of
> which are included in patches 3-4 of the series below).
> 
> So here's my re-roll of the series.

This looks all good to me.


Acked-by: Nicolas Pitre <nico@fluxnic.net>


Nicolas

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/6] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP
  2013-08-22 23:14                 ` [PATCH 2/6] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP Jeff King
@ 2013-08-23 16:41                   ` Junio C Hamano
  2013-08-23 18:24                     ` Jeff King
  2013-08-23 19:41                   ` [PATCH 2/6] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP Johannes Sixt
  1 sibling, 1 reply; 37+ messages in thread
From: Junio C Hamano @ 2013-08-23 16:41 UTC (permalink / raw)
  To: Jeff King; +Cc: Git Mailing List, Duy Nguyen, Shawn O. Pearce, Nicolas Pitre

Jeff King <peff@peff.net> writes:

> Furthermore, we know that one of our endpoints must be
> the edge of the run of duplicates. For example, given this
> sequence:
>
>  idx 0 1 2 3 4 5
>  key A C C C C D
>
> If we are searching for "B", we might hit the duplicate run
> at lo=1, hi=3 (e.g., by first mi=3, then mi=0). But we can
> never have lo > 1, because B < C. That is, if our key is
> less than the run, we know that "lo" is the edge, but we can
> say nothing of "hi". Similarly, if our key is greater than
> the run, we know that "hi" is the edge, but we can say
> nothing of "lo". But that is enough for us to return not
> only "not found", but show the position at which we would
> insert the new item.

This is somewhat tricky and may deserve an in-code comment.

> diff --git a/sha1-lookup.c b/sha1-lookup.c
> index c4dc55d..614cbb6 100644
> --- a/sha1-lookup.c
> +++ b/sha1-lookup.c
> @@ -204,7 +204,30 @@ int sha1_entry_pos(const void *table,
>  			 * byte 0 thru (ofs-1) are the same between
>  			 * lo and hi; ofs is the first byte that is
>  			 * different.
> +			 *
> +			 * If ofs==20, then no bytes are different,
> +			 * meaning we have entries with duplicate
> +			 * keys. We know that we are in a solid run
> +			 * of this entry (because the entries are
> +			 * sorted, and our lo and hi are the same,
> +			 * there can be nothing but this single key
> +			 * in between). So we can stop the search.
> +			 * Either one of these entries is it (and
> +			 * we do not care which), or we do not have
> +			 * it.
>  			 */
> +			if (ofs == 20) {
> +				mi = lo;
> +				mi_key = base + elem_size * mi + key_offset;
> +				cmp = memcmp(mi_key, key, 20);

It think we already know that mi_key[0:ofs_0] and key[0:ofs_0] are
the same at this point and we do not have to compare full 20 bytes
again, but this is done only once and a better readablity of the
above trumps micro-optimization possibility, I think.

> +				if (!cmp)
> +					return mi;
> +				if (cmp < 0)
> +					return -1 - hi;
> +				else
> +					return -1 - lo;
> +			}
> +
>  			hiv = hi_key[ofs_0];
>  			if (ofs_0 < 19)
>  				hiv = (hiv << 8) | hi_key[ofs_0+1];
> diff --git a/t/lib-pack.sh b/t/lib-pack.sh
> new file mode 100644
> index 0000000..61c5d19
> --- /dev/null
> +++ b/t/lib-pack.sh
> @@ -0,0 +1,78 @@
> +#!/bin/sh
> +#
> +# Support routines for hand-crafting weird or malicious packs.
> +#
> +# You can make a complete pack like:
> +#
> +#   pack_header 2 >foo.pack &&
> +#   pack_obj e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 >>foo.pack &&
> +#   pack_obj e68fe8129b546b101aee9510c5328e7f21ca1d18 >>foo.pack &&
> +#   pack_trailer foo.pack
> +
> +# Print the big-endian 4-byte octal representation of $1
> +uint32_octal() {

micronit (style):

	uint32_octal () {

> +	n=$1
> +	printf '\%o' $(($n / 16777216)); n=$((n % 16777216))
> +	printf '\%o' $(($n /    65536)); n=$((n %    65536))
> +	printf '\%o' $(($n /      256)); n=$((n %      256))
> +	printf '\%o' $(($n           ));
> +}
> +
> +# Print the big-endian 4-byte binary representation of $1
> +uint32_binary() {
> +	printf "$(uint32_octal "$1")"
> +}
> +
> +# Print a pack header, version 2, for a pack with $1 objects
> +pack_header() {
> +	printf 'PACK' &&
> +	printf '\0\0\0\2' &&
> +	uint32_binary "$1"
> +}
> +
> +# Print the pack data for object $1, as a delta against object $2 (or as a full
> +# object if $2 is missing or empty). The output is suitable for including
> +# directly in the packfile, and represents the entirety of the object entry.
> +# Doing this on the fly (especially picking your deltas) is quite tricky, so we
> +# have hardcoded some well-known objects. See the case statements below for the
> +# complete list.

Cute ;-) I like the idea of having this function with a right API in
place, and cheating by limiting its implementation to what is
necessary.

Thanks.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/6] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP
  2013-08-23 16:41                   ` Junio C Hamano
@ 2013-08-23 18:24                     ` Jeff King
  2013-08-23 18:54                       ` Nicolas Pitre
  2013-08-24  0:01                       ` [PATCHv3 0/6] duplicate objects and delta cycles Jeff King
  0 siblings, 2 replies; 37+ messages in thread
From: Jeff King @ 2013-08-23 18:24 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Git Mailing List, Duy Nguyen, Shawn O. Pearce, Nicolas Pitre

On Fri, Aug 23, 2013 at 09:41:57AM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > Furthermore, we know that one of our endpoints must be
> > the edge of the run of duplicates. For example, given this
> > sequence:
> >
> >  idx 0 1 2 3 4 5
> >  key A C C C C D
> >
> > If we are searching for "B", we might hit the duplicate run
> > at lo=1, hi=3 (e.g., by first mi=3, then mi=0). But we can
> > never have lo > 1, because B < C. That is, if our key is
> > less than the run, we know that "lo" is the edge, but we can
> > say nothing of "hi". Similarly, if our key is greater than
> > the run, we know that "hi" is the edge, but we can say
> > nothing of "lo". But that is enough for us to return not
> > only "not found", but show the position at which we would
> > insert the new item.
> 
> This is somewhat tricky and may deserve an in-code comment.

Do you want me to re-roll, pushing it down into the comment, or do you
want to mark it up yourself? I think there might be some value in the
latter as your re-writing of it as a comment may cross-check that my
logic is sound.

> > +			if (ofs == 20) {
> > +				mi = lo;
> > +				mi_key = base + elem_size * mi + key_offset;
> > +				cmp = memcmp(mi_key, key, 20);
> 
> It think we already know that mi_key[0:ofs_0] and key[0:ofs_0] are
> the same at this point and we do not have to compare full 20 bytes
> again, but this is done only once and a better readablity of the
> above trumps micro-optimization possibility, I think.

Yes, I had the same idea, and came to the same conclusion. Though if
anybody did want to try it, note that we have just overwritten the old
ofs_0, so you would want to bump the new code up above that line).

> > +uint32_octal() {
> 
> micronit (style):
> 
> 	uint32_octal () {

Hmph. I always forget which one we prefer, and we seem to have equal
numbers of both already. Again, want a re-roll or to mark it up
yourself?

> > +# Print the pack data for object $1, as a delta against object $2 (or as a full
> > +# object if $2 is missing or empty). The output is suitable for including
> > +# directly in the packfile, and represents the entirety of the object entry.
> > +# Doing this on the fly (especially picking your deltas) is quite tricky, so we
> > +# have hardcoded some well-known objects. See the case statements below for the
> > +# complete list.
> 
> Cute ;-) I like the idea of having this function with a right API in
> place, and cheating by limiting its implementation to what is
> necessary.

Just for reference, the procedure I used to generate the "base" data is
reasonably straight forward:

  sha1=$(printf %s "$content" | git hash-object -w --stdin)
  echo $sha1 | git pack-objects --stdout >tmp.pack
  tail -c +13 tmp.pack >no-header.pack
  head -c -20 no-header.pack >no-trailer.pack
  od -b no-trailer.pack | grep ' ' | cut -d' ' -f2- | tr ' ' '\\'

Since we want binary, we can skip the "od" call at the end (I needed it
to convert to something readable to hand "printf"). But "head -c" is not
portable, nor is head with a negative count.

To find items in the same fanout, I just used for-loops to calculate the
sha1s of all 2-byte blobs. And that is why we have the odd magic "\7\76"
blob.

Making the deltas was considerably less elegant, since we cannot provoke
pack-objects to pick arbitrary deltas (and it will not even try to delta
tiny objects, anyway, which would bloat our samples). I ended up with
the horrible patch below. We _could_ clean it up (error-checking? Who
needs it?) and make it a debug-and-testing-only option for pack-objects,
but I just didn't think the grossness was worth it. Still, it's probably
worth documenting here on the list in case somebody else ever needs to
add new samples to lib-pack.sh.

---
 builtin/pack-objects.c | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 8da2a66..e8937f5 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -2442,6 +2442,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	const char *rp_av[6];
 	int rp_ac = 0;
 	int rev_list_unpacked = 0, rev_list_all = 0, rev_list_reflog = 0;
+	int magic = 0;
 	struct option pack_objects_options[] = {
 		OPT_SET_INT('q', "quiet", &progress,
 			    N_("do not show progress meter"), 0),
@@ -2505,6 +2506,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 			    N_("pack compression level")),
 		OPT_SET_INT(0, "keep-true-parents", &grafts_replace_parents,
 			    N_("do not hide commits by grafts"), 0),
+		OPT_BOOL(0, "magic", &magic, "make deltas"),
 		OPT_END(),
 	};
 
@@ -2520,6 +2522,34 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	argc = parse_options(argc, argv, prefix, pack_objects_options,
 			     pack_usage, 0);
 
+	if (magic) {
+		unsigned char sha1[20];
+		struct delta_index *index;
+		void *src, *trg, *delta;
+		enum object_type src_type, trg_type;
+		unsigned long src_size, trg_size, delta_size, z_delta_size;
+		unsigned char header[10];
+		unsigned long header_len;
+
+		get_sha1(argv[0], sha1);
+		trg = read_sha1_file(sha1, &trg_type, &trg_size);
+
+		get_sha1(argv[1], sha1);
+		src = read_sha1_file(sha1, &src_type, &src_size);
+
+		index = create_delta_index(src, src_size);
+		delta = create_delta(index, trg, trg_size, &delta_size, 8192);
+
+		z_delta_size = do_compress(&delta, delta_size);
+		header_len = encode_in_pack_object_header(OBJ_REF_DELTA, delta_size,
+							  header);
+		fwrite(header, 1, header_len, stdout);
+		fwrite(sha1, 1, 20, stdout);
+		fwrite(delta, 1, z_delta_size, stdout);
+		return 0;
+	}
+
+
 	if (argc) {
 		base_name = argv[0];
 		argc--;

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/6] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP
  2013-08-23 18:24                     ` Jeff King
@ 2013-08-23 18:54                       ` Nicolas Pitre
  2013-08-23 18:56                         ` Jeff King
  2013-08-24  0:01                       ` [PATCHv3 0/6] duplicate objects and delta cycles Jeff King
  1 sibling, 1 reply; 37+ messages in thread
From: Nicolas Pitre @ 2013-08-23 18:54 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, Git Mailing List, Duy Nguyen, Shawn O. Pearce

On Fri, 23 Aug 2013, Jeff King wrote:

> Making the deltas was considerably less elegant, since we cannot provoke
> pack-objects to pick arbitrary deltas (and it will not even try to delta
> tiny objects, anyway, which would bloat our samples). I ended up with
> the horrible patch below. We _could_ clean it up (error-checking? Who
> needs it?) and make it a debug-and-testing-only option for pack-objects,
> but I just didn't think the grossness was worth it. Still, it's probably
> worth documenting here on the list in case somebody else ever needs to
> add new samples to lib-pack.sh.

Maybe using test-delta (from test-delta.c) would have helped here?

In any case, if something needs to be permanently added into the code to 
help in the creation of test objects, I think test-delta.c is a far 
better place than pack-objects.c.


Nicolas

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/6] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP
  2013-08-23 18:54                       ` Nicolas Pitre
@ 2013-08-23 18:56                         ` Jeff King
  0 siblings, 0 replies; 37+ messages in thread
From: Jeff King @ 2013-08-23 18:56 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Junio C Hamano, Git Mailing List, Duy Nguyen, Shawn O. Pearce

On Fri, Aug 23, 2013 at 02:54:19PM -0400, Nicolas Pitre wrote:

> On Fri, 23 Aug 2013, Jeff King wrote:
> 
> > Making the deltas was considerably less elegant, since we cannot provoke
> > pack-objects to pick arbitrary deltas (and it will not even try to delta
> > tiny objects, anyway, which would bloat our samples). I ended up with
> > the horrible patch below. We _could_ clean it up (error-checking? Who
> > needs it?) and make it a debug-and-testing-only option for pack-objects,
> > but I just didn't think the grossness was worth it. Still, it's probably
> > worth documenting here on the list in case somebody else ever needs to
> > add new samples to lib-pack.sh.
> 
> Maybe using test-delta (from test-delta.c) would have helped here?
> 
> In any case, if something needs to be permanently added into the code to 
> help in the creation of test objects, I think test-delta.c is a far 
> better place than pack-objects.c.

*forehead palm*

I didn't even know we had test-delta. Yes, that is obviously a way
better place (I initially looked at pack-objects because it has the
helpers to do compression and the type/size header properly).

-Peff

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/6] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP
  2013-08-22 23:14                 ` [PATCH 2/6] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP Jeff King
  2013-08-23 16:41                   ` Junio C Hamano
@ 2013-08-23 19:41                   ` Johannes Sixt
  2013-08-23 23:37                     ` Jeff King
  1 sibling, 1 reply; 37+ messages in thread
From: Johannes Sixt @ 2013-08-23 19:41 UTC (permalink / raw)
  To: Jeff King
  Cc: Git Mailing List, Duy Nguyen, Junio C Hamano, Shawn O. Pearce,
	Nicolas Pitre

Am 23.08.2013 01:14, schrieb Jeff King:
> +++ b/t/t5308-pack-detect-duplicates.sh
> @@ -0,0 +1,73 @@
> +#!/bin/sh
> +
> +test_description='handling of duplicate objects in incoming packfiles'
> +. ./test-lib.sh
> +. ../lib-pack.sh

This should be

. "$TEST_DIRECTORY"/lib-pack.sh

to support running tests with --root (also in patch 3/6).

-- Hannes

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/6] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP
  2013-08-23 19:41                   ` [PATCH 2/6] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP Johannes Sixt
@ 2013-08-23 23:37                     ` Jeff King
  0 siblings, 0 replies; 37+ messages in thread
From: Jeff King @ 2013-08-23 23:37 UTC (permalink / raw)
  To: Johannes Sixt
  Cc: Git Mailing List, Duy Nguyen, Junio C Hamano, Shawn O. Pearce,
	Nicolas Pitre

On Fri, Aug 23, 2013 at 09:41:39PM +0200, Johannes Sixt wrote:

> Am 23.08.2013 01:14, schrieb Jeff King:
> >+++ b/t/t5308-pack-detect-duplicates.sh
> >@@ -0,0 +1,73 @@
> >+#!/bin/sh
> >+
> >+test_description='handling of duplicate objects in incoming packfiles'
> >+. ./test-lib.sh
> >+. ../lib-pack.sh
> 
> This should be
> 
> . "$TEST_DIRECTORY"/lib-pack.sh
> 
> to support running tests with --root (also in patch 3/6).

Doh, you would think that I would remember that, as the one who
introduced "--root" in the first place.

Will fix. Thanks for noticing.

-Peff

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCHv3 0/6] duplicate objects and delta cycles
  2013-08-23 18:24                     ` Jeff King
  2013-08-23 18:54                       ` Nicolas Pitre
@ 2013-08-24  0:01                       ` Jeff King
  2013-08-24  0:01                         ` [PATCH 1/6] test-sha1: add a binary output mode Jeff King
                                           ` (5 more replies)
  1 sibling, 6 replies; 37+ messages in thread
From: Jeff King @ 2013-08-24  0:01 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Git Mailing List, Duy Nguyen, Shawn O. Pearce, Nicolas Pitre

On Fri, Aug 23, 2013 at 02:24:10PM -0400, Jeff King wrote:

> > This is somewhat tricky and may deserve an in-code comment.
> 
> Do you want me to re-roll[...]

Since there were a few things to fix, I went ahead and re-rolled. Here's
v3, which changes:

  1. Move "edge of run" logic description into the in-code comment
     rather than the commit message.

  2. Include space between shell function names and parentheses.

  3. Use $TEST_DIRECTORY to find lib-pack so that "--root" works.

I added in Nicolas's acks, too, as this version makes no changes of
substance from the previous ack'd version. Interdiff is below.

 sha1-lookup.c                     | 24 ++++++++++++++++++++++++
 t/lib-pack.sh                     | 12 ++++++------
 t/t5308-pack-detect-duplicates.sh |  4 ++--
 t/t5309-pack-delta-cycles.sh      |  2 +-
 4 files changed, 33 insertions(+), 9 deletions(-)

diff --git a/sha1-lookup.c b/sha1-lookup.c
index 614cbb6..2dd8515 100644
--- a/sha1-lookup.c
+++ b/sha1-lookup.c
@@ -215,6 +215,30 @@ int sha1_entry_pos(const void *table,
 			 * Either one of these entries is it (and
 			 * we do not care which), or we do not have
 			 * it.
+			 *
+			 * Furthermore, we know that one of our
+			 * endpoints must be the edge of the run of
+			 * duplicates. For example, given this
+			 * sequence:
+			 *
+			 *     idx 0 1 2 3 4 5
+			 *     key A C C C C D
+			 *
+			 * If we are searching for "B", we might
+			 * hit the duplicate run at lo=1, hi=3
+			 * (e.g., by first mi=3, then mi=0). But we
+			 * can never have lo > 1, because B < C.
+			 * That is, if our key is less than the
+			 * run, we know that "lo" is the edge, but
+			 * we can say nothing of "hi". Similarly,
+			 * if our key is greater than the run, we
+			 * know that "hi" is the edge, but we can
+			 * say nothing of "lo".
+			 *
+			 * Therefore if we do not find it, we also
+			 * know where it would go if it did exist:
+			 * just on the far side of the edge that we
+			 * know about.
 			 */
 			if (ofs == 20) {
 				mi = lo;
diff --git a/t/lib-pack.sh b/t/lib-pack.sh
index e028c40..7e8685b 100644
--- a/t/lib-pack.sh
+++ b/t/lib-pack.sh
@@ -10,7 +10,7 @@
 #   pack_trailer foo.pack
 
 # Print the big-endian 4-byte octal representation of $1
-uint32_octal() {
+uint32_octal () {
 	n=$1
 	printf '\%o' $(($n / 16777216)); n=$((n % 16777216))
 	printf '\%o' $(($n /    65536)); n=$((n %    65536))
@@ -19,12 +19,12 @@ uint32_binary() {
 }
 
 # Print the big-endian 4-byte binary representation of $1
-uint32_binary() {
+uint32_binary () {
 	printf "$(uint32_octal "$1")"
 }
 
 # Print a pack header, version 2, for a pack with $1 objects
-pack_header() {
+pack_header () {
 	printf 'PACK' &&
 	printf '\0\0\0\2' &&
 	uint32_binary "$1"
@@ -36,7 +36,7 @@ pack_header() {
 # Doing this on the fly (especially picking your deltas) is quite tricky, so we
 # have hardcoded some well-known objects. See the case statements below for the
 # complete list.
-pack_obj() {
+pack_obj () {
 	case "$1" in
 	# empty blob
 	e69de29bb2d1d6434b8b29ae775ad8c2e48c5391)
@@ -86,7 +86,7 @@ pack_obj() {
 }
 
 # Compute and append pack trailer to "$1"
-pack_trailer() {
+pack_trailer () {
 	test-sha1 -b <"$1" >trailer.tmp &&
 	cat trailer.tmp >>"$1" &&
 	rm -f trailer.tmp
@@ -95,6 +95,6 @@ pack_trailer() {
 # Remove any existing packs to make sure that
 # whatever we index next will be the pack that we
 # actually use.
-clear_packs() {
+clear_packs () {
 	rm -f .git/objects/pack/*
 }
diff --git a/t/t5308-pack-detect-duplicates.sh b/t/t5308-pack-detect-duplicates.sh
index e982095..e0ba5e1 100755
--- a/t/t5308-pack-detect-duplicates.sh
+++ b/t/t5308-pack-detect-duplicates.sh
@@ -2,7 +2,7 @@ test_description='handling of duplicate objects in incoming packfiles'
 
 test_description='handling of duplicate objects in incoming packfiles'
 . ./test-lib.sh
-. ../lib-pack.sh
+. "$TEST_DIRECTORY"/lib-pack.sh
 
 # The sha1s we have in our pack. It's important that these have the same
 # starting byte, so that they end up in the same fanout section of the index.
@@ -22,7 +22,7 @@ MISSING_SHA1='e69d000000000000000000000000000000000000'
 # $1 is the name of the packfile to create
 #
 # $2 is the number of times to duplicate each object
-create_pack() {
+create_pack () {
 	pack_header "$((2 * $2))" >"$1" &&
 	for i in $(test_seq 1 "$2"); do
 		pack_obj $LO_SHA1 &&
diff --git a/t/t5309-pack-delta-cycles.sh b/t/t5309-pack-delta-cycles.sh
index 38e1809..3e7861b 100755
--- a/t/t5309-pack-delta-cycles.sh
+++ b/t/t5309-pack-delta-cycles.sh
@@ -2,7 +2,7 @@ test_description='test index-pack handling of delta cycles in packfiles'
 
 test_description='test index-pack handling of delta cycles in packfiles'
 . ./test-lib.sh
-. ../lib-pack.sh
+. "$TEST_DIRECTORY"/lib-pack.sh
 
 # Two similar-ish objects that we have computed deltas between.
 A=01d7713666f4de822776c7622c10f1b07de280dc

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 1/6] test-sha1: add a binary output mode
  2013-08-24  0:01                       ` [PATCHv3 0/6] duplicate objects and delta cycles Jeff King
@ 2013-08-24  0:01                         ` Jeff King
  2013-08-24  0:02                         ` [PATCH 2/6] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP Jeff King
                                           ` (4 subsequent siblings)
  5 siblings, 0 replies; 37+ messages in thread
From: Jeff King @ 2013-08-24  0:01 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Git Mailing List, Duy Nguyen, Shawn O. Pearce, Nicolas Pitre

The test-sha1 helper program will run our internal sha1
routines over its input and output the 40-byte hex sha1.
Sometimes, however, it is useful to have the binary 20-byte
sha1 (and it's a pain to convert back in the shell). Let's
add a "-b" option to output the binary version.

Signed-off-by: Jeff King <peff@peff.net>
Acked-by: Nicolas Pitre <nico@fluxnic.net>
---
 test-sha1.c | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/test-sha1.c b/test-sha1.c
index 80daba9..e57eae1 100644
--- a/test-sha1.c
+++ b/test-sha1.c
@@ -5,10 +5,15 @@ int main(int ac, char **av)
 	git_SHA_CTX ctx;
 	unsigned char sha1[20];
 	unsigned bufsz = 8192;
+	int binary = 0;
 	char *buffer;
 
-	if (ac == 2)
-		bufsz = strtoul(av[1], NULL, 10) * 1024 * 1024;
+	if (ac == 2) {
+		if (!strcmp(av[1], "-b"))
+			binary = 1;
+		else
+			bufsz = strtoul(av[1], NULL, 10) * 1024 * 1024;
+	}
 
 	if (!bufsz)
 		bufsz = 8192;
@@ -42,6 +47,10 @@ int main(int ac, char **av)
 		git_SHA1_Update(&ctx, buffer, this_sz);
 	}
 	git_SHA1_Final(sha1, &ctx);
-	puts(sha1_to_hex(sha1));
+
+	if (binary)
+		fwrite(sha1, 1, 20, stdout);
+	else
+		puts(sha1_to_hex(sha1));
 	exit(0);
 }
-- 
1.8.4.rc2.28.g6bb5f3f

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 2/6] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP
  2013-08-24  0:01                       ` [PATCHv3 0/6] duplicate objects and delta cycles Jeff King
  2013-08-24  0:01                         ` [PATCH 1/6] test-sha1: add a binary output mode Jeff King
@ 2013-08-24  0:02                         ` Jeff King
  2013-08-24  0:02                         ` [PATCH 3/6] add tests for indexing packs with delta cycles Jeff King
                                           ` (3 subsequent siblings)
  5 siblings, 0 replies; 37+ messages in thread
From: Jeff King @ 2013-08-24  0:02 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Git Mailing List, Duy Nguyen, Shawn O. Pearce, Nicolas Pitre

The sha1_entry_pos function tries to be smart about
selecting the middle of a range for its binary search by
looking at the value differences between the "lo" and "hi"
constraints. However, it is unable to cope with entries with
duplicate keys in the sorted list.

We may hit a point in the search where both our "lo" and
"hi" point to the same key. In this case, the range of
values between our endpoints is 0, and trying to scale the
difference between our key and the endpoints over that range
is undefined (i.e., divide by zero). The current code
catches this with an "assert(lov < hiv)".

Moreover, after seeing that the first 20 byte of the key are
the same, we will try to establish a value from the 21st
byte. Which is nonsensical.

Instead, we can detect the case that we are in a run of
duplicates, and simply do a final comparison against any one
of them (since they are all the same, it does not matter
which). If the keys match, we have found our entry (or one
of them, anyway).  If not, then we know that we do not need
to look further, as we must be in a run of the duplicate
key.

Signed-off-by: Jeff King <peff@peff.net>
Acked-by: Nicolas Pitre <nico@fluxnic.net>
---
 sha1-lookup.c                     | 47 +++++++++++++++++++++++
 t/lib-pack.sh                     | 78 +++++++++++++++++++++++++++++++++++++++
 t/t5308-pack-detect-duplicates.sh | 73 ++++++++++++++++++++++++++++++++++++
 3 files changed, 198 insertions(+)
 create mode 100644 t/lib-pack.sh
 create mode 100755 t/t5308-pack-detect-duplicates.sh

diff --git a/sha1-lookup.c b/sha1-lookup.c
index c4dc55d..2dd8515 100644
--- a/sha1-lookup.c
+++ b/sha1-lookup.c
@@ -204,7 +204,54 @@ int sha1_entry_pos(const void *table,
 			 * byte 0 thru (ofs-1) are the same between
 			 * lo and hi; ofs is the first byte that is
 			 * different.
+			 *
+			 * If ofs==20, then no bytes are different,
+			 * meaning we have entries with duplicate
+			 * keys. We know that we are in a solid run
+			 * of this entry (because the entries are
+			 * sorted, and our lo and hi are the same,
+			 * there can be nothing but this single key
+			 * in between). So we can stop the search.
+			 * Either one of these entries is it (and
+			 * we do not care which), or we do not have
+			 * it.
+			 *
+			 * Furthermore, we know that one of our
+			 * endpoints must be the edge of the run of
+			 * duplicates. For example, given this
+			 * sequence:
+			 *
+			 *     idx 0 1 2 3 4 5
+			 *     key A C C C C D
+			 *
+			 * If we are searching for "B", we might
+			 * hit the duplicate run at lo=1, hi=3
+			 * (e.g., by first mi=3, then mi=0). But we
+			 * can never have lo > 1, because B < C.
+			 * That is, if our key is less than the
+			 * run, we know that "lo" is the edge, but
+			 * we can say nothing of "hi". Similarly,
+			 * if our key is greater than the run, we
+			 * know that "hi" is the edge, but we can
+			 * say nothing of "lo".
+			 *
+			 * Therefore if we do not find it, we also
+			 * know where it would go if it did exist:
+			 * just on the far side of the edge that we
+			 * know about.
 			 */
+			if (ofs == 20) {
+				mi = lo;
+				mi_key = base + elem_size * mi + key_offset;
+				cmp = memcmp(mi_key, key, 20);
+				if (!cmp)
+					return mi;
+				if (cmp < 0)
+					return -1 - hi;
+				else
+					return -1 - lo;
+			}
+
 			hiv = hi_key[ofs_0];
 			if (ofs_0 < 19)
 				hiv = (hiv << 8) | hi_key[ofs_0+1];
diff --git a/t/lib-pack.sh b/t/lib-pack.sh
new file mode 100644
index 0000000..fecd5a0
--- /dev/null
+++ b/t/lib-pack.sh
@@ -0,0 +1,78 @@
+#!/bin/sh
+#
+# Support routines for hand-crafting weird or malicious packs.
+#
+# You can make a complete pack like:
+#
+#   pack_header 2 >foo.pack &&
+#   pack_obj e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 >>foo.pack &&
+#   pack_obj e68fe8129b546b101aee9510c5328e7f21ca1d18 >>foo.pack &&
+#   pack_trailer foo.pack
+
+# Print the big-endian 4-byte octal representation of $1
+uint32_octal () {
+	n=$1
+	printf '\%o' $(($n / 16777216)); n=$((n % 16777216))
+	printf '\%o' $(($n /    65536)); n=$((n %    65536))
+	printf '\%o' $(($n /      256)); n=$((n %      256))
+	printf '\%o' $(($n           ));
+}
+
+# Print the big-endian 4-byte binary representation of $1
+uint32_binary () {
+	printf "$(uint32_octal "$1")"
+}
+
+# Print a pack header, version 2, for a pack with $1 objects
+pack_header () {
+	printf 'PACK' &&
+	printf '\0\0\0\2' &&
+	uint32_binary "$1"
+}
+
+# Print the pack data for object $1, as a delta against object $2 (or as a full
+# object if $2 is missing or empty). The output is suitable for including
+# directly in the packfile, and represents the entirety of the object entry.
+# Doing this on the fly (especially picking your deltas) is quite tricky, so we
+# have hardcoded some well-known objects. See the case statements below for the
+# complete list.
+pack_obj () {
+	case "$1" in
+	# empty blob
+	e69de29bb2d1d6434b8b29ae775ad8c2e48c5391)
+		case "$2" in
+		'')
+			printf '\060\170\234\003\0\0\0\0\1'
+			return
+			;;
+		esac
+		;;
+
+	# blob containing "\7\76"
+	e68fe8129b546b101aee9510c5328e7f21ca1d18)
+		case "$2" in
+		'')
+			printf '\062\170\234\143\267\3\0\0\116\0\106'
+			return
+			;;
+		esac
+		;;
+	esac
+
+	echo >&2 "BUG: don't know how to print $1${2:+ (from $2)}"
+	return 1
+}
+
+# Compute and append pack trailer to "$1"
+pack_trailer () {
+	test-sha1 -b <"$1" >trailer.tmp &&
+	cat trailer.tmp >>"$1" &&
+	rm -f trailer.tmp
+}
+
+# Remove any existing packs to make sure that
+# whatever we index next will be the pack that we
+# actually use.
+clear_packs () {
+	rm -f .git/objects/pack/*
+}
diff --git a/t/t5308-pack-detect-duplicates.sh b/t/t5308-pack-detect-duplicates.sh
new file mode 100755
index 0000000..04fe242
--- /dev/null
+++ b/t/t5308-pack-detect-duplicates.sh
@@ -0,0 +1,73 @@
+#!/bin/sh
+
+test_description='handling of duplicate objects in incoming packfiles'
+. ./test-lib.sh
+. "$TEST_DIRECTORY"/lib-pack.sh
+
+# The sha1s we have in our pack. It's important that these have the same
+# starting byte, so that they end up in the same fanout section of the index.
+# That lets us make sure we are exercising the binary search with both sets.
+LO_SHA1=e68fe8129b546b101aee9510c5328e7f21ca1d18
+HI_SHA1=e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
+
+# And here's a "missing sha1" which will produce failed lookups. It must also
+# be in the same fanout section, and should be between the two (so that during
+# our binary search, we are sure to end up looking at one or the other of the
+# duplicate runs).
+MISSING_SHA1='e69d000000000000000000000000000000000000'
+
+# git will never intentionally create packfiles with
+# duplicate objects, so we have to construct them by hand.
+#
+# $1 is the name of the packfile to create
+#
+# $2 is the number of times to duplicate each object
+create_pack () {
+	pack_header "$((2 * $2))" >"$1" &&
+	for i in $(test_seq 1 "$2"); do
+		pack_obj $LO_SHA1 &&
+		pack_obj $HI_SHA1
+	done >>"$1" &&
+	pack_trailer "$1"
+}
+
+# double-check that create_pack actually works
+test_expect_success 'pack with no duplicates' '
+	create_pack no-dups.pack 1 &&
+	git index-pack --stdin <no-dups.pack
+'
+
+test_expect_success 'index-pack will allow duplicate objects by default' '
+	clear_packs &&
+	create_pack dups.pack 100 &&
+	git index-pack --stdin <dups.pack
+'
+
+test_expect_success 'create batch-check test vectors' '
+	cat >input <<-EOF &&
+	$LO_SHA1
+	$HI_SHA1
+	$MISSING_SHA1
+	EOF
+	cat >expect <<-EOF
+	$LO_SHA1 blob 2
+	$HI_SHA1 blob 0
+	$MISSING_SHA1 missing
+	EOF
+'
+
+test_expect_success 'lookup in duplicated pack (binary search)' '
+	git cat-file --batch-check <input >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'lookup in duplicated pack (GIT_USE_LOOKUP)' '
+	(
+		GIT_USE_LOOKUP=1 &&
+		export GIT_USE_LOOKUP &&
+		git cat-file --batch-check <input >actual
+	) &&
+	test_cmp expect actual
+'
+
+test_done
-- 
1.8.4.rc2.28.g6bb5f3f

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 3/6] add tests for indexing packs with delta cycles
  2013-08-24  0:01                       ` [PATCHv3 0/6] duplicate objects and delta cycles Jeff King
  2013-08-24  0:01                         ` [PATCH 1/6] test-sha1: add a binary output mode Jeff King
  2013-08-24  0:02                         ` [PATCH 2/6] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP Jeff King
@ 2013-08-24  0:02                         ` Jeff King
  2013-08-24  0:02                         ` [PATCH 4/6] test index-pack on packs with recoverable " Jeff King
                                           ` (2 subsequent siblings)
  5 siblings, 0 replies; 37+ messages in thread
From: Jeff King @ 2013-08-24  0:02 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Git Mailing List, Duy Nguyen, Shawn O. Pearce, Nicolas Pitre

If we receive a broken or malicious pack from a remote, we
will feed it to index-pack. As index-pack processes the
objects as a stream, reconstructing and hashing each object
to get its name, it is not very susceptible to doing the
wrong with bad data (it simply notices that the data is
bogus and aborts).

However, one question raised on the list is whether it could
be susceptible to problems during the delta-resolution
phase. In particular, can a cycle in the packfile deltas
cause us to go into an infinite loop or cause any other
problem?

The answer is no.

We cannot have a cycle of delta-base offsets, because they
go only in one direction (the OFS_DELTA object mentions its
base by an offset towards the beginning of the file, and we
explicitly reject negative offsets).

We can have a cycle of REF_DELTA objects, which refer to
base objects by sha1 name. However, index-pack does not know
these sha1 names ahead of time; it has to reconstruct the
objects to get their names, and it cannot do so if there is
a delta cycle (in other words, it does not even realize
there is a cycle, but only that there are items that cannot
be resolved).

Even though we can reason out that index-pack should handle
this fine, let's add a few tests to make sure it behaves
correctly.

Signed-off-by: Jeff King <peff@peff.net>
Acked-by: Nicolas Pitre <nico@fluxnic.net>
---
 t/lib-pack.sh                | 22 +++++++++++++++++
 t/t5309-pack-delta-cycles.sh | 59 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 81 insertions(+)
 create mode 100755 t/t5309-pack-delta-cycles.sh

diff --git a/t/lib-pack.sh b/t/lib-pack.sh
index fecd5a0..7e8685b 100644
--- a/t/lib-pack.sh
+++ b/t/lib-pack.sh
@@ -55,6 +55,28 @@ pack_obj () {
 			printf '\062\170\234\143\267\3\0\0\116\0\106'
 			return
 			;;
+		01d7713666f4de822776c7622c10f1b07de280dc)
+			printf '\165\1\327\161\66\146\364\336\202\47\166' &&
+			printf '\307\142\54\20\361\260\175\342\200\334\170' &&
+			printf '\234\143\142\142\142\267\003\0\0\151\0\114'
+			return
+			;;
+		esac
+		;;
+
+	# blob containing "\7\0"
+	01d7713666f4de822776c7622c10f1b07de280dc)
+		case "$2" in
+		'')
+			printf '\062\170\234\143\147\0\0\0\20\0\10'
+			return
+			;;
+		e68fe8129b546b101aee9510c5328e7f21ca1d18)
+			printf '\165\346\217\350\22\233\124\153\20\32\356' &&
+			printf '\225\20\305\62\216\177\41\312\35\30\170\234' &&
+			printf '\143\142\142\142\147\0\0\0\53\0\16'
+			return
+			;;
 		esac
 		;;
 	esac
diff --git a/t/t5309-pack-delta-cycles.sh b/t/t5309-pack-delta-cycles.sh
new file mode 100755
index 0000000..1640bf7
--- /dev/null
+++ b/t/t5309-pack-delta-cycles.sh
@@ -0,0 +1,59 @@
+#!/bin/sh
+
+test_description='test index-pack handling of delta cycles in packfiles'
+. ./test-lib.sh
+. "$TEST_DIRECTORY"/lib-pack.sh
+
+# Two similar-ish objects that we have computed deltas between.
+A=01d7713666f4de822776c7622c10f1b07de280dc
+B=e68fe8129b546b101aee9510c5328e7f21ca1d18
+
+# double-check our hand-constucted packs
+test_expect_success 'index-pack works with a single delta (A->B)' '
+	clear_packs &&
+	{
+		pack_header 2 &&
+		pack_obj $A $B &&
+		pack_obj $B
+	} >ab.pack &&
+	pack_trailer ab.pack &&
+	git index-pack --stdin <ab.pack &&
+	git cat-file -t $A &&
+	git cat-file -t $B
+'
+
+test_expect_success 'index-pack works with a single delta (B->A)' '
+	clear_packs &&
+	{
+		pack_header 2 &&
+		pack_obj $A &&
+		pack_obj $B $A
+	} >ba.pack &&
+	pack_trailer ba.pack &&
+	git index-pack --stdin <ba.pack &&
+	git cat-file -t $A &&
+	git cat-file -t $B
+'
+
+test_expect_success 'index-pack detects missing base objects' '
+	clear_packs &&
+	{
+		pack_header 1 &&
+		pack_obj $A $B
+	} >missing.pack &&
+	pack_trailer missing.pack &&
+	test_must_fail git index-pack --fix-thin --stdin <missing.pack
+'
+
+test_expect_success 'index-pack detects REF_DELTA cycles' '
+	clear_packs &&
+	{
+		pack_header 2 &&
+		pack_obj $A $B &&
+		pack_obj $B $A
+	} >cycle.pack &&
+	pack_trailer cycle.pack &&
+	test_must_fail git index-pack --fix-thin --stdin <cycle.pack
+'
+
+test_done
-- 
1.8.4.rc2.28.g6bb5f3f

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 4/6] test index-pack on packs with recoverable delta cycles
  2013-08-24  0:01                       ` [PATCHv3 0/6] duplicate objects and delta cycles Jeff King
                                           ` (2 preceding siblings ...)
  2013-08-24  0:02                         ` [PATCH 3/6] add tests for indexing packs with delta cycles Jeff King
@ 2013-08-24  0:02                         ` Jeff King
  2013-08-24  0:02                         ` [PATCH 5/6] index-pack: optionally reject packs with duplicate objects Jeff King
  2013-08-24  0:02                         ` [PATCH 6/6] default pack.indexDuplicates to false Jeff King
  5 siblings, 0 replies; 37+ messages in thread
From: Jeff King @ 2013-08-24  0:02 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Git Mailing List, Duy Nguyen, Shawn O. Pearce, Nicolas Pitre

The previous commit added tests to show that index-pack
correctly bails in unrecoverable situations. There are some
situations where the data could be recovered, but it is not
currently:

  1. If we can break the cycle using an object from another
     pack via --fix-thin.

  2. If we can break the cycle using a duplicate of one of
     the objects found in the same pack.

Note that neither of these is particularly high priority; a
delta cycle within a pack should never occur, and we have no
record of even a buggy git implementation creating such a
pack.

However, it's worth adding these tests for two reasons. One,
to document that we do not currently handle the situation,
even though it is possible. And two, to exercise the code
that runs in this situation; even though it fails, by
running it we can confirm that index-pack detects the
situation and aborts, and does not misbehave (e.g., by
following the cycle in an infinite loop).

In both cases, we hit an assert that aborts index-pack.

Signed-off-by: Jeff King <peff@peff.net>
Acked-by: Nicolas Pitre <nico@fluxnic.net>
---
 t/t5309-pack-delta-cycles.sh | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/t/t5309-pack-delta-cycles.sh b/t/t5309-pack-delta-cycles.sh
index 1640bf7..3e7861b 100755
--- a/t/t5309-pack-delta-cycles.sh
+++ b/t/t5309-pack-delta-cycles.sh
@@ -56,4 +56,22 @@ test_expect_success 'index-pack detects REF_DELTA cycles' '
 	test_must_fail git index-pack --fix-thin --stdin <cycle.pack
 '
 
+test_expect_failure 'failover to an object in another pack' '
+	clear_packs &&
+	git index-pack --stdin <ab.pack &&
+	git index-pack --stdin --fix-thin <cycle.pack
+'
+
+test_expect_failure 'failover to a duplicate object in the same pack' '
+	clear_packs &&
+	{
+		pack_header 3 &&
+		pack_obj $A $B &&
+		pack_obj $B $A &&
+		pack_obj $A
+	} >recoverable.pack &&
+	pack_trailer recoverable.pack &&
+	git index-pack --fix-thin --stdin <recoverable.pack
+'
+
 test_done
-- 
1.8.4.rc2.28.g6bb5f3f

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 5/6] index-pack: optionally reject packs with duplicate objects
  2013-08-24  0:01                       ` [PATCHv3 0/6] duplicate objects and delta cycles Jeff King
                                           ` (3 preceding siblings ...)
  2013-08-24  0:02                         ` [PATCH 4/6] test index-pack on packs with recoverable " Jeff King
@ 2013-08-24  0:02                         ` Jeff King
  2013-08-24  0:02                         ` [PATCH 6/6] default pack.indexDuplicates to false Jeff King
  5 siblings, 0 replies; 37+ messages in thread
From: Jeff King @ 2013-08-24  0:02 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Git Mailing List, Duy Nguyen, Shawn O. Pearce, Nicolas Pitre

Git should never generate packs with duplicate objects.
However, we may see such packs due to bugs in Git or other
implementations (e.g., JGit had such a bug a few years ago).

In theory, such packs should not be a problem for us (we
will simply find one of the instances of the object when
looking in the pack). However, the JGit bug report mentioned
possible infinite loops during repacking due to cycles in
the delta chain.  Though this problem hasn't specifically
been reproduced on modern git, there is no reason not to be
careful with incoming packs, given that only buggy
implementations should be producing such packs, anyway.

This patch introduces the pack.indexDuplicates option to
allow or reject such packs from index-pack. The default
remains to allow it.

Signed-off-by: Jeff King <peff@peff.net>
Acked-by: Nicolas Pitre <nico@fluxnic.net>
---
 builtin/index-pack.c              |  4 ++++
 pack-write.c                      | 24 ++++++++++++++++++++++++
 pack.h                            |  2 ++
 t/t5308-pack-detect-duplicates.sh |  8 ++++++++
 4 files changed, 38 insertions(+)

diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index 79dfe47..72e19a0 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -1364,6 +1364,10 @@ static int git_index_pack_config(const char *k, const char *v, void *cb)
 #endif
 		return 0;
 	}
+	if (!strcmp(k, "pack.indexduplicates")) {
+		opts->allow_duplicates = git_config_bool(k, v);
+		return 0;
+	}
 	return git_default_config(k, v, cb);
 }
 
diff --git a/pack-write.c b/pack-write.c
index ca9e63b..da946a7 100644
--- a/pack-write.c
+++ b/pack-write.c
@@ -7,6 +7,7 @@ void reset_pack_idx_option(struct pack_idx_option *opts)
 	memset(opts, 0, sizeof(*opts));
 	opts->version = 2;
 	opts->off32_limit = 0x7fffffff;
+	opts->allow_duplicates = 1;
 }
 
 static int sha1_compare(const void *_a, const void *_b)
@@ -37,6 +38,19 @@ static int need_large_offset(off_t offset, const struct pack_idx_option *opts)
 			 sizeof(ofsval), cmp_uint32);
 }
 
+static void *find_duplicate(void *vbase, size_t n, size_t size,
+			    int (*cmp)(const void *, const void *))
+{
+	unsigned char *base = vbase;
+	while (n > 1) {
+		if (!cmp(base, base + size))
+			return base;
+		base += size;
+		n--;
+	}
+	return NULL;
+}
+
 /*
  * On entry *sha1 contains the pack content SHA1 hash, on exit it is
  * the SHA1 hash of sorted object names. The objects array passed in
@@ -68,6 +82,16 @@ const char *write_idx_file(const char *index_name, struct pack_idx_entry **objec
 	else
 		sorted_by_sha = list = last = NULL;
 
+	if (!opts->allow_duplicates) {
+		struct pack_idx_entry **dup;
+
+		dup = find_duplicate(sorted_by_sha, nr_objects,
+				     sizeof(*sorted_by_sha), sha1_compare);
+		if (dup)
+			die("pack has duplicate entries for %s",
+			    sha1_to_hex((*dup)->sha1));
+	}
+
 	if (opts->flags & WRITE_IDX_VERIFY) {
 		assert(index_name);
 		f = sha1fd_check(index_name);
diff --git a/pack.h b/pack.h
index aa6ee7d..45217b6 100644
--- a/pack.h
+++ b/pack.h
@@ -44,6 +44,8 @@ struct pack_idx_option {
 	uint32_t version;
 	uint32_t off32_limit;
 
+	int allow_duplicates;
+
 	/*
 	 * List of offsets that would fit within off32_limit but
 	 * need to be written out as 64-bit entity for byte-for-byte
diff --git a/t/t5308-pack-detect-duplicates.sh b/t/t5308-pack-detect-duplicates.sh
index 04fe242..97ce2e0 100755
--- a/t/t5308-pack-detect-duplicates.sh
+++ b/t/t5308-pack-detect-duplicates.sh
@@ -70,4 +70,12 @@ test_expect_success 'lookup in duplicated pack (GIT_USE_LOOKUP)' '
 	test_cmp expect actual
 '
 
+test_expect_success 'index-pack can reject packs with duplicates' '
+	clear_packs &&
+	create_pack dups.pack 2 &&
+	test_must_fail \
+		git -c pack.indexDuplicates=0 index-pack --stdin <dups.pack &&
+	test_expect_code 1 git cat-file -e $LO_SHA1
+'
+
 test_done
-- 
1.8.4.rc2.28.g6bb5f3f

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 6/6] default pack.indexDuplicates to false
  2013-08-24  0:01                       ` [PATCHv3 0/6] duplicate objects and delta cycles Jeff King
                                           ` (4 preceding siblings ...)
  2013-08-24  0:02                         ` [PATCH 5/6] index-pack: optionally reject packs with duplicate objects Jeff King
@ 2013-08-24  0:02                         ` Jeff King
  5 siblings, 0 replies; 37+ messages in thread
From: Jeff King @ 2013-08-24  0:02 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Git Mailing List, Duy Nguyen, Shawn O. Pearce, Nicolas Pitre

We should never see duplicate objects in packs, and it is
unknown what kind of complications such packs could create
during the repacking process. The previous commit introduced
a safety valve for checking packs coming into the repository
and being indexed by index-pack.

Let's turn the safety valve on by default. This helps
protect sites receiving packfiles from potentially malicious
strangers, and shouldn't affect normal use (and if it does,
it will have helped us identify a buggy packfile writer).
In a pinch, users can recover by turning off the option, or
by resorting to unpack-objects to explode the pack.

Note that this also turns the option on for packs we write
ourselves (e.g., via pack-objects, fast-import, or streaming
large blobs). We do not expect maliciously constructed
packfiles in these code paths, of course, but it can serve
as an extra check that we have no accidentally created a
buggy pack (and the check itself is cheap to perform).

Signed-off-by: Jeff King <peff@peff.net>
Acked-by: Nicolas Pitre <nico@fluxnic.net>
---
 pack-write.c                      | 1 -
 t/t5308-pack-detect-duplicates.sh | 9 ++++-----
 2 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/pack-write.c b/pack-write.c
index da946a7..1e3c459 100644
--- a/pack-write.c
+++ b/pack-write.c
@@ -7,7 +7,6 @@ void reset_pack_idx_option(struct pack_idx_option *opts)
 	memset(opts, 0, sizeof(*opts));
 	opts->version = 2;
 	opts->off32_limit = 0x7fffffff;
-	opts->allow_duplicates = 1;
 }
 
 static int sha1_compare(const void *_a, const void *_b)
diff --git a/t/t5308-pack-detect-duplicates.sh b/t/t5308-pack-detect-duplicates.sh
index 97ce2e0..e0ba5e1 100755
--- a/t/t5308-pack-detect-duplicates.sh
+++ b/t/t5308-pack-detect-duplicates.sh
@@ -37,10 +37,10 @@ test_expect_success 'index-pack will allow duplicate objects by default' '
 	git index-pack --stdin <no-dups.pack
 '
 
-test_expect_success 'index-pack will allow duplicate objects by default' '
+test_expect_success 'index-pack will allow duplicate objects if asked' '
 	clear_packs &&
 	create_pack dups.pack 100 &&
-	git index-pack --stdin <dups.pack
+	git -c pack.indexDuplicates=true index-pack --stdin <dups.pack
 '
 
 test_expect_success 'create batch-check test vectors' '
@@ -70,11 +70,10 @@ test_expect_success 'index-pack can reject packs with duplicates' '
 	test_cmp expect actual
 '
 
-test_expect_success 'index-pack can reject packs with duplicates' '
+test_expect_success 'index-pack rejects packs with duplicates by default' '
 	clear_packs &&
 	create_pack dups.pack 2 &&
-	test_must_fail \
-		git -c pack.indexDuplicates=0 index-pack --stdin <dups.pack &&
+	test_must_fail git index-pack --stdin <dups.pack &&
 	test_expect_code 1 git cat-file -e $LO_SHA1
 '
 
-- 
1.8.4.rc2.28.g6bb5f3f

^ permalink raw reply related	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2013-08-24  0:02 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-08-14 18:17 duplicate objects in packfile Jeff King
2013-08-14 18:29 ` Junio C Hamano
2013-08-14 18:39   ` Junio C Hamano
2013-08-14 18:54     ` Nicolas Pitre
2013-08-16 15:01     ` Jeff King
2013-08-21 20:49       ` [RFC/PATCH 0/4] duplicate objects in packfiles Jeff King
2013-08-21 20:51         ` [PATCH 1/4] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP Jeff King
2013-08-21 20:52         ` [PATCH 2/4] index-pack: optionally reject packs with duplicate objects Jeff King
2013-08-22 13:45           ` Duy Nguyen
2013-08-22 14:43             ` Jeff King
2013-08-22 23:12               ` [PATCHv2 0/6] duplicate objects and delta cycles, oh my! Jeff King
2013-08-22 23:12                 ` [PATCH 1/6] test-sha1: add a binary output mode Jeff King
2013-08-22 23:14                 ` [PATCH 2/6] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP Jeff King
2013-08-23 16:41                   ` Junio C Hamano
2013-08-23 18:24                     ` Jeff King
2013-08-23 18:54                       ` Nicolas Pitre
2013-08-23 18:56                         ` Jeff King
2013-08-24  0:01                       ` [PATCHv3 0/6] duplicate objects and delta cycles Jeff King
2013-08-24  0:01                         ` [PATCH 1/6] test-sha1: add a binary output mode Jeff King
2013-08-24  0:02                         ` [PATCH 2/6] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP Jeff King
2013-08-24  0:02                         ` [PATCH 3/6] add tests for indexing packs with delta cycles Jeff King
2013-08-24  0:02                         ` [PATCH 4/6] test index-pack on packs with recoverable " Jeff King
2013-08-24  0:02                         ` [PATCH 5/6] index-pack: optionally reject packs with duplicate objects Jeff King
2013-08-24  0:02                         ` [PATCH 6/6] default pack.indexDuplicates to false Jeff King
2013-08-23 19:41                   ` [PATCH 2/6] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP Johannes Sixt
2013-08-23 23:37                     ` Jeff King
2013-08-22 23:14                 ` [PATCH 3/6] add tests for indexing packs with delta cycles Jeff King
2013-08-22 23:15                 ` [PATCH 4/6] test index-pack on packs with recoverable " Jeff King
2013-08-22 23:15                 ` [PATCH 5/6] index-pack: optionally reject packs with duplicate objects Jeff King
2013-08-22 23:16                 ` [PATCH 6/6] default pack.indexDuplicates to false Jeff King
2013-08-22 23:35                 ` [PATCHv2 0/6] duplicate objects and delta cycles, oh my! Nicolas Pitre
2013-08-21 20:53         ` [PATCH 3/4] reject duplicates when indexing a packfile we created Jeff King
2013-08-21 20:55         ` [DO NOT APPLY PATCH 4/4] index-pack: optionally skip duplicate packfile entries Jeff King
2013-08-21 23:20           ` Junio C Hamano
2013-08-22  0:47             ` Jeff King
2013-08-21 22:17         ` [RFC/PATCH 0/4] duplicate objects in packfiles Junio C Hamano
2013-08-14 18:50 ` duplicate objects in packfile Nicolas Pitre

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.