Git Mailing List Archive on lore.kernel.org
 help / color / Atom feed
* invalid tree and commit object
@ 2020-05-09  6:19 Brandon Williams
  2020-05-09 10:16 ` René Scharfe
  0 siblings, 1 reply; 23+ messages in thread
From: Brandon Williams @ 2020-05-09  6:19 UTC (permalink / raw)
  To: git; +Cc: Jeff King

Hey!

Its been a minute since I've written to the list but I was recently looking
into the rules fsck uses to identify valid or invalid objects and I believe I
found a case that I believe fsck is currently missing. One of the things fsck
looks for when validating a tree object is that it doesn't contain any
duplicate entries. It even has a nice comment about how `git-write-tree` used
to write out trees with duplicate entries:

    /*
     * git-write-tree used to write out a nonsense tree that has
     * entries with the same name, one blob and one tree.  Make
     * sure we do not have duplicate entries.
     */

Here's the setup:
    tree c63d067eaeed0cbc68b7e4fdf40d267c6b152fe8
    tree 6241ab2a5314798183b5c4ee8a7b0ccd12c651e6
    blob 5e1c309dae7f45e0f39b1bf3ac3cd9db12e7d689

    $ git ls-tree c63d067eaeed0cbc68b7e4fdf40d267c6b152fe8
    100644 blob 5e1c309dae7f45e0f39b1bf3ac3cd9db12e7d689    hello
    100644 blob 5e1c309dae7f45e0f39b1bf3ac3cd9db12e7d689    hello.c
    040000 tree 6241ab2a5314798183b5c4ee8a7b0ccd12c651e6    hello

    $ git ls-tree 6241ab2a5314798183b5c4ee8a7b0ccd12c651e6
    100644 blob 5e1c309dae7f45e0f39b1bf3ac3cd9db12e7d689    hello

    # '%' here indicates that there is no newline at the end of the object
    $ git cat-file blob 5e1c309dae7f45e0f39b1bf3ac3cd9db12e7d689
    Hello World%

fsck currently passes when being passed these objects despite c63d067eae having
a duplicate entry. This seems to be due to the duplicate entry check in
`fsck_tree` only checking if adjacent entries are duplicates but due to the
sorting rules its unable to realize that there is both a blob and a tree with
the name "hello".

I was even able to produce a commit and push it to Github[1] (which
didn't complain)

    $ git show --pretty=raw 62f1ff6e109f8b77edd7eeb65f6634faa76a93b2
    commit 62f1ff6e109f8b77edd7eeb65f6634faa76a93b2
    tree c63d067eaeed0cbc68b7e4fdf40d267c6b152fe8
    author Brandon Williams <bwilliams.eng@gmail.com> 1589004242 -0700
    committer Brandon Williams <bwilliams.eng@gmail.com> 1589004242 -0700

        hello

Checking out that commit leaves your working directory in a somewhat
broken and 'unclean' state (although Github's UI seems to be able to handle
displaying it properly).

Am I correct in assuming that this object is indeed invalid and should be
rejected by fsck?

-Brandon

[1]: https://github.com/bmwill/invalid-commit

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: invalid tree and commit object
  2020-05-09 10:16 ` René Scharfe
@ 2020-05-09  7:16   ` Johannes Schindelin
  2020-05-09 11:51     ` René Scharfe
  2020-05-09 17:28   ` Junio C Hamano
  1 sibling, 1 reply; 23+ messages in thread
From: Johannes Schindelin @ 2020-05-09  7:16 UTC (permalink / raw)
  To: René Scharfe; +Cc: Brandon Williams, git, Jeff King


[-- Attachment #1: Type: text/plain, Size: 1321 bytes --]

Hi,

On Sat, 9 May 2020, René Scharfe wrote:

> Am 09.05.20 um 08:19 schrieb Brandon Williams:
> > Here's the setup:
> >     tree c63d067eaeed0cbc68b7e4fdf40d267c6b152fe8
> >     tree 6241ab2a5314798183b5c4ee8a7b0ccd12c651e6
> >     blob 5e1c309dae7f45e0f39b1bf3ac3cd9db12e7d689
> >
> >     $ git ls-tree c63d067eaeed0cbc68b7e4fdf40d267c6b152fe8
> >     100644 blob 5e1c309dae7f45e0f39b1bf3ac3cd9db12e7d689    hello
> >     100644 blob 5e1c309dae7f45e0f39b1bf3ac3cd9db12e7d689    hello.c
> >     040000 tree 6241ab2a5314798183b5c4ee8a7b0ccd12c651e6    hello
>
> > Am I correct in assuming that this object is indeed invalid and should be
> > rejected by fsck?
>
> I'd say yes twice -- what good is a tree that you can't check out because
> it contains a d/f conflict?
>
> So I got curious if such trees might be in popular repos, wrote the patch
> below and checked around a bit, but couldn't find any.
>
> Is there a smarter way to check for duplicates?  One that doesn't need
> allocations?  Perhaps by having a version of tree_entry_extract() that
> seeks backwards somehow?

Maybe we should verify that the entries are sorted? That would not need
any allocation, and it could even use the return value of the same
comparison we already perform to check for duplicates.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: invalid tree and commit object
  2020-05-09  6:19 invalid tree and commit object Brandon Williams
@ 2020-05-09 10:16 ` René Scharfe
  2020-05-09  7:16   ` Johannes Schindelin
  2020-05-09 17:28   ` Junio C Hamano
  0 siblings, 2 replies; 23+ messages in thread
From: René Scharfe @ 2020-05-09 10:16 UTC (permalink / raw)
  To: Brandon Williams, git; +Cc: Jeff King

Am 09.05.20 um 08:19 schrieb Brandon Williams:
> Here's the setup:
>     tree c63d067eaeed0cbc68b7e4fdf40d267c6b152fe8
>     tree 6241ab2a5314798183b5c4ee8a7b0ccd12c651e6
>     blob 5e1c309dae7f45e0f39b1bf3ac3cd9db12e7d689
>
>     $ git ls-tree c63d067eaeed0cbc68b7e4fdf40d267c6b152fe8
>     100644 blob 5e1c309dae7f45e0f39b1bf3ac3cd9db12e7d689    hello
>     100644 blob 5e1c309dae7f45e0f39b1bf3ac3cd9db12e7d689    hello.c
>     040000 tree 6241ab2a5314798183b5c4ee8a7b0ccd12c651e6    hello

> Am I correct in assuming that this object is indeed invalid and should be
> rejected by fsck?

I'd say yes twice -- what good is a tree that you can't check out because
it contains a d/f conflict?

So I got curious if such trees might be in popular repos, wrote the patch
below and checked around a bit, but couldn't find any.

Is there a smarter way to check for duplicates?  One that doesn't need
allocations?  Perhaps by having a version of tree_entry_extract() that
seeks backwards somehow?

---
 fsck.c          | 10 ++++++++++
 t/t1450-fsck.sh | 16 ++++++++++++++++
 2 files changed, 26 insertions(+)

diff --git a/fsck.c b/fsck.c
index 087a7f1ffc..f47b35fee8 100644
--- a/fsck.c
+++ b/fsck.c
@@ -587,6 +587,8 @@ static int fsck_tree(const struct object_id *oid,
 	struct tree_desc desc;
 	unsigned o_mode;
 	const char *o_name;
+	struct string_list names = STRING_LIST_INIT_NODUP;
+	size_t nr;

 	if (init_tree_desc_gently(&desc, buffer, size)) {
 		retval += report(options, oid, OBJ_TREE, FSCK_MSG_BAD_TREE, "cannot be parsed as a tree");
@@ -680,8 +682,16 @@ static int fsck_tree(const struct object_id *oid,

 		o_mode = mode;
 		o_name = name;
+		string_list_append(&names, name);
 	}

+	nr = names.nr;
+	string_list_sort(&names);
+	string_list_remove_duplicates(&names, 0);
+	if (names.nr != nr)
+		has_dup_entries = 1;
+	string_list_clear(&names, 0);
+
 	if (has_null_sha1)
 		retval += report(options, oid, OBJ_TREE, FSCK_MSG_NULL_SHA1, "contains entries pointing to null sha1");
 	if (has_full_path)
diff --git a/t/t1450-fsck.sh b/t/t1450-fsck.sh
index 449ebc5657..91a6e34f38 100755
--- a/t/t1450-fsck.sh
+++ b/t/t1450-fsck.sh
@@ -257,6 +257,22 @@ test_expect_success 'tree object with duplicate entries' '
 	test_i18ngrep "error in tree .*contains duplicate file entries" out
 '

+test_expect_success 'tree object with dublicate names' '
+	test_when_finished "remove_object \$blob" &&
+	test_when_finished "remove_object \$tree" &&
+	test_when_finished "remove_object \$badtree" &&
+	blob=$(echo blob | git hash-object -w --stdin) &&
+	printf "100644 blob %s\t%s\n" $blob x.2 >tree &&
+	tree=$(git mktree <tree) &&
+	printf "100644 blob %s\t%s\n" $blob x.1 >badtree &&
+	printf "100644 blob %s\t%s\n" $blob x >>badtree &&
+	printf "040000 tree %s\t%s\n" $tree x >>badtree &&
+	badtree=$(git mktree <badtree) &&
+	test_must_fail git fsck 2>out &&
+	test_i18ngrep "$badtree" out &&
+	test_i18ngrep "error in tree .*contains duplicate file entries" out
+'
+
 test_expect_success 'unparseable tree object' '
 	test_oid_cache <<-\EOF &&
 	junk sha1:twenty-bytes-of-junk
--
2.26.2

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: invalid tree and commit object
  2020-05-09  7:16   ` Johannes Schindelin
@ 2020-05-09 11:51     ` René Scharfe
  0 siblings, 0 replies; 23+ messages in thread
From: René Scharfe @ 2020-05-09 11:51 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Brandon Williams, git, Jeff King

Am 09.05.20 um 09:16 schrieb Johannes Schindelin:
> Hi,
>
> On Sat, 9 May 2020, René Scharfe wrote:
>
>> Am 09.05.20 um 08:19 schrieb Brandon Williams:
>>> Here's the setup:
>>>     tree c63d067eaeed0cbc68b7e4fdf40d267c6b152fe8
>>>     tree 6241ab2a5314798183b5c4ee8a7b0ccd12c651e6
>>>     blob 5e1c309dae7f45e0f39b1bf3ac3cd9db12e7d689
>>>
>>>     $ git ls-tree c63d067eaeed0cbc68b7e4fdf40d267c6b152fe8
>>>     100644 blob 5e1c309dae7f45e0f39b1bf3ac3cd9db12e7d689    hello
>>>     100644 blob 5e1c309dae7f45e0f39b1bf3ac3cd9db12e7d689    hello.c
>>>     040000 tree 6241ab2a5314798183b5c4ee8a7b0ccd12c651e6    hello
>>
>>> Am I correct in assuming that this object is indeed invalid and should be
>>> rejected by fsck?
>>
>> I'd say yes twice -- what good is a tree that you can't check out because
>> it contains a d/f conflict?
>>
>> So I got curious if such trees might be in popular repos, wrote the patch
>> below and checked around a bit, but couldn't find any.
>>
>> Is there a smarter way to check for duplicates?  One that doesn't need
>> allocations?  Perhaps by having a version of tree_entry_extract() that
>> seeks backwards somehow?
>
> Maybe we should verify that the entries are sorted? That would not need
> any allocation, and it could even use the return value of the same
> comparison we already perform to check for duplicates.

Yes, but the key question is: Sorted by what?  Tree entries are sorted
by name, only that a slash is appended implicitly to the name of entries
of type tree.  The current code already checks for duplicates by
comparing neighboring entries' names.  That's not sufficient to find
duplicates that differ in type, as Brandon's example above shows (which
is sorted correctly).

René

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: invalid tree and commit object
  2020-05-09 10:16 ` René Scharfe
  2020-05-09  7:16   ` Johannes Schindelin
@ 2020-05-09 17:28   ` Junio C Hamano
  2020-05-09 19:24     ` René Scharfe
  1 sibling, 1 reply; 23+ messages in thread
From: Junio C Hamano @ 2020-05-09 17:28 UTC (permalink / raw)
  To: René Scharfe; +Cc: Brandon Williams, git, Jeff King

René Scharfe <l.s.r@web.de> writes:

> So I got curious if such trees might be in popular repos, wrote the patch
> below and checked around a bit, but couldn't find any.
>
> Is there a smarter way to check for duplicates?  One that doesn't need
> allocations?  Perhaps by having a version of tree_entry_extract() that
> seeks backwards somehow?

I've never looked into seeking backwards in a tree object, but in
unpack-trees, I had to deal with this exact problem that a blob
'hello' sorts before 'hello.c' which in turn sorts after a tree
'hello' because of the "implicit slash after the name for an entry
for a tree object" rule by introducing the "cache_bottom" hack in
the traversal logic to limit how far we must scan back.

We may be able to limit the list of "seen recently" names in a
similar way.

If the tree we are scanning has 'hello' (blob), 'hello.c' and
'hellp', the bottom pointer initially would be at 'hello' (blob),
then stay there when we see 'hello.c' (because the next entry might
be 'hello' (tree) that would crash with 'hello'), and when we see
the entry 'hellp', we know that the entry at the bottom pointer
'hello' (blob) cannot crash with any entry that comes later in the
tree object we are scanning, so we can advance the bottom pointer
forward.  To decide if we can advance the bottom pointer beyond
'hello.c' (blob), we see if 'hello.c' (tree) can appear after the
current entry we are looking at (i.e. 'hellp'), and we know it
cannot without violating the sort order.  So the bottom would move
to point at 'hellp' we just saw.

If we had 'hello' (tree) instead of 'hellp', when we look at it
after looking at 'hello' (blob) and 'hello.c', we scan from the
bottom pointer up to the previous entry, which is still pointing at
'hello' (blob), and notice the crash.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: invalid tree and commit object
  2020-05-09 17:28   ` Junio C Hamano
@ 2020-05-09 19:24     ` René Scharfe
  2020-05-09 20:27       ` Junio C Hamano
  0 siblings, 1 reply; 23+ messages in thread
From: René Scharfe @ 2020-05-09 19:24 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Brandon Williams, git, Jeff King

Am 09.05.20 um 19:28 schrieb Junio C Hamano:
> René Scharfe <l.s.r@web.de> writes:
>
>> So I got curious if such trees might be in popular repos, wrote the patch
>> below and checked around a bit, but couldn't find any.
>>
>> Is there a smarter way to check for duplicates?  One that doesn't need
>> allocations?  Perhaps by having a version of tree_entry_extract() that
>> seeks backwards somehow?
>
> I've never looked into seeking backwards in a tree object, but in
> unpack-trees, I had to deal with this exact problem that a blob
> 'hello' sorts before 'hello.c' which in turn sorts after a tree
> 'hello' because of the "implicit slash after the name for an entry
> for a tree object" rule by introducing the "cache_bottom" hack in
> the traversal logic to limit how far we must scan back.
>
> We may be able to limit the list of "seen recently" names in a
> similar way.
>
> If the tree we are scanning has 'hello' (blob), 'hello.c' and
> 'hellp', the bottom pointer initially would be at 'hello' (blob),
> then stay there when we see 'hello.c' (because the next entry might
> be 'hello' (tree) that would crash with 'hello'), and when we see
> the entry 'hellp', we know that the entry at the bottom pointer
> 'hello' (blob) cannot crash with any entry that comes later in the
> tree object we are scanning, so we can advance the bottom pointer
> forward.  To decide if we can advance the bottom pointer beyond
> 'hello.c' (blob), we see if 'hello.c' (tree) can appear after the
> current entry we are looking at (i.e. 'hellp'), and we know it
> cannot without violating the sort order.  So the bottom would move
> to point at 'hellp' we just saw.
>
> If we had 'hello' (tree) instead of 'hellp', when we look at it
> after looking at 'hello' (blob) and 'hello.c', we scan from the
> bottom pointer up to the previous entry, which is still pointing at
> 'hello' (blob), and notice the crash.
>

Hmm, this could lead to quadratic behavior in the worst case, can't it?
Imagine you have:

  a
  a.b
  a.b.c
  ...
  a.b.c/
  a.b/
  a/

If you have a single bottom pointer, it needs to stay at "a" the whole
time, so that you can detect the d/f conflict with "a/" at the end.
And for all entries in between you need to scan from "a" on and compare
each entry -- quadratic.  The blobs "a.b" and "a.b.c" don't need to
actually be present, but we need to scan all entries to determine
"a.b/" and "a.b.c/" are not conflicting anyway.  Right?

We could, however, reduce the names we add to the string_list to
those that are possible candidates for conflict -- blobs followed by an
entry whose name starts with the blob name followed by a dot and trees
that follow an entry whose name matches in the same way.

René

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: invalid tree and commit object
  2020-05-09 19:24     ` René Scharfe
@ 2020-05-09 20:27       ` Junio C Hamano
  2020-05-10  9:07         ` René Scharfe
  0 siblings, 1 reply; 23+ messages in thread
From: Junio C Hamano @ 2020-05-09 20:27 UTC (permalink / raw)
  To: René Scharfe; +Cc: Brandon Williams, git, Jeff King

René Scharfe <l.s.r@web.de> writes:

> Hmm, this could lead to quadratic behavior in the worst case, can't it?

Oh, absolutely.  But as you have shown, you'd need a specially
crafted tree with early entries that are prefixes of later ones,
which would rather be rare, and most of the time the bottom pointer
would advance by one every time we consume one path.  

So it is trading (hopefully rare) worst-case runtime with reduced
storage cost.

> We could, however, reduce the names we add to the string_list to
> those that are possible candidates for conflict -- blobs followed by an
> entry whose name starts with the blob name followed by a dot and trees
> that follow an entry whose name matches in the same way.

Yes, that is a valid solution that strikes a different balance
between allocation and runtime.

We may want to survey how commonly "bad" trees appear in real
projects.  Depending on the result, we might want to update the
"limit re-scanning using the bottom pointer" hack we have been using
in the unpack-trees code.

Thanks.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: invalid tree and commit object
  2020-05-09 20:27       ` Junio C Hamano
@ 2020-05-10  9:07         ` René Scharfe
  2020-05-10 16:12           ` René Scharfe
  2020-05-10 16:37           ` invalid tree and commit object Junio C Hamano
  0 siblings, 2 replies; 23+ messages in thread
From: René Scharfe @ 2020-05-10  9:07 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Brandon Williams, git, Jeff King

Am 09.05.20 um 22:27 schrieb Junio C Hamano:
> René Scharfe <l.s.r@web.de> writes:
>
>> Hmm, this could lead to quadratic behavior in the worst case, can't it?
>
> Oh, absolutely.  But as you have shown, you'd need a specially
> crafted tree with early entries that are prefixes of later ones,
> which would rather be rare, and most of the time the bottom pointer
> would advance by one every time we consume one path.
>
> So it is trading (hopefully rare) worst-case runtime with reduced
> storage cost.
>
>> We could, however, reduce the names we add to the string_list to
>> those that are possible candidates for conflict -- blobs followed by an
>> entry whose name starts with the blob name followed by a dot and trees
>> that follow an entry whose name matches in the same way.
>
> Yes, that is a valid solution that strikes a different balance
> between allocation and runtime.

fsck should ideally handle the worst cases gracefully -- I assume it has
to deal with the weirdest "natural" corruptions and the "best" that
pranksters and DoSers can come up with.

> We may want to survey how commonly "bad" trees appear in real
> projects.  Depending on the result, we might want to update the
> "limit re-scanning using the bottom pointer" hack we have been using
> in the unpack-trees code.

They are surprisingly common in Git's own repo.  Ca. 74% of the trees
in my clone had at least one candidate and the average number of
candidates per tree in those was ca. 8.9.  We have e.g. the t/ tree
with its several tnnnn/ vs. tnnnn-foo.sh entries, or builtin/ vs.
builtin.h.

In the Linux repo ca. 20% of the checked trees have at least a
candidate and the average number of candidates in those is ca. 2.4.

So the patch for only adding possible candidates to the string_list
is below (on top of my earlier one), but I wonder if we can do
better.

When we look for a d/f name clash we don't necessarily have to look
back upon finding a candidate directory -- we can also scan ahead
upon finding a candidate non-directory.  That could be done using
repeated update_tree_entry_gently() calls on a private copy.  It
wouldn't require any string_list allocations, but its worst case
complexity would still be quadratic, of course.

Would a stack work?  When we see a candidate non-directory, we put
it on the stack.  When we see a candidate directory, we compare it
to the entry at the top of the stack using strcmp().  Equality
indicates a duplicate and we are done.  If the directory name is
less then we can pop the entry from the stack and check again, as
we're past the point where a duplicate would be.  Makes sense?

The candidate stack solution would have linear complexity and
require less memory than a full list of candidates.

It would rely on the entries to be in the correct order (same as
the patch below, come to think of it), but that's probably OK.
We may miss DUPLICATE_ENTRIES (just like today :), but
TREE_NOT_SORTED would still be reported.

---
 fsck.c | 25 +++++++++++++++++++++++--
 1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/fsck.c b/fsck.c
index f47b35fee8..7d5bfef804 100644
--- a/fsck.c
+++ b/fsck.c
@@ -569,6 +569,25 @@ static int verify_ordered(unsigned mode1, const char *name1, unsigned mode2, con
 	return c1 < c2 ? 0 : TREE_UNORDERED;
 }

+/*
+ * Consecutive entries are checked for duplicates in verify_ordered().
+ * However, there can be non-consecutive duplicates because a slash
+ * ('/') is appended implicitly to directory names during sorting, so
+ * there can be other names between a non-directory entry and a
+ * directory entry with the same name.  E.g.:
+ *
+ *    foo
+ *    foo.bar
+ *    foo/
+ *
+ */
+static int may_be_df_dup(const char *candidate, const char *interloper)
+{
+	const char *p;
+
+	return skip_prefix(interloper, candidate, &p) && '\0' < *p && *p < '/';
+}
+
 static int fsck_tree(const struct object_id *oid,
 		     const char *buffer, unsigned long size,
 		     struct fsck_options *options)
@@ -678,11 +697,14 @@ static int fsck_tree(const struct object_id *oid,
 			default:
 				break;
 			}
+			if (!S_ISDIR(o_mode) && may_be_df_dup(o_name, name))
+				string_list_append(&names, o_name);
+			if (S_ISDIR(mode) && may_be_df_dup(name, o_name))
+				string_list_append(&names, name);
 		}

 		o_mode = mode;
 		o_name = name;
-		string_list_append(&names, name);
 	}

 	nr = names.nr;
@@ -1221,7 +1243,6 @@ int fsck_finish(struct fsck_options *options)
 		free(buf);
 	}

-
 	oidset_clear(&gitmodules_found);
 	oidset_clear(&gitmodules_done);
 	return ret;
--
2.26.2


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: invalid tree and commit object
  2020-05-10  9:07         ` René Scharfe
@ 2020-05-10 16:12           ` René Scharfe
  2020-05-11 16:25             ` Junio C Hamano
  2020-05-10 16:37           ` invalid tree and commit object Junio C Hamano
  1 sibling, 1 reply; 23+ messages in thread
From: René Scharfe @ 2020-05-10 16:12 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Brandon Williams, git, Jeff King

Am 10.05.20 um 11:07 schrieb René Scharfe:
> Would a stack work?  When we see a candidate non-directory, we put
> it on the stack.  When we see a candidate directory, we compare it
> to the entry at the top of the stack using strcmp().  Equality
> indicates a duplicate and we are done.  If the directory name is
> less then we can pop the entry from the stack and check again, as
> we're past the point where a duplicate would be.  Makes sense?
>
> The candidate stack solution would have linear complexity and
> require less memory than a full list of candidates.
>
> It would rely on the entries to be in the correct order (same as
> the patch below, come to think of it), but that's probably OK.
> We may miss DUPLICATE_ENTRIES (just like today :), but
> TREE_NOT_SORTED would still be reported.

Here's what I mean.  It seems to work, but could use some critical
thought and testing.

-- >8 --
Subject: [PATCH] fsck: report non-consecutive duplicate names in trees

Tree entries are sorted in path order, meaning that directory names get
a slash ('/') appended implicitly.  Git fsck checks if trees contains
consecutive duplicates, but due to that ordering there can be
non-consecutive duplicates as well if one of them is a directory and the
other one isn't.  Such a tree cannot be fully checked out.

Find these duplicates by recording candidate file names on a stack and
check candidate directory names against that stack to find matches.

Suggested-by: Brandon Williams <bwilliamseng@gmail.com>
Original-test-by: Brandon Williams <bwilliamseng@gmail.com>
Signed-off-by: René Scharfe <l.s.r@web.de>
---
 fsck.c          | 72 +++++++++++++++++++++++++++++++++++++++++++++++--
 t/t1450-fsck.sh | 16 +++++++++++
 2 files changed, 86 insertions(+), 2 deletions(-)

diff --git a/fsck.c b/fsck.c
index 087a7f1ffc..8bb3ecf282 100644
--- a/fsck.c
+++ b/fsck.c
@@ -523,6 +523,28 @@ int fsck_walk(struct object *obj, void *data, struct fsck_options *options)
 	}
 }

+struct name_stack {
+	const char **names;
+	size_t nr, alloc;
+};
+
+static void name_stack_push(struct name_stack *stack, const char *name)
+{
+	ALLOC_GROW(stack->names, stack->nr + 1, stack->alloc);
+	stack->names[stack->nr++] = name;
+}
+
+static const char *name_stack_pop(struct name_stack *stack)
+{
+	return stack->nr ? stack->names[--stack->nr] : NULL;
+}
+
+static void name_stack_clear(struct name_stack *stack)
+{
+	FREE_AND_NULL(stack->names);
+	stack->nr = stack->alloc = 0;
+}
+
 /*
  * The entries in a tree are ordered in the _path_ order,
  * which means that a directory entry is ordered by adding
@@ -534,7 +556,14 @@ int fsck_walk(struct object *obj, void *data, struct fsck_options *options)
 #define TREE_UNORDERED (-1)
 #define TREE_HAS_DUPS  (-2)

-static int verify_ordered(unsigned mode1, const char *name1, unsigned mode2, const char *name2)
+static int is_less_than_slash(unsigned char c)
+{
+	return '\0' < c && c < '/';
+}
+
+static int verify_ordered(unsigned mode1, const char *name1,
+			  unsigned mode2, const char *name2,
+			  struct name_stack *candidates)
 {
 	int len1 = strlen(name1);
 	int len2 = strlen(name2);
@@ -566,6 +595,41 @@ static int verify_ordered(unsigned mode1, const char *name1, unsigned mode2, con
 		c1 = '/';
 	if (!c2 && S_ISDIR(mode2))
 		c2 = '/';
+
+	/*
+	 * There can be non-consecutive duplicates due to the implicitly
+	 * add slash, e.g.:
+	 *
+	 *   foo
+	 *   foo.bar
+	 *   foo.bar.baz
+	 *   foo.bar/
+	 *   foo/
+	 *
+	 * Record non-directory candidates (like "foo" and "foo.bar" in
+	 * the example) on a stack and check directory candidates (like
+	 * foo/" and "foo.bar/") against that stack.
+	 */
+	if (!c1 && is_less_than_slash(c2)) {
+		name_stack_push(candidates, name1);
+	} else if (c2 == '/' && is_less_than_slash(c1)) {
+		for (;;) {
+			const char *p;
+			const char *f_name = name_stack_pop(candidates);
+
+			if (!f_name)
+				break;
+			if (!skip_prefix(name2, f_name, &p))
+				break;
+			if (!*p)
+				return TREE_HAS_DUPS;
+			if (is_less_than_slash(*p)) {
+				name_stack_push(candidates, f_name);
+				break;
+			}
+		}
+	}
+
 	return c1 < c2 ? 0 : TREE_UNORDERED;
 }

@@ -587,6 +651,7 @@ static int fsck_tree(const struct object_id *oid,
 	struct tree_desc desc;
 	unsigned o_mode;
 	const char *o_name;
+	struct name_stack df_dup_candidates = { NULL };

 	if (init_tree_desc_gently(&desc, buffer, size)) {
 		retval += report(options, oid, OBJ_TREE, FSCK_MSG_BAD_TREE, "cannot be parsed as a tree");
@@ -666,7 +731,8 @@ static int fsck_tree(const struct object_id *oid,
 		}

 		if (o_name) {
-			switch (verify_ordered(o_mode, o_name, mode, name)) {
+			switch (verify_ordered(o_mode, o_name, mode, name,
+					       &df_dup_candidates)) {
 			case TREE_UNORDERED:
 				not_properly_sorted = 1;
 				break;
@@ -682,6 +748,8 @@ static int fsck_tree(const struct object_id *oid,
 		o_name = name;
 	}

+	name_stack_clear(&df_dup_candidates);
+
 	if (has_null_sha1)
 		retval += report(options, oid, OBJ_TREE, FSCK_MSG_NULL_SHA1, "contains entries pointing to null sha1");
 	if (has_full_path)
diff --git a/t/t1450-fsck.sh b/t/t1450-fsck.sh
index 449ebc5657..91a6e34f38 100755
--- a/t/t1450-fsck.sh
+++ b/t/t1450-fsck.sh
@@ -257,6 +257,22 @@ test_expect_success 'tree object with duplicate entries' '
 	test_i18ngrep "error in tree .*contains duplicate file entries" out
 '

+test_expect_success 'tree object with dublicate names' '
+	test_when_finished "remove_object \$blob" &&
+	test_when_finished "remove_object \$tree" &&
+	test_when_finished "remove_object \$badtree" &&
+	blob=$(echo blob | git hash-object -w --stdin) &&
+	printf "100644 blob %s\t%s\n" $blob x.2 >tree &&
+	tree=$(git mktree <tree) &&
+	printf "100644 blob %s\t%s\n" $blob x.1 >badtree &&
+	printf "100644 blob %s\t%s\n" $blob x >>badtree &&
+	printf "040000 tree %s\t%s\n" $tree x >>badtree &&
+	badtree=$(git mktree <badtree) &&
+	test_must_fail git fsck 2>out &&
+	test_i18ngrep "$badtree" out &&
+	test_i18ngrep "error in tree .*contains duplicate file entries" out
+'
+
 test_expect_success 'unparseable tree object' '
 	test_oid_cache <<-\EOF &&
 	junk sha1:twenty-bytes-of-junk
--
2.26.2

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: invalid tree and commit object
  2020-05-10  9:07         ` René Scharfe
  2020-05-10 16:12           ` René Scharfe
@ 2020-05-10 16:37           ` Junio C Hamano
  2020-05-21  9:51             ` René Scharfe
  1 sibling, 1 reply; 23+ messages in thread
From: Junio C Hamano @ 2020-05-10 16:37 UTC (permalink / raw)
  To: René Scharfe; +Cc: Brandon Williams, git, Jeff King

René Scharfe <l.s.r@web.de> writes:

> Would a stack work?  When we see a candidate non-directory, we put
> it on the stack.  When we see a candidate directory, we compare it
> to the entry at the top of the stack using strcmp().  Equality
> indicates a duplicate and we are done.  If the directory name is
> less then we can pop the entry from the stack and check again, as
> we're past the point where a duplicate would be.  Makes sense?

Perfectly and quite excited ;-)  I wonder if we can do the same in
the unpack-trees side.



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: invalid tree and commit object
  2020-05-10 16:12           ` René Scharfe
@ 2020-05-11 16:25             ` Junio C Hamano
  2020-05-13 16:27               ` Brandon Williams
                                 ` (5 more replies)
  0 siblings, 6 replies; 23+ messages in thread
From: Junio C Hamano @ 2020-05-11 16:25 UTC (permalink / raw)
  To: René Scharfe; +Cc: Brandon Williams, git, Jeff King

René Scharfe <l.s.r@web.de> writes:

>  fsck.c          | 72 +++++++++++++++++++++++++++++++++++++++++++++++--
>  t/t1450-fsck.sh | 16 +++++++++++
>  2 files changed, 86 insertions(+), 2 deletions(-)
>
> diff --git a/fsck.c b/fsck.c
> index 087a7f1ffc..8bb3ecf282 100644
> --- a/fsck.c
> +++ b/fsck.c
> @@ -523,6 +523,28 @@ int fsck_walk(struct object *obj, void *data, struct fsck_options *options)
>  	}
>  }
>
> +struct name_stack {
> +	const char **names;
> +	size_t nr, alloc;
> +};
> +
> +static void name_stack_push(struct name_stack *stack, const char *name)
> +{
> +	ALLOC_GROW(stack->names, stack->nr + 1, stack->alloc);
> +	stack->names[stack->nr++] = name;
> +}
> +
> +static const char *name_stack_pop(struct name_stack *stack)
> +{
> +	return stack->nr ? stack->names[--stack->nr] : NULL;
> +}
> +
> +static void name_stack_clear(struct name_stack *stack)
> +{
> +	FREE_AND_NULL(stack->names);
> +	stack->nr = stack->alloc = 0;
> +}

OK, names are just pointing into tree objects' buffer, and will be
there until we no longer need the stack and call stack_clear().
Good.

> @@ -534,7 +556,14 @@ int fsck_walk(struct object *obj, void *data, struct fsck_options *options)
>  #define TREE_UNORDERED (-1)
>  #define TREE_HAS_DUPS  (-2)
>
> -static int verify_ordered(unsigned mode1, const char *name1, unsigned mode2, const char *name2)
> +static int is_less_than_slash(unsigned char c)
> +{
> +	return '\0' < c && c < '/';
> +}

Mental note: the terminating byte does not count as
"less than".

> +static int verify_ordered(unsigned mode1, const char *name1,
> +			  unsigned mode2, const char *name2,
> +			  struct name_stack *candidates)

Mental note: the caller is iterating over tree entries, and calls
this helper with two adjacent entries (mode2/name2 is what it just
saw).  The function wants to see if name1 has duplicate in the tree
object, and before this fix, we thought that it is sufficient to
compare it with name2, but now we realized that an entry with the
same name as name1 can come much later than name2.  But it is still
the same in spirit---we want to make sure name1 does not have any
duplicate (name2 will get its turn by the next call to this function
the caller makes).

>  {
>  	int len1 = strlen(name1);
>  	int len2 = strlen(name2);
> @@ -566,6 +595,41 @@ static int verify_ordered(unsigned mode1, const char *name1, unsigned mode2, con
>  		c1 = '/';
>  	if (!c2 && S_ISDIR(mode2))
>  		c2 = '/';

Mental note: at this point, we have rejected two adjacent and
identical names, and c1 and c2 are the first byte that are different
between these two names; but if it is NUL at the end of the name,
and the object is a tree, c1/c2 is changed to a slash.

> +
> +	/*
> +	 * There can be non-consecutive duplicates due to the implicitly
> +	 * add slash, e.g.:

s/add slash/added slash/, or even "added slash at the end of the
name of a tree object".

> +	 *
> +	 *   foo
> +	 *   foo.bar
> +	 *   foo.bar.baz
> +	 *   foo.bar/
> +	 *   foo/
> +	 *
> +	 * Record non-directory candidates (like "foo" and "foo.bar" in
> +	 * the example) on a stack and check directory candidates (like
> +	 * foo/" and "foo.bar/") against that stack.
> +	 */
> +	if (!c1 && is_less_than_slash(c2)) {
> +		name_stack_push(candidates, name1);

If name1 is a blob and the name2 sorts before a hypothetical entry
that is a tree object with the same name as name1, we need to
remember that we saw name1.  We need to keep that record until we
see an entry that sorts after such a hypothetical tree.

We earlier made a mental note that c2==NUL does not count as being
less than slash, but it does not matter, as we rejected !c1 && !c2
much earlier.

When we remember name1 this way, we cannot yet tell if it is
duplicate, so we don't do anything unusual.

> +	} else if (c2 == '/' && is_less_than_slash(c1)) {

Now we are seeing a tree.  Does it crash with the last "suspicious"
name we saw?

> +		for (;;) {
> +			const char *p;
> +			const char *f_name = name_stack_pop(candidates);

We pop one name.  We know that 

 - it is a name of a blob object
 - all we have seen since then are blobs and commits and no tree

> +			if (!f_name)
> +				break;

If there is no remembered name, we are done.

> +			if (!skip_prefix(name2, f_name, &p))
> +				break;

If the remembered name (e.g. "foo") is not a prefix of the current
name (e.g. "fop"), we cannot have "foo/" after "fop" we are seeing
without violating ordering constraints that we will notice while
inspecting the later entries of this tree.  So "foo" can be
discarded at this point.

> +			if (!*p)
> +				return TREE_HAS_DUPS;

If name2 and f_name are the same, we have found a duplicate.

> +			if (is_less_than_slash(*p)) {

If name2 (the one we are currently looking at) has f_name as its
prefix, and the first byte that differs is less than slash (again,
our earlier observation that NUL is not counted as less than slash
does not matter here, as we just handled NUL case before this
check), then it still is possible that a tree entry with the same
name as f_name is hiding behind name2, so we push it back to the
stack (i.e.  we cannot tell f_name is a duplicat).  Entries in the
stack below f_name sorts earlier than f_name, are they are prefixes
of f_name in the ascending order of their length (due to the "if not
prefix we can pop" logic we saw earlier), and a tree with the same
name as stack[n] should sort earlier than a tree with the same name
as stack[n-1], so after seeing that we are not ready to determine
if f_name has duplicate, we know we are not ready for the names
deeper in the stack.

	Side note. this is nice but is subtle.  I'd need to retrace
	the thoughts on this part again later to convince myself
	that we are not missing anything.

> +				name_stack_push(candidates, f_name);
> +				break;
> +			}

Otherwise, we know name2 sorts later than the hypothetical entry
that is a tree with the same name as f_name, so f_name can be
discarded (i.e. we do not push it again).  Further, we may notice
that name2 makes it impossile for other names in the stack to have
entries with the same name, so we iterate.

> +		}
> +	}
> +
>  	return c1 < c2 ? 0 : TREE_UNORDERED;
>  }
>
> @@ -587,6 +651,7 @@ static int fsck_tree(const struct object_id *oid,
>  	struct tree_desc desc;
>  	unsigned o_mode;
>  	const char *o_name;
> +	struct name_stack df_dup_candidates = { NULL };
>
>  	if (init_tree_desc_gently(&desc, buffer, size)) {
>  		retval += report(options, oid, OBJ_TREE, FSCK_MSG_BAD_TREE, "cannot be parsed as a tree");
> @@ -666,7 +731,8 @@ static int fsck_tree(const struct object_id *oid,
>  		}
>
>  		if (o_name) {
> -			switch (verify_ordered(o_mode, o_name, mode, name)) {
> +			switch (verify_ordered(o_mode, o_name, mode, name,
> +					       &df_dup_candidates)) {
>  			case TREE_UNORDERED:
>  				not_properly_sorted = 1;
>  				break;
> @@ -682,6 +748,8 @@ static int fsck_tree(const struct object_id *oid,
>  		o_name = name;
>  	}
>
> +	name_stack_clear(&df_dup_candidates);
> +

OK.  Nicely done.

Thanks.

>  	if (has_null_sha1)
>  		retval += report(options, oid, OBJ_TREE, FSCK_MSG_NULL_SHA1, "contains entries pointing to null sha1");
>  	if (has_full_path)
> diff --git a/t/t1450-fsck.sh b/t/t1450-fsck.sh
> index 449ebc5657..91a6e34f38 100755
> --- a/t/t1450-fsck.sh
> +++ b/t/t1450-fsck.sh
> @@ -257,6 +257,22 @@ test_expect_success 'tree object with duplicate entries' '
>  	test_i18ngrep "error in tree .*contains duplicate file entries" out
>  '
>
> +test_expect_success 'tree object with dublicate names' '
> +	test_when_finished "remove_object \$blob" &&
> +	test_when_finished "remove_object \$tree" &&
> +	test_when_finished "remove_object \$badtree" &&
> +	blob=$(echo blob | git hash-object -w --stdin) &&
> +	printf "100644 blob %s\t%s\n" $blob x.2 >tree &&
> +	tree=$(git mktree <tree) &&
> +	printf "100644 blob %s\t%s\n" $blob x.1 >badtree &&
> +	printf "100644 blob %s\t%s\n" $blob x >>badtree &&
> +	printf "040000 tree %s\t%s\n" $tree x >>badtree &&
> +	badtree=$(git mktree <badtree) &&
> +	test_must_fail git fsck 2>out &&
> +	test_i18ngrep "$badtree" out &&
> +	test_i18ngrep "error in tree .*contains duplicate file entries" out
> +'
> +
>  test_expect_success 'unparseable tree object' '
>  	test_oid_cache <<-\EOF &&
>  	junk sha1:twenty-bytes-of-junk
> --
> 2.26.2

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: invalid tree and commit object
  2020-05-11 16:25             ` Junio C Hamano
@ 2020-05-13 16:27               ` Brandon Williams
  2020-05-21  9:51               ` René Scharfe
                                 ` (4 subsequent siblings)
  5 siblings, 0 replies; 23+ messages in thread
From: Brandon Williams @ 2020-05-13 16:27 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: René Scharfe, git, Jeff King

Thanks for coming up with a solution so quickly! Glad it wasn't too
difficult to solve.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: invalid tree and commit object
  2020-05-11 16:25             ` Junio C Hamano
  2020-05-13 16:27               ` Brandon Williams
@ 2020-05-21  9:51               ` René Scharfe
  2020-05-21  9:52               ` [PATCH 1/4] fsck: fix a typo in a comment René Scharfe
                                 ` (3 subsequent siblings)
  5 siblings, 0 replies; 23+ messages in thread
From: René Scharfe @ 2020-05-21  9:51 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Brandon Williams, git, Jeff King

Am 11.05.20 um 18:25 schrieb Junio C Hamano:
> René Scharfe <l.s.r@web.de> writes:
>> +
>> +	/*
>> +	 * There can be non-consecutive duplicates due to the implicitly
>> +	 * add slash, e.g.:
>
> s/add slash/added slash/, or even "added slash at the end of the
> name of a tree object".

Thanks for spotting the typo.  I think the short one suffices as the
sentence below (and earlier comments) already cover it.

>> +	 *
>> +	 *   foo
>> +	 *   foo.bar
>> +	 *   foo.bar.baz
>> +	 *   foo.bar/
>> +	 *   foo/
>> +	 *
>> +	 * Record non-directory candidates (like "foo" and "foo.bar" in
>> +	 * the example) on a stack and check directory candidates (like
>> +	 * foo/" and "foo.bar/") against that stack.
>> +	 */

> 	Side note. this is nice but is subtle.  I'd need to retrace
> 	the thoughts on this part again later to convince myself
> 	that we are not missing anything.

Yes, it's tricky, and we do miss something, as I mentioned in the test
coverage thread.  I'll reply with patches.

René

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: invalid tree and commit object
  2020-05-10 16:37           ` invalid tree and commit object Junio C Hamano
@ 2020-05-21  9:51             ` René Scharfe
  0 siblings, 0 replies; 23+ messages in thread
From: René Scharfe @ 2020-05-21  9:51 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Brandon Williams, git, Jeff King

Am 10.05.20 um 18:37 schrieb Junio C Hamano:
> René Scharfe <l.s.r@web.de> writes:
>
>> Would a stack work?  When we see a candidate non-directory, we put
>> it on the stack.  When we see a candidate directory, we compare it
>> to the entry at the top of the stack using strcmp().  Equality
>> indicates a duplicate and we are done.  If the directory name is
>> less then we can pop the entry from the stack and check again, as
>> we're past the point where a duplicate would be.  Makes sense?
>
> Perfectly and quite excited ;-)  I wonder if we can do the same in
> the unpack-trees side.

Walking multiple trees in lockstep should be possible as well.  We may
want to include a bitmap indicating which trees contain the name in the
stack entry, or any other necessary information.

And output of candidates would be delayed until the matching directory
name is either reached or found to be missing, so it can be out of
order.  Sorting like in diffcore_fix_diff_index() is an option if the
order is important and the entries are collected before being used
further.  Not sure if streaming in the right order could be done
efficiently.

But lets see if that scheme works in fsck first.  I'll send a patch for
making testing a bit easier.

René

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 1/4] fsck: fix a typo in a comment
  2020-05-11 16:25             ` Junio C Hamano
  2020-05-13 16:27               ` Brandon Williams
  2020-05-21  9:51               ` René Scharfe
@ 2020-05-21  9:52               ` René Scharfe
  2020-05-21 10:10                 ` Denton Liu
  2020-05-21 11:15                 ` René Scharfe
  2020-05-21  9:52               ` [PATCH 2/4] t1450: increase test coverage of in-tree d/f detection René Scharfe
                                 ` (2 subsequent siblings)
  5 siblings, 2 replies; 23+ messages in thread
From: René Scharfe @ 2020-05-21  9:52 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Brandon Williams, git, Jeff King

Reported-by: Junio C Hamano <gitster@pobox.com>
---
 fsck.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fsck.c b/fsck.c
index 8bb3ecf282..b48426262c 100644
--- a/fsck.c
+++ b/fsck.c
@@ -598,7 +598,7 @@ static int verify_ordered(unsigned mode1, const char *name1,

 	/*
 	 * There can be non-consecutive duplicates due to the implicitly
-	 * add slash, e.g.:
+	 * added slash, e.g.:
 	 *
 	 *   foo
 	 *   foo.bar
--
2.26.2

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 2/4] t1450: increase test coverage of in-tree d/f detection
  2020-05-11 16:25             ` Junio C Hamano
                                 ` (2 preceding siblings ...)
  2020-05-21  9:52               ` [PATCH 1/4] fsck: fix a typo in a comment René Scharfe
@ 2020-05-21  9:52               ` René Scharfe
  2020-05-21 10:20                 ` Denton Liu
  2020-05-21  9:52               ` [PATCH 3/4] t1450: demonstrate undetected in-tree d/f conflict René Scharfe
  2020-05-21  9:52               ` [PATCH 4/4] fsck: detect more in-tree d/f conflicts René Scharfe
  5 siblings, 1 reply; 23+ messages in thread
From: René Scharfe @ 2020-05-21  9:52 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Brandon Williams, git, Jeff King

Exercise the case of putting a conflict candidate file name back on the
stack because a matching directory might yet come up later.

Do that by factoring out the test code into a function to allow for more
concise notation in the form of parameters indicating names of trees
(with trailing slash) and blobs (without trailing slash) in no
particular order (they are sorted by git mktree).  Then add the new test
case as a second function call.

Fix a typo in the test title while at it ("dublicate").

Reported-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: René Scharfe <l.s.r@web.de>
---
 t/t1450-fsck.sh | 42 +++++++++++++++++++++++++++---------------
 1 file changed, 27 insertions(+), 15 deletions(-)

diff --git a/t/t1450-fsck.sh b/t/t1450-fsck.sh
index 91a6e34f38..9640ac8ff2 100755
--- a/t/t1450-fsck.sh
+++ b/t/t1450-fsck.sh
@@ -257,21 +257,33 @@ test_expect_success 'tree object with duplicate entries' '
 	test_i18ngrep "error in tree .*contains duplicate file entries" out
 '

-test_expect_success 'tree object with dublicate names' '
-	test_when_finished "remove_object \$blob" &&
-	test_when_finished "remove_object \$tree" &&
-	test_when_finished "remove_object \$badtree" &&
-	blob=$(echo blob | git hash-object -w --stdin) &&
-	printf "100644 blob %s\t%s\n" $blob x.2 >tree &&
-	tree=$(git mktree <tree) &&
-	printf "100644 blob %s\t%s\n" $blob x.1 >badtree &&
-	printf "100644 blob %s\t%s\n" $blob x >>badtree &&
-	printf "040000 tree %s\t%s\n" $tree x >>badtree &&
-	badtree=$(git mktree <badtree) &&
-	test_must_fail git fsck 2>out &&
-	test_i18ngrep "$badtree" out &&
-	test_i18ngrep "error in tree .*contains duplicate file entries" out
-'
+check_duplicate_names () {
+	expect=$1 &&
+	shift &&
+	names=$@ &&
+	test_expect_$expect "tree object with duplicate names: $names" '
+		test_when_finished "remove_object \$blob" &&
+		test_when_finished "remove_object \$tree" &&
+		test_when_finished "remove_object \$badtree" &&
+		blob=$(echo blob | git hash-object -w --stdin) &&
+		printf "100644 blob %s\t%s\n" $blob x.2 >tree &&
+		tree=$(git mktree <tree) &&
+		for name in $names
+		do
+			case "$name" in
+			*/) printf "040000 tree %s\t%s\n" $tree "${name%/}" ;;
+			*)  printf "100644 blob %s\t%s\n" $blob "$name" ;;
+			esac
+		done >badtree &&
+		badtree=$(git mktree <badtree) &&
+		test_must_fail git fsck 2>out &&
+		test_i18ngrep "$badtree" out &&
+		test_i18ngrep "error in tree .*contains duplicate file entries" out
+	'
+}
+
+check_duplicate_names success x x.1 x/
+check_duplicate_names success x x.1.2 x.1/ x/

 test_expect_success 'unparseable tree object' '
 	test_oid_cache <<-\EOF &&
--
2.26.2

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 3/4] t1450: demonstrate undetected in-tree d/f conflict
  2020-05-11 16:25             ` Junio C Hamano
                                 ` (3 preceding siblings ...)
  2020-05-21  9:52               ` [PATCH 2/4] t1450: increase test coverage of in-tree d/f detection René Scharfe
@ 2020-05-21  9:52               ` René Scharfe
  2020-05-21  9:52               ` [PATCH 4/4] fsck: detect more in-tree d/f conflicts René Scharfe
  5 siblings, 0 replies; 23+ messages in thread
From: René Scharfe @ 2020-05-21  9:52 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Brandon Williams, git, Jeff King

Signed-off-by: René Scharfe <l.s.r@web.de>
---
 t/t1450-fsck.sh | 1 +
 1 file changed, 1 insertion(+)

diff --git a/t/t1450-fsck.sh b/t/t1450-fsck.sh
index 9640ac8ff2..5780e10cbc 100755
--- a/t/t1450-fsck.sh
+++ b/t/t1450-fsck.sh
@@ -284,6 +284,7 @@ check_duplicate_names () {

 check_duplicate_names success x x.1 x/
 check_duplicate_names success x x.1.2 x.1/ x/
+check_duplicate_names failure x x.1 x.1.2 x/

 test_expect_success 'unparseable tree object' '
 	test_oid_cache <<-\EOF &&
--
2.26.2

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 4/4] fsck: detect more in-tree d/f conflicts
  2020-05-11 16:25             ` Junio C Hamano
                                 ` (4 preceding siblings ...)
  2020-05-21  9:52               ` [PATCH 3/4] t1450: demonstrate undetected in-tree d/f conflict René Scharfe
@ 2020-05-21  9:52               ` René Scharfe
  5 siblings, 0 replies; 23+ messages in thread
From: René Scharfe @ 2020-05-21  9:52 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Brandon Williams, git, Jeff King

If the conflict candidate file name from the top of the stack is not a
prefix of the current candiate directory then we can discard it as no
matching directory can come up later.  But we are not done checking the
candidate directory -- the stack might still hold a matching file name,
so stay in the loop and check the next candidate file name.

Signed-off-by: René Scharfe <l.s.r@web.de>
---
 fsck.c          | 2 +-
 t/t1450-fsck.sh | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/fsck.c b/fsck.c
index b48426262c..f82e2fe9e3 100644
--- a/fsck.c
+++ b/fsck.c
@@ -620,7 +620,7 @@ static int verify_ordered(unsigned mode1, const char *name1,
 			if (!f_name)
 				break;
 			if (!skip_prefix(name2, f_name, &p))
-				break;
+				continue;
 			if (!*p)
 				return TREE_HAS_DUPS;
 			if (is_less_than_slash(*p)) {
diff --git a/t/t1450-fsck.sh b/t/t1450-fsck.sh
index 5780e10cbc..344a2aad82 100755
--- a/t/t1450-fsck.sh
+++ b/t/t1450-fsck.sh
@@ -284,7 +284,7 @@ check_duplicate_names () {

 check_duplicate_names success x x.1 x/
 check_duplicate_names success x x.1.2 x.1/ x/
-check_duplicate_names failure x x.1 x.1.2 x/
+check_duplicate_names success x x.1 x.1.2 x/

 test_expect_success 'unparseable tree object' '
 	test_oid_cache <<-\EOF &&
--
2.26.2

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/4] fsck: fix a typo in a comment
  2020-05-21  9:52               ` [PATCH 1/4] fsck: fix a typo in a comment René Scharfe
@ 2020-05-21 10:10                 ` Denton Liu
  2020-05-21 11:15                 ` René Scharfe
  1 sibling, 0 replies; 23+ messages in thread
From: Denton Liu @ 2020-05-21 10:10 UTC (permalink / raw)
  To: René Scharfe; +Cc: Junio C Hamano, Brandon Williams, git, Jeff King

Hi René,

On Thu, May 21, 2020 at 11:52:04AM +0200, René Scharfe wrote:
> Reported-by: Junio C Hamano <gitster@pobox.com>

Missing sign-off? I'm not sure it matters since I don't think there's
anything here copyrightable but just pointing it out.

-Denton

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/4] t1450: increase test coverage of in-tree d/f detection
  2020-05-21  9:52               ` [PATCH 2/4] t1450: increase test coverage of in-tree d/f detection René Scharfe
@ 2020-05-21 10:20                 ` Denton Liu
  2020-05-21 13:31                   ` René Scharfe
  0 siblings, 1 reply; 23+ messages in thread
From: Denton Liu @ 2020-05-21 10:20 UTC (permalink / raw)
  To: René Scharfe; +Cc: Junio C Hamano, Brandon Williams, git, Jeff King

Hi René,

On Thu, May 21, 2020 at 11:52:28AM +0200, René Scharfe wrote:
> diff --git a/t/t1450-fsck.sh b/t/t1450-fsck.sh
> index 91a6e34f38..9640ac8ff2 100755
> --- a/t/t1450-fsck.sh
> +++ b/t/t1450-fsck.sh
> @@ -257,21 +257,33 @@ test_expect_success 'tree object with duplicate entries' '
>  	test_i18ngrep "error in tree .*contains duplicate file entries" out
>  '
> 
> -test_expect_success 'tree object with dublicate names' '
> -	test_when_finished "remove_object \$blob" &&
> -	test_when_finished "remove_object \$tree" &&
> -	test_when_finished "remove_object \$badtree" &&
> -	blob=$(echo blob | git hash-object -w --stdin) &&
> -	printf "100644 blob %s\t%s\n" $blob x.2 >tree &&
> -	tree=$(git mktree <tree) &&
> -	printf "100644 blob %s\t%s\n" $blob x.1 >badtree &&
> -	printf "100644 blob %s\t%s\n" $blob x >>badtree &&
> -	printf "040000 tree %s\t%s\n" $tree x >>badtree &&
> -	badtree=$(git mktree <badtree) &&
> -	test_must_fail git fsck 2>out &&
> -	test_i18ngrep "$badtree" out &&
> -	test_i18ngrep "error in tree .*contains duplicate file entries" out
> -'
> +check_duplicate_names () {
> +	expect=$1 &&
> +	shift &&
> +	names=$@ &&

It doesn't really make sense to use $@ here since we're not using the
argument list behaviour of $@; we're just expanding it normally. I would
replace this with $* instead.

> +	test_expect_$expect "tree object with duplicate names: $names" '
> +		test_when_finished "remove_object \$blob" &&
> +		test_when_finished "remove_object \$tree" &&
> +		test_when_finished "remove_object \$badtree" &&
> +		blob=$(echo blob | git hash-object -w --stdin) &&
> +		printf "100644 blob %s\t%s\n" $blob x.2 >tree &&
> +		tree=$(git mktree <tree) &&
> +		for name in $names
> +		do
> +			case "$name" in
> +			*/) printf "040000 tree %s\t%s\n" $tree "${name%/}" ;;
> +			*)  printf "100644 blob %s\t%s\n" $blob "$name" ;;
> +			esac
> +		done >badtree &&
> +		badtree=$(git mktree <badtree) &&
> +		test_must_fail git fsck 2>out &&
> +		test_i18ngrep "$badtree" out &&
> +		test_i18ngrep "error in tree .*contains duplicate file entries" out
> +	'
> +}
> +
> +check_duplicate_names success x x.1 x/
> +check_duplicate_names success x x.1.2 x.1/ x/
> 
>  test_expect_success 'unparseable tree object' '
>  	test_oid_cache <<-\EOF &&
> --
> 2.26.2

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/4] fsck: fix a typo in a comment
  2020-05-21  9:52               ` [PATCH 1/4] fsck: fix a typo in a comment René Scharfe
  2020-05-21 10:10                 ` Denton Liu
@ 2020-05-21 11:15                 ` René Scharfe
  1 sibling, 0 replies; 23+ messages in thread
From: René Scharfe @ 2020-05-21 11:15 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Brandon Williams, git, Jeff King, Denton Liu

Am 21.05.20 um 11:52 schrieb René Scharfe:
> Reported-by: Junio C Hamano <gitster@pobox.com>

Forgot to add:

Signed-off-by: René Scharfe <l.s.r@web.de>

Thanks for noticing, Denton!

> ---
>  fsck.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fsck.c b/fsck.c
> index 8bb3ecf282..b48426262c 100644
> --- a/fsck.c
> +++ b/fsck.c
> @@ -598,7 +598,7 @@ static int verify_ordered(unsigned mode1, const char *name1,
>
>  	/*
>  	 * There can be non-consecutive duplicates due to the implicitly
> -	 * add slash, e.g.:
> +	 * added slash, e.g.:
>  	 *
>  	 *   foo
>  	 *   foo.bar
> --
> 2.26.2
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/4] t1450: increase test coverage of in-tree d/f detection
  2020-05-21 10:20                 ` Denton Liu
@ 2020-05-21 13:31                   ` René Scharfe
  2020-05-21 18:01                     ` Junio C Hamano
  0 siblings, 1 reply; 23+ messages in thread
From: René Scharfe @ 2020-05-21 13:31 UTC (permalink / raw)
  To: Denton Liu; +Cc: Junio C Hamano, Brandon Williams, git, Jeff King

Am 21.05.20 um 12:20 schrieb Denton Liu:
> Hi René,
>
> On Thu, May 21, 2020 at 11:52:28AM +0200, René Scharfe wrote:
>> diff --git a/t/t1450-fsck.sh b/t/t1450-fsck.sh
>> index 91a6e34f38..9640ac8ff2 100755
>> --- a/t/t1450-fsck.sh
>> +++ b/t/t1450-fsck.sh
>> @@ -257,21 +257,33 @@ test_expect_success 'tree object with duplicate entries' '
>>  	test_i18ngrep "error in tree .*contains duplicate file entries" out
>>  '
>>
>> -test_expect_success 'tree object with dublicate names' '
>> -	test_when_finished "remove_object \$blob" &&
>> -	test_when_finished "remove_object \$tree" &&
>> -	test_when_finished "remove_object \$badtree" &&
>> -	blob=$(echo blob | git hash-object -w --stdin) &&
>> -	printf "100644 blob %s\t%s\n" $blob x.2 >tree &&
>> -	tree=$(git mktree <tree) &&
>> -	printf "100644 blob %s\t%s\n" $blob x.1 >badtree &&
>> -	printf "100644 blob %s\t%s\n" $blob x >>badtree &&
>> -	printf "040000 tree %s\t%s\n" $tree x >>badtree &&
>> -	badtree=$(git mktree <badtree) &&
>> -	test_must_fail git fsck 2>out &&
>> -	test_i18ngrep "$badtree" out &&
>> -	test_i18ngrep "error in tree .*contains duplicate file entries" out
>> -'
>> +check_duplicate_names () {
>> +	expect=$1 &&
>> +	shift &&
>> +	names=$@ &&
>
> It doesn't really make sense to use $@ here since we're not using the
> argument list behaviour of $@; we're just expanding it normally. I would
> replace this with $* instead.

The assignment to $names flattens the list, so $@ and $* behave the same
here.  I didn't think much about it, but it would be nice to support names
that contain spaces, which we could do by writing the arguments to a file.
And if we go that route then support for names with newlines would be nice
as well.  I'm not sure it's worth it, but a patch for that is below.

At least I'd like to keep the $@ as kind of a reminder that we want to
pass on arguments (full names), not words.

>
>> +	test_expect_$expect "tree object with duplicate names: $names" '
>> +		test_when_finished "remove_object \$blob" &&
>> +		test_when_finished "remove_object \$tree" &&
>> +		test_when_finished "remove_object \$badtree" &&
>> +		blob=$(echo blob | git hash-object -w --stdin) &&
>> +		printf "100644 blob %s\t%s\n" $blob x.2 >tree &&
>> +		tree=$(git mktree <tree) &&
>> +		for name in $names
>> +		do
>> +			case "$name" in
>> +			*/) printf "040000 tree %s\t%s\n" $tree "${name%/}" ;;
>> +			*)  printf "100644 blob %s\t%s\n" $blob "$name" ;;
>> +			esac
>> +		done >badtree &&
>> +		badtree=$(git mktree <badtree) &&
>> +		test_must_fail git fsck 2>out &&
>> +		test_i18ngrep "$badtree" out &&
>> +		test_i18ngrep "error in tree .*contains duplicate file entries" out
>> +	'
>> +}
>> +
>> +check_duplicate_names success x x.1 x/
>> +check_duplicate_names success x x.1.2 x.1/ x/
>>
>>  test_expect_success 'unparseable tree object' '
>>  	test_oid_cache <<-\EOF &&
>> --
>> 2.26.2

-- >8 --
Subject: [PATCH 5/4] t1450: support names with spaces in check_duplicate_names()

Signed-off-by: René Scharfe <l.s.r@web.de>
---
 t/t1450-fsck.sh | 22 +++++++++++++---------
 1 file changed, 13 insertions(+), 9 deletions(-)

diff --git a/t/t1450-fsck.sh b/t/t1450-fsck.sh
index 344a2aad82..b1766c8e11 100755
--- a/t/t1450-fsck.sh
+++ b/t/t1450-fsck.sh
@@ -257,10 +257,20 @@ test_expect_success 'tree object with duplicate entries' '
 	test_i18ngrep "error in tree .*contains duplicate file entries" out
 '

+names_to_tree () {
+	awk -v blob="$1" -v tree="$2" -v RS='\0' -v FS='/' '
+		NF == 1 {printf "100644 blob %s\t%s%c", blob, $1, 0}
+		NF == 2 {printf "040000 tree %s\t%s%c", tree, $1, 0}
+	'
+}
+
 check_duplicate_names () {
 	expect=$1 &&
 	shift &&
-	names=$@ &&
+	for name in "$@"
+	do
+		printf "%s\0" "$name"
+	done >names &&
 	test_expect_$expect "tree object with duplicate names: $names" '
 		test_when_finished "remove_object \$blob" &&
 		test_when_finished "remove_object \$tree" &&
@@ -268,14 +278,8 @@ check_duplicate_names () {
 		blob=$(echo blob | git hash-object -w --stdin) &&
 		printf "100644 blob %s\t%s\n" $blob x.2 >tree &&
 		tree=$(git mktree <tree) &&
-		for name in $names
-		do
-			case "$name" in
-			*/) printf "040000 tree %s\t%s\n" $tree "${name%/}" ;;
-			*)  printf "100644 blob %s\t%s\n" $blob "$name" ;;
-			esac
-		done >badtree &&
-		badtree=$(git mktree <badtree) &&
+		names_to_tree $blob $tree <names >badtree &&
+		badtree=$(git mktree -z <badtree) &&
 		test_must_fail git fsck 2>out &&
 		test_i18ngrep "$badtree" out &&
 		test_i18ngrep "error in tree .*contains duplicate file entries" out
--
2.26.2

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/4] t1450: increase test coverage of in-tree d/f detection
  2020-05-21 13:31                   ` René Scharfe
@ 2020-05-21 18:01                     ` Junio C Hamano
  0 siblings, 0 replies; 23+ messages in thread
From: Junio C Hamano @ 2020-05-21 18:01 UTC (permalink / raw)
  To: René Scharfe; +Cc: Denton Liu, Brandon Williams, git, Jeff King

René Scharfe <l.s.r@web.de> writes:

>>> +check_duplicate_names () {
>>> +	expect=$1 &&
>>> +	shift &&
>>> +	names=$@ &&
>>
>> It doesn't really make sense to use $@ here since we're not using the
>> argument list behaviour of $@; we're just expanding it normally. I would
>> replace this with $* instead.
>
> The assignment to $names flattens the list, so $@ and $* behave the same
> here.
> ...
> At least I'd like to keep the $@ as kind of a reminder that we want to
> pass on arguments (full names), not words.

I personally prefer to use "$*" when we are not invoking the "list"
magic of "$@" and switch it to "$@" when it starts to matter, but I
can also understand your "reminder value" reasoning, so I am on the
fence.

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, back to index

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-09  6:19 invalid tree and commit object Brandon Williams
2020-05-09 10:16 ` René Scharfe
2020-05-09  7:16   ` Johannes Schindelin
2020-05-09 11:51     ` René Scharfe
2020-05-09 17:28   ` Junio C Hamano
2020-05-09 19:24     ` René Scharfe
2020-05-09 20:27       ` Junio C Hamano
2020-05-10  9:07         ` René Scharfe
2020-05-10 16:12           ` René Scharfe
2020-05-11 16:25             ` Junio C Hamano
2020-05-13 16:27               ` Brandon Williams
2020-05-21  9:51               ` René Scharfe
2020-05-21  9:52               ` [PATCH 1/4] fsck: fix a typo in a comment René Scharfe
2020-05-21 10:10                 ` Denton Liu
2020-05-21 11:15                 ` René Scharfe
2020-05-21  9:52               ` [PATCH 2/4] t1450: increase test coverage of in-tree d/f detection René Scharfe
2020-05-21 10:20                 ` Denton Liu
2020-05-21 13:31                   ` René Scharfe
2020-05-21 18:01                     ` Junio C Hamano
2020-05-21  9:52               ` [PATCH 3/4] t1450: demonstrate undetected in-tree d/f conflict René Scharfe
2020-05-21  9:52               ` [PATCH 4/4] fsck: detect more in-tree d/f conflicts René Scharfe
2020-05-10 16:37           ` invalid tree and commit object Junio C Hamano
2020-05-21  9:51             ` René Scharfe

Git Mailing List Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/git/0 git/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 git git/ https://lore.kernel.org/git \
		git@vger.kernel.org
	public-inbox-index git

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.git


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git