git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/20] bounds-checks for chunk-based files
@ 2023-10-09 20:55 Jeff King
  2023-10-09 20:58 ` [PATCH 01/20] chunk-format: note that pair_chunk() is unsafe Jeff King
                   ` (22 more replies)
  0 siblings, 23 replies; 67+ messages in thread
From: Jeff King @ 2023-10-09 20:55 UTC (permalink / raw)
  To: git; +Cc: Taylor Blau

As part of my -Wunused-parameter series, I noticed that a few callbacks
used with the chunk-format API ignored the "chunk_size" parameter. I had
initially just annotated these with UNUSED under the usual "well, not
all callbacks need all parameters" logic.

But if you think about it, a chunk callback that does not look at the
chunk size _must_ be buggy, as there is no way it could ensure that it
does not read past the end of the chunk. In a well-formed file this
isn't a problem, since most chunks have expected sizes (e.g., a
commit-graph has a fixed-size commit-data record for every commit
mentioned in its index chunk). But it is very easy to get out-of-bounds
reads for files that are not well-formed.

I think the security implications here are pretty minor. The only files
which use the chunk format are multi-pack-index and commit-graph,
neither of which we'd expect to receive over the network (so you'd need
to access an untrusted tarball, etc, to even see a malicious file). And
we'd never try to write to this memory (they're read-only mmaps of the
files). So your worst case is probably an unexpected out-of-bounds read
and segfault, on a file that is hard to put in front of the victim.

But I think it's still worth fixing. The extra checks aren't very
expensive, and let us handle bugs or corruption more gracefully, as
well.

It's a lot of patches because there are a lot of chunk types. ;) But
each one is hopefully pretty straightforward in isolation. I tried to
group similar chunks together (e.g., commit-graph and midx both have
OIDF and OIDL chunks), but otherwise just fixed the midx chunks in the
order they appear in the code, followed by the same for commit-graph.

  [01/20]: chunk-format: note that pair_chunk() is unsafe
  [02/20]: t: add library for munging chunk-format files
  [03/20]: midx: stop ignoring malformed oid fanout chunk
  [04/20]: commit-graph: check size of oid fanout chunk
  [05/20]: midx: check size of oid lookup chunk
  [06/20]: commit-graph: check consistency of fanout table
  [07/20]: midx: check size of pack names chunk
  [08/20]: midx: enforce chunk alignment on reading
  [09/20]: midx: check size of object offset chunk
  [10/20]: midx: bounds-check large offset chunk
  [11/20]: midx: check size of revindex chunk
  [12/20]: commit-graph: check size of commit data chunk
  [13/20]: commit-graph: detect out-of-bounds extra-edges pointers
  [14/20]: commit-graph: bounds-check base graphs chunk
  [15/20]: commit-graph: check size of generations chunk
  [16/20]: commit-graph: bounds-check generation overflow chunk
  [17/20]: commit-graph: check bounds when accessing BDAT chunk
  [18/20]: commit-graph: check bounds when accessing BIDX chunk
  [19/20]: commit-graph: detect out-of-order BIDX offsets
  [20/20]: chunk-format: drop pair_chunk_unsafe()

 bloom.c                            |  34 +++++++++
 chunk-format.c                     |  24 ++++--
 chunk-format.h                     |   9 ++-
 commit-graph.c                     | 119 ++++++++++++++++++++++++-----
 commit-graph.h                     |   4 +
 midx.c                             |  68 +++++++++++++----
 midx.h                             |   3 +
 pack-revindex.c                    |  13 +++-
 t/lib-chunk.sh                     |  17 +++++
 t/lib-chunk/corrupt-chunk-file.pl  |  66 ++++++++++++++++
 t/t4216-log-bloom.sh               |  50 ++++++++++++
 t/t5318-commit-graph.sh            |  76 +++++++++++++++++-
 t/t5319-multi-pack-index.sh        | 102 ++++++++++++++++++++++++-
 t/t5324-split-commit-graph.sh      |  20 ++++-
 t/t5328-commit-graph-64bit-time.sh |  10 +++
 15 files changed, 568 insertions(+), 47 deletions(-)
 create mode 100644 t/lib-chunk.sh
 create mode 100644 t/lib-chunk/corrupt-chunk-file.pl

-Peff

^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH 01/20] chunk-format: note that pair_chunk() is unsafe
  2023-10-09 20:55 [PATCH 0/20] bounds-checks for chunk-based files Jeff King
@ 2023-10-09 20:58 ` Jeff King
  2023-10-10 23:45   ` Taylor Blau
  2023-10-09 20:58 ` [PATCH 02/20] t: add library for munging chunk-format files Jeff King
                   ` (21 subsequent siblings)
  22 siblings, 1 reply; 67+ messages in thread
From: Jeff King @ 2023-10-09 20:58 UTC (permalink / raw)
  To: git; +Cc: Taylor Blau

The pair_chunk() function is provided as an easy helper for parsing
chunks that just want a pointer to a set of bytes. But every caller has
a hidden bug: because we return only the pointer without the matching
chunk size, the callers have no clue how many bytes they are allowed to
look at. And as a result, they may read off the end of the mmap'd data
when the on-disk file does not match their expectations.

Since chunk files are typically used for local-repository data like
commit-graph files and midx's, the security implications here are pretty
mild. The worst that can happen is that you hand somebody a corrupted
repository tarball, and running Git on it does an out-of-bounds read and
crashes. So it's worth being more defensive, but we don't need to drop
everything and fix every caller immediately.

I noticed the problem because the pair_chunk_fn() callback does not look
at its chunk_size argument, and wanted to annotate it to silence
-Wunused-parameter. We could do that now, but we'd lose the hint that
this code should be audited and fixed.

So instead, let's set ourselves up for going down that path:

  1. Provide a pair_chunk() function that does return the size, which
     prepares us for fixing these cases.

  2. Rename the existing function to pair_chunk_unsafe(). That gives us
     an easy way to grep for cases which still need to be fixed, and the
     name should cause anybody adding new calls to think twice before
     using it.

There are no callers of the "safe" version yet, but we'll add some in
subsequent patches.

Signed-off-by: Jeff King <peff@peff.net>
---
I originally wasn't sure if I'd convert every case, or if we'd have to
leave some longer-term. But as you'll see at the end, we can get rid of
pair_chunk_unsafe() in the final patch (but we need both forms to
co-exist during the series, since the intermediate state is that we have
some safe and some unsafe calls).

 chunk-format.c | 24 ++++++++++++++++++++----
 chunk-format.h | 19 +++++++++++++++++--
 commit-graph.c | 14 +++++++-------
 midx.c         | 10 +++++-----
 4 files changed, 49 insertions(+), 18 deletions(-)

diff --git a/chunk-format.c b/chunk-format.c
index 140dfa0dcc..8d910e23f6 100644
--- a/chunk-format.c
+++ b/chunk-format.c
@@ -154,20 +154,36 @@ int read_table_of_contents(struct chunkfile *cf,
 	return 0;
 }
 
+struct pair_chunk_data {
+	const unsigned char **p;
+	size_t *size;
+};
+
 static int pair_chunk_fn(const unsigned char *chunk_start,
 			 size_t chunk_size,
 			 void *data)
 {
-	const unsigned char **p = data;
-	*p = chunk_start;
+	struct pair_chunk_data *pcd = data;
+	*pcd->p = chunk_start;
+	*pcd->size = chunk_size;
 	return 0;
 }
 
 int pair_chunk(struct chunkfile *cf,
 	       uint32_t chunk_id,
-	       const unsigned char **p)
+	       const unsigned char **p,
+	       size_t *size)
+{
+	struct pair_chunk_data pcd = { .p = p, .size = size };
+	return read_chunk(cf, chunk_id, pair_chunk_fn, &pcd);
+}
+
+int pair_chunk_unsafe(struct chunkfile *cf,
+		      uint32_t chunk_id,
+		      const unsigned char **p)
 {
-	return read_chunk(cf, chunk_id, pair_chunk_fn, p);
+	size_t dummy;
+	return pair_chunk(cf, chunk_id, p, &dummy);
 }
 
 int read_chunk(struct chunkfile *cf,
diff --git a/chunk-format.h b/chunk-format.h
index c7794e84ad..8dce7967f4 100644
--- a/chunk-format.h
+++ b/chunk-format.h
@@ -43,13 +43,28 @@ int read_table_of_contents(struct chunkfile *cf,
 /*
  * Find 'chunk_id' in the given chunkfile and assign the
  * given pointer to the position in the mmap'd file where
- * that chunk begins.
+ * that chunk begins. Likewise the "size" parameter is filled
+ * with the size of the chunk.
  *
  * Returns CHUNK_NOT_FOUND if the chunk does not exist.
  */
 int pair_chunk(struct chunkfile *cf,
 	       uint32_t chunk_id,
-	       const unsigned char **p);
+	       const unsigned char **p,
+	       size_t *size);
+
+/*
+ * Unsafe version of pair_chunk; it does not return the size,
+ * meaning that the caller cannot possibly be careful about
+ * reading out of bounds from the mapped memory.
+ *
+ * No new callers should use this function, and old callers should
+ * be audited and migrated over to using the regular pair_chunk()
+ * function.
+ */
+int pair_chunk_unsafe(struct chunkfile *cf,
+		      uint32_t chunk_id,
+		      const unsigned char **p);
 
 typedef int (*chunk_read_fn)(const unsigned char *chunk_start,
 			     size_t chunk_size, void *data);
diff --git a/commit-graph.c b/commit-graph.c
index 1a56efcf69..a689a55b79 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -394,25 +394,25 @@ struct commit_graph *parse_commit_graph(struct repo_settings *s,
 				   GRAPH_HEADER_SIZE, graph->num_chunks))
 		goto free_and_return;
 
-	pair_chunk(cf, GRAPH_CHUNKID_OIDFANOUT,
+	pair_chunk_unsafe(cf, GRAPH_CHUNKID_OIDFANOUT,
 		   (const unsigned char **)&graph->chunk_oid_fanout);
 	read_chunk(cf, GRAPH_CHUNKID_OIDLOOKUP, graph_read_oid_lookup, graph);
-	pair_chunk(cf, GRAPH_CHUNKID_DATA, &graph->chunk_commit_data);
-	pair_chunk(cf, GRAPH_CHUNKID_EXTRAEDGES, &graph->chunk_extra_edges);
-	pair_chunk(cf, GRAPH_CHUNKID_BASE, &graph->chunk_base_graphs);
+	pair_chunk_unsafe(cf, GRAPH_CHUNKID_DATA, &graph->chunk_commit_data);
+	pair_chunk_unsafe(cf, GRAPH_CHUNKID_EXTRAEDGES, &graph->chunk_extra_edges);
+	pair_chunk_unsafe(cf, GRAPH_CHUNKID_BASE, &graph->chunk_base_graphs);
 
 	if (s->commit_graph_generation_version >= 2) {
-		pair_chunk(cf, GRAPH_CHUNKID_GENERATION_DATA,
+		pair_chunk_unsafe(cf, GRAPH_CHUNKID_GENERATION_DATA,
 			&graph->chunk_generation_data);
-		pair_chunk(cf, GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW,
+		pair_chunk_unsafe(cf, GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW,
 			&graph->chunk_generation_data_overflow);
 
 		if (graph->chunk_generation_data)
 			graph->read_generation_data = 1;
 	}
 
 	if (s->commit_graph_read_changed_paths) {
-		pair_chunk(cf, GRAPH_CHUNKID_BLOOMINDEXES,
+		pair_chunk_unsafe(cf, GRAPH_CHUNKID_BLOOMINDEXES,
 			   &graph->chunk_bloom_indexes);
 		read_chunk(cf, GRAPH_CHUNKID_BLOOMDATA,
 			   graph_read_bloom_data, graph);
diff --git a/midx.c b/midx.c
index 931f557350..3165218ab5 100644
--- a/midx.c
+++ b/midx.c
@@ -143,19 +143,19 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local
 				   MIDX_HEADER_SIZE, m->num_chunks))
 		goto cleanup_fail;
 
-	if (pair_chunk(cf, MIDX_CHUNKID_PACKNAMES, &m->chunk_pack_names) == CHUNK_NOT_FOUND)
+	if (pair_chunk_unsafe(cf, MIDX_CHUNKID_PACKNAMES, &m->chunk_pack_names) == CHUNK_NOT_FOUND)
 		die(_("multi-pack-index missing required pack-name chunk"));
 	if (read_chunk(cf, MIDX_CHUNKID_OIDFANOUT, midx_read_oid_fanout, m) == CHUNK_NOT_FOUND)
 		die(_("multi-pack-index missing required OID fanout chunk"));
-	if (pair_chunk(cf, MIDX_CHUNKID_OIDLOOKUP, &m->chunk_oid_lookup) == CHUNK_NOT_FOUND)
+	if (pair_chunk_unsafe(cf, MIDX_CHUNKID_OIDLOOKUP, &m->chunk_oid_lookup) == CHUNK_NOT_FOUND)
 		die(_("multi-pack-index missing required OID lookup chunk"));
-	if (pair_chunk(cf, MIDX_CHUNKID_OBJECTOFFSETS, &m->chunk_object_offsets) == CHUNK_NOT_FOUND)
+	if (pair_chunk_unsafe(cf, MIDX_CHUNKID_OBJECTOFFSETS, &m->chunk_object_offsets) == CHUNK_NOT_FOUND)
 		die(_("multi-pack-index missing required object offsets chunk"));
 
-	pair_chunk(cf, MIDX_CHUNKID_LARGEOFFSETS, &m->chunk_large_offsets);
+	pair_chunk_unsafe(cf, MIDX_CHUNKID_LARGEOFFSETS, &m->chunk_large_offsets);
 
 	if (git_env_bool("GIT_TEST_MIDX_READ_RIDX", 1))
-		pair_chunk(cf, MIDX_CHUNKID_REVINDEX, &m->chunk_revindex);
+		pair_chunk_unsafe(cf, MIDX_CHUNKID_REVINDEX, &m->chunk_revindex);
 
 	m->num_objects = ntohl(m->chunk_oid_fanout[255]);
 
-- 
2.42.0.884.g35e1fe1a6a


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 02/20] t: add library for munging chunk-format files
  2023-10-09 20:55 [PATCH 0/20] bounds-checks for chunk-based files Jeff King
  2023-10-09 20:58 ` [PATCH 01/20] chunk-format: note that pair_chunk() is unsafe Jeff King
@ 2023-10-09 20:58 ` Jeff King
  2023-10-10 23:47   ` Taylor Blau
  2023-10-09 20:59 ` [PATCH 03/20] midx: stop ignoring malformed oid fanout chunk Jeff King
                   ` (20 subsequent siblings)
  22 siblings, 1 reply; 67+ messages in thread
From: Jeff King @ 2023-10-09 20:58 UTC (permalink / raw)
  To: git; +Cc: Taylor Blau

When testing corruption of files using the chunk format (like
commit-graphs and midx files), it's helpful to be able to modify bytes
in specific chunks. This requires being able both to read the
table-of-contents (to find the chunk to modify) but also to adjust it
(to account for size changes in the offsets of subsequent chunks).

We have some tests already which corrupt chunk files, but they have some
downsides:

  1. They are very brittle, as they manually compute the expected size
     of a particular instance of the file (e.g., see the definitions
     starting with NUM_OBJECTS in t5319).

  2. Because they rely on manual offsets and don't read the
     table-of-contents, they're limited to overwriting bytes. But there
     are many interesting corruptions that involve changing the sizes of
     chunks (especially smaller-than-expected ones).

This patch adds a perl script which makes such corruptions easy. We'll
use it in subsequent patches.

Note that we could get by with just a big "perl -e" inside the helper
function. I chose to put it in a separate script for two reasons. One,
so we don't have to worry about the extra layer of shell quoting. And
two, the script is kind of big, and running the tests with "-x" would
repeatedly dump it into the log output.

Signed-off-by: Jeff King <peff@peff.net>
---
 t/lib-chunk.sh                    | 17 ++++++++
 t/lib-chunk/corrupt-chunk-file.pl | 66 +++++++++++++++++++++++++++++++
 2 files changed, 83 insertions(+)
 create mode 100644 t/lib-chunk.sh
 create mode 100644 t/lib-chunk/corrupt-chunk-file.pl

diff --git a/t/lib-chunk.sh b/t/lib-chunk.sh
new file mode 100644
index 0000000000..a7cd9c3c6d
--- /dev/null
+++ b/t/lib-chunk.sh
@@ -0,0 +1,17 @@
+# Shell library for working with "chunk" files (commit-graph, midx, etc).
+
+# corrupt_chunk_file <fn> <chunk> <offset> <bytes>
+#
+# Corrupt a chunk-based file (like a commit-graph) by overwriting the bytes
+# found in the chunk specified by the 4-byte <chunk> identifier. If <offset> is
+# "clear", replace the chunk entirely. Otherwise, overwrite data <offset> bytes
+# into the chunk.
+#
+# The <bytes> are interpreted as pairs of hex digits (so "000000FE" would be
+# big-endian 254).
+corrupt_chunk_file () {
+	fn=$1; shift
+	perl "$TEST_DIRECTORY"/lib-chunk/corrupt-chunk-file.pl \
+		"$@" <"$fn" >"$fn.tmp" &&
+	mv "$fn.tmp" "$fn"
+}
diff --git a/t/lib-chunk/corrupt-chunk-file.pl b/t/lib-chunk/corrupt-chunk-file.pl
new file mode 100644
index 0000000000..cd6d386fef
--- /dev/null
+++ b/t/lib-chunk/corrupt-chunk-file.pl
@@ -0,0 +1,66 @@
+#!/usr/bin/perl
+
+my ($chunk, $seek, $bytes) = @ARGV;
+$bytes =~ s/../chr(hex($&))/ge;
+
+binmode STDIN;
+binmode STDOUT;
+
+# A few helpers to read bytes, or read and copy them to the
+# output.
+sub get {
+	my $n = shift;
+	return unless $n;
+	read(STDIN, my $buf, $n)
+		or die "read error or eof: $!\n";
+	return $buf;
+}
+sub copy {
+	my $buf = get(@_);
+	print $buf;
+	return $buf;
+}
+
+# read until we find table-of-contents entry for chunk;
+# note that we cheat a bit by assuming 4-byte alignment and
+# that no ToC entry will accidentally look like a header.
+#
+# If we don't find the entry, copy() will hit EOF and exit
+# (which should cause the caller to fail the test).
+while (copy(4) ne $chunk) { }
+my $offset = unpack("Q>", copy(8));
+
+# In clear mode, our length will change. So figure out
+# the length by comparing to the offset of the next chunk, and
+# then adjust that offset (and all subsequent) ones.
+my $len;
+if ($seek eq "clear") {
+	my $id;
+	do {
+		$id = copy(4);
+		my $next = unpack("Q>", get(8));
+		if (!defined $len) {
+			$len = $next - $offset;
+		}
+		print pack("Q>", $next - $len + length($bytes));
+	} while (unpack("N", $id));
+}
+
+# and now copy up to our existing chunk data
+copy($offset - tell(STDIN));
+if ($seek eq "clear") {
+	# if clearing, skip past existing data
+	get($len);
+} else {
+	# otherwise, copy up to the requested offset,
+	# and skip past the overwritten bytes
+	copy($seek);
+	get(length($bytes));
+}
+
+# now write out the requested bytes, along
+# with any other remaining data
+print $bytes;
+while (read(STDIN, my $buf, 4096)) {
+	print $buf;
+}
-- 
2.42.0.884.g35e1fe1a6a


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 03/20] midx: stop ignoring malformed oid fanout chunk
  2023-10-09 20:55 [PATCH 0/20] bounds-checks for chunk-based files Jeff King
  2023-10-09 20:58 ` [PATCH 01/20] chunk-format: note that pair_chunk() is unsafe Jeff King
  2023-10-09 20:58 ` [PATCH 02/20] t: add library for munging chunk-format files Jeff King
@ 2023-10-09 20:59 ` Jeff King
  2023-10-10 23:50   ` Taylor Blau
  2023-10-09 20:59 ` [PATCH 04/20] commit-graph: check size of " Jeff King
                   ` (19 subsequent siblings)
  22 siblings, 1 reply; 67+ messages in thread
From: Jeff King @ 2023-10-09 20:59 UTC (permalink / raw)
  To: git; +Cc: Taylor Blau

When we load the oid-fanout chunk, our callback checks that its size is
reasonable and returns an error if not. However, the caller only checks
our return value against CHUNK_NOT_FOUND, so we end up ignoring the
error completely! Using a too-small fanout table means we end up
accessing random memory for the fanout and segfault.

We can fix this by checking for any non-zero return value, rather than
just CHUNK_NOT_FOUND, and adjusting our error message to cover both
cases. We could handle each error code individually, but there's not
much point for such a rare case. The extra message produced in the
callback makes it clear what is going on.

The same pattern is used in the adjacent code. Those cases are actually
OK for now because they do not use a custom callback, so the only error
they can get is CHUNK_NOT_FOUND. But let's convert them, as this is an
accident waiting to happen (especially as we convert some of them away
from pair_chunk). The error messages are more verbose, but it should be
rare for a user to see these anyway.

Signed-off-by: Jeff King <peff@peff.net>
---
 midx.c                      | 16 ++++++++--------
 t/t5319-multi-pack-index.sh | 20 +++++++++++++++++++-
 2 files changed, 27 insertions(+), 9 deletions(-)

diff --git a/midx.c b/midx.c
index 3165218ab5..21d7dd15ef 100644
--- a/midx.c
+++ b/midx.c
@@ -143,14 +143,14 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local
 				   MIDX_HEADER_SIZE, m->num_chunks))
 		goto cleanup_fail;
 
-	if (pair_chunk_unsafe(cf, MIDX_CHUNKID_PACKNAMES, &m->chunk_pack_names) == CHUNK_NOT_FOUND)
-		die(_("multi-pack-index missing required pack-name chunk"));
-	if (read_chunk(cf, MIDX_CHUNKID_OIDFANOUT, midx_read_oid_fanout, m) == CHUNK_NOT_FOUND)
-		die(_("multi-pack-index missing required OID fanout chunk"));
-	if (pair_chunk_unsafe(cf, MIDX_CHUNKID_OIDLOOKUP, &m->chunk_oid_lookup) == CHUNK_NOT_FOUND)
-		die(_("multi-pack-index missing required OID lookup chunk"));
-	if (pair_chunk_unsafe(cf, MIDX_CHUNKID_OBJECTOFFSETS, &m->chunk_object_offsets) == CHUNK_NOT_FOUND)
-		die(_("multi-pack-index missing required object offsets chunk"));
+	if (pair_chunk_unsafe(cf, MIDX_CHUNKID_PACKNAMES, &m->chunk_pack_names))
+		die(_("multi-pack-index required pack-name chunk missing or corrupted"));
+	if (read_chunk(cf, MIDX_CHUNKID_OIDFANOUT, midx_read_oid_fanout, m))
+		die(_("multi-pack-index required OID fanout chunk missing or corrupted"));
+	if (pair_chunk_unsafe(cf, MIDX_CHUNKID_OIDLOOKUP, &m->chunk_oid_lookup))
+		die(_("multi-pack-index required OID lookup chunk missing or corrupted"));
+	if (pair_chunk_unsafe(cf, MIDX_CHUNKID_OBJECTOFFSETS, &m->chunk_object_offsets))
+		die(_("multi-pack-index required object offsets chunk missing or corrupted"));
 
 	pair_chunk_unsafe(cf, MIDX_CHUNKID_LARGEOFFSETS, &m->chunk_large_offsets);
 
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 1bcc02004d..b8fe85aeba 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -2,6 +2,7 @@
 
 test_description='multi-pack-indexes'
 . ./test-lib.sh
+. "$TEST_DIRECTORY"/lib-chunk.sh
 
 GIT_TEST_MULTI_PACK_INDEX=0
 objdir=.git/objects
@@ -438,7 +439,7 @@ test_expect_success 'verify extended chunk count' '
 
 test_expect_success 'verify missing required chunk' '
 	corrupt_midx_and_verify $MIDX_BYTE_CHUNK_ID "\01" $objdir \
-		"missing required"
+		"required pack-name chunk missing"
 '
 
 test_expect_success 'verify invalid chunk offset' '
@@ -1055,4 +1056,21 @@ test_expect_success 'repack with delta islands' '
 	)
 '
 
+corrupt_chunk () {
+	midx=.git/objects/pack/multi-pack-index &&
+	test_when_finished "rm -rf $midx" &&
+	git repack -ad --write-midx &&
+	corrupt_chunk_file $midx "$@"
+}
+
+test_expect_success 'reader notices too-small oid fanout chunk' '
+	corrupt_chunk OIDF clear 00000000 &&
+	test_must_fail git log 2>err &&
+	cat >expect <<-\EOF &&
+	error: multi-pack-index OID fanout is of the wrong size
+	fatal: multi-pack-index required OID fanout chunk missing or corrupted
+	EOF
+	test_cmp expect err
+'
+
 test_done
-- 
2.42.0.884.g35e1fe1a6a


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 04/20] commit-graph: check size of oid fanout chunk
  2023-10-09 20:55 [PATCH 0/20] bounds-checks for chunk-based files Jeff King
                   ` (2 preceding siblings ...)
  2023-10-09 20:59 ` [PATCH 03/20] midx: stop ignoring malformed oid fanout chunk Jeff King
@ 2023-10-09 20:59 ` Jeff King
  2023-10-11  0:08   ` Taylor Blau
  2023-10-09 21:02 ` [PATCH 05/20] midx: check size of oid lookup chunk Jeff King
                   ` (18 subsequent siblings)
  22 siblings, 1 reply; 67+ messages in thread
From: Jeff King @ 2023-10-09 20:59 UTC (permalink / raw)
  To: git; +Cc: Taylor Blau

We load the oid fanout chunk with pair_chunk(), which means we never see
the size of the chunk. We just assume the on-disk file uses the
appropriate size, and if it's too small we'll access random memory.

It's easy to check this up-front; the fanout always consists of 256
uint32's, since it is a fanout of the first byte of the hash pointing
into the oid index. These parameters can't be changed without
introducing a new chunk type.

This matches the similar check in the midx OIDF chunk (but note that
rather than checking for the error immediately, the graph code just
leaves parts of the struct NULL and checks for required fields later).

Signed-off-by: Jeff King <peff@peff.net>
---
 commit-graph.c          | 13 +++++++++++--
 t/t5318-commit-graph.sh | 26 ++++++++++++++++++++++++++
 2 files changed, 37 insertions(+), 2 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index a689a55b79..9b3b01da61 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -305,6 +305,16 @@ static int verify_commit_graph_lite(struct commit_graph *g)
 	return 0;
 }
 
+static int graph_read_oid_fanout(const unsigned char *chunk_start,
+				 size_t chunk_size, void *data)
+{
+	struct commit_graph *g = data;
+	if (chunk_size != 256 * sizeof(uint32_t))
+		return error("commit-graph oid fanout chunk is wrong size");
+	g->chunk_oid_fanout = (const uint32_t *)chunk_start;
+	return 0;
+}
+
 static int graph_read_oid_lookup(const unsigned char *chunk_start,
 				 size_t chunk_size, void *data)
 {
@@ -394,8 +404,7 @@ struct commit_graph *parse_commit_graph(struct repo_settings *s,
 				   GRAPH_HEADER_SIZE, graph->num_chunks))
 		goto free_and_return;
 
-	pair_chunk_unsafe(cf, GRAPH_CHUNKID_OIDFANOUT,
-		   (const unsigned char **)&graph->chunk_oid_fanout);
+	read_chunk(cf, GRAPH_CHUNKID_OIDFANOUT, graph_read_oid_fanout, graph);
 	read_chunk(cf, GRAPH_CHUNKID_OIDLOOKUP, graph_read_oid_lookup, graph);
 	pair_chunk_unsafe(cf, GRAPH_CHUNKID_DATA, &graph->chunk_commit_data);
 	pair_chunk_unsafe(cf, GRAPH_CHUNKID_EXTRAEDGES, &graph->chunk_extra_edges);
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index ba65f17dd9..d25bea3ec5 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -2,6 +2,7 @@
 
 test_description='commit graph'
 . ./test-lib.sh
+. "$TEST_DIRECTORY"/lib-chunk.sh
 
 GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=0
 
@@ -821,4 +822,29 @@ test_expect_success 'overflow during generation version upgrade' '
 	)
 '
 
+corrupt_chunk () {
+	graph=full/.git/objects/info/commit-graph &&
+	test_when_finished "rm -rf $graph" &&
+	git -C full commit-graph write --reachable &&
+	corrupt_chunk_file $graph "$@"
+}
+
+check_corrupt_chunk () {
+	corrupt_chunk "$@" &&
+	git -C full -c core.commitGraph=false log >expect.out &&
+	git -C full -c core.commitGraph=true log >out 2>err &&
+	test_cmp expect.out out
+}
+
+test_expect_success 'reader notices too-small oid fanout chunk' '
+	# make it big enough that the graph file is plausible,
+	# otherwise we hit an earlier check
+	check_corrupt_chunk OIDF clear $(printf "000000%02x" $(test_seq 250)) &&
+	cat >expect.err <<-\EOF &&
+	error: commit-graph oid fanout chunk is wrong size
+	error: commit-graph is missing the OID Fanout chunk
+	EOF
+	test_cmp expect.err err
+'
+
 test_done
-- 
2.42.0.884.g35e1fe1a6a


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 05/20] midx: check size of oid lookup chunk
  2023-10-09 20:55 [PATCH 0/20] bounds-checks for chunk-based files Jeff King
                   ` (3 preceding siblings ...)
  2023-10-09 20:59 ` [PATCH 04/20] commit-graph: check size of " Jeff King
@ 2023-10-09 21:02 ` Jeff King
  2023-10-09 21:04 ` [PATCH 06/20] commit-graph: check consistency of fanout table Jeff King
                   ` (17 subsequent siblings)
  22 siblings, 0 replies; 67+ messages in thread
From: Jeff King @ 2023-10-09 21:02 UTC (permalink / raw)
  To: git; +Cc: Taylor Blau

When reading an on-disk multi-pack-index, we take the number of objects
in the midx from the final value of the fanout table. But we just
blindly assume that the chunk containing the actual oid entries is the
correct size. This can lead to us reading out-of-bounds memory if the
lookup chunk is too small (or if the fanout is corrupted; when they
don't agree we cannot tell which one is wrong).

Note that we bump the assignment of m->num_objects into the fanout
parser callback, so that it's set when we parse the lookup table
(otherwise we'd have to manually record the lookup table size and check
it later).

Signed-off-by: Jeff King <peff@peff.net>
---
As an aside, the first hunk of the diff annoyingly slides the "return 0"
into the wrong spot. I thought this was our heuristics gone wrong, but I
suspect it's actually the shortest diff because of the one-line change
in midx_read_oid_fanout(), which would otherwise require extra context
to split.

 midx.c                      | 18 +++++++++++++++---
 t/t5319-multi-pack-index.sh | 10 ++++++++++
 2 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/midx.c b/midx.c
index 21d7dd15ef..62e4c03e79 100644
--- a/midx.c
+++ b/midx.c
@@ -71,6 +71,20 @@ static int midx_read_oid_fanout(const unsigned char *chunk_start,
 		error(_("multi-pack-index OID fanout is of the wrong size"));
 		return 1;
 	}
+	m->num_objects = ntohl(m->chunk_oid_fanout[255]);
+	return 0;
+}
+
+static int midx_read_oid_lookup(const unsigned char *chunk_start,
+				size_t chunk_size, void *data)
+{
+	struct multi_pack_index *m = data;
+	m->chunk_oid_lookup = chunk_start;
+
+	if (chunk_size != st_mult(m->hash_len, m->num_objects)) {
+		error(_("multi-pack-index OID lookup chunk is the wrong size"));
+		return 1;
+	}
 	return 0;
 }
 
@@ -147,7 +161,7 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local
 		die(_("multi-pack-index required pack-name chunk missing or corrupted"));
 	if (read_chunk(cf, MIDX_CHUNKID_OIDFANOUT, midx_read_oid_fanout, m))
 		die(_("multi-pack-index required OID fanout chunk missing or corrupted"));
-	if (pair_chunk_unsafe(cf, MIDX_CHUNKID_OIDLOOKUP, &m->chunk_oid_lookup))
+	if (read_chunk(cf, MIDX_CHUNKID_OIDLOOKUP, midx_read_oid_lookup, m))
 		die(_("multi-pack-index required OID lookup chunk missing or corrupted"));
 	if (pair_chunk_unsafe(cf, MIDX_CHUNKID_OBJECTOFFSETS, &m->chunk_object_offsets))
 		die(_("multi-pack-index required object offsets chunk missing or corrupted"));
@@ -157,8 +171,6 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local
 	if (git_env_bool("GIT_TEST_MIDX_READ_RIDX", 1))
 		pair_chunk_unsafe(cf, MIDX_CHUNKID_REVINDEX, &m->chunk_revindex);
 
-	m->num_objects = ntohl(m->chunk_oid_fanout[255]);
-
 	CALLOC_ARRAY(m->pack_names, m->num_packs);
 	CALLOC_ARRAY(m->packs, m->num_packs);
 
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index b8fe85aeba..2722e495b2 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -1073,4 +1073,14 @@ test_expect_success 'reader notices too-small oid fanout chunk' '
 	test_cmp expect err
 '
 
+test_expect_success 'reader notices too-small oid lookup chunk' '
+	corrupt_chunk OIDL clear 00000000 &&
+	test_must_fail git log 2>err &&
+	cat >expect <<-\EOF &&
+	error: multi-pack-index OID lookup chunk is the wrong size
+	fatal: multi-pack-index required OID lookup chunk missing or corrupted
+	EOF
+	test_cmp expect err
+'
+
 test_done
-- 
2.42.0.884.g35e1fe1a6a


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 06/20] commit-graph: check consistency of fanout table
  2023-10-09 20:55 [PATCH 0/20] bounds-checks for chunk-based files Jeff King
                   ` (4 preceding siblings ...)
  2023-10-09 21:02 ` [PATCH 05/20] midx: check size of oid lookup chunk Jeff King
@ 2023-10-09 21:04 ` Jeff King
  2023-10-11 14:45   ` Taylor Blau
  2023-10-09 21:05 ` [PATCH 07/20] midx: check size of pack names chunk Jeff King
                   ` (16 subsequent siblings)
  22 siblings, 1 reply; 67+ messages in thread
From: Jeff King @ 2023-10-09 21:04 UTC (permalink / raw)
  To: git; +Cc: Taylor Blau

We use bsearch_hash() to look up items in the oid index of a
commit-graph. It also has a fanout table to reduce the initial range in
which we'll search. But since the fanout comes from the on-disk file, a
corrupted or malicious file can cause us to look outside of the
allocated index memory.

One solution here would be to pass the total table size to
bsearch_hash(), which could then bounds check the values it reads from
the fanout. But there's an inexpensive up-front check we can do, and
it's the same one used by the midx and pack idx code (both of which
likewise have fanout tables and use bsearch_hash(), but are not affected
by this bug):

  1. We can check the value of the final fanout entry against the size
     of the table we got from the index chunk. These must always match,
     since the fanout is just slicing up the index.

       As a side note, the midx and pack idx code compute it the other
       way around: they use the final fanout value as the object count, and
       check the index size against it. Either is valid; if they
       disagree we cannot know which is wrong (a corrupted fanout value,
       or a too-small table of oids).

  2. We can quickly scan the fanout table to make sure it is
     monotonically increasing. If it is, then we know that every value
     is less than or equal to the final value, and therefore less than
     or equal to the table size.

     It would also be sufficient to just check that each fanout value is
     smaller than the final one, but the midx and pack idx code both do
     a full monotonicity check. It's the same cost, and it catches some
     other corruptions (though not all; the checks done by "commit-graph
     verify" are more complete but more expensive, and our goal here is
     to be fast and memory-safe).

There are two new tests. One just checks the final fanout value (this is
the mirror image of the "too small oid lookup" case added for the midx
in the previous commit; it's flipped here because commit-graph considers
the oid lookup chunk to be the source of truth).

The other actually creates a fanout with many out-of-bounds entries, and
prior to this patch, it does cause the segfault you'd expect. But note
that the error is not "your fanout entry is out-of-bounds", but rather
"fanout value out of order". That's because we leave the final fanout
value in place (to get past the table size check), making the index
non-monotonic (the second-to-last entry is big, but the last one must
remain small to match the actual table).

We need adjustments to a few existing tests, as well:

  - an earlier test in t5318 corrupts the fanout and runs "commit-graph
    verify". Its message is now changed, since we catch the problem
    earlier (during the load step, rather than the careful validation
    step).

  - in t5324, we test that "commit-graph verify --shallow" does not do
    expensive verification on the base file of the chain. But the
    corruption it uses (munging a byte at offset 1000) happens to be in
    the middle of the fanout table. And now we detect that problem in
    the cheaper checks that are performed for every part of the graph.
    We'll push this back to offset 1500, which is only caught by the
    more expensive checksum validation.

    Likewise, there's a later test in t5324 which munges an offset 100
    bytes into a file (also in the fanout table) that is referenced by
    an alternates file. So we now find that corruption during the load
    step, rather than the verification step. At the very least we need
    to change the error message (like the case above in t5318). But it
    is probably good to make sure we handle all parts of the
    verification even for alternate graph files. So let's likewise
    corrupt byte 1500 and make sure we found the invalid checksum.

Signed-off-by: Jeff King <peff@peff.net>
---
So I actually implemented the bsearch_hash() bounds checks and wrote
tests for midx and idx files before realizing how they handle this. ;)
Which makes sense, because the usual outcome for a corrupted idx file is
for it to say "non-monotonic index", which I have seen lead to user
confusion. Arguably we should have it say something about "hey, your idx
file seems to be corrupted, because...". But that can be its own topic.

 commit-graph.c                | 16 ++++++++++++++++
 t/t5318-commit-graph.sh       | 25 ++++++++++++++++++++++++-
 t/t5324-split-commit-graph.sh |  6 +++---
 3 files changed, 43 insertions(+), 4 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 9b3b01da61..b217e19194 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -277,6 +277,8 @@ struct commit_graph *load_commit_graph_one_fd_st(struct repository *r,
 
 static int verify_commit_graph_lite(struct commit_graph *g)
 {
+	int i;
+
 	/*
 	 * Basic validation shared between parse_commit_graph()
 	 * which'll be called every time the graph is used, and the
@@ -302,6 +304,20 @@ static int verify_commit_graph_lite(struct commit_graph *g)
 		return 1;
 	}
 
+	for (i = 0; i < 255; i++) {
+		uint32_t oid_fanout1 = ntohl(g->chunk_oid_fanout[i]);
+		uint32_t oid_fanout2 = ntohl(g->chunk_oid_fanout[i + 1]);
+
+		if (oid_fanout1 > oid_fanout2) {
+			error("commit-graph fanout values out of order");
+			return 1;
+		}
+	}
+	if (ntohl(g->chunk_oid_fanout[255]) != g->num_commits) {
+		error("commit-graph oid table and fanout disagree on size");
+		return 1;
+	}
+
 	return 0;
 }
 
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index d25bea3ec5..d10658de9e 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -560,7 +560,7 @@ test_expect_success 'detect incorrect fanout' '
 
 test_expect_success 'detect incorrect fanout final value' '
 	corrupt_graph_and_verify $GRAPH_BYTE_FANOUT2 "\01" \
-		"fanout value"
+		"oid table and fanout disagree on size"
 '
 
 test_expect_success 'detect incorrect OID order' '
@@ -847,4 +847,27 @@ test_expect_success 'reader notices too-small oid fanout chunk' '
 	test_cmp expect.err err
 '
 
+test_expect_success 'reader notices fanout/lookup table mismatch' '
+	check_corrupt_chunk OIDF 1020 "FFFFFFFF" &&
+	cat >expect.err <<-\EOF &&
+	error: commit-graph oid table and fanout disagree on size
+	EOF
+	test_cmp expect.err err
+'
+
+test_expect_success 'reader notices out-of-bounds fanout' '
+	# Rather than try to corrupt a specific hash, we will just
+	# wreck them all. But we cannot just set them all to 0xFFFFFFFF or
+	# similar, as they are used for hi/lo starts in a binary search (so if
+	# they are identical, that indicates that the search should abort
+	# immediately). Instead, we will give them high values that differ by
+	# 2^24, ensuring that any that are used would cause an out-of-bounds
+	# read.
+	check_corrupt_chunk OIDF 0 $(printf "%02x000000" $(test_seq 0 254)) &&
+	cat >expect.err <<-\EOF &&
+	error: commit-graph fanout values out of order
+	EOF
+	test_cmp expect.err err
+'
+
 test_done
diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
index 06bb897f02..55b5765e2d 100755
--- a/t/t5324-split-commit-graph.sh
+++ b/t/t5324-split-commit-graph.sh
@@ -317,7 +317,7 @@ test_expect_success 'verify --shallow does not check base contents' '
 		cd verify-shallow &&
 		git commit-graph verify &&
 		base_file=$graphdir/graph-$(head -n 1 $graphdir/commit-graph-chain).graph &&
-		corrupt_file "$base_file" 1000 "\01" &&
+		corrupt_file "$base_file" 1500 "\01" &&
 		git commit-graph verify --shallow &&
 		test_must_fail git commit-graph verify 2>test_err &&
 		grep -v "^+" test_err >err &&
@@ -391,10 +391,10 @@ test_expect_success 'verify across alternates' '
 		test_commit extra &&
 		git commit-graph write --reachable --split &&
 		tip_file=$graphdir/graph-$(tail -n 1 $graphdir/commit-graph-chain).graph &&
-		corrupt_file "$tip_file" 100 "\01" &&
+		corrupt_file "$tip_file" 1500 "\01" &&
 		test_must_fail git commit-graph verify --shallow 2>test_err &&
 		grep -v "^+" test_err >err &&
-		test_i18ngrep "commit-graph has incorrect fanout value" err
+		test_i18ngrep "incorrect checksum" err
 	)
 '
 
-- 
2.42.0.884.g35e1fe1a6a


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 07/20] midx: check size of pack names chunk
  2023-10-09 20:55 [PATCH 0/20] bounds-checks for chunk-based files Jeff King
                   ` (5 preceding siblings ...)
  2023-10-09 21:04 ` [PATCH 06/20] commit-graph: check consistency of fanout table Jeff King
@ 2023-10-09 21:05 ` Jeff King
  2023-10-11 14:52   ` Taylor Blau
  2023-10-09 21:05 ` [PATCH 08/20] midx: enforce chunk alignment on reading Jeff King
                   ` (15 subsequent siblings)
  22 siblings, 1 reply; 67+ messages in thread
From: Jeff King @ 2023-10-09 21:05 UTC (permalink / raw)
  To: git; +Cc: Taylor Blau

We parse the pack-name chunk as a series of NUL-terminated strings. But
since we don't look at the chunk size, there's nothing to guarantee that
we don't parse off the end of the chunk (or even off the end of the
mapped file).

We can record the length, and then as we parse make sure that we never
walk past it.

The new test exercises the case, though note that it does not actually
segfault before this patch. It hits a NUL byte somewhere in one of the
other chunks, and comes up with a garbage pack name. You could construct
one that reads out-of-bounds (e.g., a PNAM chunk at the end of file),
but this case is simple and sufficient to check that we detect the
problem.

Signed-off-by: Jeff King <peff@peff.net>
---
 midx.c                      | 11 +++++++++--
 midx.h                      |  1 +
 t/t5319-multi-pack-index.sh | 11 +++++++++++
 3 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/midx.c b/midx.c
index 62e4c03e79..ec585cae1b 100644
--- a/midx.c
+++ b/midx.c
@@ -157,7 +157,7 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local
 				   MIDX_HEADER_SIZE, m->num_chunks))
 		goto cleanup_fail;
 
-	if (pair_chunk_unsafe(cf, MIDX_CHUNKID_PACKNAMES, &m->chunk_pack_names))
+	if (pair_chunk(cf, MIDX_CHUNKID_PACKNAMES, &m->chunk_pack_names, &m->chunk_pack_names_len))
 		die(_("multi-pack-index required pack-name chunk missing or corrupted"));
 	if (read_chunk(cf, MIDX_CHUNKID_OIDFANOUT, midx_read_oid_fanout, m))
 		die(_("multi-pack-index required OID fanout chunk missing or corrupted"));
@@ -176,9 +176,16 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local
 
 	cur_pack_name = (const char *)m->chunk_pack_names;
 	for (i = 0; i < m->num_packs; i++) {
+		const char *end;
+		size_t avail = m->chunk_pack_names_len -
+				(cur_pack_name - (const char *)m->chunk_pack_names);
+
 		m->pack_names[i] = cur_pack_name;
 
-		cur_pack_name += strlen(cur_pack_name) + 1;
+		end = memchr(cur_pack_name, '\0', avail);
+		if (!end)
+			die(_("multi-pack-index pack-name chunk is too short"));
+		cur_pack_name = end + 1;
 
 		if (i && strcmp(m->pack_names[i], m->pack_names[i - 1]) <= 0)
 			die(_("multi-pack-index pack names out of order: '%s' before '%s'"),
diff --git a/midx.h b/midx.h
index 5578cd7b83..5b2a7da043 100644
--- a/midx.h
+++ b/midx.h
@@ -32,6 +32,7 @@ struct multi_pack_index {
 	int local;
 
 	const unsigned char *chunk_pack_names;
+	size_t chunk_pack_names_len;
 	const uint32_t *chunk_oid_fanout;
 	const unsigned char *chunk_oid_lookup;
 	const unsigned char *chunk_object_offsets;
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 2722e495b2..0a0ccec8a4 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -1083,4 +1083,15 @@ test_expect_success 'reader notices too-small oid lookup chunk' '
 	test_cmp expect err
 '
 
+test_expect_success 'reader notices too-small pack names chunk' '
+	# There is no NUL to terminate the name here, so the
+	# chunk is too short.
+	corrupt_chunk PNAM clear 70656666 &&
+	test_must_fail git log 2>err &&
+	cat >expect <<-\EOF &&
+	fatal: multi-pack-index pack-name chunk is too short
+	EOF
+	test_cmp expect err
+'
+
 test_done
-- 
2.42.0.884.g35e1fe1a6a


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 08/20] midx: enforce chunk alignment on reading
  2023-10-09 20:55 [PATCH 0/20] bounds-checks for chunk-based files Jeff King
                   ` (6 preceding siblings ...)
  2023-10-09 21:05 ` [PATCH 07/20] midx: check size of pack names chunk Jeff King
@ 2023-10-09 21:05 ` Jeff King
  2023-10-11 14:56   ` Taylor Blau
  2023-10-11 15:01   ` Taylor Blau
  2023-10-09 21:05 ` [PATCH 09/20] midx: check size of object offset chunk Jeff King
                   ` (14 subsequent siblings)
  22 siblings, 2 replies; 67+ messages in thread
From: Jeff King @ 2023-10-09 21:05 UTC (permalink / raw)
  To: git; +Cc: Taylor Blau

The midx reader assumes chunks are aligned to a 4-byte boundary: we
treat the fanout chunk as an array of uint32_t, indexing it to feed the
results to ntohl(). Without aligning the chunks, we may violate the
CPU's alignment constraints. Though many platforms allow this, some do
not. And certanily UBSan will complain, since it is undefined behavior.

Even though most chunks are naturally 4-byte-aligned (because they are
storing uint32_t or larger types), PNAM is not. It stores NUL-terminated
pack names, so you can have a valid chunk with any length. The writing
side handles this by 4-byte-aligning the chunk, introducing a few extra
NULs as necessary. But since we don't check this on the reading side, we
may end up with a misaligned fanout and trigger the undefined behavior.

We have two options here:

  1. Swap out ntohl(fanout[i]) for get_be32(fanout+i) everywhere. The
     latter handles alignment itself. It's possible that it's slightly
     slower (though in practice I'm not sure how true that is,
     especially for these code paths which then go on to do a binary
     search).

  2. Enforce the alignment when reading the chunks. This is easy to do,
     since the table-of-contents reader can check it in one spot.

I went with the second option here, just because it places less burden
on maintenance going forward (it is OK to continue using ntohl), and we
know it can't have any performance impact on the actual reads.

The commit-graph code uses the same chunk API. It's usually also 4-byte
aligned, but some chunks are not (like Bloom filter BDAT chunks). So
we'll pass "1" here to allow any alignment. It doesn't suffer from the
same problem as midx with its fanout because the fanout chunk is always
the first (and the rest of the format dictates that the first chunk will
start aligned).

The new test shows the effect on a midx with a misaligned PNAM chunk.
Note that the midx-reading code treats chunk-toc errors as soft, falling
back to the non-midx path rather than calling die(), as we do for other
parsing errors. Arguably we should make all of these behave the same,
but that's out of scope for this patch. For now the test just expects
the fallback behavior.

Signed-off-by: Jeff King <peff@peff.net>
---
 chunk-format.c              |  8 +++++++-
 chunk-format.h              |  3 ++-
 commit-graph.c              |  2 +-
 midx.c                      |  3 ++-
 t/t5319-multi-pack-index.sh | 14 ++++++++++++++
 5 files changed, 26 insertions(+), 4 deletions(-)

diff --git a/chunk-format.c b/chunk-format.c
index 8d910e23f6..09ef86afa6 100644
--- a/chunk-format.c
+++ b/chunk-format.c
@@ -102,7 +102,8 @@ int read_table_of_contents(struct chunkfile *cf,
 			   const unsigned char *mfile,
 			   size_t mfile_size,
 			   uint64_t toc_offset,
-			   int toc_length)
+			   int toc_length,
+			   unsigned expected_alignment)
 {
 	int i;
 	uint32_t chunk_id;
@@ -120,6 +121,11 @@ int read_table_of_contents(struct chunkfile *cf,
 			error(_("terminating chunk id appears earlier than expected"));
 			return 1;
 		}
+		if (chunk_offset % expected_alignment != 0) {
+			error(_("chunk id %"PRIx32" not %d-byte aligned"),
+			      chunk_id, expected_alignment);
+			return 1;
+		}
 
 		table_of_contents += CHUNK_TOC_ENTRY_SIZE;
 		next_chunk_offset = get_be64(table_of_contents + 4);
diff --git a/chunk-format.h b/chunk-format.h
index 8dce7967f4..d608b8135c 100644
--- a/chunk-format.h
+++ b/chunk-format.h
@@ -36,7 +36,8 @@ int read_table_of_contents(struct chunkfile *cf,
 			   const unsigned char *mfile,
 			   size_t mfile_size,
 			   uint64_t toc_offset,
-			   int toc_length);
+			   int toc_length,
+			   unsigned expected_alignment);
 
 #define CHUNK_NOT_FOUND (-2)
 
diff --git a/commit-graph.c b/commit-graph.c
index b217e19194..472332f603 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -417,7 +417,7 @@ struct commit_graph *parse_commit_graph(struct repo_settings *s,
 	cf = init_chunkfile(NULL);
 
 	if (read_table_of_contents(cf, graph->data, graph_size,
-				   GRAPH_HEADER_SIZE, graph->num_chunks))
+				   GRAPH_HEADER_SIZE, graph->num_chunks, 1))
 		goto free_and_return;
 
 	read_chunk(cf, GRAPH_CHUNKID_OIDFANOUT, graph_read_oid_fanout, graph);
diff --git a/midx.c b/midx.c
index ec585cae1b..98f84fbe61 100644
--- a/midx.c
+++ b/midx.c
@@ -154,7 +154,8 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local
 	cf = init_chunkfile(NULL);
 
 	if (read_table_of_contents(cf, m->data, midx_size,
-				   MIDX_HEADER_SIZE, m->num_chunks))
+				   MIDX_HEADER_SIZE, m->num_chunks,
+				   MIDX_CHUNK_ALIGNMENT))
 		goto cleanup_fail;
 
 	if (pair_chunk(cf, MIDX_CHUNKID_PACKNAMES, &m->chunk_pack_names, &m->chunk_pack_names_len))
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 0a0ccec8a4..34f5944142 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -1094,4 +1094,18 @@ test_expect_success 'reader notices too-small pack names chunk' '
 	test_cmp expect err
 '
 
+test_expect_success 'reader handles unaligned chunks' '
+	# A 9-byte PNAM means all of the subsequent chunks
+	# will no longer be 4-byte aligned, but it is still
+	# a valid one-pack chunk on its own (it is "foo.pack\0").
+	corrupt_chunk PNAM clear 666f6f2e7061636b00 &&
+	git -c core.multipackindex=false log >expect.out &&
+	git -c core.multipackindex=true log >out 2>err &&
+	test_cmp expect.out out &&
+	cat >expect.err <<-\EOF &&
+	error: chunk id 4f494446 not 4-byte aligned
+	EOF
+	test_cmp expect.err err
+'
+
 test_done
-- 
2.42.0.884.g35e1fe1a6a


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 09/20] midx: check size of object offset chunk
  2023-10-09 20:55 [PATCH 0/20] bounds-checks for chunk-based files Jeff King
                   ` (7 preceding siblings ...)
  2023-10-09 21:05 ` [PATCH 08/20] midx: enforce chunk alignment on reading Jeff King
@ 2023-10-09 21:05 ` Jeff King
  2023-10-11 18:31   ` Taylor Blau
  2023-10-09 21:05 ` [PATCH 10/20] midx: bounds-check large " Jeff King
                   ` (13 subsequent siblings)
  22 siblings, 1 reply; 67+ messages in thread
From: Jeff King @ 2023-10-09 21:05 UTC (permalink / raw)
  To: git; +Cc: Taylor Blau

The object offset chunk has one fixed-size entry for each object in the
midx. But since we don't check its size, we may access out-of-bounds
memory if we see a corrupt or malicious midx file.

Sine the entries are fixed-size, the total length can be known up-front,
and we can just check it while parsing the chunk (this is similar to
what we do when opening pack idx files, which contain a similar offset
table).

Signed-off-by: Jeff King <peff@peff.net>
---
 midx.c                      | 15 ++++++++++++++-
 t/t5319-multi-pack-index.sh | 10 ++++++++++
 2 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/midx.c b/midx.c
index 98f84fbe61..7b1b45f381 100644
--- a/midx.c
+++ b/midx.c
@@ -88,6 +88,19 @@ static int midx_read_oid_lookup(const unsigned char *chunk_start,
 	return 0;
 }
 
+static int midx_read_object_offsets(const unsigned char *chunk_start,
+				    size_t chunk_size, void *data)
+{
+	struct multi_pack_index *m = data;
+	m->chunk_object_offsets = chunk_start;
+
+	if (chunk_size != st_mult(m->num_objects, MIDX_CHUNK_OFFSET_WIDTH)) {
+		error(_("multi-pack-index object offset chunk is the wrong size"));
+		return 1;
+	}
+	return 0;
+}
+
 struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local)
 {
 	struct multi_pack_index *m = NULL;
@@ -164,7 +177,7 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local
 		die(_("multi-pack-index required OID fanout chunk missing or corrupted"));
 	if (read_chunk(cf, MIDX_CHUNKID_OIDLOOKUP, midx_read_oid_lookup, m))
 		die(_("multi-pack-index required OID lookup chunk missing or corrupted"));
-	if (pair_chunk_unsafe(cf, MIDX_CHUNKID_OBJECTOFFSETS, &m->chunk_object_offsets))
+	if (read_chunk(cf, MIDX_CHUNKID_OBJECTOFFSETS, midx_read_object_offsets, m))
 		die(_("multi-pack-index required object offsets chunk missing or corrupted"));
 
 	pair_chunk_unsafe(cf, MIDX_CHUNKID_LARGEOFFSETS, &m->chunk_large_offsets);
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 34f5944142..30687d5452 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -1108,4 +1108,14 @@ test_expect_success 'reader handles unaligned chunks' '
 	test_cmp expect.err err
 '
 
+test_expect_success 'reader notices too-small object offset chunk' '
+	corrupt_chunk OOFF clear 00000000 &&
+	test_must_fail git log 2>err &&
+	cat >expect <<-\EOF &&
+	error: multi-pack-index object offset chunk is the wrong size
+	fatal: multi-pack-index required object offsets chunk missing or corrupted
+	EOF
+	test_cmp expect err
+'
+
 test_done
-- 
2.42.0.884.g35e1fe1a6a


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 10/20] midx: bounds-check large offset chunk
  2023-10-09 20:55 [PATCH 0/20] bounds-checks for chunk-based files Jeff King
                   ` (8 preceding siblings ...)
  2023-10-09 21:05 ` [PATCH 09/20] midx: check size of object offset chunk Jeff King
@ 2023-10-09 21:05 ` Jeff King
  2023-10-11 18:38   ` Taylor Blau
  2023-10-09 21:05 ` [PATCH 11/20] midx: check size of revindex chunk Jeff King
                   ` (12 subsequent siblings)
  22 siblings, 1 reply; 67+ messages in thread
From: Jeff King @ 2023-10-09 21:05 UTC (permalink / raw)
  To: git; +Cc: Taylor Blau

When we see a large offset bit in the regular midx offset table, we
use the entry as an index into a separate large offset table (just like
a pack idx does). But we don't bounds-check the access to that large
offset table (nor even record its size when we parse the chunk!).

The equivalent code for a regular pack idx is in check_pack_index_ptr().
But things are a bit simpler here because of the chunked format: we can
just check our array index directly.

As a bonus, we can get rid of the st_mult() here. If our array
bounds-check is successful, then we know that the result will fit in a
size_t (and the bounds check uses a division to avoid overflow
entirely).

Signed-off-by: Jeff King <peff@peff.net>
---
 midx.c                      |  8 +++++---
 midx.h                      |  1 +
 t/t5319-multi-pack-index.sh | 20 ++++++++++++++++++++
 3 files changed, 26 insertions(+), 3 deletions(-)

diff --git a/midx.c b/midx.c
index 7b1b45f381..3e768d0df0 100644
--- a/midx.c
+++ b/midx.c
@@ -180,7 +180,8 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local
 	if (read_chunk(cf, MIDX_CHUNKID_OBJECTOFFSETS, midx_read_object_offsets, m))
 		die(_("multi-pack-index required object offsets chunk missing or corrupted"));
 
-	pair_chunk_unsafe(cf, MIDX_CHUNKID_LARGEOFFSETS, &m->chunk_large_offsets);
+	pair_chunk(cf, MIDX_CHUNKID_LARGEOFFSETS, &m->chunk_large_offsets,
+		   &m->chunk_large_offsets_len);
 
 	if (git_env_bool("GIT_TEST_MIDX_READ_RIDX", 1))
 		pair_chunk_unsafe(cf, MIDX_CHUNKID_REVINDEX, &m->chunk_revindex);
@@ -303,8 +304,9 @@ off_t nth_midxed_offset(struct multi_pack_index *m, uint32_t pos)
 			die(_("multi-pack-index stores a 64-bit offset, but off_t is too small"));
 
 		offset32 ^= MIDX_LARGE_OFFSET_NEEDED;
-		return get_be64(m->chunk_large_offsets +
-				st_mult(sizeof(uint64_t), offset32));
+		if (offset32 >= m->chunk_large_offsets_len / sizeof(uint64_t))
+			die(_("multi-pack-index large offset out of bounds"));
+		return get_be64(m->chunk_large_offsets + sizeof(uint64_t) * offset32);
 	}
 
 	return offset32;
diff --git a/midx.h b/midx.h
index 5b2a7da043..e8e8884d16 100644
--- a/midx.h
+++ b/midx.h
@@ -37,6 +37,7 @@ struct multi_pack_index {
 	const unsigned char *chunk_oid_lookup;
 	const unsigned char *chunk_object_offsets;
 	const unsigned char *chunk_large_offsets;
+	size_t chunk_large_offsets_len;
 	const unsigned char *chunk_revindex;
 
 	const char **pack_names;
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 30687d5452..16050f39d9 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -1118,4 +1118,24 @@ test_expect_success 'reader notices too-small object offset chunk' '
 	test_cmp expect err
 '
 
+test_expect_success 'reader bounds-checks large offset table' '
+	# re-use the objects64 dir here to cheaply get access to a midx
+	# with large offsets.
+	git init repo &&
+	test_when_finished "rm -rf repo" &&
+	(
+		cd repo &&
+		(cd ../objects64 && pwd) >.git/objects/info/alternates &&
+		git multi-pack-index --object-dir=../objects64 write &&
+		midx=../objects64/pack/multi-pack-index &&
+		corrupt_chunk_file $midx LOFF clear &&
+		test_must_fail git cat-file \
+			--batch-check --batch-all-objects 2>err &&
+		cat >expect <<-\EOF &&
+		fatal: multi-pack-index large offset out of bounds
+		EOF
+		test_cmp expect err
+	)
+'
+
 test_done
-- 
2.42.0.884.g35e1fe1a6a


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 11/20] midx: check size of revindex chunk
  2023-10-09 20:55 [PATCH 0/20] bounds-checks for chunk-based files Jeff King
                   ` (9 preceding siblings ...)
  2023-10-09 21:05 ` [PATCH 10/20] midx: bounds-check large " Jeff King
@ 2023-10-09 21:05 ` Jeff King
  2023-10-11 18:41   ` Taylor Blau
  2023-10-09 21:05 ` [PATCH 12/20] commit-graph: check size of commit data chunk Jeff King
                   ` (11 subsequent siblings)
  22 siblings, 1 reply; 67+ messages in thread
From: Jeff King @ 2023-10-09 21:05 UTC (permalink / raw)
  To: git; +Cc: Taylor Blau

When we load a revindex from disk, we check the size of the file
compared to the number of objects we expect it to have. But when we use
a RIDX chunk stored directly in the midx, we just access the memory
directly. This can lead to out-of-bounds memory access for a corrupted
or malicious multi-pack-index file.

We can catch this by recording the RIDX chunk size, and then checking it
against the expected size when we "load" the revindex. Note that this
check is much simpler than the one that load_revindex_from_disk() does,
because we just have the data array with no header (so we do not need
to account for the header size, and nor do we need to bother validating
the header values).

The test confirms both that we catch this case, and that we continue the
process (the revindex is required to use the midx bitmaps, but we
fallback to a non-bitmap traversal).

Signed-off-by: Jeff King <peff@peff.net>
---
 midx.c                      |  3 ++-
 midx.h                      |  1 +
 pack-revindex.c             | 13 ++++++++++++-
 t/t5319-multi-pack-index.sh | 17 +++++++++++++++++
 4 files changed, 32 insertions(+), 2 deletions(-)

diff --git a/midx.c b/midx.c
index 3e768d0df0..2f3863c936 100644
--- a/midx.c
+++ b/midx.c
@@ -184,7 +184,8 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local
 		   &m->chunk_large_offsets_len);
 
 	if (git_env_bool("GIT_TEST_MIDX_READ_RIDX", 1))
-		pair_chunk_unsafe(cf, MIDX_CHUNKID_REVINDEX, &m->chunk_revindex);
+		pair_chunk(cf, MIDX_CHUNKID_REVINDEX, &m->chunk_revindex,
+			   &m->chunk_revindex_len);
 
 	CALLOC_ARRAY(m->pack_names, m->num_packs);
 	CALLOC_ARRAY(m->packs, m->num_packs);
diff --git a/midx.h b/midx.h
index e8e8884d16..a5d98919c8 100644
--- a/midx.h
+++ b/midx.h
@@ -39,6 +39,7 @@ struct multi_pack_index {
 	const unsigned char *chunk_large_offsets;
 	size_t chunk_large_offsets_len;
 	const unsigned char *chunk_revindex;
+	size_t chunk_revindex_len;
 
 	const char **pack_names;
 	struct packed_git **packs;
diff --git a/pack-revindex.c b/pack-revindex.c
index 7fffcad912..6d8fd3645a 100644
--- a/pack-revindex.c
+++ b/pack-revindex.c
@@ -343,6 +343,17 @@ int verify_pack_revindex(struct packed_git *p)
 	return res;
 }
 
+static int can_use_midx_ridx_chunk(struct multi_pack_index *m)
+{
+	if (!m->chunk_revindex)
+		return 0;
+	if (m->chunk_revindex_len != st_mult(sizeof(uint32_t), m->num_objects)) {
+		error(_("multi-pack-index reverse-index chunk is the wrong size"));
+		return 0;
+	}
+	return 1;
+}
+
 int load_midx_revindex(struct multi_pack_index *m)
 {
 	struct strbuf revindex_name = STRBUF_INIT;
@@ -351,7 +362,7 @@ int load_midx_revindex(struct multi_pack_index *m)
 	if (m->revindex_data)
 		return 0;
 
-	if (m->chunk_revindex) {
+	if (can_use_midx_ridx_chunk(m)) {
 		/*
 		 * If the MIDX `m` has a `RIDX` chunk, then use its contents for
 		 * the reverse index instead of trying to load a separate `.rev`
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 16050f39d9..2a11dd1af6 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -1138,4 +1138,21 @@ test_expect_success 'reader bounds-checks large offset table' '
 	)
 '
 
+test_expect_success 'reader notices too-small revindex chunk' '
+	# We only get a revindex with bitmaps (and likewise only
+	# load it when they are asked for).
+	test_config repack.writeBitmaps true &&
+	corrupt_chunk RIDX clear 00000000 &&
+	git -c core.multipackIndex=false rev-list \
+		--all --use-bitmap-index >expect.out &&
+	git -c core.multipackIndex=true rev-list \
+		--all --use-bitmap-index >out 2>err &&
+	test_cmp expect.out out &&
+	cat >expect.err <<-\EOF &&
+	error: multi-pack-index reverse-index chunk is the wrong size
+	warning: multi-pack bitmap is missing required reverse index
+	EOF
+	test_cmp expect.err err
+'
+
 test_done
-- 
2.42.0.884.g35e1fe1a6a


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 12/20] commit-graph: check size of commit data chunk
  2023-10-09 20:55 [PATCH 0/20] bounds-checks for chunk-based files Jeff King
                   ` (10 preceding siblings ...)
  2023-10-09 21:05 ` [PATCH 11/20] midx: check size of revindex chunk Jeff King
@ 2023-10-09 21:05 ` Jeff King
  2023-10-11 18:46   ` Taylor Blau
  2023-10-09 21:05 ` [PATCH 13/20] commit-graph: detect out-of-bounds extra-edges pointers Jeff King
                   ` (10 subsequent siblings)
  22 siblings, 1 reply; 67+ messages in thread
From: Jeff King @ 2023-10-09 21:05 UTC (permalink / raw)
  To: git; +Cc: Taylor Blau

We expect a commit-graph file to have a fixed-size data record for each
commit in the file (and we know the number of commits to expct from the
size of the lookup table). If we encounter a file where this is too
small, we'll look past the end of the chunk (and possibly even off the
mapped memory).

We can fix this by checking the size up front when we record the
pointer.

The included test doesn't segfault, since it ends up reading bytes
from another chunk. But it produces nonsense results, since the values
it reads are garbage. Our test notices this by comparing the output to a
non-corrupted run of the same command (and of course we also check that
the expected error is printed to stderr).

Signed-off-by: Jeff King <peff@peff.net>
---
 commit-graph.c          | 12 +++++++++++-
 t/t5318-commit-graph.sh |  9 +++++++++
 2 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/commit-graph.c b/commit-graph.c
index 472332f603..9b80bbd75b 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -340,6 +340,16 @@ static int graph_read_oid_lookup(const unsigned char *chunk_start,
 	return 0;
 }
 
+static int graph_read_commit_data(const unsigned char *chunk_start,
+				  size_t chunk_size, void *data)
+{
+	struct commit_graph *g = data;
+	if (chunk_size != g->num_commits * GRAPH_DATA_WIDTH)
+		return error("commit-graph commit data chunk is wrong size");
+	g->chunk_commit_data = chunk_start;
+	return 0;
+}
+
 static int graph_read_bloom_data(const unsigned char *chunk_start,
 				  size_t chunk_size, void *data)
 {
@@ -422,7 +432,7 @@ struct commit_graph *parse_commit_graph(struct repo_settings *s,
 
 	read_chunk(cf, GRAPH_CHUNKID_OIDFANOUT, graph_read_oid_fanout, graph);
 	read_chunk(cf, GRAPH_CHUNKID_OIDLOOKUP, graph_read_oid_lookup, graph);
-	pair_chunk_unsafe(cf, GRAPH_CHUNKID_DATA, &graph->chunk_commit_data);
+	read_chunk(cf, GRAPH_CHUNKID_DATA, graph_read_commit_data, graph);
 	pair_chunk_unsafe(cf, GRAPH_CHUNKID_EXTRAEDGES, &graph->chunk_extra_edges);
 	pair_chunk_unsafe(cf, GRAPH_CHUNKID_BASE, &graph->chunk_base_graphs);
 
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index d10658de9e..492460157d 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -870,4 +870,13 @@ test_expect_success 'reader notices out-of-bounds fanout' '
 	test_cmp expect.err err
 '
 
+test_expect_success 'reader notices too-small commit data chunk' '
+	check_corrupt_chunk CDAT clear 00000000 &&
+	cat >expect.err <<-\EOF &&
+	error: commit-graph commit data chunk is wrong size
+	error: commit-graph is missing the Commit Data chunk
+	EOF
+	test_cmp expect.err err
+'
+
 test_done
-- 
2.42.0.884.g35e1fe1a6a


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 13/20] commit-graph: detect out-of-bounds extra-edges pointers
  2023-10-09 20:55 [PATCH 0/20] bounds-checks for chunk-based files Jeff King
                   ` (11 preceding siblings ...)
  2023-10-09 21:05 ` [PATCH 12/20] commit-graph: check size of commit data chunk Jeff King
@ 2023-10-09 21:05 ` Jeff King
  2023-10-11 19:02   ` Taylor Blau
  2023-10-09 21:05 ` [PATCH 14/20] commit-graph: bounds-check base graphs chunk Jeff King
                   ` (9 subsequent siblings)
  22 siblings, 1 reply; 67+ messages in thread
From: Jeff King @ 2023-10-09 21:05 UTC (permalink / raw)
  To: git; +Cc: Taylor Blau

If an entry in a commit-graph file has more than 2 parents, the
fixed-size parent fields instead point to an offset within an "extra
edges" chunk. We blindly follow these, assuming that the chunk is
present and sufficiently large; this can lead to an out-of-bounds read
for a corrupt or malicious file.

We can fix this by recording the size of the chunk and adding a
bounds-check in fill_commit_in_graph(). There are a few tricky bits:

  1. We'll switch from working with a pointer to an offset. This makes
     some corner cases just fall out naturally:

      a. If we did not find an EDGE chunk at all, our size will
         correctly be zero (so everything is "out of bounds").

      b. Comparing "size / 4" lets us make sure we have at least 4 bytes
         to read, and we never compute a pointer more than one element
         past the end of the array (computing a larger pointer is
         probably OK in practice, but is technically undefined
         behavior).

      c. The current code casts to "uint32_t *". Replacing it with an
         offset avoids any comparison between different types of pointer
         (since the chunk is stored as "unsigned char *").

  2. This is the first case in which fill_commit_in_graph() may return
     anything but success. We need to make sure to roll back the
     "parsed" flag (and any parents we might have added before running
     out of buffer) so that the caller can cleanly fall back to
     loading the commit object itself.

     It's a little non-trivial to do this, and we might benefit from
     factoring it out. But we can wait on that until we actually see a
     second case where we return an error.

As a bonus, this lets us drop the st_mult() call. Since we've already
done a bounds check, we know there won't be any integer overflow (it
would imply our buffer is larger than a size_t can hold).

The included test does not actually segfault before this patch (though
you could construct a case where it does). Instead, it reads garbage
from the next chunk which results in it complaining about a bogus parent
id. This is sufficient for our needs, though (we care that the fallback
succeeds, and that stderr mentions the out-of-bounds read).

Signed-off-by: Jeff King <peff@peff.net>
---
 commit-graph.c          | 20 ++++++++++++++------
 commit-graph.h          |  1 +
 t/t5318-commit-graph.sh |  8 ++++++++
 3 files changed, 23 insertions(+), 6 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 9b80bbd75b..e4860841fc 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -433,7 +433,8 @@ struct commit_graph *parse_commit_graph(struct repo_settings *s,
 	read_chunk(cf, GRAPH_CHUNKID_OIDFANOUT, graph_read_oid_fanout, graph);
 	read_chunk(cf, GRAPH_CHUNKID_OIDLOOKUP, graph_read_oid_lookup, graph);
 	read_chunk(cf, GRAPH_CHUNKID_DATA, graph_read_commit_data, graph);
-	pair_chunk_unsafe(cf, GRAPH_CHUNKID_EXTRAEDGES, &graph->chunk_extra_edges);
+	pair_chunk(cf, GRAPH_CHUNKID_EXTRAEDGES, &graph->chunk_extra_edges,
+		   &graph->chunk_extra_edges_size);
 	pair_chunk_unsafe(cf, GRAPH_CHUNKID_BASE, &graph->chunk_base_graphs);
 
 	if (s->commit_graph_generation_version >= 2) {
@@ -899,7 +900,7 @@ static int fill_commit_in_graph(struct repository *r,
 				struct commit_graph *g, uint32_t pos)
 {
 	uint32_t edge_value;
-	uint32_t *parent_data_ptr;
+	uint32_t parent_data_pos;
 	struct commit_list **pptr;
 	const unsigned char *commit_data;
 	uint32_t lex_index;
@@ -931,14 +932,21 @@ static int fill_commit_in_graph(struct repository *r,
 		return 1;
 	}
 
-	parent_data_ptr = (uint32_t*)(g->chunk_extra_edges +
-			  st_mult(4, edge_value & GRAPH_EDGE_LAST_MASK));
+	parent_data_pos = edge_value & GRAPH_EDGE_LAST_MASK;
 	do {
-		edge_value = get_be32(parent_data_ptr);
+		if (g->chunk_extra_edges_size / sizeof(uint32_t) <= parent_data_pos) {
+			error("commit-graph extra-edges pointer out of bounds");
+			free_commit_list(item->parents);
+			item->parents = NULL;
+			item->object.parsed = 0;
+			return 0;
+		}
+		edge_value = get_be32(g->chunk_extra_edges +
+				      sizeof(uint32_t) * parent_data_pos);
 		pptr = insert_parent_or_die(r, g,
 					    edge_value & GRAPH_EDGE_LAST_MASK,
 					    pptr);
-		parent_data_ptr++;
+		parent_data_pos++;
 	} while (!(edge_value & GRAPH_LAST_EDGE));
 
 	return 1;
diff --git a/commit-graph.h b/commit-graph.h
index 20ada7e891..1f8a9de4fb 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -95,6 +95,7 @@ struct commit_graph {
 	const unsigned char *chunk_generation_data;
 	const unsigned char *chunk_generation_data_overflow;
 	const unsigned char *chunk_extra_edges;
+	size_t chunk_extra_edges_size;
 	const unsigned char *chunk_base_graphs;
 	const unsigned char *chunk_bloom_indexes;
 	const unsigned char *chunk_bloom_data;
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 492460157d..05bafcfe5f 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -879,4 +879,12 @@ test_expect_success 'reader notices too-small commit data chunk' '
 	test_cmp expect.err err
 '
 
+test_expect_success 'reader notices out-of-bounds extra edge' '
+	check_corrupt_chunk EDGE clear &&
+	cat >expect.err <<-\EOF &&
+	error: commit-graph extra-edges pointer out of bounds
+	EOF
+	test_cmp expect.err err
+'
+
 test_done
-- 
2.42.0.884.g35e1fe1a6a


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 14/20] commit-graph: bounds-check base graphs chunk
  2023-10-09 20:55 [PATCH 0/20] bounds-checks for chunk-based files Jeff King
                   ` (12 preceding siblings ...)
  2023-10-09 21:05 ` [PATCH 13/20] commit-graph: detect out-of-bounds extra-edges pointers Jeff King
@ 2023-10-09 21:05 ` Jeff King
  2023-10-11 19:05   ` Taylor Blau
  2023-10-09 21:05 ` [PATCH 15/20] commit-graph: check size of generations chunk Jeff King
                   ` (8 subsequent siblings)
  22 siblings, 1 reply; 67+ messages in thread
From: Jeff King @ 2023-10-09 21:05 UTC (permalink / raw)
  To: git; +Cc: Taylor Blau

When we are loading a commit-graph chain, we check that each slice of the
chain points to the appropriate set of base graphs via its BASE chunk.
But since we don't record the size of the chunk, we may access
out-of-bounds memory if the file is corrupted.

Since we know the number of entries we expect to find (based on the
position within the commit-graph-chain file), we can just check the size
up front.

In theory this would also let us drop the st_mult() call a few lines
later when we actually access the memory, since we know that the
computed offset will fit in a size_t. But because the operands
"g->hash_len" and "n" have types "unsigned char" and "int", we'd have to
cast to size_t first. Leaving the st_mult() does that cast, and makes it
more obvious that we don't have an overflow problem.

Note that the test does not actually segfault before this patch, since
it just reads garbage from the chunk after BASE (and indeed, it even
rejects the file because that garbage does not have the expected hash
value). You could construct a file with BASE at the end that did
segfault, but corrupting the existing one is easy, and we can check
stderr for the expected message.

Signed-off-by: Jeff King <peff@peff.net>
---
 commit-graph.c                |  8 +++++++-
 commit-graph.h                |  1 +
 t/t5324-split-commit-graph.sh | 14 ++++++++++++++
 3 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/commit-graph.c b/commit-graph.c
index e4860841fc..4377b547c8 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -435,7 +435,8 @@ struct commit_graph *parse_commit_graph(struct repo_settings *s,
 	read_chunk(cf, GRAPH_CHUNKID_DATA, graph_read_commit_data, graph);
 	pair_chunk(cf, GRAPH_CHUNKID_EXTRAEDGES, &graph->chunk_extra_edges,
 		   &graph->chunk_extra_edges_size);
-	pair_chunk_unsafe(cf, GRAPH_CHUNKID_BASE, &graph->chunk_base_graphs);
+	pair_chunk(cf, GRAPH_CHUNKID_BASE, &graph->chunk_base_graphs,
+		   &graph->chunk_base_graphs_size);
 
 	if (s->commit_graph_generation_version >= 2) {
 		pair_chunk_unsafe(cf, GRAPH_CHUNKID_GENERATION_DATA,
@@ -546,6 +547,11 @@ static int add_graph_to_chain(struct commit_graph *g,
 		return 0;
 	}
 
+	if (g->chunk_base_graphs_size / g->hash_len < n) {
+		warning(_("commit-graph base graphs chunk is too small"));
+		return 0;
+	}
+
 	while (n) {
 		n--;
 
diff --git a/commit-graph.h b/commit-graph.h
index 1f8a9de4fb..e4248ea05d 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -97,6 +97,7 @@ struct commit_graph {
 	const unsigned char *chunk_extra_edges;
 	size_t chunk_extra_edges_size;
 	const unsigned char *chunk_base_graphs;
+	size_t chunk_base_graphs_size;
 	const unsigned char *chunk_bloom_indexes;
 	const unsigned char *chunk_bloom_data;
 
diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
index 55b5765e2d..3c8482d073 100755
--- a/t/t5324-split-commit-graph.sh
+++ b/t/t5324-split-commit-graph.sh
@@ -2,6 +2,7 @@
 
 test_description='split commit graph'
 . ./test-lib.sh
+. "$TEST_DIRECTORY"/lib-chunk.sh
 
 GIT_TEST_COMMIT_GRAPH=0
 GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=0
@@ -398,6 +399,19 @@ test_expect_success 'verify across alternates' '
 	)
 '
 
+test_expect_success 'reader bounds-checks base-graph chunk' '
+	git clone --no-hardlinks . corrupt-base-chunk &&
+	(
+		cd corrupt-base-chunk &&
+		tip_file=$graphdir/graph-$(tail -n 1 $graphdir/commit-graph-chain).graph &&
+		corrupt_chunk_file "$tip_file" BASE clear 01020304 &&
+		git -c core.commitGraph=false log >expect.out &&
+		git -c core.commitGraph=true log >out 2>err &&
+		test_cmp expect.out out &&
+		grep "commit-graph base graphs chunk is too small" err
+	)
+'
+
 test_expect_success 'add octopus merge' '
 	git reset --hard commits/10 &&
 	git merge commits/3 commits/4 &&
-- 
2.42.0.884.g35e1fe1a6a


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 15/20] commit-graph: check size of generations chunk
  2023-10-09 20:55 [PATCH 0/20] bounds-checks for chunk-based files Jeff King
                   ` (13 preceding siblings ...)
  2023-10-09 21:05 ` [PATCH 14/20] commit-graph: bounds-check base graphs chunk Jeff King
@ 2023-10-09 21:05 ` Jeff King
  2023-10-09 21:05 ` [PATCH 16/20] commit-graph: bounds-check generation overflow chunk Jeff King
                   ` (7 subsequent siblings)
  22 siblings, 0 replies; 67+ messages in thread
From: Jeff King @ 2023-10-09 21:05 UTC (permalink / raw)
  To: git; +Cc: Taylor Blau

We neither check nor record the size of the generations chunk we parse
from a commit-graph file. This should have one uint32_t for each commit
in the file; if it is smaller (due to corruption, etc), we may read
outside the mapped memory.

The included test segfaults without this patch, as it shrinks the size
considerably (and the chunk is near the end of the file, so we read off
the end of the array rather than accidentally reading another chunk).

We can fix this by checking the size up front (like we do for other
fixed-size chunks, like CDAT).

Signed-off-by: Jeff King <peff@peff.net>
---
 commit-graph.c          | 14 ++++++++++++--
 t/t5318-commit-graph.sh |  8 ++++++++
 2 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 4377b547c8..ca26870d1b 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -350,6 +350,16 @@ static int graph_read_commit_data(const unsigned char *chunk_start,
 	return 0;
 }
 
+static int graph_read_generation_data(const unsigned char *chunk_start,
+				      size_t chunk_size, void *data)
+{
+	struct commit_graph *g = data;
+	if (chunk_size != g->num_commits * sizeof(uint32_t))
+		return error("commit-graph generations chunk is wrong size");
+	g->chunk_generation_data = chunk_start;
+	return 0;
+}
+
 static int graph_read_bloom_data(const unsigned char *chunk_start,
 				  size_t chunk_size, void *data)
 {
@@ -439,8 +449,8 @@ struct commit_graph *parse_commit_graph(struct repo_settings *s,
 		   &graph->chunk_base_graphs_size);
 
 	if (s->commit_graph_generation_version >= 2) {
-		pair_chunk_unsafe(cf, GRAPH_CHUNKID_GENERATION_DATA,
-			&graph->chunk_generation_data);
+		read_chunk(cf, GRAPH_CHUNKID_GENERATION_DATA,
+			   graph_read_generation_data, graph);
 		pair_chunk_unsafe(cf, GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW,
 			&graph->chunk_generation_data_overflow);
 
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 05bafcfe5f..6505ff595a 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -887,4 +887,12 @@ test_expect_success 'reader notices out-of-bounds extra edge' '
 	test_cmp expect.err err
 '
 
+test_expect_success 'reader notices too-small generations chunk' '
+	check_corrupt_chunk GDA2 clear 00000000 &&
+	cat >expect.err <<-\EOF &&
+	error: commit-graph generations chunk is wrong size
+	EOF
+	test_cmp expect.err err
+'
+
 test_done
-- 
2.42.0.884.g35e1fe1a6a


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 16/20] commit-graph: bounds-check generation overflow chunk
  2023-10-09 20:55 [PATCH 0/20] bounds-checks for chunk-based files Jeff King
                   ` (14 preceding siblings ...)
  2023-10-09 21:05 ` [PATCH 15/20] commit-graph: check size of generations chunk Jeff King
@ 2023-10-09 21:05 ` Jeff King
  2023-10-09 21:05 ` [PATCH 17/20] commit-graph: check bounds when accessing BDAT chunk Jeff King
                   ` (6 subsequent siblings)
  22 siblings, 0 replies; 67+ messages in thread
From: Jeff King @ 2023-10-09 21:05 UTC (permalink / raw)
  To: git; +Cc: Taylor Blau

If the generation entry in a commit-graph doesn't fit, we instead insert
an offset into a generation overflow chunk. But since we don't record
the size of the chunk, we may read outside the chunk if the offset we
find on disk is malicious or corrupted.

We can't check the size of the chunk up-front; it will vary based on how
many entries need overflow. So instead, we'll do a bounds-check before
accessing the chunk memory. Unfortunately there is no error-return from
this function, so we'll just have to die(), which is what it does for
other forms of corruption.

As with other cases, we can drop the st_mult() call, since we know our
bounds-checked value will fit within a size_t.

Before this patch, the test here actually "works" because we read
garbage data from the next chunk. And since that garbage data happens
not to provide a generation number which changes the output, it appears
to work. We could construct a case that actually segfaults or produces
wrong output, but it would be a bit tricky. For our purposes its
sufficient to check that we've detected the bounds error.

Signed-off-by: Jeff King <peff@peff.net>
---
 commit-graph.c                     | 10 +++++++---
 commit-graph.h                     |  1 +
 t/t5328-commit-graph-64bit-time.sh | 10 ++++++++++
 3 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index ca26870d1b..f446e76c28 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -451,8 +451,9 @@ struct commit_graph *parse_commit_graph(struct repo_settings *s,
 	if (s->commit_graph_generation_version >= 2) {
 		read_chunk(cf, GRAPH_CHUNKID_GENERATION_DATA,
 			   graph_read_generation_data, graph);
-		pair_chunk_unsafe(cf, GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW,
-			&graph->chunk_generation_data_overflow);
+		pair_chunk(cf, GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW,
+			   &graph->chunk_generation_data_overflow,
+			   &graph->chunk_generation_data_overflow_size);
 
 		if (graph->chunk_generation_data)
 			graph->read_generation_data = 1;
@@ -896,7 +897,10 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
 				die(_("commit-graph requires overflow generation data but has none"));
 
 			offset_pos = offset ^ CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW;
-			graph_data->generation = item->date + get_be64(g->chunk_generation_data_overflow + st_mult(8, offset_pos));
+			if (g->chunk_generation_data_overflow_size / sizeof(uint64_t) <= offset_pos)
+				die(_("commit-graph overflow generation data is too small"));
+			graph_data->generation = item->date +
+				get_be64(g->chunk_generation_data_overflow + sizeof(uint64_t) * offset_pos);
 		} else
 			graph_data->generation = item->date + offset;
 	} else
diff --git a/commit-graph.h b/commit-graph.h
index e4248ea05d..b373f15802 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -94,6 +94,7 @@ struct commit_graph {
 	const unsigned char *chunk_commit_data;
 	const unsigned char *chunk_generation_data;
 	const unsigned char *chunk_generation_data_overflow;
+	size_t chunk_generation_data_overflow_size;
 	const unsigned char *chunk_extra_edges;
 	size_t chunk_extra_edges_size;
 	const unsigned char *chunk_base_graphs;
diff --git a/t/t5328-commit-graph-64bit-time.sh b/t/t5328-commit-graph-64bit-time.sh
index e9c521c061..e5ff3e07ad 100755
--- a/t/t5328-commit-graph-64bit-time.sh
+++ b/t/t5328-commit-graph-64bit-time.sh
@@ -10,6 +10,7 @@ then
 fi
 
 . "$TEST_DIRECTORY"/lib-commit-graph.sh
+. "$TEST_DIRECTORY/lib-chunk.sh"
 
 UNIX_EPOCH_ZERO="@0 +0000"
 FUTURE_DATE="@4147483646 +0000"
@@ -72,4 +73,13 @@ test_expect_success 'single commit with generation data exceeding UINT32_MAX' '
 	git -C repo-uint32-max commit-graph verify
 '
 
+test_expect_success 'reader notices out-of-bounds generation overflow' '
+	graph=.git/objects/info/commit-graph &&
+	test_when_finished "rm -rf $graph" &&
+	git commit-graph write --reachable &&
+	corrupt_chunk_file $graph GDO2 clear &&
+	test_must_fail git log 2>err &&
+	grep "commit-graph overflow generation data is too small" err
+'
+
 test_done
-- 
2.42.0.884.g35e1fe1a6a


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 17/20] commit-graph: check bounds when accessing BDAT chunk
  2023-10-09 20:55 [PATCH 0/20] bounds-checks for chunk-based files Jeff King
                   ` (15 preceding siblings ...)
  2023-10-09 21:05 ` [PATCH 16/20] commit-graph: bounds-check generation overflow chunk Jeff King
@ 2023-10-09 21:05 ` Jeff King
  2023-10-11 19:11   ` Taylor Blau
  2023-10-09 21:05 ` [PATCH 18/20] commit-graph: check bounds when accessing BIDX chunk Jeff King
                   ` (5 subsequent siblings)
  22 siblings, 1 reply; 67+ messages in thread
From: Jeff King @ 2023-10-09 21:05 UTC (permalink / raw)
  To: git; +Cc: Taylor Blau

When loading Bloom filters from a commit-graph file, we use the offset
values in the BIDX chunk to index into the memory mapped for the BDAT
chunk. But since we don't record how big the BDAT chunk is, we just
trust that the BIDX offsets won't cause us to read outside of the chunk
memory. A corrupted or malicious commit-graph file will cause us to
segfault (in practice this isn't a very interesting attack, since
commit-graph files are local-only, and the worst case is an
out-of-bounds read).

We can't fix this by checking the chunk size during parsing, since the
data in the BDAT chunk doesn't have a fixed size (that's why we need the
BIDX in the first place). So we'll fix it in two parts:

  1. Record the BDAT chunk size during parsing, and then later check
     that the BIDX offsets we look up are within bounds.

  2. Because the offsets are relative to the end of the BDAT header, we
     must also make sure that the BDAT chunk is at least as large as the
     expected header size. Otherwise, we overflow when trying to move
     past the header, even for an offset of "0". We can check this
     early, during the parsing stage.

The error messages are rather verbose, but since this is not something
you'd expect to see outside of severe bugs or corruption, it makes sense
to err on the side of too many details. Sadly we can't mention the
filename during the chunk-parsing stage, as we haven't set g->filename
at this point, nor passed it down through the stack.

Signed-off-by: Jeff King <peff@peff.net>
---
 bloom.c              | 24 ++++++++++++++++++++++++
 commit-graph.c       | 10 ++++++++++
 commit-graph.h       |  1 +
 t/t4216-log-bloom.sh | 28 ++++++++++++++++++++++++++++
 4 files changed, 63 insertions(+)

diff --git a/bloom.c b/bloom.c
index aef6b5fea2..61abad7f8c 100644
--- a/bloom.c
+++ b/bloom.c
@@ -29,6 +29,26 @@ static inline unsigned char get_bitmask(uint32_t pos)
 	return ((unsigned char)1) << (pos & (BITS_PER_WORD - 1));
 }
 
+static int check_bloom_offset(struct commit_graph *g, uint32_t pos,
+			      uint32_t offset)
+{
+	/*
+	 * Note that we allow offsets equal to the data size, which would set
+	 * our pointers at one past the end of the chunk memory. This is
+	 * necessary because the on-disk index points to the end of the
+	 * entries (so we can compute size by comparing adjacent ones). And
+	 * naturally the final entry's end is one-past-the-end of the chunk.
+	 */
+	if (offset <= g->chunk_bloom_data_size - BLOOMDATA_CHUNK_HEADER_SIZE)
+		return 0;
+
+	warning("ignoring out-of-range offset (%"PRIuMAX") for changed-path"
+		" filter at pos %"PRIuMAX" of %s (chunk size: %"PRIuMAX")",
+		(uintmax_t)offset, (uintmax_t)pos,
+		g->filename, (uintmax_t)g->chunk_bloom_data_size);
+	return -1;
+}
+
 static int load_bloom_filter_from_graph(struct commit_graph *g,
 					struct bloom_filter *filter,
 					uint32_t graph_pos)
@@ -51,6 +71,10 @@ static int load_bloom_filter_from_graph(struct commit_graph *g,
 	else
 		start_index = 0;
 
+	if (check_bloom_offset(g, lex_pos, end_index) < 0 ||
+	    check_bloom_offset(g, lex_pos - 1, start_index) < 0)
+		return 0;
+
 	filter->len = end_index - start_index;
 	filter->data = (unsigned char *)(g->chunk_bloom_data +
 					sizeof(unsigned char) * start_index +
diff --git a/commit-graph.c b/commit-graph.c
index f446e76c28..f7a42be6d0 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -365,7 +365,17 @@ static int graph_read_bloom_data(const unsigned char *chunk_start,
 {
 	struct commit_graph *g = data;
 	uint32_t hash_version;
+
+	if (chunk_size < BLOOMDATA_CHUNK_HEADER_SIZE) {
+		warning("ignoring too-small changed-path chunk"
+			" (%"PRIuMAX" < %"PRIuMAX") in commit-graph file",
+			(uintmax_t)chunk_size,
+			(uintmax_t)BLOOMDATA_CHUNK_HEADER_SIZE);
+		return -1;
+	}
+
 	g->chunk_bloom_data = chunk_start;
+	g->chunk_bloom_data_size = chunk_size;
 	hash_version = get_be32(chunk_start);
 
 	if (hash_version != 1)
diff --git a/commit-graph.h b/commit-graph.h
index b373f15802..c6870274c5 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -101,6 +101,7 @@ struct commit_graph {
 	size_t chunk_base_graphs_size;
 	const unsigned char *chunk_bloom_indexes;
 	const unsigned char *chunk_bloom_data;
+	size_t chunk_bloom_data_size;
 
 	struct topo_level_slab *topo_levels;
 	struct bloom_filter_settings *bloom_filter_settings;
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index fa9d32facf..7a727bcddd 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -5,6 +5,7 @@ GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=main
 export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME
 
 . ./test-lib.sh
+. "$TEST_DIRECTORY"/lib-chunk.sh
 
 GIT_TEST_COMMIT_GRAPH=0
 GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=0
@@ -404,4 +405,31 @@ test_expect_success 'Bloom generation backfills empty commits' '
 	)
 '
 
+corrupt_graph () {
+	graph=.git/objects/info/commit-graph &&
+	test_when_finished "rm -rf $graph" &&
+	git commit-graph write --reachable --changed-paths &&
+	corrupt_chunk_file $graph "$@"
+}
+
+check_corrupt_graph () {
+	corrupt_graph "$@" &&
+	git -c core.commitGraph=false log -- A/B/file2 >expect.out &&
+	git -c core.commitGraph=true log -- A/B/file2 >out 2>err &&
+	test_cmp expect.out out
+}
+
+test_expect_success 'Bloom reader notices too-small data chunk' '
+	check_corrupt_graph BDAT clear 00000000 &&
+	echo "warning: ignoring too-small changed-path chunk" \
+		"(4 < 12) in commit-graph file" >expect.err &&
+	test_cmp expect.err err
+'
+
+test_expect_success 'Bloom reader notices out-of-bounds filter offsets' '
+	check_corrupt_graph BIDX 12 FFFFFFFF &&
+	# use grep to avoid depending on exact chunk size
+	grep "warning: ignoring out-of-range offset (4294967295) for changed-path filter at pos 3 of .git/objects/info/commit-graph" err
+'
+
 test_done
-- 
2.42.0.884.g35e1fe1a6a


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 18/20] commit-graph: check bounds when accessing BIDX chunk
  2023-10-09 20:55 [PATCH 0/20] bounds-checks for chunk-based files Jeff King
                   ` (16 preceding siblings ...)
  2023-10-09 21:05 ` [PATCH 17/20] commit-graph: check bounds when accessing BDAT chunk Jeff King
@ 2023-10-09 21:05 ` Jeff King
  2023-10-11 19:15   ` Taylor Blau
  2023-10-09 21:05 ` [PATCH 19/20] commit-graph: detect out-of-order BIDX offsets Jeff King
                   ` (4 subsequent siblings)
  22 siblings, 1 reply; 67+ messages in thread
From: Jeff King @ 2023-10-09 21:05 UTC (permalink / raw)
  To: git; +Cc: Taylor Blau

We load the bloom_filter_indexes chunk using pair_chunk(), so we have no
idea how big it is. This can lead to out-of-bounds reads if it is
smaller than expected, since we index it based on the number of commits
found elsewhere in the graph file.

We can check the chunk size up front, like we do for CDAT and other
chunks with one fixed-size record per commit.

The test case demonstrates the problem. It actually won't segfault,
because we end up reading random data from the follow-on chunk (BDAT in
this case), and the bounds checks added in the previous patch complain.
But this is by no means assured, and you can craft a commit-graph file
with BIDX at the end (or a smaller BDAT) that does segfault.

Signed-off-by: Jeff King <peff@peff.net>
---
 commit-graph.c       | 16 ++++++++++++++--
 t/t4216-log-bloom.sh |  9 +++++++++
 2 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index f7a42be6d0..1f334987b5 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -360,6 +360,18 @@ static int graph_read_generation_data(const unsigned char *chunk_start,
 	return 0;
 }
 
+static int graph_read_bloom_index(const unsigned char *chunk_start,
+				  size_t chunk_size, void *data)
+{
+	struct commit_graph *g = data;
+	if (chunk_size != g->num_commits * 4) {
+		warning("commit-graph changed-path index chunk is too small");
+		return -1;
+	}
+	g->chunk_bloom_indexes = chunk_start;
+	return 0;
+}
+
 static int graph_read_bloom_data(const unsigned char *chunk_start,
 				  size_t chunk_size, void *data)
 {
@@ -470,8 +482,8 @@ struct commit_graph *parse_commit_graph(struct repo_settings *s,
 	}
 
 	if (s->commit_graph_read_changed_paths) {
-		pair_chunk_unsafe(cf, GRAPH_CHUNKID_BLOOMINDEXES,
-			   &graph->chunk_bloom_indexes);
+		read_chunk(cf, GRAPH_CHUNKID_BLOOMINDEXES,
+			   graph_read_bloom_index, graph);
 		read_chunk(cf, GRAPH_CHUNKID_BLOOMDATA,
 			   graph_read_bloom_data, graph);
 	}
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index 7a727bcddd..f6054cbb27 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -432,4 +432,13 @@ test_expect_success 'Bloom reader notices out-of-bounds filter offsets' '
 	grep "warning: ignoring out-of-range offset (4294967295) for changed-path filter at pos 3 of .git/objects/info/commit-graph" err
 '
 
+test_expect_success 'Bloom reader notices too-small index chunk' '
+	# replace the index with a single entry, making most
+	# lookups out-of-bounds
+	check_corrupt_graph BIDX clear 00000000 &&
+	echo "warning: commit-graph changed-path index chunk" \
+		"is too small" >expect.err &&
+	test_cmp expect.err err
+'
+
 test_done
-- 
2.42.0.884.g35e1fe1a6a


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 19/20] commit-graph: detect out-of-order BIDX offsets
  2023-10-09 20:55 [PATCH 0/20] bounds-checks for chunk-based files Jeff King
                   ` (17 preceding siblings ...)
  2023-10-09 21:05 ` [PATCH 18/20] commit-graph: check bounds when accessing BIDX chunk Jeff King
@ 2023-10-09 21:05 ` Jeff King
  2023-10-11 19:16   ` Taylor Blau
  2023-10-09 21:06 ` [PATCH 20/20] chunk-format: drop pair_chunk_unsafe() Jeff King
                   ` (3 subsequent siblings)
  22 siblings, 1 reply; 67+ messages in thread
From: Jeff King @ 2023-10-09 21:05 UTC (permalink / raw)
  To: git; +Cc: Taylor Blau

The BIDX chunk tells us the offsets at which each commit's Bloom filters
can be found in the BDAT chunk. We compute the length of each filter by
checking the offsets of neighbors and subtracting them.

If the offsets are out of order, then we'll get a negative length, which
we then store as a very large unsigned value. This can cause us to read
out-of-bounds memory, as we access the hash data modulo "filter->len *
BITS_PER_WORD".

We can easily detect this case when loading the individual filters.

Signed-off-by: Jeff King <peff@peff.net>
---
 bloom.c              | 10 ++++++++++
 t/t4216-log-bloom.sh | 13 +++++++++++++
 2 files changed, 23 insertions(+)

diff --git a/bloom.c b/bloom.c
index 61abad7f8c..1474aa19fa 100644
--- a/bloom.c
+++ b/bloom.c
@@ -75,6 +75,16 @@ static int load_bloom_filter_from_graph(struct commit_graph *g,
 	    check_bloom_offset(g, lex_pos - 1, start_index) < 0)
 		return 0;
 
+	if (end_index < start_index) {
+		warning("ignoring decreasing changed-path index offsets"
+			" (%"PRIuMAX" > %"PRIuMAX") for positions"
+			" %"PRIuMAX" and %"PRIuMAX" of %s",
+			(uintmax_t)start_index, (uintmax_t)end_index,
+			(uintmax_t)(lex_pos-1), (uintmax_t)lex_pos,
+			g->filename);
+		return 0;
+	}
+
 	filter->len = end_index - start_index;
 	filter->data = (unsigned char *)(g->chunk_bloom_data +
 					sizeof(unsigned char) * start_index +
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index f6054cbb27..2ba0324a69 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -441,4 +441,17 @@ test_expect_success 'Bloom reader notices too-small index chunk' '
 	test_cmp expect.err err
 '
 
+test_expect_success 'Bloom reader notices out-of-order index offsets' '
+	# we do not know any real offsets, but we can pick
+	# something plausible; we should not get to the point of
+	# actually reading from the bogus offsets anyway.
+	corrupt_graph BIDX 4 0000000c00000005 &&
+	echo "warning: ignoring decreasing changed-path index offsets" \
+		"(12 > 5) for positions 1 and 2 of .git/objects/info/commit-graph" >expect.err &&
+	git -c core.commitGraph=false log -- A/B/file2 >expect.out &&
+	git -c core.commitGraph=true log -- A/B/file2 >out 2>err &&
+	test_cmp expect.out out &&
+	test_cmp expect.err err
+'
+
 test_done
-- 
2.42.0.884.g35e1fe1a6a


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 20/20] chunk-format: drop pair_chunk_unsafe()
  2023-10-09 20:55 [PATCH 0/20] bounds-checks for chunk-based files Jeff King
                   ` (18 preceding siblings ...)
  2023-10-09 21:05 ` [PATCH 19/20] commit-graph: detect out-of-order BIDX offsets Jeff King
@ 2023-10-09 21:06 ` Jeff King
  2023-10-11 19:19 ` [PATCH 0/20] bounds-checks for chunk-based files Taylor Blau
                   ` (2 subsequent siblings)
  22 siblings, 0 replies; 67+ messages in thread
From: Jeff King @ 2023-10-09 21:06 UTC (permalink / raw)
  To: git; +Cc: Taylor Blau

There are no callers left, and we don't want anybody to add new ones (they
should use the not-unsafe version instead). So let's drop the function.

Signed-off-by: Jeff King <peff@peff.net>
---
 chunk-format.c |  8 --------
 chunk-format.h | 13 -------------
 2 files changed, 21 deletions(-)

diff --git a/chunk-format.c b/chunk-format.c
index 09ef86afa6..cdc7f39b70 100644
--- a/chunk-format.c
+++ b/chunk-format.c
@@ -184,14 +184,6 @@ int pair_chunk(struct chunkfile *cf,
 	return read_chunk(cf, chunk_id, pair_chunk_fn, &pcd);
 }
 
-int pair_chunk_unsafe(struct chunkfile *cf,
-		      uint32_t chunk_id,
-		      const unsigned char **p)
-{
-	size_t dummy;
-	return pair_chunk(cf, chunk_id, p, &dummy);
-}
-
 int read_chunk(struct chunkfile *cf,
 	       uint32_t chunk_id,
 	       chunk_read_fn fn,
diff --git a/chunk-format.h b/chunk-format.h
index d608b8135c..14b76180ef 100644
--- a/chunk-format.h
+++ b/chunk-format.h
@@ -54,19 +54,6 @@ int pair_chunk(struct chunkfile *cf,
 	       const unsigned char **p,
 	       size_t *size);
 
-/*
- * Unsafe version of pair_chunk; it does not return the size,
- * meaning that the caller cannot possibly be careful about
- * reading out of bounds from the mapped memory.
- *
- * No new callers should use this function, and old callers should
- * be audited and migrated over to using the regular pair_chunk()
- * function.
- */
-int pair_chunk_unsafe(struct chunkfile *cf,
-		      uint32_t chunk_id,
-		      const unsigned char **p);
-
 typedef int (*chunk_read_fn)(const unsigned char *chunk_start,
 			     size_t chunk_size, void *data);
 /*
-- 
2.42.0.884.g35e1fe1a6a

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [PATCH 01/20] chunk-format: note that pair_chunk() is unsafe
  2023-10-09 20:58 ` [PATCH 01/20] chunk-format: note that pair_chunk() is unsafe Jeff King
@ 2023-10-10 23:45   ` Taylor Blau
  2023-10-11 22:49     ` Jeff King
  0 siblings, 1 reply; 67+ messages in thread
From: Taylor Blau @ 2023-10-10 23:45 UTC (permalink / raw)
  To: Jeff King; +Cc: git

On Mon, Oct 09, 2023 at 04:58:23PM -0400, Jeff King wrote:
> There are no callers of the "safe" version yet, but we'll add some in
> subsequent patches.

Makes sense.

> +int pair_chunk_unsafe(struct chunkfile *cf,
> +		      uint32_t chunk_id,
> +		      const unsigned char **p)
>  {
> -	return read_chunk(cf, chunk_id, pair_chunk_fn, p);
> +	size_t dummy;
> +	return pair_chunk(cf, chunk_id, p, &dummy);

I have always disliked functions that require you to pass a non-NULL
pointer to some value that you may or may not want to have that function
fill out. So I was going to suggest something along the lines of
"pair_chunk() should tolerate a NULL fourth argument instead of filling
out the size unconditionally".

But that's (a) the whole point of the series ;-), and (b) unnecessary,
since this function is going to go away entirely by the end of the
series, too.

So I think that the call you made here was the right one. The rest of
the patch all looks good to me, let's read on...

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 02/20] t: add library for munging chunk-format files
  2023-10-09 20:58 ` [PATCH 02/20] t: add library for munging chunk-format files Jeff King
@ 2023-10-10 23:47   ` Taylor Blau
  0 siblings, 0 replies; 67+ messages in thread
From: Taylor Blau @ 2023-10-10 23:47 UTC (permalink / raw)
  To: Jeff King; +Cc: git

On Mon, Oct 09, 2023 at 04:58:38PM -0400, Jeff King wrote:
> ---
>  t/lib-chunk.sh                    | 17 ++++++++
>  t/lib-chunk/corrupt-chunk-file.pl | 66 +++++++++++++++++++++++++++++++
>  2 files changed, 83 insertions(+)
>  create mode 100644 t/lib-chunk.sh
>  create mode 100644 t/lib-chunk/corrupt-chunk-file.pl

I can't claim to be a competent reviewer for the Perl portions of this
patch. But I agree with the goal and am glad to see us getting away from
the brittle implementation that we have been living with in the test
suite.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 03/20] midx: stop ignoring malformed oid fanout chunk
  2023-10-09 20:59 ` [PATCH 03/20] midx: stop ignoring malformed oid fanout chunk Jeff King
@ 2023-10-10 23:50   ` Taylor Blau
  2023-10-11 22:52     ` Jeff King
  0 siblings, 1 reply; 67+ messages in thread
From: Taylor Blau @ 2023-10-10 23:50 UTC (permalink / raw)
  To: Jeff King; +Cc: git

On Mon, Oct 09, 2023 at 04:59:19PM -0400, Jeff King wrote:
> When we load the oid-fanout chunk, our callback checks that its size is
> reasonable and returns an error if not. However, the caller only checks
> our return value against CHUNK_NOT_FOUND, so we end up ignoring the
> error completely! Using a too-small fanout table means we end up
> accessing random memory for the fanout and segfault.
>
> We can fix this by checking for any non-zero return value, rather than
> just CHUNK_NOT_FOUND, and adjusting our error message to cover both
> cases. We could handle each error code individually, but there's not
> much point for such a rare case. The extra message produced in the
> callback makes it clear what is going on.
>
> The same pattern is used in the adjacent code. Those cases are actually
> OK for now because they do not use a custom callback, so the only error
> they can get is CHUNK_NOT_FOUND. But let's convert them, as this is an
> accident waiting to happen (especially as we convert some of them away
> from pair_chunk). The error messages are more verbose, but it should be
> rare for a user to see these anyway.
>
> Signed-off-by: Jeff King <peff@peff.net>
> ---
>  midx.c                      | 16 ++++++++--------
>  t/t5319-multi-pack-index.sh | 20 +++++++++++++++++++-
>  2 files changed, 27 insertions(+), 9 deletions(-)
>
> diff --git a/midx.c b/midx.c
> index 3165218ab5..21d7dd15ef 100644
> --- a/midx.c
> +++ b/midx.c
> @@ -143,14 +143,14 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local
>  				   MIDX_HEADER_SIZE, m->num_chunks))
>  		goto cleanup_fail;
>
> -	if (pair_chunk_unsafe(cf, MIDX_CHUNKID_PACKNAMES, &m->chunk_pack_names) == CHUNK_NOT_FOUND)
> -		die(_("multi-pack-index missing required pack-name chunk"));
> -	if (read_chunk(cf, MIDX_CHUNKID_OIDFANOUT, midx_read_oid_fanout, m) == CHUNK_NOT_FOUND)
> -		die(_("multi-pack-index missing required OID fanout chunk"));
> -	if (pair_chunk_unsafe(cf, MIDX_CHUNKID_OIDLOOKUP, &m->chunk_oid_lookup) == CHUNK_NOT_FOUND)
> -		die(_("multi-pack-index missing required OID lookup chunk"));
> -	if (pair_chunk_unsafe(cf, MIDX_CHUNKID_OBJECTOFFSETS, &m->chunk_object_offsets) == CHUNK_NOT_FOUND)
> -		die(_("multi-pack-index missing required object offsets chunk"));
> +	if (pair_chunk_unsafe(cf, MIDX_CHUNKID_PACKNAMES, &m->chunk_pack_names))
> +		die(_("multi-pack-index required pack-name chunk missing or corrupted"));
> +	if (read_chunk(cf, MIDX_CHUNKID_OIDFANOUT, midx_read_oid_fanout, m))
> +		die(_("multi-pack-index required OID fanout chunk missing or corrupted"));
> +	if (pair_chunk_unsafe(cf, MIDX_CHUNKID_OIDLOOKUP, &m->chunk_oid_lookup))
> +		die(_("multi-pack-index required OID lookup chunk missing or corrupted"));
> +	if (pair_chunk_unsafe(cf, MIDX_CHUNKID_OBJECTOFFSETS, &m->chunk_object_offsets))
> +		die(_("multi-pack-index required object offsets chunk missing or corrupted"));

All makes sense. I have a mild preference for "missing or corrupt" over
"missing or corrupted", but it's mild enough that I wouldn't be sad if
you kept it as-is ;-).

I do wonder if translators would be happy with:

      die(_("multi-pack-index required %s chunk missing or corrupt"),
          "OID fanout");

or if that is assuming too much about the languages that we translate
into.

> diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
> index 1bcc02004d..b8fe85aeba 100755
> --- a/t/t5319-multi-pack-index.sh
> +++ b/t/t5319-multi-pack-index.sh

The rest of the patch is looking good...

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 04/20] commit-graph: check size of oid fanout chunk
  2023-10-09 20:59 ` [PATCH 04/20] commit-graph: check size of " Jeff King
@ 2023-10-11  0:08   ` Taylor Blau
  2023-10-11  1:24     ` Taylor Blau
  2023-10-11 23:01     ` Jeff King
  0 siblings, 2 replies; 67+ messages in thread
From: Taylor Blau @ 2023-10-11  0:08 UTC (permalink / raw)
  To: Jeff King; +Cc: git

On Mon, Oct 09, 2023 at 04:59:51PM -0400, Jeff King wrote:
> We load the oid fanout chunk with pair_chunk(), which means we never see
> the size of the chunk. We just assume the on-disk file uses the
> appropriate size, and if it's too small we'll access random memory.
>
> It's easy to check this up-front; the fanout always consists of 256
> uint32's, since it is a fanout of the first byte of the hash pointing
> into the oid index. These parameters can't be changed without
> introducing a new chunk type.

Cool, this is the first patch that should start reducing our usage of
the new pair_chunk_unsafe() and hardening these reads. Let's take a
look...

> This matches the similar check in the midx OIDF chunk (but note that
> rather than checking for the error immediately, the graph code just
> leaves parts of the struct NULL and checks for required fields later).
>
> Signed-off-by: Jeff King <peff@peff.net>
> ---
>  commit-graph.c          | 13 +++++++++++--
>  t/t5318-commit-graph.sh | 26 ++++++++++++++++++++++++++
>  2 files changed, 37 insertions(+), 2 deletions(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index a689a55b79..9b3b01da61 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -305,6 +305,16 @@ static int verify_commit_graph_lite(struct commit_graph *g)
>  	return 0;
>  }
>
> +static int graph_read_oid_fanout(const unsigned char *chunk_start,
> +				 size_t chunk_size, void *data)
> +{
> +	struct commit_graph *g = data;
> +	if (chunk_size != 256 * sizeof(uint32_t))
> +		return error("commit-graph oid fanout chunk is wrong size");

Should we mark this string for translation?

> +	g->chunk_oid_fanout = (const uint32_t *)chunk_start;
> +	return 0;
> +}
> +

Nice. This makes sense and seems like an obvious improvement over the
existing code.

I wonder how common this pattern is. We have read_chunk() which is for
handling more complex scenarios than this. But the safe version of
pair_chunk() really just wants to check that the size of the chunk is as
expected and assign the location in the mmap to some pointer.

Do you think it would be worth changing pair_chunk() to take an expected
size_t and handle this generically? I.e. have a version of
chunk-format::pair_chunk_fn() that looks something like:

    static int pair_chunk_fn(const unsigned char *chunk_start,
                             size_t chunk_size, void *data)
    {
        const unsigned char **p = data;
        if (chunk_size != data->size)
            return -1;
        *p = chunk_start;
        return 0;
    }

and then our call here would be:

  if (pair_chunk(cf, GRAPH_CHUNKID_OIDFANOUT,
                 (const unsigned char **)&graph->chunk_oid_fanout,
                 256 * sizeof(uint32_t)) < 0)
      return error("commit-graph oid fanout chunk is wrong size");

I dunno. It's hard to have a more concrete recomendation without having
read the rest of the series. So it's possible that this is just complete
nonsense ;-). But my hunch is that there are a number of callers that
would benefit from having this built in.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 04/20] commit-graph: check size of oid fanout chunk
  2023-10-11  0:08   ` Taylor Blau
@ 2023-10-11  1:24     ` Taylor Blau
  2023-10-11 23:01     ` Jeff King
  1 sibling, 0 replies; 67+ messages in thread
From: Taylor Blau @ 2023-10-11  1:24 UTC (permalink / raw)
  To: Jeff King; +Cc: git

On Tue, Oct 10, 2023 at 08:08:05PM -0400, Taylor Blau wrote:
> Do you think it would be worth changing pair_chunk() to take an expected
> size_t and handle this generically? I.e. have a version of
> chunk-format::pair_chunk_fn() that looks something like:
>
>     static int pair_chunk_fn(const unsigned char *chunk_start,
>                              size_t chunk_size, void *data)
>     {
>         const unsigned char **p = data;
>         if (chunk_size != data->size)
>             return -1;
>         *p = chunk_start;
>         return 0;
>     }
>
> and then our call here would be:
>
>   if (pair_chunk(cf, GRAPH_CHUNKID_OIDFANOUT,
>                  (const unsigned char **)&graph->chunk_oid_fanout,
>                  256 * sizeof(uint32_t)) < 0)
>       return error("commit-graph oid fanout chunk is wrong size");

I took a brief stab at implementing this tonight and came up with this,
which applies on top of this patch. Looking through the rest of the
series briefly[^1], I think having something like this would be useful:

--- 8< ---
diff --git a/chunk-format.c b/chunk-format.c
index 8d910e23f6..38b0f847be 100644
--- a/chunk-format.c
+++ b/chunk-format.c
@@ -157,6 +157,8 @@ int read_table_of_contents(struct chunkfile *cf,
 struct pair_chunk_data {
 	const unsigned char **p;
 	size_t *size;
+
+	size_t expected_size;
 };

 static int pair_chunk_fn(const unsigned char *chunk_start,
@@ -169,6 +171,20 @@ static int pair_chunk_fn(const unsigned char *chunk_start,
 	return 0;
 }

+static int pair_chunk_expect_fn(const unsigned char *chunk_start,
+				size_t chunk_size,
+				void *data)
+{
+	struct pair_chunk_data *pcd = data;
+	if (pcd->expected_size != chunk_size)
+		return error(_("mismatched chunk size, got: %"PRIuMAX", wanted:"
+			       " %"PRIuMAX),
+			     (uintmax_t)chunk_size,
+			     (uintmax_t)pcd->expected_size);
+	*pcd->p = chunk_start;
+	return 0;
+}
+
 int pair_chunk(struct chunkfile *cf,
 	       uint32_t chunk_id,
 	       const unsigned char **p,
@@ -178,6 +194,14 @@ int pair_chunk(struct chunkfile *cf,
 	return read_chunk(cf, chunk_id, pair_chunk_fn, &pcd);
 }

+int pair_chunk_expect(struct chunkfile *cf,
+		      uint32_t chunk_id, const unsigned char **p,
+		      size_t expected_size)
+{
+	struct pair_chunk_data pcd = { .p = p, .expected_size = expected_size };
+	return read_chunk(cf, chunk_id, pair_chunk_expect_fn, &pcd);
+}
+
 int pair_chunk_unsafe(struct chunkfile *cf,
 		      uint32_t chunk_id,
 		      const unsigned char **p)
diff --git a/chunk-format.h b/chunk-format.h
index 8dce7967f4..778f81f555 100644
--- a/chunk-format.h
+++ b/chunk-format.h
@@ -53,6 +53,10 @@ int pair_chunk(struct chunkfile *cf,
 	       const unsigned char **p,
 	       size_t *size);

+int pair_chunk_expect(struct chunkfile *cf,
+		      uint32_t chunk_id, const unsigned char **p,
+		      size_t expected_size);
+
 /*
  * Unsafe version of pair_chunk; it does not return the size,
  * meaning that the caller cannot possibly be careful about
diff --git a/commit-graph.c b/commit-graph.c
index 9b3b01da61..ed85161fdb 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -305,16 +305,6 @@ static int verify_commit_graph_lite(struct commit_graph *g)
 	return 0;
 }

-static int graph_read_oid_fanout(const unsigned char *chunk_start,
-				 size_t chunk_size, void *data)
-{
-	struct commit_graph *g = data;
-	if (chunk_size != 256 * sizeof(uint32_t))
-		return error("commit-graph oid fanout chunk is wrong size");
-	g->chunk_oid_fanout = (const uint32_t *)chunk_start;
-	return 0;
-}
-
 static int graph_read_oid_lookup(const unsigned char *chunk_start,
 				 size_t chunk_size, void *data)
 {
@@ -404,7 +394,10 @@ struct commit_graph *parse_commit_graph(struct repo_settings *s,
 				   GRAPH_HEADER_SIZE, graph->num_chunks))
 		goto free_and_return;

-	read_chunk(cf, GRAPH_CHUNKID_OIDFANOUT, graph_read_oid_fanout, graph);
+	if (pair_chunk_expect(cf, GRAPH_CHUNKID_OIDFANOUT,
+			      (const unsigned char **)&graph->chunk_oid_fanout,
+			      256 * sizeof(uint32_t)) < 0)
+		error(_("commit-graph oid fanout chunk is wrong size"));
 	read_chunk(cf, GRAPH_CHUNKID_OIDLOOKUP, graph_read_oid_lookup, graph);
 	pair_chunk_unsafe(cf, GRAPH_CHUNKID_DATA, &graph->chunk_commit_data);
 	pair_chunk_unsafe(cf, GRAPH_CHUNKID_EXTRAEDGES, &graph->chunk_extra_edges);
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index d25bea3ec5..467b5f5e9c 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -841,6 +841,7 @@ test_expect_success 'reader notices too-small oid fanout chunk' '
 	# otherwise we hit an earlier check
 	check_corrupt_chunk OIDF clear $(printf "000000%02x" $(test_seq 250)) &&
 	cat >expect.err <<-\EOF &&
+	error: mismatched chunk size, got: 1000, wanted: 1024
 	error: commit-graph oid fanout chunk is wrong size
 	error: commit-graph is missing the OID Fanout chunk
 	EOF
--- >8 ---

Or, quite honestly, having the "expected_size" parameter be required in
the safe version of `pair_chunk()`.

Thanks,
Taylor

[^1]: A non-brief review is on my to-do list for tomorrow.

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [PATCH 06/20] commit-graph: check consistency of fanout table
  2023-10-09 21:04 ` [PATCH 06/20] commit-graph: check consistency of fanout table Jeff King
@ 2023-10-11 14:45   ` Taylor Blau
  0 siblings, 0 replies; 67+ messages in thread
From: Taylor Blau @ 2023-10-11 14:45 UTC (permalink / raw)
  To: Jeff King; +Cc: git

On Mon, Oct 09, 2023 at 05:04:58PM -0400, Jeff King wrote:
> We use bsearch_hash() to look up items in the oid index of a
> commit-graph. It also has a fanout table to reduce the initial range in
> which we'll search. But since the fanout comes from the on-disk file, a
> corrupted or malicious file can cause us to look outside of the
> allocated index memory.

This is all very well written and explained. The patch LGTM.

> ---
> So I actually implemented the bsearch_hash() bounds checks and wrote
> tests for midx and idx files before realizing how they handle this. ;)
> Which makes sense, because the usual outcome for a corrupted idx file is
> for it to say "non-monotonic index", which I have seen lead to user
> confusion. Arguably we should have it say something about "hey, your idx
> file seems to be corrupted, because...". But that can be its own topic.

Yeah, I definitely agree that that is out of scope here, and can be left
as #leftoverbits.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 07/20] midx: check size of pack names chunk
  2023-10-09 21:05 ` [PATCH 07/20] midx: check size of pack names chunk Jeff King
@ 2023-10-11 14:52   ` Taylor Blau
  2023-10-11 23:06     ` Jeff King
  0 siblings, 1 reply; 67+ messages in thread
From: Taylor Blau @ 2023-10-11 14:52 UTC (permalink / raw)
  To: Jeff King; +Cc: git

On Mon, Oct 09, 2023 at 05:05:14PM -0400, Jeff King wrote:
> @@ -176,9 +176,16 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local
>
>  	cur_pack_name = (const char *)m->chunk_pack_names;
>  	for (i = 0; i < m->num_packs; i++) {
> +		const char *end;
> +		size_t avail = m->chunk_pack_names_len -
> +				(cur_pack_name - (const char *)m->chunk_pack_names);
> +

This patch all looks good to me, but reading this hunk gave me a little
bit of pause. I was wondering what might happen if chunk_pack_names_len
was zero, and subtracting some other non-zero size_t from it might cause
us to wrap around.

But I think that's a non-issue here, since we'd set cur_pack_name to the
beginning of the chunk, and compute avail as 0 - (m->chunk_pack_names -
m->chunk_pack_names), and get 0 back, causing our memchr() lookup below
to fail, and for us to report this chunk is garbage.

And since cur_pack_name monotonically increases, I think that there is
an inductive argument to be made that this subtraction is always safe to
do. But it couldn't hurt to do something like:

    size_t read = cur_pack_name - (const char *)m->chunk_pack_names;
    size_t avail = m->chunk_pack_names_len;

    if (read > avail)
        die("...");
    avail -= read;

to make absolutely sure that we would never underflow here.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 08/20] midx: enforce chunk alignment on reading
  2023-10-09 21:05 ` [PATCH 08/20] midx: enforce chunk alignment on reading Jeff King
@ 2023-10-11 14:56   ` Taylor Blau
  2023-10-11 15:01   ` Taylor Blau
  1 sibling, 0 replies; 67+ messages in thread
From: Taylor Blau @ 2023-10-11 14:56 UTC (permalink / raw)
  To: Jeff King; +Cc: git

On Mon, Oct 09, 2023 at 05:05:23PM -0400, Jeff King wrote:
> ---
>  chunk-format.c              |  8 +++++++-
>  chunk-format.h              |  3 ++-
>  commit-graph.c              |  2 +-
>  midx.c                      |  3 ++-
>  t/t5319-multi-pack-index.sh | 14 ++++++++++++++
>  5 files changed, 26 insertions(+), 4 deletions(-)

Very nicely written and explained. I agree that the choice you made here
(to validate the alignment in the chunk-format API itself when reading
the table of contents) feels more sensible and places less burden on the
callers.

LGTM, let's keep reading...

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 08/20] midx: enforce chunk alignment on reading
  2023-10-09 21:05 ` [PATCH 08/20] midx: enforce chunk alignment on reading Jeff King
  2023-10-11 14:56   ` Taylor Blau
@ 2023-10-11 15:01   ` Taylor Blau
  2023-10-11 23:09     ` Jeff King
  1 sibling, 1 reply; 67+ messages in thread
From: Taylor Blau @ 2023-10-11 15:01 UTC (permalink / raw)
  To: Jeff King; +Cc: git

On Mon, Oct 09, 2023 at 05:05:23PM -0400, Jeff King wrote:
> @@ -120,6 +121,11 @@ int read_table_of_contents(struct chunkfile *cf,
>  			error(_("terminating chunk id appears earlier than expected"));
>  			return 1;
>  		}
> +		if (chunk_offset % expected_alignment != 0) {

Oops, I missed this in my first read. I'm definitely nit-picking here,
but this should probably be:

    if (chunk_offset % expected_alignment)

without the trailing "!= 0".

I don't have a strong preference here, since we are doing a comparison
of an arithmetic operation against an (un-)expected value. But I think
the CodingGuidelines would technically call this out of style...

> +			error(_("chunk id %"PRIx32" not %d-byte aligned"),
> +			      chunk_id, expected_alignment);
> +			return 1;
> +		}
>
>  		table_of_contents += CHUNK_TOC_ENTRY_SIZE;
>  		next_chunk_offset = get_be64(table_of_contents + 4);

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 09/20] midx: check size of object offset chunk
  2023-10-09 21:05 ` [PATCH 09/20] midx: check size of object offset chunk Jeff King
@ 2023-10-11 18:31   ` Taylor Blau
  0 siblings, 0 replies; 67+ messages in thread
From: Taylor Blau @ 2023-10-11 18:31 UTC (permalink / raw)
  To: Jeff King; +Cc: git

On Mon, Oct 09, 2023 at 05:05:27PM -0400, Jeff King wrote:
> @@ -88,6 +88,19 @@ static int midx_read_oid_lookup(const unsigned char *chunk_start,
>  	return 0;
>  }
>
> +static int midx_read_object_offsets(const unsigned char *chunk_start,
> +				    size_t chunk_size, void *data)
> +{
> +	struct multi_pack_index *m = data;
> +	m->chunk_object_offsets = chunk_start;
> +
> +	if (chunk_size != st_mult(m->num_objects, MIDX_CHUNK_OFFSET_WIDTH)) {
> +		error(_("multi-pack-index object offset chunk is the wrong size"));
> +		return 1;
> +	}
> +	return 0;
> +}

Makes sense, and the (elided) test below looks good, too. I think that
this is another case that would benefit from having the chunk-format API
take in an "expected size" and validate that the requested chunk is
actually that size before assigning its address to some pointer.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 10/20] midx: bounds-check large offset chunk
  2023-10-09 21:05 ` [PATCH 10/20] midx: bounds-check large " Jeff King
@ 2023-10-11 18:38   ` Taylor Blau
  2023-10-11 23:18     ` Jeff King
  0 siblings, 1 reply; 67+ messages in thread
From: Taylor Blau @ 2023-10-11 18:38 UTC (permalink / raw)
  To: Jeff King; +Cc: git

On Mon, Oct 09, 2023 at 05:05:30PM -0400, Jeff King wrote:
> When we see a large offset bit in the regular midx offset table, we
> use the entry as an index into a separate large offset table (just like
> a pack idx does). But we don't bounds-check the access to that large
> offset table (nor even record its size when we parse the chunk!).
>
> The equivalent code for a regular pack idx is in check_pack_index_ptr().
> But things are a bit simpler here because of the chunked format: we can
> just check our array index directly.
>
> As a bonus, we can get rid of the st_mult() here. If our array
> bounds-check is successful, then we know that the result will fit in a
> size_t (and the bounds check uses a division to avoid overflow
> entirely).
>
> Signed-off-by: Jeff King <peff@peff.net>
> ---
>  midx.c                      |  8 +++++---
>  midx.h                      |  1 +
>  t/t5319-multi-pack-index.sh | 20 ++++++++++++++++++++
>  3 files changed, 26 insertions(+), 3 deletions(-)
>
> diff --git a/midx.c b/midx.c
> index 7b1b45f381..3e768d0df0 100644
> --- a/midx.c
> +++ b/midx.c
> @@ -180,7 +180,8 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local
>  	if (read_chunk(cf, MIDX_CHUNKID_OBJECTOFFSETS, midx_read_object_offsets, m))
>  		die(_("multi-pack-index required object offsets chunk missing or corrupted"));
>
> -	pair_chunk_unsafe(cf, MIDX_CHUNKID_LARGEOFFSETS, &m->chunk_large_offsets);
> +	pair_chunk(cf, MIDX_CHUNKID_LARGEOFFSETS, &m->chunk_large_offsets,
> +		   &m->chunk_large_offsets_len);
>
>  	if (git_env_bool("GIT_TEST_MIDX_READ_RIDX", 1))
>  		pair_chunk_unsafe(cf, MIDX_CHUNKID_REVINDEX, &m->chunk_revindex);
> @@ -303,8 +304,9 @@ off_t nth_midxed_offset(struct multi_pack_index *m, uint32_t pos)
>  			die(_("multi-pack-index stores a 64-bit offset, but off_t is too small"));
>
>  		offset32 ^= MIDX_LARGE_OFFSET_NEEDED;
> -		return get_be64(m->chunk_large_offsets +
> -				st_mult(sizeof(uint64_t), offset32));
> +		if (offset32 >= m->chunk_large_offsets_len / sizeof(uint64_t))
> +			die(_("multi-pack-index large offset out of bounds"));
> +		return get_be64(m->chunk_large_offsets + sizeof(uint64_t) * offset32);

Makes sense, and this seems very reasonable. I had a couple of thoughts
on this hunk, one nitpick, and one non-nitpick ;-).

The nitpick is the easy one, which is that I typically think of scaling
some index to produce an offset into some chunk, instead of dividing and
going the other way.

So I probably would have written something like:

    if (st_mult(offset32, sizeof(uint64_t)) >= m->chunk_large_offsets_len)
        die(_("multi-pack-index large offset out of bounds"));

But that is definitely a subjective/minor point, and I would not at all
be sad if you felt differently about it. That said, I do wish for a
little more information in the die() message, perhaps:

    if (st_mult(offset32, sizeof(uint64_t)) >= m->chunk_large_offsets_len)
        die(_("multi-pack-index large offset for %s out of bounds "
              "(%"PRIuMAX" is beyond %"PRIuMAX")"),
            (uintmax_t)(offset32 * sizeof(uint64_t)), /* checked earlier */
            (uintmax_t)m->chunk_large_offsets_len);

I can imagine that for debugging corrupt MIDXs in the future, having
some extra information like the above would give us a better starting
point than popping into a debugger to get the same information.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 11/20] midx: check size of revindex chunk
  2023-10-09 21:05 ` [PATCH 11/20] midx: check size of revindex chunk Jeff King
@ 2023-10-11 18:41   ` Taylor Blau
  0 siblings, 0 replies; 67+ messages in thread
From: Taylor Blau @ 2023-10-11 18:41 UTC (permalink / raw)
  To: Jeff King; +Cc: git

On Mon, Oct 09, 2023 at 05:05:33PM -0400, Jeff King wrote:
> +static int can_use_midx_ridx_chunk(struct multi_pack_index *m)
> +{
> +	if (!m->chunk_revindex)
> +		return 0;
> +	if (m->chunk_revindex_len != st_mult(sizeof(uint32_t), m->num_objects)) {
> +		error(_("multi-pack-index reverse-index chunk is the wrong size"));
> +		return 0;
> +	}
> +	return 1;
> +}

This all looks good to me. I was going to suggest that it might be worth
just NULL-ing out the chunk_revindex field here altogether. I think that
we have some prior art for doing that in the commit-graph code (though
I'd have to look to be sure...), but that's all within the same
compilation unit.

Having the pack-revindex machinery NULL out a field of the MIDX
structure feels icky to me, so I think that having a "can we use this
function?" instead makes much more sense.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 12/20] commit-graph: check size of commit data chunk
  2023-10-09 21:05 ` [PATCH 12/20] commit-graph: check size of commit data chunk Jeff King
@ 2023-10-11 18:46   ` Taylor Blau
  2023-10-11 23:22     ` Jeff King
  0 siblings, 1 reply; 67+ messages in thread
From: Taylor Blau @ 2023-10-11 18:46 UTC (permalink / raw)
  To: Jeff King; +Cc: git

On Mon, Oct 09, 2023 at 05:05:36PM -0400, Jeff King wrote:
> We expect a commit-graph file to have a fixed-size data record for each
> commit in the file (and we know the number of commits to expct from the
> size of the lookup table). If we encounter a file where this is too
> small, we'll look past the end of the chunk (and possibly even off the
> mapped memory).
>
> We can fix this by checking the size up front when we record the
> pointer.
>
> The included test doesn't segfault, since it ends up reading bytes
> from another chunk. But it produces nonsense results, since the values
> it reads are garbage. Our test notices this by comparing the output to a
> non-corrupted run of the same command (and of course we also check that
> the expected error is printed to stderr).
>
> Signed-off-by: Jeff King <peff@peff.net>
> ---
>  commit-graph.c          | 12 +++++++++++-
>  t/t5318-commit-graph.sh |  9 +++++++++
>  2 files changed, 20 insertions(+), 1 deletion(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index 472332f603..9b80bbd75b 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -340,6 +340,16 @@ static int graph_read_oid_lookup(const unsigned char *chunk_start,
>  	return 0;
>  }
>
> +static int graph_read_commit_data(const unsigned char *chunk_start,
> +				  size_t chunk_size, void *data)
> +{
> +	struct commit_graph *g = data;
> +	if (chunk_size != g->num_commits * GRAPH_DATA_WIDTH)

Should this be guarded with an st_mult? I think that GRAPH_DATA_WIDTH is
defined as (the_hash_algo->rawsz + 16), so I *think* that the expression
in the parenthesis would get done as a size_t, and then g->num_commits
would be widened to a size_t for the purposes of evaluating this
expression.

So I think that this is all OK in the sense that we'd never underflow
the 64-bit space, and having more than 2^64-1/36 (some eighteen
quintillion) commits is... unlikely ;-).

But it may be worth wrapping this computation in an st_mult() anyway to
avoid future readers having to think about this.

> +		return error("commit-graph commit data chunk is wrong size");
> +	g->chunk_commit_data = chunk_start;
> +	return 0;
> +}
> +
>  static int graph_read_bloom_data(const unsigned char *chunk_start,
>  				  size_t chunk_size, void *data)
>  {
> @@ -422,7 +432,7 @@ struct commit_graph *parse_commit_graph(struct repo_settings *s,
>
>  	read_chunk(cf, GRAPH_CHUNKID_OIDFANOUT, graph_read_oid_fanout, graph);
>  	read_chunk(cf, GRAPH_CHUNKID_OIDLOOKUP, graph_read_oid_lookup, graph);
> -	pair_chunk_unsafe(cf, GRAPH_CHUNKID_DATA, &graph->chunk_commit_data);
> +	read_chunk(cf, GRAPH_CHUNKID_DATA, graph_read_commit_data, graph);

Here again would be a good use-case for a `pair_chunk_expect()`
function, but I don't want to beat a dead horse ;-).

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 13/20] commit-graph: detect out-of-bounds extra-edges pointers
  2023-10-09 21:05 ` [PATCH 13/20] commit-graph: detect out-of-bounds extra-edges pointers Jeff King
@ 2023-10-11 19:02   ` Taylor Blau
  0 siblings, 0 replies; 67+ messages in thread
From: Taylor Blau @ 2023-10-11 19:02 UTC (permalink / raw)
  To: Jeff King; +Cc: git

On Mon, Oct 09, 2023 at 05:05:38PM -0400, Jeff King wrote:
> ---
>  commit-graph.c          | 20 ++++++++++++++------
>  commit-graph.h          |  1 +
>  t/t5318-commit-graph.sh |  8 ++++++++
>  3 files changed, 23 insertions(+), 6 deletions(-)

All looks good here, and the proposed log message is very clear and
well-written. One minor note below...
> @@ -931,14 +932,21 @@ static int fill_commit_in_graph(struct repository *r,
>  		return 1;
>  	}
>
> -	parent_data_ptr = (uint32_t*)(g->chunk_extra_edges +
> -			  st_mult(4, edge_value & GRAPH_EDGE_LAST_MASK));
> +	parent_data_pos = edge_value & GRAPH_EDGE_LAST_MASK;
>  	do {
> -		edge_value = get_be32(parent_data_ptr);
> +		if (g->chunk_extra_edges_size / sizeof(uint32_t) <= parent_data_pos) {
> +			error("commit-graph extra-edges pointer out of bounds");

It might be nice to add some extra data here, too, like which commit OID
we were trying to load, the offset we were supposed to read at, and the
size of the extra edges chunk itself.

We should probably also mark this string for translation, but both are
minor points.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 14/20] commit-graph: bounds-check base graphs chunk
  2023-10-09 21:05 ` [PATCH 14/20] commit-graph: bounds-check base graphs chunk Jeff King
@ 2023-10-11 19:05   ` Taylor Blau
  0 siblings, 0 replies; 67+ messages in thread
From: Taylor Blau @ 2023-10-11 19:05 UTC (permalink / raw)
  To: Jeff King; +Cc: git

On Mon, Oct 09, 2023 at 05:05:41PM -0400, Jeff King wrote:
> When we are loading a commit-graph chain, we check that each slice of the
> chain points to the appropriate set of base graphs via its BASE chunk.
> But since we don't record the size of the chunk, we may access
> out-of-bounds memory if the file is corrupted.
>
> Since we know the number of entries we expect to find (based on the
> position within the commit-graph-chain file), we can just check the size
> up front.
>
> In theory this would also let us drop the st_mult() call a few lines
> later when we actually access the memory, since we know that the
> computed offset will fit in a size_t. But because the operands
> "g->hash_len" and "n" have types "unsigned char" and "int", we'd have to
> cast to size_t first. Leaving the st_mult() does that cast, and makes it
> more obvious that we don't have an overflow problem.
>
> Note that the test does not actually segfault before this patch, since
> it just reads garbage from the chunk after BASE (and indeed, it even
> rejects the file because that garbage does not have the expected hash
> value). You could construct a file with BASE at the end that did
> segfault, but corrupting the existing one is easy, and we can check
> stderr for the expected message.
>
> Signed-off-by: Jeff King <peff@peff.net>
> ---
>  commit-graph.c                |  8 +++++++-
>  commit-graph.h                |  1 +
>  t/t5324-split-commit-graph.sh | 14 ++++++++++++++
>  3 files changed, 22 insertions(+), 1 deletion(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index e4860841fc..4377b547c8 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -435,7 +435,8 @@ struct commit_graph *parse_commit_graph(struct repo_settings *s,
>  	read_chunk(cf, GRAPH_CHUNKID_DATA, graph_read_commit_data, graph);
>  	pair_chunk(cf, GRAPH_CHUNKID_EXTRAEDGES, &graph->chunk_extra_edges,
>  		   &graph->chunk_extra_edges_size);
> -	pair_chunk_unsafe(cf, GRAPH_CHUNKID_BASE, &graph->chunk_base_graphs);
> +	pair_chunk(cf, GRAPH_CHUNKID_BASE, &graph->chunk_base_graphs,
> +		   &graph->chunk_base_graphs_size);
>
>  	if (s->commit_graph_generation_version >= 2) {
>  		pair_chunk_unsafe(cf, GRAPH_CHUNKID_GENERATION_DATA,
> @@ -546,6 +547,11 @@ static int add_graph_to_chain(struct commit_graph *g,
>  		return 0;
>  	}
>
> +	if (g->chunk_base_graphs_size / g->hash_len < n) {
> +		warning(_("commit-graph base graphs chunk is too small"));
> +		return 0;
> +	}
> +

Nice. Here's a spot where we would not benefit from a function like
`pair_chunk_expect()`, since we don't know about the chain when we are
parsing an individual layer of it. So storing the length off to the side
and checking it within `add_graph_to_chain()` makes sense.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 17/20] commit-graph: check bounds when accessing BDAT chunk
  2023-10-09 21:05 ` [PATCH 17/20] commit-graph: check bounds when accessing BDAT chunk Jeff King
@ 2023-10-11 19:11   ` Taylor Blau
  2023-10-11 23:27     ` Jeff King
  0 siblings, 1 reply; 67+ messages in thread
From: Taylor Blau @ 2023-10-11 19:11 UTC (permalink / raw)
  To: Jeff King; +Cc: git

On Mon, Oct 09, 2023 at 05:05:50PM -0400, Jeff King wrote:
> ---
>  bloom.c              | 24 ++++++++++++++++++++++++
>  commit-graph.c       | 10 ++++++++++
>  commit-graph.h       |  1 +
>  t/t4216-log-bloom.sh | 28 ++++++++++++++++++++++++++++
>  4 files changed, 63 insertions(+)
>
> diff --git a/bloom.c b/bloom.c
> index aef6b5fea2..61abad7f8c 100644
> --- a/bloom.c
> +++ b/bloom.c
> @@ -29,6 +29,26 @@ static inline unsigned char get_bitmask(uint32_t pos)
>  	return ((unsigned char)1) << (pos & (BITS_PER_WORD - 1));
>  }
>
> +static int check_bloom_offset(struct commit_graph *g, uint32_t pos,
> +			      uint32_t offset)
> +{
> +	/*
> +	 * Note that we allow offsets equal to the data size, which would set
> +	 * our pointers at one past the end of the chunk memory. This is
> +	 * necessary because the on-disk index points to the end of the
> +	 * entries (so we can compute size by comparing adjacent ones). And
> +	 * naturally the final entry's end is one-past-the-end of the chunk.
> +	 */
> +	if (offset <= g->chunk_bloom_data_size - BLOOMDATA_CHUNK_HEADER_SIZE)
> +		return 0;
> +
> +	warning("ignoring out-of-range offset (%"PRIuMAX") for changed-path"
> +		" filter at pos %"PRIuMAX" of %s (chunk size: %"PRIuMAX")",
> +		(uintmax_t)offset, (uintmax_t)pos,
> +		g->filename, (uintmax_t)g->chunk_bloom_data_size);

Should this be marked for translation?

> +	return -1;
> +}
> +
>  static int load_bloom_filter_from_graph(struct commit_graph *g,
>  					struct bloom_filter *filter,
>  					uint32_t graph_pos)
> @@ -51,6 +71,10 @@ static int load_bloom_filter_from_graph(struct commit_graph *g,
>  	else
>  		start_index = 0;
>
> +	if (check_bloom_offset(g, lex_pos, end_index) < 0 ||
> +	    check_bloom_offset(g, lex_pos - 1, start_index) < 0)

Can lex_pos ever be zero? I can't think of any reason that it couldn't,
and indeed the (elided) diff context shows that immediately preceding
this if-statement is another that checks "if (lex_pos > 0)".

So I think we'd want to avoid checking lex_pos - 1 if we know that
lex_pos is zero. Not that any of this really matters, since the only
thing we use the computed value for is showing the graph position in the
warning() message. So seeing a bogus value there isn't the end of the
world.

But avoiding this check when lex_pos == 0 is straightforward, so is
probably worth doing.

(As an aside, we should be able to simplify that if statement to just
"(if lex_pos)", since lex_pos will never be negative).

> +		return 0;
> +
>  	filter->len = end_index - start_index;
>  	filter->data = (unsigned char *)(g->chunk_bloom_data +
>  					sizeof(unsigned char) * start_index +
> diff --git a/commit-graph.c b/commit-graph.c
> index f446e76c28..f7a42be6d0 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -365,7 +365,17 @@ static int graph_read_bloom_data(const unsigned char *chunk_start,
>  {
>  	struct commit_graph *g = data;
>  	uint32_t hash_version;
> +
> +	if (chunk_size < BLOOMDATA_CHUNK_HEADER_SIZE) {
> +		warning("ignoring too-small changed-path chunk"
> +			" (%"PRIuMAX" < %"PRIuMAX") in commit-graph file",
> +			(uintmax_t)chunk_size,
> +			(uintmax_t)BLOOMDATA_CHUNK_HEADER_SIZE);

Ditto on the translation suggestion from above.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 18/20] commit-graph: check bounds when accessing BIDX chunk
  2023-10-09 21:05 ` [PATCH 18/20] commit-graph: check bounds when accessing BIDX chunk Jeff King
@ 2023-10-11 19:15   ` Taylor Blau
  0 siblings, 0 replies; 67+ messages in thread
From: Taylor Blau @ 2023-10-11 19:15 UTC (permalink / raw)
  To: Jeff King; +Cc: git

On Mon, Oct 09, 2023 at 05:05:53PM -0400, Jeff King wrote:
> We load the bloom_filter_indexes chunk using pair_chunk(), so we have no
> idea how big it is. This can lead to out-of-bounds reads if it is
> smaller than expected, since we index it based on the number of commits
> found elsewhere in the graph file.
>
> We can check the chunk size up front, like we do for CDAT and other
> chunks with one fixed-size record per commit.
>
> The test case demonstrates the problem. It actually won't segfault,
> because we end up reading random data from the follow-on chunk (BDAT in
> this case), and the bounds checks added in the previous patch complain.
> But this is by no means assured, and you can craft a commit-graph file
> with BIDX at the end (or a smaller BDAT) that does segfault.
>
> Signed-off-by: Jeff King <peff@peff.net>
> ---
>  commit-graph.c       | 16 ++++++++++++++--
>  t/t4216-log-bloom.sh |  9 +++++++++
>  2 files changed, 23 insertions(+), 2 deletions(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index f7a42be6d0..1f334987b5 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -360,6 +360,18 @@ static int graph_read_generation_data(const unsigned char *chunk_start,
>  	return 0;
>  }
>
> +static int graph_read_bloom_index(const unsigned char *chunk_start,
> +				  size_t chunk_size, void *data)
> +{
> +	struct commit_graph *g = data;
> +	if (chunk_size != g->num_commits * 4) {

Here we probably should wrap `g->num_commits * 4` in an st_mult() call.
Having 4B commits is probably pretty unlikely in practice, but we have
definitely seen issues in the wild at GitHub where we have more than
2^32-1/20 commits in a single network.git.

> +		warning("commit-graph changed-path index chunk is too small");

Should this be marked for translation?

> +		return -1;
> +	}
> +	g->chunk_bloom_indexes = chunk_start;
> +	return 0;
> +}
> +
>  static int graph_read_bloom_data(const unsigned char *chunk_start,
>  				  size_t chunk_size, void *data)
>  {
> @@ -470,8 +482,8 @@ struct commit_graph *parse_commit_graph(struct repo_settings *s,
>  	}
>
>  	if (s->commit_graph_read_changed_paths) {
> -		pair_chunk_unsafe(cf, GRAPH_CHUNKID_BLOOMINDEXES,
> -			   &graph->chunk_bloom_indexes);
> +		read_chunk(cf, GRAPH_CHUNKID_BLOOMINDEXES,
> +			   graph_read_bloom_index, graph);

Since we know g->num_commits by this point, this would be another spot
that would benefit from a `pair_chunk_expect()` function. But, again,
not trying to beat a dead horse here or anything ;-).

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 19/20] commit-graph: detect out-of-order BIDX offsets
  2023-10-09 21:05 ` [PATCH 19/20] commit-graph: detect out-of-order BIDX offsets Jeff King
@ 2023-10-11 19:16   ` Taylor Blau
  0 siblings, 0 replies; 67+ messages in thread
From: Taylor Blau @ 2023-10-11 19:16 UTC (permalink / raw)
  To: Jeff King; +Cc: git

On Mon, Oct 09, 2023 at 05:05:56PM -0400, Jeff King wrote:
> diff --git a/bloom.c b/bloom.c
> index 61abad7f8c..1474aa19fa 100644
> --- a/bloom.c
> +++ b/bloom.c
> @@ -75,6 +75,16 @@ static int load_bloom_filter_from_graph(struct commit_graph *g,
>  	    check_bloom_offset(g, lex_pos - 1, start_index) < 0)
>  		return 0;
>
> +	if (end_index < start_index) {
> +		warning("ignoring decreasing changed-path index offsets"
> +			" (%"PRIuMAX" > %"PRIuMAX") for positions"
> +			" %"PRIuMAX" and %"PRIuMAX" of %s",

Should this be marked for translation?

Otherwise this LGTM.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/20] bounds-checks for chunk-based files
  2023-10-09 20:55 [PATCH 0/20] bounds-checks for chunk-based files Jeff King
                   ` (19 preceding siblings ...)
  2023-10-09 21:06 ` [PATCH 20/20] chunk-format: drop pair_chunk_unsafe() Jeff King
@ 2023-10-11 19:19 ` Taylor Blau
  2023-10-11 23:31   ` Jeff King
  2023-10-13 19:25 ` [PATCH 0/8] chunk-format: introduce `pair_chunk_expect()` convenience API Taylor Blau
  2023-10-14  0:43 ` [PATCH 21/20] t5319: make corrupted large-offset test more robust Jeff King
  22 siblings, 1 reply; 67+ messages in thread
From: Taylor Blau @ 2023-10-11 19:19 UTC (permalink / raw)
  To: Jeff King; +Cc: git

On Mon, Oct 09, 2023 at 04:55:44PM -0400, Jeff King wrote:
>  bloom.c                            |  34 +++++++++
>  chunk-format.c                     |  24 ++++--
>  chunk-format.h                     |   9 ++-
>  commit-graph.c                     | 119 ++++++++++++++++++++++++-----
>  commit-graph.h                     |   4 +
>  midx.c                             |  68 +++++++++++++----
>  midx.h                             |   3 +
>  pack-revindex.c                    |  13 +++-
>  t/lib-chunk.sh                     |  17 +++++
>  t/lib-chunk/corrupt-chunk-file.pl  |  66 ++++++++++++++++
>  t/t4216-log-bloom.sh               |  50 ++++++++++++
>  t/t5318-commit-graph.sh            |  76 +++++++++++++++++-
>  t/t5319-multi-pack-index.sh        | 102 ++++++++++++++++++++++++-
>  t/t5324-split-commit-graph.sh      |  20 ++++-
>  t/t5328-commit-graph-64bit-time.sh |  10 +++
>  15 files changed, 568 insertions(+), 47 deletions(-)
>  create mode 100644 t/lib-chunk.sh
>  create mode 100644 t/lib-chunk/corrupt-chunk-file.pl

I reviewed this carefully (well, except for the new Perl script, for
obvious[^1] reasons ;-)).

Everything mostly looks good to me, though I
had a handful of review comments throughout. Many of them are trivial
(e.g. a number of warning() and error() strings should be marked for
translation, etc.), but a couple of them I think are worth looking at.

Most notably, I think that by the end of the series, I was convinced
that having some kind of 'pair_chunk_expectsz()' or similar would be
useful and eliminate a good chunk of the boilerplate you have to write
to check the chunk size against an expected value when using
read_chunk().

Otherwise, this looks great. I appreciate the care you took in finding
and fixing these issues, as well as thoroughly documenting the process
(and the security implications, or lack thereof). Thanks for working on
this!

Thanks,
Taylor

[^1]: That I may be the world's least competent Perl programmer.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 01/20] chunk-format: note that pair_chunk() is unsafe
  2023-10-10 23:45   ` Taylor Blau
@ 2023-10-11 22:49     ` Jeff King
  0 siblings, 0 replies; 67+ messages in thread
From: Jeff King @ 2023-10-11 22:49 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git

On Tue, Oct 10, 2023 at 07:45:41PM -0400, Taylor Blau wrote:

> On Mon, Oct 09, 2023 at 04:58:23PM -0400, Jeff King wrote:
> > There are no callers of the "safe" version yet, but we'll add some in
> > subsequent patches.
> 
> Makes sense.
> 
> > +int pair_chunk_unsafe(struct chunkfile *cf,
> > +		      uint32_t chunk_id,
> > +		      const unsigned char **p)
> >  {
> > -	return read_chunk(cf, chunk_id, pair_chunk_fn, p);
> > +	size_t dummy;
> > +	return pair_chunk(cf, chunk_id, p, &dummy);
> 
> I have always disliked functions that require you to pass a non-NULL
> pointer to some value that you may or may not want to have that function
> fill out. So I was going to suggest something along the lines of
> "pair_chunk() should tolerate a NULL fourth argument instead of filling
> out the size unconditionally".
> 
> But that's (a) the whole point of the series ;-), and (b) unnecessary,
> since this function is going to go away entirely by the end of the
> series, too.

Yeah, for the record, I think a dummy variable like this is usually a
code smell. And it truly is a "problem" here, because we are
intentionally doing the unsafe and stupid thing. :)

-Peff

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 03/20] midx: stop ignoring malformed oid fanout chunk
  2023-10-10 23:50   ` Taylor Blau
@ 2023-10-11 22:52     ` Jeff King
  0 siblings, 0 replies; 67+ messages in thread
From: Jeff King @ 2023-10-11 22:52 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git

On Tue, Oct 10, 2023 at 07:50:31PM -0400, Taylor Blau wrote:

> All makes sense. I have a mild preference for "missing or corrupt" over
> "missing or corrupted", but it's mild enough that I wouldn't be sad if
> you kept it as-is ;-).

Hmm, now I wonder which is grammatically more appropriate.

According to the internet (which is never wrong), the distinction is
generally one of intrinsic corrupt state versus something that has
become corrupted. So I guess it depends on whether the file was created
corrupt by a bug or corrupted by a hard disk failure. ;)

I do not have a strong preference myself. The commits went into next
already, but I think you also pointed out that many of these could be
marked for translation (though this one is already).

So we could do one or two patches to adjust the error messages on top.

-Peff

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 04/20] commit-graph: check size of oid fanout chunk
  2023-10-11  0:08   ` Taylor Blau
  2023-10-11  1:24     ` Taylor Blau
@ 2023-10-11 23:01     ` Jeff King
  1 sibling, 0 replies; 67+ messages in thread
From: Jeff King @ 2023-10-11 23:01 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git

On Tue, Oct 10, 2023 at 08:08:05PM -0400, Taylor Blau wrote:

> Nice. This makes sense and seems like an obvious improvement over the
> existing code.
> 
> I wonder how common this pattern is. We have read_chunk() which is for
> handling more complex scenarios than this. But the safe version of
> pair_chunk() really just wants to check that the size of the chunk is as
> expected and assign the location in the mmap to some pointer.

Sometimes yes, sometimes no. For fixed-size ones like this, that's
sufficient. For others we have to record the size and use it for later
bounds-checking. IIRC it's about 50/50 between the two.

> Do you think it would be worth changing pair_chunk() to take an expected
> size_t and handle this generically? I.e. have a version of
> chunk-format::pair_chunk_fn() that looks something like:
> 
>     static int pair_chunk_fn(const unsigned char *chunk_start,
>                              size_t chunk_size, void *data)
>     {
>         const unsigned char **p = data;
>         if (chunk_size != data->size)
>             return -1;
>         *p = chunk_start;
>         return 0;
>     }
> 
> and then our call here would be:
> 
>   if (pair_chunk(cf, GRAPH_CHUNKID_OIDFANOUT,
>                  (const unsigned char **)&graph->chunk_oid_fanout,
>                  256 * sizeof(uint32_t)) < 0)
>       return error("commit-graph oid fanout chunk is wrong size");
> 
> I dunno. It's hard to have a more concrete recomendation without having
> read the rest of the series. So it's possible that this is just complete
> nonsense ;-). But my hunch is that there are a number of callers that
> would benefit from having this built in.

I don't think it's nonsense, and I do think other callers would benefit.
On the other hand, I kind of like the notion that there is a complete
validation callback for each of these chunks. Even though it just checks
the size for now, it could handle other things. In the case of OIDF, for
example, we can check whether the entries are monotonic. It's just that
we happen to do those checks elsewhere.

Hmm, actually, looking at that again, I think I may have missed a case
in patch 6. For pack .idx files, we check the fanout table when they are
loaded. And patch 6 adds the same for commit-graph files. I thought midx
files were handled the same .idx, but looking at it again, I only see
the monotonicity check in the "multi-pack-index verify" code paths. So
it might need the same treatment.

I'm not sure how I missed that (I started by making a corrupted midx
first and couldn't get it to fail, which is when I discovered the
existing checks, but maybe I am mixing up .idx and midx in my memory).

-Peff

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 07/20] midx: check size of pack names chunk
  2023-10-11 14:52   ` Taylor Blau
@ 2023-10-11 23:06     ` Jeff King
  0 siblings, 0 replies; 67+ messages in thread
From: Jeff King @ 2023-10-11 23:06 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git

On Wed, Oct 11, 2023 at 10:52:13AM -0400, Taylor Blau wrote:

> On Mon, Oct 09, 2023 at 05:05:14PM -0400, Jeff King wrote:
> > @@ -176,9 +176,16 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local
> >
> >  	cur_pack_name = (const char *)m->chunk_pack_names;
> >  	for (i = 0; i < m->num_packs; i++) {
> > +		const char *end;
> > +		size_t avail = m->chunk_pack_names_len -
> > +				(cur_pack_name - (const char *)m->chunk_pack_names);
> > +
> 
> This patch all looks good to me, but reading this hunk gave me a little
> bit of pause. I was wondering what might happen if chunk_pack_names_len
> was zero, and subtracting some other non-zero size_t from it might cause
> us to wrap around.
> 
> But I think that's a non-issue here, since we'd set cur_pack_name to the
> beginning of the chunk, and compute avail as 0 - (m->chunk_pack_names -
> m->chunk_pack_names), and get 0 back, causing our memchr() lookup below
> to fail, and for us to report this chunk is garbage.

Right. If it is 0, then we should have 0 avail here in the first loop.

I was more worried while writing this that:

  cur_pack_name = end + 1;

later in the loop could get us an off-by-one. But we know we are always
pointing to one past an available NUL there, so at most our subtraction
will equal m->chunk_pack_names_len.

> And since cur_pack_name monotonically increases, I think that there is
> an inductive argument to be made that this subtraction is always safe to
> do. But it couldn't hurt to do something like:
> 
>     size_t read = cur_pack_name - (const char *)m->chunk_pack_names;
>     size_t avail = m->chunk_pack_names_len;
> 
>     if (read > avail)
>         die("...");
>     avail -= read;
> 
> to make absolutely sure that we would never underflow here.

You end up with two die() calls, then. One for "hey, we somehow read too
far", and one for "hey, I ran out of data while reading the entry". And
the first one cannot be triggered.

IOW, I think your die() here is a BUG().

-Peff

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 08/20] midx: enforce chunk alignment on reading
  2023-10-11 15:01   ` Taylor Blau
@ 2023-10-11 23:09     ` Jeff King
  0 siblings, 0 replies; 67+ messages in thread
From: Jeff King @ 2023-10-11 23:09 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git

On Wed, Oct 11, 2023 at 11:01:09AM -0400, Taylor Blau wrote:

> On Mon, Oct 09, 2023 at 05:05:23PM -0400, Jeff King wrote:
> > @@ -120,6 +121,11 @@ int read_table_of_contents(struct chunkfile *cf,
> >  			error(_("terminating chunk id appears earlier than expected"));
> >  			return 1;
> >  		}
> > +		if (chunk_offset % expected_alignment != 0) {
> 
> Oops, I missed this in my first read. I'm definitely nit-picking here,
> but this should probably be:
> 
>     if (chunk_offset % expected_alignment)
> 
> without the trailing "!= 0".
> 
> I don't have a strong preference here, since we are doing a comparison
> of an arithmetic operation against an (un-)expected value. But I think
> the CodingGuidelines would technically call this out of style...

I suppose so, but somehow I consider the subtlety of "%" to merit the
more explicit comparison (versus something like "if (foo)"). Grepping
for "if (.* %)' seems to show some of both.

-Peff

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 10/20] midx: bounds-check large offset chunk
  2023-10-11 18:38   ` Taylor Blau
@ 2023-10-11 23:18     ` Jeff King
  0 siblings, 0 replies; 67+ messages in thread
From: Jeff King @ 2023-10-11 23:18 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git

On Wed, Oct 11, 2023 at 02:38:07PM -0400, Taylor Blau wrote:

> >  		offset32 ^= MIDX_LARGE_OFFSET_NEEDED;
> > -		return get_be64(m->chunk_large_offsets +
> > -				st_mult(sizeof(uint64_t), offset32));
> > +		if (offset32 >= m->chunk_large_offsets_len / sizeof(uint64_t))
> > +			die(_("multi-pack-index large offset out of bounds"));
> > +		return get_be64(m->chunk_large_offsets + sizeof(uint64_t) * offset32);
> 
> Makes sense, and this seems very reasonable. I had a couple of thoughts
> on this hunk, one nitpick, and one non-nitpick ;-).
> 
> The nitpick is the easy one, which is that I typically think of scaling
> some index to produce an offset into some chunk, instead of dividing and
> going the other way.
> 
> So I probably would have written something like:
> 
>     if (st_mult(offset32, sizeof(uint64_t)) >= m->chunk_large_offsets_len)
>         die(_("multi-pack-index large offset out of bounds"));

Yeah, I admit that my inclination is to think of it that way, too, and
the division is a bit of an inversion. But I guess I am hard-wired to do
bounds checks with division, because I know it avoids overflow issues.
And the behavior is actually different, since st_mult() dies (with a
somewhat vague message) rather than returning with an error.

I was actually tempted to write a small inline helper to do
bounds-checks (since this pattern appears a few times in this series).
But of course it suffers from the same issues (it dies with a vague
message, or it returns a result code and then is awkward to call/use
because you have to stick the output in an out-parameter).

> But that is definitely a subjective/minor point, and I would not at all
> be sad if you felt differently about it. That said, I do wish for a
> little more information in the die() message, perhaps:
> 
>     if (st_mult(offset32, sizeof(uint64_t)) >= m->chunk_large_offsets_len)
>         die(_("multi-pack-index large offset for %s out of bounds "
>               "(%"PRIuMAX" is beyond %"PRIuMAX")"),
>             (uintmax_t)(offset32 * sizeof(uint64_t)), /* checked earlier */
>             (uintmax_t)m->chunk_large_offsets_len);
> 
> I can imagine that for debugging corrupt MIDXs in the future, having
> some extra information like the above would give us a better starting
> point than popping into a debugger to get the same information.

As you'll see when you get to the BDAT/BIDX messages, I put in a load of
detail. That's because I did those ones first, and then after seeing the
terse "eh, it's the wrong size" messages in the rest of the commit-graph
and midx code, I just followed suit.

We can circle back and improve the detail in the messages. One slightly
annoying thing is dealing with it in the tests. We'd have to do one of:

  - make the tests more brittle by hard-coding positions and offsets we
    don't necessarily care about

  - loosen the tests by just matching with "grep", which can sometimes
    miss other unexpected output

  - do some post-processing on the output to massage out the details we
    don't care about; this in theory is the best of both worlds, but
    reliably post-processing is non-trivial.

So of the three I'd probably just loosen to use "grep" in most spots.

-Peff

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 12/20] commit-graph: check size of commit data chunk
  2023-10-11 18:46   ` Taylor Blau
@ 2023-10-11 23:22     ` Jeff King
  0 siblings, 0 replies; 67+ messages in thread
From: Jeff King @ 2023-10-11 23:22 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git

On Wed, Oct 11, 2023 at 02:46:28PM -0400, Taylor Blau wrote:

> > +static int graph_read_commit_data(const unsigned char *chunk_start,
> > +				  size_t chunk_size, void *data)
> > +{
> > +	struct commit_graph *g = data;
> > +	if (chunk_size != g->num_commits * GRAPH_DATA_WIDTH)
> 
> Should this be guarded with an st_mult? I think that GRAPH_DATA_WIDTH is
> defined as (the_hash_algo->rawsz + 16), so I *think* that the expression
> in the parenthesis would get done as a size_t, and then g->num_commits
> would be widened to a size_t for the purposes of evaluating this
> expression.
> 
> So I think that this is all OK in the sense that we'd never underflow
> the 64-bit space, and having more than 2^64-1/36 (some eighteen
> quintillion) commits is... unlikely ;-).

Hmm, yeah, I think you are right, but I agree it's awfully subtle. There
is no reason somebody couldn't later change "rawsz" to a smaller type
(after all, we know it's going to be tiny), and it would be quite
surprising if that introduces an overflow in far-away code. We should
protect ourselves here.

-Peff

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 17/20] commit-graph: check bounds when accessing BDAT chunk
  2023-10-11 19:11   ` Taylor Blau
@ 2023-10-11 23:27     ` Jeff King
  0 siblings, 0 replies; 67+ messages in thread
From: Jeff King @ 2023-10-11 23:27 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git

On Wed, Oct 11, 2023 at 03:11:59PM -0400, Taylor Blau wrote:

> > +	if (check_bloom_offset(g, lex_pos, end_index) < 0 ||
> > +	    check_bloom_offset(g, lex_pos - 1, start_index) < 0)
> 
> Can lex_pos ever be zero? I can't think of any reason that it couldn't,
> and indeed the (elided) diff context shows that immediately preceding
> this if-statement is another that checks "if (lex_pos > 0)".
> 
> So I think we'd want to avoid checking lex_pos - 1 if we know that
> lex_pos is zero. Not that any of this really matters, since the only
> thing we use the computed value for is showing the graph position in the
> warning() message. So seeing a bogus value there isn't the end of the
> world.

If lex_pos is 0, then we set start_index to 0 manually, which we know to
be valid. So we can't trigger a bogus warning(). My thinking was that it
was OK to just let this fall out naturally as it does here, since it
makes the code shorter. But if we want to cover that case separately,
I think you'd want to split the checks like:

diff --git a/bloom.c b/bloom.c
index 1474aa19fa..d9ad69ad82 100644
--- a/bloom.c
+++ b/bloom.c
@@ -65,16 +65,16 @@ static int load_bloom_filter_from_graph(struct commit_graph *g,
 	lex_pos = graph_pos - g->num_commits_in_base;
 
 	end_index = get_be32(g->chunk_bloom_indexes + 4 * lex_pos);
+	if (check_bloom_offset(g, lex_pos, end_index) < 0)
+		return 0;
 
-	if (lex_pos > 0)
+	if (lex_pos > 0) {
 		start_index = get_be32(g->chunk_bloom_indexes + 4 * (lex_pos - 1));
-	else
+		if (check_bloom_offset(g, lex_pos - 1, start_index) < 0)
+			return 0;
+	} else
 		start_index = 0;
 
-	if (check_bloom_offset(g, lex_pos, end_index) < 0 ||
-	    check_bloom_offset(g, lex_pos - 1, start_index) < 0)
-		return 0;
-
 	if (end_index < start_index) {
 		warning("ignoring decreasing changed-path index offsets"
 			" (%"PRIuMAX" > %"PRIuMAX") for positions"

-Peff

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/20] bounds-checks for chunk-based files
  2023-10-11 19:19 ` [PATCH 0/20] bounds-checks for chunk-based files Taylor Blau
@ 2023-10-11 23:31   ` Jeff King
  0 siblings, 0 replies; 67+ messages in thread
From: Jeff King @ 2023-10-11 23:31 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git

On Wed, Oct 11, 2023 at 03:19:28PM -0400, Taylor Blau wrote:

> I reviewed this carefully (well, except for the new Perl script, for
> obvious[^1] reasons ;-)).
> 
> Everything mostly looks good to me, though I
> had a handful of review comments throughout. Many of them are trivial
> (e.g. a number of warning() and error() strings should be marked for
> translation, etc.), but a couple of them I think are worth looking at.

Thanks for taking a look. I think it may make sense to come back on top
and adjust a few of the commit messages, along with adding a few
st_mult() overflow checks that you suggest.

> Most notably, I think that by the end of the series, I was convinced
> that having some kind of 'pair_chunk_expectsz()' or similar would be
> useful and eliminate a good chunk of the boilerplate you have to write
> to check the chunk size against an expected value when using
> read_chunk().

This I'm less convinced by. In fact, I _almost_ just dropped
pair_chunk() entirely. Adding an out-parameter for the size at least
forces the caller to consider what to do with the size. But really, I
think the right mindset is "we should be sanity-checking this chunk as
we load it". And having a callback, even if it is a little bit of
boilerplate, helps set that frame of mind.

I dunno. Maybe that is all just programmer pseudo-psychology. But I also
don't like that about half the calls to pair_chunk() can't do a size
check (so we need two functions, or to make the "expect" parameter
optional).

-Peff

^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH 0/8] chunk-format: introduce `pair_chunk_expect()` convenience API
  2023-10-09 20:55 [PATCH 0/20] bounds-checks for chunk-based files Jeff King
                   ` (20 preceding siblings ...)
  2023-10-11 19:19 ` [PATCH 0/20] bounds-checks for chunk-based files Taylor Blau
@ 2023-10-13 19:25 ` Taylor Blau
  2023-10-13 19:25   ` [PATCH 1/8] chunk-format: introduce `pair_chunk_expect()` helper Taylor Blau
                     ` (8 more replies)
  2023-10-14  0:43 ` [PATCH 21/20] t5319: make corrupted large-offset test more robust Jeff King
  22 siblings, 9 replies; 67+ messages in thread
From: Taylor Blau @ 2023-10-13 19:25 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Jeff King

While reviewing this series, I noticed a couple of spots that would be
helped by having a convenience function that stores the start of a
chunk in a designated location *after* checking that the chunk has the
expected size.

I called this function `pair_chunk_expect()` and touched up seven spots
that I think could benefit from having a convenience function like this.

This applies directly on top of 'jk/chunk-bounds'. Thanks in advance for
your review!

Taylor Blau (8):
  chunk-format: introduce `pair_chunk_expect()` helper
  commit-graph: read `OIDF` chunk with `pair_chunk_expect()`
  commit-graph: read `CDAT` chunk with `pair_chunk_expect()`
  commit-graph: read `GDAT` chunk with `pair_chunk_expect()`
  commit-graph: read `BIDX` chunk with `pair_chunk_expect()`
  midx: read `OIDF` chunk with `pair_chunk_expect()`
  midx: read `OIDL` chunk with `pair_chunk_expect()`
  midx: read `OOFF` chunk with `pair_chunk_expect()`

 chunk-format.c | 22 +++++++++++++++++
 chunk-format.h | 12 +++++++++-
 commit-graph.c | 65 +++++++++++++-------------------------------------
 midx.c         | 58 ++++++++++++--------------------------------
 4 files changed, 65 insertions(+), 92 deletions(-)

-- 
2.42.0.352.gd9c5062ff7.dirty

^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH 1/8] chunk-format: introduce `pair_chunk_expect()` helper
  2023-10-13 19:25 ` [PATCH 0/8] chunk-format: introduce `pair_chunk_expect()` convenience API Taylor Blau
@ 2023-10-13 19:25   ` Taylor Blau
  2023-10-13 19:25   ` [PATCH 2/8] commit-graph: read `OIDF` chunk with `pair_chunk_expect()` Taylor Blau
                     ` (7 subsequent siblings)
  8 siblings, 0 replies; 67+ messages in thread
From: Taylor Blau @ 2023-10-13 19:25 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Jeff King

In previous commits, the pair_chunk() interface grew a required "size"
pointer, so that the caller is forced to at least have a handle on the
actual size of the given chunk.

Many callers were converted to the new interface. A handful were instead
converted from the unsafe version of pair_chunk() to read_chunk() so
that they could check their expected size.

This led to a lot of code like:

    static int graph_read_oid_fanout(const unsigned char *chunk_start,
                                     size_t chunk_size, void *data)
    {
      struct commit_graph *g = data;
      if (chunk_size != 256 * sizeof(uint32_t))
        return error("commit-graph oid fanout chunk is wrong size");
      g->chunk_oid_fanout = (const uint32_t *)chunk_start;
      return 0;
    }

, leaving the caller to use read_chunk(), like so:

    read_chunk(cf, GRAPH_CHUNKID_OIDFANOUT, graph_read_oid_fanout, graph);

The callback to read_chunk() (in the above, `graph_read_oid_fanout()`)
does nothing more than (a) assign a pointer to the location of the start
of the chunk in the mmap'd file, and (b) assert that it has the correct
size.

For callers that know the expected size of their chunk(s) up-front, we
can simplify this by teaching the chunk-format API itself to validate
the expected size for us.

This is wrapped in a new function, called `pair_chunk_expect()` which
takes a "size_t" instead of a "size_t *", and validates that the given
chunk matches the expected size as given.

This will allow us to reduce the number of lines of code it takes to
perform these basic read_chunk() operations, by taking the above and
replacing it with something like:

    if (pair_chunk_expect(cf, GRAPH_CHUNKID_OIDFANOUT,
                          (const unsigned char **)&graph->chunk_oid_fanout,
                          256 * sizeof(uint32_t)))
      error(_("commit-graph oid fanout chunk is wrong size"));

We will perform those transformations in the following commits.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 chunk-format.c | 22 ++++++++++++++++++++++
 chunk-format.h | 12 +++++++++++-
 2 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/chunk-format.c b/chunk-format.c
index cdc7f39b70..9550f15699 100644
--- a/chunk-format.c
+++ b/chunk-format.c
@@ -163,6 +163,8 @@ int read_table_of_contents(struct chunkfile *cf,
 struct pair_chunk_data {
 	const unsigned char **p;
 	size_t *size;
+
+	size_t expected_size;
 };
 
 static int pair_chunk_fn(const unsigned char *chunk_start,
@@ -175,6 +177,17 @@ static int pair_chunk_fn(const unsigned char *chunk_start,
 	return 0;
 }
 
+static int pair_chunk_expect_fn(const unsigned char *chunk_start,
+				size_t chunk_size,
+				void *data)
+{
+	struct pair_chunk_data *pcd = data;
+	if (pcd->expected_size != chunk_size)
+		return -1;
+	*pcd->p = chunk_start;
+	return 0;
+}
+
 int pair_chunk(struct chunkfile *cf,
 	       uint32_t chunk_id,
 	       const unsigned char **p,
@@ -184,6 +197,15 @@ int pair_chunk(struct chunkfile *cf,
 	return read_chunk(cf, chunk_id, pair_chunk_fn, &pcd);
 }
 
+int pair_chunk_expect(struct chunkfile *cf,
+		      uint32_t chunk_id,
+		      const unsigned char **p,
+		      size_t expected_size)
+{
+	struct pair_chunk_data pcd = { .p = p, .expected_size = expected_size };
+	return read_chunk(cf, chunk_id, pair_chunk_expect_fn, &pcd);
+}
+
 int read_chunk(struct chunkfile *cf,
 	       uint32_t chunk_id,
 	       chunk_read_fn fn,
diff --git a/chunk-format.h b/chunk-format.h
index 14b76180ef..92c529d7ab 100644
--- a/chunk-format.h
+++ b/chunk-format.h
@@ -17,7 +17,8 @@ struct chunkfile;
  *
  * If reading a file, use a NULL 'struct hashfile *' and then call
  * read_table_of_contents(). Supply the memory-mapped data to the
- * pair_chunk() or read_chunk() methods, as appropriate.
+ * pair_chunk(), pair_chunk_expect(), or read_chunk() methods, as
+ * appropriate.
  *
  * DO NOT MIX THESE MODES. Use different 'struct chunkfile' instances
  * for reading and writing.
@@ -54,6 +55,15 @@ int pair_chunk(struct chunkfile *cf,
 	       const unsigned char **p,
 	       size_t *size);
 
+/*
+ * Similar to 'pair_chunk', but used for callers who have an expected
+ * size for the given 'chunk_id' in advance.
+ */
+int pair_chunk_expect(struct chunkfile *cf,
+		      uint32_t chunk_id,
+		      const unsigned char **p,
+		      size_t expected_size);
+
 typedef int (*chunk_read_fn)(const unsigned char *chunk_start,
 			     size_t chunk_size, void *data);
 /*
-- 
2.42.0.352.gd9c5062ff7.dirty


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 2/8] commit-graph: read `OIDF` chunk with `pair_chunk_expect()`
  2023-10-13 19:25 ` [PATCH 0/8] chunk-format: introduce `pair_chunk_expect()` convenience API Taylor Blau
  2023-10-13 19:25   ` [PATCH 1/8] chunk-format: introduce `pair_chunk_expect()` helper Taylor Blau
@ 2023-10-13 19:25   ` Taylor Blau
  2023-10-13 19:25   ` [PATCH 3/8] commit-graph: read `CDAT` " Taylor Blau
                     ` (6 subsequent siblings)
  8 siblings, 0 replies; 67+ messages in thread
From: Taylor Blau @ 2023-10-13 19:25 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Jeff King

The `OIDF` chunk can benefit from the new chunk-format API function
described in the previous commit. Convert it to `pair_chunk_expect()`
accordingly.

While we're at it, let's mark the error() string for translation.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 commit-graph.c | 15 ++++-----------
 1 file changed, 4 insertions(+), 11 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 1f334987b5..cdefd7f926 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -321,16 +321,6 @@ static int verify_commit_graph_lite(struct commit_graph *g)
 	return 0;
 }
 
-static int graph_read_oid_fanout(const unsigned char *chunk_start,
-				 size_t chunk_size, void *data)
-{
-	struct commit_graph *g = data;
-	if (chunk_size != 256 * sizeof(uint32_t))
-		return error("commit-graph oid fanout chunk is wrong size");
-	g->chunk_oid_fanout = (const uint32_t *)chunk_start;
-	return 0;
-}
-
 static int graph_read_oid_lookup(const unsigned char *chunk_start,
 				 size_t chunk_size, void *data)
 {
@@ -462,7 +452,10 @@ struct commit_graph *parse_commit_graph(struct repo_settings *s,
 				   GRAPH_HEADER_SIZE, graph->num_chunks, 1))
 		goto free_and_return;
 
-	read_chunk(cf, GRAPH_CHUNKID_OIDFANOUT, graph_read_oid_fanout, graph);
+	if (pair_chunk_expect(cf, GRAPH_CHUNKID_OIDFANOUT,
+			      (const unsigned char **)&graph->chunk_oid_fanout,
+			      256 * sizeof(uint32_t)))
+		error(_("commit-graph oid fanout chunk is wrong size"));
 	read_chunk(cf, GRAPH_CHUNKID_OIDLOOKUP, graph_read_oid_lookup, graph);
 	read_chunk(cf, GRAPH_CHUNKID_DATA, graph_read_commit_data, graph);
 	pair_chunk(cf, GRAPH_CHUNKID_EXTRAEDGES, &graph->chunk_extra_edges,
-- 
2.42.0.352.gd9c5062ff7.dirty


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 3/8] commit-graph: read `CDAT` chunk with `pair_chunk_expect()`
  2023-10-13 19:25 ` [PATCH 0/8] chunk-format: introduce `pair_chunk_expect()` convenience API Taylor Blau
  2023-10-13 19:25   ` [PATCH 1/8] chunk-format: introduce `pair_chunk_expect()` helper Taylor Blau
  2023-10-13 19:25   ` [PATCH 2/8] commit-graph: read `OIDF` chunk with `pair_chunk_expect()` Taylor Blau
@ 2023-10-13 19:25   ` Taylor Blau
  2023-10-13 19:25   ` [PATCH 4/8] commit-graph: read `GDAT` " Taylor Blau
                     ` (5 subsequent siblings)
  8 siblings, 0 replies; 67+ messages in thread
From: Taylor Blau @ 2023-10-13 19:25 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Jeff King

Perform a similar conversion as in the previous commit read the CDAT
bits.

While we're here, mark the error() string for translation, and guard
against overflow when computing the expected size by wrapping it in an
st_mult() call.

Note that the pre-image of this patch was already sufficiently guarded
against overflow, since GRAPH_DATA_WIDTH is defined as
(the_hash_algo->rawsz + 16), so the expression in the parenthesis would
get performed as a size_t, and then g->num_commits would be promoted to
the width of size_t for the purposes of evaluating this expression.

But let's make it explicitly clear that this computation is safe by
wrapping it in an st_mult() call.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 commit-graph.c | 15 ++++-----------
 1 file changed, 4 insertions(+), 11 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index cdefd7f926..97d4824673 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -330,16 +330,6 @@ static int graph_read_oid_lookup(const unsigned char *chunk_start,
 	return 0;
 }
 
-static int graph_read_commit_data(const unsigned char *chunk_start,
-				  size_t chunk_size, void *data)
-{
-	struct commit_graph *g = data;
-	if (chunk_size != g->num_commits * GRAPH_DATA_WIDTH)
-		return error("commit-graph commit data chunk is wrong size");
-	g->chunk_commit_data = chunk_start;
-	return 0;
-}
-
 static int graph_read_generation_data(const unsigned char *chunk_start,
 				      size_t chunk_size, void *data)
 {
@@ -457,7 +447,10 @@ struct commit_graph *parse_commit_graph(struct repo_settings *s,
 			      256 * sizeof(uint32_t)))
 		error(_("commit-graph oid fanout chunk is wrong size"));
 	read_chunk(cf, GRAPH_CHUNKID_OIDLOOKUP, graph_read_oid_lookup, graph);
-	read_chunk(cf, GRAPH_CHUNKID_DATA, graph_read_commit_data, graph);
+	if (pair_chunk_expect(cf, GRAPH_CHUNKID_DATA,
+			      &graph->chunk_commit_data,
+			      st_mult(graph->num_commits, GRAPH_DATA_WIDTH)))
+		error(_("commit-graph commit data chunk is wrong size"));
 	pair_chunk(cf, GRAPH_CHUNKID_EXTRAEDGES, &graph->chunk_extra_edges,
 		   &graph->chunk_extra_edges_size);
 	pair_chunk(cf, GRAPH_CHUNKID_BASE, &graph->chunk_base_graphs,
-- 
2.42.0.352.gd9c5062ff7.dirty


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 4/8] commit-graph: read `GDAT` chunk with `pair_chunk_expect()`
  2023-10-13 19:25 ` [PATCH 0/8] chunk-format: introduce `pair_chunk_expect()` convenience API Taylor Blau
                     ` (2 preceding siblings ...)
  2023-10-13 19:25   ` [PATCH 3/8] commit-graph: read `CDAT` " Taylor Blau
@ 2023-10-13 19:25   ` Taylor Blau
  2023-10-13 19:25   ` [PATCH 5/8] commit-graph: read `BIDX` " Taylor Blau
                     ` (4 subsequent siblings)
  8 siblings, 0 replies; 67+ messages in thread
From: Taylor Blau @ 2023-10-13 19:25 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Jeff King

Perform an identical conversion as in previous commits to read the GDAT
chunk.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 commit-graph.c | 17 +++++------------
 1 file changed, 5 insertions(+), 12 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 97d4824673..0fab99f5dd 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -330,16 +330,6 @@ static int graph_read_oid_lookup(const unsigned char *chunk_start,
 	return 0;
 }
 
-static int graph_read_generation_data(const unsigned char *chunk_start,
-				      size_t chunk_size, void *data)
-{
-	struct commit_graph *g = data;
-	if (chunk_size != g->num_commits * sizeof(uint32_t))
-		return error("commit-graph generations chunk is wrong size");
-	g->chunk_generation_data = chunk_start;
-	return 0;
-}
-
 static int graph_read_bloom_index(const unsigned char *chunk_start,
 				  size_t chunk_size, void *data)
 {
@@ -457,8 +447,11 @@ struct commit_graph *parse_commit_graph(struct repo_settings *s,
 		   &graph->chunk_base_graphs_size);
 
 	if (s->commit_graph_generation_version >= 2) {
-		read_chunk(cf, GRAPH_CHUNKID_GENERATION_DATA,
-			   graph_read_generation_data, graph);
+		if (pair_chunk_expect(cf, GRAPH_CHUNKID_GENERATION_DATA,
+				      &graph->chunk_generation_data,
+				      st_mult(graph->num_commits,
+					      sizeof(uint32_t))))
+			error(_("commit-graph generations chunk is wrong size"));
 		pair_chunk(cf, GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW,
 			   &graph->chunk_generation_data_overflow,
 			   &graph->chunk_generation_data_overflow_size);
-- 
2.42.0.352.gd9c5062ff7.dirty


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 5/8] commit-graph: read `BIDX` chunk with `pair_chunk_expect()`
  2023-10-13 19:25 ` [PATCH 0/8] chunk-format: introduce `pair_chunk_expect()` convenience API Taylor Blau
                     ` (3 preceding siblings ...)
  2023-10-13 19:25   ` [PATCH 4/8] commit-graph: read `GDAT` " Taylor Blau
@ 2023-10-13 19:25   ` Taylor Blau
  2023-10-13 19:49     ` Taylor Blau
  2023-10-14 16:10     ` Junio C Hamano
  2023-10-13 19:25   ` [PATCH 6/8] midx: read `OIDF` " Taylor Blau
                     ` (3 subsequent siblings)
  8 siblings, 2 replies; 67+ messages in thread
From: Taylor Blau @ 2023-10-13 19:25 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Jeff King

Perform an identical conversion as in previous commits to read the BIDX
chunk.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 commit-graph.c | 18 ++++--------------
 1 file changed, 4 insertions(+), 14 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 0fab99f5dd..ad98f6334d 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -330,18 +330,6 @@ static int graph_read_oid_lookup(const unsigned char *chunk_start,
 	return 0;
 }
 
-static int graph_read_bloom_index(const unsigned char *chunk_start,
-				  size_t chunk_size, void *data)
-{
-	struct commit_graph *g = data;
-	if (chunk_size != g->num_commits * 4) {
-		warning("commit-graph changed-path index chunk is too small");
-		return -1;
-	}
-	g->chunk_bloom_indexes = chunk_start;
-	return 0;
-}
-
 static int graph_read_bloom_data(const unsigned char *chunk_start,
 				  size_t chunk_size, void *data)
 {
@@ -461,8 +449,10 @@ struct commit_graph *parse_commit_graph(struct repo_settings *s,
 	}
 
 	if (s->commit_graph_read_changed_paths) {
-		read_chunk(cf, GRAPH_CHUNKID_BLOOMINDEXES,
-			   graph_read_bloom_index, graph);
+		if (pair_chunk_expect(cf, GRAPH_CHUNKID_BLOOMINDEXES,
+				      &graph->chunk_bloom_indexes,
+				      st_mult(graph->num_commits, 4)) == -1)
+			warning(_("commit-graph changed-path index chunk is too small (%d)"), graph->num_commits * 4);
 		read_chunk(cf, GRAPH_CHUNKID_BLOOMDATA,
 			   graph_read_bloom_data, graph);
 	}
-- 
2.42.0.352.gd9c5062ff7.dirty


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 6/8] midx: read `OIDF` chunk with `pair_chunk_expect()`
  2023-10-13 19:25 ` [PATCH 0/8] chunk-format: introduce `pair_chunk_expect()` convenience API Taylor Blau
                     ` (4 preceding siblings ...)
  2023-10-13 19:25   ` [PATCH 5/8] commit-graph: read `BIDX` " Taylor Blau
@ 2023-10-13 19:25   ` Taylor Blau
  2023-10-13 21:04     ` Junio C Hamano
  2023-10-13 19:25   ` [PATCH 7/8] midx: read `OIDL` " Taylor Blau
                     ` (2 subsequent siblings)
  8 siblings, 1 reply; 67+ messages in thread
From: Taylor Blau @ 2023-10-13 19:25 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Jeff King

Perform an identical conversion as in previous commits to read the
OIDF chunk in the MIDX machinery.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 midx.c | 21 ++++++---------------
 1 file changed, 6 insertions(+), 15 deletions(-)

diff --git a/midx.c b/midx.c
index 2f3863c936..38bf816cce 100644
--- a/midx.c
+++ b/midx.c
@@ -61,20 +61,6 @@ void get_midx_rev_filename(struct strbuf *out, struct multi_pack_index *m)
 	strbuf_addf(out, "-%s.rev", hash_to_hex(get_midx_checksum(m)));
 }
 
-static int midx_read_oid_fanout(const unsigned char *chunk_start,
-				size_t chunk_size, void *data)
-{
-	struct multi_pack_index *m = data;
-	m->chunk_oid_fanout = (uint32_t *)chunk_start;
-
-	if (chunk_size != 4 * 256) {
-		error(_("multi-pack-index OID fanout is of the wrong size"));
-		return 1;
-	}
-	m->num_objects = ntohl(m->chunk_oid_fanout[255]);
-	return 0;
-}
-
 static int midx_read_oid_lookup(const unsigned char *chunk_start,
 				size_t chunk_size, void *data)
 {
@@ -173,8 +159,13 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local
 
 	if (pair_chunk(cf, MIDX_CHUNKID_PACKNAMES, &m->chunk_pack_names, &m->chunk_pack_names_len))
 		die(_("multi-pack-index required pack-name chunk missing or corrupted"));
-	if (read_chunk(cf, MIDX_CHUNKID_OIDFANOUT, midx_read_oid_fanout, m))
+	if (pair_chunk_expect(cf, MIDX_CHUNKID_OIDFANOUT,
+			      (const unsigned char **)&m->chunk_oid_fanout,
+			      256 * sizeof(uint32_t))) {
+		error(_("multi-pack-index OID fanout is of the wrong size"));
 		die(_("multi-pack-index required OID fanout chunk missing or corrupted"));
+	}
+	m->num_objects = ntohl(m->chunk_oid_fanout[255]);
 	if (read_chunk(cf, MIDX_CHUNKID_OIDLOOKUP, midx_read_oid_lookup, m))
 		die(_("multi-pack-index required OID lookup chunk missing or corrupted"));
 	if (read_chunk(cf, MIDX_CHUNKID_OBJECTOFFSETS, midx_read_object_offsets, m))
-- 
2.42.0.352.gd9c5062ff7.dirty


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 7/8] midx: read `OIDL` chunk with `pair_chunk_expect()`
  2023-10-13 19:25 ` [PATCH 0/8] chunk-format: introduce `pair_chunk_expect()` convenience API Taylor Blau
                     ` (5 preceding siblings ...)
  2023-10-13 19:25   ` [PATCH 6/8] midx: read `OIDF` " Taylor Blau
@ 2023-10-13 19:25   ` Taylor Blau
  2023-10-13 19:25   ` [PATCH 8/8] midx: read `OOFF` " Taylor Blau
  2023-10-20 10:23   ` [PATCH 0/8] chunk-format: introduce `pair_chunk_expect()` convenience API Jeff King
  8 siblings, 0 replies; 67+ messages in thread
From: Taylor Blau @ 2023-10-13 19:25 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Jeff King

Perform an identical conversion as in previous commits to read the
OIDL chunk in the MIDX machinery.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 midx.c | 18 ++++--------------
 1 file changed, 4 insertions(+), 14 deletions(-)

diff --git a/midx.c b/midx.c
index 38bf816cce..74167b8fdb 100644
--- a/midx.c
+++ b/midx.c
@@ -61,19 +61,6 @@ void get_midx_rev_filename(struct strbuf *out, struct multi_pack_index *m)
 	strbuf_addf(out, "-%s.rev", hash_to_hex(get_midx_checksum(m)));
 }
 
-static int midx_read_oid_lookup(const unsigned char *chunk_start,
-				size_t chunk_size, void *data)
-{
-	struct multi_pack_index *m = data;
-	m->chunk_oid_lookup = chunk_start;
-
-	if (chunk_size != st_mult(m->hash_len, m->num_objects)) {
-		error(_("multi-pack-index OID lookup chunk is the wrong size"));
-		return 1;
-	}
-	return 0;
-}
-
 static int midx_read_object_offsets(const unsigned char *chunk_start,
 				    size_t chunk_size, void *data)
 {
@@ -166,8 +153,11 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local
 		die(_("multi-pack-index required OID fanout chunk missing or corrupted"));
 	}
 	m->num_objects = ntohl(m->chunk_oid_fanout[255]);
-	if (read_chunk(cf, MIDX_CHUNKID_OIDLOOKUP, midx_read_oid_lookup, m))
+	if (pair_chunk_expect(cf, MIDX_CHUNKID_OIDLOOKUP, &m->chunk_oid_lookup,
+			      st_mult(m->hash_len, m->num_objects))) {
+		error(_("multi-pack-index OID lookup chunk is the wrong size"));
 		die(_("multi-pack-index required OID lookup chunk missing or corrupted"));
+	}
 	if (read_chunk(cf, MIDX_CHUNKID_OBJECTOFFSETS, midx_read_object_offsets, m))
 		die(_("multi-pack-index required object offsets chunk missing or corrupted"));
 
-- 
2.42.0.352.gd9c5062ff7.dirty


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 8/8] midx: read `OOFF` chunk with `pair_chunk_expect()`
  2023-10-13 19:25 ` [PATCH 0/8] chunk-format: introduce `pair_chunk_expect()` convenience API Taylor Blau
                     ` (6 preceding siblings ...)
  2023-10-13 19:25   ` [PATCH 7/8] midx: read `OIDL` " Taylor Blau
@ 2023-10-13 19:25   ` Taylor Blau
  2023-10-20 10:23   ` [PATCH 0/8] chunk-format: introduce `pair_chunk_expect()` convenience API Jeff King
  8 siblings, 0 replies; 67+ messages in thread
From: Taylor Blau @ 2023-10-13 19:25 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Jeff King

Perform an identical conversion as in previous commits to read the
OOFF chunk in the MIDX machinery.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 midx.c | 19 +++++--------------
 1 file changed, 5 insertions(+), 14 deletions(-)

diff --git a/midx.c b/midx.c
index 74167b8fdb..4aeadd5e00 100644
--- a/midx.c
+++ b/midx.c
@@ -61,19 +61,6 @@ void get_midx_rev_filename(struct strbuf *out, struct multi_pack_index *m)
 	strbuf_addf(out, "-%s.rev", hash_to_hex(get_midx_checksum(m)));
 }
 
-static int midx_read_object_offsets(const unsigned char *chunk_start,
-				    size_t chunk_size, void *data)
-{
-	struct multi_pack_index *m = data;
-	m->chunk_object_offsets = chunk_start;
-
-	if (chunk_size != st_mult(m->num_objects, MIDX_CHUNK_OFFSET_WIDTH)) {
-		error(_("multi-pack-index object offset chunk is the wrong size"));
-		return 1;
-	}
-	return 0;
-}
-
 struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local)
 {
 	struct multi_pack_index *m = NULL;
@@ -158,8 +145,12 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local
 		error(_("multi-pack-index OID lookup chunk is the wrong size"));
 		die(_("multi-pack-index required OID lookup chunk missing or corrupted"));
 	}
-	if (read_chunk(cf, MIDX_CHUNKID_OBJECTOFFSETS, midx_read_object_offsets, m))
+	if (pair_chunk_expect(cf, MIDX_CHUNKID_OBJECTOFFSETS,
+			      &m->chunk_object_offsets,
+			      st_mult(m->num_objects, MIDX_CHUNK_OFFSET_WIDTH))) {
+		error(_("multi-pack-index object offset chunk is the wrong size"));
 		die(_("multi-pack-index required object offsets chunk missing or corrupted"));
+	}
 
 	pair_chunk(cf, MIDX_CHUNKID_LARGEOFFSETS, &m->chunk_large_offsets,
 		   &m->chunk_large_offsets_len);
-- 
2.42.0.352.gd9c5062ff7.dirty

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [PATCH 5/8] commit-graph: read `BIDX` chunk with `pair_chunk_expect()`
  2023-10-13 19:25   ` [PATCH 5/8] commit-graph: read `BIDX` " Taylor Blau
@ 2023-10-13 19:49     ` Taylor Blau
  2023-10-14 16:10     ` Junio C Hamano
  1 sibling, 0 replies; 67+ messages in thread
From: Taylor Blau @ 2023-10-13 19:49 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Jeff King

On Fri, Oct 13, 2023 at 03:25:30PM -0400, Taylor Blau wrote:
> Perform an identical conversion as in previous commits to read the BIDX
> chunk.
>
> Signed-off-by: Taylor Blau <me@ttaylorr.com>
> ---
>  commit-graph.c | 18 ++++--------------
>  1 file changed, 4 insertions(+), 14 deletions(-)

Oops. This fails t4216 because it changes the warning() message, which I
thought I excluded from this patch :-<.

Here is a replacement that passes the test. I can reroll the entire
"series" if we decide that this is a worthwhile direction to go in:

--- 8< ---

Subject: [PATCH] commit-graph: read `BIDX` chunk with `pair_chunk_expect()`

Perform an identical conversion as in previous commits to read the BIDX
chunk.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 commit-graph.c | 18 ++++--------------
 1 file changed, 4 insertions(+), 14 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 0fab99f5dd..66c2e628d8 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -330,18 +330,6 @@ static int graph_read_oid_lookup(const unsigned char *chunk_start,
 	return 0;
 }

-static int graph_read_bloom_index(const unsigned char *chunk_start,
-				  size_t chunk_size, void *data)
-{
-	struct commit_graph *g = data;
-	if (chunk_size != g->num_commits * 4) {
-		warning("commit-graph changed-path index chunk is too small");
-		return -1;
-	}
-	g->chunk_bloom_indexes = chunk_start;
-	return 0;
-}
-
 static int graph_read_bloom_data(const unsigned char *chunk_start,
 				  size_t chunk_size, void *data)
 {
@@ -461,8 +449,10 @@ struct commit_graph *parse_commit_graph(struct repo_settings *s,
 	}

 	if (s->commit_graph_read_changed_paths) {
-		read_chunk(cf, GRAPH_CHUNKID_BLOOMINDEXES,
-			   graph_read_bloom_index, graph);
+		if (pair_chunk_expect(cf, GRAPH_CHUNKID_BLOOMINDEXES,
+				      &graph->chunk_bloom_indexes,
+				      st_mult(graph->num_commits, 4)) == -1)
+			warning(_("commit-graph changed-path index chunk is too small"));
 		read_chunk(cf, GRAPH_CHUNKID_BLOOMDATA,
 			   graph_read_bloom_data, graph);
 	}
--
2.38.0.16.g393fd4c6db


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [PATCH 6/8] midx: read `OIDF` chunk with `pair_chunk_expect()`
  2023-10-13 19:25   ` [PATCH 6/8] midx: read `OIDF` " Taylor Blau
@ 2023-10-13 21:04     ` Junio C Hamano
  0 siblings, 0 replies; 67+ messages in thread
From: Junio C Hamano @ 2023-10-13 21:04 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Jeff King

Taylor Blau <me@ttaylorr.com> writes:

> +	if (pair_chunk_expect(cf, MIDX_CHUNKID_OIDFANOUT,
> +			      (const unsigned char **)&m->chunk_oid_fanout,
> +			      256 * sizeof(uint32_t))) {
> +		error(_("multi-pack-index OID fanout is of the wrong size"));
>  		die(_("multi-pack-index required OID fanout chunk missing or corrupted"));
> +	}

This is not a new problem, but when laid out this way, the doubled
messages look a bit suboptimal.

Together with reporting the actual and expected byte counts and
doing so consistenly I alluded to in a separate message, cleaning
these up should probably be left outside of the topic, I suspect.

> +	m->num_objects = ntohl(m->chunk_oid_fanout[255]);
>  	if (read_chunk(cf, MIDX_CHUNKID_OIDLOOKUP, midx_read_oid_lookup, m))
>  		die(_("multi-pack-index required OID lookup chunk missing or corrupted"));
>  	if (read_chunk(cf, MIDX_CHUNKID_OBJECTOFFSETS, midx_read_object_offsets, m))

^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH 21/20] t5319: make corrupted large-offset test more robust
  2023-10-09 20:55 [PATCH 0/20] bounds-checks for chunk-based files Jeff King
                   ` (21 preceding siblings ...)
  2023-10-13 19:25 ` [PATCH 0/8] chunk-format: introduce `pair_chunk_expect()` convenience API Taylor Blau
@ 2023-10-14  0:43 ` Jeff King
  2023-10-14 19:42   ` Junio C Hamano
  22 siblings, 1 reply; 67+ messages in thread
From: Jeff King @ 2023-10-14  0:43 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Taylor Blau

On Mon, Oct 09, 2023 at 04:55:44PM -0400, Jeff King wrote:

>   [10/20]: midx: bounds-check large offset chunk

Here's one more patch on top that fixes a flaky test. Sorry not to have
caught it before sending out the original series. I did see a flaky CI
run before then, but I tweaked the test with what I thought was the
solution, and then it didn't re-occur until I ran CI again today.

I'm pretty sure this should fix it for good.

(I also saw Taylor posted some further patches; I haven't looked closely
yet, this should be orthogonal to those).

-- >8 --
Subject: [PATCH] t5319: make corrupted large-offset test more robust

The test t5319.88 ("reader bounds-checks large offset table") can fail
intermittently. The failure mode looks like this:

  1. An earlier test sets up "objects64", a directory that can be used
     to produce a midx with a corrupted large-offsets table. To get the
     large offsets, it corrupts the normal ".idx" file to have a fake
     large offset, and then builds a midx from that.

     That midx now has a large offset table, which is what we want. But
     we also have a .idx on disk that has a corrupted entry. We'll call
     the object with the corrupted large-offset "X".

  2. In t5319.88, we further corrupt the midx by reducing the size of
     the large-offset chunk (because our goal is to make sure we do not
     do an out-of-bounds read on it).

  3. We then enumerate all of the objects with "cat-file --batch-check
     --batch-all-objects", expecting to see a complaint when we try to
     show object X. We use --batch-all-objects because our objects64
     repo doesn't actually have any refs (but if we check them all, one
     of them will be the failing one). The default batch-check format
     includes %(objecttype) and %(objectsize), both of which require us
     to access the actual pack data (and thus requires looking at the
     offset).

  4a. Usually, this succeeds. We try to output object X, do a lookup via
      the midx for the type/size lookup, and run into the corrupt
      large-offset table.

  4b. But sometimes we hit a different error. If another object points
      to X as a delta base, then trying to find the type of that object
      requires walking the delta chain to the base entry (since only the
      base has the concrete type; deltas themselves are either OFS_DELTA
      or REF_DELTA).

      Normally this would not require separate offset lookups at all, as
      deltas are usually stored as OFS_DELTA, specifying the relative
      offset to the base. But the corrupt idx created in step 1 is done
      directly with "git pack-objects" and does not pass the
      --delta-base-offset option, meaning we have REF_DELTA entries!
      Those do have to consult an index to find the location of the base
      object, and they use the pack .idx to do this. The same pack .idx
      that we know is corrupted from step 1!

      Git does notice the error, but it does so by seeing the corrupt
      .idx file, not the corrupt midx file, and the error it reports is
      different, causing the test to fail.

The set of objects created in the test is deterministic. But the delta
selection seems not to be (which is not too surprising, as it is
multi-threaded). I have seen the failure in Windows CI but haven't
reproduced it locally (not even with --stress). Re-running a failed
Windows CI job tends to work. But when I download and examine the trash
directory from a failed run, it shows a different set of deltas than I
get locally. But the exact source of non-determinism isn't that
important; our test should be robust against any order.

There are a few options to fix this:

  a. It would be OK for the "objects64" setup to "unbreak" the .idx file
     after generating the midx. But then it would be hard for subsequent
     tests to reuse it, since it is the corrupted idx that forces the
     midx to have a large offset table.

  b. The "objects64" setup could use --delta-base-offset. This would fix
     our problem, but earlier tests have many hard-coded offsets. Using
     OFS_DELTA would change the locations of objects in the pack (this
     might even be OK because I think most of the offsets are within the
     .idx file, but it seems brittle and I'm afraid to touch it).

  c. Our cat-file output is in oid order by default. Since we store
     bases before deltas, if we went in pack order (using the
     "--unordered" flag), we'd always see our corrupt X before any delta
     which depends on it. But using "--unordered" means we skip the midx
     entirely. That makes sense, since it is just enumerating all of
     the packs, using the offsets found in their .idx files directly.
     So it doesn't work for our test.

  d. We could ask directly about object X, rather than enumerating all
     of them. But that requires further hard-coding of the oid (both
     sha1 and sha256) of object X. I'd prefer not to introduce more
     brittleness.

  e. We can use a --batch-check format that looks at the pack data, but
     doesn't have to chase deltas. The problem in this case is
     %(objecttype), which has to walk to the base. But %(objectsize)
     does not; we can get the value directly from the delta itself.
     Another option would be %(deltabase), where we report the REF_DELTA
     name but don't look at its data.

I've gone with option (e) here. It's kind of subtle, but it's simple and
has no side effects.

Signed-off-by: Jeff King <peff@peff.net>
---
 t/t5319-multi-pack-index.sh | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 2a11dd1af6..d3c9e97feb 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -1129,8 +1129,10 @@ test_expect_success 'reader bounds-checks large offset table' '
 		git multi-pack-index --object-dir=../objects64 write &&
 		midx=../objects64/pack/multi-pack-index &&
 		corrupt_chunk_file $midx LOFF clear &&
-		test_must_fail git cat-file \
-			--batch-check --batch-all-objects 2>err &&
+		# using only %(objectsize) is important here; see the commit
+		# message for more details
+		test_must_fail git cat-file --batch-all-objects \
+			--batch-check="%(objectsize)" 2>err &&
 		cat >expect <<-\EOF &&
 		fatal: multi-pack-index large offset out of bounds
 		EOF
-- 
2.42.0.937.g4d9eb42d36


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [PATCH 5/8] commit-graph: read `BIDX` chunk with `pair_chunk_expect()`
  2023-10-13 19:25   ` [PATCH 5/8] commit-graph: read `BIDX` " Taylor Blau
  2023-10-13 19:49     ` Taylor Blau
@ 2023-10-14 16:10     ` Junio C Hamano
  2023-10-20 10:31       ` Jeff King
  1 sibling, 1 reply; 67+ messages in thread
From: Junio C Hamano @ 2023-10-14 16:10 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Jeff King

Taylor Blau <me@ttaylorr.com> writes:

> -static int graph_read_bloom_index(const unsigned char *chunk_start,
> -				  size_t chunk_size, void *data)
> -{
> -	struct commit_graph *g = data;
> -	if (chunk_size != g->num_commits * 4) {
> -		warning("commit-graph changed-path index chunk is too small");
> -		return -1;
> -	}
> ...
> @@ -461,8 +449,10 @@ struct commit_graph *parse_commit_graph(struct repo_settings *s,
>  	}
>  
>  	if (s->commit_graph_read_changed_paths) {
> +		if (pair_chunk_expect(cf, GRAPH_CHUNKID_BLOOMINDEXES,
> +				      &graph->chunk_bloom_indexes,
> +				      st_mult(graph->num_commits, 4)) == -1)
> +			warning(_("commit-graph changed-path index chunk is too small (%d)"), graph->num_commits * 4);
>  		read_chunk(cf, GRAPH_CHUNKID_BLOOMDATA,
>  			   graph_read_bloom_data, graph);
>  	}

Overall the series looked sane, but the need for each caller to
supply error messages, when the helper perfectly well knows how many
bytes the caller expected and how many bytes there are avaiable, was
a bit disturbing, as the level of detail given per each caller will
inevitably become uneven.  Even now, some give an error() while
others give a warning(), even though I suspect all of them should be
data errors.

I wonder if it makes sense to stuff the message template in the
pair_chunk_data structure and do

static int pair_chunk_expect_fn(const unsigned char *chunk_start,
				size_t chunk_size,
				void *data)
{
	struct pair_chunk_data *pcd = data;
	if (pcd->expected_size != chunk_size)
		return error(_(pcd->message), pcd->expected_size, chunk_size);
	*pcd->p = chunk_start;
	return 0;
}


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 21/20] t5319: make corrupted large-offset test more robust
  2023-10-14  0:43 ` [PATCH 21/20] t5319: make corrupted large-offset test more robust Jeff King
@ 2023-10-14 19:42   ` Junio C Hamano
  2023-10-15  3:17     ` Jeff King
  0 siblings, 1 reply; 67+ messages in thread
From: Junio C Hamano @ 2023-10-14 19:42 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Taylor Blau

Jeff King <peff@peff.net> writes:

>   4b. But sometimes we hit a different error. If another object points
>       to X as a delta base, then trying to find the type of that object
>       requires walking the delta chain to the base entry (since only the
>       base has the concrete type; deltas themselves are either OFS_DELTA
>       or REF_DELTA).
>
>       Normally this would not require separate offset lookups at all, as
>       deltas are usually stored as OFS_DELTA, specifying the relative
>       offset to the base. But the corrupt idx created in step 1 is done
>       directly with "git pack-objects" and does not pass the
>       --delta-base-offset option, meaning we have REF_DELTA entries!
>       Those do have to consult an index to find the location of the base
>       object, and they use the pack .idx to do this. The same pack .idx
>       that we know is corrupted from step 1!

Tricky.

> The set of objects created in the test is deterministic. But the delta
> selection seems not to be (which is not too surprising, as it is
> multi-threaded).

So, the offsets of the objects are also not deterministic?

> I have seen the failure in Windows CI but haven't
> reproduced it locally (not even with --stress). Re-running a failed
> Windows CI job tends to work. But when I download and examine the trash
> directory from a failed run, it shows a different set of deltas than I
> get locally. But the exact source of non-determinism isn't that
> important; our test should be robust against any order.

Yeah, thanks for digging this tricky situation through.

>   b. The "objects64" setup could use --delta-base-offset. This would fix
>      our problem, but earlier tests have many hard-coded offsets. Using
>      OFS_DELTA would change the locations of objects in the pack (this
>      might even be OK because I think most of the offsets are within the
>      .idx file, but it seems brittle and I'm afraid to touch it).

I am not sure I follow, as it does not sound a solution to anything
if the offsets are not deterministic (and "earlier tests" that have
hard coded offsets are broken no matter what, which is not a problem
this patch introduces).  Puzzled, but not curious enough to think
about it further, as you have already rejected this approach ;-)

>   d. We could ask directly about object X, rather than enumerating all
>      of them. But that requires further hard-coding of the oid (both
>      sha1 and sha256) of object X. I'd prefer not to introduce more
>      brittleness.

Right.

>   e. We can use a --batch-check format that looks at the pack data, but
>      doesn't have to chase deltas. The problem in this case is
>      %(objecttype), which has to walk to the base. But %(objectsize)
>      does not; we can get the value directly from the delta itself.
>      Another option would be %(deltabase), where we report the REF_DELTA
>      name but don't look at its data.
>
> I've gone with option (e) here. It's kind of subtle, but it's simple and
> has no side effects.

Nice.

> @@ -1129,8 +1129,10 @@ test_expect_success 'reader bounds-checks large offset table' '
>  		git multi-pack-index --object-dir=../objects64 write &&
>  		midx=../objects64/pack/multi-pack-index &&
>  		corrupt_chunk_file $midx LOFF clear &&
> -		test_must_fail git cat-file \
> -			--batch-check --batch-all-objects 2>err &&
> +		# using only %(objectsize) is important here; see the commit
> +		# message for more details
> +		test_must_fail git cat-file --batch-all-objects \
> +			--batch-check="%(objectsize)" 2>err &&

A rather unfriendly message to readers, as it is unclear which
commit you are talking about, and a fun thing is that you cannot
say which one.

Will queue.  Thanks.


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 21/20] t5319: make corrupted large-offset test more robust
  2023-10-14 19:42   ` Junio C Hamano
@ 2023-10-15  3:17     ` Jeff King
  2023-10-15 17:04       ` Junio C Hamano
  0 siblings, 1 reply; 67+ messages in thread
From: Jeff King @ 2023-10-15  3:17 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Taylor Blau

On Sat, Oct 14, 2023 at 12:42:01PM -0700, Junio C Hamano wrote:

> > The set of objects created in the test is deterministic. But the delta
> > selection seems not to be (which is not too surprising, as it is
> > multi-threaded).
> 
> So, the offsets of the objects are also not deterministic?

Hmm, yeah, you're right. The pack clearly is not deterministic, and so
neither will its offsets be.

And thus...

> >   b. The "objects64" setup could use --delta-base-offset. This would fix
> >      our problem, but earlier tests have many hard-coded offsets. Using
> >      OFS_DELTA would change the locations of objects in the pack (this
> >      might even be OK because I think most of the offsets are within the
> >      .idx file, but it seems brittle and I'm afraid to touch it).
> 
> I am not sure I follow, as it does not sound a solution to anything
> if the offsets are not deterministic (and "earlier tests" that have
> hard coded offsets are broken no matter what, which is not a problem
> this patch introduces).  Puzzled, but not curious enough to think
> about it further, as you have already rejected this approach ;-)

...yes, my "this might even be OK" is true. So it would work to solve
the problem by using --delta-base-offset. Those earlier tests are
actually OK (they munge only the idx and the midx, not the pack). And if
we have delta base offsets, then walking delta chains never requires
looking at the pack idx.

I'm not sure if that is any less subtle than the solution I did come up
with, though.

The most unsubtle thing is cleaning up the broken .idx immediately
after generating the midx. I didn't want to do that because it disrupts
the state of the objects64 directory. But we could always turn its setup
into a function or something. I dunno. It is probably not worth spending
too many brain cycles on. My hope is that nobody ever has to touch this
test ever again. ;)

> > @@ -1129,8 +1129,10 @@ test_expect_success 'reader bounds-checks large offset table' '
> >  		git multi-pack-index --object-dir=../objects64 write &&
> >  		midx=../objects64/pack/multi-pack-index &&
> >  		corrupt_chunk_file $midx LOFF clear &&
> > -		test_must_fail git cat-file \
> > -			--batch-check --batch-all-objects 2>err &&
> > +		# using only %(objectsize) is important here; see the commit
> > +		# message for more details
> > +		test_must_fail git cat-file --batch-all-objects \
> > +			--batch-check="%(objectsize)" 2>err &&
> 
> A rather unfriendly message to readers, as it is unclear which
> commit you are talking about, and a fun thing is that you cannot
> say which one.

Yeah, I had a similar thought process. I sort of assume anybody working
on git.git is capable of turning to "git blame" in a situation like
this. But maybe:

  # using only %(objectsize) is important here; run "git blame"
  # on these lines for more details

would spell it out more clearly.

-Peff

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 21/20] t5319: make corrupted large-offset test more robust
  2023-10-15  3:17     ` Jeff King
@ 2023-10-15 17:04       ` Junio C Hamano
  0 siblings, 0 replies; 67+ messages in thread
From: Junio C Hamano @ 2023-10-15 17:04 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Taylor Blau

Jeff King <peff@peff.net> writes:

> Yeah, I had a similar thought process. I sort of assume anybody working
> on git.git is capable of turning to "git blame" in a situation like
> this. But maybe:
>
>   # using only %(objectsize) is important here; run "git blame"
>   # on these lines for more details
>
> would spell it out more clearly.

The comment was about "the commit" being not so clear which commit.
"see the message of the commit that added this comment" would have
been perfectly fine and they are not required to use "blame" if they
don't like it.


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/8] chunk-format: introduce `pair_chunk_expect()` convenience API
  2023-10-13 19:25 ` [PATCH 0/8] chunk-format: introduce `pair_chunk_expect()` convenience API Taylor Blau
                     ` (7 preceding siblings ...)
  2023-10-13 19:25   ` [PATCH 8/8] midx: read `OOFF` " Taylor Blau
@ 2023-10-20 10:23   ` Jeff King
  8 siblings, 0 replies; 67+ messages in thread
From: Jeff King @ 2023-10-20 10:23 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Junio C Hamano

On Fri, Oct 13, 2023 at 03:25:12PM -0400, Taylor Blau wrote:

> While reviewing this series, I noticed a couple of spots that would be
> helped by having a convenience function that stores the start of a
> chunk in a designated location *after* checking that the chunk has the
> expected size.
> 
> I called this function `pair_chunk_expect()` and touched up seven spots
> that I think could benefit from having a convenience function like this.
> 
> This applies directly on top of 'jk/chunk-bounds'. Thanks in advance for
> your review!

I'm still a little skeptical of this approach, just because I think it
it discourages adding further checks. And in particular, I was planning
to add monotonicity checks to the midx OIDF chunk based on the
discussion in the earlier thread. And likewise, I think I probably
should have put the commit-graph checks into its OIDF reader, rather
than saving them for verify_commit_graph_lite(). That would match the
way we check the validity of the other chunks.

I haven't actually started writing those patches, though, so I'm not
sure of the full details yet.

-Peff

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 5/8] commit-graph: read `BIDX` chunk with `pair_chunk_expect()`
  2023-10-14 16:10     ` Junio C Hamano
@ 2023-10-20 10:31       ` Jeff King
  0 siblings, 0 replies; 67+ messages in thread
From: Jeff King @ 2023-10-20 10:31 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Taylor Blau, git

On Sat, Oct 14, 2023 at 09:10:22AM -0700, Junio C Hamano wrote:

> > @@ -461,8 +449,10 @@ struct commit_graph *parse_commit_graph(struct repo_settings *s,
> >  	}
> >  
> >  	if (s->commit_graph_read_changed_paths) {
> > +		if (pair_chunk_expect(cf, GRAPH_CHUNKID_BLOOMINDEXES,
> > +				      &graph->chunk_bloom_indexes,
> > +				      st_mult(graph->num_commits, 4)) == -1)
> > +			warning(_("commit-graph changed-path index chunk is too small (%d)"), graph->num_commits * 4);
> >  		read_chunk(cf, GRAPH_CHUNKID_BLOOMDATA,
> >  			   graph_read_bloom_data, graph);
> >  	}
> 
> Overall the series looked sane, but the need for each caller to
> supply error messages, when the helper perfectly well knows how many
> bytes the caller expected and how many bytes there are avaiable, was
> a bit disturbing, as the level of detail given per each caller will
> inevitably become uneven.  Even now, some give an error() while
> others give a warning(), even though I suspect all of them should be
> data errors.
> 
> I wonder if it makes sense to stuff the message template in the
> pair_chunk_data structure and do
> 
> static int pair_chunk_expect_fn(const unsigned char *chunk_start,
> 				size_t chunk_size,
> 				void *data)
> {
> 	struct pair_chunk_data *pcd = data;
> 	if (pcd->expected_size != chunk_size)
> 		return error(_(pcd->message), pcd->expected_size, chunk_size);
> 	*pcd->p = chunk_start;
> 	return 0;
> }

One issue with the series as-is is that the "chunk is too small"
messages can be misleading when the chunk is in fact missing. We do say
"missing or corrupt" in the die message (at least for midx; I did not
update the similar ones for commit-graph), but the explicit "too small"
for a missing chunk seems to me to cross the line.

The caller can distinguish the cases by the actual value returned from
pair_chunk_expect(), but doing so makes the code a lot longer and harder
to read.

Your suggestion above takes care of it naturally (in the same way that
the existing code does, which basically is emitting the same message in
each read_chunk callback).

-Peff

^ permalink raw reply	[flat|nested] 67+ messages in thread

end of thread, other threads:[~2023-10-20 10:31 UTC | newest]

Thread overview: 67+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-09 20:55 [PATCH 0/20] bounds-checks for chunk-based files Jeff King
2023-10-09 20:58 ` [PATCH 01/20] chunk-format: note that pair_chunk() is unsafe Jeff King
2023-10-10 23:45   ` Taylor Blau
2023-10-11 22:49     ` Jeff King
2023-10-09 20:58 ` [PATCH 02/20] t: add library for munging chunk-format files Jeff King
2023-10-10 23:47   ` Taylor Blau
2023-10-09 20:59 ` [PATCH 03/20] midx: stop ignoring malformed oid fanout chunk Jeff King
2023-10-10 23:50   ` Taylor Blau
2023-10-11 22:52     ` Jeff King
2023-10-09 20:59 ` [PATCH 04/20] commit-graph: check size of " Jeff King
2023-10-11  0:08   ` Taylor Blau
2023-10-11  1:24     ` Taylor Blau
2023-10-11 23:01     ` Jeff King
2023-10-09 21:02 ` [PATCH 05/20] midx: check size of oid lookup chunk Jeff King
2023-10-09 21:04 ` [PATCH 06/20] commit-graph: check consistency of fanout table Jeff King
2023-10-11 14:45   ` Taylor Blau
2023-10-09 21:05 ` [PATCH 07/20] midx: check size of pack names chunk Jeff King
2023-10-11 14:52   ` Taylor Blau
2023-10-11 23:06     ` Jeff King
2023-10-09 21:05 ` [PATCH 08/20] midx: enforce chunk alignment on reading Jeff King
2023-10-11 14:56   ` Taylor Blau
2023-10-11 15:01   ` Taylor Blau
2023-10-11 23:09     ` Jeff King
2023-10-09 21:05 ` [PATCH 09/20] midx: check size of object offset chunk Jeff King
2023-10-11 18:31   ` Taylor Blau
2023-10-09 21:05 ` [PATCH 10/20] midx: bounds-check large " Jeff King
2023-10-11 18:38   ` Taylor Blau
2023-10-11 23:18     ` Jeff King
2023-10-09 21:05 ` [PATCH 11/20] midx: check size of revindex chunk Jeff King
2023-10-11 18:41   ` Taylor Blau
2023-10-09 21:05 ` [PATCH 12/20] commit-graph: check size of commit data chunk Jeff King
2023-10-11 18:46   ` Taylor Blau
2023-10-11 23:22     ` Jeff King
2023-10-09 21:05 ` [PATCH 13/20] commit-graph: detect out-of-bounds extra-edges pointers Jeff King
2023-10-11 19:02   ` Taylor Blau
2023-10-09 21:05 ` [PATCH 14/20] commit-graph: bounds-check base graphs chunk Jeff King
2023-10-11 19:05   ` Taylor Blau
2023-10-09 21:05 ` [PATCH 15/20] commit-graph: check size of generations chunk Jeff King
2023-10-09 21:05 ` [PATCH 16/20] commit-graph: bounds-check generation overflow chunk Jeff King
2023-10-09 21:05 ` [PATCH 17/20] commit-graph: check bounds when accessing BDAT chunk Jeff King
2023-10-11 19:11   ` Taylor Blau
2023-10-11 23:27     ` Jeff King
2023-10-09 21:05 ` [PATCH 18/20] commit-graph: check bounds when accessing BIDX chunk Jeff King
2023-10-11 19:15   ` Taylor Blau
2023-10-09 21:05 ` [PATCH 19/20] commit-graph: detect out-of-order BIDX offsets Jeff King
2023-10-11 19:16   ` Taylor Blau
2023-10-09 21:06 ` [PATCH 20/20] chunk-format: drop pair_chunk_unsafe() Jeff King
2023-10-11 19:19 ` [PATCH 0/20] bounds-checks for chunk-based files Taylor Blau
2023-10-11 23:31   ` Jeff King
2023-10-13 19:25 ` [PATCH 0/8] chunk-format: introduce `pair_chunk_expect()` convenience API Taylor Blau
2023-10-13 19:25   ` [PATCH 1/8] chunk-format: introduce `pair_chunk_expect()` helper Taylor Blau
2023-10-13 19:25   ` [PATCH 2/8] commit-graph: read `OIDF` chunk with `pair_chunk_expect()` Taylor Blau
2023-10-13 19:25   ` [PATCH 3/8] commit-graph: read `CDAT` " Taylor Blau
2023-10-13 19:25   ` [PATCH 4/8] commit-graph: read `GDAT` " Taylor Blau
2023-10-13 19:25   ` [PATCH 5/8] commit-graph: read `BIDX` " Taylor Blau
2023-10-13 19:49     ` Taylor Blau
2023-10-14 16:10     ` Junio C Hamano
2023-10-20 10:31       ` Jeff King
2023-10-13 19:25   ` [PATCH 6/8] midx: read `OIDF` " Taylor Blau
2023-10-13 21:04     ` Junio C Hamano
2023-10-13 19:25   ` [PATCH 7/8] midx: read `OIDL` " Taylor Blau
2023-10-13 19:25   ` [PATCH 8/8] midx: read `OOFF` " Taylor Blau
2023-10-20 10:23   ` [PATCH 0/8] chunk-format: introduce `pair_chunk_expect()` convenience API Jeff King
2023-10-14  0:43 ` [PATCH 21/20] t5319: make corrupted large-offset test more robust Jeff King
2023-10-14 19:42   ` Junio C Hamano
2023-10-15  3:17     ` Jeff King
2023-10-15 17:04       ` Junio C Hamano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).