[PATCH 0/7] Commit-graph: Generation Number v2 Fixes, v3 implementation

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/7] Commit-graph: Generation Number v2 Fixes, v3 implementation
@ 2022-02-24 20:38 Derrick Stolee via GitGitGadget
  2022-02-24 20:38 ` [PATCH 1/7] test-read-graph: include extra post-parse info Derrick Stolee via GitGitGadget
                   ` (8 more replies)
  0 siblings, 9 replies; 70+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-02-24 20:38 UTC (permalink / raw)
  To: git; +Cc: me, gitster, abhishekkumar8222, Derrick Stolee

This patch series includes two distinct, but similarly-motivated parts:

 * Patches 1-4 fix some bugs in the commit-graph generation number v2.
 * Patches 5-7 add a new generation number v3 by incrementing the
   commit-graph file format.

I had been thinking about generation number v3, which is the same corrected
commit date as generation number v2, but it is stored in the Commit Data
chunk, requiring a new commit-graph file format version. This breaks
compatibility with older versions of Git, so it requires opt-in via the
commitGraph.generationVersion config value. The only improvement over
version 2 is that the commit-graph file is smaller, so I/O time is reduced.

However, while exploring this idea I found bugs in generation number v2. In
particular, Git has been ignoring them since shortly after they were
introduced. This is due to a bug I introduced when trying to make split
commit-graphs safer with mixed generation number versions. I also noticed an
issue with the offset overflows that I only noticed after writing generation
number v3 using a smaller offset size, actually triggering the bug in the
test suite.

I'm submitting these two things together so we can see them all at once, but
I'd be happy to split this into two series. The first four patches are
important bug fixes, so we can consider them as higher-priority.

Thanks, -Stolee

Derrick Stolee (7):
  test-read-graph: include extra post-parse info
  commit-graph: fix ordering bug in generation numbers
  commit-graph: start parsing generation v2 (again)
  commit-graph: fix generation number v2 overflow values
  commit-graph: document file format v2
  commit-graph: parse file format v2
  commit-graph: write file format v2

 Documentation/config/commitgraph.txt          |  4 +-
 .../technical/commit-graph-format.txt         | 22 ++++-
 commit-graph.c                                | 98 +++++++++++++++----
 commit-graph.h                                |  6 ++
 commit.h                                      |  1 +
 t/helper/test-read-graph.c                    | 13 +++
 t/t4216-log-bloom.sh                          |  1 +
 t/t5318-commit-graph.sh                       | 65 ++++++++++--
 t/t5324-split-commit-graph.sh                 | 10 ++
 9 files changed, 189 insertions(+), 31 deletions(-)

base-commit: dab1b7905d0b295f1acef9785bb2b9cbb0fdec84
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1163%2Fderrickstolee%2Fgen-v3-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1163/derrickstolee/gen-v3-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/1163
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [PATCH 1/7] test-read-graph: include extra post-parse info
  2022-02-24 20:38 [PATCH 0/7] Commit-graph: Generation Number v2 Fixes, v3 implementation Derrick Stolee via GitGitGadget
@ 2022-02-24 20:38 ` Derrick Stolee via GitGitGadget
  2022-02-24 20:38 ` [PATCH 2/7] commit-graph: fix ordering bug in generation numbers Derrick Stolee via GitGitGadget
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 70+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-02-24 20:38 UTC (permalink / raw)
  To: git; +Cc: me, gitster, abhishekkumar8222, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

It can be helpful to verify that the 'struct commit_graph' that results
from parsing a commit-graph is correctly structured. The existence of
different chunks is not enough to verify that all of the optional
features are correctly enabled.

Update 'test-tool read-graph' to output an "options:" line that includes
information for different parts of the struct commit_graph.

In particular, this change demonstrates that the read_generation_data
option is never being enabled, which will be fixed in a later change.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 t/helper/test-read-graph.c    | 13 +++++++++++++
 t/t4216-log-bloom.sh          |  1 +
 t/t5318-commit-graph.sh       |  1 +
 t/t5324-split-commit-graph.sh |  5 +++++
 4 files changed, 20 insertions(+)

diff --git a/t/helper/test-read-graph.c b/t/helper/test-read-graph.c
index 75927b2c81d..c3b6b8d1734 100644
--- a/t/helper/test-read-graph.c
+++ b/t/helper/test-read-graph.c
@@ -3,6 +3,7 @@
 #include "commit-graph.h"
 #include "repository.h"
 #include "object-store.h"
+#include "bloom.h"
 
 int cmd__read_graph(int argc, const char **argv)
 {
@@ -45,6 +46,18 @@ int cmd__read_graph(int argc, const char **argv)
 		printf(" bloom_data");
 	printf("\n");
 
+	printf("options:");
+	if (graph->bloom_filter_settings)
+		printf(" bloom(%d,%d,%d)",
+		       graph->bloom_filter_settings->hash_version,
+		       graph->bloom_filter_settings->bits_per_entry,
+		       graph->bloom_filter_settings->num_hashes);
+	if (graph->read_generation_data)
+		printf(" read_generation_data");
+	if (graph->topo_levels)
+		printf(" topo_levels");
+	printf("\n");
+
 	UNLEAK(graph);
 
 	return 0;
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index cc3cebf6722..5ed6d2a21c1 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -48,6 +48,7 @@ graph_read_expect () {
 	header: 43475048 1 $(test_oid oid_version) $NUM_CHUNKS 0
 	num_commits: $1
 	chunks: oid_fanout oid_lookup commit_metadata generation_data bloom_indexes bloom_data
+	options: bloom(1,10,7)
 	EOF
 	test-tool read-graph >actual &&
 	test_cmp expect actual
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index edb728f77c3..2b05026cf6d 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -104,6 +104,7 @@ graph_read_expect() {
 	header: 43475048 1 $(test_oid oid_version) $NUM_CHUNKS 0
 	num_commits: $1
 	chunks: oid_fanout oid_lookup commit_metadata$OPTIONAL
+	options:
 	EOF
 	test-tool read-graph >output &&
 	test_cmp expect output
diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
index 847b8097109..778fa418de2 100755
--- a/t/t5324-split-commit-graph.sh
+++ b/t/t5324-split-commit-graph.sh
@@ -34,6 +34,7 @@ graph_read_expect() {
 	header: 43475048 1 $(test_oid oid_version) 4 $NUM_BASE
 	num_commits: $1
 	chunks: oid_fanout oid_lookup commit_metadata generation_data
+	options:
 	EOF
 	test-tool read-graph >output &&
 	test_cmp expect output
@@ -508,6 +509,7 @@ test_expect_success 'setup repo for mixed generation commit-graph-chain' '
 		header: 43475048 1 $(test_oid oid_version) 4 1
 		num_commits: $NUM_SECOND_LAYER_COMMITS
 		chunks: oid_fanout oid_lookup commit_metadata
+		options:
 		EOF
 		test_cmp expect output &&
 		git commit-graph verify &&
@@ -540,6 +542,7 @@ test_expect_success 'do not write generation data chunk if not present on existi
 		header: 43475048 1 $(test_oid oid_version) 4 2
 		num_commits: $NUM_THIRD_LAYER_COMMITS
 		chunks: oid_fanout oid_lookup commit_metadata
+		options:
 		EOF
 		test_cmp expect output &&
 		git commit-graph verify
@@ -581,6 +584,7 @@ test_expect_success 'do not write generation data chunk if the topmost remaining
 		header: 43475048 1 $(test_oid oid_version) 4 2
 		num_commits: $(($NUM_THIRD_LAYER_COMMITS + $NUM_FOURTH_LAYER_COMMITS))
 		chunks: oid_fanout oid_lookup commit_metadata
+		options:
 		EOF
 		test_cmp expect output &&
 		git commit-graph verify
@@ -620,6 +624,7 @@ test_expect_success 'write generation data chunk if topmost remaining layer has
 		header: 43475048 1 $(test_oid oid_version) 5 1
 		num_commits: $(($NUM_SECOND_LAYER_COMMITS + $NUM_THIRD_LAYER_COMMITS + $NUM_FOURTH_LAYER_COMMITS + $NUM_FIFTH_LAYER_COMMITS))
 		chunks: oid_fanout oid_lookup commit_metadata generation_data
+		options:
 		EOF
 		test_cmp expect output
 	)
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH 2/7] commit-graph: fix ordering bug in generation numbers
  2022-02-24 20:38 [PATCH 0/7] Commit-graph: Generation Number v2 Fixes, v3 implementation Derrick Stolee via GitGitGadget
  2022-02-24 20:38 ` [PATCH 1/7] test-read-graph: include extra post-parse info Derrick Stolee via GitGitGadget
@ 2022-02-24 20:38 ` Derrick Stolee via GitGitGadget
  2022-02-24 22:15   ` Junio C Hamano
  2022-02-24 20:38 ` [PATCH 3/7] commit-graph: start parsing generation v2 (again) Derrick Stolee via GitGitGadget
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 70+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-02-24 20:38 UTC (permalink / raw)
  To: git; +Cc: me, gitster, abhishekkumar8222, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

When computing the generation numbers for a commit-graph, we compute
the corrected commit dates and then check if their offsets from the
actual dates is too large to fit in the 32-bit Generation Data chunk.
However, there is a problem with this approach: if we have parsed the
generation data from the previous commit-graph, then we continue the
loop because the corrected commit date is already computed.

It is incorrect to add an increment to num_generation_data_overflows
here, because we might start double-counting commits that are computed
because of the depth-first search walk from a commit with an earlier
OID.

Instead, iterate over the full commit list at the end, checking the
offsets to see how many grow beyond the maximum value.

Update a test in t5318 to use a larger time value, which will help
demonstrate this bug in more cases. It still won't hit all potential
cases until the next change, which reenables reading generation numbers.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 commit-graph.c          | 10 +++++++---
 t/t5318-commit-graph.sh |  4 ++--
 2 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 265c010122e..a19bd96c2ee 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1556,12 +1556,16 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 				if (current->date && current->date > max_corrected_commit_date)
 					max_corrected_commit_date = current->date - 1;
 				commit_graph_data_at(current)->generation = max_corrected_commit_date + 1;
-
-				if (commit_graph_data_at(current)->generation - current->date > GENERATION_NUMBER_V2_OFFSET_MAX)
-					ctx->num_generation_data_overflows++;
 			}
 		}
 	}
+
+	for (i = 0; i < ctx->commits.nr; i++) {
+		struct commit *c = ctx->commits.list[i];
+		timestamp_t offset = commit_graph_data_at(c)->generation - c->date;
+		if (offset > GENERATION_NUMBER_V2_OFFSET_MAX)
+			ctx->num_generation_data_overflows++;
+	}
 	stop_progress(&ctx->progress);
 }
 
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 2b05026cf6d..f9bffe38013 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -467,10 +467,10 @@ test_expect_success 'warn on improper hash version' '
 	)
 '
 
-test_expect_success 'lower layers have overflow chunk' '
+test_expect_success TIME_IS_64BIT,TIME_T_IS_64BIT 'lower layers have overflow chunk' '
 	cd "$TRASH_DIRECTORY/full" &&
 	UNIX_EPOCH_ZERO="@0 +0000" &&
-	FUTURE_DATE="@2147483646 +0000" &&
+	FUTURE_DATE="@4147483646 +0000" &&
 	rm -f .git/objects/info/commit-graph &&
 	test_commit --date "$FUTURE_DATE" future-1 &&
 	test_commit --date "$UNIX_EPOCH_ZERO" old-1 &&
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH 3/7] commit-graph: start parsing generation v2 (again)
  2022-02-24 20:38 [PATCH 0/7] Commit-graph: Generation Number v2 Fixes, v3 implementation Derrick Stolee via GitGitGadget
  2022-02-24 20:38 ` [PATCH 1/7] test-read-graph: include extra post-parse info Derrick Stolee via GitGitGadget
  2022-02-24 20:38 ` [PATCH 2/7] commit-graph: fix ordering bug in generation numbers Derrick Stolee via GitGitGadget
@ 2022-02-24 20:38 ` Derrick Stolee via GitGitGadget
  2022-02-28 15:18   ` Patrick Steinhardt
  2022-02-24 20:38 ` [PATCH 4/7] commit-graph: fix generation number v2 overflow values Derrick Stolee via GitGitGadget
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 70+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-02-24 20:38 UTC (permalink / raw)
  To: git; +Cc: me, gitster, abhishekkumar8222, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

The 'read_generation_data' member of 'struct commit_graph' was
introduced by 1fdc383c5 (commit-graph: use generation v2 only if entire
chain does, 2021-01-16). The intention was to avoid using corrected
commit dates if not all layers of a commit-graph had that data stored.
The logic in validate_mixed_generation_chain() at that point incorrectly
initialized read_generation_data to 1 if and only if the tip
commit-graph contained the Corrected Commit Date chunk.

This was "fixed" in 448a39e65 (commit-graph: validate layers for
generation data, 2021-02-02) to validate that read_generation_data was
either non-zero for all layers, or it would set read_generation_data to
zero for all layers.

The problem here is that read_generation_data is not initialized to be
non-zero anywhere!

This change initializes read_generation_data immediately after the chunk
is parsed, so each layer will have its value present as soon as
possible.

The read_generation_data member is used in fill_commit_graph_info() to
determine if we should use the corrected commit date or the topological
levels stored in the Commit Data chunk. Due to this bug, all previous
versions of Git were defaulting to topological levels in all cases!

This can be measured with some performance tests. Using the Linux kernel
as a testbed, I generated a complete commit-graph containing corrected
commit dates and tested the 'new' version against the previous, 'old'
version.

First, rev-list with --topo-order demonstrates a 26% improvement using
corrected commit dates:

hyperfine \
	-n "old" "$OLD_GIT rev-list --topo-order -1000 v3.6" \
	-n "new" "$NEW_GIT rev-list --topo-order -1000 v3.6" \
	--warmup=10

Benchmark 1: old
  Time (mean ± σ):      57.1 ms ±   3.1 ms
  Range (min … max):    52.9 ms …  62.0 ms    55 runs

Benchmark 2: new
  Time (mean ± σ):      45.5 ms ±   3.3 ms
  Range (min … max):    39.9 ms …  51.7 ms    59 runs

Summary
  'new' ran
    1.26 ± 0.11 times faster than 'old'

These performance improvements are due to the algorithmic improvements
given by walking fewer commits due to the higher cutoffs from corrected
commit dates.

However, this comes at a cost. The additional I/O cost of parsing the
corrected commit dates is visible in case of merge-base commands that do
not reduce the overall number of walked commits.

hyperfine \
        -n "old" "$OLD_GIT merge-base v4.8 v4.9" \
        -n "new" "$NEW_GIT merge-base v4.8 v4.9" \
        --warmup=10

Benchmark 1: old
  Time (mean ± σ):     110.4 ms ±   6.4 ms
  Range (min … max):    96.0 ms … 118.3 ms    25 runs

Benchmark 2: new
  Time (mean ± σ):     150.7 ms ±   1.1 ms
  Range (min … max):   149.3 ms … 153.4 ms    19 runs

Summary
  'old' ran
    1.36 ± 0.08 times faster than 'new'

Performance issues like this are what motivated 702110aac (commit-graph:
use config to specify generation type, 2021-02-25).

In the future, we could fix this performance problem by inserting the
corrected commit date offsets into the Commit Date chunk instead of
having that data in an extra chunk.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 commit-graph.c                |  3 +++
 t/t4216-log-bloom.sh          |  2 +-
 t/t5318-commit-graph.sh       | 14 ++++++++++++--
 t/t5324-split-commit-graph.sh |  9 +++++++--
 4 files changed, 23 insertions(+), 5 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index a19bd96c2ee..8e52bb09552 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -407,6 +407,9 @@ struct commit_graph *parse_commit_graph(struct repository *r,
 			&graph->chunk_generation_data);
 		pair_chunk(cf, GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW,
 			&graph->chunk_generation_data_overflow);
+
+		if (graph->chunk_generation_data)
+			graph->read_generation_data = 1;
 	}
 
 	if (r->settings.commit_graph_read_changed_paths) {
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index 5ed6d2a21c1..fa9d32facfb 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -48,7 +48,7 @@ graph_read_expect () {
 	header: 43475048 1 $(test_oid oid_version) $NUM_CHUNKS 0
 	num_commits: $1
 	chunks: oid_fanout oid_lookup commit_metadata generation_data bloom_indexes bloom_data
-	options: bloom(1,10,7)
+	options: bloom(1,10,7) read_generation_data
 	EOF
 	test-tool read-graph >actual &&
 	test_cmp expect actual
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index f9bffe38013..1afee1c2705 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -100,11 +100,21 @@ graph_read_expect() {
 		OPTIONAL=" $2"
 		NUM_CHUNKS=$((3 + $(echo "$2" | wc -w)))
 	fi
+	GENERATION_VERSION=2
+	if test ! -z "$3"
+	then
+		GENERATION_VERSION=$3
+	fi
+	OPTIONS=
+	if test $GENERATION_VERSION -gt 1
+	then
+		OPTIONS=" read_generation_data"
+	fi
 	cat >expect <<- EOF
 	header: 43475048 1 $(test_oid oid_version) $NUM_CHUNKS 0
 	num_commits: $1
 	chunks: oid_fanout oid_lookup commit_metadata$OPTIONAL
-	options:
+	options:$OPTIONS
 	EOF
 	test-tool read-graph >output &&
 	test_cmp expect output
@@ -498,7 +508,7 @@ test_expect_success 'git commit-graph verify' '
 	cd "$TRASH_DIRECTORY/full" &&
 	git rev-parse commits/8 | git -c commitGraph.generationVersion=1 commit-graph write --stdin-commits &&
 	git commit-graph verify >output &&
-	graph_read_expect 9 extra_edges
+	graph_read_expect 9 extra_edges 1
 '
 
 NUM_COMMITS=9
diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
index 778fa418de2..669ddc645fa 100755
--- a/t/t5324-split-commit-graph.sh
+++ b/t/t5324-split-commit-graph.sh
@@ -30,11 +30,16 @@ graph_read_expect() {
 	then
 		NUM_BASE=$2
 	fi
+	OPTIONS=
+	if test -z "$3"
+	then
+		OPTIONS=" read_generation_data"
+	fi
 	cat >expect <<- EOF
 	header: 43475048 1 $(test_oid oid_version) 4 $NUM_BASE
 	num_commits: $1
 	chunks: oid_fanout oid_lookup commit_metadata generation_data
-	options:
+	options:$OPTIONS
 	EOF
 	test-tool read-graph >output &&
 	test_cmp expect output
@@ -624,7 +629,7 @@ test_expect_success 'write generation data chunk if topmost remaining layer has
 		header: 43475048 1 $(test_oid oid_version) 5 1
 		num_commits: $(($NUM_SECOND_LAYER_COMMITS + $NUM_THIRD_LAYER_COMMITS + $NUM_FOURTH_LAYER_COMMITS + $NUM_FIFTH_LAYER_COMMITS))
 		chunks: oid_fanout oid_lookup commit_metadata generation_data
-		options:
+		options: read_generation_data
 		EOF
 		test_cmp expect output
 	)
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH 4/7] commit-graph: fix generation number v2 overflow values
  2022-02-24 20:38 [PATCH 0/7] Commit-graph: Generation Number v2 Fixes, v3 implementation Derrick Stolee via GitGitGadget
                   ` (2 preceding siblings ...)
  2022-02-24 20:38 ` [PATCH 3/7] commit-graph: start parsing generation v2 (again) Derrick Stolee via GitGitGadget
@ 2022-02-24 20:38 ` Derrick Stolee via GitGitGadget
  2022-02-24 22:35   ` Junio C Hamano
  2022-02-24 20:38 ` [PATCH 5/7] commit-graph: document file format v2 Derrick Stolee via GitGitGadget
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 70+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-02-24 20:38 UTC (permalink / raw)
  To: git; +Cc: me, gitster, abhishekkumar8222, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

The Generation Data Chunk was implemented and tested in e8b63005c
(commit-graph: implement generation data chunk, 2021-01-16), but the
test was carefully constructed to work on systems with 32-bit dates.
Since the corrected commit date offsets still required more than 31
bits, this triggered writing the generation_data_overflow chunk.

However, upon closer look, the
write_graph_chunk_generation_data_overflow() method writes the offsets
to the chunk (as dictated by the format) but fill_commit_graph_info()
treats the value in the chunk as if it is the full corrected commit date
(not an offset). For some reason, this does not cause an issue when
using the FUTURE_DATE specified in t5318-commit-graph.sh, but it does
show up as a failure in 'git commit-graph verify' if we increase that
FUTURE_DATE to be above four billion.

Fix this error and update the test to require 64-bit dates so we can
safely use this large value in our test.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 commit-graph.c          |  2 +-
 t/t5318-commit-graph.sh | 21 +++++++++++++++++++--
 2 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 8e52bb09552..b86a6a634fe 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -806,7 +806,7 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
 				die(_("commit-graph requires overflow generation data but has none"));
 
 			offset_pos = offset ^ CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW;
-			graph_data->generation = get_be64(g->chunk_generation_data_overflow + 8 * offset_pos);
+			graph_data->generation = item->date + get_be64(g->chunk_generation_data_overflow + 8 * offset_pos);
 		} else
 			graph_data->generation = item->date + offset;
 	} else
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 1afee1c2705..5e4b0216fa6 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -815,6 +815,19 @@ test_expect_success 'corrupt commit-graph write (missing tree)' '
 	)
 '
 
+# The remaining tests check timestamps that flow over
+# 32-bits. The graph_git_behavior checks can't take a
+# prereq, so just stop here if we are on a 32-bit machine.
+
+if ! test_have_prereq TIME_IS_64BIT
+then
+	test_done
+fi
+if ! test_have_prereq TIME_T_IS_64BIT
+then
+	test_done
+fi
+
 # We test the overflow-related code with the following repo history:
 #
 #               4:F - 5:N - 6:U
@@ -832,10 +845,10 @@ test_expect_success 'corrupt commit-graph write (missing tree)' '
 # The largest offset observed is 2 ^ 31, just large enough to overflow.
 #
 
-test_expect_success 'set up and verify repo with generation data overflow chunk' '
+test_expect_success TIME_IS_64BIT,TIME_T_IS_64BIT 'set up and verify repo with generation data overflow chunk' '
 	objdir=".git/objects" &&
 	UNIX_EPOCH_ZERO="@0 +0000" &&
-	FUTURE_DATE="@2147483646 +0000" &&
+	FUTURE_DATE="@4000000000 +0000" &&
 	test_oid_cache <<-EOF &&
 	oid_version sha1:1
 	oid_version sha256:2
@@ -867,4 +880,8 @@ test_expect_success 'set up and verify repo with generation data overflow chunk'
 
 graph_git_behavior 'generation data overflow chunk repo' repo left right
 
+# Do not add tests at the end of this file, unless they require 64-bit
+# timestamps, since this portion of the script is only executed when
+# time data types have 64 bits.
+
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH 5/7] commit-graph: document file format v2
  2022-02-24 20:38 [PATCH 0/7] Commit-graph: Generation Number v2 Fixes, v3 implementation Derrick Stolee via GitGitGadget
                   ` (3 preceding siblings ...)
  2022-02-24 20:38 ` [PATCH 4/7] commit-graph: fix generation number v2 overflow values Derrick Stolee via GitGitGadget
@ 2022-02-24 20:38 ` Derrick Stolee via GitGitGadget
  2022-02-24 22:55   ` Junio C Hamano
  2022-02-25 22:31   ` Ævar Arnfjörð Bjarmason
  2022-02-24 20:38 ` [PATCH 6/7] commit-graph: parse " Derrick Stolee via GitGitGadget
                   ` (3 subsequent siblings)
  8 siblings, 2 replies; 70+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-02-24 20:38 UTC (permalink / raw)
  To: git; +Cc: me, gitster, abhishekkumar8222, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

The corrected commit date was first documented in 5a3b130ca (doc: add
corrected commit date info, 2021-01-16) and it used an optional chunk to
augment the commit-graph format without modifying the file format
version.

One major benefit to this approach is that corrected commit dates could
be written without causing a backwards compatibility issue with Git
versions that do not understand them. The topological level was still
available in the CDAT chunk as it was before.

However, this causes a different issue: more data needs to be loaded
from disk when parsing commits from the commit-graph. In cases where
there is no significant algorithmic gain from using corrected commit
dates, commit walks take up to 20% longer because of this extra data.

Create a new file format version for the commit-graph format that
differs only in the CDAT chunk: it now stores corrected commit date
offsets. This brings our data back to normal and will demonstrate
performance gains in almost all cases.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 .../technical/commit-graph-format.txt         | 22 ++++++++++++++-----
 1 file changed, 17 insertions(+), 5 deletions(-)

diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
index 87971c27dd7..2cb48993314 100644
--- a/Documentation/technical/commit-graph-format.txt
+++ b/Documentation/technical/commit-graph-format.txt
@@ -36,7 +36,7 @@ HEADER:
       The signature is: {'C', 'G', 'P', 'H'}
 
   1-byte version number:
-      Currently, the only valid version is 1.
+      This version number can be 1 or 2.
 
   1-byte Hash Version
       We infer the hash length (H) from this value:
@@ -85,13 +85,22 @@ CHUNK DATA:
       position. If there are more than two parents, the second value
       has its most-significant bit on and the other bits store an array
       position into the Extra Edge List chunk.
-    * The next 8 bytes store the topological level (generation number v1)
-      of the commit and
-      the commit time in seconds since EPOCH. The generation number
-      uses the higher 30 bits of the first 4 bytes, while the commit
+    * The next 8 bytes store the generation number information of the
+      commit and the commit time in seconds since EPOCH. The generation
+      number uses the higher 30 bits of the first 4 bytes, while the commit
       time uses the 32 bits of the second 4 bytes, along with the lowest
       2 bits of the lowest byte, storing the 33rd and 34th bit of the
       commit time.
+      - If the commit-graph file format is version 1, then the higher 30
+	bits contain the topological level (generation number v1) for the
+	commit.
+      - If the commit-graph file format is version 2, then the higher 30
+	bits contain the corrected commit date offset (generation number
+	v2) for the commit, except if the offset cannot be stored within
+	29 bits. If the offset is too large for 29 bits, then the value
+	stored here has its most-significant bit on and the other bits
+	store the position of the corrected commit date in the Generation
+	Date Overflow chunk.
 
   Generation Data (ID: {'G', 'D', 'A', 'T' }) (N * 4 bytes) [Optional]
     * This list of 4-byte values store corrected commit date offsets for the
@@ -103,6 +112,9 @@ CHUNK DATA:
     * Generation Data chunk is present only when commit-graph file is written
       by compatible versions of Git and in case of split commit-graph chains,
       the topmost layer also has Generation Data chunk.
+    * This chunk does not exist if the commit-graph file format version is 2,
+      because the corrected commit date offset data is stored in the Commit
+      Data chunk.
 
   Generation Data Overflow (ID: {'G', 'D', 'O', 'V' }) [Optional]
     * This list of 8-byte values stores the corrected commit date offsets
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH 6/7] commit-graph: parse file format v2
  2022-02-24 20:38 [PATCH 0/7] Commit-graph: Generation Number v2 Fixes, v3 implementation Derrick Stolee via GitGitGadget
                   ` (4 preceding siblings ...)
  2022-02-24 20:38 ` [PATCH 5/7] commit-graph: document file format v2 Derrick Stolee via GitGitGadget
@ 2022-02-24 20:38 ` Derrick Stolee via GitGitGadget
  2022-02-24 23:01   ` Junio C Hamano
  2022-02-24 20:38 ` [PATCH 7/7] commit-graph: write " Derrick Stolee via GitGitGadget
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 70+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-02-24 20:38 UTC (permalink / raw)
  To: git; +Cc: me, gitster, abhishekkumar8222, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

The commit-graph file format v2 alters how it stores the corrected
commit date offsets within the Commit Data chunk instead of a separate
chunk. The idea is to significantly reduce the amount of data loaded
from disk while parsing the commit-graph.

We need to alter the error message when we see a file format version
outside of our range now that multiple are possible. This has a
non-functional side-effect of altering a use of GRAPH_VERSION within
write_commit_graph().

By storing the file format version in 'struct commit_graph', we can
alter the parsing code to depend on that version value. This involves
changing where we look for the corrected commit date offset, but also
which constants we use for jumping into the Generation Data Overflow
chunk. The Commit Data chunk only has 30 bits available for the offset
while the Generation Data chunk has 32 bits. This only makes a
meaningful difference in very malformed repositories.

Also, we need to be careful about how we enable using corrected commit
dates and generation numbers to rely upon the read_generation_data value
instead of a non-zero value in the Commit Date chunk. In
generation_numbers_enabled(), the first_generation variable is
attemptint to look for the first topological level stored to see that it
is nonzero. However, for a v2 commit-graph, this value is actually
likely to be zero because the corrected commit date offset is probably
zero.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 commit-graph.c          | 43 +++++++++++++++++++++++++++++------------
 commit-graph.h          |  6 ++++++
 t/t5318-commit-graph.sh |  2 +-
 3 files changed, 38 insertions(+), 13 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index b86a6a634fe..366fc4d6e41 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -49,7 +49,9 @@ void git_test_write_commit_graph_or_die(void)
 #define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16)
 
 #define GRAPH_VERSION_1 0x1
-#define GRAPH_VERSION GRAPH_VERSION_1
+#define GRAPH_VERSION_2 0x2
+#define GRAPH_VERSION_MIN GRAPH_VERSION_1
+#define GRAPH_VERSION_MAX GRAPH_VERSION_2
 
 #define GRAPH_EXTRA_EDGES_NEEDED 0x80000000
 #define GRAPH_EDGE_LAST_MASK 0x7fffffff
@@ -63,6 +65,7 @@ void git_test_write_commit_graph_or_die(void)
 			+ GRAPH_FANOUT_SIZE + the_hash_algo->rawsz)
 
 #define CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW (1ULL << 31)
+#define CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW_V3 (1ULL << 29)
 
 /* Remember to update object flag allocation in object.h */
 #define REACHABLE       (1u<<15)
@@ -358,9 +361,10 @@ struct commit_graph *parse_commit_graph(struct repository *r,
 	}
 
 	graph_version = *(unsigned char*)(data + 4);
-	if (graph_version != GRAPH_VERSION) {
-		error(_("commit-graph version %X does not match version %X"),
-		      graph_version, GRAPH_VERSION);
+	if (graph_version < GRAPH_VERSION_MIN ||
+	    graph_version > GRAPH_VERSION_MAX) {
+		error(_("commit-graph version %X is not understood"),
+		      graph_version);
 		return NULL;
 	}
 
@@ -375,6 +379,7 @@ struct commit_graph *parse_commit_graph(struct repository *r,
 
 	graph = alloc_commit_graph();
 
+	graph->version = graph_version;
 	graph->hash_len = the_hash_algo->rawsz;
 	graph->num_chunks = *(unsigned char*)(data + 6);
 	graph->data = graph_map;
@@ -402,13 +407,17 @@ struct commit_graph *parse_commit_graph(struct repository *r,
 	pair_chunk(cf, GRAPH_CHUNKID_EXTRAEDGES, &graph->chunk_extra_edges);
 	pair_chunk(cf, GRAPH_CHUNKID_BASE, &graph->chunk_base_graphs);
 
-	if (get_configured_generation_version(r) >= 2) {
-		pair_chunk(cf, GRAPH_CHUNKID_GENERATION_DATA,
-			&graph->chunk_generation_data);
+	if (graph_version >= GRAPH_VERSION_2 ||
+	    get_configured_generation_version(r) >= 2) {
+		/* Skip this chunk if GRAPH_VERSION_2 or higher. */
+		if (graph_version == GRAPH_VERSION_1)
+			pair_chunk(cf, GRAPH_CHUNKID_GENERATION_DATA,
+				   &graph->chunk_generation_data);
 		pair_chunk(cf, GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW,
 			&graph->chunk_generation_data_overflow);
 
-		if (graph->chunk_generation_data)
+		if (graph_version >= GRAPH_VERSION_2 ||
+		    graph->chunk_generation_data)
 			graph->read_generation_data = 1;
 	}
 
@@ -683,6 +692,9 @@ int generation_numbers_enabled(struct repository *r)
 	if (!g->num_commits)
 		return 0;
 
+	if (g->version >= GRAPH_VERSION_2)
+		return g->read_generation_data;
+
 	first_generation = get_be32(g->chunk_commit_data +
 				    g->hash_len + 8) >> 2;
 
@@ -799,13 +811,20 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
 	item->date = (timestamp_t)((date_high << 32) | date_low);
 
 	if (g->read_generation_data) {
-		offset = (timestamp_t)get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
+		timestamp_t overflow_bit;
+		if (g->version == GRAPH_VERSION_2) {
+			offset = (timestamp_t)(get_be32(commit_data + g->hash_len + 8) >> 2);
+			overflow_bit = CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW_V3;
+		} else {
+			offset = (timestamp_t)get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
+			overflow_bit = CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW;
+		}
 
-		if (offset & CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW) {
+		if (offset & overflow_bit) {
 			if (!g->chunk_generation_data_overflow)
 				die(_("commit-graph requires overflow generation data but has none"));
 
-			offset_pos = offset ^ CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW;
+			offset_pos = offset ^ overflow_bit;
 			graph_data->generation = item->date + get_be64(g->chunk_generation_data_overflow + 8 * offset_pos);
 		} else
 			graph_data->generation = item->date + offset;
@@ -1917,7 +1936,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 
 	hashwrite_be32(f, GRAPH_SIGNATURE);
 
-	hashwrite_u8(f, GRAPH_VERSION);
+	hashwrite_u8(f, GRAPH_VERSION_1);
 	hashwrite_u8(f, oid_version());
 	hashwrite_u8(f, get_num_chunks(cf));
 	hashwrite_u8(f, ctx->num_commit_graphs_after - 1);
diff --git a/commit-graph.h b/commit-graph.h
index 04a94e18302..b379b8eae25 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -63,6 +63,12 @@ struct commit_graph {
 	const unsigned char *data;
 	size_t data_len;
 
+	/**
+	 * The 'version' byte mirrors the file format version. This is
+	 * necessary to consider when parsing commits.
+	 */
+	unsigned version;
+
 	unsigned char hash_len;
 	unsigned char num_chunks;
 	uint32_t num_commits;
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 5e4b0216fa6..a14a13e5f7b 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -605,7 +605,7 @@ test_expect_success 'detect bad signature' '
 '
 
 test_expect_success 'detect bad version' '
-	corrupt_graph_and_verify $GRAPH_BYTE_VERSION "\02" \
+	corrupt_graph_and_verify $GRAPH_BYTE_VERSION "\03" \
 		"graph version"
 '
 
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH 7/7] commit-graph: write file format v2
  2022-02-24 20:38 [PATCH 0/7] Commit-graph: Generation Number v2 Fixes, v3 implementation Derrick Stolee via GitGitGadget
                   ` (5 preceding siblings ...)
  2022-02-24 20:38 ` [PATCH 6/7] commit-graph: parse " Derrick Stolee via GitGitGadget
@ 2022-02-24 20:38 ` Derrick Stolee via GitGitGadget
  2022-02-24 21:42 ` [PATCH 0/7] Commit-graph: Generation Number v2 Fixes, v3 implementation Junio C Hamano
  2022-02-28 13:53 ` [PATCH v2 0/4] Commit-graph: Generation Number v2 Fixes Derrick Stolee via GitGitGadget
  8 siblings, 0 replies; 70+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-02-24 20:38 UTC (permalink / raw)
  To: git; +Cc: me, gitster, abhishekkumar8222, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

The commit-graph file format v2 exists so we can modify the meaning of
the Commit Data chunk to store corrected commit date offsets. Thus, we
trigger the write to use this different file format only if the
configured generation number version is 3.

The implementation needs to be careful of a few things to ensure we
enable computing corrected commit dates and do not compute topological
levels. We also still need the Generation Data Overflow chunk, but we
compute the offsets into that chunk while writing the Commit Data chunk
instead of the generation Data chunk.

Testing 'git merge-base v4.8 v4.9' in the Linux kernel with corrected
commit dates, but the only difference being the file format (between
generation number v2 and v3) we get these results:

Benchmark 1: generation number v2
  Time (mean ± σ):     144.4 ms ±   8.3 ms
  Range (min … max):   127.4 ms … 154.6 ms    20 runs

Benchmark 2: generation number v3
  Time (mean ± σ):     139.3 ms ±   7.3 ms
  Range (min … max):   125.1 ms … 148.1 ms    20 runs

This provides a 3.6% improvement, and the only reason is the reduced
I/O. This test was run with hot caches, so I re-ran it in the cold-cache
case, trying to demonstrate that this I/O cost is higher when reading
directly from disk every time:

Benchmark 1: generation number v2
  Time (mean ± σ):     469.9 ms ±  14.8 ms
  Range (min … max):   434.5 ms … 494.4 ms    10 runs

Benchmark 2: generation number v3
  Time (mean ± σ):     413.4 ms ±  18.9 ms
  Range (min … max):   372.8 ms … 428.3 ms    10 runs

With cold caches, the improvement increases to 13.4%.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 Documentation/config/commitgraph.txt |  4 ++-
 commit-graph.c                       | 46 +++++++++++++++++++++++++---
 commit.h                             |  1 +
 t/t5318-commit-graph.sh              | 25 ++++++++++++++-
 4 files changed, 69 insertions(+), 7 deletions(-)

diff --git a/Documentation/config/commitgraph.txt b/Documentation/config/commitgraph.txt
index 30604e4a4c2..79d57d06a67 100644
--- a/Documentation/config/commitgraph.txt
+++ b/Documentation/config/commitgraph.txt
@@ -1,7 +1,9 @@
 commitGraph.generationVersion::
 	Specifies the type of generation number version to use when writing
 	or reading the commit-graph file. If version 1 is specified, then
-	the corrected commit dates will not be written or read. Defaults to
+	the corrected commit dates will not be written or read. If version
+	3 is specified, then the commit-graph file will be slightly smaller,
+	but will be incompatible with some old versions of Git. Defaults to
 	2.
 
 commitGraph.maxNewFilters::
diff --git a/commit-graph.c b/commit-graph.c
index 366fc4d6e41..82f7401b283 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1035,6 +1035,7 @@ struct write_commit_graph_context {
 	struct progress *progress;
 	int progress_done;
 	uint64_t progress_cnt;
+	int version;
 
 	char *base_graph_name;
 	int num_commit_graphs_before;
@@ -1118,12 +1119,14 @@ static int write_graph_chunk_data(struct hashfile *f,
 	struct commit **list = ctx->commits.list;
 	struct commit **last = ctx->commits.list + ctx->commits.nr;
 	uint32_t num_extra_edges = 0;
+	int num_generation_data_overflows = 0;
 
 	while (list < last) {
 		struct commit_list *parent;
 		struct object_id *tree;
 		int edge_value;
 		uint32_t packedDate[2];
+		uint32_t generation_data;
 		display_progress(ctx->progress, ++ctx->progress_cnt);
 
 		if (repo_parse_commit_no_graph(ctx->r, *list))
@@ -1203,7 +1206,18 @@ static int write_graph_chunk_data(struct hashfile *f,
 		else
 			packedDate[0] = 0;
 
-		packedDate[0] |= htonl(*topo_level_slab_at(ctx->topo_levels, *list) << 2);
+		if (ctx->version == GRAPH_VERSION_1)
+			generation_data = *topo_level_slab_at(ctx->topo_levels, *list);
+		else {
+			generation_data = commit_graph_data_at(*list)->generation - (*list)->date;
+			if (generation_data > GENERATION_NUMBER_V3_OFFSET_MAX) {
+				generation_data = CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW_V3 |
+						  num_generation_data_overflows;
+				num_generation_data_overflows++;
+			}
+		}
+
+		packedDate[0] |= htonl(generation_data << 2);
 
 		packedDate[1] = htonl((*list)->date);
 		hashwrite(f, packedDate, 8);
@@ -1243,12 +1257,16 @@ static int write_graph_chunk_generation_data_overflow(struct hashfile *f,
 {
 	struct write_commit_graph_context *ctx = data;
 	int i;
+	timestamp_t offset_max = ctx->version >= 2 ?
+					GENERATION_NUMBER_V3_OFFSET_MAX :
+					GENERATION_NUMBER_V2_OFFSET_MAX;
+
 	for (i = 0; i < ctx->commits.nr; i++) {
 		struct commit *c = ctx->commits.list[i];
 		timestamp_t offset = commit_graph_data_at(c)->generation - c->date;
 		display_progress(ctx->progress, ++ctx->progress_cnt);
 
-		if (offset > GENERATION_NUMBER_V2_OFFSET_MAX) {
+		if (offset > offset_max) {
 			hashwrite_be32(f, offset >> 32);
 			hashwrite_be32(f, (uint32_t) offset);
 		}
@@ -1474,6 +1492,13 @@ static void compute_topological_levels(struct write_commit_graph_context *ctx)
 	int i;
 	struct commit_list *list = NULL;
 
+	/*
+	 * Skip topological levels if file format version is two or more,
+	 * since the Commit Data chunk uses corrected commit date offsets.
+	 */
+	if (ctx->version >= 2)
+		return;
+
 	if (ctx->report_progress)
 		ctx->progress = start_delayed_progress(
 					_("Computing commit graph topological levels"),
@@ -1526,6 +1551,9 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 {
 	int i;
 	struct commit_list *list = NULL;
+	timestamp_t offset_max = ctx->version >= 2 ?
+					GENERATION_NUMBER_V3_OFFSET_MAX :
+					GENERATION_NUMBER_V2_OFFSET_MAX;
 
 	if (ctx->report_progress)
 		ctx->progress = start_delayed_progress(
@@ -1585,7 +1613,7 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 	for (i = 0; i < ctx->commits.nr; i++) {
 		struct commit *c = ctx->commits.list[i];
 		timestamp_t offset = commit_graph_data_at(c)->generation - c->date;
-		if (offset > GENERATION_NUMBER_V2_OFFSET_MAX)
+		if (offset > offset_max)
 			ctx->num_generation_data_overflows++;
 	}
 	stop_progress(&ctx->progress);
@@ -1908,7 +1936,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	add_chunk(cf, GRAPH_CHUNKID_DATA, (hashsz + 16) * ctx->commits.nr,
 		  write_graph_chunk_data);
 
-	if (ctx->write_generation_data)
+	if (ctx->write_generation_data && ctx->version == GRAPH_VERSION_1)
 		add_chunk(cf, GRAPH_CHUNKID_GENERATION_DATA,
 			  sizeof(uint32_t) * ctx->commits.nr,
 			  write_graph_chunk_generation_data);
@@ -1936,7 +1964,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 
 	hashwrite_be32(f, GRAPH_SIGNATURE);
 
-	hashwrite_u8(f, GRAPH_VERSION_1);
+	hashwrite_u8(f, ctx->version);
 	hashwrite_u8(f, oid_version());
 	hashwrite_u8(f, get_num_chunks(cf));
 	hashwrite_u8(f, ctx->num_commit_graphs_after - 1);
@@ -2317,6 +2345,14 @@ int write_commit_graph(struct object_directory *odb,
 	ctx->write_generation_data = (get_configured_generation_version(r) == 2);
 	ctx->num_generation_data_overflows = 0;
 
+	if (get_configured_generation_version(r) == 3)
+		ctx->version = GRAPH_VERSION_2;
+	else
+		ctx->version = GRAPH_VERSION_1;
+
+	if (ctx->version >= GRAPH_VERSION_2)
+		ctx->write_generation_data = 1;
+
 	bloom_settings.bits_per_entry = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY",
 						      bloom_settings.bits_per_entry);
 	bloom_settings.num_hashes = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_NUM_HASHES",
diff --git a/commit.h b/commit.h
index 38cc5426615..a668b5cdec0 100644
--- a/commit.h
+++ b/commit.h
@@ -15,6 +15,7 @@
 #define GENERATION_NUMBER_V1_MAX 0x3FFFFFFF
 #define GENERATION_NUMBER_ZERO 0
 #define GENERATION_NUMBER_V2_OFFSET_MAX ((1ULL << 31) - 1)
+#define GENERATION_NUMBER_V3_OFFSET_MAX ((1ULL << 29) - 1)
 
 struct commit_list {
 	struct commit *item;
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index a14a13e5f7b..77e130ef63e 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -110,8 +110,13 @@ graph_read_expect() {
 	then
 		OPTIONS=" read_generation_data"
 	fi
+	VERSION=1
+	if test $GENERATION_VERSION -gt 2
+	then
+		VERSION=2
+	fi
 	cat >expect <<- EOF
-	header: 43475048 1 $(test_oid oid_version) $NUM_CHUNKS 0
+	header: 43475048 $VERSION $(test_oid oid_version) $NUM_CHUNKS 0
 	num_commits: $1
 	chunks: oid_fanout oid_lookup commit_metadata$OPTIONAL
 	options:$OPTIONS
@@ -343,6 +348,15 @@ test_expect_success 'build graph using --reachable' '
 graph_git_behavior 'append graph, commit 8 vs merge 1' full commits/8 merge/1
 graph_git_behavior 'append graph, commit 8 vs merge 2' full commits/8 merge/2
 
+test_expect_success 'write file format v2 with generation number v3' '
+	cd "$TRASH_DIRECTORY/full" &&
+	git -c commitGraph.generationVersion=3 commit-graph write --reachable &&
+	graph_read_expect "11" "extra_edges" 3
+'
+
+graph_git_behavior 'graph v2, commit 8 vs merge 1' full commits/8 merge/1
+graph_git_behavior 'graph v2, commit 8 vs merge 2' full commits/8 merge/2
+
 test_expect_success 'setup bare repo' '
 	cd "$TRASH_DIRECTORY" &&
 	git clone --bare --no-local full bare &&
@@ -880,6 +894,15 @@ test_expect_success TIME_IS_64BIT,TIME_T_IS_64BIT 'set up and verify repo with g
 
 graph_git_behavior 'generation data overflow chunk repo' repo left right
 
+test_expect_success TIME_IS_64BIT,TIME_T_IS_64BIT 'set up and verify repo with generation data overflow chunk (v3)' '
+	cd "$TRASH_DIRECTORY/repo" &&
+	git -c commitGraph.generationVersion=3 commit-graph write --reachable &&
+	graph_read_expect 10 "generation_data_overflow" 3 &&
+	git commit-graph verify
+'
+
+graph_git_behavior 'generation data overflow chunk repo' repo left right
+
 # Do not add tests at the end of this file, unless they require 64-bit
 # timestamps, since this portion of the script is only executed when
 # time data types have 64 bits.
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/7] Commit-graph: Generation Number v2 Fixes, v3 implementation
  2022-02-24 20:38 [PATCH 0/7] Commit-graph: Generation Number v2 Fixes, v3 implementation Derrick Stolee via GitGitGadget
                   ` (6 preceding siblings ...)
  2022-02-24 20:38 ` [PATCH 7/7] commit-graph: write " Derrick Stolee via GitGitGadget
@ 2022-02-24 21:42 ` Junio C Hamano
  2022-02-24 23:06   ` Junio C Hamano
  2022-02-28 13:53 ` [PATCH v2 0/4] Commit-graph: Generation Number v2 Fixes Derrick Stolee via GitGitGadget
  8 siblings, 1 reply; 70+ messages in thread
From: Junio C Hamano @ 2022-02-24 21:42 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, me, abhishekkumar8222, Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> This patch series includes two distinct, but similarly-motivated parts:
>
>  * Patches 1-4 fix some bugs in the commit-graph generation number v2.
>  * Patches 5-7 add a new generation number v3 by incrementing the
>    commit-graph file format.
>
> I had been thinking about generation number v3, which is the same corrected
> commit date as generation number v2, but it is stored in the Commit Data
> chunk, requiring a new commit-graph file format version. This breaks
> compatibility with older versions of Git, so it requires opt-in via the
> commitGraph.generationVersion config value. The only improvement over
> version 2 is that the commit-graph file is smaller, so I/O time is reduced.

Sounds exciting.  Locality of on-disk data does matter.

> However, while exploring this idea I found bugs in generation number v2. In
> particular, Git has been ignoring them since shortly after they were
> introduced. This is due to a bug I introduced when trying to make split
> commit-graphs safer with mixed generation number versions. I also noticed an
> issue with the offset overflows that I only noticed after writing generation
> number v3 using a smaller offset size, actually triggering the bug in the
> test suite.
>
> I'm submitting these two things together so we can see them all at once, but
> I'd be happy to split this into two series. The first four patches are
> important bug fixes, so we can consider them as higher-priority.
>
> Thanks, -Stolee

Thanks, will take a look.


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 2/7] commit-graph: fix ordering bug in generation numbers
  2022-02-24 20:38 ` [PATCH 2/7] commit-graph: fix ordering bug in generation numbers Derrick Stolee via GitGitGadget
@ 2022-02-24 22:15   ` Junio C Hamano
  2022-02-25 13:51     ` Derrick Stolee
  0 siblings, 1 reply; 70+ messages in thread
From: Junio C Hamano @ 2022-02-24 22:15 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, me, abhishekkumar8222, Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Derrick Stolee <derrickstolee@github.com>
>
> When computing the generation numbers for a commit-graph, we compute
> the corrected commit dates and then check if their offsets from the
> actual dates is too large to fit in the 32-bit Generation Data chunk.
> However, there is a problem with this approach: if we have parsed the
> generation data from the previous commit-graph, then we continue the
> loop because the corrected commit date is already computed.
>
> It is incorrect to add an increment to num_generation_data_overflows
> here, because we might start double-counting commits that are computed
> because of the depth-first search walk from a commit with an earlier
> OID.
>
> Instead, iterate over the full commit list at the end, checking the
> offsets to see how many grow beyond the maximum value.

Hmph, I can see how the new code correctly counts the commits that
require offsets that are too large, but I am not sure why the fix is
needed.  The overall loop structure is

    for each commit ctx->commits.list[i]:
        continue if generation number has been computed for it already

	set up a commit-list for depth first search
	while (we are still digging) {
		for each parent {
			if generation for the parent is not known yet:
				push it down and redo
			else
				compute max of parents' generation number
		}
                if (all parents' generation number is known) {
			set the generation number for ourselves
			count if we needed an offset that is too big
		}
	}
    }

The only case where we may double-count near the end of inner loop I
can think of is when we end up computing generation for the same
commit in the while () loop.  But isn't that "we dig the same thing
twice" by itself something we want to fix, regardless of the
double-counting issue?

IOW,

>  				if (current->date && current->date > max_corrected_commit_date)
>  					max_corrected_commit_date = current->date - 1;
>  				commit_graph_data_at(current)->generation = max_corrected_commit_date + 1;
> -
> -				if (commit_graph_data_at(current)->generation - current->date > GENERATION_NUMBER_V2_OFFSET_MAX)
> -					ctx->num_generation_data_overflows++;
>  			}
>  		}
>  	}

here, before doing the assignment for the "current" commit's
generation number, if we added

		if (commit_graph_data_at(current)->generation !=
		    GENERATION_NUMBER_ZERO)
			BUG("why are we digging it twice?");

would it trigger?  If so, isn't that already a bug worth fixing?

Perhaps avoiding the second round, perhaps like this, may be a
better fix?

	while (list) {
		struct commit *current = list->item;
		struct commit_list *parent;
		int all_parents_computed = 1;
		timestamp_t max_corrected_commit_date = 0;

+		if (commit_graph_data_at(current)->generation !=
+		    GENERATION_NUMBER_ZERO) {
+			pop_commit(&list);
+			continue;
+		}
+
		for (parent = current->parents; parent; parent = parent->next) {

Or am I grossly misunderstanding why the original code is incorrect
to have the counting at this place?

> +
> +	for (i = 0; i < ctx->commits.nr; i++) {
> +		struct commit *c = ctx->commits.list[i];
> +		timestamp_t offset = commit_graph_data_at(c)->generation - c->date;
> +		if (offset > GENERATION_NUMBER_V2_OFFSET_MAX)
> +			ctx->num_generation_data_overflows++;
> +	}
>  	stop_progress(&ctx->progress);
>  }
>  
> diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
> index 2b05026cf6d..f9bffe38013 100755
> --- a/t/t5318-commit-graph.sh
> +++ b/t/t5318-commit-graph.sh
> @@ -467,10 +467,10 @@ test_expect_success 'warn on improper hash version' '
>  	)
>  '
>  
> -test_expect_success 'lower layers have overflow chunk' '
> +test_expect_success TIME_IS_64BIT,TIME_T_IS_64BIT 'lower layers have overflow chunk' '
>  	cd "$TRASH_DIRECTORY/full" &&
>  	UNIX_EPOCH_ZERO="@0 +0000" &&
> -	FUTURE_DATE="@2147483646 +0000" &&
> +	FUTURE_DATE="@4147483646 +0000" &&
>  	rm -f .git/objects/info/commit-graph &&
>  	test_commit --date "$FUTURE_DATE" future-1 &&
>  	test_commit --date "$UNIX_EPOCH_ZERO" old-1 &&

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 4/7] commit-graph: fix generation number v2 overflow values
  2022-02-24 20:38 ` [PATCH 4/7] commit-graph: fix generation number v2 overflow values Derrick Stolee via GitGitGadget
@ 2022-02-24 22:35   ` Junio C Hamano
  2022-02-25 13:53     ` Derrick Stolee
  0 siblings, 1 reply; 70+ messages in thread
From: Junio C Hamano @ 2022-02-24 22:35 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, me, abhishekkumar8222, Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> -			graph_data->generation = get_be64(g->chunk_generation_data_overflow + 8 * offset_pos);
> +			graph_data->generation = item->date + get_be64(g->chunk_generation_data_overflow + 8 * offset_pos);

Wow, that's embarrassing.

> diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
> index 1afee1c2705..5e4b0216fa6 100755
> --- a/t/t5318-commit-graph.sh
> +++ b/t/t5318-commit-graph.sh
> @@ -815,6 +815,19 @@ test_expect_success 'corrupt commit-graph write (missing tree)' '
>  	)
>  '
>  
> +# The remaining tests check timestamps that flow over
> +# 32-bits. The graph_git_behavior checks can't take a
> +# prereq, so just stop here if we are on a 32-bit machine.
> +
> +if ! test_have_prereq TIME_IS_64BIT
> +then
> +	test_done
> +fi
> +if ! test_have_prereq TIME_T_IS_64BIT
> +then
> +	test_done
> +fi

The above is OK but is there a reason why we cannot do

	if A || B
	then
		test_done
	fi

here (I am assuming not, but in case I am missing the reason why it
has to be separate)?

> @@ -832,10 +845,10 @@ test_expect_success 'corrupt commit-graph write (missing tree)' '
>  # The largest offset observed is 2 ^ 31, just large enough to overflow.
>  #
>  
> -test_expect_success 'set up and verify repo with generation data overflow chunk' '
> +test_expect_success TIME_IS_64BIT,TIME_T_IS_64BIT 'set up and verify repo with generation data overflow chunk' '
>  	objdir=".git/objects" &&
>  	UNIX_EPOCH_ZERO="@0 +0000" &&
> -	FUTURE_DATE="@2147483646 +0000" &&
> +	FUTURE_DATE="@4000000000 +0000" &&

OK. 16#EE6B2800 too large to fit and will cause wrapping around with
signed 32-bit integer.

> @@ -867,4 +880,8 @@ test_expect_success 'set up and verify repo with generation data overflow chunk'
>  
>  graph_git_behavior 'generation data overflow chunk repo' repo left right
>  
> +# Do not add tests at the end of this file, unless they require 64-bit
> +# timestamps, since this portion of the script is only executed when
> +# time data types have 64 bits.
> +
>  test_done

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 5/7] commit-graph: document file format v2
  2022-02-24 20:38 ` [PATCH 5/7] commit-graph: document file format v2 Derrick Stolee via GitGitGadget
@ 2022-02-24 22:55   ` Junio C Hamano
  2022-02-25 22:31   ` Ævar Arnfjörð Bjarmason
  1 sibling, 0 replies; 70+ messages in thread
From: Junio C Hamano @ 2022-02-24 22:55 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, me, abhishekkumar8222, Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Derrick Stolee <derrickstolee@github.com>
>
> The corrected commit date was first documented in 5a3b130ca (doc: add
> corrected commit date info, 2021-01-16) and it used an optional chunk to
> augment the commit-graph format without modifying the file format
> version.
>
> One major benefit to this approach is that corrected commit dates could
> be written without causing a backwards compatibility issue with Git
> versions that do not understand them. The topological level was still
> available in the CDAT chunk as it was before.
>
> However, this causes a different issue: more data needs to be loaded
> from disk when parsing commits from the commit-graph. In cases where
> there is no significant algorithmic gain from using corrected commit
> dates, commit walks take up to 20% longer because of this extra data.
>
> Create a new file format version for the commit-graph format that
> differs only in the CDAT chunk: it now stores corrected commit date
> offsets. This brings our data back to normal and will demonstrate
> performance gains in almost all cases.

OK.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 6/7] commit-graph: parse file format v2
  2022-02-24 20:38 ` [PATCH 6/7] commit-graph: parse " Derrick Stolee via GitGitGadget
@ 2022-02-24 23:01   ` Junio C Hamano
  2022-02-25 13:54     ` Derrick Stolee
  0 siblings, 1 reply; 70+ messages in thread
From: Junio C Hamano @ 2022-02-24 23:01 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, me, abhishekkumar8222, Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Derrick Stolee <derrickstolee@github.com>
>
> The commit-graph file format v2 alters how it stores the corrected
> commit date offsets within the Commit Data chunk instead of a separate
> chunk. The idea is to significantly reduce the amount of data loaded
> from disk while parsing the commit-graph.
>
> We need to alter the error message when we see a file format version
> outside of our range now that multiple are possible. This has a
> non-functional side-effect of altering a use of GRAPH_VERSION within
> write_commit_graph().
>
> By storing the file format version in 'struct commit_graph', we can
> alter the parsing code to depend on that version value. This involves
> changing where we look for the corrected commit date offset, but also
> which constants we use for jumping into the Generation Data Overflow
> chunk. The Commit Data chunk only has 30 bits available for the offset
> while the Generation Data chunk has 32 bits. This only makes a
> meaningful difference in very malformed repositories.
>
> Also, we need to be careful about how we enable using corrected commit
> dates and generation numbers to rely upon the read_generation_data value
> instead of a non-zero value in the Commit Date chunk. In
> generation_numbers_enabled(), the first_generation variable is
> attemptint to look for the first topological level stored to see that it
> is nonzero. However, for a v2 commit-graph, this value is actually
> likely to be zero because the corrected commit date offset is probably
> zero.

I see references to OVERFLOW_V3 that comes after OVERFLOW, but there
is no OVERFLOW_V2.  Intended, or should it be V2 to match the "file
format v2" of "generation number v2"?  It is getting awkward to have
two different version scheme ("gen v2" means corrected committer
timestamp, whose on-disk representation is different between "file
v1" and "file v2", and this OVERFLOW vs OVERFLOW_V3 is about the
difference between "file v1" and "file v2" if I am following the
series correctly).

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/7] Commit-graph: Generation Number v2 Fixes, v3 implementation
  2022-02-24 21:42 ` [PATCH 0/7] Commit-graph: Generation Number v2 Fixes, v3 implementation Junio C Hamano
@ 2022-02-24 23:06   ` Junio C Hamano
  2022-02-25 13:55     ` Derrick Stolee
  0 siblings, 1 reply; 70+ messages in thread
From: Junio C Hamano @ 2022-02-24 23:06 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, me, abhishekkumar8222, Derrick Stolee

Junio C Hamano <gitster@pobox.com> writes:

>> I'm submitting these two things together so we can see them all at once, but
>> I'd be happy to split this into two series. The first four patches are
>> important bug fixes, so we can consider them as higher-priority.
>>
>> Thanks, -Stolee
>
> Thanks, will take a look.

Overall it was a pleasant read, even though my reading hiccupped in
a few places.  It does look like two separate topics, one of which
builds on the other.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 2/7] commit-graph: fix ordering bug in generation numbers
  2022-02-24 22:15   ` Junio C Hamano
@ 2022-02-25 13:51     ` Derrick Stolee
  2022-02-25 17:35       ` Junio C Hamano
  0 siblings, 1 reply; 70+ messages in thread
From: Derrick Stolee @ 2022-02-25 13:51 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee via GitGitGadget
  Cc: git, me, abhishekkumar8222

On 2/24/2022 5:15 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> From: Derrick Stolee <derrickstolee@github.com>
>>
>> When computing the generation numbers for a commit-graph, we compute
>> the corrected commit dates and then check if their offsets from the
>> actual dates is too large to fit in the 32-bit Generation Data chunk.
>> However, there is a problem with this approach: if we have parsed the
>> generation data from the previous commit-graph, then we continue the
>> loop because the corrected commit date is already computed.
>>
>> It is incorrect to add an increment to num_generation_data_overflows
>> here, because we might start double-counting commits that are computed
>> because of the depth-first search walk from a commit with an earlier
>> OID.
>>
>> Instead, iterate over the full commit list at the end, checking the
>> offsets to see how many grow beyond the maximum value.
> 
> Hmph, I can see how the new code correctly counts the commits that
> require offsets that are too large, but I am not sure why the fix is
> needed.  The overall loop structure is

It is very subtle, which is why it took me a while to debug this
issue once I managed to trigger it.

>     for each commit ctx->commits.list[i]:
>         continue if generation number has been computed for it already

This is the critical line in the current version. This includes
"continue if the generation number was loaded from the previous
commit-graph file." This means we under-count when building from
an existing commit-graph with overflows.

If we insert an increment here, then we risk double-counting. I
should have described this better.

> 	set up a commit-list for depth first search
> 	while (we are still digging) {
> 		for each parent {
> 			if generation for the parent is not known yet:
> 				push it down and redo
> 			else
> 				compute max of parents' generation number
> 		}
>                 if (all parents' generation number is known) {
> 			set the generation number for ourselves
> 			count if we needed an offset that is too big
> 		}
> 	}
>     }
> 
> The only case where we may double-count near the end of inner loop I
> can think of is when we end up computing generation for the same
> commit in the while () loop.  But isn't that "we dig the same thing
> twice" by itself something we want to fix, regardless of the
> double-counting issue?

By "we dig the same thing twice" I think you mean "we look across
every edge in the commit-graph, and some commits have multiple
direct children." There is no way around this, but we do skip
recalculating generation numbers for parents that are already
computed.

> IOW,
> 
>>  				if (current->date && current->date > max_corrected_commit_date)
>>  					max_corrected_commit_date = current->date - 1;
>>  				commit_graph_data_at(current)->generation = max_corrected_commit_date + 1;
>> -
>> -				if (commit_graph_data_at(current)->generation - current->date > GENERATION_NUMBER_V2_OFFSET_MAX)
>> -					ctx->num_generation_data_overflows++;
>>  			}
>>  		}
>>  	}
> 
> here, before doing the assignment for the "current" commit's
> generation number, if we added
> 
> 		if (commit_graph_data_at(current)->generation !=
> 		    GENERATION_NUMBER_ZERO)
> 			BUG("why are we digging it twice?");
> 
> would it trigger?  If so, isn't that already a bug worth fixing?

This would not trigger, since 'current' did not have its
generation when adding to the stack and it could not possibly
have been added a second time when doing a depth-first search
from that commit.

> Perhaps avoiding the second round, perhaps like this, may be a
> better fix?
> 
> 	while (list) {
> 		struct commit *current = list->item;
> 		struct commit_list *parent;
> 		int all_parents_computed = 1;
> 		timestamp_t max_corrected_commit_date = 0;
> 
> +		if (commit_graph_data_at(current)->generation !=
> +		    GENERATION_NUMBER_ZERO) {
> +			pop_commit(&list);
> +			continue;
> +		}
> +
> 		for (parent = current->parents; parent; parent = parent->next) {
> 
> Or am I grossly misunderstanding why the original code is incorrect
> to have the counting at this place?

Hopefully I cleared up the issue earlier in my reply. Let me
know if this is still confusing.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 4/7] commit-graph: fix generation number v2 overflow values
  2022-02-24 22:35   ` Junio C Hamano
@ 2022-02-25 13:53     ` Derrick Stolee
  2022-02-25 17:38       ` Junio C Hamano
  0 siblings, 1 reply; 70+ messages in thread
From: Derrick Stolee @ 2022-02-25 13:53 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee via GitGitGadget
  Cc: git, me, abhishekkumar8222

On 2/24/2022 5:35 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> -			graph_data->generation = get_be64(g->chunk_generation_data_overflow + 8 * offset_pos);
>> +			graph_data->generation = item->date + get_be64(g->chunk_generation_data_overflow + 8 * offset_pos);
> 
> Wow, that's embarrassing.
> 
>> diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
>> index 1afee1c2705..5e4b0216fa6 100755
>> --- a/t/t5318-commit-graph.sh
>> +++ b/t/t5318-commit-graph.sh
>> @@ -815,6 +815,19 @@ test_expect_success 'corrupt commit-graph write (missing tree)' '
>>  	)
>>  '
>>  
>> +# The remaining tests check timestamps that flow over
>> +# 32-bits. The graph_git_behavior checks can't take a
>> +# prereq, so just stop here if we are on a 32-bit machine.
>> +
>> +if ! test_have_prereq TIME_IS_64BIT
>> +then
>> +	test_done
>> +fi
>> +if ! test_have_prereq TIME_T_IS_64BIT
>> +then
>> +	test_done
>> +fi
> 
> The above is OK but is there a reason why we cannot do
> 
> 	if A || B
> 	then
> 		test_done
> 	fi
> 
> here (I am assuming not, but in case I am missing the reason why it
> has to be separate)?

Does not need to be separate. I just discovered the two different
prereqs for similar, but not exact, checks. I can swap this to an
or statement.
 
>> @@ -832,10 +845,10 @@ test_expect_success 'corrupt commit-graph write (missing tree)' '
>>  # The largest offset observed is 2 ^ 31, just large enough to overflow.
>>  #
>>  
>> -test_expect_success 'set up and verify repo with generation data overflow chunk' '
>> +test_expect_success TIME_IS_64BIT,TIME_T_IS_64BIT 'set up and verify repo with generation data overflow chunk' '
>>  	objdir=".git/objects" &&
>>  	UNIX_EPOCH_ZERO="@0 +0000" &&
>> -	FUTURE_DATE="@2147483646 +0000" &&
>> +	FUTURE_DATE="@4000000000 +0000" &&
> 
> OK. 16#EE6B2800 too large to fit and will cause wrapping around with
> signed 32-bit integer.

Right. I wanted it to be right on that boundary of needing the 32nd
bit but not being over that on its own. I did check that without
the prereqs this code fails on 32-bit systems due to parsing the
time.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 6/7] commit-graph: parse file format v2
  2022-02-24 23:01   ` Junio C Hamano
@ 2022-02-25 13:54     ` Derrick Stolee
  0 siblings, 0 replies; 70+ messages in thread
From: Derrick Stolee @ 2022-02-25 13:54 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee via GitGitGadget
  Cc: git, me, abhishekkumar8222

On 2/24/2022 6:01 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> From: Derrick Stolee <derrickstolee@github.com>
>>
>> The commit-graph file format v2 alters how it stores the corrected
>> commit date offsets within the Commit Data chunk instead of a separate
>> chunk. The idea is to significantly reduce the amount of data loaded
>> from disk while parsing the commit-graph.
>>
>> We need to alter the error message when we see a file format version
>> outside of our range now that multiple are possible. This has a
>> non-functional side-effect of altering a use of GRAPH_VERSION within
>> write_commit_graph().
>>
>> By storing the file format version in 'struct commit_graph', we can
>> alter the parsing code to depend on that version value. This involves
>> changing where we look for the corrected commit date offset, but also
>> which constants we use for jumping into the Generation Data Overflow
>> chunk. The Commit Data chunk only has 30 bits available for the offset
>> while the Generation Data chunk has 32 bits. This only makes a
>> meaningful difference in very malformed repositories.
>>
>> Also, we need to be careful about how we enable using corrected commit
>> dates and generation numbers to rely upon the read_generation_data value
>> instead of a non-zero value in the Commit Date chunk. In
>> generation_numbers_enabled(), the first_generation variable is
>> attemptint to look for the first topological level stored to see that it
>> is nonzero. However, for a v2 commit-graph, this value is actually
>> likely to be zero because the corrected commit date offset is probably
>> zero.
> 
> I see references to OVERFLOW_V3 that comes after OVERFLOW, but there
> is no OVERFLOW_V2.  Intended, or should it be V2 to match the "file
> format v2" of "generation number v2"?  It is getting awkward to have
> two different version scheme ("gen v2" means corrected committer
> timestamp, whose on-disk representation is different between "file
> v1" and "file v2", and this OVERFLOW vs OVERFLOW_V3 is about the
> difference between "file v1" and "file v2" if I am following the
> series correctly).

You're right that it would be clearer to rename OVERFLOW to
OVERFLOW_V2. I'll add that to my next version when these patches
appear on their own.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/7] Commit-graph: Generation Number v2 Fixes, v3 implementation
  2022-02-24 23:06   ` Junio C Hamano
@ 2022-02-25 13:55     ` Derrick Stolee
  0 siblings, 0 replies; 70+ messages in thread
From: Derrick Stolee @ 2022-02-25 13:55 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee via GitGitGadget
  Cc: git, me, abhishekkumar8222

On 2/24/2022 6:06 PM, Junio C Hamano wrote:
> Junio C Hamano <gitster@pobox.com> writes:
> 
>>> I'm submitting these two things together so we can see them all at once, but
>>> I'd be happy to split this into two series. The first four patches are
>>> important bug fixes, so we can consider them as higher-priority.
>>>
>>> Thanks, -Stolee
>>
>> Thanks, will take a look.
> 
> Overall it was a pleasant read, even though my reading hiccupped in
> a few places.  It does look like two separate topics, one of which
> builds on the other.

Thanks. I'll split them. The next version will drop the last three
patches and I'll re-send them after the first four merge down.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 2/7] commit-graph: fix ordering bug in generation numbers
  2022-02-25 13:51     ` Derrick Stolee
@ 2022-02-25 17:35       ` Junio C Hamano
  0 siblings, 0 replies; 70+ messages in thread
From: Junio C Hamano @ 2022-02-25 17:35 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, git, me, abhishekkumar8222

Derrick Stolee <derrickstolee@github.com> writes:

> It is very subtle, which is why it took me a while to debug this
> issue once I managed to trigger it.
>
>>     for each commit ctx->commits.list[i]:
>>         continue if generation number has been computed for it already
>
> This is the critical line in the current version. This includes
> "continue if the generation number was loaded from the previous
> commit-graph file." This means we under-count when building from
> an existing commit-graph with overflows.
>
> If we insert an increment here, then we risk double-counting. I
> should have described this better.

Ah, that obviously I missed.  Thanks.


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 4/7] commit-graph: fix generation number v2 overflow values
  2022-02-25 13:53     ` Derrick Stolee
@ 2022-02-25 17:38       ` Junio C Hamano
  0 siblings, 0 replies; 70+ messages in thread
From: Junio C Hamano @ 2022-02-25 17:38 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, git, me, abhishekkumar8222

Derrick Stolee <derrickstolee@github.com> writes:

>>> +# The remaining tests check timestamps that flow over
>>> +# 32-bits. The graph_git_behavior checks can't take a
>>> +# prereq, so just stop here if we are on a 32-bit machine.
>>> +
>>> +if ! test_have_prereq TIME_IS_64BIT
>>> +then
>>> +	test_done
>>> +fi
>>> +if ! test_have_prereq TIME_T_IS_64BIT
>>> +then
>>> +	test_done
>>> +fi
>> 
>> The above is OK but is there a reason why we cannot do
>> 
>> 	if A || B
>> 	then
>> 		test_done
>> 	fi
>> 
>> here (I am assuming not, but in case I am missing the reason why it
>> has to be separate)?
>
> Does not need to be separate. I just discovered the two different
> prereqs for similar, but not exact, checks. I can swap this to an
> or statement.

I do not think a single condition with single test_done is
necessarily better. I was just curious if there was anything subtle
going on.

Thanks.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 5/7] commit-graph: document file format v2
  2022-02-24 20:38 ` [PATCH 5/7] commit-graph: document file format v2 Derrick Stolee via GitGitGadget
  2022-02-24 22:55   ` Junio C Hamano
@ 2022-02-25 22:31   ` Ævar Arnfjörð Bjarmason
  2022-02-28 13:44     ` Derrick Stolee
  1 sibling, 1 reply; 70+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-02-25 22:31 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, me, gitster, abhishekkumar8222, Derrick Stolee


On Thu, Feb 24 2022, Derrick Stolee via GitGitGadget wrote:

> The corrected commit date was first documented in 5a3b130ca (doc: add
> corrected commit date info, 2021-01-16) and it used an optional chunk to
> augment the commit-graph format without modifying the file format
> version.
>
> One major benefit to this approach is that corrected commit dates could
> be written without causing a backwards compatibility issue with Git
> versions that do not understand them. The topological level was still
> available in the CDAT chunk as it was before.
>
> However, this causes a different issue: more data needs to be loaded
> from disk when parsing commits from the commit-graph. In cases where
> there is no significant algorithmic gain from using corrected commit
> dates, commit walks take up to 20% longer because of this extra data.
>
> Create a new file format version for the commit-graph format that
> differs only in the CDAT chunk: it now stores corrected commit date
> offsets. This brings our data back to normal and will demonstrate
> performance gains in almost all cases.
>
> Signed-off-by: Derrick Stolee <derrickstolee@github.com>
> ---
>  .../technical/commit-graph-format.txt         | 22 ++++++++++++++-----
>  1 file changed, 17 insertions(+), 5 deletions(-)
>
> diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
> index 87971c27dd7..2cb48993314 100644
> --- a/Documentation/technical/commit-graph-format.txt
> +++ b/Documentation/technical/commit-graph-format.txt
> @@ -36,7 +36,7 @@ HEADER:
>        The signature is: {'C', 'G', 'P', 'H'}
>  
>    1-byte version number:
> -      Currently, the only valid version is 1.
> +      This version number can be 1 or 2.
>  
>    1-byte Hash Version
>        We infer the hash length (H) from this value:
> @@ -85,13 +85,22 @@ CHUNK DATA:
>        position. If there are more than two parents, the second value
>        has its most-significant bit on and the other bits store an array
>        position into the Extra Edge List chunk.
> -    * The next 8 bytes store the topological level (generation number v1)
> -      of the commit and
> -      the commit time in seconds since EPOCH. The generation number
> -      uses the higher 30 bits of the first 4 bytes, while the commit
> +    * The next 8 bytes store the generation number information of the
> +      commit and the commit time in seconds since EPOCH. The generation
> +      number uses the higher 30 bits of the first 4 bytes, while the commit
>        time uses the 32 bits of the second 4 bytes, along with the lowest
>        2 bits of the lowest byte, storing the 33rd and 34th bit of the
>        commit time.
> +      - If the commit-graph file format is version 1, then the higher 30
> +	bits contain the topological level (generation number v1) for the
> +	commit.
> +      - If the commit-graph file format is version 2, then the higher 30
> +	bits contain the corrected commit date offset (generation number
> +	v2) for the commit, except if the offset cannot be stored within
> +	29 bits. If the offset is too large for 29 bits, then the value
> +	stored here has its most-significant bit on and the other bits
> +	store the position of the corrected commit date in the Generation
> +	Date Overflow chunk.
>  
>    Generation Data (ID: {'G', 'D', 'A', 'T' }) (N * 4 bytes) [Optional]
>      * This list of 4-byte values store corrected commit date offsets for the
> @@ -103,6 +112,9 @@ CHUNK DATA:
>      * Generation Data chunk is present only when commit-graph file is written
>        by compatible versions of Git and in case of split commit-graph chains,
>        the topmost layer also has Generation Data chunk.
> +    * This chunk does not exist if the commit-graph file format version is 2,
> +      because the corrected commit date offset data is stored in the Commit
> +      Data chunk.
>  
>    Generation Data Overflow (ID: {'G', 'D', 'O', 'V' }) [Optional]
>      * This list of 8-byte values stores the corrected commit date offsets

We talked a while ago now about how we do commit-graph format changes
and this is partially echoing those earlier questions[1] from 2019.

I fully understand why we're writing this amended CDAT chunk in a
different layout. By not having the GDAT side-chunk to look up in the
data is more local, that part of the file is more compact etc.

What I don't understand is why getting those performance improvements
requires the breaking version change & the writing of the incompatible
version number.

I.e. couldn't the differently formatted CDAT chunk be written instead to a new
chunk name (say "2DAT") instead? Per [1] we'd pay a small fixed cost for
a possibly empty chunk (I didn't re-do those numbers), but surely the
performance improvements will be about the same for that miniscule
overhead.

It will give you something you can't have here, which is optional
compatibility with older clients by writing both versions. That'll be a
~2x as large file on disk, but with the page cache & each client version
skipping to the data it needs caching characteristics & data locality
should work out to about the same thing.

Or maybe they won't. I just found it surprising when reviewing this to
not find an answer to why that approach wasn't
considered.

E.g. 76ffbca71a9 (commit-graph: write Bloom filters to commit graph
file, 2020-04-06) is a commit adding such new optional and
backwards-compatible data.

1. https://lore.kernel.org/git/87h8acivkh.fsf@evledraar.gmail.com/

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 5/7] commit-graph: document file format v2
  2022-02-25 22:31   ` Ævar Arnfjörð Bjarmason
@ 2022-02-28 13:44     ` Derrick Stolee
  2022-02-28 14:27       ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 70+ messages in thread
From: Derrick Stolee @ 2022-02-28 13:44 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason, Derrick Stolee via GitGitGadget
  Cc: git, me, gitster, abhishekkumar8222

On 2/25/2022 5:31 PM, Ævar Arnfjörð Bjarmason wrote:
> 
> On Thu, Feb 24 2022, Derrick Stolee via GitGitGadget wrote:
> 
...
>>    Generation Data (ID: {'G', 'D', 'A', 'T' }) (N * 4 bytes) [Optional]
>>      * This list of 4-byte values store corrected commit date offsets for the
>> @@ -103,6 +112,9 @@ CHUNK DATA:
>>      * Generation Data chunk is present only when commit-graph file is written
>>        by compatible versions of Git and in case of split commit-graph chains,
>>        the topmost layer also has Generation Data chunk.
>> +    * This chunk does not exist if the commit-graph file format version is 2,
>> +      because the corrected commit date offset data is stored in the Commit
>> +      Data chunk.
>>  
>>    Generation Data Overflow (ID: {'G', 'D', 'O', 'V' }) [Optional]
>>      * This list of 8-byte values stores the corrected commit date offsets
> 
> We talked a while ago now about how we do commit-graph format changes
> and this is partially echoing those earlier questions[1] from 2019.
> 
> I fully understand why we're writing this amended CDAT chunk in a
> different layout. By not having the GDAT side-chunk to look up in the
> data is more local, that part of the file is more compact etc.
> 
> What I don't understand is why getting those performance improvements
> requires the breaking version change & the writing of the incompatible
> version number.
> 
> I.e. couldn't the differently formatted CDAT chunk be written instead to a new
> chunk name (say "2DAT") instead? Per [1] we'd pay a small fixed cost for
> a possibly empty chunk (I didn't re-do those numbers), but surely the
> performance improvements will be about the same for that miniscule
> overhead.

CDAT is a required chunk. It is part of the v1 spec that CDAT exists
and is correct. All other Git clients will error out when reading a
"v1" graph without such a chunk, and in a way that is less helpful to
users. Instead of clearly indicating "file version is too new" it will
say "commit-graph is missing the Commit Data chunk" which is not
helpful.

> It will give you something you can't have here, which is optional
> compatibility with older clients by writing both versions. That'll be a
> ~2x as large file on disk, but with the page cache & each client version
> skipping to the data it needs caching characteristics & data locality
> should work out to about the same thing.

Writing both is the only way that this could work without incrementing
the graph version number, but I'd rather just update the number and
avoid wasting the effort to write that extra data.

It seems you are hyper-focused on "we don't _need_ to update the version
number" and you are willing to recommend wasteful approaches in order to
support that stance.

So: you're right. We don't _need_ to update the version number. But this
is the best choice among the options available.

> Or maybe they won't. I just found it surprising when reviewing this to
> not find an answer to why that approach wasn't
> considered.

The point is to create a new format that can be chosen when deployed
in an environment where older Git versions will not exist (such as
a Git server). The new version is not chosen by default and instead
is opt-in through the commitGraph.generationVersion config option.

Perhaps in a year or two we would consider making this the new
default, but there is no rush to do so.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [PATCH v2 0/4] Commit-graph: Generation Number v2 Fixes
  2022-02-24 20:38 [PATCH 0/7] Commit-graph: Generation Number v2 Fixes, v3 implementation Derrick Stolee via GitGitGadget
                   ` (7 preceding siblings ...)
  2022-02-24 21:42 ` [PATCH 0/7] Commit-graph: Generation Number v2 Fixes, v3 implementation Junio C Hamano
@ 2022-02-28 13:53 ` Derrick Stolee via GitGitGadget
  2022-02-28 13:53   ` [PATCH v2 1/4] test-read-graph: include extra post-parse info Derrick Stolee via GitGitGadget
                     ` (5 more replies)
  8 siblings, 6 replies; 70+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-02-28 13:53 UTC (permalink / raw)
  To: git; +Cc: me, gitster, abhishekkumar8222, avarab, Derrick Stolee

This patch series fixes some bugs in generation number v2. They were
discovered while building generation number v3, but that implementation will
be delayed until these fixes are merged.

In particular, Git has been ignoring corrected commit dates since shortly
after they were introduced. This is due to a bug I introduced when trying to
make split commit-graphs safer with mixed generation number versions. I also
noticed an issue with the offset overflows that I only noticed after writing
generation number v3 using a smaller offset size, actually triggering the
bug in the test suite.


Updates in v2
=============

 * Dropped generation v3 patches, saving them for later.
 * Updated a commit message to more clearly describe the problem with the
   old code.
 * Used an || instead of two if statements in test script.

Thanks, -Stolee

Derrick Stolee (4):
  test-read-graph: include extra post-parse info
  commit-graph: fix ordering bug in generation numbers
  commit-graph: start parsing generation v2 (again)
  commit-graph: fix generation number v2 overflow values

 commit-graph.c                | 15 +++++++++++----
 t/helper/test-read-graph.c    | 13 +++++++++++++
 t/t4216-log-bloom.sh          |  1 +
 t/t5318-commit-graph.sh       | 34 +++++++++++++++++++++++++++++-----
 t/t5324-split-commit-graph.sh | 10 ++++++++++
 5 files changed, 64 insertions(+), 9 deletions(-)


base-commit: dab1b7905d0b295f1acef9785bb2b9cbb0fdec84
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1163%2Fderrickstolee%2Fgen-v3-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1163/derrickstolee/gen-v3-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/1163

Range-diff vs v1:

 1:  2f89275314b = 1:  2f89275314b test-read-graph: include extra post-parse info
 2:  6e47ffed257 ! 2:  cbcbf10e699 commit-graph: fix ordering bug in generation numbers
     @@ Commit message
          actual dates is too large to fit in the 32-bit Generation Data chunk.
          However, there is a problem with this approach: if we have parsed the
          generation data from the previous commit-graph, then we continue the
     -    loop because the corrected commit date is already computed.
     +    loop because the corrected commit date is already computed. This causes
     +    an under-count in the number of overflow values.
      
          It is incorrect to add an increment to num_generation_data_overflows
     -    here, because we might start double-counting commits that are computed
     -    because of the depth-first search walk from a commit with an earlier
     -    OID.
     +    next to this 'continue' statement, because we might start
     +    double-counting commits that are computed because of the depth-first
     +    search walk from a commit with an earlier OID.
      
          Instead, iterate over the full commit list at the end, checking the
          offsets to see how many grow beyond the maximum value.
 3:  a3436b92a32 = 3:  5bc6a7660d8 commit-graph: start parsing generation v2 (again)
 4:  de7ab2f39d9 ! 4:  193217c71e0 commit-graph: fix generation number v2 overflow values
     @@ t/t5318-commit-graph.sh: test_expect_success 'corrupt commit-graph write (missin
      +# 32-bits. The graph_git_behavior checks can't take a
      +# prereq, so just stop here if we are on a 32-bit machine.
      +
     -+if ! test_have_prereq TIME_IS_64BIT
     -+then
     -+	test_done
     -+fi
     -+if ! test_have_prereq TIME_T_IS_64BIT
     ++if ! test_have_prereq TIME_IS_64BIT || ! test_have_prereq TIME_T_IS_64BIT
      +then
      +	test_done
      +fi
 5:  7f9b65bd225 < -:  ----------- commit-graph: document file format v2
 6:  28fe8824ba7 < -:  ----------- commit-graph: parse file format v2
 7:  ade697c4d34 < -:  ----------- commit-graph: write file format v2

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [PATCH v2 1/4] test-read-graph: include extra post-parse info
  2022-02-28 13:53 ` [PATCH v2 0/4] Commit-graph: Generation Number v2 Fixes Derrick Stolee via GitGitGadget
@ 2022-02-28 13:53   ` Derrick Stolee via GitGitGadget
  2022-02-28 15:22     ` Ævar Arnfjörð Bjarmason
  2022-02-28 13:53   ` [PATCH v2 2/4] commit-graph: fix ordering bug in generation numbers Derrick Stolee via GitGitGadget
                     ` (4 subsequent siblings)
  5 siblings, 1 reply; 70+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-02-28 13:53 UTC (permalink / raw)
  To: git
  Cc: me, gitster, abhishekkumar8222, avarab, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

It can be helpful to verify that the 'struct commit_graph' that results
from parsing a commit-graph is correctly structured. The existence of
different chunks is not enough to verify that all of the optional
features are correctly enabled.

Update 'test-tool read-graph' to output an "options:" line that includes
information for different parts of the struct commit_graph.

In particular, this change demonstrates that the read_generation_data
option is never being enabled, which will be fixed in a later change.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 t/helper/test-read-graph.c    | 13 +++++++++++++
 t/t4216-log-bloom.sh          |  1 +
 t/t5318-commit-graph.sh       |  1 +
 t/t5324-split-commit-graph.sh |  5 +++++
 4 files changed, 20 insertions(+)

diff --git a/t/helper/test-read-graph.c b/t/helper/test-read-graph.c
index 75927b2c81d..c3b6b8d1734 100644
--- a/t/helper/test-read-graph.c
+++ b/t/helper/test-read-graph.c
@@ -3,6 +3,7 @@
 #include "commit-graph.h"
 #include "repository.h"
 #include "object-store.h"
+#include "bloom.h"
 
 int cmd__read_graph(int argc, const char **argv)
 {
@@ -45,6 +46,18 @@ int cmd__read_graph(int argc, const char **argv)
 		printf(" bloom_data");
 	printf("\n");
 
+	printf("options:");
+	if (graph->bloom_filter_settings)
+		printf(" bloom(%d,%d,%d)",
+		       graph->bloom_filter_settings->hash_version,
+		       graph->bloom_filter_settings->bits_per_entry,
+		       graph->bloom_filter_settings->num_hashes);
+	if (graph->read_generation_data)
+		printf(" read_generation_data");
+	if (graph->topo_levels)
+		printf(" topo_levels");
+	printf("\n");
+
 	UNLEAK(graph);
 
 	return 0;
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index cc3cebf6722..5ed6d2a21c1 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -48,6 +48,7 @@ graph_read_expect () {
 	header: 43475048 1 $(test_oid oid_version) $NUM_CHUNKS 0
 	num_commits: $1
 	chunks: oid_fanout oid_lookup commit_metadata generation_data bloom_indexes bloom_data
+	options: bloom(1,10,7)
 	EOF
 	test-tool read-graph >actual &&
 	test_cmp expect actual
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index edb728f77c3..2b05026cf6d 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -104,6 +104,7 @@ graph_read_expect() {
 	header: 43475048 1 $(test_oid oid_version) $NUM_CHUNKS 0
 	num_commits: $1
 	chunks: oid_fanout oid_lookup commit_metadata$OPTIONAL
+	options:
 	EOF
 	test-tool read-graph >output &&
 	test_cmp expect output
diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
index 847b8097109..778fa418de2 100755
--- a/t/t5324-split-commit-graph.sh
+++ b/t/t5324-split-commit-graph.sh
@@ -34,6 +34,7 @@ graph_read_expect() {
 	header: 43475048 1 $(test_oid oid_version) 4 $NUM_BASE
 	num_commits: $1
 	chunks: oid_fanout oid_lookup commit_metadata generation_data
+	options:
 	EOF
 	test-tool read-graph >output &&
 	test_cmp expect output
@@ -508,6 +509,7 @@ test_expect_success 'setup repo for mixed generation commit-graph-chain' '
 		header: 43475048 1 $(test_oid oid_version) 4 1
 		num_commits: $NUM_SECOND_LAYER_COMMITS
 		chunks: oid_fanout oid_lookup commit_metadata
+		options:
 		EOF
 		test_cmp expect output &&
 		git commit-graph verify &&
@@ -540,6 +542,7 @@ test_expect_success 'do not write generation data chunk if not present on existi
 		header: 43475048 1 $(test_oid oid_version) 4 2
 		num_commits: $NUM_THIRD_LAYER_COMMITS
 		chunks: oid_fanout oid_lookup commit_metadata
+		options:
 		EOF
 		test_cmp expect output &&
 		git commit-graph verify
@@ -581,6 +584,7 @@ test_expect_success 'do not write generation data chunk if the topmost remaining
 		header: 43475048 1 $(test_oid oid_version) 4 2
 		num_commits: $(($NUM_THIRD_LAYER_COMMITS + $NUM_FOURTH_LAYER_COMMITS))
 		chunks: oid_fanout oid_lookup commit_metadata
+		options:
 		EOF
 		test_cmp expect output &&
 		git commit-graph verify
@@ -620,6 +624,7 @@ test_expect_success 'write generation data chunk if topmost remaining layer has
 		header: 43475048 1 $(test_oid oid_version) 5 1
 		num_commits: $(($NUM_SECOND_LAYER_COMMITS + $NUM_THIRD_LAYER_COMMITS + $NUM_FOURTH_LAYER_COMMITS + $NUM_FIFTH_LAYER_COMMITS))
 		chunks: oid_fanout oid_lookup commit_metadata generation_data
+		options:
 		EOF
 		test_cmp expect output
 	)
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v2 2/4] commit-graph: fix ordering bug in generation numbers
  2022-02-28 13:53 ` [PATCH v2 0/4] Commit-graph: Generation Number v2 Fixes Derrick Stolee via GitGitGadget
  2022-02-28 13:53   ` [PATCH v2 1/4] test-read-graph: include extra post-parse info Derrick Stolee via GitGitGadget
@ 2022-02-28 13:53   ` Derrick Stolee via GitGitGadget
  2022-02-28 15:25     ` Ævar Arnfjörð Bjarmason
  2022-02-28 13:53   ` [PATCH v2 3/4] commit-graph: start parsing generation v2 (again) Derrick Stolee via GitGitGadget
                     ` (3 subsequent siblings)
  5 siblings, 1 reply; 70+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-02-28 13:53 UTC (permalink / raw)
  To: git
  Cc: me, gitster, abhishekkumar8222, avarab, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

When computing the generation numbers for a commit-graph, we compute
the corrected commit dates and then check if their offsets from the
actual dates is too large to fit in the 32-bit Generation Data chunk.
However, there is a problem with this approach: if we have parsed the
generation data from the previous commit-graph, then we continue the
loop because the corrected commit date is already computed. This causes
an under-count in the number of overflow values.

It is incorrect to add an increment to num_generation_data_overflows
next to this 'continue' statement, because we might start
double-counting commits that are computed because of the depth-first
search walk from a commit with an earlier OID.

Instead, iterate over the full commit list at the end, checking the
offsets to see how many grow beyond the maximum value.

Update a test in t5318 to use a larger time value, which will help
demonstrate this bug in more cases. It still won't hit all potential
cases until the next change, which reenables reading generation numbers.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 commit-graph.c          | 10 +++++++---
 t/t5318-commit-graph.sh |  4 ++--
 2 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 265c010122e..a19bd96c2ee 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1556,12 +1556,16 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 				if (current->date && current->date > max_corrected_commit_date)
 					max_corrected_commit_date = current->date - 1;
 				commit_graph_data_at(current)->generation = max_corrected_commit_date + 1;
-
-				if (commit_graph_data_at(current)->generation - current->date > GENERATION_NUMBER_V2_OFFSET_MAX)
-					ctx->num_generation_data_overflows++;
 			}
 		}
 	}
+
+	for (i = 0; i < ctx->commits.nr; i++) {
+		struct commit *c = ctx->commits.list[i];
+		timestamp_t offset = commit_graph_data_at(c)->generation - c->date;
+		if (offset > GENERATION_NUMBER_V2_OFFSET_MAX)
+			ctx->num_generation_data_overflows++;
+	}
 	stop_progress(&ctx->progress);
 }
 
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 2b05026cf6d..f9bffe38013 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -467,10 +467,10 @@ test_expect_success 'warn on improper hash version' '
 	)
 '
 
-test_expect_success 'lower layers have overflow chunk' '
+test_expect_success TIME_IS_64BIT,TIME_T_IS_64BIT 'lower layers have overflow chunk' '
 	cd "$TRASH_DIRECTORY/full" &&
 	UNIX_EPOCH_ZERO="@0 +0000" &&
-	FUTURE_DATE="@2147483646 +0000" &&
+	FUTURE_DATE="@4147483646 +0000" &&
 	rm -f .git/objects/info/commit-graph &&
 	test_commit --date "$FUTURE_DATE" future-1 &&
 	test_commit --date "$UNIX_EPOCH_ZERO" old-1 &&
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v2 3/4] commit-graph: start parsing generation v2 (again)
  2022-02-28 13:53 ` [PATCH v2 0/4] Commit-graph: Generation Number v2 Fixes Derrick Stolee via GitGitGadget
  2022-02-28 13:53   ` [PATCH v2 1/4] test-read-graph: include extra post-parse info Derrick Stolee via GitGitGadget
  2022-02-28 13:53   ` [PATCH v2 2/4] commit-graph: fix ordering bug in generation numbers Derrick Stolee via GitGitGadget
@ 2022-02-28 13:53   ` Derrick Stolee via GitGitGadget
  2022-02-28 15:30     ` Ævar Arnfjörð Bjarmason
  2022-02-28 13:53   ` [PATCH v2 4/4] commit-graph: fix generation number v2 overflow values Derrick Stolee via GitGitGadget
                     ` (2 subsequent siblings)
  5 siblings, 1 reply; 70+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-02-28 13:53 UTC (permalink / raw)
  To: git
  Cc: me, gitster, abhishekkumar8222, avarab, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

The 'read_generation_data' member of 'struct commit_graph' was
introduced by 1fdc383c5 (commit-graph: use generation v2 only if entire
chain does, 2021-01-16). The intention was to avoid using corrected
commit dates if not all layers of a commit-graph had that data stored.
The logic in validate_mixed_generation_chain() at that point incorrectly
initialized read_generation_data to 1 if and only if the tip
commit-graph contained the Corrected Commit Date chunk.

This was "fixed" in 448a39e65 (commit-graph: validate layers for
generation data, 2021-02-02) to validate that read_generation_data was
either non-zero for all layers, or it would set read_generation_data to
zero for all layers.

The problem here is that read_generation_data is not initialized to be
non-zero anywhere!

This change initializes read_generation_data immediately after the chunk
is parsed, so each layer will have its value present as soon as
possible.

The read_generation_data member is used in fill_commit_graph_info() to
determine if we should use the corrected commit date or the topological
levels stored in the Commit Data chunk. Due to this bug, all previous
versions of Git were defaulting to topological levels in all cases!

This can be measured with some performance tests. Using the Linux kernel
as a testbed, I generated a complete commit-graph containing corrected
commit dates and tested the 'new' version against the previous, 'old'
version.

First, rev-list with --topo-order demonstrates a 26% improvement using
corrected commit dates:

hyperfine \
	-n "old" "$OLD_GIT rev-list --topo-order -1000 v3.6" \
	-n "new" "$NEW_GIT rev-list --topo-order -1000 v3.6" \
	--warmup=10

Benchmark 1: old
  Time (mean ± σ):      57.1 ms ±   3.1 ms
  Range (min … max):    52.9 ms …  62.0 ms    55 runs

Benchmark 2: new
  Time (mean ± σ):      45.5 ms ±   3.3 ms
  Range (min … max):    39.9 ms …  51.7 ms    59 runs

Summary
  'new' ran
    1.26 ± 0.11 times faster than 'old'

These performance improvements are due to the algorithmic improvements
given by walking fewer commits due to the higher cutoffs from corrected
commit dates.

However, this comes at a cost. The additional I/O cost of parsing the
corrected commit dates is visible in case of merge-base commands that do
not reduce the overall number of walked commits.

hyperfine \
        -n "old" "$OLD_GIT merge-base v4.8 v4.9" \
        -n "new" "$NEW_GIT merge-base v4.8 v4.9" \
        --warmup=10

Benchmark 1: old
  Time (mean ± σ):     110.4 ms ±   6.4 ms
  Range (min … max):    96.0 ms … 118.3 ms    25 runs

Benchmark 2: new
  Time (mean ± σ):     150.7 ms ±   1.1 ms
  Range (min … max):   149.3 ms … 153.4 ms    19 runs

Summary
  'old' ran
    1.36 ± 0.08 times faster than 'new'

Performance issues like this are what motivated 702110aac (commit-graph:
use config to specify generation type, 2021-02-25).

In the future, we could fix this performance problem by inserting the
corrected commit date offsets into the Commit Date chunk instead of
having that data in an extra chunk.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 commit-graph.c                |  3 +++
 t/t4216-log-bloom.sh          |  2 +-
 t/t5318-commit-graph.sh       | 14 ++++++++++++--
 t/t5324-split-commit-graph.sh |  9 +++++++--
 4 files changed, 23 insertions(+), 5 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index a19bd96c2ee..8e52bb09552 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -407,6 +407,9 @@ struct commit_graph *parse_commit_graph(struct repository *r,
 			&graph->chunk_generation_data);
 		pair_chunk(cf, GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW,
 			&graph->chunk_generation_data_overflow);
+
+		if (graph->chunk_generation_data)
+			graph->read_generation_data = 1;
 	}
 
 	if (r->settings.commit_graph_read_changed_paths) {
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index 5ed6d2a21c1..fa9d32facfb 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -48,7 +48,7 @@ graph_read_expect () {
 	header: 43475048 1 $(test_oid oid_version) $NUM_CHUNKS 0
 	num_commits: $1
 	chunks: oid_fanout oid_lookup commit_metadata generation_data bloom_indexes bloom_data
-	options: bloom(1,10,7)
+	options: bloom(1,10,7) read_generation_data
 	EOF
 	test-tool read-graph >actual &&
 	test_cmp expect actual
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index f9bffe38013..1afee1c2705 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -100,11 +100,21 @@ graph_read_expect() {
 		OPTIONAL=" $2"
 		NUM_CHUNKS=$((3 + $(echo "$2" | wc -w)))
 	fi
+	GENERATION_VERSION=2
+	if test ! -z "$3"
+	then
+		GENERATION_VERSION=$3
+	fi
+	OPTIONS=
+	if test $GENERATION_VERSION -gt 1
+	then
+		OPTIONS=" read_generation_data"
+	fi
 	cat >expect <<- EOF
 	header: 43475048 1 $(test_oid oid_version) $NUM_CHUNKS 0
 	num_commits: $1
 	chunks: oid_fanout oid_lookup commit_metadata$OPTIONAL
-	options:
+	options:$OPTIONS
 	EOF
 	test-tool read-graph >output &&
 	test_cmp expect output
@@ -498,7 +508,7 @@ test_expect_success 'git commit-graph verify' '
 	cd "$TRASH_DIRECTORY/full" &&
 	git rev-parse commits/8 | git -c commitGraph.generationVersion=1 commit-graph write --stdin-commits &&
 	git commit-graph verify >output &&
-	graph_read_expect 9 extra_edges
+	graph_read_expect 9 extra_edges 1
 '
 
 NUM_COMMITS=9
diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
index 778fa418de2..669ddc645fa 100755
--- a/t/t5324-split-commit-graph.sh
+++ b/t/t5324-split-commit-graph.sh
@@ -30,11 +30,16 @@ graph_read_expect() {
 	then
 		NUM_BASE=$2
 	fi
+	OPTIONS=
+	if test -z "$3"
+	then
+		OPTIONS=" read_generation_data"
+	fi
 	cat >expect <<- EOF
 	header: 43475048 1 $(test_oid oid_version) 4 $NUM_BASE
 	num_commits: $1
 	chunks: oid_fanout oid_lookup commit_metadata generation_data
-	options:
+	options:$OPTIONS
 	EOF
 	test-tool read-graph >output &&
 	test_cmp expect output
@@ -624,7 +629,7 @@ test_expect_success 'write generation data chunk if topmost remaining layer has
 		header: 43475048 1 $(test_oid oid_version) 5 1
 		num_commits: $(($NUM_SECOND_LAYER_COMMITS + $NUM_THIRD_LAYER_COMMITS + $NUM_FOURTH_LAYER_COMMITS + $NUM_FIFTH_LAYER_COMMITS))
 		chunks: oid_fanout oid_lookup commit_metadata generation_data
-		options:
+		options: read_generation_data
 		EOF
 		test_cmp expect output
 	)
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v2 4/4] commit-graph: fix generation number v2 overflow values
  2022-02-28 13:53 ` [PATCH v2 0/4] Commit-graph: Generation Number v2 Fixes Derrick Stolee via GitGitGadget
                     ` (2 preceding siblings ...)
  2022-02-28 13:53   ` [PATCH v2 3/4] commit-graph: start parsing generation v2 (again) Derrick Stolee via GitGitGadget
@ 2022-02-28 13:53   ` Derrick Stolee via GitGitGadget
  2022-02-28 15:40     ` Ævar Arnfjörð Bjarmason
  2022-03-01 17:23   ` [PATCH v2 0/4] Commit-graph: Generation Number v2 Fixes Ævar Arnfjörð Bjarmason
  2022-03-01 19:48   ` [PATCH v3 0/5] " Derrick Stolee via GitGitGadget
  5 siblings, 1 reply; 70+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-02-28 13:53 UTC (permalink / raw)
  To: git
  Cc: me, gitster, abhishekkumar8222, avarab, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

The Generation Data Chunk was implemented and tested in e8b63005c
(commit-graph: implement generation data chunk, 2021-01-16), but the
test was carefully constructed to work on systems with 32-bit dates.
Since the corrected commit date offsets still required more than 31
bits, this triggered writing the generation_data_overflow chunk.

However, upon closer look, the
write_graph_chunk_generation_data_overflow() method writes the offsets
to the chunk (as dictated by the format) but fill_commit_graph_info()
treats the value in the chunk as if it is the full corrected commit date
(not an offset). For some reason, this does not cause an issue when
using the FUTURE_DATE specified in t5318-commit-graph.sh, but it does
show up as a failure in 'git commit-graph verify' if we increase that
FUTURE_DATE to be above four billion.

Fix this error and update the test to require 64-bit dates so we can
safely use this large value in our test.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 commit-graph.c          |  2 +-
 t/t5318-commit-graph.sh | 17 +++++++++++++++--
 2 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 8e52bb09552..b86a6a634fe 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -806,7 +806,7 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
 				die(_("commit-graph requires overflow generation data but has none"));
 
 			offset_pos = offset ^ CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW;
-			graph_data->generation = get_be64(g->chunk_generation_data_overflow + 8 * offset_pos);
+			graph_data->generation = item->date + get_be64(g->chunk_generation_data_overflow + 8 * offset_pos);
 		} else
 			graph_data->generation = item->date + offset;
 	} else
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 1afee1c2705..f4ffaad661d 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -815,6 +815,15 @@ test_expect_success 'corrupt commit-graph write (missing tree)' '
 	)
 '
 
+# The remaining tests check timestamps that flow over
+# 32-bits. The graph_git_behavior checks can't take a
+# prereq, so just stop here if we are on a 32-bit machine.
+
+if ! test_have_prereq TIME_IS_64BIT || ! test_have_prereq TIME_T_IS_64BIT
+then
+	test_done
+fi
+
 # We test the overflow-related code with the following repo history:
 #
 #               4:F - 5:N - 6:U
@@ -832,10 +841,10 @@ test_expect_success 'corrupt commit-graph write (missing tree)' '
 # The largest offset observed is 2 ^ 31, just large enough to overflow.
 #
 
-test_expect_success 'set up and verify repo with generation data overflow chunk' '
+test_expect_success TIME_IS_64BIT,TIME_T_IS_64BIT 'set up and verify repo with generation data overflow chunk' '
 	objdir=".git/objects" &&
 	UNIX_EPOCH_ZERO="@0 +0000" &&
-	FUTURE_DATE="@2147483646 +0000" &&
+	FUTURE_DATE="@4000000000 +0000" &&
 	test_oid_cache <<-EOF &&
 	oid_version sha1:1
 	oid_version sha256:2
@@ -867,4 +876,8 @@ test_expect_success 'set up and verify repo with generation data overflow chunk'
 
 graph_git_behavior 'generation data overflow chunk repo' repo left right
 
+# Do not add tests at the end of this file, unless they require 64-bit
+# timestamps, since this portion of the script is only executed when
+# time data types have 64 bits.
+
 test_done
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* Re: [PATCH 5/7] commit-graph: document file format v2
  2022-02-28 13:44     ` Derrick Stolee
@ 2022-02-28 14:27       ` Ævar Arnfjörð Bjarmason
  2022-02-28 16:39         ` Derrick Stolee
  0 siblings, 1 reply; 70+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-02-28 14:27 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, git, me, gitster, abhishekkumar8222

On Mon, Feb 28 2022, Derrick Stolee wrote:

> On 2/25/2022 5:31 PM, Ævar Arnfjörð Bjarmason wrote:
>> 
>> On Thu, Feb 24 2022, Derrick Stolee via GitGitGadget wrote:
>> 
> ...
>>>    Generation Data (ID: {'G', 'D', 'A', 'T' }) (N * 4 bytes) [Optional]
>>>      * This list of 4-byte values store corrected commit date offsets for the
>>> @@ -103,6 +112,9 @@ CHUNK DATA:
>>>      * Generation Data chunk is present only when commit-graph file is written
>>>        by compatible versions of Git and in case of split commit-graph chains,
>>>        the topmost layer also has Generation Data chunk.
>>> +    * This chunk does not exist if the commit-graph file format version is 2,
>>> +      because the corrected commit date offset data is stored in the Commit
>>> +      Data chunk.
>>>  
>>>    Generation Data Overflow (ID: {'G', 'D', 'O', 'V' }) [Optional]
>>>      * This list of 8-byte values stores the corrected commit date offsets
>> 
>> We talked a while ago now about how we do commit-graph format changes
>> and this is partially echoing those earlier questions[1] from 2019.
>> 
>> I fully understand why we're writing this amended CDAT chunk in a
>> different layout. By not having the GDAT side-chunk to look up in the
>> data is more local, that part of the file is more compact etc.
>> 
>> What I don't understand is why getting those performance improvements
>> requires the breaking version change & the writing of the incompatible
>> version number.
>> 
>> I.e. couldn't the differently formatted CDAT chunk be written instead to a new
>> chunk name (say "2DAT") instead? Per [1] we'd pay a small fixed cost for
>> a possibly empty chunk (I didn't re-do those numbers), but surely the
>> performance improvements will be about the same for that miniscule
>> overhead.
>
> CDAT is a required chunk. It is part of the v1 spec that CDAT exists
> and is correct. All other Git clients will error out when reading a
> "v1" graph without such a chunk, and in a way that is less helpful to
> users. Instead of clearly indicating "file version is too new" it will
> say "commit-graph is missing the Commit Data chunk" which is not
> helpful.

Yes. That would be the worst of both worlds.

I thought the reference to the 2019-era post made it clear (which is
explicit about this aspect), but I'm talking about writing one of:

 A. An empty chunk
 B. Keeping a "stale" chunk around (as we re-write the graph)
 C. Duplicate writes of new/old chunks.

And not simply omitting the CDAT chunk. As you point out would give you
all the drawbacks of a version number change, with none of the benefits.

I haven't re-tested this now, but at the time doing any of (A..C) would
work smoothly for older clients, while giving newer ones improved data.

>> It will give you something you can't have here, which is optional
>> compatibility with older clients by writing both versions. That'll be a
>> ~2x as large file on disk, but with the page cache & each client version
>> skipping to the data it needs caching characteristics & data locality
>> should work out to about the same thing.
>
> Writing both is the only way that this could work without incrementing
> the graph version number, but I'd rather just update the number and
> avoid wasting the effort to write that extra data.

...

> It seems you are hyper-focused on "we don't _need_ to update the version
> number" and you are willing to recommend wasteful approaches in order to
> support that stance.

I'd say less hyper-focused, and more clarifying an IMO major unstated
trade-off of the proposed format change.

> So: you're right. We don't _need_ to update the version number. But this
> is the best choice among the options available.

...

>> Or maybe they won't. I just found it surprising when reviewing this to
>> not find an answer to why that approach wasn't
>> considered.
>
> The point is to create a new format that can be chosen when deployed
> in an environment where older Git versions will not exist (such as
> a Git server). The new version is not chosen by default and instead
> is opt-in through the commitGraph.generationVersion config option.
>
> Perhaps in a year or two we would consider making this the new
> default, but there is no rush to do so.

Looking into this a bit more I think that in either case this is less of
a big deal after my 43d35618055 (commit-graph write: don't die if the
existing graph is corrupt, 2019-03-25), which came out of some of those
discussions at the time of [1].

I.e. now a client that only understands version N-1 will warn when
loading it, wheras it's only if a pre-v2.22.0 client (which has that
commit) reads the repository that we'd hard die on it, correct?

But speaking of hyper-focus. I think that arguably applies to you in
this case when considering the trade-offs of these sorts of format
changes :)

I.e. you're primarily considering cases of say a git server (presumably
running on GitHub) or another such deployment where it's easy to have
full control over all of your versions "in the wild".

And thus a three-phase rollout of something like a format change can be
done in a timely and predictable manner.

But git is used by *a lot* of people in a bunch of different
scenarios. E.g.:

 * A shared (hopefully read-only) NFS mounted by remote "unmanaged" clients.
 * A tarred-up directory including a .git, which may be transferred to
   a machine with a pre-v2.22.0 version.

Or even softer cases of failure, such as:

 * A cronjob causes an alert/incident somewhere because the server 
   operator started writing a new version, but forgot about a set
   of machines that are still on the old version.

I think that even if it's less conceptually clean it's worth considering
being over backwards to be kinder to such use-cases, unless it's really
required for other reasons to break such in-the-wild use-cases.

Or in this case, if it's thought to be worth it to help reviewers decide
by separating the performance improvement aspect from the changed
interaction between new graphs and older clients.

As a further nit on the proposed end-state here: Do I understand it
correctly that commitGraph.generationVersion=[1|2] (i.e. on current
"master") will always result in a file that's compatible with older
versions, since the only thing "v2" there controls now is to write the
optional GDAT and GDOV chunks?

Whereas going from commitGraph.generationVersion=2 to
commitGraph.generationVersion=3 in this series will impact older clients
as noted above, since we're bumping the version (of the file, to 2 if
the config is 3, which as Junio noted is a bit confusing).

I think if you're set on going down the path of bumping the top-level
version that deserves to be made much clearer in the added
documentation. Right now the only hint to that is a passing mention that
for v3:

    [it] will be incompatible with some old versions of Git

Which if we're opting for breaking format changes really should note
some of the caveats above, that pre-v2.22.0 hard-dies, and probably
describe "some old versions of Git" a bit more clearly.

It actually means once this gets released "the git version that was the
latest one you could download yesterday". Which a reader of the docs
probably won't expect when starting to play with this in mixed-version
environment.

1. https://lore.kernel.org/git/87h8acivkh.fsf@evledraar.gmail.com/

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 3/7] commit-graph: start parsing generation v2 (again)
  2022-02-24 20:38 ` [PATCH 3/7] commit-graph: start parsing generation v2 (again) Derrick Stolee via GitGitGadget
@ 2022-02-28 15:18   ` Patrick Steinhardt
  2022-02-28 16:23     ` Derrick Stolee
  0 siblings, 1 reply; 70+ messages in thread
From: Patrick Steinhardt @ 2022-02-28 15:18 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, me, gitster, abhishekkumar8222, Derrick Stolee

[-- Attachment #1: Type: text/plain, Size: 5822 bytes --]

On Thu, Feb 24, 2022 at 08:38:32PM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <derrickstolee@github.com>
> 
> The 'read_generation_data' member of 'struct commit_graph' was
> introduced by 1fdc383c5 (commit-graph: use generation v2 only if entire
> chain does, 2021-01-16). The intention was to avoid using corrected
> commit dates if not all layers of a commit-graph had that data stored.
> The logic in validate_mixed_generation_chain() at that point incorrectly
> initialized read_generation_data to 1 if and only if the tip
> commit-graph contained the Corrected Commit Date chunk.
> 
> This was "fixed" in 448a39e65 (commit-graph: validate layers for
> generation data, 2021-02-02) to validate that read_generation_data was
> either non-zero for all layers, or it would set read_generation_data to
> zero for all layers.
> 
> The problem here is that read_generation_data is not initialized to be
> non-zero anywhere!
> 
> This change initializes read_generation_data immediately after the chunk
> is parsed, so each layer will have its value present as soon as
> possible.
> 
> The read_generation_data member is used in fill_commit_graph_info() to
> determine if we should use the corrected commit date or the topological
> levels stored in the Commit Data chunk. Due to this bug, all previous
> versions of Git were defaulting to topological levels in all cases!
> 
> This can be measured with some performance tests. Using the Linux kernel
> as a testbed, I generated a complete commit-graph containing corrected
> commit dates and tested the 'new' version against the previous, 'old'
> version.
> 
> First, rev-list with --topo-order demonstrates a 26% improvement using
> corrected commit dates:
> 
> hyperfine \
> 	-n "old" "$OLD_GIT rev-list --topo-order -1000 v3.6" \
> 	-n "new" "$NEW_GIT rev-list --topo-order -1000 v3.6" \
> 	--warmup=10
> 
> Benchmark 1: old
>   Time (mean ± σ):      57.1 ms ±   3.1 ms
>   Range (min … max):    52.9 ms …  62.0 ms    55 runs
> 
> Benchmark 2: new
>   Time (mean ± σ):      45.5 ms ±   3.3 ms
>   Range (min … max):    39.9 ms …  51.7 ms    59 runs
> 
> Summary
>   'new' ran
>     1.26 ± 0.11 times faster than 'old'
> 
> These performance improvements are due to the algorithmic improvements
> given by walking fewer commits due to the higher cutoffs from corrected
> commit dates.
> 
> However, this comes at a cost. The additional I/O cost of parsing the
> corrected commit dates is visible in case of merge-base commands that do
> not reduce the overall number of walked commits.
> 
> hyperfine \
>         -n "old" "$OLD_GIT merge-base v4.8 v4.9" \
>         -n "new" "$NEW_GIT merge-base v4.8 v4.9" \
>         --warmup=10
> 
> Benchmark 1: old
>   Time (mean ± σ):     110.4 ms ±   6.4 ms
>   Range (min … max):    96.0 ms … 118.3 ms    25 runs
> 
> Benchmark 2: new
>   Time (mean ± σ):     150.7 ms ±   1.1 ms
>   Range (min … max):   149.3 ms … 153.4 ms    19 runs
> 
> Summary
>   'old' ran
>     1.36 ± 0.08 times faster than 'new'
> 
> Performance issues like this are what motivated 702110aac (commit-graph:
> use config to specify generation type, 2021-02-25).
> 
> In the future, we could fix this performance problem by inserting the
> corrected commit date offsets into the Commit Date chunk instead of
> having that data in an extra chunk.
> 
> Signed-off-by: Derrick Stolee <derrickstolee@github.com>
> ---
>  commit-graph.c                |  3 +++
>  t/t4216-log-bloom.sh          |  2 +-
>  t/t5318-commit-graph.sh       | 14 ++++++++++++--
>  t/t5324-split-commit-graph.sh |  9 +++++++--
>  4 files changed, 23 insertions(+), 5 deletions(-)
> 
> diff --git a/commit-graph.c b/commit-graph.c
> index a19bd96c2ee..8e52bb09552 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -407,6 +407,9 @@ struct commit_graph *parse_commit_graph(struct repository *r,
>  			&graph->chunk_generation_data);
>  		pair_chunk(cf, GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW,
>  			&graph->chunk_generation_data_overflow);
> +
> +		if (graph->chunk_generation_data)
> +			graph->read_generation_data = 1;
>  	}
>  
>  	if (r->settings.commit_graph_read_changed_paths) {

I wanted to test your changes because they seem quite exciting in the
context of my work as well, but this commit seems to uncover a bug with
how we handle overflows. I originally triggered the bug when trying to
do a mirror-fetch, but as it turns it seems to trigger now whenever the
commit-graph is being read:

    $ git commit-graph verify
    fatal: commit-graph requires overflow generation data but has none

    $ git commit-graph write --split
    Finding commits for commit graph among packed objects: 100% (10235119/10235119), done.
    fatal: commit-graph requires overflow generation data but has none

    $ git commit-graph write --split=replace
    Finding commits for commit graph among packed objects: 100% (10235119/10235119), done.
    fatal: commit-graph requires overflow generation data but has none

I initially assumed this may be a bug with how we previously wrote the
commit-graph, but removing all chains still reliably triggers it:

    $ rm -f objects/info/commit-graphs/*
    $ git commit-graph write --split
    Finding commits for commit graph among packed objects: 100% (10235119/10235119), done.
    fatal: commit-graph requires overflow generation data but has none

I haven't yet found the time to dig deeper into why this is happening.
While the repository is publicly accessible at [1], unfortunately the
bug seems to be triggered by a commit that's only kept alive by an
internal reference.

Patrick

[1]: https://gitlab.com/gitlab-com/www-gitlab-com.git

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v2 1/4] test-read-graph: include extra post-parse info
  2022-02-28 13:53   ` [PATCH v2 1/4] test-read-graph: include extra post-parse info Derrick Stolee via GitGitGadget
@ 2022-02-28 15:22     ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 70+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-02-28 15:22 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, me, gitster, abhishekkumar8222, Derrick Stolee


On Mon, Feb 28 2022, Derrick Stolee via GitGitGadget wrote:

> From: Derrick Stolee <derrickstolee@github.com>
>
> It can be helpful to verify that the 'struct commit_graph' that results
> from parsing a commit-graph is correctly structured. The existence of
> different chunks is not enough to verify that all of the optional
> features are correctly enabled.
>
> Update 'test-tool read-graph' to output an "options:" line that includes
> information for different parts of the struct commit_graph.
>
> In particular, this change demonstrates that the read_generation_data
> option is never being enabled, which will be fixed in a later change.
>
> Signed-off-by: Derrick Stolee <derrickstolee@github.com>
> ---
>  t/helper/test-read-graph.c    | 13 +++++++++++++
>  t/t4216-log-bloom.sh          |  1 +
>  t/t5318-commit-graph.sh       |  1 +
>  t/t5324-split-commit-graph.sh |  5 +++++
>  4 files changed, 20 insertions(+)
>
> diff --git a/t/helper/test-read-graph.c b/t/helper/test-read-graph.c
> index 75927b2c81d..c3b6b8d1734 100644
> --- a/t/helper/test-read-graph.c
> +++ b/t/helper/test-read-graph.c
> @@ -3,6 +3,7 @@
>  #include "commit-graph.h"
>  #include "repository.h"
>  #include "object-store.h"
> +#include "bloom.h"
>  
>  int cmd__read_graph(int argc, const char **argv)
>  {
> @@ -45,6 +46,18 @@ int cmd__read_graph(int argc, const char **argv)
>  		printf(" bloom_data");
>  	printf("\n");
>  
> +	printf("options:");
> +	if (graph->bloom_filter_settings)
> +		printf(" bloom(%d,%d,%d)",

I think this is probably unportable, as other code (including in
commit-graph.c) uses "%"PRIu32 when printing uint32_t.

Does this work on our Linux32 job? I was going to quickly check the PR
CI, but it appears the run was skipped for some reason.

> +		       graph->bloom_filter_settings->hash_version,
> +		       graph->bloom_filter_settings->bits_per_entry,
> +		       graph->bloom_filter_settings->num_hashes);
> +	if (graph->read_generation_data)
> +		printf(" read_generation_data");
> +	if (graph->topo_levels)
> +		printf(" topo_levels");
> +	printf("\n");
> +
>  	UNLEAK(graph);
>  
>  	return 0;
> diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
> index cc3cebf6722..5ed6d2a21c1 100755
> --- a/t/t4216-log-bloom.sh
> +++ b/t/t4216-log-bloom.sh
> @@ -48,6 +48,7 @@ graph_read_expect () {
>  	header: 43475048 1 $(test_oid oid_version) $NUM_CHUNKS 0
>  	num_commits: $1
>  	chunks: oid_fanout oid_lookup commit_metadata generation_data bloom_indexes bloom_data
> +	options: bloom(1,10,7)
>  	EOF
>  	test-tool read-graph >actual &&
>  	test_cmp expect actual
> diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
> index edb728f77c3..2b05026cf6d 100755
> --- a/t/t5318-commit-graph.sh
> +++ b/t/t5318-commit-graph.sh
> @@ -104,6 +104,7 @@ graph_read_expect() {
>  	header: 43475048 1 $(test_oid oid_version) $NUM_CHUNKS 0
>  	num_commits: $1
>  	chunks: oid_fanout oid_lookup commit_metadata$OPTIONAL
> +	options:
>  	EOF
>  	test-tool read-graph >output &&
>  	test_cmp expect output
> diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
> index 847b8097109..778fa418de2 100755
> --- a/t/t5324-split-commit-graph.sh
> +++ b/t/t5324-split-commit-graph.sh
> @@ -34,6 +34,7 @@ graph_read_expect() {
>  	header: 43475048 1 $(test_oid oid_version) 4 $NUM_BASE
>  	num_commits: $1
>  	chunks: oid_fanout oid_lookup commit_metadata generation_data
> +	options:
>  	EOF
>  	test-tool read-graph >output &&
>  	test_cmp expect output
> @@ -508,6 +509,7 @@ test_expect_success 'setup repo for mixed generation commit-graph-chain' '
>  		header: 43475048 1 $(test_oid oid_version) 4 1
>  		num_commits: $NUM_SECOND_LAYER_COMMITS
>  		chunks: oid_fanout oid_lookup commit_metadata
> +		options:
>  		EOF
>  		test_cmp expect output &&
>  		git commit-graph verify &&
> @@ -540,6 +542,7 @@ test_expect_success 'do not write generation data chunk if not present on existi
>  		header: 43475048 1 $(test_oid oid_version) 4 2
>  		num_commits: $NUM_THIRD_LAYER_COMMITS
>  		chunks: oid_fanout oid_lookup commit_metadata
> +		options:
>  		EOF
>  		test_cmp expect output &&
>  		git commit-graph verify
> @@ -581,6 +584,7 @@ test_expect_success 'do not write generation data chunk if the topmost remaining
>  		header: 43475048 1 $(test_oid oid_version) 4 2
>  		num_commits: $(($NUM_THIRD_LAYER_COMMITS + $NUM_FOURTH_LAYER_COMMITS))
>  		chunks: oid_fanout oid_lookup commit_metadata
> +		options:
>  		EOF
>  		test_cmp expect output &&
>  		git commit-graph verify
> @@ -620,6 +624,7 @@ test_expect_success 'write generation data chunk if topmost remaining layer has
>  		header: 43475048 1 $(test_oid oid_version) 5 1
>  		num_commits: $(($NUM_SECOND_LAYER_COMMITS + $NUM_THIRD_LAYER_COMMITS + $NUM_FOURTH_LAYER_COMMITS + $NUM_FIFTH_LAYER_COMMITS))
>  		chunks: oid_fanout oid_lookup commit_metadata generation_data
> +		options:
>  		EOF
>  		test_cmp expect output
>  	)

I think the rest of this all looks good and obviously correct.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v2 2/4] commit-graph: fix ordering bug in generation numbers
  2022-02-28 13:53   ` [PATCH v2 2/4] commit-graph: fix ordering bug in generation numbers Derrick Stolee via GitGitGadget
@ 2022-02-28 15:25     ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 70+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-02-28 15:25 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, me, gitster, abhishekkumar8222, Derrick Stolee


On Mon, Feb 28 2022, Derrick Stolee via GitGitGadget wrote:

> From: Derrick Stolee <derrickstolee@github.com>
> [...]
> diff --git a/commit-graph.c b/commit-graph.c
> index 265c010122e..a19bd96c2ee 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -1556,12 +1556,16 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
>  				if (current->date && current->date > max_corrected_commit_date)
>  					max_corrected_commit_date = current->date - 1;
>  				commit_graph_data_at(current)->generation = max_corrected_commit_date + 1;
> -
> -				if (commit_graph_data_at(current)->generation - current->date > GENERATION_NUMBER_V2_OFFSET_MAX)
> -					ctx->num_generation_data_overflows++;
>  			}
>  		}
>  	}
> +
> +	for (i = 0; i < ctx->commits.nr; i++) {
> +		struct commit *c = ctx->commits.list[i];
> +		timestamp_t offset = commit_graph_data_at(c)->generation - c->date;
> +		if (offset > GENERATION_NUMBER_V2_OFFSET_MAX)
> +			ctx->num_generation_data_overflows++;
> +	}
>  	stop_progress(&ctx->progress);
>  }
>  
> diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
> index 2b05026cf6d..f9bffe38013 100755
> --- a/t/t5318-commit-graph.sh
> +++ b/t/t5318-commit-graph.sh
> @@ -467,10 +467,10 @@ test_expect_success 'warn on improper hash version' '
>  	)
>  '
>  
> -test_expect_success 'lower layers have overflow chunk' '
> +test_expect_success TIME_IS_64BIT,TIME_T_IS_64BIT 'lower layers have overflow chunk' '
>  	cd "$TRASH_DIRECTORY/full" &&
>  	UNIX_EPOCH_ZERO="@0 +0000" &&
> -	FUTURE_DATE="@2147483646 +0000" &&
> +	FUTURE_DATE="@4147483646 +0000" &&
>  	rm -f .git/objects/info/commit-graph &&
>  	test_commit --date "$FUTURE_DATE" future-1 &&
>  	test_commit --date "$UNIX_EPOCH_ZERO" old-1 &&

Isn't it worth splitting up this test instead, so that we can test cases
where 32 bit timestamps overflow without these new prereqs.

Unless I'm missing something that would just be a matter of splitting
this test into helper that takes that $FUTURE_DATE as an argument, then
running it for both timestamps, with TIME_IS_64BIT,TIME_T_IS_64BIT on
the 64 bit one.

Or maybe I don't get it, but it seems like we're throwing out some
carefully considered testing for 32 bit compatibility with the
proverbial bath water here....

Aside from that I wonder how this interacts with both:

    #define CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW (1ULL << 31)

And this existing code, where offset is timestamp_t, but
num_generation_data_overflows is an "int":

    offset = CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW | num_generation_data_overflows;

That proooobably does the right thing if int is say 32 bit, but
timestamp_t is 64 bit (does such an OS exist?), but maybe worth looking
at with a second pair of eyes...

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v2 3/4] commit-graph: start parsing generation v2 (again)
  2022-02-28 13:53   ` [PATCH v2 3/4] commit-graph: start parsing generation v2 (again) Derrick Stolee via GitGitGadget
@ 2022-02-28 15:30     ` Ævar Arnfjörð Bjarmason
  2022-02-28 16:43       ` Derrick Stolee
  0 siblings, 1 reply; 70+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-02-28 15:30 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, me, gitster, abhishekkumar8222, Derrick Stolee


On Mon, Feb 28 2022, Derrick Stolee via GitGitGadget wrote:

> From: Derrick Stolee <derrickstolee@github.com>
> [...]
> +	GENERATION_VERSION=2
> +	if test ! -z "$3"

TIL this works somewhere :)

I thought it *might* be unportable behavior (but didn't check at
first), but it's not! We have a few such cases already.

But IMO much less puzzling would be at least:

    if ! test -z "$3"

Or in this case, more plainly:

    if test -n "$3"

> +	then
> +		GENERATION_VERSION=$3
> +	fi
> +	OPTIONS=
> +	if test $GENERATION_VERSION -gt 1
> +	then
> +		OPTIONS=" read_generation_data"
> +	fi

Or actually, since we don't use $GENERATION_VERSION further down setting
it to a default etc. here seems a bit odd. Perhaps something closer to:

    if test $# -eq 3 && test $3 -gt 1

It's also possible to be more clever as e.g.:

    test "${3:-2}" -gt 1

But that hardly seems worth it...

>  NUM_COMMITS=9
> diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
> index 778fa418de2..669ddc645fa 100755
> --- a/t/t5324-split-commit-graph.sh
> +++ b/t/t5324-split-commit-graph.sh
> @@ -30,11 +30,16 @@ graph_read_expect() {
>  	then
>  		NUM_BASE=$2
>  	fi
> +	OPTIONS=
> +	if test -z "$3"
> +	then
> +		OPTIONS=" read_generation_data"
> +	fi
>  	cat >expect <<- EOF
>  	header: 43475048 1 $(test_oid oid_version) 4 $NUM_BASE
>  	num_commits: $1
>  	chunks: oid_fanout oid_lookup commit_metadata generation_data
> -	options:
> +	options:$OPTIONS
>  	EOF
>  	test-tool read-graph >output &&
>  	test_cmp expect output

Not a new issue, but it would be nice to have the mostly copy/pasted
code in a lib-commit-graph.sh or something, but probably too distracting
for this small series...

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v2 4/4] commit-graph: fix generation number v2 overflow values
  2022-02-28 13:53   ` [PATCH v2 4/4] commit-graph: fix generation number v2 overflow values Derrick Stolee via GitGitGadget
@ 2022-02-28 15:40     ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 70+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-02-28 15:40 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, me, gitster, abhishekkumar8222, Derrick Stolee


On Mon, Feb 28 2022, Derrick Stolee via GitGitGadget wrote:

> From: Derrick Stolee <derrickstolee@github.com>
>
> The Generation Data Chunk was implemented and tested in e8b63005c
> (commit-graph: implement generation data chunk, 2021-01-16), but the
> test was carefully constructed to work on systems with 32-bit dates.
> Since the corrected commit date offsets still required more than 31
> bits, this triggered writing the generation_data_overflow chunk.
>
> However, upon closer look, the
> write_graph_chunk_generation_data_overflow() method writes the offsets
> to the chunk (as dictated by the format) but fill_commit_graph_info()
> treats the value in the chunk as if it is the full corrected commit date
> (not an offset). For some reason, this does not cause an issue when
> using the FUTURE_DATE specified in t5318-commit-graph.sh, but it does
> show up as a failure in 'git commit-graph verify' if we increase that
> FUTURE_DATE to be above four billion.
>
> Fix this error and update the test to require 64-bit dates so we can
> safely use this large value in our test.

Hrm, so perhaps re my comment on 2/4 @2147483646 was never used? I'm not
sure I understand this.

> diff --git a/commit-graph.c b/commit-graph.c
> index 8e52bb09552..b86a6a634fe 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -806,7 +806,7 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
>  				die(_("commit-graph requires overflow generation data but has none"));
>  
>  			offset_pos = offset ^ CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW;
> -			graph_data->generation = get_be64(g->chunk_generation_data_overflow + 8 * offset_pos);
> +			graph_data->generation = item->date + get_be64(g->chunk_generation_data_overflow + 8 * offset_pos);
>  		} else
>  			graph_data->generation = item->date + offset;
>  	} else
> diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
> index 1afee1c2705..f4ffaad661d 100755
> --- a/t/t5318-commit-graph.sh
> +++ b/t/t5318-commit-graph.sh
> @@ -815,6 +815,15 @@ test_expect_success 'corrupt commit-graph write (missing tree)' '
>  	)
>  '

This goes back to my comment on 3/4 but:

> +# The remaining tests check timestamps that flow over
> +# 32-bits. The graph_git_behavior checks can't take a
> +# prereq, so just stop here if we are on a 32-bit machine.
> +
> +if ! test_have_prereq TIME_IS_64BIT || ! test_have_prereq TIME_T_IS_64BIT
> +then
> +	test_done
> +fi
> +

This...(continued)...

>  # We test the overflow-related code with the following repo history:
>  #
>  #               4:F - 5:N - 6:U
> @@ -832,10 +841,10 @@ test_expect_success 'corrupt commit-graph write (missing tree)' '
>  # The largest offset observed is 2 ^ 31, just large enough to overflow.
>  #
>  
> -test_expect_success 'set up and verify repo with generation data overflow chunk' '
> +test_expect_success TIME_IS_64BIT,TIME_T_IS_64BIT 'set up and verify repo with generation data overflow chunk' '
>  	objdir=".git/objects" &&
>  	UNIX_EPOCH_ZERO="@0 +0000" &&
> -	FUTURE_DATE="@2147483646 +0000" &&
> +	FUTURE_DATE="@4000000000 +0000" &&

Hrm, again this may be over my head, but @4147483646 instead of
@2147483646 in the other test, but @4000000000 instead of the same here?


>  	test_oid_cache <<-EOF &&
>  	oid_version sha1:1
>  	oid_version sha256:2
> @@ -867,4 +876,8 @@ test_expect_success 'set up and verify repo with generation data overflow chunk'
>  
>  graph_git_behavior 'generation data overflow chunk repo' repo left right
>  
> +# Do not add tests at the end of this file, unless they require 64-bit
> +# timestamps, since this portion of the script is only executed when
> +# time data types have 64 bits.
> +
>  test_done

...and this would really be much nicer if we split this test up into its
own file, which would be obviously named to note the
issue. tXXXX-commit-graph-64bit-timestamp.sh or something.

As shown in my recent 0a2bfccb9c8 (t0051: use "skip_all" under !MINGW in
single-test file, 2022-02-04) you'll also get much nicer output from
"prove" in that case.
 

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 3/7] commit-graph: start parsing generation v2 (again)
  2022-02-28 15:18   ` Patrick Steinhardt
@ 2022-02-28 16:23     ` Derrick Stolee
  2022-02-28 16:59       ` Patrick Steinhardt
  0 siblings, 1 reply; 70+ messages in thread
From: Derrick Stolee @ 2022-02-28 16:23 UTC (permalink / raw)
  To: Patrick Steinhardt, Derrick Stolee via GitGitGadget
  Cc: git, me, gitster, abhishekkumar8222

On 2/28/2022 10:18 AM, Patrick Steinhardt wrote:
> On Thu, Feb 24, 2022 at 08:38:32PM +0000, Derrick Stolee via GitGitGadget wrote:
>> From: Derrick Stolee <derrickstolee@github.com>
...
>> diff --git a/commit-graph.c b/commit-graph.c
>> index a19bd96c2ee..8e52bb09552 100644
>> --- a/commit-graph.c
>> +++ b/commit-graph.c
>> @@ -407,6 +407,9 @@ struct commit_graph *parse_commit_graph(struct repository *r,
>>  			&graph->chunk_generation_data);
>>  		pair_chunk(cf, GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW,
>>  			&graph->chunk_generation_data_overflow);
>> +
>> +		if (graph->chunk_generation_data)
>> +			graph->read_generation_data = 1;
>>  	}
>>  
>>  	if (r->settings.commit_graph_read_changed_paths) {
> 
> I wanted to test your changes because they seem quite exciting in the
> context of my work as well, but this commit seems to uncover a bug with
> how we handle overflows. I originally triggered the bug when trying to
> do a mirror-fetch, but as it turns it seems to trigger now whenever the
> commit-graph is being read:
> 
>     $ git commit-graph verify
>     fatal: commit-graph requires overflow generation data but has none
> 
>     $ git commit-graph write --split
>     Finding commits for commit graph among packed objects: 100% (10235119/10235119), done.
>     fatal: commit-graph requires overflow generation data but has none
> 
>     $ git commit-graph write --split=replace
>     Finding commits for commit graph among packed objects: 100% (10235119/10235119), done.
>     fatal: commit-graph requires overflow generation data but has none
> 
> I initially assumed this may be a bug with how we previously wrote the
> commit-graph, but removing all chains still reliably triggers it:
> 
>     $ rm -f objects/info/commit-graphs/*
>     $ git commit-graph write --split
>     Finding commits for commit graph among packed objects: 100% (10235119/10235119), done.
>     fatal: commit-graph requires overflow generation data but has none
> 
> I haven't yet found the time to dig deeper into why this is happening.
> While the repository is publicly accessible at [1], unfortunately the
> bug seems to be triggered by a commit that's only kept alive by an
> internal reference.
> 
> Patrick
> 
> [1]: https://gitlab.com/gitlab-com/www-gitlab-com.git

Thanks for including this information. Just to be clear: did you
include patch 4 in your tests, or not? Patch 4 includes a fix
related to overflow values, so it would be helpful to know if you
found a _different_ bug or if it is the same one.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 5/7] commit-graph: document file format v2
  2022-02-28 14:27       ` Ævar Arnfjörð Bjarmason
@ 2022-02-28 16:39         ` Derrick Stolee
  2022-02-28 21:14           ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 70+ messages in thread
From: Derrick Stolee @ 2022-02-28 16:39 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Derrick Stolee via GitGitGadget, git, me, gitster, abhishekkumar8222

On 2/28/2022 9:27 AM, Ævar Arnfjörð Bjarmason wrote:
> 
> On Mon, Feb 28 2022, Derrick Stolee wrote:
> 
>> On 2/25/2022 5:31 PM, Ævar Arnfjörð Bjarmason wrote:

>>> Or maybe they won't. I just found it surprising when reviewing this to
>>> not find an answer to why that approach wasn't
>>> considered.
>>
>> The point is to create a new format that can be chosen when deployed
>> in an environment where older Git versions will not exist (such as
>> a Git server). The new version is not chosen by default and instead
>> is opt-in through the commitGraph.generationVersion config option.
>>
>> Perhaps in a year or two we would consider making this the new
>> default, but there is no rush to do so.
> 
> Looking into this a bit more I think that in either case this is less of
> a big deal after my 43d35618055 (commit-graph write: don't die if the
> existing graph is corrupt, 2019-03-25), which came out of some of those
> discussions at the time of [1].
> 
> I.e. now a client that only understands version N-1 will warn when
> loading it, wheras it's only if a pre-v2.22.0 client (which has that
> commit) reads the repository that we'd hard die on it, correct?
> 
> But speaking of hyper-focus. I think that arguably applies to you in
> this case when considering the trade-offs of these sorts of format
> changes :)
> 
> I.e. you're primarily considering cases of say a git server (presumably
> running on GitHub) or another such deployment where it's easy to have
> full control over all of your versions "in the wild".

I'm thinking of servers, yes, but also 99% of clients who only upgrade
(or _maybe_ downgrade to a recent, previous version occasionally).
 
> And thus a three-phase rollout of something like a format change can be
> done in a timely and predictable manner.
> 
> But git is used by *a lot* of people in a bunch of different
> scenarios. E.g.:
> 
>  * A shared (hopefully read-only) NFS mounted by remote "unmanaged" clients.
>  * A tarred-up directory including a .git, which may be transferred to
>    a machine with a pre-v2.22.0 version.
> 
> Or even softer cases of failure, such as:
> 
>  * A cronjob causes an alert/incident somewhere because the server 
>    operator started writing a new version, but forgot about a set
>    of machines that are still on the old version.

It is important to continue supporting these cases, and this change does
not cause any issues for them. However, this handful of corner cases
should not block progress in the main cases.

> I think that even if it's less conceptually clean it's worth considering
> being over backwards to be kinder to such use-cases, unless it's really
> required for other reasons to break such in-the-wild use-cases.
> 
> Or in this case, if it's thought to be worth it to help reviewers decide
> by separating the performance improvement aspect from the changed
> interaction between new graphs and older clients.
> 
> As a further nit on the proposed end-state here: Do I understand it
> correctly that commitGraph.generationVersion=[1|2] (i.e. on current
> "master") will always result in a file that's compatible with older
> versions, since the only thing "v2" there controls now is to write the
> optional GDAT and GDOV chunks?
> 
> Whereas going from commitGraph.generationVersion=2 to
> commitGraph.generationVersion=3 in this series will impact older clients
> as noted above, since we're bumping the version (of the file, to 2 if
> the config is 3, which as Junio noted is a bit confusing).
> 
> I think if you're set on going down the path of bumping the top-level
> version that deserves to be made much clearer in the added
> documentation. Right now the only hint to that is a passing mention that
> for v3:
> 
>     [it] will be incompatible with some old versions of Git
> 
> Which if we're opting for breaking format changes really should note
> some of the caveats above, that pre-v2.22.0 hard-dies, and probably
> describe "some old versions of Git" a bit more clearly.
> 
> It actually means once this gets released "the git version that was the
> latest one you could download yesterday". Which a reader of the docs
> probably won't expect when starting to play with this in mixed-version
> environment.
> 
> 1. https://lore.kernel.org/git/87h8acivkh.fsf@evledraar.gmail.com/

This documentation could be altered to be specific about versions,
but such a specific change makes assumptions of the version that will
include it. As of now, the generation number v2 fixes will _probably_
get in for 2.36 and the format change would have enough time to cook
for 2.37, so I'll update the docs to refer to that version explicitly.

The pre-2.22.0 change might be helpful to mention, but it could also be
noise to the reader. We can revisit this when these patches are
submitted again in another thread. There's also concern about third-
party tools like libgit2. I'd rather draw the line as "tread carefully
here" than "here is so much information that a reader might think it
is all they need to know".

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v2 3/4] commit-graph: start parsing generation v2 (again)
  2022-02-28 15:30     ` Ævar Arnfjörð Bjarmason
@ 2022-02-28 16:43       ` Derrick Stolee
  0 siblings, 0 replies; 70+ messages in thread
From: Derrick Stolee @ 2022-02-28 16:43 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason, Derrick Stolee via GitGitGadget
  Cc: git, me, gitster, abhishekkumar8222



On 2/28/2022 10:30 AM, Ævar Arnfjörð Bjarmason wrote:
> 
> On Mon, Feb 28 2022, Derrick Stolee via GitGitGadget wrote:
> 
>> From: Derrick Stolee <derrickstolee@github.com>
>> [...]
>> +	GENERATION_VERSION=2
>> +	if test ! -z "$3"
> 
> TIL this works somewhere :)
> 
> I thought it *might* be unportable behavior (but didn't check at
> first), but it's not! We have a few such cases already.
> 
> But IMO much less puzzling would be at least:
> 
>     if ! test -z "$3"
> 
> Or in this case, more plainly:
> 
>     if test -n "$3"

Sure, that makes sense.

>> +	then
>> +		GENERATION_VERSION=$3
>> +	fi
>> +	OPTIONS=
>> +	if test $GENERATION_VERSION -gt 1
>> +	then
>> +		OPTIONS=" read_generation_data"
>> +	fi
> 
> Or actually, since we don't use $GENERATION_VERSION further down setting
> it to a default etc. here seems a bit odd. Perhaps something closer to:
> 
>     if test $# -eq 3 && test $3 -gt 1
> 
> It's also possible to be more clever as e.g.:
> 
>     test "${3:-2}" -gt 1
> 
> But that hardly seems worth it...

I prefer to use a variable so the code is self-documenting.

>>  NUM_COMMITS=9
>> diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
>> index 778fa418de2..669ddc645fa 100755
>> --- a/t/t5324-split-commit-graph.sh
>> +++ b/t/t5324-split-commit-graph.sh
>> @@ -30,11 +30,16 @@ graph_read_expect() {
>>  	then
>>  		NUM_BASE=$2
>>  	fi
>> +	OPTIONS=
>> +	if test -z "$3"
>> +	then
>> +		OPTIONS=" read_generation_data"
>> +	fi
>>  	cat >expect <<- EOF
>>  	header: 43475048 1 $(test_oid oid_version) 4 $NUM_BASE
>>  	num_commits: $1
>>  	chunks: oid_fanout oid_lookup commit_metadata generation_data
>> -	options:
>> +	options:$OPTIONS
>>  	EOF
>>  	test-tool read-graph >output &&
>>  	test_cmp expect output
> 
> Not a new issue, but it would be nice to have the mostly copy/pasted
> code in a lib-commit-graph.sh or something, but probably too distracting
> for this small series...

These cases are different enough in the needs of the test files
that they cannot be shared without significant complication.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 3/7] commit-graph: start parsing generation v2 (again)
  2022-02-28 16:23     ` Derrick Stolee
@ 2022-02-28 16:59       ` Patrick Steinhardt
  2022-02-28 18:44         ` Derrick Stolee
  0 siblings, 1 reply; 70+ messages in thread
From: Patrick Steinhardt @ 2022-02-28 16:59 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, git, me, gitster, abhishekkumar8222

[-- Attachment #1: Type: text/plain, Size: 3041 bytes --]

On Mon, Feb 28, 2022 at 11:23:38AM -0500, Derrick Stolee wrote:
> On 2/28/2022 10:18 AM, Patrick Steinhardt wrote:
> > On Thu, Feb 24, 2022 at 08:38:32PM +0000, Derrick Stolee via GitGitGadget wrote:
> >> From: Derrick Stolee <derrickstolee@github.com>
> ...
> >> diff --git a/commit-graph.c b/commit-graph.c
> >> index a19bd96c2ee..8e52bb09552 100644
> >> --- a/commit-graph.c
> >> +++ b/commit-graph.c
> >> @@ -407,6 +407,9 @@ struct commit_graph *parse_commit_graph(struct repository *r,
> >>  			&graph->chunk_generation_data);
> >>  		pair_chunk(cf, GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW,
> >>  			&graph->chunk_generation_data_overflow);
> >> +
> >> +		if (graph->chunk_generation_data)
> >> +			graph->read_generation_data = 1;
> >>  	}
> >>  
> >>  	if (r->settings.commit_graph_read_changed_paths) {
> > 
> > I wanted to test your changes because they seem quite exciting in the
> > context of my work as well, but this commit seems to uncover a bug with
> > how we handle overflows. I originally triggered the bug when trying to
> > do a mirror-fetch, but as it turns it seems to trigger now whenever the
> > commit-graph is being read:
> > 
> >     $ git commit-graph verify
> >     fatal: commit-graph requires overflow generation data but has none
> > 
> >     $ git commit-graph write --split
> >     Finding commits for commit graph among packed objects: 100% (10235119/10235119), done.
> >     fatal: commit-graph requires overflow generation data but has none
> > 
> >     $ git commit-graph write --split=replace
> >     Finding commits for commit graph among packed objects: 100% (10235119/10235119), done.
> >     fatal: commit-graph requires overflow generation data but has none
> > 
> > I initially assumed this may be a bug with how we previously wrote the
> > commit-graph, but removing all chains still reliably triggers it:
> > 
> >     $ rm -f objects/info/commit-graphs/*
> >     $ git commit-graph write --split
> >     Finding commits for commit graph among packed objects: 100% (10235119/10235119), done.
> >     fatal: commit-graph requires overflow generation data but has none
> > 
> > I haven't yet found the time to dig deeper into why this is happening.
> > While the repository is publicly accessible at [1], unfortunately the
> > bug seems to be triggered by a commit that's only kept alive by an
> > internal reference.
> > 
> > Patrick
> > 
> > [1]: https://gitlab.com/gitlab-com/www-gitlab-com.git
> 
> Thanks for including this information. Just to be clear: did you
> include patch 4 in your tests, or not? Patch 4 includes a fix
> related to overflow values, so it would be helpful to know if you
> found a _different_ bug or if it is the same one.
> 
> Thanks,
> -Stolee

I initially only applied the first three patches, but after having hit
the fatal error I also applied the rest of this series to have a look at
whether it is indeed fixed already by one of your later patches. The
error remains the same though.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 3/7] commit-graph: start parsing generation v2 (again)
  2022-02-28 16:59       ` Patrick Steinhardt
@ 2022-02-28 18:44         ` Derrick Stolee
  2022-03-01  9:46           ` Patrick Steinhardt
  0 siblings, 1 reply; 70+ messages in thread
From: Derrick Stolee @ 2022-02-28 18:44 UTC (permalink / raw)
  To: Patrick Steinhardt
  Cc: Derrick Stolee via GitGitGadget, git, me, gitster, abhishekkumar8222

On 2/28/2022 11:59 AM, Patrick Steinhardt wrote:
> On Mon, Feb 28, 2022 at 11:23:38AM -0500, Derrick Stolee wrote:
>> On 2/28/2022 10:18 AM, Patrick Steinhardt wrote:
>>> I haven't yet found the time to dig deeper into why this is happening.
>>> While the repository is publicly accessible at [1], unfortunately the
>>> bug seems to be triggered by a commit that's only kept alive by an
>>> internal reference.
>>>
>>> Patrick
>>>
>>> [1]: https://gitlab.com/gitlab-com/www-gitlab-com.git
>>
>> Thanks for including this information. Just to be clear: did you
>> include patch 4 in your tests, or not? Patch 4 includes a fix
>> related to overflow values, so it would be helpful to know if you
>> found a _different_ bug or if it is the same one.
>>
>> Thanks,
>> -Stolee
> 
> I initially only applied the first three patches, but after having hit
> the fatal error I also applied the rest of this series to have a look at
> whether it is indeed fixed already by one of your later patches. The
> error remains the same though.

Thanks for this extra context. Is this a commit-graph that you wrote
with the first three patches and then you get an error when reading it?

Do you get the same error when deleting that file and rewriting it with
all patches included?

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 5/7] commit-graph: document file format v2
  2022-02-28 16:39         ` Derrick Stolee
@ 2022-02-28 21:14           ` Ævar Arnfjörð Bjarmason
  2022-03-01 14:19             ` Derrick Stolee
  0 siblings, 1 reply; 70+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-02-28 21:14 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, git, me, gitster, abhishekkumar8222


On Mon, Feb 28 2022, Derrick Stolee wrote:

> On 2/28/2022 9:27 AM, Ævar Arnfjörð Bjarmason wrote:
>> 
>> On Mon, Feb 28 2022, Derrick Stolee wrote:
>> 
>>> On 2/25/2022 5:31 PM, Ævar Arnfjörð Bjarmason wrote:
>
>>>> Or maybe they won't. I just found it surprising when reviewing this to
>>>> not find an answer to why that approach wasn't
>>>> considered.
>>>
>>> The point is to create a new format that can be chosen when deployed
>>> in an environment where older Git versions will not exist (such as
>>> a Git server). The new version is not chosen by default and instead
>>> is opt-in through the commitGraph.generationVersion config option.
>>>
>>> Perhaps in a year or two we would consider making this the new
>>> default, but there is no rush to do so.
>> 
>> Looking into this a bit more I think that in either case this is less of
>> a big deal after my 43d35618055 (commit-graph write: don't die if the
>> existing graph is corrupt, 2019-03-25), which came out of some of those
>> discussions at the time of [1].
>> 
>> I.e. now a client that only understands version N-1 will warn when
>> loading it, wheras it's only if a pre-v2.22.0 client (which has that
>> commit) reads the repository that we'd hard die on it, correct?
>> 
>> But speaking of hyper-focus. I think that arguably applies to you in
>> this case when considering the trade-offs of these sorts of format
>> changes :)
>> 
>> I.e. you're primarily considering cases of say a git server (presumably
>> running on GitHub) or another such deployment where it's easy to have
>> full control over all of your versions "in the wild".
>
> I'm thinking of servers, yes, but also 99% of clients who only upgrade
> (or _maybe_ downgrade to a recent, previous version occasionally).

*nod*

>> And thus a three-phase rollout of something like a format change can be
>> done in a timely and predictable manner.
>> 
>> But git is used by *a lot* of people in a bunch of different
>> scenarios. E.g.:
>> 
>>  * A shared (hopefully read-only) NFS mounted by remote "unmanaged" clients.
>>  * A tarred-up directory including a .git, which may be transferred to
>>    a machine with a pre-v2.22.0 version.
>> 
>> Or even softer cases of failure, such as:
>> 
>>  * A cronjob causes an alert/incident somewhere because the server 
>>    operator started writing a new version, but forgot about a set
>>    of machines that are still on the old version.
>
> It is important to continue supporting these cases, and this change does
> not cause any issues for them.

The issues in those cases will range from warnings on older versions
when loading the graph to errors if it's pre-v2.22.0, with the
performance benefits v3 placing them out of range of v2-only clients.

I think arguable that's OK/worth it, but it's "not [any] issues", no?

> However, this handful of corner cases should not block progress in the
> main cases.

What progress would be blocked?

I'm only talking about whether we choose to consider a "new graph" to be an:

    <existing version number>
    <existing chunk name (old content, possibly empty)>
    <new chunk name (new content)>

v.s.:

    <old/new version number>
    <existing chunk name old/new (incompatible) content>

I.e. the "progress" this series is about is in getting the data locality
with smaller data with the new content.

But that's also possible to get with a very low amount of fixed-overhead.

Per the referenced E-Mail an "empty" commit-graph file was ~1k bytes in
2019, I haven't re-checked. In terms of wasted space it's miniscule &
<1/4 of one FS page on Linux.

I'm not just trying to rehash the same points, I *think* the version
bump is just an aesthetic choice & we're not getting any performance
difference out of that.

But I'm not sure from the "block progress" etc., so maybe I'm still
missing something...

>> I think that even if it's less conceptually clean it's worth considering
>> being over backwards to be kinder to such use-cases, unless it's really
>> required for other reasons to break such in-the-wild use-cases.
>> 
>> Or in this case, if it's thought to be worth it to help reviewers decide
>> by separating the performance improvement aspect from the changed
>> interaction between new graphs and older clients.
>> 
>> As a further nit on the proposed end-state here: Do I understand it
>> correctly that commitGraph.generationVersion=[1|2] (i.e. on current
>> "master") will always result in a file that's compatible with older
>> versions, since the only thing "v2" there controls now is to write the
>> optional GDAT and GDOV chunks?
>> 
>> Whereas going from commitGraph.generationVersion=2 to
>> commitGraph.generationVersion=3 in this series will impact older clients
>> as noted above, since we're bumping the version (of the file, to 2 if
>> the config is 3, which as Junio noted is a bit confusing).
>> 
>> I think if you're set on going down the path of bumping the top-level
>> version that deserves to be made much clearer in the added
>> documentation. Right now the only hint to that is a passing mention that
>> for v3:
>> 
>>     [it] will be incompatible with some old versions of Git
>> 
>> Which if we're opting for breaking format changes really should note
>> some of the caveats above, that pre-v2.22.0 hard-dies, and probably
>> describe "some old versions of Git" a bit more clearly.
>> 
>> It actually means once this gets released "the git version that was the
>> latest one you could download yesterday". Which a reader of the docs
>> probably won't expect when starting to play with this in mixed-version
>> environment.
>> 
>> 1. https://lore.kernel.org/git/87h8acivkh.fsf@evledraar.gmail.com/
>
> This documentation could be altered to be specific about versions,
> but such a specific change makes assumptions of the version that will
> include it. As of now, the generation number v2 fixes will _probably_
> get in for 2.36 and the format change would have enough time to cook
> for 2.37, so I'll update the docs to refer to that version explicitly.

...

> The pre-2.22.0 change might be helpful to mention, but it could also be
> noise to the reader. We can revisit this when these patches are
> submitted again in another thread. There's also concern about third-
> party tools like libgit2. I'd rather draw the line as "tread carefully
> here" than "here is so much information that a reader might think it
> is all they need to know".

In terms of concern about libgit2 or any other implementation (which I
haven't looked at) isn't "tread carefully" to do it with new chunks if
possible, which we've done before with BIDX/BDAT, v.s. a version bump we
haven't done?

I'd think it wouldn't be an issue either way for any reader of the
format, and libgit2 is more specialized & won't have someone on RHEL6 or
whatever trying to inspect a random repo.

It just seems like a win-win to have a performance improvement with
smooth backwards compatibility v.s. without, if that's possible.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 3/7] commit-graph: start parsing generation v2 (again)
  2022-02-28 18:44         ` Derrick Stolee
@ 2022-03-01  9:46           ` Patrick Steinhardt
  2022-03-01 10:35             ` Patrick Steinhardt
  0 siblings, 1 reply; 70+ messages in thread
From: Patrick Steinhardt @ 2022-03-01  9:46 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, git, me, gitster, abhishekkumar8222

[-- Attachment #1: Type: text/plain, Size: 2978 bytes --]

On Mon, Feb 28, 2022 at 01:44:01PM -0500, Derrick Stolee wrote:
> On 2/28/2022 11:59 AM, Patrick Steinhardt wrote:
> > On Mon, Feb 28, 2022 at 11:23:38AM -0500, Derrick Stolee wrote:
> >> On 2/28/2022 10:18 AM, Patrick Steinhardt wrote:
> >>> I haven't yet found the time to dig deeper into why this is happening.
> >>> While the repository is publicly accessible at [1], unfortunately the
> >>> bug seems to be triggered by a commit that's only kept alive by an
> >>> internal reference.
> >>>
> >>> Patrick
> >>>
> >>> [1]: https://gitlab.com/gitlab-com/www-gitlab-com.git
> >>
> >> Thanks for including this information. Just to be clear: did you
> >> include patch 4 in your tests, or not? Patch 4 includes a fix
> >> related to overflow values, so it would be helpful to know if you
> >> found a _different_ bug or if it is the same one.
> >>
> >> Thanks,
> >> -Stolee
> > 
> > I initially only applied the first three patches, but after having hit
> > the fatal error I also applied the rest of this series to have a look at
> > whether it is indeed fixed already by one of your later patches. The
> > error remains the same though.
> 
> Thanks for this extra context. Is this a commit-graph that you wrote
> with the first three patches and then you get an error when reading it?
> 
> Do you get the same error when deleting that file and rewriting it with
> all patches included?
> 
> Thanks,
> -Stolee

Yes, I do. I've applied all four patches from v2 on top of 715d08a9e5
(The eighth batch, 2022-02-25) and still get the same results:

    $ find objects/info/commit-graphs/
    objects/info/commit-graphs/
    objects/info/commit-graphs/graph-607e641165f3e83a82d5b14af4e611bf2a688f35.graph
    objects/info/commit-graphs/commit-graph-chain
    objects/info/commit-graphs/graph-5f357c7573c0075d42d82b28e660bc3eac01bfe8.graph
    objects/info/commit-graphs/graph-e0c12ead1b61c7c30720ae372e8a9f98d95dfb2d.graph
    objects/info/commit-graphs/graph-c96723b133c2d81106a01ecd7a8773bb2ef6c2e1.graph

     $ git commit-graph verify
    fatal: commit-graph requires overflow generation data but has none

     $ git commit-graph write
    Finding commits for commit graph among packed objects: 100% (10235119/10235119), done.
    Expanding reachable commits in commit graph: 2197197, done.
    Finding extra edges in commit graph: 100% (2197197/2197197), done.
    fatal: commit-graph requires overflow generation data but has none

     $ rm -rf objects/info/commit-graphs/

     $ git commit-graph write
    Finding commits for commit graph among packed objects: 100% (10235119/10235119), done.
    Expanding reachable commits in commit graph: 2197197, done.
    Finding extra edges in commit graph: 100% (2197197/2197197), done.
    fatal: commit-graph requires overflow generation data but has none)

So even generating them completely anew doesn't seem to generate the
overflow generation data.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 3/7] commit-graph: start parsing generation v2 (again)
  2022-03-01  9:46           ` Patrick Steinhardt
@ 2022-03-01 10:35             ` Patrick Steinhardt
  2022-03-01 14:06               ` Derrick Stolee
  0 siblings, 1 reply; 70+ messages in thread
From: Patrick Steinhardt @ 2022-03-01 10:35 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, git, me, gitster, abhishekkumar8222

[-- Attachment #1: Type: text/plain, Size: 5268 bytes --]

On Tue, Mar 01, 2022 at 10:46:14AM +0100, Patrick Steinhardt wrote:
> On Mon, Feb 28, 2022 at 01:44:01PM -0500, Derrick Stolee wrote:
> > On 2/28/2022 11:59 AM, Patrick Steinhardt wrote:
> > > On Mon, Feb 28, 2022 at 11:23:38AM -0500, Derrick Stolee wrote:
> > >> On 2/28/2022 10:18 AM, Patrick Steinhardt wrote:
> > >>> I haven't yet found the time to dig deeper into why this is happening.
> > >>> While the repository is publicly accessible at [1], unfortunately the
> > >>> bug seems to be triggered by a commit that's only kept alive by an
> > >>> internal reference.
> > >>>
> > >>> Patrick
> > >>>
> > >>> [1]: https://gitlab.com/gitlab-com/www-gitlab-com.git
> > >>
> > >> Thanks for including this information. Just to be clear: did you
> > >> include patch 4 in your tests, or not? Patch 4 includes a fix
> > >> related to overflow values, so it would be helpful to know if you
> > >> found a _different_ bug or if it is the same one.
> > >>
> > >> Thanks,
> > >> -Stolee
> > > 
> > > I initially only applied the first three patches, but after having hit
> > > the fatal error I also applied the rest of this series to have a look at
> > > whether it is indeed fixed already by one of your later patches. The
> > > error remains the same though.
> > 
> > Thanks for this extra context. Is this a commit-graph that you wrote
> > with the first three patches and then you get an error when reading it?
> > 
> > Do you get the same error when deleting that file and rewriting it with
> > all patches included?
> > 
> > Thanks,
> > -Stolee
> 
> Yes, I do. I've applied all four patches from v2 on top of 715d08a9e5
> (The eighth batch, 2022-02-25) and still get the same results:
> 
>     $ find objects/info/commit-graphs/
>     objects/info/commit-graphs/
>     objects/info/commit-graphs/graph-607e641165f3e83a82d5b14af4e611bf2a688f35.graph
>     objects/info/commit-graphs/commit-graph-chain
>     objects/info/commit-graphs/graph-5f357c7573c0075d42d82b28e660bc3eac01bfe8.graph
>     objects/info/commit-graphs/graph-e0c12ead1b61c7c30720ae372e8a9f98d95dfb2d.graph
>     objects/info/commit-graphs/graph-c96723b133c2d81106a01ecd7a8773bb2ef6c2e1.graph
> 
>      $ git commit-graph verify
>     fatal: commit-graph requires overflow generation data but has none
> 
>      $ git commit-graph write
>     Finding commits for commit graph among packed objects: 100% (10235119/10235119), done.
>     Expanding reachable commits in commit graph: 2197197, done.
>     Finding extra edges in commit graph: 100% (2197197/2197197), done.
>     fatal: commit-graph requires overflow generation data but has none
> 
>      $ rm -rf objects/info/commit-graphs/
> 
>      $ git commit-graph write
>     Finding commits for commit graph among packed objects: 100% (10235119/10235119), done.
>     Expanding reachable commits in commit graph: 2197197, done.
>     Finding extra edges in commit graph: 100% (2197197/2197197), done.
>     fatal: commit-graph requires overflow generation data but has none)
> 
> So even generating them completely anew doesn't seem to generate the
> overflow generation data.
> 
> Patrick

I stand corrected. I forgot that the repository at hand was connected to
another one via `objects/info/alternates`. If I prune commit-graphs from
that alternate, too, then it works alright with your patches.

This makes me wonder how such a bugfix should be handled though. As this
series is right now, users will be faced with repository corruption as
soon as they upgrade their Git version to one that contains this patch
series. This corruption needs manual action: they have to go into the
repository, delete the commit-graphs and then optionally create new
ones.

This is not a good user experience, and it's worse on the server-side
where we now have a timeframe where all commit-graphs are potentially
corrupt. This effectively leads to us being unable to serve those repos
at all until we have rewritten the commit-graphs because all commands
which make use of the commit-graph will now die:

    $ git log
    fatal: commit-graph requires overflow generation data but has none

So the question is whether this is a change that needs to be rolled out
over multiple releases. First we'd get in the bug fix such that we write
correct commit-graphs, and after this fix has been released we can also
release the fix that starts to actually parse the generation. This
ensures there's a grace period during which we can hopefully correct the
data on-disk such that users are not faced with failures.

The better alternative would probably be to just gracefully handle
commit-graphs which are corrupted in such a way. Can we maybe just
continue to not parse generations in case we find that the commit-graph
doesn't have overflow generation data?

This is more of a general issue though: commit-graphs are an auxiliary
cache that is not required for proper operation at all. If we fail to
parse it, then Git shouldn't die but instead fail gracefully just ignore
it. Furthermore, if we notice that graphs are corrupt when we try to
write new ones, we may just delete the corrupt versions automatically
and generate completely new ones.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 3/7] commit-graph: start parsing generation v2 (again)
  2022-03-01 10:35             ` Patrick Steinhardt
@ 2022-03-01 14:06               ` Derrick Stolee
  2022-03-01 14:53                 ` Patrick Steinhardt
  0 siblings, 1 reply; 70+ messages in thread
From: Derrick Stolee @ 2022-03-01 14:06 UTC (permalink / raw)
  To: Patrick Steinhardt
  Cc: Derrick Stolee via GitGitGadget, git, me, gitster, abhishekkumar8222

On 3/1/2022 5:35 AM, Patrick Steinhardt wrote:
> On Tue, Mar 01, 2022 at 10:46:14AM +0100, Patrick Steinhardt wrote:
>> On Mon, Feb 28, 2022 at 01:44:01PM -0500, Derrick Stolee wrote:
>>> On 2/28/2022 11:59 AM, Patrick Steinhardt wrote:
>>>> On Mon, Feb 28, 2022 at 11:23:38AM -0500, Derrick Stolee wrote:
>>>>> On 2/28/2022 10:18 AM, Patrick Steinhardt wrote:
>>>>>> I haven't yet found the time to dig deeper into why this is happening.
>>>>>> While the repository is publicly accessible at [1], unfortunately the
>>>>>> bug seems to be triggered by a commit that's only kept alive by an
>>>>>> internal reference.
>>>>>>
>>>>>> Patrick
>>>>>>
>>>>>> [1]: https://gitlab.com/gitlab-com/www-gitlab-com.git
>>>>>
>>>>> Thanks for including this information. Just to be clear: did you
>>>>> include patch 4 in your tests, or not? Patch 4 includes a fix
>>>>> related to overflow values, so it would be helpful to know if you
>>>>> found a _different_ bug or if it is the same one.
>>>>>
>>>>> Thanks,
>>>>> -Stolee
>>>>
>>>> I initially only applied the first three patches, but after having hit
>>>> the fatal error I also applied the rest of this series to have a look at
>>>> whether it is indeed fixed already by one of your later patches. The
>>>> error remains the same though.
>>>
>>> Thanks for this extra context. Is this a commit-graph that you wrote
>>> with the first three patches and then you get an error when reading it?
>>>
>>> Do you get the same error when deleting that file and rewriting it with
>>> all patches included?
>>>
>>> Thanks,
>>> -Stolee
>>
>> Yes, I do. I've applied all four patches from v2 on top of 715d08a9e5
>> (The eighth batch, 2022-02-25) and still get the same results:
>>
>>     $ find objects/info/commit-graphs/
>>     objects/info/commit-graphs/
>>     objects/info/commit-graphs/graph-607e641165f3e83a82d5b14af4e611bf2a688f35.graph
>>     objects/info/commit-graphs/commit-graph-chain
>>     objects/info/commit-graphs/graph-5f357c7573c0075d42d82b28e660bc3eac01bfe8.graph
>>     objects/info/commit-graphs/graph-e0c12ead1b61c7c30720ae372e8a9f98d95dfb2d.graph
>>     objects/info/commit-graphs/graph-c96723b133c2d81106a01ecd7a8773bb2ef6c2e1.graph
>>
>>      $ git commit-graph verify
>>     fatal: commit-graph requires overflow generation data but has none
>>
>>      $ git commit-graph write
>>     Finding commits for commit graph among packed objects: 100% (10235119/10235119), done.
>>     Expanding reachable commits in commit graph: 2197197, done.
>>     Finding extra edges in commit graph: 100% (2197197/2197197), done.
>>     fatal: commit-graph requires overflow generation data but has none
>>
>>      $ rm -rf objects/info/commit-graphs/
>>
>>      $ git commit-graph write
>>     Finding commits for commit graph among packed objects: 100% (10235119/10235119), done.
>>     Expanding reachable commits in commit graph: 2197197, done.
>>     Finding extra edges in commit graph: 100% (2197197/2197197), done.
>>     fatal: commit-graph requires overflow generation data but has none)
>>
>> So even generating them completely anew doesn't seem to generate the
>> overflow generation data.
>>
>> Patrick
> 
> I stand corrected. I forgot that the repository at hand was connected to
> another one via `objects/info/alternates`. If I prune commit-graphs from
> that alternate, too, then it works alright with your patches.

OK, thanks. That clarifies the situation.

I ordered the patches such that the fix in patch 4 could be immediately
testable, which is not the case without patch 3. However, it does leave
this temporary state where information can be incorrect if only a subset
of the series is applied.

> This makes me wonder how such a bugfix should be handled though. As this
> series is right now, users will be faced with repository corruption as
> soon as they upgrade their Git version to one that contains this patch
> series. This corruption needs manual action: they have to go into the
> repository, delete the commit-graphs and then optionally create new
> ones.
> 
> This is not a good user experience, and it's worse on the server-side
> where we now have a timeframe where all commit-graphs are potentially
> corrupt. This effectively leads to us being unable to serve those repos
> at all until we have rewritten the commit-graphs because all commands
> which make use of the commit-graph will now die:
> 
>     $ git log
>     fatal: commit-graph requires overflow generation data but has none
> 
> So the question is whether this is a change that needs to be rolled out
> over multiple releases. First we'd get in the bug fix such that we write
> correct commit-graphs, and after this fix has been released we can also
> release the fix that starts to actually parse the generation. This
> ensures there's a grace period during which we can hopefully correct the
> data on-disk such that users are not faced with failures.

You are right that we need to be careful here, but I also think that
previous versions of Git always wrote the correct data. Here is my
thought process:

1. To get this bug, we need to have parsed the corrected commit date
   from an existing commit-graph in order to under-count the number
   of overflow values.

2. Before this series, Git versions were not parsing the corrected
   commit date, so they recompute the corrected commit date every
   time the commit-graph is written, getting the proper count of
   overflow values.

For these reasons, data written by previous versions of Git are
correct and can be trusted without a staged release.

Does this make sense? Or, do you experience a different result when
you build commit-graphs with a released Git version and then when
writing on top with all patches applied?

> The better alternative would probably be to just gracefully handle
> commit-graphs which are corrupted in such a way. Can we maybe just
> continue to not parse generations in case we find that the commit-graph
> doesn't have overflow generation data?
>
> This is more of a general issue though: commit-graphs are an auxiliary
> cache that is not required for proper operation at all. If we fail to
> parse it, then Git shouldn't die but instead fail gracefully just ignore
> it. Furthermore, if we notice that graphs are corrupt when we try to
> write new ones, we may just delete the corrupt versions automatically
> and generate completely new ones.

You are right that we can be better about failures here and report
and error instead of a die(). Especially in this case, we could just
revert to topological levels instead of throwing out the commit-graph
entirely.

This seems like something for another series, so we can be sure to
audit all cases of fatal errors when parsing the commit-graph so we
catch all of them and do the "best" thing in each case.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 5/7] commit-graph: document file format v2
  2022-02-28 21:14           ` Ævar Arnfjörð Bjarmason
@ 2022-03-01 14:19             ` Derrick Stolee
  2022-03-01 14:29               ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 70+ messages in thread
From: Derrick Stolee @ 2022-03-01 14:19 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Derrick Stolee via GitGitGadget, git, me, gitster, abhishekkumar8222

On 2/28/2022 4:14 PM, Ævar Arnfjörð Bjarmason wrote:
> 
> On Mon, Feb 28 2022, Derrick Stolee wrote:
> 
>> On 2/28/2022 9:27 AM, Ævar Arnfjörð Bjarmason wrote:
>>>
>>> On Mon, Feb 28 2022, Derrick Stolee wrote:
>>>
>>>> On 2/25/2022 5:31 PM, Ævar Arnfjörð Bjarmason wrote:
>>
>>>>> Or maybe they won't. I just found it surprising when reviewing this to
>>>>> not find an answer to why that approach wasn't
>>>>> considered.
>>>>
>>>> The point is to create a new format that can be chosen when deployed
>>>> in an environment where older Git versions will not exist (such as
>>>> a Git server). The new version is not chosen by default and instead
>>>> is opt-in through the commitGraph.generationVersion config option.
>>>>
>>>> Perhaps in a year or two we would consider making this the new
>>>> default, but there is no rush to do so.
>>>
>>> Looking into this a bit more I think that in either case this is less of
>>> a big deal after my 43d35618055 (commit-graph write: don't die if the
>>> existing graph is corrupt, 2019-03-25), which came out of some of those
>>> discussions at the time of [1].
>>>
>>> I.e. now a client that only understands version N-1 will warn when
>>> loading it, wheras it's only if a pre-v2.22.0 client (which has that
>>> commit) reads the repository that we'd hard die on it, correct?
>>>
>>> But speaking of hyper-focus. I think that arguably applies to you in
>>> this case when considering the trade-offs of these sorts of format
>>> changes :)
>>>
>>> I.e. you're primarily considering cases of say a git server (presumably
>>> running on GitHub) or another such deployment where it's easy to have
>>> full control over all of your versions "in the wild".
>>
>> I'm thinking of servers, yes, but also 99% of clients who only upgrade
>> (or _maybe_ downgrade to a recent, previous version occasionally).
> 
> *nod*
> 
>>> And thus a three-phase rollout of something like a format change can be
>>> done in a timely and predictable manner.
>>>
>>> But git is used by *a lot* of people in a bunch of different
>>> scenarios. E.g.:
>>>
>>>  * A shared (hopefully read-only) NFS mounted by remote "unmanaged" clients.
>>>  * A tarred-up directory including a .git, which may be transferred to
>>>    a machine with a pre-v2.22.0 version.
>>>
>>> Or even softer cases of failure, such as:
>>>
>>>  * A cronjob causes an alert/incident somewhere because the server 
>>>    operator started writing a new version, but forgot about a set
>>>    of machines that are still on the old version.
>>
>> It is important to continue supporting these cases, and this change does
>> not cause any issues for them.
> 
> The issues in those cases will range from warnings on older versions
> when loading the graph to errors if it's pre-v2.22.0, with the
> performance benefits v3 placing them out of range of v2-only clients.
> 
> I think arguable that's OK/worth it, but it's "not [any] issues", no?

What I mean is that this change does not enable the new graph version
by default, so these users do not have any issues unless someone opts
in to the feature while in this mixed scenario.

>> However, this handful of corner cases should not block progress in the
>> main cases.
> 
> What progress would be blocked?
> 
> I'm only talking about whether we choose to consider a "new graph" to be an:
> 
>     <existing version number>
>     <existing chunk name (old content, possibly empty)>
>     <new chunk name (new content)>
> 
> v.s.:
> 
>     <old/new version number>
>     <existing chunk name old/new (incompatible) content>
> 
> I.e. the "progress" this series is about is in getting the data locality
> with smaller data with the new content.
> 
> But that's also possible to get with a very low amount of fixed-overhead.
> 
> Per the referenced E-Mail an "empty" commit-graph file was ~1k bytes in
> 2019, I haven't re-checked. In terms of wasted space it's miniscule &
> <1/4 of one FS page on Linux.

If you're talking "empty" data, then you need to have an empty Commit
Data chunk _and_ and empty OID Lookup chunk in order to not have
breakage. So you'd need duplicate versions of these chunks for the
new "Commit Data 2" chunk. Then we need special-casing for all of this
during parsing that is unnecessary complexity.

Finally, the end result becomes "older versions get slower without
any warning" instead of "older versions get a message about not
understanding the commit-graph file".
 
> I'm not just trying to rehash the same points, I *think* the version
> bump is just an aesthetic choice & we're not getting any performance
> difference out of that.
> 
> But I'm not sure from the "block progress" etc., so maybe I'm still
> missing something...

The fact that we have a Generation Data chunk instead of already
bumping the file format version number is already a concession to
this concern about backwards compatibility.

With the point above about empty Commit Data Chunks, the only way
to properly conserve backwards compatibility is to have a full
Commit Data Chunk as well as a second copy that contains the new
offsets instead of topological levels. This is wasteful.

>>> I think that even if it's less conceptually clean it's worth considering
>>> being over backwards to be kinder to such use-cases, unless it's really
>>> required for other reasons to break such in-the-wild use-cases.
>>>
>>> Or in this case, if it's thought to be worth it to help reviewers decide
>>> by separating the performance improvement aspect from the changed
>>> interaction between new graphs and older clients.
>>>
>>> As a further nit on the proposed end-state here: Do I understand it
>>> correctly that commitGraph.generationVersion=[1|2] (i.e. on current
>>> "master") will always result in a file that's compatible with older
>>> versions, since the only thing "v2" there controls now is to write the
>>> optional GDAT and GDOV chunks?
>>>
>>> Whereas going from commitGraph.generationVersion=2 to
>>> commitGraph.generationVersion=3 in this series will impact older clients
>>> as noted above, since we're bumping the version (of the file, to 2 if
>>> the config is 3, which as Junio noted is a bit confusing).
>>>
>>> I think if you're set on going down the path of bumping the top-level
>>> version that deserves to be made much clearer in the added
>>> documentation. Right now the only hint to that is a passing mention that
>>> for v3:
>>>
>>>     [it] will be incompatible with some old versions of Git
>>>
>>> Which if we're opting for breaking format changes really should note
>>> some of the caveats above, that pre-v2.22.0 hard-dies, and probably
>>> describe "some old versions of Git" a bit more clearly.
>>>
>>> It actually means once this gets released "the git version that was the
>>> latest one you could download yesterday". Which a reader of the docs
>>> probably won't expect when starting to play with this in mixed-version
>>> environment.
>>>
>>> 1. https://lore.kernel.org/git/87h8acivkh.fsf@evledraar.gmail.com/
>>
>> This documentation could be altered to be specific about versions,
>> but such a specific change makes assumptions of the version that will
>> include it. As of now, the generation number v2 fixes will _probably_
>> get in for 2.36 and the format change would have enough time to cook
>> for 2.37, so I'll update the docs to refer to that version explicitly.
> 
> ...
> 
>> The pre-2.22.0 change might be helpful to mention, but it could also be
>> noise to the reader. We can revisit this when these patches are
>> submitted again in another thread. There's also concern about third-
>> party tools like libgit2. I'd rather draw the line as "tread carefully
>> here" than "here is so much information that a reader might think it
>> is all they need to know".
> 
> In terms of concern about libgit2 or any other implementation (which I
> haven't looked at) isn't "tread carefully" to do it with new chunks if
> possible, which we've done before with BIDX/BDAT, v.s. a version bump we
> haven't done?

New chunks adding new information is part of the design. Changing
the location of existing data is new here.

> I'd think it wouldn't be an issue either way for any reader of the
> format, and libgit2 is more specialized & won't have someone on RHEL6 or
> whatever trying to inspect a random repo.
> 
> It just seems like a win-win to have a performance improvement with
> smooth backwards compatibility v.s. without, if that's possible.

You are right that it is _possible_, but I don't think that the
side-effects are worth it. Those being:

* "Empty CDAT Chunk": Silently slowing down older clients.
* "Duplicate CDAT Chunk": Wasted data.

Finally, I want to reiterate that by making this opt-in, users make
the call about whether or not they are in a scenario where this
compatibility issue is appropriate for them. This includes waiting
to see if third-party tools like libgit2 are updated to understand
this version.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 5/7] commit-graph: document file format v2
  2022-03-01 14:19             ` Derrick Stolee
@ 2022-03-01 14:29               ` Ævar Arnfjörð Bjarmason
  2022-03-01 15:59                 ` Derrick Stolee
  0 siblings, 1 reply; 70+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-03-01 14:29 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, git, me, gitster, abhishekkumar8222


On Tue, Mar 01 2022, Derrick Stolee wrote:

> On 2/28/2022 4:14 PM, Ævar Arnfjörð Bjarmason wrote:
>> 
>> On Mon, Feb 28 2022, Derrick Stolee wrote:
>> 
>>> On 2/28/2022 9:27 AM, Ævar Arnfjörð Bjarmason wrote:
>>>>
>>>> On Mon, Feb 28 2022, Derrick Stolee wrote:
>>>>
>>>>> On 2/25/2022 5:31 PM, Ævar Arnfjörð Bjarmason wrote:
>>>
>>>>>> Or maybe they won't. I just found it surprising when reviewing this to
>>>>>> not find an answer to why that approach wasn't
>>>>>> considered.
>>>>>
>>>>> The point is to create a new format that can be chosen when deployed
>>>>> in an environment where older Git versions will not exist (such as
>>>>> a Git server). The new version is not chosen by default and instead
>>>>> is opt-in through the commitGraph.generationVersion config option.
>>>>>
>>>>> Perhaps in a year or two we would consider making this the new
>>>>> default, but there is no rush to do so.
>>>>
>>>> Looking into this a bit more I think that in either case this is less of
>>>> a big deal after my 43d35618055 (commit-graph write: don't die if the
>>>> existing graph is corrupt, 2019-03-25), which came out of some of those
>>>> discussions at the time of [1].
>>>>
>>>> I.e. now a client that only understands version N-1 will warn when
>>>> loading it, wheras it's only if a pre-v2.22.0 client (which has that
>>>> commit) reads the repository that we'd hard die on it, correct?
>>>>
>>>> But speaking of hyper-focus. I think that arguably applies to you in
>>>> this case when considering the trade-offs of these sorts of format
>>>> changes :)
>>>>
>>>> I.e. you're primarily considering cases of say a git server (presumably
>>>> running on GitHub) or another such deployment where it's easy to have
>>>> full control over all of your versions "in the wild".
>>>
>>> I'm thinking of servers, yes, but also 99% of clients who only upgrade
>>> (or _maybe_ downgrade to a recent, previous version occasionally).
>> 
>> *nod*
>> 
>>>> And thus a three-phase rollout of something like a format change can be
>>>> done in a timely and predictable manner.
>>>>
>>>> But git is used by *a lot* of people in a bunch of different
>>>> scenarios. E.g.:
>>>>
>>>>  * A shared (hopefully read-only) NFS mounted by remote "unmanaged" clients.
>>>>  * A tarred-up directory including a .git, which may be transferred to
>>>>    a machine with a pre-v2.22.0 version.
>>>>
>>>> Or even softer cases of failure, such as:
>>>>
>>>>  * A cronjob causes an alert/incident somewhere because the server 
>>>>    operator started writing a new version, but forgot about a set
>>>>    of machines that are still on the old version.
>>>
>>> It is important to continue supporting these cases, and this change does
>>> not cause any issues for them.
>> 
>> The issues in those cases will range from warnings on older versions
>> when loading the graph to errors if it's pre-v2.22.0, with the
>> performance benefits v3 placing them out of range of v2-only clients.
>> 
>> I think arguable that's OK/worth it, but it's "not [any] issues", no?
>
> What I mean is that this change does not enable the new graph version
> by default, so these users do not have any issues unless someone opts
> in to the feature while in this mixed scenario.

Indeed. FWIW I wasn't confused about that bit. I'm just commenting on
/how/ we do version upgrades, and if we can save users unnecessary
hassle relatively easily.

But I also think the writing is on the wall that you'll want to
(understandably) bump the default sooner than later, or if not for this
data for other future chunks.

>>> However, this handful of corner cases should not block progress in the
>>> main cases.
>> 
>> What progress would be blocked?
>> 
>> I'm only talking about whether we choose to consider a "new graph" to be an:
>> 
>>     <existing version number>
>>     <existing chunk name (old content, possibly empty)>
>>     <new chunk name (new content)>
>> 
>> v.s.:
>> 
>>     <old/new version number>
>>     <existing chunk name old/new (incompatible) content>
>> 
>> I.e. the "progress" this series is about is in getting the data locality
>> with smaller data with the new content.
>> 
>> But that's also possible to get with a very low amount of fixed-overhead.
>> 
>> Per the referenced E-Mail an "empty" commit-graph file was ~1k bytes in
>> 2019, I haven't re-checked. In terms of wasted space it's miniscule &
>> <1/4 of one FS page on Linux.
>
> If you're talking "empty" data, then you need to have an empty Commit
> Data chunk _and_ and empty OID Lookup chunk in order to not have
> breakage. So you'd need duplicate versions of these chunks for the
> new "Commit Data 2" chunk. Then we need special-casing for all of this
> during parsing that is unnecessary complexity.

Why does it need to be special-cased? Don't we just call pair_chunk() on
the new chunk name, and if it doesn't exist fall back on the old
chunk. We'll then note what format we're parsing, just as this series
does.

> Finally, the end result becomes "older versions get slower without
> any warning" instead of "older versions get a message about not
> understanding the commit-graph file".

Sure, IF you want to write such an empty chunk. The point is that you
now have the option.

And this is the same edge case we already dealt with for
GDAT/GDOV. I.e. older readers who didn't understand it would be slower.

We can still have a feature to make older clients intentionally
break/warn, it seems to me that if you'd want such a thing you'd want it
aside from the specific mechanism of this proposed upgrade.

Or you could dual-write the data for older clients, which I think
probably isn't worth the hassle.

I.e. if you're worried about silent slowdown older clients happily
ignoring the BIDX and BDAT chunks are silently slower.

>> I'm not just trying to rehash the same points, I *think* the version
>> bump is just an aesthetic choice & we're not getting any performance
>> difference out of that.
>> 
>> But I'm not sure from the "block progress" etc., so maybe I'm still
>> missing something...
>
> The fact that we have a Generation Data chunk instead of already
> bumping the file format version number is already a concession to
> this concern about backwards compatibility.

Sure, but not taking that version bumping route in the past shouldn't
bias us towards doing it now, should it?

> With the point above about empty Commit Data Chunks, the only way
> to properly conserve backwards compatibility is to have a full
> Commit Data Chunk as well as a second copy that contains the new
> offsets instead of topological levels. This is wasteful.

Empty chunks would be a handful of bytes, and not produce those
errors/warnings, and AFAICT without any downsides.

But that's me assuming that the overlap between people for whom the
commit-graph is critical for performance and those using wildly
different versions is pretty much zero.

The reason I mentioned it at all initially in
https://lore.kernel.org/git/220228.86pmn73toq.gmgdl@evledraar.gmail.com/
was in reference to trying to understand the context of the performance
gains, i.e. whether they'd be equivalent with a new chunk or dual-write
data.

>>>> I think that even if it's less conceptually clean it's worth considering
>>>> being over backwards to be kinder to such use-cases, unless it's really
>>>> required for other reasons to break such in-the-wild use-cases.
>>>>
>>>> Or in this case, if it's thought to be worth it to help reviewers decide
>>>> by separating the performance improvement aspect from the changed
>>>> interaction between new graphs and older clients.
>>>>
>>>> As a further nit on the proposed end-state here: Do I understand it
>>>> correctly that commitGraph.generationVersion=[1|2] (i.e. on current
>>>> "master") will always result in a file that's compatible with older
>>>> versions, since the only thing "v2" there controls now is to write the
>>>> optional GDAT and GDOV chunks?
>>>>
>>>> Whereas going from commitGraph.generationVersion=2 to
>>>> commitGraph.generationVersion=3 in this series will impact older clients
>>>> as noted above, since we're bumping the version (of the file, to 2 if
>>>> the config is 3, which as Junio noted is a bit confusing).
>>>>
>>>> I think if you're set on going down the path of bumping the top-level
>>>> version that deserves to be made much clearer in the added
>>>> documentation. Right now the only hint to that is a passing mention that
>>>> for v3:
>>>>
>>>>     [it] will be incompatible with some old versions of Git
>>>>
>>>> Which if we're opting for breaking format changes really should note
>>>> some of the caveats above, that pre-v2.22.0 hard-dies, and probably
>>>> describe "some old versions of Git" a bit more clearly.
>>>>
>>>> It actually means once this gets released "the git version that was the
>>>> latest one you could download yesterday". Which a reader of the docs
>>>> probably won't expect when starting to play with this in mixed-version
>>>> environment.
>>>>
>>>> 1. https://lore.kernel.org/git/87h8acivkh.fsf@evledraar.gmail.com/
>>>
>>> This documentation could be altered to be specific about versions,
>>> but such a specific change makes assumptions of the version that will
>>> include it. As of now, the generation number v2 fixes will _probably_
>>> get in for 2.36 and the format change would have enough time to cook
>>> for 2.37, so I'll update the docs to refer to that version explicitly.
>> 
>> ...
>> 
>>> The pre-2.22.0 change might be helpful to mention, but it could also be
>>> noise to the reader. We can revisit this when these patches are
>>> submitted again in another thread. There's also concern about third-
>>> party tools like libgit2. I'd rather draw the line as "tread carefully
>>> here" than "here is so much information that a reader might think it
>>> is all they need to know".
>> 
>> In terms of concern about libgit2 or any other implementation (which I
>> haven't looked at) isn't "tread carefully" to do it with new chunks if
>> possible, which we've done before with BIDX/BDAT, v.s. a version bump we
>> haven't done?
>
> New chunks adding new information is part of the design. Changing
> the location of existing data is new here.

We've never bumped the top-level version number, and hard dying on "git
status" or whatever was also part of the initial design :)

I think it's legitimate to ask/argue that these version number bumps are
something we should be reserving for truly incompatible format bumps,
v.s. "new indexes".

I.e. this case is similar to us having a SQL database with N tables, and
we'd like to add a new table or index.

We could have a central "schema_version", or we could just add a new
table in a backwards-compatiable way. Older clients read older
data/tables, which is possibly empty.

The commit-graph is essentially such a key-value store when it comes to
top-level chunks, and we're already making decisions on what data to
load/use based on chunk existence.

>> I'd think it wouldn't be an issue either way for any reader of the
>> format, and libgit2 is more specialized & won't have someone on RHEL6 or
>> whatever trying to inspect a random repo.
>> 
>> It just seems like a win-win to have a performance improvement with
>> smooth backwards compatibility v.s. without, if that's possible.
>
> You are right that it is _possible_, but I don't think that the
> side-effects are worth it. Those being:
>
> * "Empty CDAT Chunk": Silently slowing down older clients.
> * "Duplicate CDAT Chunk": Wasted data.
>
> Finally, I want to reiterate that by making this opt-in, users make
> the call about whether or not they are in a scenario where this
> compatibility issue is appropriate for them. This includes waiting
> to see if third-party tools like libgit2 are updated to understand
> this version.

I think we're probably not getting any further here & this back & forth,
it's certainly been interesting, and thanks a lot for your patience and
time.

I'll try to look at some of this once the "prep" patches land.

If you'll end up doing this via the version route I'm not going to
strongly object to it or anything, I was just trying to review this to
see if I understood the trade-offs & constraints involved. In particular
with reference to those changes in 2019 that I did to make
format/corruption/version transitions non-fatal.

Thanks!

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 3/7] commit-graph: start parsing generation v2 (again)
  2022-03-01 14:06               ` Derrick Stolee
@ 2022-03-01 14:53                 ` Patrick Steinhardt
  2022-03-01 15:25                   ` Derrick Stolee
  0 siblings, 1 reply; 70+ messages in thread
From: Patrick Steinhardt @ 2022-03-01 14:53 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, git, me, gitster, abhishekkumar8222

[-- Attachment #1: Type: text/plain, Size: 8014 bytes --]

On Tue, Mar 01, 2022 at 09:06:44AM -0500, Derrick Stolee wrote:
> On 3/1/2022 5:35 AM, Patrick Steinhardt wrote:
> > On Tue, Mar 01, 2022 at 10:46:14AM +0100, Patrick Steinhardt wrote:
> >> On Mon, Feb 28, 2022 at 01:44:01PM -0500, Derrick Stolee wrote:
> >>> On 2/28/2022 11:59 AM, Patrick Steinhardt wrote:
> >>>> On Mon, Feb 28, 2022 at 11:23:38AM -0500, Derrick Stolee wrote:
> >>>>> On 2/28/2022 10:18 AM, Patrick Steinhardt wrote:
> >>>>>> I haven't yet found the time to dig deeper into why this is happening.
> >>>>>> While the repository is publicly accessible at [1], unfortunately the
> >>>>>> bug seems to be triggered by a commit that's only kept alive by an
> >>>>>> internal reference.
> >>>>>>
> >>>>>> Patrick
> >>>>>>
> >>>>>> [1]: https://gitlab.com/gitlab-com/www-gitlab-com.git
> >>>>>
> >>>>> Thanks for including this information. Just to be clear: did you
> >>>>> include patch 4 in your tests, or not? Patch 4 includes a fix
> >>>>> related to overflow values, so it would be helpful to know if you
> >>>>> found a _different_ bug or if it is the same one.
> >>>>>
> >>>>> Thanks,
> >>>>> -Stolee
> >>>>
> >>>> I initially only applied the first three patches, but after having hit
> >>>> the fatal error I also applied the rest of this series to have a look at
> >>>> whether it is indeed fixed already by one of your later patches. The
> >>>> error remains the same though.
> >>>
> >>> Thanks for this extra context. Is this a commit-graph that you wrote
> >>> with the first three patches and then you get an error when reading it?
> >>>
> >>> Do you get the same error when deleting that file and rewriting it with
> >>> all patches included?
> >>>
> >>> Thanks,
> >>> -Stolee
> >>
> >> Yes, I do. I've applied all four patches from v2 on top of 715d08a9e5
> >> (The eighth batch, 2022-02-25) and still get the same results:
> >>
> >>     $ find objects/info/commit-graphs/
> >>     objects/info/commit-graphs/
> >>     objects/info/commit-graphs/graph-607e641165f3e83a82d5b14af4e611bf2a688f35.graph
> >>     objects/info/commit-graphs/commit-graph-chain
> >>     objects/info/commit-graphs/graph-5f357c7573c0075d42d82b28e660bc3eac01bfe8.graph
> >>     objects/info/commit-graphs/graph-e0c12ead1b61c7c30720ae372e8a9f98d95dfb2d.graph
> >>     objects/info/commit-graphs/graph-c96723b133c2d81106a01ecd7a8773bb2ef6c2e1.graph
> >>
> >>      $ git commit-graph verify
> >>     fatal: commit-graph requires overflow generation data but has none
> >>
> >>      $ git commit-graph write
> >>     Finding commits for commit graph among packed objects: 100% (10235119/10235119), done.
> >>     Expanding reachable commits in commit graph: 2197197, done.
> >>     Finding extra edges in commit graph: 100% (2197197/2197197), done.
> >>     fatal: commit-graph requires overflow generation data but has none
> >>
> >>      $ rm -rf objects/info/commit-graphs/
> >>
> >>      $ git commit-graph write
> >>     Finding commits for commit graph among packed objects: 100% (10235119/10235119), done.
> >>     Expanding reachable commits in commit graph: 2197197, done.
> >>     Finding extra edges in commit graph: 100% (2197197/2197197), done.
> >>     fatal: commit-graph requires overflow generation data but has none)
> >>
> >> So even generating them completely anew doesn't seem to generate the
> >> overflow generation data.
> >>
> >> Patrick
> > 
> > I stand corrected. I forgot that the repository at hand was connected to
> > another one via `objects/info/alternates`. If I prune commit-graphs from
> > that alternate, too, then it works alright with your patches.
> 
> OK, thanks. That clarifies the situation.
> 
> I ordered the patches such that the fix in patch 4 could be immediately
> testable, which is not the case without patch 3. However, it does leave
> this temporary state where information can be incorrect if only a subset
> of the series is applied.
> 
> > This makes me wonder how such a bugfix should be handled though. As this
> > series is right now, users will be faced with repository corruption as
> > soon as they upgrade their Git version to one that contains this patch
> > series. This corruption needs manual action: they have to go into the
> > repository, delete the commit-graphs and then optionally create new
> > ones.
> > 
> > This is not a good user experience, and it's worse on the server-side
> > where we now have a timeframe where all commit-graphs are potentially
> > corrupt. This effectively leads to us being unable to serve those repos
> > at all until we have rewritten the commit-graphs because all commands
> > which make use of the commit-graph will now die:
> > 
> >     $ git log
> >     fatal: commit-graph requires overflow generation data but has none
> > 
> > So the question is whether this is a change that needs to be rolled out
> > over multiple releases. First we'd get in the bug fix such that we write
> > correct commit-graphs, and after this fix has been released we can also
> > release the fix that starts to actually parse the generation. This
> > ensures there's a grace period during which we can hopefully correct the
> > data on-disk such that users are not faced with failures.
> 
> You are right that we need to be careful here, but I also think that
> previous versions of Git always wrote the correct data. Here is my
> thought process:
> 
> 1. To get this bug, we need to have parsed the corrected commit date
>    from an existing commit-graph in order to under-count the number
>    of overflow values.
> 
> 2. Before this series, Git versions were not parsing the corrected
>    commit date, so they recompute the corrected commit date every
>    time the commit-graph is written, getting the proper count of
>    overflow values.
> 
> For these reasons, data written by previous versions of Git are
> correct and can be trusted without a staged release.
> 
> Does this make sense? Or, do you experience a different result when
> you build commit-graphs with a released Git version and then when
> writing on top with all patches applied?

Just to verify my understanding: you claim that the bug I was hitting
shouldn't be encountered in the wild when the release , but
only if one were to write a commit-graph with the intermediate stafe
until patch 3/4 of your patch series?

Hum. I have re-verified, and this indeed seems to play out. So I must've
accidentally ran all my testing with the state generated without the
final patch which fixes the corruption. I do see lots of the following
warnings, but overall I can verify and write the commit-graph just fine:

    commit-graph generation for commit c80a42de8803e2d77818d0c82f88e748d7f9425f is 1623362063 < 1623362139

Thanks for your patience, and sorry for the noise :)

> > The better alternative would probably be to just gracefully handle
> > commit-graphs which are corrupted in such a way. Can we maybe just
> > continue to not parse generations in case we find that the commit-graph
> > doesn't have overflow generation data?
> >
> > This is more of a general issue though: commit-graphs are an auxiliary
> > cache that is not required for proper operation at all. If we fail to
> > parse it, then Git shouldn't die but instead fail gracefully just ignore
> > it. Furthermore, if we notice that graphs are corrupt when we try to
> > write new ones, we may just delete the corrupt versions automatically
> > and generate completely new ones.
> 
> You are right that we can be better about failures here and report
> and error instead of a die(). Especially in this case, we could just
> revert to topological levels instead of throwing out the commit-graph
> entirely.
> 
> This seems like something for another series, so we can be sure to
> audit all cases of fatal errors when parsing the commit-graph so we
> catch all of them and do the "best" thing in each case.

I agree.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 3/7] commit-graph: start parsing generation v2 (again)
  2022-03-01 14:53                 ` Patrick Steinhardt
@ 2022-03-01 15:25                   ` Derrick Stolee
  2022-03-02 13:57                     ` Patrick Steinhardt
  0 siblings, 1 reply; 70+ messages in thread
From: Derrick Stolee @ 2022-03-01 15:25 UTC (permalink / raw)
  To: Patrick Steinhardt
  Cc: Derrick Stolee via GitGitGadget, git, me, gitster, abhishekkumar8222

On 3/1/2022 9:53 AM, Patrick Steinhardt wrote:
> On Tue, Mar 01, 2022 at 09:06:44AM -0500, Derrick Stolee wrote:
>> On 3/1/2022 5:35 AM, Patrick Steinhardt wrote:
>>> On Tue, Mar 01, 2022 at 10:46:14AM +0100, Patrick Steinhardt wrote:
>>>> On Mon, Feb 28, 2022 at 01:44:01PM -0500, Derrick Stolee wrote:
>>>>> On 2/28/2022 11:59 AM, Patrick Steinhardt wrote:
>>>>>> On Mon, Feb 28, 2022 at 11:23:38AM -0500, Derrick Stolee wrote:
>>>>>>> On 2/28/2022 10:18 AM, Patrick Steinhardt wrote:
>>>>>>>> [1]: https://gitlab.com/gitlab-com/www-gitlab-com.git
...
>>> So the question is whether this is a change that needs to be rolled out
>>> over multiple releases. First we'd get in the bug fix such that we write
>>> correct commit-graphs, and after this fix has been released we can also
>>> release the fix that starts to actually parse the generation. This
>>> ensures there's a grace period during which we can hopefully correct the
>>> data on-disk such that users are not faced with failures.
>>
>> You are right that we need to be careful here, but I also think that
>> previous versions of Git always wrote the correct data. Here is my
>> thought process:
>>
>> 1. To get this bug, we need to have parsed the corrected commit date
>>    from an existing commit-graph in order to under-count the number
>>    of overflow values.
>>
>> 2. Before this series, Git versions were not parsing the corrected
>>    commit date, so they recompute the corrected commit date every
>>    time the commit-graph is written, getting the proper count of
>>    overflow values.
>>
>> For these reasons, data written by previous versions of Git are
>> correct and can be trusted without a staged release.
>>
>> Does this make sense? Or, do you experience a different result when
>> you build commit-graphs with a released Git version and then when
>> writing on top with all patches applied?
> 
> Just to verify my understanding: you claim that the bug I was hitting
> shouldn't be encountered in the wild when the release , but
> only if one were to write a commit-graph with the intermediate stafe
> until patch 3/4 of your patch series?

That is my claim. And my testing of the repo at [1] has demonstrated
that it works correctly in these cases.
 
> Hum. I have re-verified, and this indeed seems to play out. So I must've
> accidentally ran all my testing with the state generated without the
> final patch which fixes the corruption. I do see lots of the following
> warnings, but overall I can verify and write the commit-graph just fine:
> 
>     commit-graph generation for commit c80a42de8803e2d77818d0c82f88e748d7f9425f is 1623362063 < 1623362139

But I'm not able to generate these warnings from either version. I
tried generating different levels of a split commit-graph, but
could not reproduce it. If you have reproduction steps using current
'master' (or any released Git version) and the four patches here,
then I would love to get a full understanding of your errors.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 5/7] commit-graph: document file format v2
  2022-03-01 14:29               ` Ævar Arnfjörð Bjarmason
@ 2022-03-01 15:59                 ` Derrick Stolee
  0 siblings, 0 replies; 70+ messages in thread
From: Derrick Stolee @ 2022-03-01 15:59 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Derrick Stolee via GitGitGadget, git, me, gitster, abhishekkumar8222

On 3/1/2022 9:29 AM, Ævar Arnfjörð Bjarmason wrote:

I agree that this discussion has mostly run its course and I'll do my
best to summarize it in the commit messages of a future patch series.

I just wanted to focus on two things in the most-recent reply:

> On Tue, Mar 01 2022, Derrick Stolee wrote:
> 
>> On 2/28/2022 4:14 PM, Ævar Arnfjörð Bjarmason wrote:
>>>
>>> On Mon, Feb 28 2022, Derrick Stolee wrote:
>>> I think arguable that's OK/worth it, but it's "not [any] issues", no?
>>
>> What I mean is that this change does not enable the new graph version
>> by default, so these users do not have any issues unless someone opts
>> in to the feature while in this mixed scenario.
> 
> Indeed. FWIW I wasn't confused about that bit. I'm just commenting on
> /how/ we do version upgrades, and if we can save users unnecessary
> hassle relatively easily.
> 
> But I also think the writing is on the wall that you'll want to
> (understandably) bump the default sooner than later, or if not for this
> data for other future chunks.

I think somewhere I said we wouldn't want to update this default for
at least a year after it ships, but I'm also happy to never update it
and let experts opt-in when they want.

>> Finally, the end result becomes "older versions get slower without
>> any warning" instead of "older versions get a message about not
>> understanding the commit-graph file".
> 
> Sure, IF you want to write such an empty chunk. The point is that you
> now have the option.
> 
> And this is the same edge case we already dealt with for
> GDAT/GDOV. I.e. older readers who didn't understand it would be slower.
> 
> We can still have a feature to make older clients intentionally
> break/warn, it seems to me that if you'd want such a thing you'd want it
> aside from the specific mechanism of this proposed upgrade.
> 
> Or you could dual-write the data for older clients, which I think
> probably isn't worth the hassle.
> 
> I.e. if you're worried about silent slowdown older clients happily
> ignoring the BIDX and BDAT chunks are silently slower.

Older clients ignoring BIDX and BDAT chunks means they are silently
slower than newer clients, but they are still as fast as they were
yesterday. The empty chunk approach will make those older Git versions
slower than they were yesterday.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v2 0/4] Commit-graph: Generation Number v2 Fixes
  2022-02-28 13:53 ` [PATCH v2 0/4] Commit-graph: Generation Number v2 Fixes Derrick Stolee via GitGitGadget
                     ` (3 preceding siblings ...)
  2022-02-28 13:53   ` [PATCH v2 4/4] commit-graph: fix generation number v2 overflow values Derrick Stolee via GitGitGadget
@ 2022-03-01 17:23   ` Ævar Arnfjörð Bjarmason
  2022-03-01 19:48   ` [PATCH v3 0/5] " Derrick Stolee via GitGitGadget
  5 siblings, 0 replies; 70+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-03-01 17:23 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, me, gitster, abhishekkumar8222, Derrick Stolee

On Mon, Feb 28 2022, Derrick Stolee via GitGitGadget wrote:

> In particular, Git has been ignoring corrected commit dates since shortly
> after they were introduced. This is due to a bug I introduced when trying to
> make split commit-graphs safer with mixed generation number versions. I also
> noticed an issue with the offset overflows that I only noticed after writing
> generation number v3 using a smaller offset size, actually triggering the
> bug in the test suite.

I think sans existing small issues/fixes I noted this looks good.

Just a bit on the overall direction/design. And don't worry, not the
verison v.s. chunk all over again (except notes about how the two
versions eventually interact) :)

I.e. the post-state here looks good, but I wondered about the direction
of:

 * We have a commitGraph.readChangedPaths for *reading* BIDX/BDAT, on by default
 * We have a commitgraph.generationVersion which pre this series is 1, post-2.
 * >= 2 means look at the GDAT/GDOV chunk, and when splitting/rewriting etc.
   carry them forward.

So, I wonder:

A. Do we really need these "yes I'll read this thing in the file" settings at all?

   Isn't it sufficient to have core.commitGraph=false as an escape hatch, do we
   really need to be able to selectively ignore individual parts of the file on-disk?

B. For a "selective ignore" the commitGraph.readChangedPaths=BOOL makes sense, but
   given the follow-up series it seems odd to end up with a commitgraph.generationVersion=3
   which bumps the top-level version of the commit-graph.

I.e. commitgraph.generationVersion=2 *is* optional since the GDAT/GCOV
chunks can be ignored, but commitgraph.generationVersion=3 is *not* since
it'll also bump the format version to v2 (not v3!).

So yeah, like I said I'm being quiet about the top-level version
v.s. chunks here blah blah :)

But for the end-state you want (as I understand it) wouldn't this make
more sense:

1. Just say we have write settings + "core.commitGraph=false" escape
   hatch, no "selective read".

2. Make commitgraph.generationVersion=2 an alias for a more obviously named
   commitGraph.readGenerationData=true (optional). Maybe deprecate
   "commitGraph.readGenerationData" (say "error: just don't write it then")

3. Never have a commitgraph.generationVersion=3 setting, but instead
   add say a core.writeCommitGraphVersion=2.

   We'd thus not be conflating "commitgraph.generationVersion" which *is*
   optional and is on both the read *and* write side, with your WIP
   commitgraph.generationVersion=3 which also changes how writes happen,
   but is not at all optional for reads. You'll either support it or get a
   warning() when loading the graph.

I think this should all be OK, and I don't think it conflicts with your
stated preferences about the top-level version v.s. optional/redundant
chunks.

It makes things simpler since we'd always read data we find, with no
"maybe ignore it if it's there" settings.

And it avoids user confusion of e.g. getting an error about not
understanding a v2 graph after setting a commitgraph.generationVersion=3
setting (since one is 0-indexed...), and having "versions" in the same
config setting being first optionalon read/write, to very much not
optional.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [PATCH v3 0/5] Commit-graph: Generation Number v2 Fixes
  2022-02-28 13:53 ` [PATCH v2 0/4] Commit-graph: Generation Number v2 Fixes Derrick Stolee via GitGitGadget
                     ` (4 preceding siblings ...)
  2022-03-01 17:23   ` [PATCH v2 0/4] Commit-graph: Generation Number v2 Fixes Ævar Arnfjörð Bjarmason
@ 2022-03-01 19:48   ` Derrick Stolee via GitGitGadget
  2022-03-01 19:48     ` [PATCH v3 1/5] test-read-graph: include extra post-parse info Derrick Stolee via GitGitGadget
                       ` (4 more replies)
  5 siblings, 5 replies; 70+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-03-01 19:48 UTC (permalink / raw)
  To: git; +Cc: me, gitster, abhishekkumar8222, avarab, Derrick Stolee

This patch series fixes some bugs in generation number v2. They were
discovered while building generation number v3, but that implementation will
be delayed until these fixes are merged.

In particular, Git has been ignoring corrected commit dates since shortly
after they were introduced. This is due to a bug I introduced when trying to
make split commit-graphs safer with mixed generation number versions. I also
noticed an issue with the offset overflows that I only noticed after writing
generation number v3 using a smaller offset size, actually triggering the
bug in the test suite.


Updates in v3
=============

 * Used portable printf macros for uint32.
 * Added a new patch that creates lib-commit-graph.sh in preparation for new
   64-bit test script.
 * While copying this information, replaced 'test ! -z ...' with 'test -n
   ...'
 * Instead of editing existing overflow tests, created a new test script
   focused on 64-bit tests.


Updates in v2
=============

 * Dropped generation v3 patches, saving them for later.
 * Updated a commit message to more clearly describe the problem with the
   old code.
 * Used an || instead of two if statements in test script.

Thanks, -Stolee

Derrick Stolee (5):
  test-read-graph: include extra post-parse info
  t5318: extract helpers to lib-commit-graph.sh
  commit-graph: fix ordering bug in generation numbers
  commit-graph: start parsing generation v2 (again)
  commit-graph: fix generation number v2 overflow values

 commit-graph.c                     | 15 +++++--
 t/helper/test-read-graph.c         | 13 ++++++
 t/lib-commit-graph.sh              | 58 ++++++++++++++++++++++++++
 t/t4216-log-bloom.sh               |  1 +
 t/t5318-commit-graph.sh            | 55 +++----------------------
 t/t5324-split-commit-graph.sh      | 10 +++++
 t/t5328-commit-graph-64bit-time.sh | 66 ++++++++++++++++++++++++++++++
 7 files changed, 164 insertions(+), 54 deletions(-)
 create mode 100755 t/lib-commit-graph.sh
 create mode 100755 t/t5328-commit-graph-64bit-time.sh


base-commit: dab1b7905d0b295f1acef9785bb2b9cbb0fdec84
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1163%2Fderrickstolee%2Fgen-v3-v3
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1163/derrickstolee/gen-v3-v3
Pull-Request: https://github.com/gitgitgadget/git/pull/1163

Range-diff vs v2:

 1:  2f89275314b ! 1:  4eca8028c97 test-read-graph: include extra post-parse info
     @@ t/helper/test-read-graph.c: int cmd__read_graph(int argc, const char **argv)
       
      +	printf("options:");
      +	if (graph->bloom_filter_settings)
     -+		printf(" bloom(%d,%d,%d)",
     ++		printf(" bloom(%"PRIu32",%"PRIu32",%"PRIu32")",
      +		       graph->bloom_filter_settings->hash_version,
      +		       graph->bloom_filter_settings->bits_per_entry,
      +		       graph->bloom_filter_settings->num_hashes);
 -:  ----------- > 2:  a90cad8efa5 t5318: extract helpers to lib-commit-graph.sh
 2:  cbcbf10e699 ! 3:  562341b76b3 commit-graph: fix ordering bug in generation numbers
     @@ Commit message
          Instead, iterate over the full commit list at the end, checking the
          offsets to see how many grow beyond the maximum value.
      
     -    Update a test in t5318 to use a larger time value, which will help
     -    demonstrate this bug in more cases. It still won't hit all potential
     -    cases until the next change, which reenables reading generation numbers.
     +    Create a new t5328-commit-graph-64-bit-time.sh test script to handle
     +    special cases of testing 64-bit timestampes. This helps demonstrate this
     +    bug in more cases. It still won't hit all potential cases until the next
     +    change, which reenables reading generation numbers. Use the skip_all
     +    trick from 0a2bfccb9c8 (t0051: use "skip_all" under !MINGW in
     +    single-test file, 2022-02-04) to make the output clean when run on a
     +    32-bit system.
      
     +    Hepled-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
          Signed-off-by: Derrick Stolee <derrickstolee@github.com>
      
       ## commit-graph.c ##
     @@ t/t5318-commit-graph.sh: test_expect_success 'warn on improper hash version' '
       	rm -f .git/objects/info/commit-graph &&
       	test_commit --date "$FUTURE_DATE" future-1 &&
       	test_commit --date "$UNIX_EPOCH_ZERO" old-1 &&
     +
     + ## t/t5328-commit-graph-64bit-time.sh (new) ##
     +@@
     ++#!/bin/sh
     ++
     ++test_description='commit graph with 64-bit timestamps'
     ++. ./test-lib.sh
     ++
     ++if ! test_have_prereq TIME_IS_64BIT || ! test_have_prereq TIME_T_IS_64BIT
     ++then
     ++	skip_all='skipping 64-bit timestamp tests'
     ++	test_done
     ++fi
     ++
     ++. "$TEST_DIRECTORY"/lib-commit-graph.sh
     ++
     ++UNIX_EPOCH_ZERO="@0 +0000"
     ++FUTURE_DATE="@4147483646 +0000"
     ++
     ++GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=0
     ++
     ++test_expect_success 'lower layers have overflow chunk' '
     ++	rm -f .git/objects/info/commit-graph &&
     ++	test_commit --date "$FUTURE_DATE" future-1 &&
     ++	test_commit --date "$UNIX_EPOCH_ZERO" old-1 &&
     ++	git commit-graph write --reachable &&
     ++	test_commit --date "$FUTURE_DATE" future-2 &&
     ++	test_commit --date "$UNIX_EPOCH_ZERO" old-2 &&
     ++	git commit-graph write --reachable --split=no-merge &&
     ++	test_commit extra &&
     ++	git commit-graph write --reachable --split=no-merge &&
     ++	git commit-graph write --reachable &&
     ++	graph_read_expect 5 "generation_data generation_data_overflow" &&
     ++	mv .git/objects/info/commit-graph commit-graph-upgraded &&
     ++	git commit-graph write --reachable &&
     ++	graph_read_expect 5 "generation_data generation_data_overflow" &&
     ++	test_cmp .git/objects/info/commit-graph commit-graph-upgraded
     ++'
     ++
     ++graph_git_behavior 'overflow' '' HEAD~2 HEAD
     ++
     ++test_done
 3:  5bc6a7660d8 ! 4:  041d96bf1d7 commit-graph: start parsing generation v2 (again)
     @@ commit-graph.c: struct commit_graph *parse_commit_graph(struct repository *r,
       
       	if (r->settings.commit_graph_read_changed_paths) {
      
     - ## t/t4216-log-bloom.sh ##
     -@@ t/t4216-log-bloom.sh: graph_read_expect () {
     - 	header: 43475048 1 $(test_oid oid_version) $NUM_CHUNKS 0
     - 	num_commits: $1
     - 	chunks: oid_fanout oid_lookup commit_metadata generation_data bloom_indexes bloom_data
     --	options: bloom(1,10,7)
     -+	options: bloom(1,10,7) read_generation_data
     - 	EOF
     - 	test-tool read-graph >actual &&
     - 	test_cmp expect actual
     -
     - ## t/t5318-commit-graph.sh ##
     -@@ t/t5318-commit-graph.sh: graph_read_expect() {
     + ## t/lib-commit-graph.sh ##
     +@@ t/lib-commit-graph.sh: graph_read_expect() {
       		OPTIONAL=" $2"
       		NUM_CHUNKS=$((3 + $(echo "$2" | wc -w)))
       	fi
      +	GENERATION_VERSION=2
     -+	if test ! -z "$3"
     ++	if test -n "$3"
      +	then
      +		GENERATION_VERSION=$3
      +	fi
     @@ t/t5318-commit-graph.sh: graph_read_expect() {
       	EOF
       	test-tool read-graph >output &&
       	test_cmp expect output
     +
     + ## t/t4216-log-bloom.sh ##
     +@@ t/t4216-log-bloom.sh: graph_read_expect () {
     + 	header: 43475048 1 $(test_oid oid_version) $NUM_CHUNKS 0
     + 	num_commits: $1
     + 	chunks: oid_fanout oid_lookup commit_metadata generation_data bloom_indexes bloom_data
     +-	options: bloom(1,10,7)
     ++	options: bloom(1,10,7) read_generation_data
     + 	EOF
     + 	test-tool read-graph >actual &&
     + 	test_cmp expect actual
     +
     + ## t/t5318-commit-graph.sh ##
      @@ t/t5318-commit-graph.sh: test_expect_success 'git commit-graph verify' '
       	cd "$TRASH_DIRECTORY/full" &&
       	git rev-parse commits/8 | git -c commitGraph.generationVersion=1 commit-graph write --stdin-commits &&
 4:  193217c71e0 ! 5:  e957baa9d77 commit-graph: fix generation number v2 overflow values
     @@ Commit message
          show up as a failure in 'git commit-graph verify' if we increase that
          FUTURE_DATE to be above four billion.
      
     -    Fix this error and update the test to require 64-bit dates so we can
     -    safely use this large value in our test.
     +    Fix this error and create a 64-bit timestamp version of the test so we
     +    can test these larger values.
      
          Signed-off-by: Derrick Stolee <derrickstolee@github.com>
      
     @@ commit-graph.c: static void fill_commit_graph_info(struct commit *item, struct c
       			graph_data->generation = item->date + offset;
       	} else
      
     - ## t/t5318-commit-graph.sh ##
     -@@ t/t5318-commit-graph.sh: test_expect_success 'corrupt commit-graph write (missing tree)' '
     - 	)
     - '
     + ## t/t5328-commit-graph-64bit-time.sh ##
     +@@ t/t5328-commit-graph-64bit-time.sh: test_expect_success 'lower layers have overflow chunk' '
       
     -+# The remaining tests check timestamps that flow over
     -+# 32-bits. The graph_git_behavior checks can't take a
     -+# prereq, so just stop here if we are on a 32-bit machine.
     -+
     -+if ! test_have_prereq TIME_IS_64BIT || ! test_have_prereq TIME_T_IS_64BIT
     -+then
     -+	test_done
     -+fi
     -+
     - # We test the overflow-related code with the following repo history:
     - #
     - #               4:F - 5:N - 6:U
     -@@ t/t5318-commit-graph.sh: test_expect_success 'corrupt commit-graph write (missing tree)' '
     - # The largest offset observed is 2 ^ 31, just large enough to overflow.
     - #
     - 
     --test_expect_success 'set up and verify repo with generation data overflow chunk' '
     -+test_expect_success TIME_IS_64BIT,TIME_T_IS_64BIT 'set up and verify repo with generation data overflow chunk' '
     - 	objdir=".git/objects" &&
     - 	UNIX_EPOCH_ZERO="@0 +0000" &&
     --	FUTURE_DATE="@2147483646 +0000" &&
     -+	FUTURE_DATE="@4000000000 +0000" &&
     - 	test_oid_cache <<-EOF &&
     - 	oid_version sha1:1
     - 	oid_version sha256:2
     -@@ t/t5318-commit-graph.sh: test_expect_success 'set up and verify repo with generation data overflow chunk'
     + graph_git_behavior 'overflow' '' HEAD~2 HEAD
       
     - graph_git_behavior 'generation data overflow chunk repo' repo left right
     - 
     -+# Do not add tests at the end of this file, unless they require 64-bit
     -+# timestamps, since this portion of the script is only executed when
     -+# time data types have 64 bits.
     ++test_expect_success 'set up and verify repo with generation data overflow chunk' '
     ++	mkdir repo &&
     ++	cd repo &&
     ++	git init &&
     ++	test_commit --date "$UNIX_EPOCH_ZERO" 1 &&
     ++	test_commit 2 &&
     ++	test_commit --date "$UNIX_EPOCH_ZERO" 3 &&
     ++	git commit-graph write --reachable &&
     ++	graph_read_expect 3 generation_data &&
     ++	test_commit --date "$FUTURE_DATE" 4 &&
     ++	test_commit 5 &&
     ++	test_commit --date "$UNIX_EPOCH_ZERO" 6 &&
     ++	git branch left &&
     ++	git reset --hard 3 &&
     ++	test_commit 7 &&
     ++	test_commit --date "$FUTURE_DATE" 8 &&
     ++	test_commit 9 &&
     ++	git branch right &&
     ++	git reset --hard 3 &&
     ++	test_merge M left right &&
     ++	git commit-graph write --reachable &&
     ++	graph_read_expect 10 "generation_data generation_data_overflow" &&
     ++	git commit-graph verify
     ++'
     ++
     ++graph_git_behavior 'overflow 2' repo left right
      +
       test_done

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [PATCH v3 1/5] test-read-graph: include extra post-parse info
  2022-03-01 19:48   ` [PATCH v3 0/5] " Derrick Stolee via GitGitGadget
@ 2022-03-01 19:48     ` Derrick Stolee via GitGitGadget
  2022-03-01 19:48     ` [PATCH v3 2/5] t5318: extract helpers to lib-commit-graph.sh Derrick Stolee via GitGitGadget
                       ` (3 subsequent siblings)
  4 siblings, 0 replies; 70+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-03-01 19:48 UTC (permalink / raw)
  To: git
  Cc: me, gitster, abhishekkumar8222, avarab, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

It can be helpful to verify that the 'struct commit_graph' that results
from parsing a commit-graph is correctly structured. The existence of
different chunks is not enough to verify that all of the optional
features are correctly enabled.

Update 'test-tool read-graph' to output an "options:" line that includes
information for different parts of the struct commit_graph.

In particular, this change demonstrates that the read_generation_data
option is never being enabled, which will be fixed in a later change.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 t/helper/test-read-graph.c    | 13 +++++++++++++
 t/t4216-log-bloom.sh          |  1 +
 t/t5318-commit-graph.sh       |  1 +
 t/t5324-split-commit-graph.sh |  5 +++++
 4 files changed, 20 insertions(+)

diff --git a/t/helper/test-read-graph.c b/t/helper/test-read-graph.c
index 75927b2c81d..98b73bb8f25 100644
--- a/t/helper/test-read-graph.c
+++ b/t/helper/test-read-graph.c
@@ -3,6 +3,7 @@
 #include "commit-graph.h"
 #include "repository.h"
 #include "object-store.h"
+#include "bloom.h"
 
 int cmd__read_graph(int argc, const char **argv)
 {
@@ -45,6 +46,18 @@ int cmd__read_graph(int argc, const char **argv)
 		printf(" bloom_data");
 	printf("\n");
 
+	printf("options:");
+	if (graph->bloom_filter_settings)
+		printf(" bloom(%"PRIu32",%"PRIu32",%"PRIu32")",
+		       graph->bloom_filter_settings->hash_version,
+		       graph->bloom_filter_settings->bits_per_entry,
+		       graph->bloom_filter_settings->num_hashes);
+	if (graph->read_generation_data)
+		printf(" read_generation_data");
+	if (graph->topo_levels)
+		printf(" topo_levels");
+	printf("\n");
+
 	UNLEAK(graph);
 
 	return 0;
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index cc3cebf6722..5ed6d2a21c1 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -48,6 +48,7 @@ graph_read_expect () {
 	header: 43475048 1 $(test_oid oid_version) $NUM_CHUNKS 0
 	num_commits: $1
 	chunks: oid_fanout oid_lookup commit_metadata generation_data bloom_indexes bloom_data
+	options: bloom(1,10,7)
 	EOF
 	test-tool read-graph >actual &&
 	test_cmp expect actual
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index edb728f77c3..2b05026cf6d 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -104,6 +104,7 @@ graph_read_expect() {
 	header: 43475048 1 $(test_oid oid_version) $NUM_CHUNKS 0
 	num_commits: $1
 	chunks: oid_fanout oid_lookup commit_metadata$OPTIONAL
+	options:
 	EOF
 	test-tool read-graph >output &&
 	test_cmp expect output
diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
index 847b8097109..778fa418de2 100755
--- a/t/t5324-split-commit-graph.sh
+++ b/t/t5324-split-commit-graph.sh
@@ -34,6 +34,7 @@ graph_read_expect() {
 	header: 43475048 1 $(test_oid oid_version) 4 $NUM_BASE
 	num_commits: $1
 	chunks: oid_fanout oid_lookup commit_metadata generation_data
+	options:
 	EOF
 	test-tool read-graph >output &&
 	test_cmp expect output
@@ -508,6 +509,7 @@ test_expect_success 'setup repo for mixed generation commit-graph-chain' '
 		header: 43475048 1 $(test_oid oid_version) 4 1
 		num_commits: $NUM_SECOND_LAYER_COMMITS
 		chunks: oid_fanout oid_lookup commit_metadata
+		options:
 		EOF
 		test_cmp expect output &&
 		git commit-graph verify &&
@@ -540,6 +542,7 @@ test_expect_success 'do not write generation data chunk if not present on existi
 		header: 43475048 1 $(test_oid oid_version) 4 2
 		num_commits: $NUM_THIRD_LAYER_COMMITS
 		chunks: oid_fanout oid_lookup commit_metadata
+		options:
 		EOF
 		test_cmp expect output &&
 		git commit-graph verify
@@ -581,6 +584,7 @@ test_expect_success 'do not write generation data chunk if the topmost remaining
 		header: 43475048 1 $(test_oid oid_version) 4 2
 		num_commits: $(($NUM_THIRD_LAYER_COMMITS + $NUM_FOURTH_LAYER_COMMITS))
 		chunks: oid_fanout oid_lookup commit_metadata
+		options:
 		EOF
 		test_cmp expect output &&
 		git commit-graph verify
@@ -620,6 +624,7 @@ test_expect_success 'write generation data chunk if topmost remaining layer has
 		header: 43475048 1 $(test_oid oid_version) 5 1
 		num_commits: $(($NUM_SECOND_LAYER_COMMITS + $NUM_THIRD_LAYER_COMMITS + $NUM_FOURTH_LAYER_COMMITS + $NUM_FIFTH_LAYER_COMMITS))
 		chunks: oid_fanout oid_lookup commit_metadata generation_data
+		options:
 		EOF
 		test_cmp expect output
 	)
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v3 2/5] t5318: extract helpers to lib-commit-graph.sh
  2022-03-01 19:48   ` [PATCH v3 0/5] " Derrick Stolee via GitGitGadget
  2022-03-01 19:48     ` [PATCH v3 1/5] test-read-graph: include extra post-parse info Derrick Stolee via GitGitGadget
@ 2022-03-01 19:48     ` Derrick Stolee via GitGitGadget
  2022-03-01 19:48     ` [PATCH v3 3/5] commit-graph: fix ordering bug in generation numbers Derrick Stolee via GitGitGadget
                       ` (2 subsequent siblings)
  4 siblings, 0 replies; 70+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-03-01 19:48 UTC (permalink / raw)
  To: git
  Cc: me, gitster, abhishekkumar8222, avarab, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

The graph_git_behavior helper is useful for testing that certain Git
commands behave the same when using the commit-graph and when not using
the commit-graph. Extract it to a new lib-commit-graph.sh file for use
in new test scripts that will split out from t5318.

While doing this extraction, also extract graph_read_expect and the
logic for priming the test_oid_cache.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 t/lib-commit-graph.sh   | 48 +++++++++++++++++++++++++++++++++++++++
 t/t5318-commit-graph.sh | 50 ++---------------------------------------
 2 files changed, 50 insertions(+), 48 deletions(-)
 create mode 100755 t/lib-commit-graph.sh

diff --git a/t/lib-commit-graph.sh b/t/lib-commit-graph.sh
new file mode 100755
index 00000000000..07e12b9d2fe
--- /dev/null
+++ b/t/lib-commit-graph.sh
@@ -0,0 +1,48 @@
+#!/bin/sh
+
+# Helper functions for testing commit-graphs.
+
+# Initialize OID cache with oid_version
+test_oid_cache <<-EOF
+oid_version sha1:1
+oid_version sha256:2
+EOF
+
+graph_git_two_modes() {
+	git -c core.commitGraph=true $1 >output &&
+	git -c core.commitGraph=false $1 >expect &&
+	test_cmp expect output
+}
+
+graph_git_behavior() {
+	MSG=$1
+	DIR=$2
+	BRANCH=$3
+	COMPARE=$4
+	test_expect_success "check normal git operations: $MSG" '
+		cd "$TRASH_DIRECTORY/$DIR" &&
+		graph_git_two_modes "log --oneline $BRANCH" &&
+		graph_git_two_modes "log --topo-order $BRANCH" &&
+		graph_git_two_modes "log --graph $COMPARE..$BRANCH" &&
+		graph_git_two_modes "branch -vv" &&
+		graph_git_two_modes "merge-base -a $BRANCH $COMPARE"
+	'
+}
+
+graph_read_expect() {
+	OPTIONAL=""
+	NUM_CHUNKS=3
+	if test -n "$2"
+	then
+		OPTIONAL=" $2"
+		NUM_CHUNKS=$((3 + $(echo "$2" | wc -w)))
+	fi
+	cat >expect <<- EOF
+	header: 43475048 1 $(test_oid oid_version) $NUM_CHUNKS 0
+	num_commits: $1
+	chunks: oid_fanout oid_lookup commit_metadata$OPTIONAL
+	options:
+	EOF
+	test-tool read-graph >output &&
+	test_cmp expect output
+}
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 2b05026cf6d..9e2b5884dae 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -29,12 +29,7 @@ test_expect_success 'setup full repo' '
 	cd "$TRASH_DIRECTORY/full" &&
 	git init &&
 	git config core.commitGraph true &&
-	objdir=".git/objects" &&
-
-	test_oid_cache <<-EOF
-	oid_version sha1:1
-	oid_version sha256:2
-	EOF
+	objdir=".git/objects"
 '
 
 test_expect_success POSIXPERM 'tweak umask for modebit tests' '
@@ -69,47 +64,10 @@ test_expect_success 'create commits and repack' '
 	git repack
 '
 
-graph_git_two_modes() {
-	git -c core.commitGraph=true $1 >output &&
-	git -c core.commitGraph=false $1 >expect &&
-	test_cmp expect output
-}
-
-graph_git_behavior() {
-	MSG=$1
-	DIR=$2
-	BRANCH=$3
-	COMPARE=$4
-	test_expect_success "check normal git operations: $MSG" '
-		cd "$TRASH_DIRECTORY/$DIR" &&
-		graph_git_two_modes "log --oneline $BRANCH" &&
-		graph_git_two_modes "log --topo-order $BRANCH" &&
-		graph_git_two_modes "log --graph $COMPARE..$BRANCH" &&
-		graph_git_two_modes "branch -vv" &&
-		graph_git_two_modes "merge-base -a $BRANCH $COMPARE"
-	'
-}
+. "$TEST_DIRECTORY"/lib-commit-graph.sh
 
 graph_git_behavior 'no graph' full commits/3 commits/1
 
-graph_read_expect() {
-	OPTIONAL=""
-	NUM_CHUNKS=3
-	if test ! -z "$2"
-	then
-		OPTIONAL=" $2"
-		NUM_CHUNKS=$((3 + $(echo "$2" | wc -w)))
-	fi
-	cat >expect <<- EOF
-	header: 43475048 1 $(test_oid oid_version) $NUM_CHUNKS 0
-	num_commits: $1
-	chunks: oid_fanout oid_lookup commit_metadata$OPTIONAL
-	options:
-	EOF
-	test-tool read-graph >output &&
-	test_cmp expect output
-}
-
 test_expect_success 'exit with correct error on bad input to --stdin-commits' '
 	cd "$TRASH_DIRECTORY/full" &&
 	# invalid, non-hex OID
@@ -826,10 +784,6 @@ test_expect_success 'set up and verify repo with generation data overflow chunk'
 	objdir=".git/objects" &&
 	UNIX_EPOCH_ZERO="@0 +0000" &&
 	FUTURE_DATE="@2147483646 +0000" &&
-	test_oid_cache <<-EOF &&
-	oid_version sha1:1
-	oid_version sha256:2
-	EOF
 	cd "$TRASH_DIRECTORY" &&
 	mkdir repo &&
 	cd repo &&
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v3 3/5] commit-graph: fix ordering bug in generation numbers
  2022-03-01 19:48   ` [PATCH v3 0/5] " Derrick Stolee via GitGitGadget
  2022-03-01 19:48     ` [PATCH v3 1/5] test-read-graph: include extra post-parse info Derrick Stolee via GitGitGadget
  2022-03-01 19:48     ` [PATCH v3 2/5] t5318: extract helpers to lib-commit-graph.sh Derrick Stolee via GitGitGadget
@ 2022-03-01 19:48     ` Derrick Stolee via GitGitGadget
  2022-03-01 20:13       ` Junio C Hamano
  2022-03-01 19:48     ` [PATCH v3 4/5] commit-graph: start parsing generation v2 (again) Derrick Stolee via GitGitGadget
  2022-03-01 19:48     ` [PATCH v3 5/5] commit-graph: fix generation number v2 overflow values Derrick Stolee via GitGitGadget
  4 siblings, 1 reply; 70+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-03-01 19:48 UTC (permalink / raw)
  To: git
  Cc: me, gitster, abhishekkumar8222, avarab, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

When computing the generation numbers for a commit-graph, we compute
the corrected commit dates and then check if their offsets from the
actual dates is too large to fit in the 32-bit Generation Data chunk.
However, there is a problem with this approach: if we have parsed the
generation data from the previous commit-graph, then we continue the
loop because the corrected commit date is already computed. This causes
an under-count in the number of overflow values.

It is incorrect to add an increment to num_generation_data_overflows
next to this 'continue' statement, because we might start
double-counting commits that are computed because of the depth-first
search walk from a commit with an earlier OID.

Instead, iterate over the full commit list at the end, checking the
offsets to see how many grow beyond the maximum value.

Create a new t5328-commit-graph-64-bit-time.sh test script to handle
special cases of testing 64-bit timestampes. This helps demonstrate this
bug in more cases. It still won't hit all potential cases until the next
change, which reenables reading generation numbers. Use the skip_all
trick from 0a2bfccb9c8 (t0051: use "skip_all" under !MINGW in
single-test file, 2022-02-04) to make the output clean when run on a
32-bit system.

Hepled-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 commit-graph.c                     | 10 +++++---
 t/t5318-commit-graph.sh            |  4 +--
 t/t5328-commit-graph-64bit-time.sh | 39 ++++++++++++++++++++++++++++++
 3 files changed, 48 insertions(+), 5 deletions(-)
 create mode 100755 t/t5328-commit-graph-64bit-time.sh

diff --git a/commit-graph.c b/commit-graph.c
index 265c010122e..a19bd96c2ee 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1556,12 +1556,16 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 				if (current->date && current->date > max_corrected_commit_date)
 					max_corrected_commit_date = current->date - 1;
 				commit_graph_data_at(current)->generation = max_corrected_commit_date + 1;
-
-				if (commit_graph_data_at(current)->generation - current->date > GENERATION_NUMBER_V2_OFFSET_MAX)
-					ctx->num_generation_data_overflows++;
 			}
 		}
 	}
+
+	for (i = 0; i < ctx->commits.nr; i++) {
+		struct commit *c = ctx->commits.list[i];
+		timestamp_t offset = commit_graph_data_at(c)->generation - c->date;
+		if (offset > GENERATION_NUMBER_V2_OFFSET_MAX)
+			ctx->num_generation_data_overflows++;
+	}
 	stop_progress(&ctx->progress);
 }
 
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 9e2b5884dae..0ed7e9de8e6 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -425,10 +425,10 @@ test_expect_success 'warn on improper hash version' '
 	)
 '
 
-test_expect_success 'lower layers have overflow chunk' '
+test_expect_success TIME_IS_64BIT,TIME_T_IS_64BIT 'lower layers have overflow chunk' '
 	cd "$TRASH_DIRECTORY/full" &&
 	UNIX_EPOCH_ZERO="@0 +0000" &&
-	FUTURE_DATE="@2147483646 +0000" &&
+	FUTURE_DATE="@4147483646 +0000" &&
 	rm -f .git/objects/info/commit-graph &&
 	test_commit --date "$FUTURE_DATE" future-1 &&
 	test_commit --date "$UNIX_EPOCH_ZERO" old-1 &&
diff --git a/t/t5328-commit-graph-64bit-time.sh b/t/t5328-commit-graph-64bit-time.sh
new file mode 100755
index 00000000000..28114bcaf47
--- /dev/null
+++ b/t/t5328-commit-graph-64bit-time.sh
@@ -0,0 +1,39 @@
+#!/bin/sh
+
+test_description='commit graph with 64-bit timestamps'
+. ./test-lib.sh
+
+if ! test_have_prereq TIME_IS_64BIT || ! test_have_prereq TIME_T_IS_64BIT
+then
+	skip_all='skipping 64-bit timestamp tests'
+	test_done
+fi
+
+. "$TEST_DIRECTORY"/lib-commit-graph.sh
+
+UNIX_EPOCH_ZERO="@0 +0000"
+FUTURE_DATE="@4147483646 +0000"
+
+GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=0
+
+test_expect_success 'lower layers have overflow chunk' '
+	rm -f .git/objects/info/commit-graph &&
+	test_commit --date "$FUTURE_DATE" future-1 &&
+	test_commit --date "$UNIX_EPOCH_ZERO" old-1 &&
+	git commit-graph write --reachable &&
+	test_commit --date "$FUTURE_DATE" future-2 &&
+	test_commit --date "$UNIX_EPOCH_ZERO" old-2 &&
+	git commit-graph write --reachable --split=no-merge &&
+	test_commit extra &&
+	git commit-graph write --reachable --split=no-merge &&
+	git commit-graph write --reachable &&
+	graph_read_expect 5 "generation_data generation_data_overflow" &&
+	mv .git/objects/info/commit-graph commit-graph-upgraded &&
+	git commit-graph write --reachable &&
+	graph_read_expect 5 "generation_data generation_data_overflow" &&
+	test_cmp .git/objects/info/commit-graph commit-graph-upgraded
+'
+
+graph_git_behavior 'overflow' '' HEAD~2 HEAD
+
+test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v3 4/5] commit-graph: start parsing generation v2 (again)
  2022-03-01 19:48   ` [PATCH v3 0/5] " Derrick Stolee via GitGitGadget
                       ` (2 preceding siblings ...)
  2022-03-01 19:48     ` [PATCH v3 3/5] commit-graph: fix ordering bug in generation numbers Derrick Stolee via GitGitGadget
@ 2022-03-01 19:48     ` Derrick Stolee via GitGitGadget
  2022-03-01 19:48     ` [PATCH v3 5/5] commit-graph: fix generation number v2 overflow values Derrick Stolee via GitGitGadget
  4 siblings, 0 replies; 70+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-03-01 19:48 UTC (permalink / raw)
  To: git
  Cc: me, gitster, abhishekkumar8222, avarab, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

The 'read_generation_data' member of 'struct commit_graph' was
introduced by 1fdc383c5 (commit-graph: use generation v2 only if entire
chain does, 2021-01-16). The intention was to avoid using corrected
commit dates if not all layers of a commit-graph had that data stored.
The logic in validate_mixed_generation_chain() at that point incorrectly
initialized read_generation_data to 1 if and only if the tip
commit-graph contained the Corrected Commit Date chunk.

This was "fixed" in 448a39e65 (commit-graph: validate layers for
generation data, 2021-02-02) to validate that read_generation_data was
either non-zero for all layers, or it would set read_generation_data to
zero for all layers.

The problem here is that read_generation_data is not initialized to be
non-zero anywhere!

This change initializes read_generation_data immediately after the chunk
is parsed, so each layer will have its value present as soon as
possible.

The read_generation_data member is used in fill_commit_graph_info() to
determine if we should use the corrected commit date or the topological
levels stored in the Commit Data chunk. Due to this bug, all previous
versions of Git were defaulting to topological levels in all cases!

This can be measured with some performance tests. Using the Linux kernel
as a testbed, I generated a complete commit-graph containing corrected
commit dates and tested the 'new' version against the previous, 'old'
version.

First, rev-list with --topo-order demonstrates a 26% improvement using
corrected commit dates:

hyperfine \
	-n "old" "$OLD_GIT rev-list --topo-order -1000 v3.6" \
	-n "new" "$NEW_GIT rev-list --topo-order -1000 v3.6" \
	--warmup=10

Benchmark 1: old
  Time (mean ± σ):      57.1 ms ±   3.1 ms
  Range (min … max):    52.9 ms …  62.0 ms    55 runs

Benchmark 2: new
  Time (mean ± σ):      45.5 ms ±   3.3 ms
  Range (min … max):    39.9 ms …  51.7 ms    59 runs

Summary
  'new' ran
    1.26 ± 0.11 times faster than 'old'

These performance improvements are due to the algorithmic improvements
given by walking fewer commits due to the higher cutoffs from corrected
commit dates.

However, this comes at a cost. The additional I/O cost of parsing the
corrected commit dates is visible in case of merge-base commands that do
not reduce the overall number of walked commits.

hyperfine \
        -n "old" "$OLD_GIT merge-base v4.8 v4.9" \
        -n "new" "$NEW_GIT merge-base v4.8 v4.9" \
        --warmup=10

Benchmark 1: old
  Time (mean ± σ):     110.4 ms ±   6.4 ms
  Range (min … max):    96.0 ms … 118.3 ms    25 runs

Benchmark 2: new
  Time (mean ± σ):     150.7 ms ±   1.1 ms
  Range (min … max):   149.3 ms … 153.4 ms    19 runs

Summary
  'old' ran
    1.36 ± 0.08 times faster than 'new'

Performance issues like this are what motivated 702110aac (commit-graph:
use config to specify generation type, 2021-02-25).

In the future, we could fix this performance problem by inserting the
corrected commit date offsets into the Commit Date chunk instead of
having that data in an extra chunk.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 commit-graph.c                |  3 +++
 t/lib-commit-graph.sh         | 12 +++++++++++-
 t/t4216-log-bloom.sh          |  2 +-
 t/t5318-commit-graph.sh       |  2 +-
 t/t5324-split-commit-graph.sh |  9 +++++++--
 5 files changed, 23 insertions(+), 5 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index a19bd96c2ee..8e52bb09552 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -407,6 +407,9 @@ struct commit_graph *parse_commit_graph(struct repository *r,
 			&graph->chunk_generation_data);
 		pair_chunk(cf, GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW,
 			&graph->chunk_generation_data_overflow);
+
+		if (graph->chunk_generation_data)
+			graph->read_generation_data = 1;
 	}
 
 	if (r->settings.commit_graph_read_changed_paths) {
diff --git a/t/lib-commit-graph.sh b/t/lib-commit-graph.sh
index 07e12b9d2fe..5d79e1a4e96 100755
--- a/t/lib-commit-graph.sh
+++ b/t/lib-commit-graph.sh
@@ -37,11 +37,21 @@ graph_read_expect() {
 		OPTIONAL=" $2"
 		NUM_CHUNKS=$((3 + $(echo "$2" | wc -w)))
 	fi
+	GENERATION_VERSION=2
+	if test -n "$3"
+	then
+		GENERATION_VERSION=$3
+	fi
+	OPTIONS=
+	if test $GENERATION_VERSION -gt 1
+	then
+		OPTIONS=" read_generation_data"
+	fi
 	cat >expect <<- EOF
 	header: 43475048 1 $(test_oid oid_version) $NUM_CHUNKS 0
 	num_commits: $1
 	chunks: oid_fanout oid_lookup commit_metadata$OPTIONAL
-	options:
+	options:$OPTIONS
 	EOF
 	test-tool read-graph >output &&
 	test_cmp expect output
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index 5ed6d2a21c1..fa9d32facfb 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -48,7 +48,7 @@ graph_read_expect () {
 	header: 43475048 1 $(test_oid oid_version) $NUM_CHUNKS 0
 	num_commits: $1
 	chunks: oid_fanout oid_lookup commit_metadata generation_data bloom_indexes bloom_data
-	options: bloom(1,10,7)
+	options: bloom(1,10,7) read_generation_data
 	EOF
 	test-tool read-graph >actual &&
 	test_cmp expect actual
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 0ed7e9de8e6..fbf0d64578c 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -456,7 +456,7 @@ test_expect_success 'git commit-graph verify' '
 	cd "$TRASH_DIRECTORY/full" &&
 	git rev-parse commits/8 | git -c commitGraph.generationVersion=1 commit-graph write --stdin-commits &&
 	git commit-graph verify >output &&
-	graph_read_expect 9 extra_edges
+	graph_read_expect 9 extra_edges 1
 '
 
 NUM_COMMITS=9
diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
index 778fa418de2..669ddc645fa 100755
--- a/t/t5324-split-commit-graph.sh
+++ b/t/t5324-split-commit-graph.sh
@@ -30,11 +30,16 @@ graph_read_expect() {
 	then
 		NUM_BASE=$2
 	fi
+	OPTIONS=
+	if test -z "$3"
+	then
+		OPTIONS=" read_generation_data"
+	fi
 	cat >expect <<- EOF
 	header: 43475048 1 $(test_oid oid_version) 4 $NUM_BASE
 	num_commits: $1
 	chunks: oid_fanout oid_lookup commit_metadata generation_data
-	options:
+	options:$OPTIONS
 	EOF
 	test-tool read-graph >output &&
 	test_cmp expect output
@@ -624,7 +629,7 @@ test_expect_success 'write generation data chunk if topmost remaining layer has
 		header: 43475048 1 $(test_oid oid_version) 5 1
 		num_commits: $(($NUM_SECOND_LAYER_COMMITS + $NUM_THIRD_LAYER_COMMITS + $NUM_FOURTH_LAYER_COMMITS + $NUM_FIFTH_LAYER_COMMITS))
 		chunks: oid_fanout oid_lookup commit_metadata generation_data
-		options:
+		options: read_generation_data
 		EOF
 		test_cmp expect output
 	)
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v3 5/5] commit-graph: fix generation number v2 overflow values
  2022-03-01 19:48   ` [PATCH v3 0/5] " Derrick Stolee via GitGitGadget
                       ` (3 preceding siblings ...)
  2022-03-01 19:48     ` [PATCH v3 4/5] commit-graph: start parsing generation v2 (again) Derrick Stolee via GitGitGadget
@ 2022-03-01 19:48     ` Derrick Stolee via GitGitGadget
  4 siblings, 0 replies; 70+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-03-01 19:48 UTC (permalink / raw)
  To: git
  Cc: me, gitster, abhishekkumar8222, avarab, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

The Generation Data Chunk was implemented and tested in e8b63005c
(commit-graph: implement generation data chunk, 2021-01-16), but the
test was carefully constructed to work on systems with 32-bit dates.
Since the corrected commit date offsets still required more than 31
bits, this triggered writing the generation_data_overflow chunk.

However, upon closer look, the
write_graph_chunk_generation_data_overflow() method writes the offsets
to the chunk (as dictated by the format) but fill_commit_graph_info()
treats the value in the chunk as if it is the full corrected commit date
(not an offset). For some reason, this does not cause an issue when
using the FUTURE_DATE specified in t5318-commit-graph.sh, but it does
show up as a failure in 'git commit-graph verify' if we increase that
FUTURE_DATE to be above four billion.

Fix this error and create a 64-bit timestamp version of the test so we
can test these larger values.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 commit-graph.c                     |  2 +-
 t/t5328-commit-graph-64bit-time.sh | 27 +++++++++++++++++++++++++++
 2 files changed, 28 insertions(+), 1 deletion(-)

diff --git a/commit-graph.c b/commit-graph.c
index 8e52bb09552..b86a6a634fe 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -806,7 +806,7 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
 				die(_("commit-graph requires overflow generation data but has none"));
 
 			offset_pos = offset ^ CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW;
-			graph_data->generation = get_be64(g->chunk_generation_data_overflow + 8 * offset_pos);
+			graph_data->generation = item->date + get_be64(g->chunk_generation_data_overflow + 8 * offset_pos);
 		} else
 			graph_data->generation = item->date + offset;
 	} else
diff --git a/t/t5328-commit-graph-64bit-time.sh b/t/t5328-commit-graph-64bit-time.sh
index 28114bcaf47..093f0c067af 100755
--- a/t/t5328-commit-graph-64bit-time.sh
+++ b/t/t5328-commit-graph-64bit-time.sh
@@ -36,4 +36,31 @@ test_expect_success 'lower layers have overflow chunk' '
 
 graph_git_behavior 'overflow' '' HEAD~2 HEAD
 
+test_expect_success 'set up and verify repo with generation data overflow chunk' '
+	mkdir repo &&
+	cd repo &&
+	git init &&
+	test_commit --date "$UNIX_EPOCH_ZERO" 1 &&
+	test_commit 2 &&
+	test_commit --date "$UNIX_EPOCH_ZERO" 3 &&
+	git commit-graph write --reachable &&
+	graph_read_expect 3 generation_data &&
+	test_commit --date "$FUTURE_DATE" 4 &&
+	test_commit 5 &&
+	test_commit --date "$UNIX_EPOCH_ZERO" 6 &&
+	git branch left &&
+	git reset --hard 3 &&
+	test_commit 7 &&
+	test_commit --date "$FUTURE_DATE" 8 &&
+	test_commit 9 &&
+	git branch right &&
+	git reset --hard 3 &&
+	test_merge M left right &&
+	git commit-graph write --reachable &&
+	graph_read_expect 10 "generation_data generation_data_overflow" &&
+	git commit-graph verify
+'
+
+graph_git_behavior 'overflow 2' repo left right
+
 test_done
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 3/5] commit-graph: fix ordering bug in generation numbers
  2022-03-01 19:48     ` [PATCH v3 3/5] commit-graph: fix ordering bug in generation numbers Derrick Stolee via GitGitGadget
@ 2022-03-01 20:13       ` Junio C Hamano
  2022-03-01 20:30         ` Junio C Hamano
  0 siblings, 1 reply; 70+ messages in thread
From: Junio C Hamano @ 2022-03-01 20:13 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, me, abhishekkumar8222, avarab, Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Derrick Stolee <derrickstolee@github.com>
>
> When computing the generation numbers for a commit-graph, we compute
> the corrected commit dates and then check if their offsets from the
> actual dates is too large to fit in the 32-bit Generation Data chunk.
> However, there is a problem with this approach: if we have parsed the
> generation data from the previous commit-graph, then we continue the
> loop because the corrected commit date is already computed. This causes
> an under-count in the number of overflow values.
>
> It is incorrect to add an increment to num_generation_data_overflows
> next to this 'continue' statement, because we might start
> double-counting commits that are computed because of the depth-first
> search walk from a commit with an earlier OID.
>
> Instead, iterate over the full commit list at the end, checking the
> offsets to see how many grow beyond the maximum value.

OK.

> Create a new t5328-commit-graph-64-bit-time.sh test script to handle
> special cases of testing 64-bit timestampes. This helps demonstrate this
> bug in more cases. It still won't hit all potential cases until the next
> change, which reenables reading generation numbers. Use the skip_all
> trick from 0a2bfccb9c8 (t0051: use "skip_all" under !MINGW in
> single-test file, 2022-02-04) to make the output clean when run on a
> 32-bit system.
>
> Hepled-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>

I can typofix this one locally if needed.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 3/5] commit-graph: fix ordering bug in generation numbers
  2022-03-01 20:13       ` Junio C Hamano
@ 2022-03-01 20:30         ` Junio C Hamano
  2022-03-02 14:13           ` Derrick Stolee
  0 siblings, 1 reply; 70+ messages in thread
From: Junio C Hamano @ 2022-03-01 20:30 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, me, abhishekkumar8222, avarab, Derrick Stolee

Junio C Hamano <gitster@pobox.com> writes:

>> Create a new t5328-commit-graph-64-bit-time.sh test script to handle
>> special cases of testing 64-bit timestampes. This helps demonstrate this
>> bug in more cases. It still won't hit all potential cases until the next
>> change, which reenables reading generation numbers. Use the skip_all
>> trick from 0a2bfccb9c8 (t0051: use "skip_all" under !MINGW in
>> single-test file, 2022-02-04) to make the output clean when run on a
>> 32-bit system.
>>
>> Hepled-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
>
> I can typofix this one locally if needed.

What I meant was s/timestampes/timestamps/ and s/Hepled/Helped/.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 3/7] commit-graph: start parsing generation v2 (again)
  2022-03-01 15:25                   ` Derrick Stolee
@ 2022-03-02 13:57                     ` Patrick Steinhardt
  2022-03-02 14:57                       ` Derrick Stolee
  0 siblings, 1 reply; 70+ messages in thread
From: Patrick Steinhardt @ 2022-03-02 13:57 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, git, me, gitster, abhishekkumar8222

[-- Attachment #1: Type: text/plain, Size: 6184 bytes --]

On Tue, Mar 01, 2022 at 10:25:46AM -0500, Derrick Stolee wrote:
> On 3/1/2022 9:53 AM, Patrick Steinhardt wrote:
> > On Tue, Mar 01, 2022 at 09:06:44AM -0500, Derrick Stolee wrote:
> >> On 3/1/2022 5:35 AM, Patrick Steinhardt wrote:
> >>> On Tue, Mar 01, 2022 at 10:46:14AM +0100, Patrick Steinhardt wrote:
> >>>> On Mon, Feb 28, 2022 at 01:44:01PM -0500, Derrick Stolee wrote:
> >>>>> On 2/28/2022 11:59 AM, Patrick Steinhardt wrote:
> >>>>>> On Mon, Feb 28, 2022 at 11:23:38AM -0500, Derrick Stolee wrote:
> >>>>>>> On 2/28/2022 10:18 AM, Patrick Steinhardt wrote:
> >>>>>>>> [1]: https://gitlab.com/gitlab-com/www-gitlab-com.git
> ...
> >>> So the question is whether this is a change that needs to be rolled out
> >>> over multiple releases. First we'd get in the bug fix such that we write
> >>> correct commit-graphs, and after this fix has been released we can also
> >>> release the fix that starts to actually parse the generation. This
> >>> ensures there's a grace period during which we can hopefully correct the
> >>> data on-disk such that users are not faced with failures.
> >>
> >> You are right that we need to be careful here, but I also think that
> >> previous versions of Git always wrote the correct data. Here is my
> >> thought process:
> >>
> >> 1. To get this bug, we need to have parsed the corrected commit date
> >>    from an existing commit-graph in order to under-count the number
> >>    of overflow values.
> >>
> >> 2. Before this series, Git versions were not parsing the corrected
> >>    commit date, so they recompute the corrected commit date every
> >>    time the commit-graph is written, getting the proper count of
> >>    overflow values.
> >>
> >> For these reasons, data written by previous versions of Git are
> >> correct and can be trusted without a staged release.
> >>
> >> Does this make sense? Or, do you experience a different result when
> >> you build commit-graphs with a released Git version and then when
> >> writing on top with all patches applied?
> > 
> > Just to verify my understanding: you claim that the bug I was hitting
> > shouldn't be encountered in the wild when the release , but
> > only if one were to write a commit-graph with the intermediate stafe
> > until patch 3/4 of your patch series?
> 
> That is my claim. And my testing of the repo at [1] has demonstrated
> that it works correctly in these cases.
>  
> > Hum. I have re-verified, and this indeed seems to play out. So I must've
> > accidentally ran all my testing with the state generated without the
> > final patch which fixes the corruption. I do see lots of the following
> > warnings, but overall I can verify and write the commit-graph just fine:
> > 
> >     commit-graph generation for commit c80a42de8803e2d77818d0c82f88e748d7f9425f is 1623362063 < 1623362139
> 
> But I'm not able to generate these warnings from either version. I
> tried generating different levels of a split commit-graph, but
> could not reproduce it. If you have reproduction steps using current
> 'master' (or any released Git version) and the four patches here,
> then I would love to get a full understanding of your errors.
> 
> Thanks,
> -Stolee

I haven't yet been able to reproduce it with publicly available data,
but with the internal references I'm able to evoke the warnings
reliably. It only works when I have two repositories connected via
alternates, when generating the commit-graph in the linked-to repo
first, and then generating the commit-graph in the linking repo.

The following recipe allows me to reproduce, but rely on private data:

    $ git --version
    git version 2.35.1

    # The pool repository is the one we're linked to from the fork.
    $ cd "$pool"
    $ rm -rf objects/info/commit-graph objects/info/commit-graph
    $ git commit-graph write --split

    $ cd "$fork"
    $ rm -rf objects/info/commit-graph objects/info/commit-graph
    $ git commit-graph write --split

    $ git commit-graph verify --no-progress
    $ echo $?
    0

    # This is 715d08a9e51251ad8290b181b6ac3b9e1f9719d7 with your full v2
    # applied on top.
    $ ~/Development/git/bin-wrappers/git --version
    git version 2.35.1.358.g7ede1bea24

    $ ~/Development/git/bin-wrappers/git commit-graph verify --no-progress
    commit-graph generation for commit 06a91bac00ed11128becd48d5ae77eacd8f24c97 is 1623273624 < 1623273710
    commit-graph generation for commit 0ae91029f27238e8f8e109c6bb3907f864dda14f is 1622151146 < 1622151220
    commit-graph generation for commit 0d4582a33d8c8e3eb01adbf564f5e1deeb3b56a2 is 1631045222 < 1631045225
    commit-graph generation for commit 0daf8976439d7e0bb9710c5ee63b570580e0dc03 is 1620347739 < 1620347789
    commit-graph generation for commit 0e0ee8ffb3fa22cee7d28e21cbd6df26454932cf is 1623783297 < 1623783380
    commit-graph generation for commit 0f08ab3de6ec115ea8a956a1996cb9759e640e74 is 1621543278 < 1621543339
    commit-graph generation for commit 133ed0319b5a66ae0c2be76e5a887b880452b111 is 1620949864 < 1620949915
    commit-graph generation for commit 1341b3e6c63343ae94a8a473fa057126ddd4669a is 1637344364 < 1637344384
    commit-graph generation for commit 15bdfc501c2c9f23e9353bf6e6a5facd9c32a07a is 1623348103 < 1623348133
    ...
    $ echo $?
    1

When generating commit-graphs with your patches applied the `verify`
step works alright.

I've also by accident stumbled over the original error again:

    fatal: commit-graph requires overflow generation data but has none

This time it's definitely not caused by generating commit-graphs with an
in-between state of your patch series because the data comes straight
from production with no changes to the commit-graphs performed by
myself. There we're running Git v2.33.1 with a couple of backported
patches (see [1]). While those patches cause us to make more use of the
commit-graph, none modify the way we generate them.

Of note is that the commit-graph contains references to commits which
don't exist in the ODB anymore.

Patrick

[1]: https://gitlab.com/gitlab-org/gitlab-git/-/commits/pks-v2.33.1.gl3

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 3/5] commit-graph: fix ordering bug in generation numbers
  2022-03-01 20:30         ` Junio C Hamano
@ 2022-03-02 14:13           ` Derrick Stolee
  0 siblings, 0 replies; 70+ messages in thread
From: Derrick Stolee @ 2022-03-02 14:13 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee via GitGitGadget
  Cc: git, me, abhishekkumar8222, avarab

On 3/1/2022 3:30 PM, Junio C Hamano wrote:
> Junio C Hamano <gitster@pobox.com> writes:
> 
>>> Create a new t5328-commit-graph-64-bit-time.sh test script to handle
>>> special cases of testing 64-bit timestampes. This helps demonstrate this
>>> bug in more cases. It still won't hit all potential cases until the next
>>> change, which reenables reading generation numbers. Use the skip_all
>>> trick from 0a2bfccb9c8 (t0051: use "skip_all" under !MINGW in
>>> single-test file, 2022-02-04) to make the output clean when run on a
>>> 32-bit system.
>>>
>>> Hepled-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
>>
>> I can typofix this one locally if needed.
> 
> What I meant was s/timestampes/timestamps/ and s/Hepled/Helped/.

Thank you!
-Stolee

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 3/7] commit-graph: start parsing generation v2 (again)
  2022-03-02 13:57                     ` Patrick Steinhardt
@ 2022-03-02 14:57                       ` Derrick Stolee
  2022-03-02 18:15                         ` Junio C Hamano
  2022-03-03 11:19                         ` Patrick Steinhardt
  0 siblings, 2 replies; 70+ messages in thread
From: Derrick Stolee @ 2022-03-02 14:57 UTC (permalink / raw)
  To: Patrick Steinhardt
  Cc: Derrick Stolee via GitGitGadget, git, me, gitster, abhishekkumar8222

On 3/2/2022 8:57 AM, Patrick Steinhardt wrote:
> On Tue, Mar 01, 2022 at 10:25:46AM -0500, Derrick Stolee wrote:
>> On 3/1/2022 9:53 AM, Patrick Steinhardt wrote:

>>> Hum. I have re-verified, and this indeed seems to play out. So I must've
>>> accidentally ran all my testing with the state generated without the
>>> final patch which fixes the corruption. I do see lots of the following
>>> warnings, but overall I can verify and write the commit-graph just fine:
>>>
>>>     commit-graph generation for commit c80a42de8803e2d77818d0c82f88e748d7f9425f is 1623362063 < 1623362139
>>
>> But I'm not able to generate these warnings from either version. I
>> tried generating different levels of a split commit-graph, but
>> could not reproduce it. If you have reproduction steps using current
>> 'master' (or any released Git version) and the four patches here,
>> then I would love to get a full understanding of your errors.
>>
>> Thanks,
>> -Stolee
> 
> I haven't yet been able to reproduce it with publicly available data,
> but with the internal references I'm able to evoke the warnings
> reliably. It only works when I have two repositories connected via
> alternates, when generating the commit-graph in the linked-to repo
> first, and then generating the commit-graph in the linking repo.
> 
> The following recipe allows me to reproduce, but rely on private data:
> 
>     $ git --version
>     git version 2.35.1
> 
>     # The pool repository is the one we're linked to from the fork.
>     $ cd "$pool"
>     $ rm -rf objects/info/commit-graph objects/info/commit-graph
>     $ git commit-graph write --split
> 
>     $ cd "$fork"
>     $ rm -rf objects/info/commit-graph objects/info/commit-graph
>     $ git commit-graph write --split
> 
>     $ git commit-graph verify --no-progress
>     $ echo $?
>     0
> 
>     # This is 715d08a9e51251ad8290b181b6ac3b9e1f9719d7 with your full v2
>     # applied on top.
>     $ ~/Development/git/bin-wrappers/git --version
>     git version 2.35.1.358.g7ede1bea24
> 
>     $ ~/Development/git/bin-wrappers/git commit-graph verify --no-progress
>     commit-graph generation for commit 06a91bac00ed11128becd48d5ae77eacd8f24c97 is 1623273624 < 1623273710
>     commit-graph generation for commit 0ae91029f27238e8f8e109c6bb3907f864dda14f is 1622151146 < 1622151220
>     commit-graph generation for commit 0d4582a33d8c8e3eb01adbf564f5e1deeb3b56a2 is 1631045222 < 1631045225
>     commit-graph generation for commit 0daf8976439d7e0bb9710c5ee63b570580e0dc03 is 1620347739 < 1620347789
>     commit-graph generation for commit 0e0ee8ffb3fa22cee7d28e21cbd6df26454932cf is 1623783297 < 1623783380
>     commit-graph generation for commit 0f08ab3de6ec115ea8a956a1996cb9759e640e74 is 1621543278 < 1621543339
>     commit-graph generation for commit 133ed0319b5a66ae0c2be76e5a887b880452b111 is 1620949864 < 1620949915
>     commit-graph generation for commit 1341b3e6c63343ae94a8a473fa057126ddd4669a is 1637344364 < 1637344384
>     commit-graph generation for commit 15bdfc501c2c9f23e9353bf6e6a5facd9c32a07a is 1623348103 < 1623348133
>     ...
>     $ echo $?
>     1
> 
> When generating commit-graphs with your patches applied the `verify`
> step works alright.
> 
> I've also by accident stumbled over the original error again:
> 
>     fatal: commit-graph requires overflow generation data but has none
> 
> This time it's definitely not caused by generating commit-graphs with an
> in-between state of your patch series because the data comes straight
> from production with no changes to the commit-graphs performed by
> myself. There we're running Git v2.33.1 with a couple of backported
> patches (see [1]). While those patches cause us to make more use of the
> commit-graph, none modify the way we generate them.
> 
> Of note is that the commit-graph contains references to commits which
> don't exist in the ODB anymore.
> 
> Patrick
> 
> [1]: https://gitlab.com/gitlab-org/gitlab-git/-/commits/pks-v2.33.1.gl3

Thank you for your diligence here, Patrick. I really appreciate the
work you're putting in to verify the situation.

Since our repro relies on private information, but is consistent, I
wonder if we should take the patch below, which starts to ignore the
older generation number v2 data and only writes freshly-computed
numbers.

Thanks,
-Stolee

--- 8< ---

From c53d8bd52bbcab3862e8a826ee75692edc7e4173 Mon Sep 17 00:00:00 2001
From: Derrick Stolee <derrickstolee@github.com>
Date: Wed, 2 Mar 2022 09:45:13 -0500
Subject: [PATCH v3 5/4] commit-graph: declare bankruptcy on GDAT chunks

The Generation Data (GDAT) and Generation Data Overflow (GDOV) chunks
store corrected commit date offsets, used for generation number v2.
Recent changes have demonstrated that previous versions of Git were
incorrectly parsing data from these chunks, but might have also been
writing them incorrectly.

I asserted [1] that the previous fixes were sufficient because the known
reasons for incorrectly writing generation number v2 data relied on
parsing the information incorrectly out of a commit-graph file, but the
previous versions of Git were not reading the generation number v2 data.

However, Patrick demonstrated [2] a case where in split commit-graphs
across an alternate boundary (and possibly some other special
conditions) it was possible to have a commit-graph that was generated by
a previous version of Git have incorrect generation number v2 data which
results in errors like the following:

  commit-graph generation for commit <oid> is 1623273624 < 1623273710

[1] https://lore.kernel.org/git/f50e74f0-9ffa-f4f2-4663-269801495ed3@github.com/
[2] https://lore.kernel.org/git/Yh93vOkt2DkrGPh2@ncase/

Clearly, there is something else going on. The situation is not
completely understood, but the errors do not reproduce if the
commit-graphs are all generated by a Git version including these recent
fixes.

If we cannot trust the existing data in the GDAT and GDOV chunks, then
we can alter the format to change the chunk IDs for these chunks. This
causes the new version of Git to silently ignore the older chunks (and
disabling generation number v2 in the process) while writing new
commit-graph files with correct data in the GDA2 and GDO2 chunks.

Update commit-graph-format.txt including a historical note about these
deprecated chunks.

Reported-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 Documentation/technical/commit-graph-format.txt | 12 ++++++++++--
 commit-graph.c                                  |  4 ++--
 2 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
index 87971c27dd7..484b185ba98 100644
--- a/Documentation/technical/commit-graph-format.txt
+++ b/Documentation/technical/commit-graph-format.txt
@@ -93,7 +93,7 @@ CHUNK DATA:
       2 bits of the lowest byte, storing the 33rd and 34th bit of the
       commit time.
 
-  Generation Data (ID: {'G', 'D', 'A', 'T' }) (N * 4 bytes) [Optional]
+  Generation Data (ID: {'G', 'D', 'A', '2' }) (N * 4 bytes) [Optional]
     * This list of 4-byte values store corrected commit date offsets for the
       commits, arranged in the same order as commit data chunk.
     * If the corrected commit date offset cannot be stored within 31 bits,
@@ -104,7 +104,7 @@ CHUNK DATA:
       by compatible versions of Git and in case of split commit-graph chains,
       the topmost layer also has Generation Data chunk.
 
-  Generation Data Overflow (ID: {'G', 'D', 'O', 'V' }) [Optional]
+  Generation Data Overflow (ID: {'G', 'D', 'O', '2' }) [Optional]
     * This list of 8-byte values stores the corrected commit date offsets
       for commits with corrected commit date offsets that cannot be
       stored within 31 bits.
@@ -156,3 +156,11 @@ CHUNK DATA:
 TRAILER:
 
 	H-byte HASH-checksum of all of the above.
+
+== Historical Notes:
+
+The Generation Data (GDA2) and Generation Data Overflow (GDO2) chunks have
+the number '2' in their chunk IDs because a previous version of Git wrote
+possibly erroneous data in these chunks with the IDs "GDAT" and "GDOV". By
+changing the IDs, newer versions of Git will silently ignore those older
+chunks and write the new information without trusting the incorrect data.
diff --git a/commit-graph.c b/commit-graph.c
index b86a6a634fe..fb2ced0bd6d 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -39,8 +39,8 @@ void git_test_write_commit_graph_or_die(void)
 #define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
 #define GRAPH_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
 #define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */
-#define GRAPH_CHUNKID_GENERATION_DATA 0x47444154 /* "GDAT" */
-#define GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW 0x47444f56 /* "GDOV" */
+#define GRAPH_CHUNKID_GENERATION_DATA 0x47444132 /* "GDA2" */
+#define GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW 0x47444f32 /* "GDO2" */
 #define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */
 #define GRAPH_CHUNKID_BLOOMINDEXES 0x42494458 /* "BIDX" */
 #define GRAPH_CHUNKID_BLOOMDATA 0x42444154 /* "BDAT" */
-- 
2.35.1.138.gfc5de29e9e6





^ permalink raw reply related	[flat|nested] 70+ messages in thread

* Re: [PATCH 3/7] commit-graph: start parsing generation v2 (again)
  2022-03-02 14:57                       ` Derrick Stolee
@ 2022-03-02 18:15                         ` Junio C Hamano
  2022-03-02 18:46                           ` Derrick Stolee
  2022-03-03 11:19                         ` Patrick Steinhardt
  1 sibling, 1 reply; 70+ messages in thread
From: Junio C Hamano @ 2022-03-02 18:15 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Patrick Steinhardt, Derrick Stolee via GitGitGadget, git, me,
	abhishekkumar8222

Derrick Stolee <derrickstolee@github.com> writes:

> Since our repro relies on private information, but is consistent, I
> wonder if we should take the patch below, which starts to ignore the
> older generation number v2 data and only writes freshly-computed
> numbers.

;-)

> Clearly, there is something else going on. The situation is not
> completely understood, but the errors do not reproduce if the
> commit-graphs are all generated by a Git version including these recent
> fixes.

Do you mean "we know doing X and then Y and then Z on this
particular private data with older version of Git without those two
fixes will lead to a broken timestamp, but doing exactly the same
with the two fixes, the breakage does not reproduce"?  If so, that
is quite encouraging news.  Thanks for working well together.

> If we cannot trust the existing data in the GDAT and GDOV chunks, then
> we can alter the format to change the chunk IDs for these chunks. This
> causes the new version of Git to silently ignore the older chunks (and
> disabling generation number v2 in the process) while writing new
> commit-graph files with correct data in the GDA2 and GDO2 chunks.
>
> Update commit-graph-format.txt including a historical note about these
> deprecated chunks.

Sensible.

> @@ -156,3 +156,11 @@ CHUNK DATA:
>  TRAILER:
>  
>  	H-byte HASH-checksum of all of the above.
> +
> +== Historical Notes:
> +
> +The Generation Data (GDA2) and Generation Data Overflow (GDO2) chunks have
> +the number '2' in their chunk IDs because a previous version of Git wrote
> +possibly erroneous data in these chunks with the IDs "GDAT" and "GDOV". By
> +changing the IDs, newer versions of Git will silently ignore those older
> +chunks and write the new information without trusting the incorrect data.

Good.  How does a new version of Git skip and ignore GDAT and GDOV
in existing files?  By not having any code to recognize what they
are?

I am wondering if there is some notion of "if you do not understand
what this chunk is, you are incapable of handling this file
correctly, so do not use it" kind of bit per chunks (similar to the
index extensions where ones that begin with [A-Z] are optional) that
may negatively affect this plan.

Thanks.

> diff --git a/commit-graph.c b/commit-graph.c
> index b86a6a634fe..fb2ced0bd6d 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -39,8 +39,8 @@ void git_test_write_commit_graph_or_die(void)
>  #define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
>  #define GRAPH_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
>  #define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */
> -#define GRAPH_CHUNKID_GENERATION_DATA 0x47444154 /* "GDAT" */
> -#define GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW 0x47444f56 /* "GDOV" */
> +#define GRAPH_CHUNKID_GENERATION_DATA 0x47444132 /* "GDA2" */
> +#define GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW 0x47444f32 /* "GDO2" */
>  #define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */
>  #define GRAPH_CHUNKID_BLOOMINDEXES 0x42494458 /* "BIDX" */
>  #define GRAPH_CHUNKID_BLOOMDATA 0x42444154 /* "BDAT" */

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 3/7] commit-graph: start parsing generation v2 (again)
  2022-03-02 18:15                         ` Junio C Hamano
@ 2022-03-02 18:46                           ` Derrick Stolee
  2022-03-02 22:42                             ` Junio C Hamano
  0 siblings, 1 reply; 70+ messages in thread
From: Derrick Stolee @ 2022-03-02 18:46 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Patrick Steinhardt, Derrick Stolee via GitGitGadget, git, me,
	abhishekkumar8222

On 3/2/2022 1:15 PM, Junio C Hamano wrote:
> Derrick Stolee <derrickstolee@github.com> writes:
> 
>> Since our repro relies on private information, but is consistent, I
>> wonder if we should take the patch below, which starts to ignore the
>> older generation number v2 data and only writes freshly-computed
>> numbers.
> 
> ;-)
> 
>> Clearly, there is something else going on. The situation is not
>> completely understood, but the errors do not reproduce if the
>> commit-graphs are all generated by a Git version including these recent
>> fixes.
> 
> Do you mean "we know doing X and then Y and then Z on this
> particular private data with older version of Git without those two
> fixes will lead to a broken timestamp, but doing exactly the same
> with the two fixes, the breakage does not reproduce"?  If so, that
> is quite encouraging news.  Thanks for working well together.

Yes, that is my understanding.

>> If we cannot trust the existing data in the GDAT and GDOV chunks, then
>> we can alter the format to change the chunk IDs for these chunks. This
>> causes the new version of Git to silently ignore the older chunks (and
>> disabling generation number v2 in the process) while writing new
>> commit-graph files with correct data in the GDA2 and GDO2 chunks.
>>
>> Update commit-graph-format.txt including a historical note about these
>> deprecated chunks.
> 
> Sensible.
> 
>> @@ -156,3 +156,11 @@ CHUNK DATA:
>>  TRAILER:
>>  
>>  	H-byte HASH-checksum of all of the above.
>> +
>> +== Historical Notes:
>> +
>> +The Generation Data (GDA2) and Generation Data Overflow (GDO2) chunks have
>> +the number '2' in their chunk IDs because a previous version of Git wrote
>> +possibly erroneous data in these chunks with the IDs "GDAT" and "GDOV". By
>> +changing the IDs, newer versions of Git will silently ignore those older
>> +chunks and write the new information without trusting the incorrect data.
> 
> Good.  How does a new version of Git skip and ignore GDAT and GDOV
> in existing files?  By not having any code to recognize what they
> are?
> 
> I am wondering if there is some notion of "if you do not understand
> what this chunk is, you are incapable of handling this file
> correctly, so do not use it" kind of bit per chunks (similar to the
> index extensions where ones that begin with [A-Z] are optional) that
> may negatively affect this plan.

The chunk IDs do not have this special casing rule. This is a
bit unfortunate for certain cases like adding something that _must_
be understood. Here, it works to our benefit that GDAT and GDOV are
optional and can be safely ignored. Thus, clients with this patch
will ignore GDAT and GDOV and continue using topological levels
form the CDAT chunk. Older clients without this patch will ignore
the new GDA2 and GDO2 chunks and continue using topological levels.

For Git versions without this topic branch, this "continue using
topological levels" means no change of behavior at all.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 3/7] commit-graph: start parsing generation v2 (again)
  2022-03-02 18:46                           ` Derrick Stolee
@ 2022-03-02 22:42                             ` Junio C Hamano
  0 siblings, 0 replies; 70+ messages in thread
From: Junio C Hamano @ 2022-03-02 22:42 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Patrick Steinhardt, Derrick Stolee via GitGitGadget, git, me,
	abhishekkumar8222

Derrick Stolee <derrickstolee@github.com> writes:

>> I am wondering if there is some notion of "if you do not understand
>> what this chunk is, you are incapable of handling this file
>> correctly, so do not use it" kind of bit per chunks (similar to the
>> index extensions where ones that begin with [A-Z] are optional) that
>> may negatively affect this plan.
>
> The chunk IDs do not have this special casing rule. This is a
> bit unfortunate for certain cases like adding something that _must_
> be understood. Here, it works to our benefit that GDAT and GDOV are
> optional and can be safely ignored. Thus, clients with this patch
> will ignore GDAT and GDOV and continue using topological levels
> form the CDAT chunk. Older clients without this patch will ignore
> the new GDA2 and GDO2 chunks and continue using topological levels.
>
> For Git versions without this topic branch, this "continue using
> topological levels" means no change of behavior at all.

Excellent.  Thanks.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 3/7] commit-graph: start parsing generation v2 (again)
  2022-03-02 14:57                       ` Derrick Stolee
  2022-03-02 18:15                         ` Junio C Hamano
@ 2022-03-03 11:19                         ` Patrick Steinhardt
  2022-03-03 16:00                           ` Derrick Stolee
  1 sibling, 1 reply; 70+ messages in thread
From: Patrick Steinhardt @ 2022-03-03 11:19 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, git, me, gitster, abhishekkumar8222

[-- Attachment #1: Type: text/plain, Size: 11805 bytes --]

On Wed, Mar 02, 2022 at 09:57:17AM -0500, Derrick Stolee wrote:
> On 3/2/2022 8:57 AM, Patrick Steinhardt wrote:
> > On Tue, Mar 01, 2022 at 10:25:46AM -0500, Derrick Stolee wrote:
> >> On 3/1/2022 9:53 AM, Patrick Steinhardt wrote:
> 
> >>> Hum. I have re-verified, and this indeed seems to play out. So I must've
> >>> accidentally ran all my testing with the state generated without the
> >>> final patch which fixes the corruption. I do see lots of the following
> >>> warnings, but overall I can verify and write the commit-graph just fine:
> >>>
> >>>     commit-graph generation for commit c80a42de8803e2d77818d0c82f88e748d7f9425f is 1623362063 < 1623362139
> >>
> >> But I'm not able to generate these warnings from either version. I
> >> tried generating different levels of a split commit-graph, but
> >> could not reproduce it. If you have reproduction steps using current
> >> 'master' (or any released Git version) and the four patches here,
> >> then I would love to get a full understanding of your errors.
> >>
> >> Thanks,
> >> -Stolee
> > 
> > I haven't yet been able to reproduce it with publicly available data,
> > but with the internal references I'm able to evoke the warnings
> > reliably. It only works when I have two repositories connected via
> > alternates, when generating the commit-graph in the linked-to repo
> > first, and then generating the commit-graph in the linking repo.
> > 
> > The following recipe allows me to reproduce, but rely on private data:
> > 
> >     $ git --version
> >     git version 2.35.1
> > 
> >     # The pool repository is the one we're linked to from the fork.
> >     $ cd "$pool"
> >     $ rm -rf objects/info/commit-graph objects/info/commit-graph
> >     $ git commit-graph write --split
> > 
> >     $ cd "$fork"
> >     $ rm -rf objects/info/commit-graph objects/info/commit-graph
> >     $ git commit-graph write --split
> > 
> >     $ git commit-graph verify --no-progress
> >     $ echo $?
> >     0
> > 
> >     # This is 715d08a9e51251ad8290b181b6ac3b9e1f9719d7 with your full v2
> >     # applied on top.
> >     $ ~/Development/git/bin-wrappers/git --version
> >     git version 2.35.1.358.g7ede1bea24
> > 
> >     $ ~/Development/git/bin-wrappers/git commit-graph verify --no-progress
> >     commit-graph generation for commit 06a91bac00ed11128becd48d5ae77eacd8f24c97 is 1623273624 < 1623273710
> >     commit-graph generation for commit 0ae91029f27238e8f8e109c6bb3907f864dda14f is 1622151146 < 1622151220
> >     commit-graph generation for commit 0d4582a33d8c8e3eb01adbf564f5e1deeb3b56a2 is 1631045222 < 1631045225
> >     commit-graph generation for commit 0daf8976439d7e0bb9710c5ee63b570580e0dc03 is 1620347739 < 1620347789
> >     commit-graph generation for commit 0e0ee8ffb3fa22cee7d28e21cbd6df26454932cf is 1623783297 < 1623783380
> >     commit-graph generation for commit 0f08ab3de6ec115ea8a956a1996cb9759e640e74 is 1621543278 < 1621543339
> >     commit-graph generation for commit 133ed0319b5a66ae0c2be76e5a887b880452b111 is 1620949864 < 1620949915
> >     commit-graph generation for commit 1341b3e6c63343ae94a8a473fa057126ddd4669a is 1637344364 < 1637344384
> >     commit-graph generation for commit 15bdfc501c2c9f23e9353bf6e6a5facd9c32a07a is 1623348103 < 1623348133
> >     ...
> >     $ echo $?
> >     1
> > 
> > When generating commit-graphs with your patches applied the `verify`
> > step works alright.
> > 
> > I've also by accident stumbled over the original error again:
> > 
> >     fatal: commit-graph requires overflow generation data but has none
> > 
> > This time it's definitely not caused by generating commit-graphs with an
> > in-between state of your patch series because the data comes straight
> > from production with no changes to the commit-graphs performed by
> > myself. There we're running Git v2.33.1 with a couple of backported
> > patches (see [1]). While those patches cause us to make more use of the
> > commit-graph, none modify the way we generate them.
> > 
> > Of note is that the commit-graph contains references to commits which
> > don't exist in the ODB anymore.
> > 
> > Patrick
> > 
> > [1]: https://gitlab.com/gitlab-org/gitlab-git/-/commits/pks-v2.33.1.gl3
> 
> Thank you for your diligence here, Patrick. I really appreciate the
> work you're putting in to verify the situation.
> 
> Since our repro relies on private information, but is consistent, I
> wonder if we should take the patch below, which starts to ignore the
> older generation number v2 data and only writes freshly-computed
> numbers.
> 
> Thanks,
> -Stolee

Thanks. With your patch below the `fatal:` error is gone, but I'm still
seeing the same errors with regards to the commit-graph generations.

So to summarize my findings:

    - This bug occurs when writing commit-graphs with v2.35.1, but
      reading them with your patches.

    - This bug occurs when I have two repositories connected via an
      alternates file. I haven't yet been able to reproduce it in a
      single repository that is not connected to a separate ODB.

    - This bug only occurs when I first generate the commit-graph in the
      repository I'm borrowing objects from.

    - This bug only occurs when I write commit-graphs with `--split` in
      both repositories. "Normal" commit-graphs don't have this issue,
      and neither can I see it with `--split=replace` or mixed-type
      commit-graphs.

Beware, the following explanation is based on my very basic
understanding of the commit-graph code and thus more likely to be wrong
than right:

With the old Git version, we've been mis-parsing the generation because
`read_generation_data` wasn't ever set. As a result it can happen that
the second split commit-graph we're generating computes its own
generation numbers from the wrong starting point because it uses the
mis-parsed generation numbers from the parent commit-graph.

With your patches, we start to correctly account for overflows and would
thus end up with a different value for the generation depending on where
we parse the commit from: if we parse it from the first commit-graph it
would be correct because it's contains the "root" of the generation
numbers. But if we parse a commit from the second commit-graph we may
have a mismatch because the generation numbers in there may have been
derived from generation numbers mis-parsed from the first commit-graph.
And because these would be wrong in case there was an overflow it is
clear that the new corrected generation number may be wrong, as well.

Patrick

> --- 8< ---
> 
> From c53d8bd52bbcab3862e8a826ee75692edc7e4173 Mon Sep 17 00:00:00 2001
> From: Derrick Stolee <derrickstolee@github.com>
> Date: Wed, 2 Mar 2022 09:45:13 -0500
> Subject: [PATCH v3 5/4] commit-graph: declare bankruptcy on GDAT chunks
> 
> The Generation Data (GDAT) and Generation Data Overflow (GDOV) chunks
> store corrected commit date offsets, used for generation number v2.
> Recent changes have demonstrated that previous versions of Git were
> incorrectly parsing data from these chunks, but might have also been
> writing them incorrectly.
> 
> I asserted [1] that the previous fixes were sufficient because the known
> reasons for incorrectly writing generation number v2 data relied on
> parsing the information incorrectly out of a commit-graph file, but the
> previous versions of Git were not reading the generation number v2 data.
> 
> However, Patrick demonstrated [2] a case where in split commit-graphs
> across an alternate boundary (and possibly some other special
> conditions) it was possible to have a commit-graph that was generated by
> a previous version of Git have incorrect generation number v2 data which
> results in errors like the following:
> 
>   commit-graph generation for commit <oid> is 1623273624 < 1623273710
> 
> [1] https://lore.kernel.org/git/f50e74f0-9ffa-f4f2-4663-269801495ed3@github.com/
> [2] https://lore.kernel.org/git/Yh93vOkt2DkrGPh2@ncase/
> 
> Clearly, there is something else going on. The situation is not
> completely understood, but the errors do not reproduce if the
> commit-graphs are all generated by a Git version including these recent
> fixes.
> 
> If we cannot trust the existing data in the GDAT and GDOV chunks, then
> we can alter the format to change the chunk IDs for these chunks. This
> causes the new version of Git to silently ignore the older chunks (and
> disabling generation number v2 in the process) while writing new
> commit-graph files with correct data in the GDA2 and GDO2 chunks.
> 
> Update commit-graph-format.txt including a historical note about these
> deprecated chunks.
> 
> Reported-by: Patrick Steinhardt <ps@pks.im>
> Signed-off-by: Derrick Stolee <derrickstolee@github.com>
> ---
>  Documentation/technical/commit-graph-format.txt | 12 ++++++++++--
>  commit-graph.c                                  |  4 ++--
>  2 files changed, 12 insertions(+), 4 deletions(-)
> 
> diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
> index 87971c27dd7..484b185ba98 100644
> --- a/Documentation/technical/commit-graph-format.txt
> +++ b/Documentation/technical/commit-graph-format.txt
> @@ -93,7 +93,7 @@ CHUNK DATA:
>        2 bits of the lowest byte, storing the 33rd and 34th bit of the
>        commit time.
>  
> -  Generation Data (ID: {'G', 'D', 'A', 'T' }) (N * 4 bytes) [Optional]
> +  Generation Data (ID: {'G', 'D', 'A', '2' }) (N * 4 bytes) [Optional]
>      * This list of 4-byte values store corrected commit date offsets for the
>        commits, arranged in the same order as commit data chunk.
>      * If the corrected commit date offset cannot be stored within 31 bits,
> @@ -104,7 +104,7 @@ CHUNK DATA:
>        by compatible versions of Git and in case of split commit-graph chains,
>        the topmost layer also has Generation Data chunk.
>  
> -  Generation Data Overflow (ID: {'G', 'D', 'O', 'V' }) [Optional]
> +  Generation Data Overflow (ID: {'G', 'D', 'O', '2' }) [Optional]
>      * This list of 8-byte values stores the corrected commit date offsets
>        for commits with corrected commit date offsets that cannot be
>        stored within 31 bits.
> @@ -156,3 +156,11 @@ CHUNK DATA:
>  TRAILER:
>  
>  	H-byte HASH-checksum of all of the above.
> +
> +== Historical Notes:
> +
> +The Generation Data (GDA2) and Generation Data Overflow (GDO2) chunks have
> +the number '2' in their chunk IDs because a previous version of Git wrote
> +possibly erroneous data in these chunks with the IDs "GDAT" and "GDOV". By
> +changing the IDs, newer versions of Git will silently ignore those older
> +chunks and write the new information without trusting the incorrect data.
> diff --git a/commit-graph.c b/commit-graph.c
> index b86a6a634fe..fb2ced0bd6d 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -39,8 +39,8 @@ void git_test_write_commit_graph_or_die(void)
>  #define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
>  #define GRAPH_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
>  #define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */
> -#define GRAPH_CHUNKID_GENERATION_DATA 0x47444154 /* "GDAT" */
> -#define GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW 0x47444f56 /* "GDOV" */
> +#define GRAPH_CHUNKID_GENERATION_DATA 0x47444132 /* "GDA2" */
> +#define GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW 0x47444f32 /* "GDO2" */
>  #define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */
>  #define GRAPH_CHUNKID_BLOOMINDEXES 0x42494458 /* "BIDX" */
>  #define GRAPH_CHUNKID_BLOOMDATA 0x42444154 /* "BDAT" */
> -- 
> 2.35.1.138.gfc5de29e9e6
> 
> 
> 
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 3/7] commit-graph: start parsing generation v2 (again)
  2022-03-03 11:19                         ` Patrick Steinhardt
@ 2022-03-03 16:00                           ` Derrick Stolee
  2022-03-04 14:03                             ` Derrick Stolee
  0 siblings, 1 reply; 70+ messages in thread
From: Derrick Stolee @ 2022-03-03 16:00 UTC (permalink / raw)
  To: Patrick Steinhardt
  Cc: Derrick Stolee via GitGitGadget, git, me, gitster, abhishekkumar8222

On 3/3/2022 6:19 AM, Patrick Steinhardt wrote:
> On Wed, Mar 02, 2022 at 09:57:17AM -0500, Derrick Stolee wrote:
>> On 3/2/2022 8:57 AM, Patrick Steinhardt wrote:
>>> On Tue, Mar 01, 2022 at 10:25:46AM -0500, Derrick Stolee wrote:
>>>> On 3/1/2022 9:53 AM, Patrick Steinhardt wrote:
>>
>>>>> Hum. I have re-verified, and this indeed seems to play out. So I must've
>>>>> accidentally ran all my testing with the state generated without the
>>>>> final patch which fixes the corruption. I do see lots of the following
>>>>> warnings, but overall I can verify and write the commit-graph just fine:
>>>>>
>>>>>     commit-graph generation for commit c80a42de8803e2d77818d0c82f88e748d7f9425f is 1623362063 < 1623362139
>>>>
>>>> But I'm not able to generate these warnings from either version. I
>>>> tried generating different levels of a split commit-graph, but
>>>> could not reproduce it. If you have reproduction steps using current
>>>> 'master' (or any released Git version) and the four patches here,
>>>> then I would love to get a full understanding of your errors.
>>>>
>>>> Thanks,
>>>> -Stolee
>>>
>>> I haven't yet been able to reproduce it with publicly available data,
>>> but with the internal references I'm able to evoke the warnings
>>> reliably. It only works when I have two repositories connected via
>>> alternates, when generating the commit-graph in the linked-to repo
>>> first, and then generating the commit-graph in the linking repo.
>>>
>>> The following recipe allows me to reproduce, but rely on private data:
>>>
>>>     $ git --version
>>>     git version 2.35.1
>>>
>>>     # The pool repository is the one we're linked to from the fork.
>>>     $ cd "$pool"
>>>     $ rm -rf objects/info/commit-graph objects/info/commit-graph
>>>     $ git commit-graph write --split
>>>
>>>     $ cd "$fork"
>>>     $ rm -rf objects/info/commit-graph objects/info/commit-graph
>>>     $ git commit-graph write --split
>>>
>>>     $ git commit-graph verify --no-progress
>>>     $ echo $?
>>>     0
>>>
>>>     # This is 715d08a9e51251ad8290b181b6ac3b9e1f9719d7 with your full v2
>>>     # applied on top.
>>>     $ ~/Development/git/bin-wrappers/git --version
>>>     git version 2.35.1.358.g7ede1bea24
>>>
>>>     $ ~/Development/git/bin-wrappers/git commit-graph verify --no-progress
>>>     commit-graph generation for commit 06a91bac00ed11128becd48d5ae77eacd8f24c97 is 1623273624 < 1623273710
>>>     commit-graph generation for commit 0ae91029f27238e8f8e109c6bb3907f864dda14f is 1622151146 < 1622151220
>>>     commit-graph generation for commit 0d4582a33d8c8e3eb01adbf564f5e1deeb3b56a2 is 1631045222 < 1631045225
>>>     commit-graph generation for commit 0daf8976439d7e0bb9710c5ee63b570580e0dc03 is 1620347739 < 1620347789
>>>     commit-graph generation for commit 0e0ee8ffb3fa22cee7d28e21cbd6df26454932cf is 1623783297 < 1623783380
>>>     commit-graph generation for commit 0f08ab3de6ec115ea8a956a1996cb9759e640e74 is 1621543278 < 1621543339
>>>     commit-graph generation for commit 133ed0319b5a66ae0c2be76e5a887b880452b111 is 1620949864 < 1620949915
>>>     commit-graph generation for commit 1341b3e6c63343ae94a8a473fa057126ddd4669a is 1637344364 < 1637344384
>>>     commit-graph generation for commit 15bdfc501c2c9f23e9353bf6e6a5facd9c32a07a is 1623348103 < 1623348133
>>>     ...
>>>     $ echo $?
>>>     1
>>>
>>> When generating commit-graphs with your patches applied the `verify`
>>> step works alright.
>>>
>>> I've also by accident stumbled over the original error again:
>>>
>>>     fatal: commit-graph requires overflow generation data but has none
>>>
>>> This time it's definitely not caused by generating commit-graphs with an
>>> in-between state of your patch series because the data comes straight
>>> from production with no changes to the commit-graphs performed by
>>> myself. There we're running Git v2.33.1 with a couple of backported
>>> patches (see [1]). While those patches cause us to make more use of the
>>> commit-graph, none modify the way we generate them.
>>>
>>> Of note is that the commit-graph contains references to commits which
>>> don't exist in the ODB anymore.
>>>
>>> Patrick
>>>
>>> [1]: https://gitlab.com/gitlab-org/gitlab-git/-/commits/pks-v2.33.1.gl3
>>
>> Thank you for your diligence here, Patrick. I really appreciate the
>> work you're putting in to verify the situation.
>>
>> Since our repro relies on private information, but is consistent, I
>> wonder if we should take the patch below, which starts to ignore the
>> older generation number v2 data and only writes freshly-computed
>> numbers.
>>
>> Thanks,
>> -Stolee
> 
> Thanks. With your patch below the `fatal:` error is gone, but I'm still
> seeing the same errors with regards to the commit-graph generations.

This is disappointing and unexpected. Thanks for verifying.

> So to summarize my findings:
> 
>     - This bug occurs when writing commit-graphs with v2.35.1, but
>       reading them with your patches.
> 
>     - This bug occurs when I have two repositories connected via an
>       alternates file. I haven't yet been able to reproduce it in a
>       single repository that is not connected to a separate ODB.

This is an interesting distinction. One that I didn't think would
matter, but I'll look into the code to see how that could affect
things.

>     - This bug only occurs when I first generate the commit-graph in the
>       repository I'm borrowing objects from.
> 
>     - This bug only occurs when I write commit-graphs with `--split` in
>       both repositories. "Normal" commit-graphs don't have this issue,
>       and neither can I see it with `--split=replace` or mixed-type
>       commit-graphs.
> 
> Beware, the following explanation is based on my very basic
> understanding of the commit-graph code and thus more likely to be wrong
> than right:
> 
> With the old Git version, we've been mis-parsing the generation because
> `read_generation_data` wasn't ever set. As a result it can happen that
> the second split commit-graph we're generating computes its own
> generation numbers from the wrong starting point because it uses the
> mis-parsed generation numbers from the parent commit-graph.
> 
> With your patches, we start to correctly account for overflows and would
> thus end up with a different value for the generation depending on where
> we parse the commit from: if we parse it from the first commit-graph it
> would be correct because it's contains the "root" of the generation
> numbers. But if we parse a commit from the second commit-graph we may
> have a mismatch because the generation numbers in there may have been
> derived from generation numbers mis-parsed from the first commit-graph.
> And because these would be wrong in case there was an overflow it is
> clear that the new corrected generation number may be wrong, as well.

Hm. My expectation was that the older layers of the split commit-graph
would have read_generation_data disabled (because the new Git version
cannot read the GDAT chunk) and then the validate_mixed_generation_chain()
method would remove read_generation_data from all of the graphs in the
list.

Combining this with your thoughts on cross-alternate split commit-graphs,
this makes me think we should try this:

--- >8 ---

diff --git a/commit-graph.c b/commit-graph.c
index fb2ced0bd6..74c6534f56 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -609,8 +609,6 @@ struct commit_graph *read_commit_graph_one(struct repository *r,
 	if (!g)
 		g = load_commit_graph_chain(r, odb);
 
-	validate_mixed_generation_chain(g);
-
 	return g;
 }
 
@@ -668,7 +666,13 @@ static int prepare_commit_graph(struct repository *r)
 	     !r->objects->commit_graph && odb;
 	     odb = odb->next)
 		prepare_commit_graph_one(r, odb);
-	return !!r->objects->commit_graph;
+
+	if (r->objects->commit_graph) {
+		validate_mixed_generation_chain(r->objects->commit_graph);
+		return 1;
+	}
+
+	return 0;
 }
 
 int generation_numbers_enabled(struct repository *r)


--- >8 ---

Notice that I'm moving the validate_mixed_generation_chain() call
out of read_commit_graph_one() and into prepare_commit_graph(). To
my understanding, this _should_ have an equivalent end state as the
old code, but might be worth trying just as a quick check.

I will continue investigating and try to reproduce with this
additional constraint of working across an alternate.

Thanks,
-Stolee



^ permalink raw reply related	[flat|nested] 70+ messages in thread

* Re: [PATCH 3/7] commit-graph: start parsing generation v2 (again)
  2022-03-03 16:00                           ` Derrick Stolee
@ 2022-03-04 14:03                             ` Derrick Stolee
  2022-03-07 10:34                               ` Patrick Steinhardt
  0 siblings, 1 reply; 70+ messages in thread
From: Derrick Stolee @ 2022-03-04 14:03 UTC (permalink / raw)
  To: Patrick Steinhardt
  Cc: Derrick Stolee via GitGitGadget, git, me, gitster, abhishekkumar8222

On 3/3/2022 11:00 AM, Derrick Stolee wrote:
> On 3/3/2022 6:19 AM, Patrick Steinhardt wrote:
>> On Wed, Mar 02, 2022 at 09:57:17AM -0500, Derrick Stolee wrote:
>>> On 3/2/2022 8:57 AM, Patrick Steinhardt wrote:
>>>> On Tue, Mar 01, 2022 at 10:25:46AM -0500, Derrick Stolee wrote:
>>>>> On 3/1/2022 9:53 AM, Patrick Steinhardt wrote:
>>>
>>>>>> Hum. I have re-verified, and this indeed seems to play out. So I must've
>>>>>> accidentally ran all my testing with the state generated without the
>>>>>> final patch which fixes the corruption. I do see lots of the following
>>>>>> warnings, but overall I can verify and write the commit-graph just fine:
>>>>>>
>>>>>>     commit-graph generation for commit c80a42de8803e2d77818d0c82f88e748d7f9425f is 1623362063 < 1623362139
>>>>>
>>>>> But I'm not able to generate these warnings from either version. I
>>>>> tried generating different levels of a split commit-graph, but
>>>>> could not reproduce it. If you have reproduction steps using current
>>>>> 'master' (or any released Git version) and the four patches here,
>>>>> then I would love to get a full understanding of your errors.
>>>>>
>>>>> Thanks,
>>>>> -Stolee
>>>>
>>>> I haven't yet been able to reproduce it with publicly available data,
>>>> but with the internal references I'm able to evoke the warnings
>>>> reliably. It only works when I have two repositories connected via
>>>> alternates, when generating the commit-graph in the linked-to repo
>>>> first, and then generating the commit-graph in the linking repo.
>>>>
>>>> The following recipe allows me to reproduce, but rely on private data:
>>>>
>>>>     $ git --version
>>>>     git version 2.35.1
>>>>
>>>>     # The pool repository is the one we're linked to from the fork.
>>>>     $ cd "$pool"
>>>>     $ rm -rf objects/info/commit-graph objects/info/commit-graph
>>>>     $ git commit-graph write --split
>>>>
>>>>     $ cd "$fork"
>>>>     $ rm -rf objects/info/commit-graph objects/info/commit-graph
>>>>     $ git commit-graph write --split
>>>>
>>>>     $ git commit-graph verify --no-progress
>>>>     $ echo $?
>>>>     0
>>>>
>>>>     # This is 715d08a9e51251ad8290b181b6ac3b9e1f9719d7 with your full v2
>>>>     # applied on top.
>>>>     $ ~/Development/git/bin-wrappers/git --version
>>>>     git version 2.35.1.358.g7ede1bea24
>>>>
>>>>     $ ~/Development/git/bin-wrappers/git commit-graph verify --no-progress
>>>>     commit-graph generation for commit 06a91bac00ed11128becd48d5ae77eacd8f24c97 is 1623273624 < 1623273710
>>>>     commit-graph generation for commit 0ae91029f27238e8f8e109c6bb3907f864dda14f is 1622151146 < 1622151220
>>>>     commit-graph generation for commit 0d4582a33d8c8e3eb01adbf564f5e1deeb3b56a2 is 1631045222 < 1631045225
>>>>     commit-graph generation for commit 0daf8976439d7e0bb9710c5ee63b570580e0dc03 is 1620347739 < 1620347789
>>>>     commit-graph generation for commit 0e0ee8ffb3fa22cee7d28e21cbd6df26454932cf is 1623783297 < 1623783380
>>>>     commit-graph generation for commit 0f08ab3de6ec115ea8a956a1996cb9759e640e74 is 1621543278 < 1621543339
>>>>     commit-graph generation for commit 133ed0319b5a66ae0c2be76e5a887b880452b111 is 1620949864 < 1620949915
>>>>     commit-graph generation for commit 1341b3e6c63343ae94a8a473fa057126ddd4669a is 1637344364 < 1637344384
>>>>     commit-graph generation for commit 15bdfc501c2c9f23e9353bf6e6a5facd9c32a07a is 1623348103 < 1623348133
>>>>     ...
>>>>     $ echo $?
>>>>     1
>>>>
>>>> When generating commit-graphs with your patches applied the `verify`
>>>> step works alright.
>>>>
>>>> I've also by accident stumbled over the original error again:
>>>>
>>>>     fatal: commit-graph requires overflow generation data but has none
>>>>
>>>> This time it's definitely not caused by generating commit-graphs with an
>>>> in-between state of your patch series because the data comes straight
>>>> from production with no changes to the commit-graphs performed by
>>>> myself. There we're running Git v2.33.1 with a couple of backported
>>>> patches (see [1]). While those patches cause us to make more use of the
>>>> commit-graph, none modify the way we generate them.
>>>>
>>>> Of note is that the commit-graph contains references to commits which
>>>> don't exist in the ODB anymore.
>>>>
>>>> Patrick
>>>>
>>>> [1]: https://gitlab.com/gitlab-org/gitlab-git/-/commits/pks-v2.33.1.gl3
>>>
>>> Thank you for your diligence here, Patrick. I really appreciate the
>>> work you're putting in to verify the situation.
>>>
>>> Since our repro relies on private information, but is consistent, I
>>> wonder if we should take the patch below, which starts to ignore the
>>> older generation number v2 data and only writes freshly-computed
>>> numbers.
>>>
>>> Thanks,
>>> -Stolee
>>
>> Thanks. With your patch below the `fatal:` error is gone, but I'm still
>> seeing the same errors with regards to the commit-graph generations.
> 
> This is disappointing and unexpected. Thanks for verifying.
> 
>> So to summarize my findings:
>>
>>     - This bug occurs when writing commit-graphs with v2.35.1, but
>>       reading them with your patches.
>>
>>     - This bug occurs when I have two repositories connected via an
>>       alternates file. I haven't yet been able to reproduce it in a
>>       single repository that is not connected to a separate ODB.
> 
> This is an interesting distinction. One that I didn't think would
> matter, but I'll look into the code to see how that could affect
> things.
> 
>>     - This bug only occurs when I first generate the commit-graph in the
>>       repository I'm borrowing objects from.
>>
>>     - This bug only occurs when I write commit-graphs with `--split` in
>>       both repositories. "Normal" commit-graphs don't have this issue,
>>       and neither can I see it with `--split=replace` or mixed-type
>>       commit-graphs.
>>
>> Beware, the following explanation is based on my very basic
>> understanding of the commit-graph code and thus more likely to be wrong
>> than right:
>>
>> With the old Git version, we've been mis-parsing the generation because
>> `read_generation_data` wasn't ever set. As a result it can happen that
>> the second split commit-graph we're generating computes its own
>> generation numbers from the wrong starting point because it uses the
>> mis-parsed generation numbers from the parent commit-graph.
>>
>> With your patches, we start to correctly account for overflows and would
>> thus end up with a different value for the generation depending on where
>> we parse the commit from: if we parse it from the first commit-graph it
>> would be correct because it's contains the "root" of the generation
>> numbers. But if we parse a commit from the second commit-graph we may
>> have a mismatch because the generation numbers in there may have been
>> derived from generation numbers mis-parsed from the first commit-graph.
>> And because these would be wrong in case there was an overflow it is
>> clear that the new corrected generation number may be wrong, as well.
> 
> Hm. My expectation was that the older layers of the split commit-graph
> would have read_generation_data disabled (because the new Git version
> cannot read the GDAT chunk) and then the validate_mixed_generation_chain()
> method would remove read_generation_data from all of the graphs in the
> list.
> 
> Combining this with your thoughts on cross-alternate split commit-graphs,
> this makes me think we should try this:
> 
> --- >8 ---
> 
> diff --git a/commit-graph.c b/commit-graph.c
> index fb2ced0bd6..74c6534f56 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -609,8 +609,6 @@ struct commit_graph *read_commit_graph_one(struct repository *r,
>  	if (!g)
>  		g = load_commit_graph_chain(r, odb);
>  
> -	validate_mixed_generation_chain(g);
> -
>  	return g;
>  }
>  
> @@ -668,7 +666,13 @@ static int prepare_commit_graph(struct repository *r)
>  	     !r->objects->commit_graph && odb;
>  	     odb = odb->next)
>  		prepare_commit_graph_one(r, odb);
> -	return !!r->objects->commit_graph;
> +
> +	if (r->objects->commit_graph) {
> +		validate_mixed_generation_chain(r->objects->commit_graph);
> +		return 1;
> +	}
> +
> +	return 0;
>  }
>  
>  int generation_numbers_enabled(struct repository *r)
> 
> 
> --- >8 ---
> 
> Notice that I'm moving the validate_mixed_generation_chain() call
> out of read_commit_graph_one() and into prepare_commit_graph(). To
> my understanding, this _should_ have an equivalent end state as the
> old code, but might be worth trying just as a quick check.
> 
> I will continue investigating and try to reproduce with this
> additional constraint of working across an alternate.

My attempts to reproduce this across an alternate have failed. I
tried running the following test against Git without these patches,
then verify with the newer version of Git. (I also have generated
a few new layers on top with these patches, and they correctly drop
the GDA2 and GDO2 chunks when the lower layers "don't have gen v2".)


test_description='commit-graph with offsets across alternates'
. ./test-lib.sh

if ! test_have_prereq TIME_IS_64BIT || ! test_have_prereq TIME_T_IS_64BIT
then
	skip_all='skipping 64-bit timestamp tests'
	test_done
fi


UNIX_EPOCH_ZERO="@0 +0000"
FUTURE_DATE="@4147483646 +0000"

GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=0

test_expect_success 'generate alternate split commit-graph' '
	git init alternate &&
	(
		cd alternate &&
		test_commit --date "$UNIX_EPOCH_ZERO" 1 &&
		test_commit --date "$FUTURE_DATE" 2 &&
		git commit-graph write --reachable &&
		test_commit --date "$UNIX_EPOCH_ZERO" 3 &&
		test_commit --date "$FUTURE_DATE" 4 &&
		git commit-graph write --reachable --split=no-merge
	) &&
	git clone --shared alternate fork &&
	(
		cd fork &&
		test_commit --date "$UNIX_EPOCH_ZERO" 5 &&
		test_commit --date "$FUTURE_DATE" 6 &&
		git commit-graph write --reachable --split=no-merge &&
		test_commit --date "$UNIX_EPOCH_ZERO" 7 &&
		test_commit --date "$FUTURE_DATE" 8 &&
		git commit-graph write --reachable --split=no-merge
	)
'

test_done


My testing after running this with -d allows me to reliably see these
layers being created with GDAT and GDOV chunks. Running the 'git
commit-graph verify' command with the new code does not show those
errors, even after adding commits and another layer to the split
commit-graph.

I look forward to any additional insights you might have here.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 3/7] commit-graph: start parsing generation v2 (again)
  2022-03-04 14:03                             ` Derrick Stolee
@ 2022-03-07 10:34                               ` Patrick Steinhardt
  2022-03-07 13:45                                 ` Derrick Stolee
  0 siblings, 1 reply; 70+ messages in thread
From: Patrick Steinhardt @ 2022-03-07 10:34 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, git, me, gitster, abhishekkumar8222

[-- Attachment #1: Type: text/plain, Size: 11584 bytes --]

On Fri, Mar 04, 2022 at 09:03:15AM -0500, Derrick Stolee wrote:
> On 3/3/2022 11:00 AM, Derrick Stolee wrote:
> > On 3/3/2022 6:19 AM, Patrick Steinhardt wrote:
> >> On Wed, Mar 02, 2022 at 09:57:17AM -0500, Derrick Stolee wrote:
> >>> On 3/2/2022 8:57 AM, Patrick Steinhardt wrote:
> >>>> On Tue, Mar 01, 2022 at 10:25:46AM -0500, Derrick Stolee wrote:
> >>>>> On 3/1/2022 9:53 AM, Patrick Steinhardt wrote:
> >>>
> >>>>>> Hum. I have re-verified, and this indeed seems to play out. So I must've
> >>>>>> accidentally ran all my testing with the state generated without the
> >>>>>> final patch which fixes the corruption. I do see lots of the following
> >>>>>> warnings, but overall I can verify and write the commit-graph just fine:
> >>>>>>
> >>>>>>     commit-graph generation for commit c80a42de8803e2d77818d0c82f88e748d7f9425f is 1623362063 < 1623362139
> >>>>>
> >>>>> But I'm not able to generate these warnings from either version. I
> >>>>> tried generating different levels of a split commit-graph, but
> >>>>> could not reproduce it. If you have reproduction steps using current
> >>>>> 'master' (or any released Git version) and the four patches here,
> >>>>> then I would love to get a full understanding of your errors.
> >>>>>
> >>>>> Thanks,
> >>>>> -Stolee
> >>>>
> >>>> I haven't yet been able to reproduce it with publicly available data,
> >>>> but with the internal references I'm able to evoke the warnings
> >>>> reliably. It only works when I have two repositories connected via
> >>>> alternates, when generating the commit-graph in the linked-to repo
> >>>> first, and then generating the commit-graph in the linking repo.
> >>>>
> >>>> The following recipe allows me to reproduce, but rely on private data:
> >>>>
> >>>>     $ git --version
> >>>>     git version 2.35.1
> >>>>
> >>>>     # The pool repository is the one we're linked to from the fork.
> >>>>     $ cd "$pool"
> >>>>     $ rm -rf objects/info/commit-graph objects/info/commit-graph
> >>>>     $ git commit-graph write --split
> >>>>
> >>>>     $ cd "$fork"
> >>>>     $ rm -rf objects/info/commit-graph objects/info/commit-graph
> >>>>     $ git commit-graph write --split
> >>>>
> >>>>     $ git commit-graph verify --no-progress
> >>>>     $ echo $?
> >>>>     0
> >>>>
> >>>>     # This is 715d08a9e51251ad8290b181b6ac3b9e1f9719d7 with your full v2
> >>>>     # applied on top.
> >>>>     $ ~/Development/git/bin-wrappers/git --version
> >>>>     git version 2.35.1.358.g7ede1bea24
> >>>>
> >>>>     $ ~/Development/git/bin-wrappers/git commit-graph verify --no-progress
> >>>>     commit-graph generation for commit 06a91bac00ed11128becd48d5ae77eacd8f24c97 is 1623273624 < 1623273710
> >>>>     commit-graph generation for commit 0ae91029f27238e8f8e109c6bb3907f864dda14f is 1622151146 < 1622151220
> >>>>     commit-graph generation for commit 0d4582a33d8c8e3eb01adbf564f5e1deeb3b56a2 is 1631045222 < 1631045225
> >>>>     commit-graph generation for commit 0daf8976439d7e0bb9710c5ee63b570580e0dc03 is 1620347739 < 1620347789
> >>>>     commit-graph generation for commit 0e0ee8ffb3fa22cee7d28e21cbd6df26454932cf is 1623783297 < 1623783380
> >>>>     commit-graph generation for commit 0f08ab3de6ec115ea8a956a1996cb9759e640e74 is 1621543278 < 1621543339
> >>>>     commit-graph generation for commit 133ed0319b5a66ae0c2be76e5a887b880452b111 is 1620949864 < 1620949915
> >>>>     commit-graph generation for commit 1341b3e6c63343ae94a8a473fa057126ddd4669a is 1637344364 < 1637344384
> >>>>     commit-graph generation for commit 15bdfc501c2c9f23e9353bf6e6a5facd9c32a07a is 1623348103 < 1623348133
> >>>>     ...
> >>>>     $ echo $?
> >>>>     1
> >>>>
> >>>> When generating commit-graphs with your patches applied the `verify`
> >>>> step works alright.
> >>>>
> >>>> I've also by accident stumbled over the original error again:
> >>>>
> >>>>     fatal: commit-graph requires overflow generation data but has none
> >>>>
> >>>> This time it's definitely not caused by generating commit-graphs with an
> >>>> in-between state of your patch series because the data comes straight
> >>>> from production with no changes to the commit-graphs performed by
> >>>> myself. There we're running Git v2.33.1 with a couple of backported
> >>>> patches (see [1]). While those patches cause us to make more use of the
> >>>> commit-graph, none modify the way we generate them.
> >>>>
> >>>> Of note is that the commit-graph contains references to commits which
> >>>> don't exist in the ODB anymore.
> >>>>
> >>>> Patrick
> >>>>
> >>>> [1]: https://gitlab.com/gitlab-org/gitlab-git/-/commits/pks-v2.33.1.gl3
> >>>
> >>> Thank you for your diligence here, Patrick. I really appreciate the
> >>> work you're putting in to verify the situation.
> >>>
> >>> Since our repro relies on private information, but is consistent, I
> >>> wonder if we should take the patch below, which starts to ignore the
> >>> older generation number v2 data and only writes freshly-computed
> >>> numbers.
> >>>
> >>> Thanks,
> >>> -Stolee
> >>
> >> Thanks. With your patch below the `fatal:` error is gone, but I'm still
> >> seeing the same errors with regards to the commit-graph generations.
> > 
> > This is disappointing and unexpected. Thanks for verifying.
> > 
> >> So to summarize my findings:
> >>
> >>     - This bug occurs when writing commit-graphs with v2.35.1, but
> >>       reading them with your patches.
> >>
> >>     - This bug occurs when I have two repositories connected via an
> >>       alternates file. I haven't yet been able to reproduce it in a
> >>       single repository that is not connected to a separate ODB.
> > 
> > This is an interesting distinction. One that I didn't think would
> > matter, but I'll look into the code to see how that could affect
> > things.
> > 
> >>     - This bug only occurs when I first generate the commit-graph in the
> >>       repository I'm borrowing objects from.
> >>
> >>     - This bug only occurs when I write commit-graphs with `--split` in
> >>       both repositories. "Normal" commit-graphs don't have this issue,
> >>       and neither can I see it with `--split=replace` or mixed-type
> >>       commit-graphs.
> >>
> >> Beware, the following explanation is based on my very basic
> >> understanding of the commit-graph code and thus more likely to be wrong
> >> than right:
> >>
> >> With the old Git version, we've been mis-parsing the generation because
> >> `read_generation_data` wasn't ever set. As a result it can happen that
> >> the second split commit-graph we're generating computes its own
> >> generation numbers from the wrong starting point because it uses the
> >> mis-parsed generation numbers from the parent commit-graph.
> >>
> >> With your patches, we start to correctly account for overflows and would
> >> thus end up with a different value for the generation depending on where
> >> we parse the commit from: if we parse it from the first commit-graph it
> >> would be correct because it's contains the "root" of the generation
> >> numbers. But if we parse a commit from the second commit-graph we may
> >> have a mismatch because the generation numbers in there may have been
> >> derived from generation numbers mis-parsed from the first commit-graph.
> >> And because these would be wrong in case there was an overflow it is
> >> clear that the new corrected generation number may be wrong, as well.
> > 
> > Hm. My expectation was that the older layers of the split commit-graph
> > would have read_generation_data disabled (because the new Git version
> > cannot read the GDAT chunk) and then the validate_mixed_generation_chain()
> > method would remove read_generation_data from all of the graphs in the
> > list.
> > 
> > Combining this with your thoughts on cross-alternate split commit-graphs,
> > this makes me think we should try this:
> > 
> > --- >8 ---
> > 
> > diff --git a/commit-graph.c b/commit-graph.c
> > index fb2ced0bd6..74c6534f56 100644
> > --- a/commit-graph.c
> > +++ b/commit-graph.c
> > @@ -609,8 +609,6 @@ struct commit_graph *read_commit_graph_one(struct repository *r,
> >  	if (!g)
> >  		g = load_commit_graph_chain(r, odb);
> >  
> > -	validate_mixed_generation_chain(g);
> > -
> >  	return g;
> >  }
> >  
> > @@ -668,7 +666,13 @@ static int prepare_commit_graph(struct repository *r)
> >  	     !r->objects->commit_graph && odb;
> >  	     odb = odb->next)
> >  		prepare_commit_graph_one(r, odb);
> > -	return !!r->objects->commit_graph;
> > +
> > +	if (r->objects->commit_graph) {
> > +		validate_mixed_generation_chain(r->objects->commit_graph);
> > +		return 1;
> > +	}
> > +
> > +	return 0;
> >  }
> >  
> >  int generation_numbers_enabled(struct repository *r)
> > 
> > 
> > --- >8 ---
> > 
> > Notice that I'm moving the validate_mixed_generation_chain() call
> > out of read_commit_graph_one() and into prepare_commit_graph(). To
> > my understanding, this _should_ have an equivalent end state as the
> > old code, but might be worth trying just as a quick check.
> > 
> > I will continue investigating and try to reproduce with this
> > additional constraint of working across an alternate.
> 
> My attempts to reproduce this across an alternate have failed. I
> tried running the following test against Git without these patches,
> then verify with the newer version of Git. (I also have generated
> a few new layers on top with these patches, and they correctly drop
> the GDA2 and GDO2 chunks when the lower layers "don't have gen v2".)
> 
> 
> test_description='commit-graph with offsets across alternates'
> . ./test-lib.sh
> 
> if ! test_have_prereq TIME_IS_64BIT || ! test_have_prereq TIME_T_IS_64BIT
> then
> 	skip_all='skipping 64-bit timestamp tests'
> 	test_done
> fi
> 
> 
> UNIX_EPOCH_ZERO="@0 +0000"
> FUTURE_DATE="@4147483646 +0000"
> 
> GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=0
> 
> test_expect_success 'generate alternate split commit-graph' '
> 	git init alternate &&
> 	(
> 		cd alternate &&
> 		test_commit --date "$UNIX_EPOCH_ZERO" 1 &&
> 		test_commit --date "$FUTURE_DATE" 2 &&
> 		git commit-graph write --reachable &&
> 		test_commit --date "$UNIX_EPOCH_ZERO" 3 &&
> 		test_commit --date "$FUTURE_DATE" 4 &&
> 		git commit-graph write --reachable --split=no-merge
> 	) &&
> 	git clone --shared alternate fork &&
> 	(
> 		cd fork &&
> 		test_commit --date "$UNIX_EPOCH_ZERO" 5 &&
> 		test_commit --date "$FUTURE_DATE" 6 &&
> 		git commit-graph write --reachable --split=no-merge &&
> 		test_commit --date "$UNIX_EPOCH_ZERO" 7 &&
> 		test_commit --date "$FUTURE_DATE" 8 &&
> 		git commit-graph write --reachable --split=no-merge
> 	)
> '
> 
> test_done
> 
> 
> My testing after running this with -d allows me to reliably see these
> layers being created with GDAT and GDOV chunks. Running the 'git
> commit-graph verify' command with the new code does not show those
> errors, even after adding commits and another layer to the split
> commit-graph.
> 
> I look forward to any additional insights you might have here.

I don't really know why, but now I've become unable to reproduce it
again. I think we should just go with your patch 5/4 on top -- it does
fix the most important issue, which is the `die()` I saw on almost all
commands. The second part about the warnings I'm just not sure about,
but I don't think it should stop this patch series given my own
uncertainty.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 3/7] commit-graph: start parsing generation v2 (again)
  2022-03-07 10:34                               ` Patrick Steinhardt
@ 2022-03-07 13:45                                 ` Derrick Stolee
  2022-03-07 17:22                                   ` Junio C Hamano
  2022-03-10 13:58                                   ` Patrick Steinhardt
  0 siblings, 2 replies; 70+ messages in thread
From: Derrick Stolee @ 2022-03-07 13:45 UTC (permalink / raw)
  To: Patrick Steinhardt
  Cc: Derrick Stolee via GitGitGadget, git, me, gitster, abhishekkumar8222

On 3/7/2022 5:34 AM, Patrick Steinhardt wrote:
> On Fri, Mar 04, 2022 at 09:03:15AM -0500, Derrick Stolee wrote:
>> On 3/3/2022 11:00 AM, Derrick Stolee wrote:
...
>>> I will continue investigating and try to reproduce with this
>>> additional constraint of working across an alternate.
>>
>> My attempts to reproduce this across an alternate have failed. I
>> tried running the following test against Git without these patches,
>> then verify with the newer version of Git. (I also have generated
>> a few new layers on top with these patches, and they correctly drop
>> the GDA2 and GDO2 chunks when the lower layers "don't have gen v2".)
>>
>>
>> test_description='commit-graph with offsets across alternates'
>> . ./test-lib.sh
>>
>> if ! test_have_prereq TIME_IS_64BIT || ! test_have_prereq TIME_T_IS_64BIT
>> then
>> 	skip_all='skipping 64-bit timestamp tests'
>> 	test_done
>> fi
>>
>>
>> UNIX_EPOCH_ZERO="@0 +0000"
>> FUTURE_DATE="@4147483646 +0000"
>>
>> GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=0
>>
>> test_expect_success 'generate alternate split commit-graph' '
>> 	git init alternate &&
>> 	(
>> 		cd alternate &&
>> 		test_commit --date "$UNIX_EPOCH_ZERO" 1 &&
>> 		test_commit --date "$FUTURE_DATE" 2 &&
>> 		git commit-graph write --reachable &&
>> 		test_commit --date "$UNIX_EPOCH_ZERO" 3 &&
>> 		test_commit --date "$FUTURE_DATE" 4 &&
>> 		git commit-graph write --reachable --split=no-merge
>> 	) &&
>> 	git clone --shared alternate fork &&
>> 	(
>> 		cd fork &&
>> 		test_commit --date "$UNIX_EPOCH_ZERO" 5 &&
>> 		test_commit --date "$FUTURE_DATE" 6 &&
>> 		git commit-graph write --reachable --split=no-merge &&
>> 		test_commit --date "$UNIX_EPOCH_ZERO" 7 &&
>> 		test_commit --date "$FUTURE_DATE" 8 &&
>> 		git commit-graph write --reachable --split=no-merge
>> 	)
>> '
>>
>> test_done
>>
>>
>> My testing after running this with -d allows me to reliably see these
>> layers being created with GDAT and GDOV chunks. Running the 'git
>> commit-graph verify' command with the new code does not show those
>> errors, even after adding commits and another layer to the split
>> commit-graph.
>>
>> I look forward to any additional insights you might have here.
> 
> I don't really know why, but now I've become unable to reproduce it
> again. I think we should just go with your patch 5/4 on top -- it does
> fix the most important issue, which is the `die()` I saw on almost all
> commands. The second part about the warnings I'm just not sure about,
> but I don't think it should stop this patch series given my own
> uncertainty.

Thanks for following up. I agree that with 5/4 we should be safe.

I'll remain available to quickly respond if anything else surprising
comes up in this area.

Thanks!
-Stolee

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 3/7] commit-graph: start parsing generation v2 (again)
  2022-03-07 13:45                                 ` Derrick Stolee
@ 2022-03-07 17:22                                   ` Junio C Hamano
  2022-03-10 13:58                                   ` Patrick Steinhardt
  1 sibling, 0 replies; 70+ messages in thread
From: Junio C Hamano @ 2022-03-07 17:22 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Patrick Steinhardt, Derrick Stolee via GitGitGadget, git, me,
	abhishekkumar8222

Derrick Stolee <derrickstolee@github.com> writes:

> Thanks for following up. I agree that with 5/4 we should be safe.
>
> I'll remain available to quickly respond if anything else surprising
> comes up in this area.

Thanks.  I just picked up the "bankruptcy" step, so we should be
good to go.


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 3/7] commit-graph: start parsing generation v2 (again)
  2022-03-07 13:45                                 ` Derrick Stolee
  2022-03-07 17:22                                   ` Junio C Hamano
@ 2022-03-10 13:58                                   ` Patrick Steinhardt
  2022-03-10 17:18                                     ` Derrick Stolee
  1 sibling, 1 reply; 70+ messages in thread
From: Patrick Steinhardt @ 2022-03-10 13:58 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, git, me, gitster, abhishekkumar8222

[-- Attachment #1: Type: text/plain, Size: 4815 bytes --]

On Mon, Mar 07, 2022 at 08:45:07AM -0500, Derrick Stolee wrote:
> On 3/7/2022 5:34 AM, Patrick Steinhardt wrote:
> > On Fri, Mar 04, 2022 at 09:03:15AM -0500, Derrick Stolee wrote:
> >> On 3/3/2022 11:00 AM, Derrick Stolee wrote:
> ...
> >>> I will continue investigating and try to reproduce with this
> >>> additional constraint of working across an alternate.
> >>
> >> My attempts to reproduce this across an alternate have failed. I
> >> tried running the following test against Git without these patches,
> >> then verify with the newer version of Git. (I also have generated
> >> a few new layers on top with these patches, and they correctly drop
> >> the GDA2 and GDO2 chunks when the lower layers "don't have gen v2".)
> >>
> >>
> >> test_description='commit-graph with offsets across alternates'
> >> . ./test-lib.sh
> >>
> >> if ! test_have_prereq TIME_IS_64BIT || ! test_have_prereq TIME_T_IS_64BIT
> >> then
> >> 	skip_all='skipping 64-bit timestamp tests'
> >> 	test_done
> >> fi
> >>
> >>
> >> UNIX_EPOCH_ZERO="@0 +0000"
> >> FUTURE_DATE="@4147483646 +0000"
> >>
> >> GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=0
> >>
> >> test_expect_success 'generate alternate split commit-graph' '
> >> 	git init alternate &&
> >> 	(
> >> 		cd alternate &&
> >> 		test_commit --date "$UNIX_EPOCH_ZERO" 1 &&
> >> 		test_commit --date "$FUTURE_DATE" 2 &&
> >> 		git commit-graph write --reachable &&
> >> 		test_commit --date "$UNIX_EPOCH_ZERO" 3 &&
> >> 		test_commit --date "$FUTURE_DATE" 4 &&
> >> 		git commit-graph write --reachable --split=no-merge
> >> 	) &&
> >> 	git clone --shared alternate fork &&
> >> 	(
> >> 		cd fork &&
> >> 		test_commit --date "$UNIX_EPOCH_ZERO" 5 &&
> >> 		test_commit --date "$FUTURE_DATE" 6 &&
> >> 		git commit-graph write --reachable --split=no-merge &&
> >> 		test_commit --date "$UNIX_EPOCH_ZERO" 7 &&
> >> 		test_commit --date "$FUTURE_DATE" 8 &&
> >> 		git commit-graph write --reachable --split=no-merge
> >> 	)
> >> '
> >>
> >> test_done
> >>
> >>
> >> My testing after running this with -d allows me to reliably see these
> >> layers being created with GDAT and GDOV chunks. Running the 'git
> >> commit-graph verify' command with the new code does not show those
> >> errors, even after adding commits and another layer to the split
> >> commit-graph.
> >>
> >> I look forward to any additional insights you might have here.
> > 
> > I don't really know why, but now I've become unable to reproduce it
> > again. I think we should just go with your patch 5/4 on top -- it does
> > fix the most important issue, which is the `die()` I saw on almost all
> > commands. The second part about the warnings I'm just not sure about,
> > but I don't think it should stop this patch series given my own
> > uncertainty.
> 
> Thanks for following up. I agree that with 5/4 we should be safe.
> 
> I'll remain available to quickly respond if anything else surprising
> comes up in this area.
> 
> Thanks!
> -Stolee

There is another surprise I hit today in the context of generation
numbers. In production, I found the following bug:

    signal: aborted (core dumped): BUG: chunk-format.c:88: expected to write 8 bytes to chunk 47444f56, but wrote 168304 instead

47444f56 is the GENERATION_DATA_OVERFLOW chunk ID, and seemingly the
precomputed size we intended to write was mismatching the data we have
actually been writing to disk. And I think this stems from a mismatch in
how we precompute the number of generation data overflows compared to
how we're actually writing the data to disk:

    - We precompute how many generation number overflows there are in
      `compute_generation_numbers()`. Here we only increment the number
      of overflows in case all parents of a given commit have a non-zero
      generation number and if the generation is bigger than OFFSET_MAX.
      Seemingly we have found only a single commit which matches this
      criteria because we pass `sizeof(timestamp_t) * overflows` as
      expected size, and `sizeof(timestamp_t) == 8`.

    - On the other hand, when we write generation numbers to disk in
      `write_graph_chunk_generation_data_overflow()`, we always write a
      chunk in case its offset is bigger than OFFSET_MAX. So we don't
      care about the parents here, and this seems to extend the number
      of commits which match this criteria to 21038 commits we write
      into the file.

The result is that the sanity check we do where we compare that the
actually written amount of data matches what we expect fails because of
the different ways we count this data.

This time I don't have access to the repository myself, I only tried to
combine what's happening based on the bug message and the code.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 3/7] commit-graph: start parsing generation v2 (again)
  2022-03-10 13:58                                   ` Patrick Steinhardt
@ 2022-03-10 17:18                                     ` Derrick Stolee
  0 siblings, 0 replies; 70+ messages in thread
From: Derrick Stolee @ 2022-03-10 17:18 UTC (permalink / raw)
  To: Patrick Steinhardt
  Cc: Derrick Stolee via GitGitGadget, git, me, gitster, abhishekkumar8222

On 3/10/2022 8:58 AM, Patrick Steinhardt wrote:
> On Mon, Mar 07, 2022 at 08:45:07AM -0500, Derrick Stolee wrote:
>> On 3/7/2022 5:34 AM, Patrick Steinhardt wrote:
>>> On Fri, Mar 04, 2022 at 09:03:15AM -0500, Derrick Stolee wrote:
>>>> On 3/3/2022 11:00 AM, Derrick Stolee wrote:
>> ...
>>>>> I will continue investigating and try to reproduce with this
>>>>> additional constraint of working across an alternate.
>>>>
>>>> My attempts to reproduce this across an alternate have failed. I
>>>> tried running the following test against Git without these patches,
>>>> then verify with the newer version of Git. (I also have generated
>>>> a few new layers on top with these patches, and they correctly drop
>>>> the GDA2 and GDO2 chunks when the lower layers "don't have gen v2".)
>>>>
>>>>
>>>> test_description='commit-graph with offsets across alternates'
>>>> . ./test-lib.sh
>>>>
>>>> if ! test_have_prereq TIME_IS_64BIT || ! test_have_prereq TIME_T_IS_64BIT
>>>> then
>>>> 	skip_all='skipping 64-bit timestamp tests'
>>>> 	test_done
>>>> fi
>>>>
>>>>
>>>> UNIX_EPOCH_ZERO="@0 +0000"
>>>> FUTURE_DATE="@4147483646 +0000"
>>>>
>>>> GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=0
>>>>
>>>> test_expect_success 'generate alternate split commit-graph' '
>>>> 	git init alternate &&
>>>> 	(
>>>> 		cd alternate &&
>>>> 		test_commit --date "$UNIX_EPOCH_ZERO" 1 &&
>>>> 		test_commit --date "$FUTURE_DATE" 2 &&
>>>> 		git commit-graph write --reachable &&
>>>> 		test_commit --date "$UNIX_EPOCH_ZERO" 3 &&
>>>> 		test_commit --date "$FUTURE_DATE" 4 &&
>>>> 		git commit-graph write --reachable --split=no-merge
>>>> 	) &&
>>>> 	git clone --shared alternate fork &&
>>>> 	(
>>>> 		cd fork &&
>>>> 		test_commit --date "$UNIX_EPOCH_ZERO" 5 &&
>>>> 		test_commit --date "$FUTURE_DATE" 6 &&
>>>> 		git commit-graph write --reachable --split=no-merge &&
>>>> 		test_commit --date "$UNIX_EPOCH_ZERO" 7 &&
>>>> 		test_commit --date "$FUTURE_DATE" 8 &&
>>>> 		git commit-graph write --reachable --split=no-merge
>>>> 	)
>>>> '
>>>>
>>>> test_done
>>>>
>>>>
>>>> My testing after running this with -d allows me to reliably see these
>>>> layers being created with GDAT and GDOV chunks. Running the 'git
>>>> commit-graph verify' command with the new code does not show those
>>>> errors, even after adding commits and another layer to the split
>>>> commit-graph.
>>>>
>>>> I look forward to any additional insights you might have here.
>>>
>>> I don't really know why, but now I've become unable to reproduce it
>>> again. I think we should just go with your patch 5/4 on top -- it does
>>> fix the most important issue, which is the `die()` I saw on almost all
>>> commands. The second part about the warnings I'm just not sure about,
>>> but I don't think it should stop this patch series given my own
>>> uncertainty.
>>
>> Thanks for following up. I agree that with 5/4 we should be safe.
>>
>> I'll remain available to quickly respond if anything else surprising
>> comes up in this area.
>>
>> Thanks!
>> -Stolee
> 
> There is another surprise I hit today in the context of generation
> numbers. In production, I found the following bug:

"In production" makes me think this is on a version of Git without
these patches. Am I right?

>     signal: aborted (core dumped): BUG: chunk-format.c:88: expected to write 8 bytes to chunk 47444f56, but wrote 168304 instead
> 
> 47444f56 is the GENERATION_DATA_OVERFLOW chunk ID, and seemingly the
> precomputed size we intended to write was mismatching the data we have
> actually been writing to disk. And I think this stems from a mismatch in
> how we precompute the number of generation data overflows compared to
> how we're actually writing the data to disk:

Yes, and this count was supposed to be fixed in patch v3 3/5.

The issue is that we skip the increment if the commit was already
parsed, so we would undercount. The fix was to do a loop at the
end focused on counting these overflows.

If you have v3 applied and you still get this error, then we need
to look closely at this issue.

For your production's sake, you might want to set
"commitGraph.generationVersion=1" in your config until you have the
fixes from this series.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 70+ messages in thread

end of thread, other threads:[~2022-03-10 17:18 UTC | newest]

Thread overview: 70+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-24 20:38 [PATCH 0/7] Commit-graph: Generation Number v2 Fixes, v3 implementation Derrick Stolee via GitGitGadget
2022-02-24 20:38 ` [PATCH 1/7] test-read-graph: include extra post-parse info Derrick Stolee via GitGitGadget
2022-02-24 20:38 ` [PATCH 2/7] commit-graph: fix ordering bug in generation numbers Derrick Stolee via GitGitGadget
2022-02-24 22:15   ` Junio C Hamano
2022-02-25 13:51     ` Derrick Stolee
2022-02-25 17:35       ` Junio C Hamano
2022-02-24 20:38 ` [PATCH 3/7] commit-graph: start parsing generation v2 (again) Derrick Stolee via GitGitGadget
2022-02-28 15:18   ` Patrick Steinhardt
2022-02-28 16:23     ` Derrick Stolee
2022-02-28 16:59       ` Patrick Steinhardt
2022-02-28 18:44         ` Derrick Stolee
2022-03-01  9:46           ` Patrick Steinhardt
2022-03-01 10:35             ` Patrick Steinhardt
2022-03-01 14:06               ` Derrick Stolee
2022-03-01 14:53                 ` Patrick Steinhardt
2022-03-01 15:25                   ` Derrick Stolee
2022-03-02 13:57                     ` Patrick Steinhardt
2022-03-02 14:57                       ` Derrick Stolee
2022-03-02 18:15                         ` Junio C Hamano
2022-03-02 18:46                           ` Derrick Stolee
2022-03-02 22:42                             ` Junio C Hamano
2022-03-03 11:19                         ` Patrick Steinhardt
2022-03-03 16:00                           ` Derrick Stolee
2022-03-04 14:03                             ` Derrick Stolee
2022-03-07 10:34                               ` Patrick Steinhardt
2022-03-07 13:45                                 ` Derrick Stolee
2022-03-07 17:22                                   ` Junio C Hamano
2022-03-10 13:58                                   ` Patrick Steinhardt
2022-03-10 17:18                                     ` Derrick Stolee
2022-02-24 20:38 ` [PATCH 4/7] commit-graph: fix generation number v2 overflow values Derrick Stolee via GitGitGadget
2022-02-24 22:35   ` Junio C Hamano
2022-02-25 13:53     ` Derrick Stolee
2022-02-25 17:38       ` Junio C Hamano
2022-02-24 20:38 ` [PATCH 5/7] commit-graph: document file format v2 Derrick Stolee via GitGitGadget
2022-02-24 22:55   ` Junio C Hamano
2022-02-25 22:31   ` Ævar Arnfjörð Bjarmason
2022-02-28 13:44     ` Derrick Stolee
2022-02-28 14:27       ` Ævar Arnfjörð Bjarmason
2022-02-28 16:39         ` Derrick Stolee
2022-02-28 21:14           ` Ævar Arnfjörð Bjarmason
2022-03-01 14:19             ` Derrick Stolee
2022-03-01 14:29               ` Ævar Arnfjörð Bjarmason
2022-03-01 15:59                 ` Derrick Stolee
2022-02-24 20:38 ` [PATCH 6/7] commit-graph: parse " Derrick Stolee via GitGitGadget
2022-02-24 23:01   ` Junio C Hamano
2022-02-25 13:54     ` Derrick Stolee
2022-02-24 20:38 ` [PATCH 7/7] commit-graph: write " Derrick Stolee via GitGitGadget
2022-02-24 21:42 ` [PATCH 0/7] Commit-graph: Generation Number v2 Fixes, v3 implementation Junio C Hamano
2022-02-24 23:06   ` Junio C Hamano
2022-02-25 13:55     ` Derrick Stolee
2022-02-28 13:53 ` [PATCH v2 0/4] Commit-graph: Generation Number v2 Fixes Derrick Stolee via GitGitGadget
2022-02-28 13:53   ` [PATCH v2 1/4] test-read-graph: include extra post-parse info Derrick Stolee via GitGitGadget
2022-02-28 15:22     ` Ævar Arnfjörð Bjarmason
2022-02-28 13:53   ` [PATCH v2 2/4] commit-graph: fix ordering bug in generation numbers Derrick Stolee via GitGitGadget
2022-02-28 15:25     ` Ævar Arnfjörð Bjarmason
2022-02-28 13:53   ` [PATCH v2 3/4] commit-graph: start parsing generation v2 (again) Derrick Stolee via GitGitGadget
2022-02-28 15:30     ` Ævar Arnfjörð Bjarmason
2022-02-28 16:43       ` Derrick Stolee
2022-02-28 13:53   ` [PATCH v2 4/4] commit-graph: fix generation number v2 overflow values Derrick Stolee via GitGitGadget
2022-02-28 15:40     ` Ævar Arnfjörð Bjarmason
2022-03-01 17:23   ` [PATCH v2 0/4] Commit-graph: Generation Number v2 Fixes Ævar Arnfjörð Bjarmason
2022-03-01 19:48   ` [PATCH v3 0/5] " Derrick Stolee via GitGitGadget
2022-03-01 19:48     ` [PATCH v3 1/5] test-read-graph: include extra post-parse info Derrick Stolee via GitGitGadget
2022-03-01 19:48     ` [PATCH v3 2/5] t5318: extract helpers to lib-commit-graph.sh Derrick Stolee via GitGitGadget
2022-03-01 19:48     ` [PATCH v3 3/5] commit-graph: fix ordering bug in generation numbers Derrick Stolee via GitGitGadget
2022-03-01 20:13       ` Junio C Hamano
2022-03-01 20:30         ` Junio C Hamano
2022-03-02 14:13           ` Derrick Stolee
2022-03-01 19:48     ` [PATCH v3 4/5] commit-graph: start parsing generation v2 (again) Derrick Stolee via GitGitGadget
2022-03-01 19:48     ` [PATCH v3 5/5] commit-graph: fix generation number v2 overflow values Derrick Stolee via GitGitGadget

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.