git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/6] bloom: reuse existing Bloom filters when possible during upgrade
@ 2023-08-07 16:37 Taylor Blau
  2023-08-07 16:37 ` [RFC PATCH 1/6] bloom: annotate filters with hash version Taylor Blau
                   ` (6 more replies)
  0 siblings, 7 replies; 24+ messages in thread
From: Taylor Blau @ 2023-08-07 16:37 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Jonathan Tan, Junio C Hamano

This series is based off of 'jt/path-filter-fix'.

These few patches implement an idea that we discussed in [1], where we
attempt to reuse existing Bloom filters during an upgrade from v1 to v2
Bloom filters while rewriting the commit-graph.

The core idea is that Bloom filters are reusable when there aren't any
non-ASCII paths in a commit's tree-diff against its first parent (or the
empty tree, if none exists). If we assume that a commit's tree-diff
meets those conditions, we can't conclude anything about whether either
tree contains non-ASCII characters, since they could be unmodified on
either side and thus excluded from the tree-diff.

But assuming the RHS (that there aren't any non-ASCII characters present
in the tree's path set) *does* give us that there aren't any such paths
present in the first-parent tree diff, either.

This series checks whether or not commits meet that criteria, and reuses
the existing Bloom filter (if one exists) when possible. In practice, we
end up visiting relatively few trees, since we mark trees we've already
visited.

On both linux.git and git.git, this series gives a significant speed-up
when upgrading Bloom filters from v1 to v2. On linux.git:

    Benchmark 1: GIT_TEST_UPGRADE_BLOOM_FILTERS=0 git.compile commit-graph write --reachable --changed-paths
      Time (mean ± σ):     124.873 s ±  0.316 s    [User: 124.081 s, System: 0.643 s]
      Range (min … max):   124.621 s … 125.227 s    3 runs

    Benchmark 2: GIT_TEST_UPGRADE_BLOOM_FILTERS=1 git.compile commit-graph write --reachable --changed-paths
      Time (mean ± σ):     79.271 s ±  0.163 s    [User: 74.611 s, System: 4.521 s]
      Range (min … max):   79.112 s … 79.437 s    3 runs

    Summary
      'GIT_TEST_UPGRADE_BLOOM_FILTERS=1 git.compile commit-graph write --reachable --changed-paths' ran
        1.58 ± 0.01 times faster than 'GIT_TEST_UPGRADE_BLOOM_FILTERS=0 git.compile commit-graph write --reachable --changed-paths'

On git.git (where we do have some non-ASCII paths), the change goes from
4.163 seconds to 3.348 seconds, for a 1.24x speed-up.

I'm sending this as an RFC, since we are in the middle of the -rc phase,
and 'jt/path-filter-fix' isn't expected[2] to merge into 'master' until
we're on the other side of 2.42.

The structure of this series is as follows:

  - The first three patches prepare to load the `BDAT` chunk, even when
    the graph's Bloom filter settings are incompatible with the value in
    `commitGraph.changedPathsVersion`.
  - The fourth patch begins loading `BDAT` chunks unconditionally.
  - The fifth patch is a clean-up.
  - The sixth and final patch implements the approach discussed above.

Thanks in advance for your thoughts and review :-).

[1]: https://lore.kernel.org/git/ZMKvsObx+uaKA8zF@nand.local/
[2]: https://lore.kernel.org/git/xmqqy1it6ykm.fsf@gitster.g/

Taylor Blau (6):
  bloom: annotate filters with hash version
  bloom: prepare to discard incompatible Bloom filters
  t/t4216-log-bloom.sh: harden `test_bloom_filters_not_used()`
  commit-graph.c: unconditionally load Bloom filters
  object.h: fix mis-aligned flag bits table
  commit-graph: reuse existing Bloom filters where possible

 bloom.c              | 117 +++++++++++++++++++++++++++++++++++++++++--
 bloom.h              |  22 +++++++-
 commit-graph.c       |  24 +++++----
 object.h             |   3 +-
 t/t4216-log-bloom.sh |  49 ++++++++++++++++--
 5 files changed, 195 insertions(+), 20 deletions(-)

-- 
2.41.0.407.g6d1c33951b

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC PATCH 1/6] bloom: annotate filters with hash version
  2023-08-07 16:37 [RFC PATCH 0/6] bloom: reuse existing Bloom filters when possible during upgrade Taylor Blau
@ 2023-08-07 16:37 ` Taylor Blau
  2023-08-11 21:46   ` Jonathan Tan
  2023-08-07 16:37 ` [RFC PATCH 2/6] bloom: prepare to discard incompatible Bloom filters Taylor Blau
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 24+ messages in thread
From: Taylor Blau @ 2023-08-07 16:37 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Jonathan Tan, Junio C Hamano

In subsequent commits, we will want to load existing Bloom filters out
of a commit-graph, even when the hash version they were computed with
does not match the value of `commitGraph.changedPathVersion`.

In order to differentiate between the two, add a "filter" field to each
Bloom filter.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 bloom.c | 11 ++++++++---
 bloom.h |  1 +
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/bloom.c b/bloom.c
index ebef5cfd2f..9b6a30f6f6 100644
--- a/bloom.c
+++ b/bloom.c
@@ -55,6 +55,7 @@ int load_bloom_filter_from_graph(struct commit_graph *g,
 	filter->data = (unsigned char *)(g->chunk_bloom_data +
 					sizeof(unsigned char) * start_index +
 					BLOOMDATA_CHUNK_HEADER_SIZE);
+	filter->version = g->bloom_filter_settings->hash_version;
 
 	return 1;
 }
@@ -240,11 +241,13 @@ static int pathmap_cmp(const void *hashmap_cmp_fn_data UNUSED,
 	return strcmp(e1->path, e2->path);
 }
 
-static void init_truncated_large_filter(struct bloom_filter *filter)
+static void init_truncated_large_filter(struct bloom_filter *filter,
+					int version)
 {
 	filter->data = xmalloc(1);
 	filter->data[0] = 0xFF;
 	filter->len = 1;
+	filter->version = version;
 }
 
 struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
@@ -329,13 +332,15 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 		}
 
 		if (hashmap_get_size(&pathmap) > settings->max_changed_paths) {
-			init_truncated_large_filter(filter);
+			init_truncated_large_filter(filter,
+						    settings->hash_version);
 			if (computed)
 				*computed |= BLOOM_TRUNC_LARGE;
 			goto cleanup;
 		}
 
 		filter->len = (hashmap_get_size(&pathmap) * settings->bits_per_entry + BITS_PER_WORD - 1) / BITS_PER_WORD;
+		filter->version = settings->hash_version;
 		if (!filter->len) {
 			if (computed)
 				*computed |= BLOOM_TRUNC_EMPTY;
@@ -355,7 +360,7 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 	} else {
 		for (i = 0; i < diff_queued_diff.nr; i++)
 			diff_free_filepair(diff_queued_diff.queue[i]);
-		init_truncated_large_filter(filter);
+		init_truncated_large_filter(filter, settings->hash_version);
 
 		if (computed)
 			*computed |= BLOOM_TRUNC_LARGE;
diff --git a/bloom.h b/bloom.h
index 138d57a86b..330a140520 100644
--- a/bloom.h
+++ b/bloom.h
@@ -55,6 +55,7 @@ struct bloom_filter_settings {
 struct bloom_filter {
 	unsigned char *data;
 	size_t len;
+	int version;
 };
 
 /*
-- 
2.41.0.407.g6d1c33951b


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 2/6] bloom: prepare to discard incompatible Bloom filters
  2023-08-07 16:37 [RFC PATCH 0/6] bloom: reuse existing Bloom filters when possible during upgrade Taylor Blau
  2023-08-07 16:37 ` [RFC PATCH 1/6] bloom: annotate filters with hash version Taylor Blau
@ 2023-08-07 16:37 ` Taylor Blau
  2023-08-11 21:48   ` Jonathan Tan
  2023-08-24 22:20   ` Jonathan Tan
  2023-08-07 16:37 ` [RFC PATCH 3/6] t/t4216-log-bloom.sh: harden `test_bloom_filters_not_used()` Taylor Blau
                   ` (4 subsequent siblings)
  6 siblings, 2 replies; 24+ messages in thread
From: Taylor Blau @ 2023-08-07 16:37 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Jonathan Tan, Junio C Hamano

Callers use the inline `get_bloom_filter()` implementation as a thin
wrapper around `get_or_compute_bloom_filter()`. The former calls the
latter with a value of "0" for `compute_if_not_present`, making
`get_bloom_filter()` the default read-only path for fetching an existing
Bloom filter.

Callers expect the value returned from `get_bloom_filter()` is usable,
that is that it's compatible with the configured value corresponding to
`commitGraph.changedPathsVersion`.

This is OK, since the commit-graph machinery only initializes its BDAT
chunk (thereby enabling it to service Bloom filter queries) when the
Bloom filter hash_version is compatible with our settings. So any value
returned by `get_bloom_filter()` is trivially useable.

However, subsequent commits will load the BDAT chunk even when the Bloom
filters are built with incompatible hash versions. Prepare to handle
this by teaching `get_bloom_filter()` to discard filters that are
incompatible with the configured hash version.

Callers who wish to read incompatible filters (e.g., for upgrading
filters from v1 to v2) may use the lower level routine,
`get_or_compute_bloom_filter()`.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 bloom.c | 20 +++++++++++++++++++-
 bloom.h | 20 ++++++++++++++++++--
 2 files changed, 37 insertions(+), 3 deletions(-)

diff --git a/bloom.c b/bloom.c
index 9b6a30f6f6..739fa093ba 100644
--- a/bloom.c
+++ b/bloom.c
@@ -250,6 +250,23 @@ static void init_truncated_large_filter(struct bloom_filter *filter,
 	filter->version = version;
 }
 
+struct bloom_filter *get_bloom_filter(struct repository *r, struct commit *c)
+{
+	struct bloom_filter *filter;
+	int hash_version;
+
+	filter = get_or_compute_bloom_filter(r, c, 0, NULL, NULL);
+	if (!filter)
+		return NULL;
+
+	prepare_repo_settings(r);
+	hash_version = r->settings.commit_graph_changed_paths_version;
+
+	if (!(hash_version == -1 || hash_version == filter->version))
+		return NULL; /* unusable filter */
+	return filter;
+}
+
 struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 						 struct commit *c,
 						 int compute_if_not_present,
@@ -275,7 +292,8 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 						     filter, graph_pos);
 	}
 
-	if (filter->data && filter->len)
+	if ((filter->data && filter->len) &&
+	    (!settings || settings->hash_version == filter->version))
 		return filter;
 	if (!compute_if_not_present)
 		return NULL;
diff --git a/bloom.h b/bloom.h
index 330a140520..2b1c124bb5 100644
--- a/bloom.h
+++ b/bloom.h
@@ -110,8 +110,24 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 						 const struct bloom_filter_settings *settings,
 						 enum bloom_filter_computed *computed);
 
-#define get_bloom_filter(r, c) get_or_compute_bloom_filter( \
-	(r), (c), 0, NULL, NULL)
+/*
+ * Find the Bloom filter associated with the given commit "c".
+ *
+ * If any of the following are true
+ *
+ *   - the repository does not have a commit-graph
+ *   - it has a commit-graph, but reading the commit-graph is disabled
+ *   - the given commit does not have a Bloom filter computed
+ *   - there is a Bloom filter for commit "c", but it cannot be read because
+ *     disabled
+ *
+ * , then `get_bloom_filter()` will return NULL. Otherwise, the corresponding
+ * Bloom filter will be returned.
+ *
+ * For callers who wish to inspect Bloom filters with incompatible hash
+ * versions, use get_or_compute_bloom_filter().
+ */
+struct bloom_filter *get_bloom_filter(struct repository *r, struct commit *c);
 
 int bloom_filter_contains(const struct bloom_filter *filter,
 			  const struct bloom_key *key,
-- 
2.41.0.407.g6d1c33951b


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 3/6] t/t4216-log-bloom.sh: harden `test_bloom_filters_not_used()`
  2023-08-07 16:37 [RFC PATCH 0/6] bloom: reuse existing Bloom filters when possible during upgrade Taylor Blau
  2023-08-07 16:37 ` [RFC PATCH 1/6] bloom: annotate filters with hash version Taylor Blau
  2023-08-07 16:37 ` [RFC PATCH 2/6] bloom: prepare to discard incompatible Bloom filters Taylor Blau
@ 2023-08-07 16:37 ` Taylor Blau
  2023-08-07 16:37 ` [RFC PATCH 4/6] commit-graph.c: unconditionally load Bloom filters Taylor Blau
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 24+ messages in thread
From: Taylor Blau @ 2023-08-07 16:37 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Jonathan Tan, Junio C Hamano

The existing implementation of test_bloom_filters_not_used() asserts
that the Bloom filter sub-system has not been initialized at all, by
checking for the absence of any data from it from trace2.

In the following commit, it will become possible to load Bloom filters
without using them (e.g., because `commitGraph.changedPathVersion` is
incompatible with the hash version with which the commit-graph's Bloom
filters were written).

When this is the case, it's possible to initialize the Bloom filter
sub-system, while still not using any Bloom filters. When this is the
case, check that the data dump from the Bloom sub-system is all zeros,
indicating that no filters were used.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/t4216-log-bloom.sh | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index 775e59d864..a77caca789 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -81,7 +81,19 @@ test_bloom_filters_used () {
 test_bloom_filters_not_used () {
 	log_args=$1
 	setup "$log_args" &&
-	! grep -q "statistics:{\"filter_not_present\":" "$TRASH_DIRECTORY/trace.perf" &&
+
+	if grep -q "statistics:{\"filter_not_present\":" "$TRASH_DIRECTORY/trace.perf"
+	then
+		# if the Bloom filter system is initialized, ensure that no
+		# filters were used
+		data="statistics:{"
+		data="$data\"filter_not_present\":0,"
+		data="$data\"maybe\":0,"
+		data="$data\"definitely_not\":0,"
+		data="$data\"false_positive\":0}"
+
+		grep -q "$data" "$TRASH_DIRECTORY/trace.perf"
+	fi &&
 	test_cmp log_wo_bloom log_w_bloom
 }
 
-- 
2.41.0.407.g6d1c33951b


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 4/6] commit-graph.c: unconditionally load Bloom filters
  2023-08-07 16:37 [RFC PATCH 0/6] bloom: reuse existing Bloom filters when possible during upgrade Taylor Blau
                   ` (2 preceding siblings ...)
  2023-08-07 16:37 ` [RFC PATCH 3/6] t/t4216-log-bloom.sh: harden `test_bloom_filters_not_used()` Taylor Blau
@ 2023-08-07 16:37 ` Taylor Blau
  2023-08-11 22:00   ` Jonathan Tan
  2023-08-07 16:37 ` [RFC PATCH 5/6] object.h: fix mis-aligned flag bits table Taylor Blau
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 24+ messages in thread
From: Taylor Blau @ 2023-08-07 16:37 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Jonathan Tan, Junio C Hamano

In 9e4df4da07 (commit-graph: new filter ver. that fixes murmur3,
2023-08-01), we began ignoring the Bloom data ("BDAT") chunk for
commit-graphs whose Bloom filters were computed using a hash version
incompatible with the value of `commitGraph.changedPathVersion`.

Now that the Bloom API has been hardened to discard these incompatible
filters (with the exception of low-level APIs), we can safely load these
Bloom filters unconditionally.

We no longer want to return early from `graph_read_bloom_data()`, and
similarly do not want to set the bloom_settings' `hash_version` field as
a side-effect. The latter is because we want to wait until we know which
Bloom settings we're using (either the defaults, from the GIT_TEST
variables, or from the previous commit-graph layer) before deciding what
hash_version to use.

If we detect an existing BDAT chunk, we'll infer the rest of the
settings (e.g., number of hashes, bits per entry, and maximum number of
changed paths) from the earlier graph layer. The hash_version will be
inferred from the previous layer as well, unless one has already been
specified via configuration.

Once all of that is done, we normalize the value of the hash_version to
either "1" or "2".

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 commit-graph.c | 19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 38edb12669..60e5f9ada7 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -317,12 +317,6 @@ static int graph_read_bloom_data(const unsigned char *chunk_start,
 	uint32_t hash_version;
 	hash_version = get_be32(chunk_start);
 
-	if (*c->commit_graph_changed_paths_version == -1) {
-		*c->commit_graph_changed_paths_version = hash_version;
-	} else if (hash_version != *c->commit_graph_changed_paths_version) {
-		return 0;
-	}
-
 	g->chunk_bloom_data = chunk_start;
 	g->bloom_filter_settings = xmalloc(sizeof(struct bloom_filter_settings));
 	g->bloom_filter_settings->hash_version = hash_version;
@@ -2390,8 +2384,7 @@ int write_commit_graph(struct object_directory *odb,
 			r->settings.commit_graph_changed_paths_version);
 		return 0;
 	}
-	bloom_settings.hash_version = r->settings.commit_graph_changed_paths_version == 2
-		? 2 : 1;
+	bloom_settings.hash_version = r->settings.commit_graph_changed_paths_version;
 	bloom_settings.bits_per_entry = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY",
 						      bloom_settings.bits_per_entry);
 	bloom_settings.num_hashes = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_NUM_HASHES",
@@ -2423,10 +2416,18 @@ int write_commit_graph(struct object_directory *odb,
 		/* We have changed-paths already. Keep them in the next graph */
 		if (g && g->bloom_filter_settings) {
 			ctx->changed_paths = 1;
-			ctx->bloom_settings = g->bloom_filter_settings;
+
+			/* don't propagate the hash_version unless unspecified */
+			if (bloom_settings.hash_version == -1)
+				bloom_settings.hash_version = g->bloom_filter_settings->hash_version;
+			bloom_settings.bits_per_entry = g->bloom_filter_settings->bits_per_entry;
+			bloom_settings.num_hashes = g->bloom_filter_settings->num_hashes;
+			bloom_settings.max_changed_paths = g->bloom_filter_settings->max_changed_paths;
 		}
 	}
 
+	bloom_settings.hash_version = bloom_settings.hash_version == 2 ? 2 : 1;
+
 	if (ctx->split) {
 		struct commit_graph *g = ctx->r->objects->commit_graph;
 
-- 
2.41.0.407.g6d1c33951b


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 5/6] object.h: fix mis-aligned flag bits table
  2023-08-07 16:37 [RFC PATCH 0/6] bloom: reuse existing Bloom filters when possible during upgrade Taylor Blau
                   ` (3 preceding siblings ...)
  2023-08-07 16:37 ` [RFC PATCH 4/6] commit-graph.c: unconditionally load Bloom filters Taylor Blau
@ 2023-08-07 16:37 ` Taylor Blau
  2023-08-07 16:37 ` [RFC PATCH 6/6] commit-graph: reuse existing Bloom filters where possible Taylor Blau
  2023-08-11 22:13 ` [RFC PATCH 0/6] bloom: reuse existing Bloom filters when possible during upgrade Jonathan Tan
  6 siblings, 0 replies; 24+ messages in thread
From: Taylor Blau @ 2023-08-07 16:37 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Jonathan Tan, Junio C Hamano

Bit position 23 is one column too far to the left.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 object.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/object.h b/object.h
index 114d45954d..db25714b4e 100644
--- a/object.h
+++ b/object.h
@@ -62,7 +62,7 @@ void object_array_init(struct object_array *array);
 
 /*
  * object flag allocation:
- * revision.h:               0---------10         15             23------27
+ * revision.h:               0---------10         15               23------27
  * fetch-pack.c:             01    67
  * negotiator/default.c:       2--5
  * walker.c:                 0-2
-- 
2.41.0.407.g6d1c33951b


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 6/6] commit-graph: reuse existing Bloom filters where possible
  2023-08-07 16:37 [RFC PATCH 0/6] bloom: reuse existing Bloom filters when possible during upgrade Taylor Blau
                   ` (4 preceding siblings ...)
  2023-08-07 16:37 ` [RFC PATCH 5/6] object.h: fix mis-aligned flag bits table Taylor Blau
@ 2023-08-07 16:37 ` Taylor Blau
  2023-08-11 22:06   ` Jonathan Tan
  2023-08-11 22:13 ` [RFC PATCH 0/6] bloom: reuse existing Bloom filters when possible during upgrade Jonathan Tan
  6 siblings, 1 reply; 24+ messages in thread
From: Taylor Blau @ 2023-08-07 16:37 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Jonathan Tan, Junio C Hamano

In 9e4df4da07 (commit-graph: new filter ver. that fixes murmur3,
2023-08-01), a bug was described where it's possible for Git to produce
non-murmur3 hashes when the platform's "char" type is signed, and there
are paths with characters whose highest bit is set (i.e. all characters
>= 0x80).

That patch allows the caller to control which version of Bloom filters
are read and written. However, even on platforms with a signed "char"
type, it is possible to reuse existing Bloom filters if and only if
there are no changed paths in any commit's first parent tree-diff whose
characters have their highest bit set.

When this is the case, we can reuse the existing filter without having
to compute a new one. This is done by marking trees which are known to
have (or not have) any such paths. When a commit's root tree is verified
to not have any such paths, we mark it as such and declare that the
commit's Bloom filter is reusable.

Note that this heuristic only goes in one direction. If neither a commit
nor its first parent have any paths in their trees with non-ASCII
characters, then we know for certain that a path with non-ASCII
characters will not appear in a tree-diff against that commit's first
parent. The reverse isn't necessarily true: just because the tree-diff
doesn't contain any such paths does not imply that no such paths exist
in either tree.

So we end up recomputing some Bloom filters that we don't strictly have
to (i.e. their bits are the same no matter which version of murmur3 we
use). But culling these out is impossible, since we'd have to perform
the full tree-diff, which is the same effort as computing the Bloom
filter from scratch.

But because we can cache our results in each tree's flag bits, we can
often avoid recomputing many filters, thereby reducing the time it takes
to run

    $ git commit-graph write --changed-paths --reachable

when upgrading from v1 to v2 Bloom filters.

To benchmark this, let's generate a commit-graph in linux.git with v1
changed-paths in generation order[^1]:

    $ git clone git@github.com:torvalds/linux.git
    $ cd linux
    $ git commit-graph write --reachable --changed-paths
    $ graph=".git/objects/info/commit-graph"
    $ mv $graph{,.bak}

Then let's time how long it takes to go from v1 to v2 filters (with and
without the upgrade path enabled), resetting the state of the
commit-graph each time:

    $ git config commitGraph.changedPathsVersion 2
    $ hyperfine -p 'cp -f $graph.bak $graph' -L v 0,1 \
        'GIT_TEST_UPGRADE_BLOOM_FILTERS={v} git.compile commit-graph write --reachable --changed-paths'

On linux.git (where there aren't any non-ASCII paths), the timings
indicate that this patch represents a speed-up over recomputing all
Bloom filters from scratch:

    Benchmark 1: GIT_TEST_UPGRADE_BLOOM_FILTERS=0 git.compile commit-graph write --reachable --changed-paths
      Time (mean ± σ):     124.873 s ±  0.316 s    [User: 124.081 s, System: 0.643 s]
      Range (min … max):   124.621 s … 125.227 s    3 runs

    Benchmark 2: GIT_TEST_UPGRADE_BLOOM_FILTERS=1 git.compile commit-graph write --reachable --changed-paths
      Time (mean ± σ):     79.271 s ±  0.163 s    [User: 74.611 s, System: 4.521 s]
      Range (min … max):   79.112 s … 79.437 s    3 runs

    Summary
      'GIT_TEST_UPGRADE_BLOOM_FILTERS=1 git.compile commit-graph write --reachable --changed-paths' ran
        1.58 ± 0.01 times faster than 'GIT_TEST_UPGRADE_BLOOM_FILTERS=0 git.compile commit-graph write --reachable --changed-paths'

On git.git, we do have some non-ASCII paths, giving us a more modest
improvement from 4.163 seconds to 3.348 seconds, for a 1.24x speed-up.
On my machine, the stats for git.git are:

  - 8,285 Bloom filters computed from scratch
  - 10 Bloom filters generated as empty
  - 4 Bloom filters generated as truncated due to too many changed paths
  - 65,114 Bloom filters were reused when transitioning from v1 to v2.

[^1]: Note that this is is important, since `--stdin-packs` or
  `--stdin-commits` orders commits in the commit-graph by their pack
  position (with `--stdin-packs`) or in the raw input (with
  `--stdin-commits`).

  Since we compute Bloom filters in the same order that commits appear
  in the graph, we must see a commit's (first) parent before we process
  the commit itself. This is only guaranteed to happen when sorting
  commits by their generation number.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 bloom.c              | 90 ++++++++++++++++++++++++++++++++++++++++++--
 bloom.h              |  1 +
 commit-graph.c       |  5 +++
 object.h             |  1 +
 t/t4216-log-bloom.sh | 35 ++++++++++++++++-
 5 files changed, 127 insertions(+), 5 deletions(-)

diff --git a/bloom.c b/bloom.c
index 739fa093ba..24dd874e46 100644
--- a/bloom.c
+++ b/bloom.c
@@ -7,6 +7,9 @@
 #include "commit-graph.h"
 #include "commit.h"
 #include "commit-slab.h"
+#include "tree.h"
+#include "tree-walk.h"
+#include "config.h"
 
 define_commit_slab(bloom_filter_slab, struct bloom_filter);
 
@@ -250,6 +253,73 @@ static void init_truncated_large_filter(struct bloom_filter *filter,
 	filter->version = version;
 }
 
+#define VISITED   (1u<<21)
+#define HIGH_BITS (1u<<22)
+
+static int has_entries_with_high_bit(struct repository *r, struct tree *t)
+{
+	if (parse_tree(t))
+		return 1;
+
+	if (!(t->object.flags & VISITED)) {
+		struct tree_desc desc;
+		struct name_entry entry;
+
+		init_tree_desc(&desc, t->buffer, t->size);
+		while (tree_entry(&desc, &entry)) {
+			size_t i;
+			for (i = 0; i < entry.pathlen; i++) {
+				if (entry.path[i] & 0x80) {
+					t->object.flags |= HIGH_BITS;
+					goto done;
+				}
+			}
+
+			if (S_ISDIR(entry.mode)) {
+				struct tree *sub = lookup_tree(r, &entry.oid);
+				if (sub && has_entries_with_high_bit(r, sub)) {
+					t->object.flags |= HIGH_BITS;
+					goto done;
+				}
+			}
+
+		}
+
+done:
+		t->object.flags |= VISITED;
+	}
+
+	return !!(t->object.flags & HIGH_BITS);
+}
+
+static int commit_tree_has_high_bit_paths(struct repository *r,
+					  struct commit *c)
+{
+	struct tree *t;
+	if (repo_parse_commit(r, c))
+		return 1;
+	t = repo_get_commit_tree(r, c);
+	if (!t)
+		return 1;
+	return has_entries_with_high_bit(r, t);
+}
+
+static struct bloom_filter *upgrade_filter(struct repository *r, struct commit *c,
+					   struct bloom_filter *filter,
+					   int hash_version)
+{
+	struct commit_list *p = c->parents;
+	if (commit_tree_has_high_bit_paths(r, c))
+		return NULL;
+
+	if (p && commit_tree_has_high_bit_paths(r, p->item))
+		return NULL;
+
+	filter->version = hash_version;
+
+	return filter;
+}
+
 struct bloom_filter *get_bloom_filter(struct repository *r, struct commit *c)
 {
 	struct bloom_filter *filter;
@@ -292,9 +362,23 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 						     filter, graph_pos);
 	}
 
-	if ((filter->data && filter->len) &&
-	    (!settings || settings->hash_version == filter->version))
-		return filter;
+	if (filter->data && filter->len) {
+		struct bloom_filter *upgrade;
+		if (!settings || settings->hash_version == filter->version)
+			return filter;
+
+		/* version mismatch, see if we can upgrade */
+		if (compute_if_not_present &&
+		    git_env_bool("GIT_TEST_UPGRADE_BLOOM_FILTERS", 1)) {
+			upgrade = upgrade_filter(r, c, filter,
+						 settings->hash_version);
+			if (upgrade) {
+				if (computed)
+					*computed |= BLOOM_UPGRADED;
+				return upgrade;
+			}
+		}
+	}
 	if (!compute_if_not_present)
 		return NULL;
 
diff --git a/bloom.h b/bloom.h
index 2b1c124bb5..4462fc3908 100644
--- a/bloom.h
+++ b/bloom.h
@@ -102,6 +102,7 @@ enum bloom_filter_computed {
 	BLOOM_COMPUTED     = (1 << 1),
 	BLOOM_TRUNC_LARGE  = (1 << 2),
 	BLOOM_TRUNC_EMPTY  = (1 << 3),
+	BLOOM_UPGRADED     = (1 << 4),
 };
 
 struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
diff --git a/commit-graph.c b/commit-graph.c
index 60e5f9ada7..62e10c8f40 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1048,6 +1048,7 @@ struct write_commit_graph_context {
 	int count_bloom_filter_not_computed;
 	int count_bloom_filter_trunc_empty;
 	int count_bloom_filter_trunc_large;
+	int count_bloom_filter_upgraded;
 };
 
 static int write_graph_chunk_fanout(struct hashfile *f,
@@ -1654,6 +1655,8 @@ static void trace2_bloom_filter_write_statistics(struct write_commit_graph_conte
 			   ctx->count_bloom_filter_trunc_empty);
 	trace2_data_intmax("commit-graph", ctx->r, "filter-trunc-large",
 			   ctx->count_bloom_filter_trunc_large);
+	trace2_data_intmax("commit-graph", ctx->r, "filter-upgraded",
+			   ctx->count_bloom_filter_upgraded);
 }
 
 static void compute_bloom_filters(struct write_commit_graph_context *ctx)
@@ -1695,6 +1698,8 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 				ctx->count_bloom_filter_trunc_empty++;
 			if (computed & BLOOM_TRUNC_LARGE)
 				ctx->count_bloom_filter_trunc_large++;
+		} else if (computed & BLOOM_UPGRADED) {
+			ctx->count_bloom_filter_upgraded++;
 		} else if (computed & BLOOM_NOT_COMPUTED)
 			ctx->count_bloom_filter_not_computed++;
 		ctx->total_bloom_filter_data_size += filter
diff --git a/object.h b/object.h
index db25714b4e..2e5e08725f 100644
--- a/object.h
+++ b/object.h
@@ -75,6 +75,7 @@ void object_array_init(struct object_array *array);
  * commit-reach.c:                                  16-----19
  * sha1-name.c:                                              20
  * list-objects-filter.c:                                      21
+ * bloom.c:                                                    2122
  * builtin/fsck.c:           0--3
  * builtin/gc.c:             0
  * builtin/index-pack.c:                                     2021
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index a77caca789..48f8109a66 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -217,6 +217,10 @@ test_filter_trunc_large () {
 	grep "\"key\":\"filter-trunc-large\",\"value\":\"$1\"" $2
 }
 
+test_filter_upgraded () {
+	grep "\"key\":\"filter-upgraded\",\"value\":\"$1\"" $2
+}
+
 test_expect_success 'correctly report changes over limit' '
 	git init limits &&
 	(
@@ -543,10 +547,19 @@ test_expect_success 'when writing another commit graph, preserve existing versio
 test_expect_success 'when writing commit graph, do not reuse changed-path of another version' '
 	git init doublewrite &&
 	test_commit -C doublewrite c "$CENT" &&
+
 	git -C doublewrite config --add commitgraph.changedPathsVersion 1 &&
-	git -C doublewrite commit-graph write --reachable --changed-paths &&
+	GIT_TRACE2_EVENT="$(pwd)/trace2.txt" \
+		git -C doublewrite commit-graph write --reachable --changed-paths &&
+	test_filter_computed 1 trace2.txt &&
+	test_filter_upgraded 0 trace2.txt &&
+
 	git -C doublewrite config --add commitgraph.changedPathsVersion 2 &&
-	git -C doublewrite commit-graph write --reachable --changed-paths &&
+	GIT_TRACE2_EVENT="$(pwd)/trace2.txt" \
+		git -C doublewrite commit-graph write --reachable --changed-paths &&
+	test_filter_computed 1 trace2.txt &&
+	test_filter_upgraded 0 trace2.txt &&
+
 	(
 		cd doublewrite &&
 		echo "c01f" >expect &&
@@ -555,4 +568,22 @@ test_expect_success 'when writing commit graph, do not reuse changed-path of ano
 	)
 '
 
+test_expect_success 'when writing commit graph, reuse changed-path of another version where possible' '
+	git init upgrade &&
+
+	test_commit -C upgrade base no-high-bits &&
+
+	git -C upgrade config --add commitgraph.changedPathsVersion 1 &&
+	GIT_TRACE2_EVENT="$(pwd)/trace2.txt" \
+		git -C upgrade commit-graph write --reachable --changed-paths &&
+	test_filter_computed 1 trace2.txt &&
+	test_filter_upgraded 0 trace2.txt &&
+
+	git -C upgrade config --add commitgraph.changedPathsVersion 2 &&
+	GIT_TRACE2_EVENT="$(pwd)/trace2.txt" \
+		git -C upgrade commit-graph write --reachable --changed-paths &&
+	test_filter_computed 0 trace2.txt &&
+	test_filter_upgraded 1 trace2.txt
+'
+
 test_done
-- 
2.41.0.407.g6d1c33951b

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 1/6] bloom: annotate filters with hash version
  2023-08-07 16:37 ` [RFC PATCH 1/6] bloom: annotate filters with hash version Taylor Blau
@ 2023-08-11 21:46   ` Jonathan Tan
  2023-08-17 19:55     ` Taylor Blau
  0 siblings, 1 reply; 24+ messages in thread
From: Jonathan Tan @ 2023-08-11 21:46 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Jonathan Tan, git, Derrick Stolee, Junio C Hamano

Taylor Blau <me@ttaylorr.com> writes:
> In subsequent commits, we will want to load existing Bloom filters out
> of a commit-graph, even when the hash version they were computed with
> does not match the value of `commitGraph.changedPathVersion`.
> 
> In order to differentiate between the two, add a "filter" field to each
> Bloom filter.

You mean "version", I think.

> @@ -55,6 +55,7 @@ struct bloom_filter_settings {
>  struct bloom_filter {
>  	unsigned char *data;
>  	size_t len;
> +	int version;
>  };

We might want to shrink the sizes of len (we have a changed path limit
so we know exactly how big Bloom filters can get) and version so that
this struct doesn't take up more space. But if other reviewers think
that this is OK, I'm fine with that.

Another thing that we might want to track is whether the Bloom filter is
a reference to an existing buffer (and thus does not need to be freed)
or a reference to a malloc-ed buffer that we must free. But both before
and after this patch set, a malloc-ed buffer is never overridden by a
reference-to-existing-buffer, so we should still be fine for now. (This
patch set does add a scenario in which a reference-to-existing buffer is
overridden by a malloc-ed buffer, but that's the only new scenario.)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 2/6] bloom: prepare to discard incompatible Bloom filters
  2023-08-07 16:37 ` [RFC PATCH 2/6] bloom: prepare to discard incompatible Bloom filters Taylor Blau
@ 2023-08-11 21:48   ` Jonathan Tan
  2023-08-21 20:23     ` Taylor Blau
  2023-08-24 22:20   ` Jonathan Tan
  1 sibling, 1 reply; 24+ messages in thread
From: Jonathan Tan @ 2023-08-11 21:48 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Jonathan Tan, git, Derrick Stolee, Junio C Hamano

Taylor Blau <me@ttaylorr.com> writes:
> diff --git a/bloom.h b/bloom.h
> index 330a140520..2b1c124bb5 100644
> --- a/bloom.h
> +++ b/bloom.h
> @@ -110,8 +110,24 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
>  						 const struct bloom_filter_settings *settings,
>  						 enum bloom_filter_computed *computed);
>  
> -#define get_bloom_filter(r, c) get_or_compute_bloom_filter( \
> -	(r), (c), 0, NULL, NULL)
> +/*
> + * Find the Bloom filter associated with the given commit "c".
> + *
> + * If any of the following are true
> + *
> + *   - the repository does not have a commit-graph
> + *   - it has a commit-graph, but reading the commit-graph is disabled
> + *   - the given commit does not have a Bloom filter computed
> + *   - there is a Bloom filter for commit "c", but it cannot be read because
> + *     disabled

s/disabled/version incompatibility/, I think.

> + *
> + * , then `get_bloom_filter()` will return NULL. Otherwise, the corresponding
> + * Bloom filter will be returned.
> + *
> + * For callers who wish to inspect Bloom filters with incompatible hash
> + * versions, use get_or_compute_bloom_filter().
> + */
> +struct bloom_filter *get_bloom_filter(struct repository *r, struct commit *c);


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 4/6] commit-graph.c: unconditionally load Bloom filters
  2023-08-07 16:37 ` [RFC PATCH 4/6] commit-graph.c: unconditionally load Bloom filters Taylor Blau
@ 2023-08-11 22:00   ` Jonathan Tan
  2023-08-21 20:40     ` Taylor Blau
  0 siblings, 1 reply; 24+ messages in thread
From: Jonathan Tan @ 2023-08-11 22:00 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Jonathan Tan, git, Derrick Stolee, Junio C Hamano

Taylor Blau <me@ttaylorr.com> writes:
> diff --git a/commit-graph.c b/commit-graph.c
> index 38edb12669..60e5f9ada7 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -317,12 +317,6 @@ static int graph_read_bloom_data(const unsigned char *chunk_start,
>  	uint32_t hash_version;
>  	hash_version = get_be32(chunk_start);
>  
> -	if (*c->commit_graph_changed_paths_version == -1) {
> -		*c->commit_graph_changed_paths_version = hash_version;
> -	} else if (hash_version != *c->commit_graph_changed_paths_version) {
> -		return 0;
> -	}

Lots of things to notice in this patch, but the summary is that this
patch looks correct.

Here, we remove (1) the autodetection of changed paths version and (2)
the storage of the result of that autodetection. (1) is restored in a
subsequent code hunk, but (2) is never restored.

Prior to this patch set, we needed (2) because we only wanted to load
Bloom filters that match a given version, so if we autodetected a
version, that version must be in effect for all future loads (which is
why we needed to store that result immediately). But with this patch
set, we support the loading of Bloom filters of varying versions, so
storing it immediately is no longer needed.

(Also, I don't think we need the commit_graph_changed_paths_version
variable anymore.)

> @@ -2390,8 +2384,7 @@ int write_commit_graph(struct object_directory *odb,
>  			r->settings.commit_graph_changed_paths_version);
>  		return 0;
>  	}
> -	bloom_settings.hash_version = r->settings.commit_graph_changed_paths_version == 2
> -		? 2 : 1;
> +	bloom_settings.hash_version = r->settings.commit_graph_changed_paths_version;

As stated in the commit message, normalization of the hash version is
performed later.

> @@ -2423,10 +2416,18 @@ int write_commit_graph(struct object_directory *odb,
>  		/* We have changed-paths already. Keep them in the next graph */
>  		if (g && g->bloom_filter_settings) {
>  			ctx->changed_paths = 1;
> -			ctx->bloom_settings = g->bloom_filter_settings;
> +
> +			/* don't propagate the hash_version unless unspecified */
> +			if (bloom_settings.hash_version == -1)
> +				bloom_settings.hash_version = g->bloom_filter_settings->hash_version;
> +			bloom_settings.bits_per_entry = g->bloom_filter_settings->bits_per_entry;
> +			bloom_settings.num_hashes = g->bloom_filter_settings->num_hashes;
> +			bloom_settings.max_changed_paths = g->bloom_filter_settings->max_changed_paths;

Here is where the autodetection is restored.

Prior to this patch set, this part of the code does not perform
autodetection - every hash_version in memory matches the version in
commit_graph_changed_paths_version, so we're just copying over the value
in that variable. But the nature of this part of the code has changed
due to this patch set: all the g->bloom_filter_settings in memory
may not have the same hash version, and may not match what the user
specified in config. To compensate, we are more selective in what we
propagate from g->bloom_filter_settings.

> +	bloom_settings.hash_version = bloom_settings.hash_version == 2 ? 2 : 1;

And here we restore the normalization.

Thanks - up to here looks good.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 6/6] commit-graph: reuse existing Bloom filters where possible
  2023-08-07 16:37 ` [RFC PATCH 6/6] commit-graph: reuse existing Bloom filters where possible Taylor Blau
@ 2023-08-11 22:06   ` Jonathan Tan
  0 siblings, 0 replies; 24+ messages in thread
From: Jonathan Tan @ 2023-08-11 22:06 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Jonathan Tan, git, Derrick Stolee, Junio C Hamano

Taylor Blau <me@ttaylorr.com> writes:
> @@ -292,9 +362,23 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
>  						     filter, graph_pos);
>  	}
>  
> -	if ((filter->data && filter->len) &&
> -	    (!settings || settings->hash_version == filter->version))
> -		return filter;
> +	if (filter->data && filter->len) {
> +		struct bloom_filter *upgrade;
> +		if (!settings || settings->hash_version == filter->version)
> +			return filter;
> +
> +		/* version mismatch, see if we can upgrade */
> +		if (compute_if_not_present &&
> +		    git_env_bool("GIT_TEST_UPGRADE_BLOOM_FILTERS", 1)) {
> +			upgrade = upgrade_filter(r, c, filter,
> +						 settings->hash_version);
> +			if (upgrade) {
> +				if (computed)
> +					*computed |= BLOOM_UPGRADED;
> +				return upgrade;
> +			}
> +		}
> +	}
>  	if (!compute_if_not_present)
>  		return NULL;

There's a part in this function that's untouched by the diff that
computes and stores a new Bloom filter if it was not possible to upgrade
the filter (look at the part where filter->data is calloc-ed). That
part unconditionally overrides the existing filter->data, which may at
that point contain genuine data (a Bloom filter that does not match the
version we want). Right now this is fine because all overrides that we
do involve overriding references to existing buffers, so we do not need
to free anything.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 0/6] bloom: reuse existing Bloom filters when possible during upgrade
  2023-08-07 16:37 [RFC PATCH 0/6] bloom: reuse existing Bloom filters when possible during upgrade Taylor Blau
                   ` (5 preceding siblings ...)
  2023-08-07 16:37 ` [RFC PATCH 6/6] commit-graph: reuse existing Bloom filters where possible Taylor Blau
@ 2023-08-11 22:13 ` Jonathan Tan
  2023-08-21 20:46   ` Taylor Blau
  6 siblings, 1 reply; 24+ messages in thread
From: Jonathan Tan @ 2023-08-11 22:13 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Jonathan Tan, git, Derrick Stolee, Junio C Hamano

Taylor Blau <me@ttaylorr.com> writes:
> On both linux.git and git.git, this series gives a significant speed-up
> when upgrading Bloom filters from v1 to v2. On linux.git:
> 
>     Benchmark 1: GIT_TEST_UPGRADE_BLOOM_FILTERS=0 git.compile commit-graph write --reachable --changed-paths
>       Time (mean ± σ):     124.873 s ±  0.316 s    [User: 124.081 s, System: 0.643 s]
>       Range (min … max):   124.621 s … 125.227 s    3 runs
> 
>     Benchmark 2: GIT_TEST_UPGRADE_BLOOM_FILTERS=1 git.compile commit-graph write --reachable --changed-paths
>       Time (mean ± σ):     79.271 s ±  0.163 s    [User: 74.611 s, System: 4.521 s]
>       Range (min … max):   79.112 s … 79.437 s    3 runs
> 
>     Summary
>       'GIT_TEST_UPGRADE_BLOOM_FILTERS=1 git.compile commit-graph write --reachable --changed-paths' ran
>         1.58 ± 0.01 times faster than 'GIT_TEST_UPGRADE_BLOOM_FILTERS=0 git.compile commit-graph write --reachable --changed-paths'
> 
> On git.git (where we do have some non-ASCII paths), the change goes from
> 4.163 seconds to 3.348 seconds, for a 1.24x speed-up.

My main concern is that this modifies the code somewhat pervasively
(tracking the version of Bloom filters and removing assumptions about
what Bloom filter versions are in memory) in return for what I think
is a small speedup, when considering that we will perform this
operation only once for a given repo. I'll defer to server operators
on this (or other people handling large numbers of repos), though.

Putting that concern aside, I've reviewed the code and assuming that
we're OK with modifying the code in this way, this patch set looks good
to me, and hopefully my review will be of some help.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 1/6] bloom: annotate filters with hash version
  2023-08-11 21:46   ` Jonathan Tan
@ 2023-08-17 19:55     ` Taylor Blau
  2023-08-21 20:21       ` Taylor Blau
  0 siblings, 1 reply; 24+ messages in thread
From: Taylor Blau @ 2023-08-17 19:55 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, Derrick Stolee, Junio C Hamano

On Fri, Aug 11, 2023 at 02:46:51PM -0700, Jonathan Tan wrote:
> Taylor Blau <me@ttaylorr.com> writes:
> > In subsequent commits, we will want to load existing Bloom filters out
> > of a commit-graph, even when the hash version they were computed with
> > does not match the value of `commitGraph.changedPathVersion`.
> >
> > In order to differentiate between the two, add a "filter" field to each
> > Bloom filter.
>
> You mean "version", I think.

Oops, yes -- I'm not sure how my editor tab-completed "version" there,
but oh, well :-).

> > @@ -55,6 +55,7 @@ struct bloom_filter_settings {
> >  struct bloom_filter {
> >  	unsigned char *data;
> >  	size_t len;
> > +	int version;
> >  };
>
> We might want to shrink the sizes of len (we have a changed path limit
> so we know exactly how big Bloom filters can get) and version so that
> this struct doesn't take up more space. But if other reviewers think
> that this is OK, I'm fine with that.

I think that making len a size_t here is an appropriate choice. Even
though the maximum length of a Bloom filter is well below the 2^64-1
threshold, we are often looking at a memory-mapped region here, so
keeping track of it with a size_t / off_t seems reasonable to me.

> Another thing that we might want to track is whether the Bloom filter is
> a reference to an existing buffer (and thus does not need to be freed)
> or a reference to a malloc-ed buffer that we must free. But both before
> and after this patch set, a malloc-ed buffer is never overridden by a
> reference-to-existing-buffer, so we should still be fine for now. (This
> patch set does add a scenario in which a reference-to-existing buffer is
> overridden by a malloc-ed buffer, but that's the only new scenario.)

Yeah, I think there is some opportunity for clean-up here. I'll take a
look...

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 1/6] bloom: annotate filters with hash version
  2023-08-17 19:55     ` Taylor Blau
@ 2023-08-21 20:21       ` Taylor Blau
  0 siblings, 0 replies; 24+ messages in thread
From: Taylor Blau @ 2023-08-21 20:21 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, Derrick Stolee, Junio C Hamano

On Thu, Aug 17, 2023 at 03:55:06PM -0400, Taylor Blau wrote:
> > Another thing that we might want to track is whether the Bloom filter is
> > a reference to an existing buffer (and thus does not need to be freed)
> > or a reference to a malloc-ed buffer that we must free. But both before
> > and after this patch set, a malloc-ed buffer is never overridden by a
> > reference-to-existing-buffer, so we should still be fine for now. (This
> > patch set does add a scenario in which a reference-to-existing buffer is
> > overridden by a malloc-ed buffer, but that's the only new scenario.)
>
> Yeah, I think there is some opportunity for clean-up here. I'll take a
> look...

This ended up being pretty reasonable. I'm not sure whether I should
include it here or not, since any leaks in the Bloom subsystem are
definitely not new as of this series.

But the patch is relatively straightforward anyway, so I think throwing
it on the end would be OK:

--- 8< ---
diff --git a/bloom.c b/bloom.c
index 24dd874e46..ff131893cd 100644
--- a/bloom.c
+++ b/bloom.c
@@ -59,6 +59,7 @@ int load_bloom_filter_from_graph(struct commit_graph *g,
 					sizeof(unsigned char) * start_index +
 					BLOOMDATA_CHUNK_HEADER_SIZE);
 	filter->version = g->bloom_filter_settings->hash_version;
+	filter->to_free = NULL;

 	return 1;
 }
@@ -231,6 +232,18 @@ void init_bloom_filters(void)
 	init_bloom_filter_slab(&bloom_filters);
 }

+static void free_one_bloom_filter(struct bloom_filter *filter)
+{
+	if (!filter)
+		return;
+	free(filter->to_free);
+}
+
+void deinit_bloom_filters(void)
+{
+	deep_clear_bloom_filter_slab(&bloom_filters, free_one_bloom_filter);
+}
+
 static int pathmap_cmp(const void *hashmap_cmp_fn_data UNUSED,
 		       const struct hashmap_entry *eptr,
 		       const struct hashmap_entry *entry_or_key,
@@ -247,7 +260,7 @@ static int pathmap_cmp(const void *hashmap_cmp_fn_data UNUSED,
 static void init_truncated_large_filter(struct bloom_filter *filter,
 					int version)
 {
-	filter->data = xmalloc(1);
+	filter->data = filter->to_free = xmalloc(1);
 	filter->data[0] = 0xFF;
 	filter->len = 1;
 	filter->version = version;
@@ -449,6 +462,7 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 			filter->len = 1;
 		}
 		CALLOC_ARRAY(filter->data, filter->len);
+		filter->to_free = filter->data;

 		hashmap_for_each_entry(&pathmap, &iter, e, entry) {
 			struct bloom_key key;
diff --git a/bloom.h b/bloom.h
index 4462fc3908..c1d74d63e6 100644
--- a/bloom.h
+++ b/bloom.h
@@ -56,6 +56,8 @@ struct bloom_filter {
 	unsigned char *data;
 	size_t len;
 	int version;
+
+	void *to_free;
 };

 /*
@@ -96,6 +98,7 @@ void add_key_to_filter(const struct bloom_key *key,
 		       const struct bloom_filter_settings *settings);

 void init_bloom_filters(void);
+void deinit_bloom_filters(void);

 enum bloom_filter_computed {
 	BLOOM_NOT_COMPUTED = (1 << 0),
diff --git a/commit-graph.c b/commit-graph.c
index 183ed90b6d..f22f2d350d 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -2532,6 +2532,9 @@ int write_commit_graph(struct object_directory *odb,

 	res = write_commit_graph_file(ctx);

+	if (ctx->changed_paths)
+		deinit_bloom_filters();
+
 	if (ctx->split)
 		mark_commit_graphs(ctx);

--- >8 ---

Thanks,
Taylor

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 2/6] bloom: prepare to discard incompatible Bloom filters
  2023-08-11 21:48   ` Jonathan Tan
@ 2023-08-21 20:23     ` Taylor Blau
  0 siblings, 0 replies; 24+ messages in thread
From: Taylor Blau @ 2023-08-21 20:23 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, Derrick Stolee, Junio C Hamano

On Fri, Aug 11, 2023 at 02:48:06PM -0700, Jonathan Tan wrote:
> Taylor Blau <me@ttaylorr.com> writes:
> > diff --git a/bloom.h b/bloom.h
> > index 330a140520..2b1c124bb5 100644
> > --- a/bloom.h
> > +++ b/bloom.h
> > @@ -110,8 +110,24 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
> >  						 const struct bloom_filter_settings *settings,
> >  						 enum bloom_filter_computed *computed);
> >
> > -#define get_bloom_filter(r, c) get_or_compute_bloom_filter( \
> > -	(r), (c), 0, NULL, NULL)
> > +/*
> > + * Find the Bloom filter associated with the given commit "c".
> > + *
> > + * If any of the following are true
> > + *
> > + *   - the repository does not have a commit-graph
> > + *   - it has a commit-graph, but reading the commit-graph is disabled
> > + *   - the given commit does not have a Bloom filter computed
> > + *   - there is a Bloom filter for commit "c", but it cannot be read because
> > + *     disabled
>
> s/disabled/version incompatibility/, I think.

Well spotted, thanks.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 4/6] commit-graph.c: unconditionally load Bloom filters
  2023-08-11 22:00   ` Jonathan Tan
@ 2023-08-21 20:40     ` Taylor Blau
  0 siblings, 0 replies; 24+ messages in thread
From: Taylor Blau @ 2023-08-21 20:40 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, Derrick Stolee, Junio C Hamano

On Fri, Aug 11, 2023 at 03:00:30PM -0700, Jonathan Tan wrote:
> Taylor Blau <me@ttaylorr.com> writes:
> > diff --git a/commit-graph.c b/commit-graph.c
> > index 38edb12669..60e5f9ada7 100644
> > --- a/commit-graph.c
> > +++ b/commit-graph.c
> > @@ -317,12 +317,6 @@ static int graph_read_bloom_data(const unsigned char *chunk_start,
> >  	uint32_t hash_version;
> >  	hash_version = get_be32(chunk_start);
> >
> > -	if (*c->commit_graph_changed_paths_version == -1) {
> > -		*c->commit_graph_changed_paths_version = hash_version;
> > -	} else if (hash_version != *c->commit_graph_changed_paths_version) {
> > -		return 0;
> > -	}
>
> Lots of things to notice in this patch, but the summary is that this
> patch looks correct.

Thanks for the detailed analysis here. This is definitely the patch that
I was most nervous about while writing, but I appreciate the second set
of eyes.

> (Also, I don't think we need the commit_graph_changed_paths_version
> variable anymore.)

Great catch, thanks. I cleaned that up in a separate commit after this
one, since I think that this patch is already intricate enough.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 0/6] bloom: reuse existing Bloom filters when possible during upgrade
  2023-08-11 22:13 ` [RFC PATCH 0/6] bloom: reuse existing Bloom filters when possible during upgrade Jonathan Tan
@ 2023-08-21 20:46   ` Taylor Blau
  0 siblings, 0 replies; 24+ messages in thread
From: Taylor Blau @ 2023-08-21 20:46 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, Derrick Stolee, Junio C Hamano

On Fri, Aug 11, 2023 at 03:13:37PM -0700, Jonathan Tan wrote:
> Taylor Blau <me@ttaylorr.com> writes:
> > On both linux.git and git.git, this series gives a significant speed-up
> > when upgrading Bloom filters from v1 to v2. On linux.git:
> >
> >     Benchmark 1: GIT_TEST_UPGRADE_BLOOM_FILTERS=0 git.compile commit
> >       Time (mean ± σ):     124.873 s ±  0.316 s    [User: 124.081 s, System: 0.643 s]
> >       Range (min … max):   124.621 s … 125.227 s    3 runs
> >
> >     Benchmark 2: GIT_TEST_UPGRADE_BLOOM_FILTERS=1 git.compile commit-graph write --reachable --changed-paths
> >       Time (mean ± σ):     79.271 s ±  0.163 s    [User: 74.611 s, System: 4.521 s]
> >       Range (min … max):   79.112 s … 79.437 s    3 runs
> >
> >     Summary
> >       'GIT_TEST_UPGRADE_BLOOM_FILTERS=1 git.compile commit-graph write --reachable --changed-paths' ran
> >         1.58 ± 0.01 times faster than 'GIT_TEST_UPGRADE_BLOOM_FILTERS=0 git.compile commit-graph write --reachable --changed-paths'
> >
> > On git.git (where we do have some non-ASCII paths), the change goes from
> > 4.163 seconds to 3.348 seconds, for a 1.24x speed-up.
>
> My main concern is that this modifies the code somewhat pervasively
> (tracking the version of Bloom filters and removing assumptions about
> what Bloom filter versions are in memory) in return for what I think
> is a small speedup, when considering that we will perform this
> operation only once for a given repo. I'll defer to server operators
> on this (or other people handling large numbers of repos), though.

Yeah, I think that this is certainly on the riskier side of things. But
I have confidence in our test coverage here, and feel better with your
thorough review.

FWIW, I think that the speed-up is worthwhile (though I'm obviously
biased here). When we were rolling out changed-path Bloom filters to
repositories on GitHub.com, it was apparent that computing them from
scratch in a single shot was infeasible. This led to 809e0327f5
(builtin/commit-graph.c: introduce '--max-new-filters=<n>', 2020-09-18).

That would of course still save us here if we didn't have an upgrade
path, since each recomputed filter would count against the maximum
number we can generate in a single run.

We could introduce a configuration knob, but I'd rather avoid it if
possible. That can also be done on top in a later patch, which should be
easy enough to do if we hear of a compelling use-case for wanting to
disable the upgrade path here.

> Putting that concern aside, I've reviewed the code and assuming that
> we're OK with modifying the code in this way, this patch set looks good
> to me, and hopefully my review will be of some help.

Thanks very much -- it was more than helpful :-). I'll collect up this
and your previous patches (as we have discussed off-list) and tie
everything together in a single series for the maintainer.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 2/6] bloom: prepare to discard incompatible Bloom filters
  2023-08-07 16:37 ` [RFC PATCH 2/6] bloom: prepare to discard incompatible Bloom filters Taylor Blau
  2023-08-11 21:48   ` Jonathan Tan
@ 2023-08-24 22:20   ` Jonathan Tan
  2023-08-24 22:47     ` Taylor Blau
  1 sibling, 1 reply; 24+ messages in thread
From: Jonathan Tan @ 2023-08-24 22:20 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Jonathan Tan, git, Derrick Stolee, Junio C Hamano

Taylor Blau <me@ttaylorr.com> writes:
> diff --git a/bloom.c b/bloom.c
> index 9b6a30f6f6..739fa093ba 100644
> --- a/bloom.c
> +++ b/bloom.c
> @@ -250,6 +250,23 @@ static void init_truncated_large_filter(struct bloom_filter *filter,
>  	filter->version = version;
>  }
>  
> +struct bloom_filter *get_bloom_filter(struct repository *r, struct commit *c)
> +{
> +	struct bloom_filter *filter;
> +	int hash_version;
> +
> +	filter = get_or_compute_bloom_filter(r, c, 0, NULL, NULL);
> +	if (!filter)
> +		return NULL;
> +
> +	prepare_repo_settings(r);
> +	hash_version = r->settings.commit_graph_changed_paths_version;
> +
> +	if (!(hash_version == -1 || hash_version == filter->version))
> +		return NULL; /* unusable filter */
> +	return filter;
> +}

I missed this the last time, but this should match what fill_bloom_key()
does. Use get_bloom_filter_settings(), then compare filter->version to
version 2 if hash_version is 2, and to version 1 otherwise.

Everything else looks good.
 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 2/6] bloom: prepare to discard incompatible Bloom filters
  2023-08-24 22:20   ` Jonathan Tan
@ 2023-08-24 22:47     ` Taylor Blau
  2023-08-24 23:05       ` Jonathan Tan
  0 siblings, 1 reply; 24+ messages in thread
From: Taylor Blau @ 2023-08-24 22:47 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, Derrick Stolee, Junio C Hamano

On Thu, Aug 24, 2023 at 03:20:51PM -0700, Jonathan Tan wrote:
> Taylor Blau <me@ttaylorr.com> writes:
> > diff --git a/bloom.c b/bloom.c
> > index 9b6a30f6f6..739fa093ba 100644
> > --- a/bloom.c
> > +++ b/bloom.c
> > @@ -250,6 +250,23 @@ static void init_truncated_large_filter(struct bloom_filter *filter,
> >  	filter->version = version;
> >  }
> >
> > +struct bloom_filter *get_bloom_filter(struct repository *r, struct commit *c)
> > +{
> > +	struct bloom_filter *filter;
> > +	int hash_version;
> > +
> > +	filter = get_or_compute_bloom_filter(r, c, 0, NULL, NULL);
> > +	if (!filter)
> > +		return NULL;
> > +
> > +	prepare_repo_settings(r);
> > +	hash_version = r->settings.commit_graph_changed_paths_version;
> > +
> > +	if (!(hash_version == -1 || hash_version == filter->version))
> > +		return NULL; /* unusable filter */
> > +	return filter;
> > +}
>
> I missed this the last time, but this should match what fill_bloom_key()
> does. Use get_bloom_filter_settings(),

I'm not sure I'm following. Are you saying that we should use
get_bloom_filter_settings() instead of reading the value from
r->settings directly?

> then compare filter->version to version 2 if hash_version is 2, and to
> version 1 otherwise.

Hmm. I think we're already doing the right thing here, no? IIUC, you're
saying to do something like:

    struct bloom_filter_settings *s = get_bloom_filter_settings(r);
    struct bloom_filter *f = get_or_compute_bloom_filter(r, c, ...);
    int hash_version;

    if (!f)
      return NULL;

    prepare_repo_settings(r);
    hash_version = r->settings.commit_graph_changed_paths_version;

    if (!(hash_version == -1 || hash_version == s->hash_version))
      return NULL; /* incompatible */
    return f;

?

If so, I think that we're already OK here, since s->hash_version is
always going to take the same value as f->version, since f->version is
assigned in bloom.c::load_bloom_filter_from_graph(), which does:

    filter->version = g->bloom_filter_settings->hash_version;

Or are we talking about two different things entirely? ;-)

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 2/6] bloom: prepare to discard incompatible Bloom filters
  2023-08-24 22:47     ` Taylor Blau
@ 2023-08-24 23:05       ` Jonathan Tan
  2023-08-25 19:00         ` Taylor Blau
  0 siblings, 1 reply; 24+ messages in thread
From: Jonathan Tan @ 2023-08-24 23:05 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Jonathan Tan, git, Derrick Stolee, Junio C Hamano

Taylor Blau <me@ttaylorr.com> writes:
> I'm not sure I'm following. Are you saying that we should use
> get_bloom_filter_settings() instead of reading the value from
> r->settings directly?
> 
> > then compare filter->version to version 2 if hash_version is 2, and to
> > version 1 otherwise.
> 
> Hmm. I think we're already doing the right thing here, no? IIUC, you're
> saying to do something like:
> 
>     struct bloom_filter_settings *s = get_bloom_filter_settings(r);
>     struct bloom_filter *f = get_or_compute_bloom_filter(r, c, ...);
>     int hash_version;
> 
>     if (!f)
>       return NULL;
> 
>     prepare_repo_settings(r);
>     hash_version = r->settings.commit_graph_changed_paths_version;
> 
>     if (!(hash_version == -1 || hash_version == s->hash_version))
>       return NULL; /* incompatible */
>     return f;
> 
> ?
> 
> If so, I think that we're already OK here, since s->hash_version is
> always going to take the same value as f->version, since f->version is
> assigned in bloom.c::load_bloom_filter_from_graph(), which does:
> 
>     filter->version = g->bloom_filter_settings->hash_version;
> 
> Or are we talking about two different things entirely? ;-)
> 
> Thanks,
> Taylor

Ah, sorry for not being clear. What I meant is in how you compute
hash_version after we have found that filter is not NULL.

> +struct bloom_filter *get_bloom_filter(struct repository *r, struct commit *c)
> +{
> +	struct bloom_filter *filter;
> +	int hash_version;
> +
> +	filter = get_or_compute_bloom_filter(r, c, 0, NULL, NULL);
> +	if (!filter)
> +		return NULL;
> +
> +	prepare_repo_settings(r);

Up to here is fine.

> +	hash_version = r->settings.commit_graph_changed_paths_version;

Instead of doing this, do this (well, move the struct declaration to
the top):

  struct bloom_filter_settings *s = get_bloom_filter_settings(r);
  hash_version = s->hash_version == 2 ? 2 : 1;
  
> +	if (!(hash_version == -1 || hash_version == filter->version))

No need for the comparison to -1 here.

> +		return NULL; /* unusable filter */
> +	return filter;
> +}

This is fine.



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 2/6] bloom: prepare to discard incompatible Bloom filters
  2023-08-24 23:05       ` Jonathan Tan
@ 2023-08-25 19:00         ` Taylor Blau
  2023-08-29 16:49           ` Jonathan Tan
  0 siblings, 1 reply; 24+ messages in thread
From: Taylor Blau @ 2023-08-25 19:00 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, Derrick Stolee, Junio C Hamano

On Thu, Aug 24, 2023 at 04:05:27PM -0700, Jonathan Tan wrote:
> Up to here is fine.
>
> > +	hash_version = r->settings.commit_graph_changed_paths_version;
>
> Instead of doing this, do this (well, move the struct declaration to
> the top):
>
>   struct bloom_filter_settings *s = get_bloom_filter_settings(r);
>   hash_version = s->hash_version == 2 ? 2 : 1;

Do we need this normalization? We assign s->hash_version in
commit-graph.c::graph_read_bloom_data() by reading it from the start of
the BDAT chunk, so this should only ever be 1 or 2.

> > +	if (!(hash_version == -1 || hash_version == filter->version))
>
> No need for the comparison to -1 here.

I'm not sure I understand your suggestion. When we fetch the filter from
get_or_compute_bloom_filter(), we have filter->version set to the
hash_version from the containing graph's Bloom filter settings.

So (besides the normalization), I would think that:

    struct bloom_filter_settings *s = get_bloom_filter_settings(r);
    struct bloom_filter *f = get_bloom_filter(...);

    assert(s->hash_version == f->version);

would hold.

I think the check that we want to make is instead: is this Bloom
filter's version (or equivalently, the hash version indicated by that
graph's BDAT chunk) something that we can read? And I think "what we can
read" here is dictated by the commit_graph_changed_paths_version member
of our repository_settings, no?

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 2/6] bloom: prepare to discard incompatible Bloom filters
  2023-08-25 19:00         ` Taylor Blau
@ 2023-08-29 16:49           ` Jonathan Tan
  2023-08-29 19:14             ` Taylor Blau
  0 siblings, 1 reply; 24+ messages in thread
From: Jonathan Tan @ 2023-08-29 16:49 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Jonathan Tan, git, Derrick Stolee, Junio C Hamano

Taylor Blau <me@ttaylorr.com> writes:
> On Thu, Aug 24, 2023 at 04:05:27PM -0700, Jonathan Tan wrote:
> > Up to here is fine.
> >
> > > +	hash_version = r->settings.commit_graph_changed_paths_version;
> >
> > Instead of doing this, do this (well, move the struct declaration to
> > the top):
> >
> >   struct bloom_filter_settings *s = get_bloom_filter_settings(r);
> >   hash_version = s->hash_version == 2 ? 2 : 1;
> 
> Do we need this normalization? We assign s->hash_version in
> commit-graph.c::graph_read_bloom_data() by reading it from the start of
> the BDAT chunk, so this should only ever be 1 or 2.

I'm not sure offhand if we do...I wrote it this way to match
fill_bloom_key(), but fill_bloom_key() was written in that way because
it was the clearest, not specifically because it needed to normalize.

> > > +	if (!(hash_version == -1 || hash_version == filter->version))
> >
> > No need for the comparison to -1 here.
> 
> I'm not sure I understand your suggestion. When we fetch the filter from
> get_or_compute_bloom_filter(), we have filter->version set to the
> hash_version from the containing graph's Bloom filter settings.
> 
> So (besides the normalization), I would think that:
> 
>     struct bloom_filter_settings *s = get_bloom_filter_settings(r);
>     struct bloom_filter *f = get_bloom_filter(...);
> 
>     assert(s->hash_version == f->version);
> 
> would hold.

My mention to avoid the comparison to -1 was just for completeness
- since we're normalizing the value of hash_version to 1 or 2, we no
longer need to compare it to -1.

As for whether s->hash_version is always equal to f->version, I think
that it may not be true if for some reason, there are multiple commit
graph files on disk, not all with the same Bloom filter version.

> I think the check that we want to make is instead: is this Bloom
> filter's version (or equivalently, the hash version indicated by that
> graph's BDAT chunk) something that we can read? 

I think it's not "something that we can read" (eventually, we can read
all versions, we just treat them differently) but "the version that
fill_bloom_key" will use. We don't want this function to produce a Bloom
filter of version X and then have the calling code subsequently use it
with a Bloom key of version Y.

> And I think "what we can
> read" here is dictated by the commit_graph_changed_paths_version member
> of our repository_settings, no?

I don't think commit_graph_changed_paths_version always dictates
something - it could be -1 (which you have probably seen, since you
check it against -1 in the current version of the patch). One of the
points of my suggestion is to avoid this field completely.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 2/6] bloom: prepare to discard incompatible Bloom filters
  2023-08-29 16:49           ` Jonathan Tan
@ 2023-08-29 19:14             ` Taylor Blau
  2023-08-29 22:04               ` Jonathan Tan
  0 siblings, 1 reply; 24+ messages in thread
From: Taylor Blau @ 2023-08-29 19:14 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, Derrick Stolee, Junio C Hamano

On Tue, Aug 29, 2023 at 09:49:26AM -0700, Jonathan Tan wrote:
> > > > +	if (!(hash_version == -1 || hash_version == filter->version))
> > >
> > > No need for the comparison to -1 here.
> >
> > I'm not sure I understand your suggestion. When we fetch the filter from
> > get_or_compute_bloom_filter(), we have filter->version set to the
> > hash_version from the containing graph's Bloom filter settings.
> >
> > So (besides the normalization), I would think that:
> >
> >     struct bloom_filter_settings *s = get_bloom_filter_settings(r);
> >     struct bloom_filter *f = get_bloom_filter(...);
> >
> >     assert(s->hash_version == f->version);
> >
> > would hold.
>
> My mention to avoid the comparison to -1 was just for completeness
> - since we're normalizing the value of hash_version to 1 or 2, we no
> longer need to compare it to -1.
>
> As for whether s->hash_version is always equal to f->version, I think
> that it may not be true if for some reason, there are multiple commit
> graph files on disk, not all with the same Bloom filter version.

Apologies for not quite grokking this, but I am still somewhat confused.

Suppose we applied something like your suggestion on top of this patch,
like so:

--- 8< ---
diff --git a/bloom.c b/bloom.c
index 739fa093ba..ee81976342 100644
--- a/bloom.c
+++ b/bloom.c
@@ -253,16 +253,16 @@ static void init_truncated_large_filter(struct bloom_filter *filter,
 struct bloom_filter *get_bloom_filter(struct repository *r, struct commit *c)
 {
 	struct bloom_filter *filter;
-	int hash_version;
+	struct bloom_filter_settings *settings;

 	filter = get_or_compute_bloom_filter(r, c, 0, NULL, NULL);
 	if (!filter)
 		return NULL;

 	prepare_repo_settings(r);
-	hash_version = r->settings.commit_graph_changed_paths_version;
+	settings = get_bloom_filter_settings(r);

-	if (!(hash_version == -1 || hash_version == filter->version))
+	if (filter->version != (settings->hash_version == 2 ? 2 : 1))
 		return NULL; /* unusable filter */
 	return filter;
 }
--- >8 ---

We have a handful of failing tests, notably including t4216.1, which
tries to read settings->hash_version, but fails since settings is NULL.
I believe this happens when we have a computed Bloom filter prepared for
some commit, but have not yet written a commit graph containing that (or
any) Bloom filter.

But I think we're talking about different things here. The point of
get_bloom_filter() as a wrapper around get_or_compute_bloom_filter() is
to only return Bloom filters that are readable under the configuration
commitGraph.changedPathsVersion.

We have a handle on what version was used to compute Bloom filters in
each layer of the graph by reading its "version" field, which is written
via load_bloom_filter_from_graph().

So we want to return a non-NULL filter exactly when we (a) have a Bloom
filter computed for the given commit, and (b) its version matches the
version specified in commitGraph.chagnedPathsVersion.

Are you saying that we need to do the normalization ahead of time so
that we don't emit filters that have different hash versions when
working in a commit-graph chain where each layer may use different Bloom
filter settings? If so, then I think the change we'd need to make is
limited to:

--- 8< ---
diff --git a/bloom.c b/bloom.c
index 739fa093ba..d5cc23b0a8 100644
--- a/bloom.c
+++ b/bloom.c
@@ -260,9 +260,9 @@ struct bloom_filter *get_bloom_filter(struct repository *r, struct commit *c)
 		return NULL;

 	prepare_repo_settings(r);
-	hash_version = r->settings.commit_graph_changed_paths_version;
+	hash_version = r->settings.commit_graph_changed_paths_version == 2 ? 2 : 1;

-	if (!(hash_version == -1 || hash_version == filter->version))
+	if (hash_version == filter->version)
 		return NULL; /* unusable filter */
 	return filter;
 }
--- >8 ---

But that doesn't quite work, either, since it's breaking some assumption
from the caller and causing us not to write out any Bloom filters (I
haven't investigated further, but I'm not sure that this is the right
path in the first place...).

So I am not sure what you are suggesting, but I am sure that I'm missing
something.

Thanks,
Taylor

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 2/6] bloom: prepare to discard incompatible Bloom filters
  2023-08-29 19:14             ` Taylor Blau
@ 2023-08-29 22:04               ` Jonathan Tan
  0 siblings, 0 replies; 24+ messages in thread
From: Jonathan Tan @ 2023-08-29 22:04 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Jonathan Tan, git, Derrick Stolee, Junio C Hamano

Here are answers to your questions, but I'll also reply to the latest
version stating that I'm OK with this patch set being merged (with my
reasons).

Taylor Blau <me@ttaylorr.com> writes:
> Apologies for not quite grokking this, but I am still somewhat confused.
> 
> Suppose we applied something like your suggestion on top of this patch,
> like so:
> 
> --- 8< ---
> diff --git a/bloom.c b/bloom.c
> index 739fa093ba..ee81976342 100644
> --- a/bloom.c
> +++ b/bloom.c
> @@ -253,16 +253,16 @@ static void init_truncated_large_filter(struct bloom_filter *filter,
>  struct bloom_filter *get_bloom_filter(struct repository *r, struct commit *c)
>  {
>  	struct bloom_filter *filter;
> -	int hash_version;
> +	struct bloom_filter_settings *settings;
> 
>  	filter = get_or_compute_bloom_filter(r, c, 0, NULL, NULL);
>  	if (!filter)
>  		return NULL;
> 
>  	prepare_repo_settings(r);
> -	hash_version = r->settings.commit_graph_changed_paths_version;
> +	settings = get_bloom_filter_settings(r);
> 
> -	if (!(hash_version == -1 || hash_version == filter->version))
> +	if (filter->version != (settings->hash_version == 2 ? 2 : 1))
>  		return NULL; /* unusable filter */
>  	return filter;
>  }
> --- >8 ---
> 
> We have a handful of failing tests, notably including t4216.1, which
> tries to read settings->hash_version, but fails since settings is NULL.
> I believe this happens when we have a computed Bloom filter prepared for
> some commit, but have not yet written a commit graph containing that (or
> any) Bloom filter.

Ah, I discovered this independently too. I think it happens when we
write version 2 Bloom filters to a repo that has no Bloom filters
currently. So, r->settings.commit_graph_changed_paths_version is set to
2, but settings is NULL. I think the solution has to be a combination
of both (use commit_graph_changed_paths_version, but if it is -1, check
get_bloom_filter_settings()). But having said that, as I said above, we
can go with what you have currently.

> But I think we're talking about different things here. The point of
> get_bloom_filter() as a wrapper around get_or_compute_bloom_filter() is
> to only return Bloom filters that are readable under the configuration
> commitGraph.changedPathsVersion.

But this is my contention - sometimes commitGraph.changedPathsVersion is
-1, so we need to find out which version is readable from elsewhere.

> We have a handle on what version was used to compute Bloom filters in
> each layer of the graph by reading its "version" field, which is written
> via load_bloom_filter_from_graph().
> 
> So we want to return a non-NULL filter exactly when we (a) have a Bloom
> filter computed for the given commit, and (b) its version matches the
> version specified in commitGraph.chagnedPathsVersion.

Yes, except add "or autodetected from the first commit graph file that
we read if nothing is specified in commitGraph.changedPathsVersion" to
the end.

> Are you saying that we need to do the normalization ahead of time so
> that we don't emit filters that have different hash versions when
> working in a commit-graph chain where each layer may use different Bloom
> filter settings? 

By "emit" do you mean write filters to disk? If yes, I'm worried about
reading them, not writing them - for example, when running "git log"
with a path. I am worried precisely about the commit-graph chain in
which different layers have different Bloom settings, yes.

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2023-08-29 22:05 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-08-07 16:37 [RFC PATCH 0/6] bloom: reuse existing Bloom filters when possible during upgrade Taylor Blau
2023-08-07 16:37 ` [RFC PATCH 1/6] bloom: annotate filters with hash version Taylor Blau
2023-08-11 21:46   ` Jonathan Tan
2023-08-17 19:55     ` Taylor Blau
2023-08-21 20:21       ` Taylor Blau
2023-08-07 16:37 ` [RFC PATCH 2/6] bloom: prepare to discard incompatible Bloom filters Taylor Blau
2023-08-11 21:48   ` Jonathan Tan
2023-08-21 20:23     ` Taylor Blau
2023-08-24 22:20   ` Jonathan Tan
2023-08-24 22:47     ` Taylor Blau
2023-08-24 23:05       ` Jonathan Tan
2023-08-25 19:00         ` Taylor Blau
2023-08-29 16:49           ` Jonathan Tan
2023-08-29 19:14             ` Taylor Blau
2023-08-29 22:04               ` Jonathan Tan
2023-08-07 16:37 ` [RFC PATCH 3/6] t/t4216-log-bloom.sh: harden `test_bloom_filters_not_used()` Taylor Blau
2023-08-07 16:37 ` [RFC PATCH 4/6] commit-graph.c: unconditionally load Bloom filters Taylor Blau
2023-08-11 22:00   ` Jonathan Tan
2023-08-21 20:40     ` Taylor Blau
2023-08-07 16:37 ` [RFC PATCH 5/6] object.h: fix mis-aligned flag bits table Taylor Blau
2023-08-07 16:37 ` [RFC PATCH 6/6] commit-graph: reuse existing Bloom filters where possible Taylor Blau
2023-08-11 22:06   ` Jonathan Tan
2023-08-11 22:13 ` [RFC PATCH 0/6] bloom: reuse existing Bloom filters when possible during upgrade Jonathan Tan
2023-08-21 20:46   ` Taylor Blau

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).