git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/12] more miscellaneous Bloom filter improvements, redux
@ 2020-09-09 15:22 Taylor Blau
  2020-09-09 15:22 ` [PATCH 01/12] commit-graph: introduce 'get_bloom_filter_settings()' Taylor Blau
                   ` (13 more replies)
  0 siblings, 14 replies; 75+ messages in thread
From: Taylor Blau @ 2020-09-09 15:22 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, peff, szeder.dev

Here is a rejiggered version of my series in [1], which accomplishes the
same without changing any of the on-disk commit-graph format.

As a reminder, the main goal of this series is to introduce a
'--max-new-filters' flag to 'git commit-graph write' to place a limit on
the number of new Bloom filters a writer is willing to compute from
scratch. The main difficulty is disambiguating between empty/too-large
filters and ones that haven't been computed yet. See "bloom: encode
out-of-bounds filters as non-empty" for the details.

The series is organized as follows:

  * Patches 1-4 are uninteresting preparatory steps.,
  * Patch 5 introduces the 'commitGraph.readChangedPaths' configuration.
  * Patches 6-8 are more preparation.
  * Patch 9 is from Stolee and fixes a bug where computing Bloom filters
    from scratch wouldn't stop at the limit of 512.
  * Patches 10-12 prepares for and then introduces '--max-new-filters'.

The first nine patches are basically unchanged from [1] where they were
thoroughly reviewed. The tenth patch is new, and the final two patches
are only touched up and simplified to work with this new approach, but
they have otherwise been reviewed.

Since the old thread was getting long, and this is a substantially new
approach, I'm sending this as "v1" of a new series, which hopefully
nobody minds.

[1]: https://lore.kernel.org/git/cover.1596480582.git.me@ttaylorr.com/

Derrick Stolee (1):
  bloom/diff: properly short-circuit on max_changes

Taylor Blau (11):
  commit-graph: introduce 'get_bloom_filter_settings()'
  t4216: use an '&&'-chain
  commit-graph: pass a 'struct repository *' in more places
  t/helper/test-read-graph.c: prepare repo settings
  commit-graph: respect 'commitGraph.readChangedPaths'
  commit-graph.c: store maximum changed paths
  bloom: split 'get_bloom_filter()' in two
  bloom: use provided 'struct bloom_filter_settings'
  bloom: encode out-of-bounds filters as non-empty
  commit-graph: rename 'split_commit_graph_opts'
  builtin/commit-graph.c: introduce '--max-new-filters=<n>'

 Documentation/config.txt                      |   2 +
 Documentation/config/commitgraph.txt          |   8 +
 Documentation/git-commit-graph.txt            |   6 +
 .../technical/commit-graph-format.txt         |   2 +-
 blame.c                                       |   8 +-
 bloom.c                                       |  53 +++--
 bloom.h                                       |  29 ++-
 builtin/commit-graph.c                        |  61 ++++--
 commit-graph.c                                | 148 ++++++++++----
 commit-graph.h                                |  17 +-
 diff.h                                        |   2 -
 fuzz-commit-graph.c                           |   5 +-
 line-log.c                                    |   2 +-
 repo-settings.c                               |   3 +
 repository.h                                  |   1 +
 revision.c                                    |   7 +-
 t/helper/test-bloom.c                         |   4 +-
 t/helper/test-read-graph.c                    |   3 +-
 t/t0095-bloom.sh                              |   4 +-
 t/t4216-log-bloom.sh                          | 181 ++++++++++++++++--
 t/t5324-split-commit-graph.sh                 |  13 ++
 tree-diff.c                                   |   5 +-
 22 files changed, 442 insertions(+), 122 deletions(-)
 create mode 100644 Documentation/config/commitgraph.txt

--
2.28.0.462.g4ff11cec37

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [PATCH 01/12] commit-graph: introduce 'get_bloom_filter_settings()'
  2020-09-09 15:22 [PATCH 00/12] more miscellaneous Bloom filter improvements, redux Taylor Blau
@ 2020-09-09 15:22 ` Taylor Blau
  2020-09-09 15:22 ` [PATCH 02/12] t4216: use an '&&'-chain Taylor Blau
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 75+ messages in thread
From: Taylor Blau @ 2020-09-09 15:22 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, peff, szeder.dev

Many places in the code often need a pointer to the commit-graph's
'struct bloom_filter_settings', in which case they often take the value
from the top-most commit-graph.

In the non-split case, this works as expected. In the split case,
however, things get a little tricky. Not all layers in a chain of
incremental commit-graphs are required to themselves have Bloom data,
and so whether or not some part of the code uses Bloom filters depends
entirely on whether or not the top-most level of the commit-graph chain
has Bloom filters.

This has been the behavior since Bloom filters were introduced, and has
been codified into the tests since a759bfa9ee (t4216: add end to end
tests for git log with Bloom filters, 2020-04-06). In fact, t4216.130
requires that Bloom filters are not used in exactly the case described
earlier.

There is no reason that this needs to be the case, since it is perfectly
valid for commits in an earlier layer to have Bloom filters when commits
in a newer layer do not.

Since Bloom settings are guaranteed in practice to be the same for any
layer in a chain that has Bloom data, it is sufficient to traverse the
'->base_graph' pointer until either (1) a non-null 'struct
bloom_filter_settings *' is found, or (2) until we are at the root of
the commit-graph chain.

Introduce a 'get_bloom_filter_settings()' function that does just this,
and use it instead of purely dereferencing the top-most graph's
'->bloom_filter_settings' pointer.

While we're at it, add an additional test in t5324 to guard against code
in the commit-graph writing machinery that doesn't correctly handle a
NULL 'struct bloom_filter *'.

Co-authored-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 blame.c                       |  6 ++++--
 bloom.c                       |  6 +++---
 commit-graph.c                | 11 +++++++++++
 commit-graph.h                |  2 ++
 revision.c                    |  5 +----
 t/t4216-log-bloom.sh          |  9 ++++++---
 t/t5324-split-commit-graph.sh | 13 +++++++++++++
 7 files changed, 40 insertions(+), 12 deletions(-)

diff --git a/blame.c b/blame.c
index 1be1cd82a2..903e23af23 100644
--- a/blame.c
+++ b/blame.c
@@ -2892,16 +2892,18 @@ void setup_blame_bloom_data(struct blame_scoreboard *sb,
 			    const char *path)
 {
 	struct blame_bloom_data *bd;
+	struct bloom_filter_settings *bs;
 
 	if (!sb->repo->objects->commit_graph)
 		return;
 
-	if (!sb->repo->objects->commit_graph->bloom_filter_settings)
+	bs = get_bloom_filter_settings(sb->repo);
+	if (!bs)
 		return;
 
 	bd = xmalloc(sizeof(struct blame_bloom_data));
 
-	bd->settings = sb->repo->objects->commit_graph->bloom_filter_settings;
+	bd->settings = bs;
 
 	bd->alloc = 4;
 	bd->nr = 0;
diff --git a/bloom.c b/bloom.c
index 1a573226e7..cd9380ac62 100644
--- a/bloom.c
+++ b/bloom.c
@@ -38,7 +38,7 @@ static int load_bloom_filter_from_graph(struct commit_graph *g,
 	while (graph_pos < g->num_commits_in_base)
 		g = g->base_graph;
 
-	/* The commit graph commit 'c' lives in doesn't carry bloom filters. */
+	/* The commit graph commit 'c' lives in doesn't carry Bloom filters. */
 	if (!g->chunk_bloom_indexes)
 		return 0;
 
@@ -195,8 +195,8 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
 	if (!filter->data) {
 		load_commit_graph_info(r, c);
 		if (commit_graph_position(c) != COMMIT_NOT_FROM_GRAPH &&
-			r->objects->commit_graph->chunk_bloom_indexes)
-			load_bloom_filter_from_graph(r->objects->commit_graph, filter, c);
+			load_bloom_filter_from_graph(r->objects->commit_graph, filter, c))
+				return filter;
 	}
 
 	if (filter->data)
diff --git a/commit-graph.c b/commit-graph.c
index 0ed003e218..6a36ed0b06 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -667,6 +667,17 @@ int generation_numbers_enabled(struct repository *r)
 	return !!first_generation;
 }
 
+struct bloom_filter_settings *get_bloom_filter_settings(struct repository *r)
+{
+	struct commit_graph *g = r->objects->commit_graph;
+	while (g) {
+		if (g->bloom_filter_settings)
+			return g->bloom_filter_settings;
+		g = g->base_graph;
+	}
+	return NULL;
+}
+
 static void close_commit_graph_one(struct commit_graph *g)
 {
 	if (!g)
diff --git a/commit-graph.h b/commit-graph.h
index 09a97030dc..0677dd1031 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -87,6 +87,8 @@ struct commit_graph *parse_commit_graph(void *graph_map, size_t graph_size);
  */
 int generation_numbers_enabled(struct repository *r);
 
+struct bloom_filter_settings *get_bloom_filter_settings(struct repository *r);
+
 enum commit_graph_write_flags {
 	COMMIT_GRAPH_WRITE_APPEND     = (1 << 0),
 	COMMIT_GRAPH_WRITE_PROGRESS   = (1 << 1),
diff --git a/revision.c b/revision.c
index 08c2ad23af..857274408c 100644
--- a/revision.c
+++ b/revision.c
@@ -680,10 +680,7 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
 
 	repo_parse_commit(revs->repo, revs->commits->item);
 
-	if (!revs->repo->objects->commit_graph)
-		return;
-
-	revs->bloom_filter_settings = revs->repo->objects->commit_graph->bloom_filter_settings;
+	revs->bloom_filter_settings = get_bloom_filter_settings(revs->repo);
 	if (!revs->bloom_filter_settings)
 		return;
 
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index 4bb9e9dbe2..715912ad0f 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -65,7 +65,7 @@ setup () {
 
 test_bloom_filters_used () {
 	log_args=$1
-	bloom_trace_prefix="statistics:{\"filter_not_present\":0,\"maybe\""
+	bloom_trace_prefix="statistics:{\"filter_not_present\":${2:-0},\"maybe\""
 	setup "$log_args" &&
 	grep -q "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.perf" &&
 	test_cmp log_wo_bloom log_w_bloom &&
@@ -139,8 +139,11 @@ test_expect_success 'setup - add commit-graph to the chain without Bloom filters
 	test_line_count = 2 .git/objects/info/commit-graphs/commit-graph-chain
 '
 
-test_expect_success 'Do not use Bloom filters if the latest graph does not have Bloom filters.' '
-	test_bloom_filters_not_used "-- A/B"
+test_expect_success 'use Bloom filters even if the latest graph does not have Bloom filters' '
+	# Ensure that the number of empty filters is equal to the number of
+	# filters in the latest graph layer to prove that they are loaded (and
+	# ignored).
+	test_bloom_filters_used "-- A/B" 3
 '
 
 test_expect_success 'setup - add commit-graph to the chain with Bloom filters' '
diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
index 18216463c7..c334ee9155 100755
--- a/t/t5324-split-commit-graph.sh
+++ b/t/t5324-split-commit-graph.sh
@@ -427,4 +427,17 @@ done <<\EOF
 0600 -r--------
 EOF
 
+test_expect_success '--split=replace with partial Bloom data' '
+	rm -rf $graphdir $infodir/commit-graph &&
+	git reset --hard commits/3 &&
+	git rev-list -1 HEAD~2 >a &&
+	git rev-list -1 HEAD~1 >b &&
+	git commit-graph write --split=no-merge --stdin-commits --changed-paths <a &&
+	git commit-graph write --split=no-merge --stdin-commits <b &&
+	git commit-graph write --split=replace --stdin-commits --changed-paths <c &&
+	ls $graphdir/graph-*.graph >graph-files &&
+	test_line_count = 1 graph-files &&
+	verify_chain_files_exist $graphdir
+'
+
 test_done
-- 
2.28.0.462.g4ff11cec37


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH 02/12] t4216: use an '&&'-chain
  2020-09-09 15:22 [PATCH 00/12] more miscellaneous Bloom filter improvements, redux Taylor Blau
  2020-09-09 15:22 ` [PATCH 01/12] commit-graph: introduce 'get_bloom_filter_settings()' Taylor Blau
@ 2020-09-09 15:22 ` Taylor Blau
  2020-09-09 15:22 ` [PATCH 03/12] commit-graph: pass a 'struct repository *' in more places Taylor Blau
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 75+ messages in thread
From: Taylor Blau @ 2020-09-09 15:22 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, peff, szeder.dev

In a759bfa9ee (t4216: add end to end tests for git log with Bloom
filters, 2020-04-06), a 'rm' invocation was added without a
corresponding '&&' chain.

When 'trace.perf' already exists, everything works fine. However, the
function can be executed without 'trace.perf' on disk (eg., when the
subset of tests run is altered with '--run'), and so the bare 'rm'
complains about a missing file.

To remove some noise from the test log, invoke 'rm' with '-f', at which
point it is sensible to place the 'rm -f' in an '&&'-chain, which is
both (1) our usual style, and (2) avoids a broken chain in the future if
more commands are added at the beginning of the function.

Helped-by: Eric Sunshine <sunshine@sunshineco.com>
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/t4216-log-bloom.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index 715912ad0f..cd89c75002 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -58,7 +58,7 @@ sane_unset GIT_TRACE2_PERF_BRIEF
 sane_unset GIT_TRACE2_CONFIG_PARAMS
 
 setup () {
-	rm "$TRASH_DIRECTORY/trace.perf"
+	rm -f "$TRASH_DIRECTORY/trace.perf" &&
 	git -c core.commitGraph=false log --pretty="format:%s" $1 >log_wo_bloom &&
 	GIT_TRACE2_PERF="$TRASH_DIRECTORY/trace.perf" git -c core.commitGraph=true log --pretty="format:%s" $1 >log_w_bloom
 }
-- 
2.28.0.462.g4ff11cec37


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH 03/12] commit-graph: pass a 'struct repository *' in more places
  2020-09-09 15:22 [PATCH 00/12] more miscellaneous Bloom filter improvements, redux Taylor Blau
  2020-09-09 15:22 ` [PATCH 01/12] commit-graph: introduce 'get_bloom_filter_settings()' Taylor Blau
  2020-09-09 15:22 ` [PATCH 02/12] t4216: use an '&&'-chain Taylor Blau
@ 2020-09-09 15:22 ` Taylor Blau
  2020-09-09 15:23 ` [PATCH 04/12] t/helper/test-read-graph.c: prepare repo settings Taylor Blau
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 75+ messages in thread
From: Taylor Blau @ 2020-09-09 15:22 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, peff, szeder.dev

In a future commit, some commit-graph internals will want access to
'r->settings', but we only have the 'struct object_directory *'
corresponding to that repository.

Add an additional parameter to pass the repository around in more
places.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/commit-graph.c |  2 +-
 commit-graph.c         | 17 ++++++++++-------
 commit-graph.h         |  6 ++++--
 fuzz-commit-graph.c    |  5 +++--
 4 files changed, 18 insertions(+), 12 deletions(-)

diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 523501f217..ba5584463f 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -106,7 +106,7 @@ static int graph_verify(int argc, const char **argv)
 	FREE_AND_NULL(graph_name);
 
 	if (open_ok)
-		graph = load_commit_graph_one_fd_st(fd, &st, odb);
+		graph = load_commit_graph_one_fd_st(the_repository, fd, &st, odb);
 	else
 		graph = read_commit_graph_one(the_repository, odb);
 
diff --git a/commit-graph.c b/commit-graph.c
index 6a36ed0b06..72a838bd00 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -231,7 +231,8 @@ int open_commit_graph(const char *graph_file, int *fd, struct stat *st)
 	return 1;
 }
 
-struct commit_graph *load_commit_graph_one_fd_st(int fd, struct stat *st,
+struct commit_graph *load_commit_graph_one_fd_st(struct repository *r,
+						 int fd, struct stat *st,
 						 struct object_directory *odb)
 {
 	void *graph_map;
@@ -247,7 +248,7 @@ struct commit_graph *load_commit_graph_one_fd_st(int fd, struct stat *st,
 	}
 	graph_map = xmmap(NULL, graph_size, PROT_READ, MAP_PRIVATE, fd, 0);
 	close(fd);
-	ret = parse_commit_graph(graph_map, graph_size);
+	ret = parse_commit_graph(r, graph_map, graph_size);
 
 	if (ret)
 		ret->odb = odb;
@@ -287,7 +288,8 @@ static int verify_commit_graph_lite(struct commit_graph *g)
 	return 0;
 }
 
-struct commit_graph *parse_commit_graph(void *graph_map, size_t graph_size)
+struct commit_graph *parse_commit_graph(struct repository *r,
+					void *graph_map, size_t graph_size)
 {
 	const unsigned char *data, *chunk_lookup;
 	uint32_t i;
@@ -452,7 +454,8 @@ struct commit_graph *parse_commit_graph(void *graph_map, size_t graph_size)
 	return NULL;
 }
 
-static struct commit_graph *load_commit_graph_one(const char *graph_file,
+static struct commit_graph *load_commit_graph_one(struct repository *r,
+						  const char *graph_file,
 						  struct object_directory *odb)
 {
 
@@ -464,7 +467,7 @@ static struct commit_graph *load_commit_graph_one(const char *graph_file,
 	if (!open_ok)
 		return NULL;
 
-	g = load_commit_graph_one_fd_st(fd, &st, odb);
+	g = load_commit_graph_one_fd_st(r, fd, &st, odb);
 
 	if (g)
 		g->filename = xstrdup(graph_file);
@@ -476,7 +479,7 @@ static struct commit_graph *load_commit_graph_v1(struct repository *r,
 						 struct object_directory *odb)
 {
 	char *graph_name = get_commit_graph_filename(odb);
-	struct commit_graph *g = load_commit_graph_one(graph_name, odb);
+	struct commit_graph *g = load_commit_graph_one(r, graph_name, odb);
 	free(graph_name);
 
 	return g;
@@ -557,7 +560,7 @@ static struct commit_graph *load_commit_graph_chain(struct repository *r,
 		valid = 0;
 		for (odb = r->objects->odb; odb; odb = odb->next) {
 			char *graph_name = get_split_graph_filename(odb, line.buf);
-			struct commit_graph *g = load_commit_graph_one(graph_name, odb);
+			struct commit_graph *g = load_commit_graph_one(r, graph_name, odb);
 
 			free(graph_name);
 
diff --git a/commit-graph.h b/commit-graph.h
index 0677dd1031..d9acb22bac 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -75,11 +75,13 @@ struct commit_graph {
 	struct bloom_filter_settings *bloom_filter_settings;
 };
 
-struct commit_graph *load_commit_graph_one_fd_st(int fd, struct stat *st,
+struct commit_graph *load_commit_graph_one_fd_st(struct repository *r,
+						 int fd, struct stat *st,
 						 struct object_directory *odb);
 struct commit_graph *read_commit_graph_one(struct repository *r,
 					   struct object_directory *odb);
-struct commit_graph *parse_commit_graph(void *graph_map, size_t graph_size);
+struct commit_graph *parse_commit_graph(struct repository *r,
+					void *graph_map, size_t graph_size);
 
 /*
  * Return 1 if and only if the repository has a commit-graph
diff --git a/fuzz-commit-graph.c b/fuzz-commit-graph.c
index 430817214d..e7cf6d5b0f 100644
--- a/fuzz-commit-graph.c
+++ b/fuzz-commit-graph.c
@@ -1,7 +1,8 @@
 #include "commit-graph.h"
 #include "repository.h"
 
-struct commit_graph *parse_commit_graph(void *graph_map, size_t graph_size);
+struct commit_graph *parse_commit_graph(struct repository *r,
+					void *graph_map, size_t graph_size);
 
 int LLVMFuzzerTestOneInput(const uint8_t *data, size_t size);
 
@@ -10,7 +11,7 @@ int LLVMFuzzerTestOneInput(const uint8_t *data, size_t size)
 	struct commit_graph *g;
 
 	initialize_the_repository();
-	g = parse_commit_graph((void *)data, size);
+	g = parse_commit_graph(the_repository, (void *)data, size);
 	repo_clear(the_repository);
 	free_commit_graph(g);
 
-- 
2.28.0.462.g4ff11cec37


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH 04/12] t/helper/test-read-graph.c: prepare repo settings
  2020-09-09 15:22 [PATCH 00/12] more miscellaneous Bloom filter improvements, redux Taylor Blau
                   ` (2 preceding siblings ...)
  2020-09-09 15:22 ` [PATCH 03/12] commit-graph: pass a 'struct repository *' in more places Taylor Blau
@ 2020-09-09 15:23 ` Taylor Blau
  2020-09-09 15:23 ` [PATCH 05/12] commit-graph: respect 'commitGraph.readChangedPaths' Taylor Blau
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 75+ messages in thread
From: Taylor Blau @ 2020-09-09 15:23 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, peff, szeder.dev

The read-graph test-tool is used by a number of the commit-graph test to
assert various properties about a commit-graph. Previously, this program
never ran 'prepare_repo_settings()'. There was no need to do so, since
none of the commit-graph machinery is affected by the repo settings.

In the next patch, the commit-graph machinery's behavior will become
dependent on the repo settings, and so loading them before running the
rest of the test tool is critical.

As such, teach the test tool to call 'prepare_repo_settings()'.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/helper/test-read-graph.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/t/helper/test-read-graph.c b/t/helper/test-read-graph.c
index 6d0c962438..5f585a1725 100644
--- a/t/helper/test-read-graph.c
+++ b/t/helper/test-read-graph.c
@@ -12,11 +12,12 @@ int cmd__read_graph(int argc, const char **argv)
 	setup_git_directory();
 	odb = the_repository->objects->odb;
 
+	prepare_repo_settings(the_repository);
+
 	graph = read_commit_graph_one(the_repository, odb);
 	if (!graph)
 		return 1;
 
-
 	printf("header: %08x %d %d %d %d\n",
 		ntohl(*(uint32_t*)graph->data),
 		*(unsigned char*)(graph->data + 4),
-- 
2.28.0.462.g4ff11cec37


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH 05/12] commit-graph: respect 'commitGraph.readChangedPaths'
  2020-09-09 15:22 [PATCH 00/12] more miscellaneous Bloom filter improvements, redux Taylor Blau
                   ` (3 preceding siblings ...)
  2020-09-09 15:23 ` [PATCH 04/12] t/helper/test-read-graph.c: prepare repo settings Taylor Blau
@ 2020-09-09 15:23 ` Taylor Blau
  2020-09-09 15:23 ` [PATCH 06/12] commit-graph.c: store maximum changed paths Taylor Blau
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 75+ messages in thread
From: Taylor Blau @ 2020-09-09 15:23 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, peff, szeder.dev

Git uses the 'core.commitGraph' configuration value to control whether
or not the commit graph is used when parsing commits or performing a
traversal.

Now that commit-graphs can also contain a section for changed-path Bloom
filters, administrators that already have commit-graphs may find it
convenient to use those graphs without relying on their changed-path
Bloom filters. This can happen, for example, during a staged roll-out,
or in the event of an incident.

Introduce 'commitGraph.readChangedPaths' to control whether or not Bloom
filters are read. Note that this configuration is independent from both:

  - 'core.commitGraph', to allow flexibility in using all parts of a
    commit-graph _except_ for its Bloom filters.

  - The '--changed-paths' option for 'git commit-graph write', to allow
    reading and writing Bloom filters to be controlled independently.

When the variable is set, pretend as if no Bloom data was specified at
all. This avoids adding additional special-casing outside of the
commit-graph internals.

Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/config.txt             | 2 ++
 Documentation/config/commitgraph.txt | 4 ++++
 commit-graph.c                       | 6 ++++--
 repo-settings.c                      | 3 +++
 repository.h                         | 1 +
 t/t4216-log-bloom.sh                 | 4 +++-
 6 files changed, 17 insertions(+), 3 deletions(-)
 create mode 100644 Documentation/config/commitgraph.txt

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 3042d80978..770ae79b82 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -340,6 +340,8 @@ include::config/column.txt[]
 
 include::config/commit.txt[]
 
+include::config/commitgraph.txt[]
+
 include::config/credential.txt[]
 
 include::config/completion.txt[]
diff --git a/Documentation/config/commitgraph.txt b/Documentation/config/commitgraph.txt
new file mode 100644
index 0000000000..cff0797b54
--- /dev/null
+++ b/Documentation/config/commitgraph.txt
@@ -0,0 +1,4 @@
+commitGraph.readChangedPaths::
+	If true, then git will use the changed-path Bloom filters in the
+	commit-graph file (if it exists, and they are present). Defaults to
+	true. See linkgit:git-commit-graph[1] for more information.
diff --git a/commit-graph.c b/commit-graph.c
index 72a838bd00..ea54d108b9 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -327,6 +327,8 @@ struct commit_graph *parse_commit_graph(struct repository *r,
 		return NULL;
 	}
 
+	prepare_repo_settings(r);
+
 	graph = alloc_commit_graph();
 
 	graph->hash_len = the_hash_algo->rawsz;
@@ -403,14 +405,14 @@ struct commit_graph *parse_commit_graph(struct repository *r,
 		case GRAPH_CHUNKID_BLOOMINDEXES:
 			if (graph->chunk_bloom_indexes)
 				chunk_repeated = 1;
-			else
+			else if (r->settings.commit_graph_read_changed_paths)
 				graph->chunk_bloom_indexes = data + chunk_offset;
 			break;
 
 		case GRAPH_CHUNKID_BLOOMDATA:
 			if (graph->chunk_bloom_data)
 				chunk_repeated = 1;
-			else {
+			else if (r->settings.commit_graph_read_changed_paths) {
 				uint32_t hash_version;
 				graph->chunk_bloom_data = data + chunk_offset;
 				hash_version = get_be32(data + chunk_offset);
diff --git a/repo-settings.c b/repo-settings.c
index aa61a35338..88ccce2036 100644
--- a/repo-settings.c
+++ b/repo-settings.c
@@ -17,9 +17,12 @@ void prepare_repo_settings(struct repository *r)
 
 	if (!repo_config_get_bool(r, "core.commitgraph", &value))
 		r->settings.core_commit_graph = value;
+	if (!repo_config_get_bool(r, "commitgraph.readchangedpaths", &value))
+		r->settings.commit_graph_read_changed_paths = value;
 	if (!repo_config_get_bool(r, "gc.writecommitgraph", &value))
 		r->settings.gc_write_commit_graph = value;
 	UPDATE_DEFAULT_BOOL(r->settings.core_commit_graph, 1);
+	UPDATE_DEFAULT_BOOL(r->settings.commit_graph_read_changed_paths, 1);
 	UPDATE_DEFAULT_BOOL(r->settings.gc_write_commit_graph, 1);
 
 	if (!repo_config_get_int(r, "index.version", &value))
diff --git a/repository.h b/repository.h
index 628c834367..bacf843d46 100644
--- a/repository.h
+++ b/repository.h
@@ -30,6 +30,7 @@ struct repo_settings {
 	int initialized;
 
 	int core_commit_graph;
+	int commit_graph_read_changed_paths;
 	int gc_write_commit_graph;
 	int fetch_write_commit_graph;
 
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index cd89c75002..fc7693806c 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -95,7 +95,9 @@ do
 		      "--ancestry-path side..master"
 	do
 		test_expect_success "git log option: $option for path: $path" '
-			test_bloom_filters_used "$option -- $path"
+			test_bloom_filters_used "$option -- $path" &&
+			test_config commitgraph.readChangedPaths false &&
+			test_bloom_filters_not_used "$option -- $path"
 		'
 	done
 done
-- 
2.28.0.462.g4ff11cec37


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH 06/12] commit-graph.c: store maximum changed paths
  2020-09-09 15:22 [PATCH 00/12] more miscellaneous Bloom filter improvements, redux Taylor Blau
                   ` (4 preceding siblings ...)
  2020-09-09 15:23 ` [PATCH 05/12] commit-graph: respect 'commitGraph.readChangedPaths' Taylor Blau
@ 2020-09-09 15:23 ` Taylor Blau
  2020-09-09 15:23 ` [PATCH 07/12] bloom: split 'get_bloom_filter()' in two Taylor Blau
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 75+ messages in thread
From: Taylor Blau @ 2020-09-09 15:23 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, peff, szeder.dev

For now, we assume that there is a fixed constant describing the
maximum number of changed paths we are willing to store in a Bloom
filter.

Prepare for that to (at least partially) not be the case by making it a
member of the 'struct bloom_filter_settings'. This will be helpful in
the subsequent patches by reducing the size of test cases that exercise
storing too many changed paths, as well as preparing for an eventual
future in which this value might change.

This patch alone does not cause newly generated Bloom filters to use
a custom upper-bound on the maximum number of changed paths a single
Bloom filter can hold, that will occur in a later patch.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 bloom.h              | 11 ++++++++++-
 commit-graph.c       |  3 +++
 t/t4216-log-bloom.sh |  4 ++--
 3 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/bloom.h b/bloom.h
index d8fbb0fbf1..0b9b59a6fe 100644
--- a/bloom.h
+++ b/bloom.h
@@ -28,9 +28,18 @@ struct bloom_filter_settings {
 	 * that contain n*b bits.
 	 */
 	uint32_t bits_per_entry;
+
+	/*
+	 * The maximum number of changed paths per commit
+	 * before declaring a Bloom filter to be too-large.
+	 *
+	 * Not written to the commit-graph file.
+	 */
+	uint32_t max_changed_paths;
 };
 
-#define DEFAULT_BLOOM_FILTER_SETTINGS { 1, 7, 10 }
+#define DEFAULT_BLOOM_MAX_CHANGES 512
+#define DEFAULT_BLOOM_FILTER_SETTINGS { 1, 7, 10, DEFAULT_BLOOM_MAX_CHANGES }
 #define BITS_PER_WORD 8
 #define BLOOMDATA_CHUNK_HEADER_SIZE 3 * sizeof(uint32_t)
 
diff --git a/commit-graph.c b/commit-graph.c
index ea54d108b9..55af498aa0 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1201,6 +1201,7 @@ static void trace2_bloom_filter_settings(struct write_commit_graph_context *ctx)
 	jw_object_intmax(&jw, "hash_version", ctx->bloom_settings->hash_version);
 	jw_object_intmax(&jw, "num_hashes", ctx->bloom_settings->num_hashes);
 	jw_object_intmax(&jw, "bits_per_entry", ctx->bloom_settings->bits_per_entry);
+	jw_object_intmax(&jw, "max_changed_paths", ctx->bloom_settings->max_changed_paths);
 	jw_end(&jw);
 
 	trace2_data_json("bloom", ctx->r, "settings", &jw);
@@ -1669,6 +1670,8 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 							      bloom_settings.bits_per_entry);
 		bloom_settings.num_hashes = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_NUM_HASHES",
 							  bloom_settings.num_hashes);
+		bloom_settings.max_changed_paths = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS",
+							  bloom_settings.max_changed_paths);
 		ctx->bloom_settings = &bloom_settings;
 	}
 
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index fc7693806c..47ddf2641f 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -174,11 +174,11 @@ test_expect_success 'persist filter settings' '
 		GIT_TEST_BLOOM_SETTINGS_NUM_HASHES=9 \
 		GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY=15 \
 		git commit-graph write --reachable --changed-paths &&
-	grep "{\"hash_version\":1,\"num_hashes\":9,\"bits_per_entry\":15}" trace2.txt &&
+	grep "{\"hash_version\":1,\"num_hashes\":9,\"bits_per_entry\":15" trace2.txt &&
 	GIT_TRACE2_EVENT="$(pwd)/trace2-auto.txt" \
 		GIT_TRACE2_EVENT_NESTING=5 \
 		git commit-graph write --reachable --changed-paths &&
-	grep "{\"hash_version\":1,\"num_hashes\":9,\"bits_per_entry\":15}" trace2-auto.txt
+	grep "{\"hash_version\":1,\"num_hashes\":9,\"bits_per_entry\":15" trace2-auto.txt
 '
 
 test_expect_success 'correctly report changes over limit' '
-- 
2.28.0.462.g4ff11cec37


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH 07/12] bloom: split 'get_bloom_filter()' in two
  2020-09-09 15:22 [PATCH 00/12] more miscellaneous Bloom filter improvements, redux Taylor Blau
                   ` (5 preceding siblings ...)
  2020-09-09 15:23 ` [PATCH 06/12] commit-graph.c: store maximum changed paths Taylor Blau
@ 2020-09-09 15:23 ` Taylor Blau
  2020-09-09 15:23 ` [PATCH 08/12] bloom: use provided 'struct bloom_filter_settings' Taylor Blau
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 75+ messages in thread
From: Taylor Blau @ 2020-09-09 15:23 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, peff, szeder.dev

'get_bloom_filter' takes a flag to control whether it will compute a
Bloom filter if the requested one is missing. In the next patch, we'll
add yet another parameter to this method, which would force all but one
caller to specify an extra 'NULL' parameter at the end.

Instead of doing this, split 'get_bloom_filter' into two functions:
'get_bloom_filter' and 'get_or_compute_bloom_filter'. The former only
looks up a Bloom filter (and does not compute one if it's missing,
thus dropping the 'compute_if_not_present' flag). The latter does
compute missing Bloom filters, with an additional parameter to store
whether or not it needed to do so.

This simplifies many call-sites, since the majority of existing callers
to 'get_bloom_filter' do not want missing Bloom filters to be computed
(so they can drop the parameter entirely and use the simpler version of
the function).

While we're at it, instrument the new 'get_or_compute_bloom_filter()'
with counters in the 'write_commit_graph_context' struct which store
the number of filters that we did and didn't compute, as well as filters
that were truncated.

It would be nice to drop the 'compute_if_not_present' flag entirely,
since all remaining callers of 'get_or_compute_bloom_filter' pass it as
'1', but this will change in a future patch and hence cannot be removed.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 blame.c               |  2 +-
 bloom.c               | 16 +++++++++++++---
 bloom.h               | 16 +++++++++++++---
 commit-graph.c        | 42 +++++++++++++++++++++++++++++++++++++++---
 line-log.c            |  2 +-
 revision.c            |  2 +-
 t/helper/test-bloom.c |  3 ++-
 7 files changed, 70 insertions(+), 13 deletions(-)

diff --git a/blame.c b/blame.c
index 903e23af23..e5ba35dbd1 100644
--- a/blame.c
+++ b/blame.c
@@ -1276,7 +1276,7 @@ static int maybe_changed_path(struct repository *r,
 	if (commit_graph_generation(origin->commit) == GENERATION_NUMBER_INFINITY)
 		return 1;
 
-	filter = get_bloom_filter(r, origin->commit, 0);
+	filter = get_bloom_filter(r, origin->commit);
 
 	if (!filter)
 		return 1;
diff --git a/bloom.c b/bloom.c
index cd9380ac62..393c61b4bc 100644
--- a/bloom.c
+++ b/bloom.c
@@ -177,9 +177,10 @@ static int pathmap_cmp(const void *hashmap_cmp_fn_data,
 	return strcmp(e1->path, e2->path);
 }
 
-struct bloom_filter *get_bloom_filter(struct repository *r,
-				      struct commit *c,
-				      int compute_if_not_present)
+struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
+						 struct commit *c,
+						 int compute_if_not_present,
+						 enum bloom_filter_computed *computed)
 {
 	struct bloom_filter *filter;
 	struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
@@ -187,6 +188,9 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
 	struct diff_options diffopt;
 	int max_changes = 512;
 
+	if (computed)
+		*computed = BLOOM_NOT_COMPUTED;
+
 	if (!bloom_filters.slab_size)
 		return NULL;
 
@@ -271,8 +275,14 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
 			diff_free_filepair(diff_queued_diff.queue[i]);
 		filter->data = NULL;
 		filter->len = 0;
+
+		if (computed)
+			*computed |= BLOOM_TRUNC_LARGE;
 	}
 
+	if (computed)
+		*computed |= BLOOM_COMPUTED;
+
 	free(diff_queued_diff.queue);
 	DIFF_QUEUE_CLEAR(&diff_queued_diff);
 
diff --git a/bloom.h b/bloom.h
index 0b9b59a6fe..e2e035ad14 100644
--- a/bloom.h
+++ b/bloom.h
@@ -89,9 +89,19 @@ void add_key_to_filter(const struct bloom_key *key,
 
 void init_bloom_filters(void);
 
-struct bloom_filter *get_bloom_filter(struct repository *r,
-				      struct commit *c,
-				      int compute_if_not_present);
+enum bloom_filter_computed {
+	BLOOM_NOT_COMPUTED = (1 << 0),
+	BLOOM_COMPUTED     = (1 << 1),
+	BLOOM_TRUNC_LARGE  = (1 << 2),
+};
+
+struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
+						 struct commit *c,
+						 int compute_if_not_present,
+						 enum bloom_filter_computed *computed);
+
+#define get_bloom_filter(r, c) get_or_compute_bloom_filter( \
+	(r), (c), 0, NULL)
 
 int bloom_filter_contains(const struct bloom_filter *filter,
 			  const struct bloom_key *key,
diff --git a/commit-graph.c b/commit-graph.c
index 55af498aa0..b8b2c7ca65 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -971,6 +971,10 @@ struct write_commit_graph_context {
 	const struct split_commit_graph_opts *split_opts;
 	size_t total_bloom_filter_data_size;
 	const struct bloom_filter_settings *bloom_settings;
+
+	int count_bloom_filter_computed;
+	int count_bloom_filter_not_computed;
+	int count_bloom_filter_trunc_large;
 };
 
 static int write_graph_chunk_fanout(struct hashfile *f,
@@ -1182,7 +1186,7 @@ static int write_graph_chunk_bloom_indexes(struct hashfile *f,
 	uint32_t cur_pos = 0;
 
 	while (list < last) {
-		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
+		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
 		size_t len = filter ? filter->len : 0;
 		cur_pos += len;
 		display_progress(ctx->progress, ++ctx->progress_cnt);
@@ -1222,7 +1226,7 @@ static int write_graph_chunk_bloom_data(struct hashfile *f,
 	hashwrite_be32(f, ctx->bloom_settings->bits_per_entry);
 
 	while (list < last) {
-		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
+		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
 		size_t len = filter ? filter->len : 0;
 
 		display_progress(ctx->progress, ++ctx->progress_cnt);
@@ -1392,6 +1396,24 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 	stop_progress(&ctx->progress);
 }
 
+static void trace2_bloom_filter_write_statistics(struct write_commit_graph_context *ctx)
+{
+	struct json_writer jw = JSON_WRITER_INIT;
+
+	jw_object_begin(&jw, 0);
+	jw_object_intmax(&jw, "filter_computed",
+			 ctx->count_bloom_filter_computed);
+	jw_object_intmax(&jw, "filter_not_computed",
+			 ctx->count_bloom_filter_not_computed);
+	jw_object_intmax(&jw, "filter_trunc_large",
+			 ctx->count_bloom_filter_trunc_large);
+	jw_end(&jw);
+
+	trace2_data_json("commit-graph", the_repository, "bloom_statistics", &jw);
+
+	jw_release(&jw);
+}
+
 static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 {
 	int i;
@@ -1414,12 +1436,26 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 		QSORT(sorted_commits, ctx->commits.nr, commit_gen_cmp);
 
 	for (i = 0; i < ctx->commits.nr; i++) {
+		enum bloom_filter_computed computed = 0;
 		struct commit *c = sorted_commits[i];
-		struct bloom_filter *filter = get_bloom_filter(ctx->r, c, 1);
+		struct bloom_filter *filter = get_or_compute_bloom_filter(
+			ctx->r,
+			c,
+			1,
+			&computed);
+		if (computed & BLOOM_COMPUTED) {
+			ctx->count_bloom_filter_computed++;
+			if (computed & BLOOM_TRUNC_LARGE)
+				ctx->count_bloom_filter_trunc_large++;
+		} else if (computed & BLOOM_NOT_COMPUTED)
+			ctx->count_bloom_filter_not_computed++;
 		ctx->total_bloom_filter_data_size += sizeof(unsigned char) * filter->len;
 		display_progress(progress, i + 1);
 	}
 
+	if (trace2_is_enabled())
+		trace2_bloom_filter_write_statistics(ctx);
+
 	free(sorted_commits);
 	stop_progress(&progress);
 }
diff --git a/line-log.c b/line-log.c
index bf73ea95ac..68eeb425f8 100644
--- a/line-log.c
+++ b/line-log.c
@@ -1159,7 +1159,7 @@ static int bloom_filter_check(struct rev_info *rev,
 		return 1;
 
 	if (!rev->bloom_filter_settings ||
-	    !(filter = get_bloom_filter(rev->repo, commit, 0)))
+	    !(filter = get_bloom_filter(rev->repo, commit)))
 		return 1;
 
 	if (!range)
diff --git a/revision.c b/revision.c
index 857274408c..f4be5d1650 100644
--- a/revision.c
+++ b/revision.c
@@ -751,7 +751,7 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
 	if (commit_graph_generation(commit) == GENERATION_NUMBER_INFINITY)
 		return -1;
 
-	filter = get_bloom_filter(revs->repo, commit, 0);
+	filter = get_bloom_filter(revs->repo, commit);
 
 	if (!filter) {
 		count_bloom_filter_not_present++;
diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
index 5e77d56f59..9f7bb729fc 100644
--- a/t/helper/test-bloom.c
+++ b/t/helper/test-bloom.c
@@ -39,7 +39,8 @@ static void get_bloom_filter_for_commit(const struct object_id *commit_oid)
 	struct bloom_filter *filter;
 	setup_git_directory();
 	c = lookup_commit(the_repository, commit_oid);
-	filter = get_bloom_filter(the_repository, c, 1);
+	filter = get_or_compute_bloom_filter(the_repository, c, 1,
+					     NULL);
 	print_bloom_filter(filter);
 }
 
-- 
2.28.0.462.g4ff11cec37


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH 08/12] bloom: use provided 'struct bloom_filter_settings'
  2020-09-09 15:22 [PATCH 00/12] more miscellaneous Bloom filter improvements, redux Taylor Blau
                   ` (6 preceding siblings ...)
  2020-09-09 15:23 ` [PATCH 07/12] bloom: split 'get_bloom_filter()' in two Taylor Blau
@ 2020-09-09 15:23 ` Taylor Blau
  2020-09-09 15:23 ` [PATCH 09/12] bloom/diff: properly short-circuit on max_changes Taylor Blau
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 75+ messages in thread
From: Taylor Blau @ 2020-09-09 15:23 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, peff, szeder.dev

When 'get_or_compute_bloom_filter()' needs to compute a Bloom filter
from scratch, it looks to the default 'struct bloom_filter_settings' in
order to determine the maximum number of changed paths, number of bits
per entry, and so on.

All of these values have so far been constant, and so there was no need
to pass in a pointer from the caller (eg., the one that is stored in the
'struct write_commit_graph_context').

Start passing in a 'struct bloom_filter_settings *' instead of using the
default values to respect graph-specific settings (eg., in the case of
setting 'GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS').

In order to have an initialized value for these settings, move its
initialization to earlier in the commit-graph write.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 bloom.c               | 13 ++++++-------
 bloom.h               |  3 ++-
 commit-graph.c        | 21 ++++++++++-----------
 t/helper/test-bloom.c |  1 +
 4 files changed, 19 insertions(+), 19 deletions(-)

diff --git a/bloom.c b/bloom.c
index 393c61b4bc..2d6aef9098 100644
--- a/bloom.c
+++ b/bloom.c
@@ -180,13 +180,12 @@ static int pathmap_cmp(const void *hashmap_cmp_fn_data,
 struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 						 struct commit *c,
 						 int compute_if_not_present,
+						 const struct bloom_filter_settings *settings,
 						 enum bloom_filter_computed *computed)
 {
 	struct bloom_filter *filter;
-	struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
 	int i;
 	struct diff_options diffopt;
-	int max_changes = 512;
 
 	if (computed)
 		*computed = BLOOM_NOT_COMPUTED;
@@ -211,7 +210,7 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 	repo_diff_setup(r, &diffopt);
 	diffopt.flags.recursive = 1;
 	diffopt.detect_rename = 0;
-	diffopt.max_changes = max_changes;
+	diffopt.max_changes = settings->max_changed_paths;
 	diff_setup_done(&diffopt);
 
 	/* ensure commit is parsed so we have parent information */
@@ -223,7 +222,7 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 		diff_tree_oid(NULL, &c->object.oid, "", &diffopt);
 	diffcore_std(&diffopt);
 
-	if (diffopt.num_changes <= max_changes) {
+	if (diffopt.num_changes <= settings->max_changed_paths) {
 		struct hashmap pathmap;
 		struct pathmap_hash_entry *e;
 		struct hashmap_iter iter;
@@ -260,13 +259,13 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 			diff_free_filepair(diff_queued_diff.queue[i]);
 		}
 
-		filter->len = (hashmap_get_size(&pathmap) * settings.bits_per_entry + BITS_PER_WORD - 1) / BITS_PER_WORD;
+		filter->len = (hashmap_get_size(&pathmap) * settings->bits_per_entry + BITS_PER_WORD - 1) / BITS_PER_WORD;
 		filter->data = xcalloc(filter->len, sizeof(unsigned char));
 
 		hashmap_for_each_entry(&pathmap, &iter, e, entry) {
 			struct bloom_key key;
-			fill_bloom_key(e->path, strlen(e->path), &key, &settings);
-			add_key_to_filter(&key, filter, &settings);
+			fill_bloom_key(e->path, strlen(e->path), &key, settings);
+			add_key_to_filter(&key, filter, settings);
 		}
 
 		hashmap_free_entries(&pathmap, struct pathmap_hash_entry, entry);
diff --git a/bloom.h b/bloom.h
index e2e035ad14..c6d77e8393 100644
--- a/bloom.h
+++ b/bloom.h
@@ -98,10 +98,11 @@ enum bloom_filter_computed {
 struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 						 struct commit *c,
 						 int compute_if_not_present,
+						 const struct bloom_filter_settings *settings,
 						 enum bloom_filter_computed *computed);
 
 #define get_bloom_filter(r, c) get_or_compute_bloom_filter( \
-	(r), (c), 0, NULL)
+	(r), (c), 0, NULL, NULL)
 
 int bloom_filter_contains(const struct bloom_filter *filter,
 			  const struct bloom_key *key,
diff --git a/commit-graph.c b/commit-graph.c
index b8b2c7ca65..500f29525a 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1442,6 +1442,7 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 			ctx->r,
 			c,
 			1,
+			ctx->bloom_settings,
 			&computed);
 		if (computed & BLOOM_COMPUTED) {
 			ctx->count_bloom_filter_computed++;
@@ -1699,17 +1700,6 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	int num_chunks = 3;
 	uint64_t chunk_offset;
 	struct object_id file_hash;
-	struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
-
-	if (!ctx->bloom_settings) {
-		bloom_settings.bits_per_entry = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY",
-							      bloom_settings.bits_per_entry);
-		bloom_settings.num_hashes = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_NUM_HASHES",
-							  bloom_settings.num_hashes);
-		bloom_settings.max_changed_paths = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS",
-							  bloom_settings.max_changed_paths);
-		ctx->bloom_settings = &bloom_settings;
-	}
 
 	if (ctx->split) {
 		struct strbuf tmp_file = STRBUF_INIT;
@@ -2155,6 +2145,7 @@ int write_commit_graph(struct object_directory *odb,
 	uint32_t i, count_distinct = 0;
 	int res = 0;
 	int replace = 0;
+	struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
 
 	if (!commit_graph_compatible(the_repository))
 		return 0;
@@ -2168,6 +2159,14 @@ int write_commit_graph(struct object_directory *odb,
 	ctx->split_opts = split_opts;
 	ctx->total_bloom_filter_data_size = 0;
 
+	bloom_settings.bits_per_entry = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY",
+						      bloom_settings.bits_per_entry);
+	bloom_settings.num_hashes = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_NUM_HASHES",
+						  bloom_settings.num_hashes);
+	bloom_settings.max_changed_paths = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS",
+							 bloom_settings.max_changed_paths);
+	ctx->bloom_settings = &bloom_settings;
+
 	if (flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS)
 		ctx->changed_paths = 1;
 	if (!(flags & COMMIT_GRAPH_NO_WRITE_BLOOM_FILTERS)) {
diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
index 9f7bb729fc..46e97b04eb 100644
--- a/t/helper/test-bloom.c
+++ b/t/helper/test-bloom.c
@@ -40,6 +40,7 @@ static void get_bloom_filter_for_commit(const struct object_id *commit_oid)
 	setup_git_directory();
 	c = lookup_commit(the_repository, commit_oid);
 	filter = get_or_compute_bloom_filter(the_repository, c, 1,
+					     &settings,
 					     NULL);
 	print_bloom_filter(filter);
 }
-- 
2.28.0.462.g4ff11cec37


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH 09/12] bloom/diff: properly short-circuit on max_changes
  2020-09-09 15:22 [PATCH 00/12] more miscellaneous Bloom filter improvements, redux Taylor Blau
                   ` (7 preceding siblings ...)
  2020-09-09 15:23 ` [PATCH 08/12] bloom: use provided 'struct bloom_filter_settings' Taylor Blau
@ 2020-09-09 15:23 ` Taylor Blau
  2020-09-09 15:23 ` [PATCH 10/12] bloom: encode out-of-bounds filters as non-empty Taylor Blau
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 75+ messages in thread
From: Taylor Blau @ 2020-09-09 15:23 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, peff, szeder.dev

From: Derrick Stolee <dstolee@microsoft.com>

Commit e3696980 (diff: halt tree-diff early after max_changes,
2020-03-30) intended to create a mechanism to short-circuit a diff
calculation after a certain number of paths were modified. By
incrementing a "num_changes" counter throughout the recursive
ll_diff_tree_paths(), this was supposed to match the number of changes
that would be written into the changed-path Bloom filters.
Unfortunately, this was not implemented correctly and instead misses
simple cases like file modifications. This then does not stop very
large changed-path filters from being written (unless they add or remove
many files).

To start, change the implementation in ll_diff_tree_paths() to instead
use the global diff_queue_diff struct's 'nr' member as the count. This
is a way to simplify the logic instead of making more mistakes in the
complicated diff code.

This has a drawback: the diff_queue_diff struct only lists the paths
corresponding to blob changes, not their leading directories. Thus,
get_or_compute_bloom_filter() needs an additional check to see if the
hashmap with the leading directories becomes too large.

One reason why this was not caught by test cases was that the test in
t4216-log-bloom.sh that was supposed to check this "too many changes"
condition only checked this on the initial commit of a repository. The
old logic counted these values correctly. Update this test in a few
ways:

1. Use GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS to reduce the limit,
   allowing smaller commits to engage with this logic.

2. Create several interesting cases of edits, adds, removes, and mode
   changes (in the second commit). By testing both sides of the
   inequality with the *_MAX_CHANGED_PATHS variable, we can see that
   the count is exactly correct, so none of these changes are missed
   or over-counted.

3. Use the trace2 data value filter_found_large to verify that these
   commits are on the correct side of the limit.

Another way to verify the behavior is correct is through performance
tests. By testing on my local copies of the Git repository and the Linux
kernel repository, I could measure the effect of these short-circuits
when computing a fresh commit-graph file with changed-path Bloom filters
using the command

  GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS=N time \
    git commit-graph write --reachable --changed-paths

and reporting the wall time and resulting commit-graph size.

For Git, the results are

|        |      N=1       |       N=10     |      N=512     |
|--------|----------------|----------------|----------------|
| HEAD~1 | 10.90s  9.18MB | 11.11s  9.34MB | 11.31s  9.35MB |
| HEAD   |  9.21s  8.62MB | 11.11s  9.29MB | 11.29s  9.34MB |

For Linux, the results are

|        |       N=1      |     N=20      |     N=512     |
|--------|----------------|---------------|---------------|
| HEAD~1 | 61.28s  64.3MB | 76.9s  72.6MB | 77.6s  72.6MB |
| HEAD   | 49.44s  56.3MB | 68.7s  65.9MB | 69.2s  65.9MB |

Naturally, the improvement becomes much less as the limit grows, as
fewer commits satisfy the short-circuit.

Reported-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 bloom.c              |  9 ++++-
 diff.h               |  2 -
 t/t4216-log-bloom.sh | 88 +++++++++++++++++++++++++++++++++++++++-----
 tree-diff.c          |  5 +--
 4 files changed, 88 insertions(+), 16 deletions(-)

diff --git a/bloom.c b/bloom.c
index 2d6aef9098..db9fb82437 100644
--- a/bloom.c
+++ b/bloom.c
@@ -222,7 +222,7 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 		diff_tree_oid(NULL, &c->object.oid, "", &diffopt);
 	diffcore_std(&diffopt);
 
-	if (diffopt.num_changes <= settings->max_changed_paths) {
+	if (diff_queued_diff.nr <= settings->max_changed_paths) {
 		struct hashmap pathmap;
 		struct pathmap_hash_entry *e;
 		struct hashmap_iter iter;
@@ -259,6 +259,12 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 			diff_free_filepair(diff_queued_diff.queue[i]);
 		}
 
+		if (hashmap_get_size(&pathmap) > settings->max_changed_paths) {
+			if (computed)
+				*computed |= BLOOM_TRUNC_LARGE;
+			goto cleanup;
+		}
+
 		filter->len = (hashmap_get_size(&pathmap) * settings->bits_per_entry + BITS_PER_WORD - 1) / BITS_PER_WORD;
 		filter->data = xcalloc(filter->len, sizeof(unsigned char));
 
@@ -268,6 +274,7 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 			add_key_to_filter(&key, filter, settings);
 		}
 
+	cleanup:
 		hashmap_free_entries(&pathmap, struct pathmap_hash_entry, entry);
 	} else {
 		for (i = 0; i < diff_queued_diff.nr; i++)
diff --git a/diff.h b/diff.h
index e0c0af6286..1d32b71885 100644
--- a/diff.h
+++ b/diff.h
@@ -287,8 +287,6 @@ struct diff_options {
 
 	/* If non-zero, then stop computing after this many changes. */
 	int max_changes;
-	/* For internal use only. */
-	int num_changes;
 
 	int ita_invisible_in_index;
 /* white-space error highlighting */
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index 47ddf2641f..f375e752cd 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -182,20 +182,90 @@ test_expect_success 'persist filter settings' '
 '
 
 test_expect_success 'correctly report changes over limit' '
-	git init 513changes &&
+	git init limits &&
 	(
-		cd 513changes &&
-		for i in $(test_seq 1 513)
+		cd limits &&
+		mkdir d &&
+		mkdir d/e &&
+
+		for i in $(test_seq 1 2)
 		do
-			echo $i >file$i.txt || return 1
+			printf $i >d/file$i.txt &&
+			printf $i >d/e/file$i.txt || return 1
 		done &&
-		git add . &&
+
+		mkdir mode &&
+		printf bash >mode/script.sh &&
+
+		mkdir foo &&
+		touch foo/bar &&
+		touch foo.txt &&
+
+		git add d foo foo.txt mode &&
 		git commit -m "files" &&
-		git commit-graph write --reachable --changed-paths &&
-		for i in $(test_seq 1 513)
+
+		# Commit has 7 file and 4 directory adds
+		GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS=10 \
+			GIT_TRACE2_EVENT="$(pwd)/trace" \
+			git commit-graph write --reachable --changed-paths &&
+		grep "\"max_changed_paths\":10" trace &&
+		grep "\"filter_computed\":1" trace &&
+		grep "\"filter_trunc_large\":1" trace &&
+
+		for path in $(git ls-tree -r --name-only HEAD)
 		do
-			git -c core.commitGraph=false log -- file$i.txt >expect &&
-			git log -- file$i.txt >actual &&
+			git -c commitGraph.readChangedPaths=false log \
+				-- $path >expect &&
+			git log -- $path >actual &&
+			test_cmp expect actual || return 1
+		done &&
+
+		# Make a variety of path changes
+		printf new1 >d/e/file1.txt &&
+		printf new2 >d/file2.txt &&
+		rm d/e/file2.txt &&
+		rm -r foo &&
+		printf text >foo &&
+		mkdir f &&
+		printf new1 >f/file1.txt &&
+
+		# including a mode-only change (counts as modified)
+		git update-index --chmod=+x mode/script.sh &&
+
+		git add foo d f &&
+		git commit -m "complicated" &&
+
+		# start from scratch and rebuild
+		rm -f .git/objects/info/commit-graph &&
+		GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS=10 \
+			GIT_TRACE2_EVENT="$(pwd)/trace-edit" \
+			git commit-graph write --reachable --changed-paths &&
+		grep "\"max_changed_paths\":10" trace-edit &&
+		grep "\"filter_computed\":2" trace-edit &&
+		grep "\"filter_trunc_large\":2" trace-edit &&
+
+		for path in $(git ls-tree -r --name-only HEAD)
+		do
+			git -c commitGraph.readChangedPaths=false log \
+				-- $path >expect &&
+			git log -- $path >actual &&
+			test_cmp expect actual || return 1
+		done &&
+
+		# start from scratch and rebuild
+		rm -f .git/objects/info/commit-graph &&
+		GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS=11 \
+			GIT_TRACE2_EVENT="$(pwd)/trace-update" \
+			git commit-graph write --reachable --changed-paths &&
+		grep "\"max_changed_paths\":11" trace-update &&
+		grep "\"filter_computed\":2" trace-update &&
+		grep "\"filter_trunc_large\":0" trace-update &&
+
+		for path in $(git ls-tree -r --name-only HEAD)
+		do
+			git -c commitGraph.readChangedPaths=false log \
+				-- $path >expect &&
+			git log -- $path >actual &&
 			test_cmp expect actual || return 1
 		done
 	)
diff --git a/tree-diff.c b/tree-diff.c
index 6ebad1a46f..7cebbb327e 100644
--- a/tree-diff.c
+++ b/tree-diff.c
@@ -434,7 +434,7 @@ static struct combine_diff_path *ll_diff_tree_paths(
 		if (diff_can_quit_early(opt))
 			break;
 
-		if (opt->max_changes && opt->num_changes > opt->max_changes)
+		if (opt->max_changes && diff_queued_diff.nr > opt->max_changes)
 			break;
 
 		if (opt->pathspec.nr) {
@@ -521,7 +521,6 @@ static struct combine_diff_path *ll_diff_tree_paths(
 
 			/* t↓ */
 			update_tree_entry(&t);
-			opt->num_changes++;
 		}
 
 		/* t > p[imin] */
@@ -539,7 +538,6 @@ static struct combine_diff_path *ll_diff_tree_paths(
 		skip_emit_tp:
 			/* ∀ pi=p[imin]  pi↓ */
 			update_tp_entries(tp, nparent);
-			opt->num_changes++;
 		}
 	}
 
@@ -557,7 +555,6 @@ struct combine_diff_path *diff_tree_paths(
 	const struct object_id **parents_oid, int nparent,
 	struct strbuf *base, struct diff_options *opt)
 {
-	opt->num_changes = 0;
 	p = ll_diff_tree_paths(p, oid, parents_oid, nparent, base, opt);
 
 	/*
-- 
2.28.0.462.g4ff11cec37


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH 10/12] bloom: encode out-of-bounds filters as non-empty
  2020-09-09 15:22 [PATCH 00/12] more miscellaneous Bloom filter improvements, redux Taylor Blau
                   ` (8 preceding siblings ...)
  2020-09-09 15:23 ` [PATCH 09/12] bloom/diff: properly short-circuit on max_changes Taylor Blau
@ 2020-09-09 15:23 ` Taylor Blau
  2020-09-10  3:35   ` Taylor Blau
  2020-09-09 15:23 ` [PATCH 11/12] commit-graph: rename 'split_commit_graph_opts' Taylor Blau
                   ` (3 subsequent siblings)
  13 siblings, 1 reply; 75+ messages in thread
From: Taylor Blau @ 2020-09-09 15:23 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, peff, szeder.dev

When a changed-path Bloom filter has either zero, or more than a
certain number (commonly 512) of entries, the commit-graph machinery
encodes it as "missing". More specifically, it sets the indices adjacent
in the BIDX chunk as equal to each other to indicate a "length 0"
filter; that is, that the filter occupies zero bytes on disk.

This has heretofore been fine, since the commit-graph machinery has no
need to care about these filters with too few or too many changed paths.
Both cases act like no filter has been generated at all, and so there is
no need to store them.

In a subsequent commit, however, the commit-graph machinery will learn
to only compute Bloom filters for some commits in the current
commit-graph layer. This is a change from the current implementation
which computes Bloom filters for all commits that are in the layer being
written. Critically for this patch, only computing some of the Bloom
filters means adding a third state for length 0 Bloom filters: zero
entries, too many entries, or "hasn't been computed".

It will be important for that future patch to distinguish between "not
representable" (i.e., zero or too-many changed paths), and "hasn't been
computed". In particular, we don't want to waste time recomputing
filters that have already been computed.

To that end, change how we store Bloom filters in the "computed but not
representable" category:

  - Bloom filters with no entries are stored as a single byte with all
    bits low (i.e., all queries to that Bloom filter will return
    "definitely not")

  - Bloom filters with too many entries are stored as a single byte with
    all bits set high (i.e., all queries to that Bloom filter will
    return "maybe").

These rules are sufficient to not incur a behavior change by changing
the on-disk representation of these two classes. Likewise, no
specification changes are necessary for the commit-graph format, either:

  - Filters that were previously empty will be recomputed and stored
    according to the new rules, and

  - old clients reading filters generated by new clients will interpret
    the filters correctly and be none the wiser to how they were
    generated.

Clients will invoke the Bloom machinery in more cases than before, but
this can be addressed by returning a NULL filter when all bits are set
high. This can be addressed in a future patch.

Finally, note that this does increase the size of on-disk commit-graphs,
but far less than other proposals. In particular, this is generally more
efficient than storing a bitmap for which commits haven't computed their
Bloom filters. Storing a bitmap incurs a penalty of one bit per commit,
whereas storing explicit filters as above incurs a penalty of one byte
per too-large or too-small commit.

In practice, these boundary commits likely occupy a small proportion of
the overall number of commits, and so the size penalty is likely smaller
than storing a bitmap for all commits.

A test to exercise filters which contain too many changed path entries
will be introduced in a subsequent patch.

Suggested-by: SZEDER Gábor <szeder.dev@gmail.com>
Suggested-by: Jakub Narębski <jnareb@gmail.com>
Helped-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 .../technical/commit-graph-format.txt         |  2 +-
 bloom.c                                       | 10 +++++++--
 bloom.h                                       |  1 +
 commit-graph.c                                |  5 +++++
 t/t0095-bloom.sh                              |  4 ++--
 t/t4216-log-bloom.sh                          | 21 +++++++++++++++++--
 6 files changed, 36 insertions(+), 7 deletions(-)

diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
index 6ddbceba15..6585f1948a 100644
--- a/Documentation/technical/commit-graph-format.txt
+++ b/Documentation/technical/commit-graph-format.txt
@@ -125,7 +125,7 @@ CHUNK DATA:
     * The rest of the chunk is the concatenation of all the computed Bloom
       filters for the commits in lexicographic order.
     * Note: Commits with no changes or more than 512 changes have Bloom filters
-      of length zero.
+      of length one, with either all bits set to zero or one respectively.
     * The BDAT chunk is present if and only if BIDX is present.
 
   Base Graphs List (ID: {'B', 'A', 'S', 'E'}) [Optional]
diff --git a/bloom.c b/bloom.c
index db9fb82437..194b6ab8ad 100644
--- a/bloom.c
+++ b/bloom.c
@@ -266,6 +266,11 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 		}
 
 		filter->len = (hashmap_get_size(&pathmap) * settings->bits_per_entry + BITS_PER_WORD - 1) / BITS_PER_WORD;
+		if (!filter->len) {
+			if (computed)
+				*computed |= BLOOM_TRUNC_SMALL;
+			filter->len = 1;
+		}
 		filter->data = xcalloc(filter->len, sizeof(unsigned char));
 
 		hashmap_for_each_entry(&pathmap, &iter, e, entry) {
@@ -279,8 +284,9 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 	} else {
 		for (i = 0; i < diff_queued_diff.nr; i++)
 			diff_free_filepair(diff_queued_diff.queue[i]);
-		filter->data = NULL;
-		filter->len = 0;
+		filter->data = xmalloc(1);
+		filter->data[0] = 0xFF;
+		filter->len = 1;
 
 		if (computed)
 			*computed |= BLOOM_TRUNC_LARGE;
diff --git a/bloom.h b/bloom.h
index c6d77e8393..70a8840896 100644
--- a/bloom.h
+++ b/bloom.h
@@ -93,6 +93,7 @@ enum bloom_filter_computed {
 	BLOOM_NOT_COMPUTED = (1 << 0),
 	BLOOM_COMPUTED     = (1 << 1),
 	BLOOM_TRUNC_LARGE  = (1 << 2),
+	BLOOM_TRUNC_SMALL  = (1 << 3),
 };
 
 struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
diff --git a/commit-graph.c b/commit-graph.c
index 500f29525a..d402743e61 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -974,6 +974,7 @@ struct write_commit_graph_context {
 
 	int count_bloom_filter_computed;
 	int count_bloom_filter_not_computed;
+	int count_bloom_filter_trunc_small;
 	int count_bloom_filter_trunc_large;
 };
 
@@ -1405,6 +1406,8 @@ static void trace2_bloom_filter_write_statistics(struct write_commit_graph_conte
 			 ctx->count_bloom_filter_computed);
 	jw_object_intmax(&jw, "filter_not_computed",
 			 ctx->count_bloom_filter_not_computed);
+	jw_object_intmax(&jw, "filter_trunc_small",
+			 ctx->count_bloom_filter_trunc_small);
 	jw_object_intmax(&jw, "filter_trunc_large",
 			 ctx->count_bloom_filter_trunc_large);
 	jw_end(&jw);
@@ -1446,6 +1449,8 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 			&computed);
 		if (computed & BLOOM_COMPUTED) {
 			ctx->count_bloom_filter_computed++;
+			if (computed & BLOOM_TRUNC_SMALL)
+				ctx->count_bloom_filter_trunc_small++;
 			if (computed & BLOOM_TRUNC_LARGE)
 				ctx->count_bloom_filter_trunc_large++;
 		} else if (computed & BLOOM_NOT_COMPUTED)
diff --git a/t/t0095-bloom.sh b/t/t0095-bloom.sh
index 232ba2c485..4d0e512bcb 100755
--- a/t/t0095-bloom.sh
+++ b/t/t0095-bloom.sh
@@ -71,8 +71,8 @@ test_expect_success 'get bloom filters for commit with no changes' '
 	git init &&
 	git commit --allow-empty -m "c0" &&
 	cat >expect <<-\EOF &&
-	Filter_Length:0
-	Filter_Data:
+	Filter_Length:1
+	Filter_Data:00|
 	EOF
 	test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
 	test_cmp expect actual
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index f375e752cd..a56327ffd4 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -30,6 +30,7 @@ test_expect_success 'setup test - repo, commits, commit graph, log outputs' '
 	rm file_to_be_deleted &&
 	git add . &&
 	git commit -m "file removed" &&
+	git commit --allow-empty -m "empty" &&
 	git commit-graph write --reachable --changed-paths &&
 
 	test_oid_cache <<-EOF
@@ -49,7 +50,7 @@ graph_read_expect () {
 }
 
 test_expect_success 'commit-graph write wrote out the bloom chunks' '
-	graph_read_expect 15
+	graph_read_expect 16
 '
 
 # Turn off any inherited trace2 settings for this test.
@@ -156,7 +157,7 @@ test_expect_success 'setup - add commit-graph to the chain with Bloom filters' '
 
 test_bloom_filters_used_when_some_filters_are_missing () {
 	log_args=$1
-	bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"maybe\":6,\"definitely_not\":8"
+	bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"maybe\":6,\"definitely_not\":9"
 	setup "$log_args" &&
 	grep -q "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.perf" &&
 	test_cmp log_wo_bloom log_w_bloom
@@ -271,4 +272,20 @@ test_expect_success 'correctly report changes over limit' '
 	)
 '
 
+test_expect_success 'correctly report commits with no changed paths' '
+	git init small &&
+	test_when_finished "rm -fr small" &&
+	(
+		cd small &&
+
+		git commit --allow-empty -m "initial commit" &&
+
+		GIT_TRACE2_EVENT="$(pwd)/trace" \
+			git commit-graph write --reachable --changed-paths &&
+		grep "\"filter_computed\":1" trace &&
+		grep "\"filter_trunc_small\":1" trace &&
+		grep "\"filter_trunc_large\":0" trace
+	)
+'
+
 test_done
-- 
2.28.0.462.g4ff11cec37


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH 11/12] commit-graph: rename 'split_commit_graph_opts'
  2020-09-09 15:22 [PATCH 00/12] more miscellaneous Bloom filter improvements, redux Taylor Blau
                   ` (9 preceding siblings ...)
  2020-09-09 15:23 ` [PATCH 10/12] bloom: encode out-of-bounds filters as non-empty Taylor Blau
@ 2020-09-09 15:23 ` Taylor Blau
  2020-09-09 15:24 ` [PATCH 12/12] builtin/commit-graph.c: introduce '--max-new-filters=<n>' Taylor Blau
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 75+ messages in thread
From: Taylor Blau @ 2020-09-09 15:23 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, peff, szeder.dev

In the subsequent commit, additional options will be added to the
commit-graph API which have nothing to do with splitting.

Rename the 'split_commit_graph_opts' structure to the more-generic
'commit_graph_opts' to encompass both. Likewise, rename the 'flags'
member to instead be 'split_flags' to clarify that it only has to do
with the behavior implied by '--split'.

Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/commit-graph.c | 20 ++++++++++----------
 commit-graph.c         | 40 ++++++++++++++++++++--------------------
 commit-graph.h         |  8 ++++----
 3 files changed, 34 insertions(+), 34 deletions(-)

diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index ba5584463f..f3243bd982 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -119,7 +119,7 @@ static int graph_verify(int argc, const char **argv)
 }
 
 extern int read_replace_refs;
-static struct split_commit_graph_opts split_opts;
+static struct commit_graph_opts write_opts;
 
 static int write_option_parse_split(const struct option *opt, const char *arg,
 				    int unset)
@@ -187,24 +187,24 @@ static int graph_write(int argc, const char **argv)
 		OPT_BOOL(0, "changed-paths", &opts.enable_changed_paths,
 			N_("enable computation for changed paths")),
 		OPT_BOOL(0, "progress", &opts.progress, N_("force progress reporting")),
-		OPT_CALLBACK_F(0, "split", &split_opts.flags, NULL,
+		OPT_CALLBACK_F(0, "split", &write_opts.split_flags, NULL,
 			N_("allow writing an incremental commit-graph file"),
 			PARSE_OPT_OPTARG | PARSE_OPT_NONEG,
 			write_option_parse_split),
-		OPT_INTEGER(0, "max-commits", &split_opts.max_commits,
+		OPT_INTEGER(0, "max-commits", &write_opts.max_commits,
 			N_("maximum number of commits in a non-base split commit-graph")),
-		OPT_INTEGER(0, "size-multiple", &split_opts.size_multiple,
+		OPT_INTEGER(0, "size-multiple", &write_opts.size_multiple,
 			N_("maximum ratio between two levels of a split commit-graph")),
-		OPT_EXPIRY_DATE(0, "expire-time", &split_opts.expire_time,
+		OPT_EXPIRY_DATE(0, "expire-time", &write_opts.expire_time,
 			N_("only expire files older than a given date-time")),
 		OPT_END(),
 	};
 
 	opts.progress = isatty(2);
 	opts.enable_changed_paths = -1;
-	split_opts.size_multiple = 2;
-	split_opts.max_commits = 0;
-	split_opts.expire_time = 0;
+	write_opts.size_multiple = 2;
+	write_opts.max_commits = 0;
+	write_opts.expire_time = 0;
 
 	trace2_cmd_mode("write");
 
@@ -232,7 +232,7 @@ static int graph_write(int argc, const char **argv)
 	odb = find_odb(the_repository, opts.obj_dir);
 
 	if (opts.reachable) {
-		if (write_commit_graph_reachable(odb, flags, &split_opts))
+		if (write_commit_graph_reachable(odb, flags, &write_opts))
 			return 1;
 		return 0;
 	}
@@ -261,7 +261,7 @@ static int graph_write(int argc, const char **argv)
 			       opts.stdin_packs ? &pack_indexes : NULL,
 			       opts.stdin_commits ? &commits : NULL,
 			       flags,
-			       &split_opts))
+			       &write_opts))
 		result = 1;
 
 cleanup:
diff --git a/commit-graph.c b/commit-graph.c
index d402743e61..dcc27b74e3 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -968,7 +968,7 @@ struct write_commit_graph_context {
 		 changed_paths:1,
 		 order_by_pack:1;
 
-	const struct split_commit_graph_opts *split_opts;
+	const struct commit_graph_opts *opts;
 	size_t total_bloom_filter_data_size;
 	const struct bloom_filter_settings *bloom_settings;
 
@@ -1292,8 +1292,8 @@ static void close_reachable(struct write_commit_graph_context *ctx)
 {
 	int i;
 	struct commit *commit;
-	enum commit_graph_split_flags flags = ctx->split_opts ?
-		ctx->split_opts->flags : COMMIT_GRAPH_SPLIT_UNSPECIFIED;
+	enum commit_graph_split_flags flags = ctx->opts ?
+		ctx->opts->split_flags : COMMIT_GRAPH_SPLIT_UNSPECIFIED;
 
 	if (ctx->report_progress)
 		ctx->progress = start_delayed_progress(
@@ -1490,7 +1490,7 @@ static int add_ref_to_set(const char *refname,
 
 int write_commit_graph_reachable(struct object_directory *odb,
 				 enum commit_graph_write_flags flags,
-				 const struct split_commit_graph_opts *split_opts)
+				 const struct commit_graph_opts *opts)
 {
 	struct oidset commits = OIDSET_INIT;
 	struct refs_cb_data data;
@@ -1507,7 +1507,7 @@ int write_commit_graph_reachable(struct object_directory *odb,
 	stop_progress(&data.progress);
 
 	result = write_commit_graph(odb, NULL, &commits,
-				    flags, split_opts);
+				    flags, opts);
 
 	oidset_clear(&commits);
 	return result;
@@ -1622,8 +1622,8 @@ static uint32_t count_distinct_commits(struct write_commit_graph_context *ctx)
 static void copy_oids_to_commits(struct write_commit_graph_context *ctx)
 {
 	uint32_t i;
-	enum commit_graph_split_flags flags = ctx->split_opts ?
-		ctx->split_opts->flags : COMMIT_GRAPH_SPLIT_UNSPECIFIED;
+	enum commit_graph_split_flags flags = ctx->opts ?
+		ctx->opts->split_flags : COMMIT_GRAPH_SPLIT_UNSPECIFIED;
 
 	ctx->num_extra_edges = 0;
 	if (ctx->report_progress)
@@ -1908,13 +1908,13 @@ static void split_graph_merge_strategy(struct write_commit_graph_context *ctx)
 	int max_commits = 0;
 	int size_mult = 2;
 
-	if (ctx->split_opts) {
-		max_commits = ctx->split_opts->max_commits;
+	if (ctx->opts) {
+		max_commits = ctx->opts->max_commits;
 
-		if (ctx->split_opts->size_multiple)
-			size_mult = ctx->split_opts->size_multiple;
+		if (ctx->opts->size_multiple)
+			size_mult = ctx->opts->size_multiple;
 
-		flags = ctx->split_opts->flags;
+		flags = ctx->opts->split_flags;
 	}
 
 	g = ctx->r->objects->commit_graph;
@@ -2092,8 +2092,8 @@ static void expire_commit_graphs(struct write_commit_graph_context *ctx)
 	size_t dirnamelen;
 	timestamp_t expire_time = time(NULL);
 
-	if (ctx->split_opts && ctx->split_opts->expire_time)
-		expire_time = ctx->split_opts->expire_time;
+	if (ctx->opts && ctx->opts->expire_time)
+		expire_time = ctx->opts->expire_time;
 	if (!ctx->split) {
 		char *chain_file_name = get_chain_filename(ctx->odb);
 		unlink(chain_file_name);
@@ -2144,7 +2144,7 @@ int write_commit_graph(struct object_directory *odb,
 		       struct string_list *pack_indexes,
 		       struct oidset *commits,
 		       enum commit_graph_write_flags flags,
-		       const struct split_commit_graph_opts *split_opts)
+		       const struct commit_graph_opts *opts)
 {
 	struct write_commit_graph_context *ctx;
 	uint32_t i, count_distinct = 0;
@@ -2161,7 +2161,7 @@ int write_commit_graph(struct object_directory *odb,
 	ctx->append = flags & COMMIT_GRAPH_WRITE_APPEND ? 1 : 0;
 	ctx->report_progress = flags & COMMIT_GRAPH_WRITE_PROGRESS ? 1 : 0;
 	ctx->split = flags & COMMIT_GRAPH_WRITE_SPLIT ? 1 : 0;
-	ctx->split_opts = split_opts;
+	ctx->opts = opts;
 	ctx->total_bloom_filter_data_size = 0;
 
 	bloom_settings.bits_per_entry = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY",
@@ -2209,15 +2209,15 @@ int write_commit_graph(struct object_directory *odb,
 			}
 		}
 
-		if (ctx->split_opts)
-			replace = ctx->split_opts->flags & COMMIT_GRAPH_SPLIT_REPLACE;
+		if (ctx->opts)
+			replace = ctx->opts->split_flags & COMMIT_GRAPH_SPLIT_REPLACE;
 	}
 
 	ctx->approx_nr_objects = approximate_object_count();
 	ctx->oids.alloc = ctx->approx_nr_objects / 32;
 
-	if (ctx->split && split_opts && ctx->oids.alloc > split_opts->max_commits)
-		ctx->oids.alloc = split_opts->max_commits;
+	if (ctx->split && opts && ctx->oids.alloc > opts->max_commits)
+		ctx->oids.alloc = opts->max_commits;
 
 	if (ctx->append) {
 		prepare_commit_graph_one(ctx->r, ctx->odb);
diff --git a/commit-graph.h b/commit-graph.h
index d9acb22bac..b7914b0a7a 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -105,11 +105,11 @@ enum commit_graph_split_flags {
 	COMMIT_GRAPH_SPLIT_REPLACE          = 2
 };
 
-struct split_commit_graph_opts {
+struct commit_graph_opts {
 	int size_multiple;
 	int max_commits;
 	timestamp_t expire_time;
-	enum commit_graph_split_flags flags;
+	enum commit_graph_split_flags split_flags;
 };
 
 /*
@@ -120,12 +120,12 @@ struct split_commit_graph_opts {
  */
 int write_commit_graph_reachable(struct object_directory *odb,
 				 enum commit_graph_write_flags flags,
-				 const struct split_commit_graph_opts *split_opts);
+				 const struct commit_graph_opts *opts);
 int write_commit_graph(struct object_directory *odb,
 		       struct string_list *pack_indexes,
 		       struct oidset *commits,
 		       enum commit_graph_write_flags flags,
-		       const struct split_commit_graph_opts *split_opts);
+		       const struct commit_graph_opts *opts);
 
 #define COMMIT_GRAPH_VERIFY_SHALLOW	(1 << 0)
 
-- 
2.28.0.462.g4ff11cec37


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH 12/12] builtin/commit-graph.c: introduce '--max-new-filters=<n>'
  2020-09-09 15:22 [PATCH 00/12] more miscellaneous Bloom filter improvements, redux Taylor Blau
                   ` (10 preceding siblings ...)
  2020-09-09 15:23 ` [PATCH 11/12] commit-graph: rename 'split_commit_graph_opts' Taylor Blau
@ 2020-09-09 15:24 ` Taylor Blau
  2020-09-11 17:52   ` Jeff King
  2020-09-16 18:06 ` [PATCH v2 00/13] more miscellaneous Bloom filter improvements, redux Taylor Blau
  2020-09-18  2:58 ` [PATCH v3 " Taylor Blau
  13 siblings, 1 reply; 75+ messages in thread
From: Taylor Blau @ 2020-09-09 15:24 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, peff, szeder.dev

Introduce a command-line flag and configuration variable to specify the
maximum number of new Bloom filters that a 'git commit-graph write' is
willing to compute from scratch.

Prior to this patch, a commit-graph write with '--changed-paths' would
compute Bloom filters for all selected commits which haven't already
been computed (i.e., by a previous commit-graph write with '--split'
such that a roll-up or replacement is performed).

This behavior can cause prohibitively-long commit-graph writes for a
variety of reasons:

  * There may be lots of filters whose diffs take a long time to
    generate (for example, they have close to the maximum number of
    changes, diffing itself takes a long time, etc).

  * Old-style commit-graphs (which encode filters with too many entries
    as not having been computed at all) cause us to waste time
    recomputing filters that appear to have not been computed only to
    discover that they are too-large.

This can make the upper-bound of the time it takes for 'git commit-graph
write --changed-paths' to be rather unpredictable.

To make this command behave more predictably, introduce
'--max-new-filters=<n>' to allow computing at most '<n>' Bloom filters
from scratch. This lets "computing" already-known filters proceed
quickly, while bounding the number of slow tasks that Git is willing to
do.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/config/commitgraph.txt |  4 +++
 Documentation/git-commit-graph.txt   |  6 ++++
 bloom.c                              |  7 ++--
 builtin/commit-graph.c               | 39 ++++++++++++++++++--
 commit-graph.c                       |  9 +++--
 commit-graph.h                       |  1 +
 t/t4216-log-bloom.sh                 | 53 ++++++++++++++++++++++++++++
 7 files changed, 110 insertions(+), 9 deletions(-)

diff --git a/Documentation/config/commitgraph.txt b/Documentation/config/commitgraph.txt
index cff0797b54..4582c39fc4 100644
--- a/Documentation/config/commitgraph.txt
+++ b/Documentation/config/commitgraph.txt
@@ -1,3 +1,7 @@
+commitGraph.maxNewFilters::
+	Specifies the default value for the `--max-new-filters` option of `git
+	commit-graph write` (c.f., linkgit:git-commit-graph[1]).
+
 commitGraph.readChangedPaths::
 	If true, then git will use the changed-path Bloom filters in the
 	commit-graph file (if it exists, and they are present). Defaults to
diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
index 17405c73a9..81a2e65903 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -67,6 +67,12 @@ this option is given, future commit-graph writes will automatically assume
 that this option was intended. Use `--no-changed-paths` to stop storing this
 data.
 +
+With the `--max-new-filters=<n>` option, generate at most `n` new Bloom
+filters (if `--changed-paths` is specified). If `n` is `-1`, no limit is
+enforced. Commits whose filters are not calculated are stored as a
+length zero Bloom filter, and their bit is marked in the `BFXL` chunk.
+Overrides the `commitGraph.maxNewFilters` configuration.
++
 With the `--split[=<strategy>]` option, write the commit-graph as a
 chain of multiple commit-graph files stored in
 `<dir>/info/commit-graphs`. Commit-graph layers are merged based on the
diff --git a/bloom.c b/bloom.c
index 194b6ab8ad..022dd6e0f9 100644
--- a/bloom.c
+++ b/bloom.c
@@ -197,12 +197,11 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 
 	if (!filter->data) {
 		load_commit_graph_info(r, c);
-		if (commit_graph_position(c) != COMMIT_NOT_FROM_GRAPH &&
-			load_bloom_filter_from_graph(r->objects->commit_graph, filter, c))
-				return filter;
+		if (commit_graph_position(c) != COMMIT_NOT_FROM_GRAPH)
+			load_bloom_filter_from_graph(r->objects->commit_graph, filter, c);
 	}
 
-	if (filter->data)
+	if (filter->data && filter->len)
 		return filter;
 	if (!compute_if_not_present)
 		return NULL;
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index f3243bd982..e7a1539b08 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -13,7 +13,8 @@ static char const * const builtin_commit_graph_usage[] = {
 	N_("git commit-graph verify [--object-dir <objdir>] [--shallow] [--[no-]progress]"),
 	N_("git commit-graph write [--object-dir <objdir>] [--append] "
 	   "[--split[=<strategy>]] [--reachable|--stdin-packs|--stdin-commits] "
-	   "[--changed-paths] [--[no-]progress] <split options>"),
+	   "[--changed-paths] [--[no-]max-new-filters <n>] [--[no-]progress] "
+	   "<split options>"),
 	NULL
 };
 
@@ -25,7 +26,8 @@ static const char * const builtin_commit_graph_verify_usage[] = {
 static const char * const builtin_commit_graph_write_usage[] = {
 	N_("git commit-graph write [--object-dir <objdir>] [--append] "
 	   "[--split[=<strategy>]] [--reachable|--stdin-packs|--stdin-commits] "
-	   "[--changed-paths] [--[no-]progress] <split options>"),
+	   "[--changed-paths] [--[no-]max-new-filters <n>] [--[no-]progress] "
+	   "<split options>"),
 	NULL
 };
 
@@ -162,6 +164,23 @@ static int read_one_commit(struct oidset *commits, struct progress *progress,
 	return 0;
 }
 
+static int write_option_max_new_filters(const struct option *opt,
+					const char *arg,
+					int unset)
+{
+	int *to = opt->value;
+	if (unset)
+		*to = -1;
+	else {
+		const char *s;
+		*to = strtol(arg, (char **)&s, 10);
+		if (*s)
+			return error(_("%s expects a numerical value"),
+				     optname(opt, opt->flags));
+	}
+	return 0;
+}
+
 static int graph_write(int argc, const char **argv)
 {
 	struct string_list pack_indexes = STRING_LIST_INIT_NODUP;
@@ -197,6 +216,9 @@ static int graph_write(int argc, const char **argv)
 			N_("maximum ratio between two levels of a split commit-graph")),
 		OPT_EXPIRY_DATE(0, "expire-time", &write_opts.expire_time,
 			N_("only expire files older than a given date-time")),
+		OPT_CALLBACK_F(0, "max-new-filters", &write_opts.max_new_filters,
+			NULL, N_("maximum number of changed-path Bloom filters to compute"),
+			0, write_option_max_new_filters),
 		OPT_END(),
 	};
 
@@ -205,6 +227,7 @@ static int graph_write(int argc, const char **argv)
 	write_opts.size_multiple = 2;
 	write_opts.max_commits = 0;
 	write_opts.expire_time = 0;
+	write_opts.max_new_filters = -1;
 
 	trace2_cmd_mode("write");
 
@@ -270,6 +293,16 @@ static int graph_write(int argc, const char **argv)
 	return result;
 }
 
+static int git_commit_graph_config(const char *var, const char *value, void *cb)
+{
+	if (!strcmp(var, "commitgraph.maxnewfilters")) {
+		write_opts.max_new_filters = git_config_int(var, value);
+		return 0;
+	}
+
+	return git_default_config(var, value, cb);
+}
+
 int cmd_commit_graph(int argc, const char **argv, const char *prefix)
 {
 	static struct option builtin_commit_graph_options[] = {
@@ -283,7 +316,7 @@ int cmd_commit_graph(int argc, const char **argv, const char *prefix)
 		usage_with_options(builtin_commit_graph_usage,
 				   builtin_commit_graph_options);
 
-	git_config(git_default_config, NULL);
+	git_config(git_commit_graph_config, &opts);
 	argc = parse_options(argc, argv, prefix,
 			     builtin_commit_graph_options,
 			     builtin_commit_graph_usage,
diff --git a/commit-graph.c b/commit-graph.c
index dcc27b74e3..1d9f8cc7e9 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1422,6 +1422,7 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 	int i;
 	struct progress *progress = NULL;
 	struct commit **sorted_commits;
+	int max_new_filters;
 
 	init_bloom_filters();
 
@@ -1438,13 +1439,16 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 	else
 		QSORT(sorted_commits, ctx->commits.nr, commit_gen_cmp);
 
+	max_new_filters = ctx->opts && ctx->opts->max_new_filters >= 0 ?
+		ctx->opts->max_new_filters : ctx->commits.nr;
+
 	for (i = 0; i < ctx->commits.nr; i++) {
 		enum bloom_filter_computed computed = 0;
 		struct commit *c = sorted_commits[i];
 		struct bloom_filter *filter = get_or_compute_bloom_filter(
 			ctx->r,
 			c,
-			1,
+			ctx->count_bloom_filter_computed < max_new_filters,
 			ctx->bloom_settings,
 			&computed);
 		if (computed & BLOOM_COMPUTED) {
@@ -1455,7 +1459,8 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 				ctx->count_bloom_filter_trunc_large++;
 		} else if (computed & BLOOM_NOT_COMPUTED)
 			ctx->count_bloom_filter_not_computed++;
-		ctx->total_bloom_filter_data_size += sizeof(unsigned char) * filter->len;
+		ctx->total_bloom_filter_data_size += filter
+			? sizeof(unsigned char) * filter->len : 0;
 		display_progress(progress, i + 1);
 	}
 
diff --git a/commit-graph.h b/commit-graph.h
index b7914b0a7a..a22bd86701 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -110,6 +110,7 @@ struct commit_graph_opts {
 	int max_commits;
 	timestamp_t expire_time;
 	enum commit_graph_split_flags split_flags;
+	int max_new_filters;
 };
 
 /*
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index a56327ffd4..24deb8104a 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -287,5 +287,58 @@ test_expect_success 'correctly report commits with no changed paths' '
 		grep "\"filter_trunc_large\":0" trace
 	)
 '
+test_bloom_filters_computed () {
+	commit_graph_args=$1
+	rm -f "$TRASH_DIRECTORY/trace.event" &&
+	GIT_TRACE2_EVENT="$TRASH_DIRECTORY/trace.event" git commit-graph write \
+		$commit_graph_args &&
+	grep "\"filter_not_computed\":$2" "$TRASH_DIRECTORY/trace.event" &&
+	grep "\"filter_trunc_large\":$3" "$TRASH_DIRECTORY/trace.event" &&
+	grep "\"filter_computed\":$4" "$TRASH_DIRECTORY/trace.event"
+}
+
+test_expect_success 'Bloom generation is limited by --max-new-filters' '
+	(
+		cd limits &&
+		test_commit c2 filter &&
+		test_commit c3 filter &&
+		test_commit c4 no-filter &&
+		test_bloom_filters_computed "--reachable --changed-paths --split=replace --max-new-filters=2" \
+			3 0 2
+	)
+'
+
+test_expect_success 'Bloom generation backfills previously-skipped filters' '
+	(
+		cd limits &&
+		test_bloom_filters_computed "--reachable --changed-paths --split=replace --max-new-filters=1" \
+			4 0 1
+	)
+'
+
+test_expect_success 'Bloom generation backfills empty commits' '
+	git init empty &&
+	test_when_finished "rm -fr empty" &&
+	(
+		cd empty &&
+		for i in $(test_seq 1 6)
+		do
+			git commit --allow-empty -m "$i"
+		done &&
+
+		# Generate Bloom filters for empty commits 1-6, two at a time.
+		test_bloom_filters_computed "--reachable --changed-paths --max-new-filters=2" \
+			4 0 2 &&
+		test_bloom_filters_computed "--reachable --changed-paths --max-new-filters=2" \
+			4 0 2 &&
+		test_bloom_filters_computed "--reachable --changed-paths --max-new-filters=2" \
+			4 0 2 &&
+
+		# Finally, make sure that once all commits have filters, that
+		# none are subsequently recomputed.
+		test_bloom_filters_computed "--reachable --changed-paths --max-new-filters=2" \
+			6 0 0
+	)
+'
 
 test_done
-- 
2.28.0.462.g4ff11cec37

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* Re: [PATCH 10/12] bloom: encode out-of-bounds filters as non-empty
  2020-09-09 15:23 ` [PATCH 10/12] bloom: encode out-of-bounds filters as non-empty Taylor Blau
@ 2020-09-10  3:35   ` Taylor Blau
  2020-09-10 15:45     ` Taylor Blau
  0 siblings, 1 reply; 75+ messages in thread
From: Taylor Blau @ 2020-09-10  3:35 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, peff, szeder.dev

On Wed, Sep 09, 2020 at 11:23:48AM -0400, Taylor Blau wrote:
> -		filter->data = NULL;
> -		filter->len = 0;
> +		filter->data = xmalloc(1);
> +		filter->data[0] = 0xFF;
> +		filter->len = 1;
>
>  		if (computed)
>  			*computed |= BLOOM_TRUNC_LARGE;

Oops, I missed the case that added by the previous patch where the
number of diff entries is smaller than the limit, but the hashmap
entries (after directories are added and such) crosses the threshold.

Specifically, this patch doesn't write the 0xFF filter like it should.
I'll send a different version of this patch tomorrow.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 10/12] bloom: encode out-of-bounds filters as non-empty
  2020-09-10  3:35   ` Taylor Blau
@ 2020-09-10 15:45     ` Taylor Blau
  2020-09-11 18:15       ` Derrick Stolee
  0 siblings, 1 reply; 75+ messages in thread
From: Taylor Blau @ 2020-09-10 15:45 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, peff, szeder.dev

On Wed, Sep 09, 2020 at 11:35:57PM -0400, Taylor Blau wrote:
> On Wed, Sep 09, 2020 at 11:23:48AM -0400, Taylor Blau wrote:
> > -		filter->data = NULL;
> > -		filter->len = 0;
> > +		filter->data = xmalloc(1);
> > +		filter->data[0] = 0xFF;
> > +		filter->len = 1;
> >
> >  		if (computed)
> >  			*computed |= BLOOM_TRUNC_LARGE;
>
> Oops, I missed the case that added by the previous patch where the
> number of diff entries is smaller than the limit, but the hashmap
> entries (after directories are added and such) crosses the threshold.
>
> Specifically, this patch doesn't write the 0xFF filter like it should.
> I'll send a different version of this patch tomorrow.

This one should do the trick. Let's use it instead.

--- >8 ---

Subject: [PATCH] bloom: encode out-of-bounds filters as non-empty

When a changed-path Bloom filter has either zero, or more than a
certain number (commonly 512) of entries, the commit-graph machinery
encodes it as "missing". More specifically, it sets the indices adjacent
in the BIDX chunk as equal to each other to indicate a "length 0"
filter; that is, that the filter occupies zero bytes on disk.

This has heretofore been fine, since the commit-graph machinery has no
need to care about these filters with too few or too many changed paths.
Both cases act like no filter has been generated at all, and so there is
no need to store them.

In a subsequent commit, however, the commit-graph machinery will learn
to only compute Bloom filters for some commits in the current
commit-graph layer. This is a change from the current implementation
which computes Bloom filters for all commits that are in the layer being
written. Critically for this patch, only computing some of the Bloom
filters means adding a third state for length 0 Bloom filters: zero
entries, too many entries, or "hasn't been computed".

It will be important for that future patch to distinguish between "not
representable" (i.e., zero or too-many changed paths), and "hasn't been
computed". In particular, we don't want to waste time recomputing
filters that have already been computed.

To that end, change how we store Bloom filters in the "computed but not
representable" category:

  - Bloom filters with no entries are stored as a single byte with all
    bits low (i.e., all queries to that Bloom filter will return
    "definitely not")

  - Bloom filters with too many entries are stored as a single byte with
    all bits set high (i.e., all queries to that Bloom filter will
    return "maybe").

These rules are sufficient to not incur a behavior change by changing
the on-disk representation of these two classes. Likewise, no
specification changes are necessary for the commit-graph format, either:

  - Filters that were previously empty will be recomputed and stored
    according to the new rules, and

  - old clients reading filters generated by new clients will interpret
    the filters correctly and be none the wiser to how they were
    generated.

Clients will invoke the Bloom machinery in more cases than before, but
this can be addressed by returning a NULL filter when all bits are set
high. This can be addressed in a future patch.

Finally, note that this does increase the size of on-disk commit-graphs,
but far less than other proposals. In particular, this is generally more
efficient than storing a bitmap for which commits haven't computed their
Bloom filters. Storing a bitmap incurs a penalty of one bit per commit,
whereas storing explicit filters as above incurs a penalty of one byte
per too-large or too-small commit.

In practice, these boundary commits likely occupy a small proportion of
the overall number of commits, and so the size penalty is likely smaller
than storing a bitmap for all commits.

A test to exercise filters which contain too many changed path entries
will be introduced in a subsequent patch.

Suggested-by: SZEDER Gábor <szeder.dev@gmail.com>
Suggested-by: Jakub Narębski <jnareb@gmail.com>
Helped-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 .../technical/commit-graph-format.txt         |  2 +-
 bloom.c                                       | 16 ++++++++++++--
 bloom.h                                       |  1 +
 commit-graph.c                                |  5 +++++
 t/t0095-bloom.sh                              |  8 +++----
 t/t4216-log-bloom.sh                          | 21 +++++++++++++++++--
 6 files changed, 44 insertions(+), 9 deletions(-)

diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
index 6ddbceba15..6585f1948a 100644
--- a/Documentation/technical/commit-graph-format.txt
+++ b/Documentation/technical/commit-graph-format.txt
@@ -125,7 +125,7 @@ CHUNK DATA:
     * The rest of the chunk is the concatenation of all the computed Bloom
       filters for the commits in lexicographic order.
     * Note: Commits with no changes or more than 512 changes have Bloom filters
-      of length zero.
+      of length one, with either all bits set to zero or one respectively.
     * The BDAT chunk is present if and only if BIDX is present.

   Base Graphs List (ID: {'B', 'A', 'S', 'E'}) [Optional]
diff --git a/bloom.c b/bloom.c
index db9fb82437..d24747a1d5 100644
--- a/bloom.c
+++ b/bloom.c
@@ -177,6 +177,13 @@ static int pathmap_cmp(const void *hashmap_cmp_fn_data,
 	return strcmp(e1->path, e2->path);
 }

+static void init_truncated_large_filter(struct bloom_filter *filter)
+{
+	filter->data = xmalloc(1);
+	filter->data[0] = 0xFF;
+	filter->len = 1;
+}
+
 struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 						 struct commit *c,
 						 int compute_if_not_present,
@@ -260,12 +267,18 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 		}

 		if (hashmap_get_size(&pathmap) > settings->max_changed_paths) {
+			init_truncated_large_filter(filter);
 			if (computed)
 				*computed |= BLOOM_TRUNC_LARGE;
 			goto cleanup;
 		}

 		filter->len = (hashmap_get_size(&pathmap) * settings->bits_per_entry + BITS_PER_WORD - 1) / BITS_PER_WORD;
+		if (!filter->len) {
+			if (computed)
+				*computed |= BLOOM_TRUNC_SMALL;
+			filter->len = 1;
+		}
 		filter->data = xcalloc(filter->len, sizeof(unsigned char));

 		hashmap_for_each_entry(&pathmap, &iter, e, entry) {
@@ -279,8 +292,7 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 	} else {
 		for (i = 0; i < diff_queued_diff.nr; i++)
 			diff_free_filepair(diff_queued_diff.queue[i]);
-		filter->data = NULL;
-		filter->len = 0;
+		init_truncated_large_filter(filter);

 		if (computed)
 			*computed |= BLOOM_TRUNC_LARGE;
diff --git a/bloom.h b/bloom.h
index c6d77e8393..70a8840896 100644
--- a/bloom.h
+++ b/bloom.h
@@ -93,6 +93,7 @@ enum bloom_filter_computed {
 	BLOOM_NOT_COMPUTED = (1 << 0),
 	BLOOM_COMPUTED     = (1 << 1),
 	BLOOM_TRUNC_LARGE  = (1 << 2),
+	BLOOM_TRUNC_SMALL  = (1 << 3),
 };

 struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
diff --git a/commit-graph.c b/commit-graph.c
index 500f29525a..d402743e61 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -974,6 +974,7 @@ struct write_commit_graph_context {

 	int count_bloom_filter_computed;
 	int count_bloom_filter_not_computed;
+	int count_bloom_filter_trunc_small;
 	int count_bloom_filter_trunc_large;
 };

@@ -1405,6 +1406,8 @@ static void trace2_bloom_filter_write_statistics(struct write_commit_graph_conte
 			 ctx->count_bloom_filter_computed);
 	jw_object_intmax(&jw, "filter_not_computed",
 			 ctx->count_bloom_filter_not_computed);
+	jw_object_intmax(&jw, "filter_trunc_small",
+			 ctx->count_bloom_filter_trunc_small);
 	jw_object_intmax(&jw, "filter_trunc_large",
 			 ctx->count_bloom_filter_trunc_large);
 	jw_end(&jw);
@@ -1446,6 +1449,8 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 			&computed);
 		if (computed & BLOOM_COMPUTED) {
 			ctx->count_bloom_filter_computed++;
+			if (computed & BLOOM_TRUNC_SMALL)
+				ctx->count_bloom_filter_trunc_small++;
 			if (computed & BLOOM_TRUNC_LARGE)
 				ctx->count_bloom_filter_trunc_large++;
 		} else if (computed & BLOOM_NOT_COMPUTED)
diff --git a/t/t0095-bloom.sh b/t/t0095-bloom.sh
index 232ba2c485..7e4ab1795f 100755
--- a/t/t0095-bloom.sh
+++ b/t/t0095-bloom.sh
@@ -71,8 +71,8 @@ test_expect_success 'get bloom filters for commit with no changes' '
 	git init &&
 	git commit --allow-empty -m "c0" &&
 	cat >expect <<-\EOF &&
-	Filter_Length:0
-	Filter_Data:
+	Filter_Length:1
+	Filter_Data:00|
 	EOF
 	test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
 	test_cmp expect actual
@@ -107,8 +107,8 @@ test_expect_success EXPENSIVE 'get bloom filter for commit with 513 changes' '
 	git add bigDir &&
 	git commit -m "commit with 513 changes" &&
 	cat >expect <<-\EOF &&
-	Filter_Length:0
-	Filter_Data:
+	Filter_Length:1
+	Filter_Data:ff|
 	EOF
 	test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
 	test_cmp expect actual
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index f375e752cd..a56327ffd4 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -30,6 +30,7 @@ test_expect_success 'setup test - repo, commits, commit graph, log outputs' '
 	rm file_to_be_deleted &&
 	git add . &&
 	git commit -m "file removed" &&
+	git commit --allow-empty -m "empty" &&
 	git commit-graph write --reachable --changed-paths &&

 	test_oid_cache <<-EOF
@@ -49,7 +50,7 @@ graph_read_expect () {
 }

 test_expect_success 'commit-graph write wrote out the bloom chunks' '
-	graph_read_expect 15
+	graph_read_expect 16
 '

 # Turn off any inherited trace2 settings for this test.
@@ -156,7 +157,7 @@ test_expect_success 'setup - add commit-graph to the chain with Bloom filters' '

 test_bloom_filters_used_when_some_filters_are_missing () {
 	log_args=$1
-	bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"maybe\":6,\"definitely_not\":8"
+	bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"maybe\":6,\"definitely_not\":9"
 	setup "$log_args" &&
 	grep -q "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.perf" &&
 	test_cmp log_wo_bloom log_w_bloom
@@ -271,4 +272,20 @@ test_expect_success 'correctly report changes over limit' '
 	)
 '

+test_expect_success 'correctly report commits with no changed paths' '
+	git init small &&
+	test_when_finished "rm -fr small" &&
+	(
+		cd small &&
+
+		git commit --allow-empty -m "initial commit" &&
+
+		GIT_TRACE2_EVENT="$(pwd)/trace" \
+			git commit-graph write --reachable --changed-paths &&
+		grep "\"filter_computed\":1" trace &&
+		grep "\"filter_trunc_small\":1" trace &&
+		grep "\"filter_trunc_large\":0" trace
+	)
+'
+
 test_done
--
2.27.0.2918.g22f69e1f31


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* Re: [PATCH 12/12] builtin/commit-graph.c: introduce '--max-new-filters=<n>'
  2020-09-09 15:24 ` [PATCH 12/12] builtin/commit-graph.c: introduce '--max-new-filters=<n>' Taylor Blau
@ 2020-09-11 17:52   ` Jeff King
  2020-09-11 18:59     ` Taylor Blau
  2020-09-11 19:31     ` Junio C Hamano
  0 siblings, 2 replies; 75+ messages in thread
From: Jeff King @ 2020-09-11 17:52 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, dstolee, gitster, szeder.dev

On Wed, Sep 09, 2020 at 11:24:00AM -0400, Taylor Blau wrote:

> diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
> index 17405c73a9..81a2e65903 100644
> --- a/Documentation/git-commit-graph.txt
> +++ b/Documentation/git-commit-graph.txt
> @@ -67,6 +67,12 @@ this option is given, future commit-graph writes will automatically assume
>  that this option was intended. Use `--no-changed-paths` to stop storing this
>  data.
>  +
> +With the `--max-new-filters=<n>` option, generate at most `n` new Bloom
> +filters (if `--changed-paths` is specified). If `n` is `-1`, no limit is
> +enforced. Commits whose filters are not calculated are stored as a
> +length zero Bloom filter, and their bit is marked in the `BFXL` chunk.
> +Overrides the `commitGraph.maxNewFilters` configuration.

The BFXL chunk doesn't exist anymore in this iteration, right?

I wondered about having a user-facing "-1" here. My gut feeling is that
we usually use "0" to mean "no limit" in other places, and it probably
make sense to be consistent. It does look like we use both, though, and
I'm having trouble formulating a grep pattern to find examples that
doesn't produce a lot of noise.

These are "0 is no limit":

  pack.windowMemory
  pack.deltaCacheSize
  git-daemon --max-connections

These are "-1 is no limit":

  git-grep --max-depth
  rev-list --max-parents (I think?)

So I dunno. It's a pretty minor thing, but I think it's good to aim for
consistency, and since this is user-facing we won't be able to change it
later.

-Peff

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 10/12] bloom: encode out-of-bounds filters as non-empty
  2020-09-10 15:45     ` Taylor Blau
@ 2020-09-11 18:15       ` Derrick Stolee
  0 siblings, 0 replies; 75+ messages in thread
From: Derrick Stolee @ 2020-09-11 18:15 UTC (permalink / raw)
  To: Taylor Blau, git; +Cc: dstolee, gitster, peff, szeder.dev

On 9/10/2020 11:45 AM, Taylor Blau wrote:
> On Wed, Sep 09, 2020 at 11:35:57PM -0400, Taylor Blau wrote:
>> On Wed, Sep 09, 2020 at 11:23:48AM -0400, Taylor Blau wrote:
>>> -		filter->data = NULL;
>>> -		filter->len = 0;
>>> +		filter->data = xmalloc(1);
>>> +		filter->data[0] = 0xFF;
>>> +		filter->len = 1;
>>>
>>>  		if (computed)
>>>  			*computed |= BLOOM_TRUNC_LARGE;
>>
>> Oops, I missed the case that added by the previous patch where the
>> number of diff entries is smaller than the limit, but the hashmap
>> entries (after directories are added and such) crosses the threshold.
>>
>> Specifically, this patch doesn't write the 0xFF filter like it should.
>> I'll send a different version of this patch tomorrow.
> 
> This one should do the trick. Let's use it instead.
> 
> --- >8 ---
> 
> Subject: [PATCH] bloom: encode out-of-bounds filters as non-empty
> 
> When a changed-path Bloom filter has either zero, or more than a
> certain number (commonly 512) of entries, the commit-graph machinery
> encodes it as "missing". More specifically, it sets the indices adjacent
> in the BIDX chunk as equal to each other to indicate a "length 0"
> filter; that is, that the filter occupies zero bytes on disk.
> 
> This has heretofore been fine, since the commit-graph machinery has no
> need to care about these filters with too few or too many changed paths.
> Both cases act like no filter has been generated at all, and so there is
> no need to store them.
> 
> In a subsequent commit, however, the commit-graph machinery will learn
> to only compute Bloom filters for some commits in the current
> commit-graph layer. This is a change from the current implementation
> which computes Bloom filters for all commits that are in the layer being
> written. Critically for this patch, only computing some of the Bloom
> filters means adding a third state for length 0 Bloom filters: zero
> entries, too many entries, or "hasn't been computed".
> 
> It will be important for that future patch to distinguish between "not
> representable" (i.e., zero or too-many changed paths), and "hasn't been
> computed". In particular, we don't want to waste time recomputing
> filters that have already been computed.
> 
> To that end, change how we store Bloom filters in the "computed but not
> representable" category:
> 
>   - Bloom filters with no entries are stored as a single byte with all
>     bits low (i.e., all queries to that Bloom filter will return
>     "definitely not")
> 
>   - Bloom filters with too many entries are stored as a single byte with
>     all bits set high (i.e., all queries to that Bloom filter will
>     return "maybe").
> 
> These rules are sufficient to not incur a behavior change by changing
> the on-disk representation of these two classes. Likewise, no
> specification changes are necessary for the commit-graph format, either:
> 
>   - Filters that were previously empty will be recomputed and stored
>     according to the new rules, and
> 
>   - old clients reading filters generated by new clients will interpret
>     the filters correctly and be none the wiser to how they were
>     generated.
> 
> Clients will invoke the Bloom machinery in more cases than before, but
> this can be addressed by returning a NULL filter when all bits are set
> high. This can be addressed in a future patch.
> 
> Finally, note that this does increase the size of on-disk commit-graphs,
> but far less than other proposals. In particular, this is generally more
> efficient than storing a bitmap for which commits haven't computed their
> Bloom filters. Storing a bitmap incurs a penalty of one bit per commit,
> whereas storing explicit filters as above incurs a penalty of one byte
> per too-large or too-small commit.
> 
> In practice, these boundary commits likely occupy a small proportion of
> the overall number of commits, and so the size penalty is likely smaller
> than storing a bitmap for all commits.
> 
> A test to exercise filters which contain too many changed path entries
> will be introduced in a subsequent patch.
> 
> Suggested-by: SZEDER Gábor <szeder.dev@gmail.com>
> Suggested-by: Jakub Narębski <jnareb@gmail.com>
> Helped-by: Derrick Stolee <dstolee@microsoft.com>
> Signed-off-by: Taylor Blau <me@ttaylorr.com>
> ---
>  .../technical/commit-graph-format.txt         |  2 +-
>  bloom.c                                       | 16 ++++++++++++--
>  bloom.h                                       |  1 +
>  commit-graph.c                                |  5 +++++
>  t/t0095-bloom.sh                              |  8 +++----
>  t/t4216-log-bloom.sh                          | 21 +++++++++++++++++--
>  6 files changed, 44 insertions(+), 9 deletions(-)
> 
> diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
> index 6ddbceba15..6585f1948a 100644
> --- a/Documentation/technical/commit-graph-format.txt
> +++ b/Documentation/technical/commit-graph-format.txt
> @@ -125,7 +125,7 @@ CHUNK DATA:
>      * The rest of the chunk is the concatenation of all the computed Bloom
>        filters for the commits in lexicographic order.
>      * Note: Commits with no changes or more than 512 changes have Bloom filters
> -      of length zero.
> +      of length one, with either all bits set to zero or one respectively.
>      * The BDAT chunk is present if and only if BIDX is present.
> 
>    Base Graphs List (ID: {'B', 'A', 'S', 'E'}) [Optional]
> diff --git a/bloom.c b/bloom.c
> index db9fb82437..d24747a1d5 100644
> --- a/bloom.c
> +++ b/bloom.c
> @@ -177,6 +177,13 @@ static int pathmap_cmp(const void *hashmap_cmp_fn_data,
>  	return strcmp(e1->path, e2->path);
>  }
> 
> +static void init_truncated_large_filter(struct bloom_filter *filter)
> +{
> +	filter->data = xmalloc(1);
> +	filter->data[0] = 0xFF;
> +	filter->len = 1;
> +}
> +

So this hunk is essentially the new bit in this version.

> diff --git a/t/t0095-bloom.sh b/t/t0095-bloom.sh
> index 232ba2c485..7e4ab1795f 100755
> --- a/t/t0095-bloom.sh
> +++ b/t/t0095-bloom.sh
> @@ -71,8 +71,8 @@ test_expect_success 'get bloom filters for commit with no changes' '
>  	git init &&
>  	git commit --allow-empty -m "c0" &&
>  	cat >expect <<-\EOF &&
> -	Filter_Length:0
> -	Filter_Data:
> +	Filter_Length:1
> +	Filter_Data:00|
>  	EOF
>  	test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
>  	test_cmp expect actual
> @@ -107,8 +107,8 @@ test_expect_success EXPENSIVE 'get bloom filter for commit with 513 changes' '
>  	git add bigDir &&
>  	git commit -m "commit with 513 changes" &&
>  	cat >expect <<-\EOF &&
> -	Filter_Length:0
> -	Filter_Data:
> +	Filter_Length:1
> +	Filter_Data:ff|
>  	EOF
>  	test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
>  	test_cmp expect actual

And these two tests now show the new behavior. Nice.

This version matches my expectations. Thanks.

-Stolee

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 12/12] builtin/commit-graph.c: introduce '--max-new-filters=<n>'
  2020-09-11 17:52   ` Jeff King
@ 2020-09-11 18:59     ` Taylor Blau
  2020-09-11 19:25       ` Taylor Blau
  2020-09-11 19:47       ` Jeff King
  2020-09-11 19:31     ` Junio C Hamano
  1 sibling, 2 replies; 75+ messages in thread
From: Taylor Blau @ 2020-09-11 18:59 UTC (permalink / raw)
  To: Jeff King; +Cc: Taylor Blau, git, dstolee, gitster, szeder.dev

On Fri, Sep 11, 2020 at 01:52:16PM -0400, Jeff King wrote:
> On Wed, Sep 09, 2020 at 11:24:00AM -0400, Taylor Blau wrote:
> > +With the `--max-new-filters=<n>` option, generate at most `n` new Bloom
> > +filters (if `--changed-paths` is specified). If `n` is `-1`, no limit is
> > +enforced. Commits whose filters are not calculated are stored as a
> > +length zero Bloom filter, and their bit is marked in the `BFXL` chunk.
> > +Overrides the `commitGraph.maxNewFilters` configuration.
>
> The BFXL chunk doesn't exist anymore in this iteration, right?

Ack; I'll have to drop that.

> I wondered about having a user-facing "-1" here. My gut feeling is that
> we usually use "0" to mean "no limit" in other places, and it probably
> make sense to be consistent. It does look like we use both, though, and
> I'm having trouble formulating a grep pattern to find examples that
> doesn't produce a lot of noise.
>
> These are "0 is no limit":
>
>   pack.windowMemory
>   pack.deltaCacheSize
>   git-daemon --max-connections
>
> These are "-1 is no limit":
>
>   git-grep --max-depth
>   rev-list --max-parents (I think?)
>
> So I dunno. It's a pretty minor thing, but I think it's good to aim for
> consistency, and since this is user-facing we won't be able to change it
> later.

I think that we have to treat "-1" as the no-limit indicator, or
otherwise we'd have to specify some other way to say we don't want to
generate any filters. With this patch, users can write:

  $ git commit-graph write --changed-paths .. --max-new-filters=0

to generate a commit-graph without writing any new filters. This is
important to be able to do since we also have a
'commitGraph.maxNewFilters' configuration, which callers may want to
override.

You may wonder why you wouldn't just write '--no-changed-paths' instead.
Doing so would indeed generate no new filters, but it also wouldn't
write any already existing filters into a new graph which is important
when rolling up graph layers that already have incrementals, for example
with '--split'.

I'm happy to include all or none of this in a re-rolled commit message
if you think it's relevant, too.

> -Peff

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 12/12] builtin/commit-graph.c: introduce '--max-new-filters=<n>'
  2020-09-11 18:59     ` Taylor Blau
@ 2020-09-11 19:25       ` Taylor Blau
  2020-09-14 20:12         ` Taylor Blau
  2020-09-11 19:47       ` Jeff King
  1 sibling, 1 reply; 75+ messages in thread
From: Taylor Blau @ 2020-09-11 19:25 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, dstolee, szeder.dev, Jeff King

On Fri, Sep 11, 2020 at 02:59:34PM -0400, Taylor Blau wrote:
> On Fri, Sep 11, 2020 at 01:52:16PM -0400, Jeff King wrote:
> > On Wed, Sep 09, 2020 at 11:24:00AM -0400, Taylor Blau wrote:
> > > +With the `--max-new-filters=<n>` option, generate at most `n` new Bloom
> > > +filters (if `--changed-paths` is specified). If `n` is `-1`, no limit is
> > > +enforced. Commits whose filters are not calculated are stored as a
> > > +length zero Bloom filter, and their bit is marked in the `BFXL` chunk.
> > > +Overrides the `commitGraph.maxNewFilters` configuration.
> >
> > The BFXL chunk doesn't exist anymore in this iteration, right?
>
> Ack; I'll have to drop that.

Junio, I know that I've already sent one replacement patch. If you don't
mind, here's another (and if you do mind, I'm happy to re-roll the
series).

Thanks.

--- >8 ---

Subject: [PATCH] builtin/commit-graph.c: introduce '--max-new-filters=<n>'

Introduce a command-line flag and configuration variable to specify the
maximum number of new Bloom filters that a 'git commit-graph write' is
willing to compute from scratch.

Prior to this patch, a commit-graph write with '--changed-paths' would
compute Bloom filters for all selected commits which haven't already
been computed (i.e., by a previous commit-graph write with '--split'
such that a roll-up or replacement is performed).

This behavior can cause prohibitively-long commit-graph writes for a
variety of reasons:

  * There may be lots of filters whose diffs take a long time to
    generate (for example, they have close to the maximum number of
    changes, diffing itself takes a long time, etc).

  * Old-style commit-graphs (which encode filters with too many entries
    as not having been computed at all) cause us to waste time
    recomputing filters that appear to have not been computed only to
    discover that they are too-large.

This can make the upper-bound of the time it takes for 'git commit-graph
write --changed-paths' to be rather unpredictable.

To make this command behave more predictably, introduce
'--max-new-filters=<n>' to allow computing at most '<n>' Bloom filters
from scratch. This lets "computing" already-known filters proceed
quickly, while bounding the number of slow tasks that Git is willing to
do.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/config/commitgraph.txt |  4 +++
 Documentation/git-commit-graph.txt   |  6 ++++
 bloom.c                              |  7 ++--
 builtin/commit-graph.c               | 39 ++++++++++++++++++--
 commit-graph.c                       |  9 +++--
 commit-graph.h                       |  1 +
 t/t4216-log-bloom.sh                 | 53 ++++++++++++++++++++++++++++
 7 files changed, 110 insertions(+), 9 deletions(-)

diff --git a/Documentation/config/commitgraph.txt b/Documentation/config/commitgraph.txt
index cff0797b54..4582c39fc4 100644
--- a/Documentation/config/commitgraph.txt
+++ b/Documentation/config/commitgraph.txt
@@ -1,3 +1,7 @@
+commitGraph.maxNewFilters::
+	Specifies the default value for the `--max-new-filters` option of `git
+	commit-graph write` (c.f., linkgit:git-commit-graph[1]).
+
 commitGraph.readChangedPaths::
 	If true, then git will use the changed-path Bloom filters in the
 	commit-graph file (if it exists, and they are present). Defaults to
diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
index 17405c73a9..60df4e4bfa 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -67,6 +67,12 @@ this option is given, future commit-graph writes will automatically assume
 that this option was intended. Use `--no-changed-paths` to stop storing this
 data.
 +
+With the `--max-new-filters=<n>` option, generate at most `n` new Bloom
+filters (if `--changed-paths` is specified). If `n` is `-1`, no limit is
+enforced. Commits whose filters are not calculated are stored as a
+length zero Bloom filter. Overrides the `commitGraph.maxNewFilters`
+configuration.
++
 With the `--split[=<strategy>]` option, write the commit-graph as a
 chain of multiple commit-graph files stored in
 `<dir>/info/commit-graphs`. Commit-graph layers are merged based on the
diff --git a/bloom.c b/bloom.c
index d24747a1d5..230a515831 100644
--- a/bloom.c
+++ b/bloom.c
@@ -204,12 +204,11 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,

 	if (!filter->data) {
 		load_commit_graph_info(r, c);
-		if (commit_graph_position(c) != COMMIT_NOT_FROM_GRAPH &&
-			load_bloom_filter_from_graph(r->objects->commit_graph, filter, c))
-				return filter;
+		if (commit_graph_position(c) != COMMIT_NOT_FROM_GRAPH)
+			load_bloom_filter_from_graph(r->objects->commit_graph, filter, c);
 	}

-	if (filter->data)
+	if (filter->data && filter->len)
 		return filter;
 	if (!compute_if_not_present)
 		return NULL;
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index f3243bd982..e7a1539b08 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -13,7 +13,8 @@ static char const * const builtin_commit_graph_usage[] = {
 	N_("git commit-graph verify [--object-dir <objdir>] [--shallow] [--[no-]progress]"),
 	N_("git commit-graph write [--object-dir <objdir>] [--append] "
 	   "[--split[=<strategy>]] [--reachable|--stdin-packs|--stdin-commits] "
-	   "[--changed-paths] [--[no-]progress] <split options>"),
+	   "[--changed-paths] [--[no-]max-new-filters <n>] [--[no-]progress] "
+	   "<split options>"),
 	NULL
 };

@@ -25,7 +26,8 @@ static const char * const builtin_commit_graph_verify_usage[] = {
 static const char * const builtin_commit_graph_write_usage[] = {
 	N_("git commit-graph write [--object-dir <objdir>] [--append] "
 	   "[--split[=<strategy>]] [--reachable|--stdin-packs|--stdin-commits] "
-	   "[--changed-paths] [--[no-]progress] <split options>"),
+	   "[--changed-paths] [--[no-]max-new-filters <n>] [--[no-]progress] "
+	   "<split options>"),
 	NULL
 };

@@ -162,6 +164,23 @@ static int read_one_commit(struct oidset *commits, struct progress *progress,
 	return 0;
 }

+static int write_option_max_new_filters(const struct option *opt,
+					const char *arg,
+					int unset)
+{
+	int *to = opt->value;
+	if (unset)
+		*to = -1;
+	else {
+		const char *s;
+		*to = strtol(arg, (char **)&s, 10);
+		if (*s)
+			return error(_("%s expects a numerical value"),
+				     optname(opt, opt->flags));
+	}
+	return 0;
+}
+
 static int graph_write(int argc, const char **argv)
 {
 	struct string_list pack_indexes = STRING_LIST_INIT_NODUP;
@@ -197,6 +216,9 @@ static int graph_write(int argc, const char **argv)
 			N_("maximum ratio between two levels of a split commit-graph")),
 		OPT_EXPIRY_DATE(0, "expire-time", &write_opts.expire_time,
 			N_("only expire files older than a given date-time")),
+		OPT_CALLBACK_F(0, "max-new-filters", &write_opts.max_new_filters,
+			NULL, N_("maximum number of changed-path Bloom filters to compute"),
+			0, write_option_max_new_filters),
 		OPT_END(),
 	};

@@ -205,6 +227,7 @@ static int graph_write(int argc, const char **argv)
 	write_opts.size_multiple = 2;
 	write_opts.max_commits = 0;
 	write_opts.expire_time = 0;
+	write_opts.max_new_filters = -1;

 	trace2_cmd_mode("write");

@@ -270,6 +293,16 @@ static int graph_write(int argc, const char **argv)
 	return result;
 }

+static int git_commit_graph_config(const char *var, const char *value, void *cb)
+{
+	if (!strcmp(var, "commitgraph.maxnewfilters")) {
+		write_opts.max_new_filters = git_config_int(var, value);
+		return 0;
+	}
+
+	return git_default_config(var, value, cb);
+}
+
 int cmd_commit_graph(int argc, const char **argv, const char *prefix)
 {
 	static struct option builtin_commit_graph_options[] = {
@@ -283,7 +316,7 @@ int cmd_commit_graph(int argc, const char **argv, const char *prefix)
 		usage_with_options(builtin_commit_graph_usage,
 				   builtin_commit_graph_options);

-	git_config(git_default_config, NULL);
+	git_config(git_commit_graph_config, &opts);
 	argc = parse_options(argc, argv, prefix,
 			     builtin_commit_graph_options,
 			     builtin_commit_graph_usage,
diff --git a/commit-graph.c b/commit-graph.c
index dcc27b74e3..1d9f8cc7e9 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1422,6 +1422,7 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 	int i;
 	struct progress *progress = NULL;
 	struct commit **sorted_commits;
+	int max_new_filters;

 	init_bloom_filters();

@@ -1438,13 +1439,16 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 	else
 		QSORT(sorted_commits, ctx->commits.nr, commit_gen_cmp);

+	max_new_filters = ctx->opts && ctx->opts->max_new_filters >= 0 ?
+		ctx->opts->max_new_filters : ctx->commits.nr;
+
 	for (i = 0; i < ctx->commits.nr; i++) {
 		enum bloom_filter_computed computed = 0;
 		struct commit *c = sorted_commits[i];
 		struct bloom_filter *filter = get_or_compute_bloom_filter(
 			ctx->r,
 			c,
-			1,
+			ctx->count_bloom_filter_computed < max_new_filters,
 			ctx->bloom_settings,
 			&computed);
 		if (computed & BLOOM_COMPUTED) {
@@ -1455,7 +1459,8 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 				ctx->count_bloom_filter_trunc_large++;
 		} else if (computed & BLOOM_NOT_COMPUTED)
 			ctx->count_bloom_filter_not_computed++;
-		ctx->total_bloom_filter_data_size += sizeof(unsigned char) * filter->len;
+		ctx->total_bloom_filter_data_size += filter
+			? sizeof(unsigned char) * filter->len : 0;
 		display_progress(progress, i + 1);
 	}

diff --git a/commit-graph.h b/commit-graph.h
index b7914b0a7a..a22bd86701 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -110,6 +110,7 @@ struct commit_graph_opts {
 	int max_commits;
 	timestamp_t expire_time;
 	enum commit_graph_split_flags split_flags;
+	int max_new_filters;
 };

 /*
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index a56327ffd4..24deb8104a 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -287,5 +287,58 @@ test_expect_success 'correctly report commits with no changed paths' '
 		grep "\"filter_trunc_large\":0" trace
 	)
 '
+test_bloom_filters_computed () {
+	commit_graph_args=$1
+	rm -f "$TRASH_DIRECTORY/trace.event" &&
+	GIT_TRACE2_EVENT="$TRASH_DIRECTORY/trace.event" git commit-graph write \
+		$commit_graph_args &&
+	grep "\"filter_not_computed\":$2" "$TRASH_DIRECTORY/trace.event" &&
+	grep "\"filter_trunc_large\":$3" "$TRASH_DIRECTORY/trace.event" &&
+	grep "\"filter_computed\":$4" "$TRASH_DIRECTORY/trace.event"
+}
+
+test_expect_success 'Bloom generation is limited by --max-new-filters' '
+	(
+		cd limits &&
+		test_commit c2 filter &&
+		test_commit c3 filter &&
+		test_commit c4 no-filter &&
+		test_bloom_filters_computed "--reachable --changed-paths --split=replace --max-new-filters=2" \
+			3 0 2
+	)
+'
+
+test_expect_success 'Bloom generation backfills previously-skipped filters' '
+	(
+		cd limits &&
+		test_bloom_filters_computed "--reachable --changed-paths --split=replace --max-new-filters=1" \
+			4 0 1
+	)
+'
+
+test_expect_success 'Bloom generation backfills empty commits' '
+	git init empty &&
+	test_when_finished "rm -fr empty" &&
+	(
+		cd empty &&
+		for i in $(test_seq 1 6)
+		do
+			git commit --allow-empty -m "$i"
+		done &&
+
+		# Generate Bloom filters for empty commits 1-6, two at a time.
+		test_bloom_filters_computed "--reachable --changed-paths --max-new-filters=2" \
+			4 0 2 &&
+		test_bloom_filters_computed "--reachable --changed-paths --max-new-filters=2" \
+			4 0 2 &&
+		test_bloom_filters_computed "--reachable --changed-paths --max-new-filters=2" \
+			4 0 2 &&
+
+		# Finally, make sure that once all commits have filters, that
+		# none are subsequently recomputed.
+		test_bloom_filters_computed "--reachable --changed-paths --max-new-filters=2" \
+			6 0 0
+	)
+'

 test_done
--
2.27.0.2924.ga64bac9092.dirty


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* Re: [PATCH 12/12] builtin/commit-graph.c: introduce '--max-new-filters=<n>'
  2020-09-11 17:52   ` Jeff King
  2020-09-11 18:59     ` Taylor Blau
@ 2020-09-11 19:31     ` Junio C Hamano
  1 sibling, 0 replies; 75+ messages in thread
From: Junio C Hamano @ 2020-09-11 19:31 UTC (permalink / raw)
  To: Jeff King; +Cc: Taylor Blau, git, dstolee, szeder.dev

Jeff King <peff@peff.net> writes:

> I wondered about having a user-facing "-1" here. My gut feeling is that
> we usually use "0" to mean "no limit" in other places, and it probably
> make sense to be consistent. It does look like we use both, though, and
> I'm having trouble formulating a grep pattern to find examples that
> doesn't produce a lot of noise.
>
> These are "0 is no limit":
>
>   pack.windowMemory
>   pack.deltaCacheSize
>   git-daemon --max-connections
>
> These are "-1 is no limit":
>
>   git-grep --max-depth
>   rev-list --max-parents (I think?)

I am unsure if "limiting to the top-level" is depth 0 or depth 1,
but if it is depth 0, --max-depth=0 that does not recurse is
sensible and cannot be used as a signal for "unlimited".

Same for --max-parents=0 that would be a legit way to ask for "root
commits only".

I do not think the system fundamentally would not work with 0 bytes
of window memory or 0 connections, so "0 is unlimited" for them
sounds appropriate.  I would not be surprised if the reason why "0
is unlimited" fields did not choose to use "-1" as the "unlimited"
signal was because the internal type for these fields is unsigned.


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 12/12] builtin/commit-graph.c: introduce '--max-new-filters=<n>'
  2020-09-11 18:59     ` Taylor Blau
  2020-09-11 19:25       ` Taylor Blau
@ 2020-09-11 19:47       ` Jeff King
  1 sibling, 0 replies; 75+ messages in thread
From: Jeff King @ 2020-09-11 19:47 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, dstolee, gitster, szeder.dev

On Fri, Sep 11, 2020 at 02:59:34PM -0400, Taylor Blau wrote:

> I think that we have to treat "-1" as the no-limit indicator, or
> otherwise we'd have to specify some other way to say we don't want to
> generate any filters. With this patch, users can write:
> 
>   $ git commit-graph write --changed-paths .. --max-new-filters=0
> 
> to generate a commit-graph without writing any new filters. This is
> important to be able to do since we also have a
> 'commitGraph.maxNewFilters' configuration, which callers may want to
> override.

OK, that makes sense. Consistency would be nice, but I agree it just
wouldn't work here (and we're not entirely consistent anyway, so it's
not that big a loss).

-Peff

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 12/12] builtin/commit-graph.c: introduce '--max-new-filters=<n>'
  2020-09-11 19:25       ` Taylor Blau
@ 2020-09-14 20:12         ` Taylor Blau
  2020-09-14 20:31           ` Derrick Stolee
  0 siblings, 1 reply; 75+ messages in thread
From: Taylor Blau @ 2020-09-14 20:12 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, dstolee, szeder.dev, Jeff King

On Fri, Sep 11, 2020 at 03:25:55PM -0400, Taylor Blau wrote:
> On Fri, Sep 11, 2020 at 02:59:34PM -0400, Taylor Blau wrote:
> > On Fri, Sep 11, 2020 at 01:52:16PM -0400, Jeff King wrote:
> > > On Wed, Sep 09, 2020 at 11:24:00AM -0400, Taylor Blau wrote:
> > > > +With the `--max-new-filters=<n>` option, generate at most `n` new Bloom
> > > > +filters (if `--changed-paths` is specified). If `n` is `-1`, no limit is
> > > > +enforced. Commits whose filters are not calculated are stored as a
> > > > +length zero Bloom filter, and their bit is marked in the `BFXL` chunk.
> > > > +Overrides the `commitGraph.maxNewFilters` configuration.
> > >
> > > The BFXL chunk doesn't exist anymore in this iteration, right?
> >
> > Ack; I'll have to drop that.
>
> Junio, I know that I've already sent one replacement patch. If you don't
> mind, here's another (and if you do mind, I'm happy to re-roll the
> series).

Just kidding. Let's use *this* version which fixes a bug reading the
commitGraph.maxNewFilters configuration. At this point, the
fix-ups are:

  - This patch (attached below the scisors) instead of 12/12, and

  - This [1] patch instead of 10/12.

[1]: https://lore.kernel.org/git/20200910154516.GA32117@nand.local/

Let me know if you'd rather have a full re-roll.

--- 8< ---

Subject: [PATCH] builtin/commit-graph.c: introduce '--max-new-filters=<n>'

Introduce a command-line flag and configuration variable to specify the
maximum number of new Bloom filters that a 'git commit-graph write' is
willing to compute from scratch.

Prior to this patch, a commit-graph write with '--changed-paths' would
compute Bloom filters for all selected commits which haven't already
been computed (i.e., by a previous commit-graph write with '--split'
such that a roll-up or replacement is performed).

This behavior can cause prohibitively-long commit-graph writes for a
variety of reasons:

  * There may be lots of filters whose diffs take a long time to
    generate (for example, they have close to the maximum number of
    changes, diffing itself takes a long time, etc).

  * Old-style commit-graphs (which encode filters with too many entries
    as not having been computed at all) cause us to waste time
    recomputing filters that appear to have not been computed only to
    discover that they are too-large.

This can make the upper-bound of the time it takes for 'git commit-graph
write --changed-paths' to be rather unpredictable.

To make this command behave more predictably, introduce
'--max-new-filters=<n>' to allow computing at most '<n>' Bloom filters
from scratch. This lets "computing" already-known filters proceed
quickly, while bounding the number of slow tasks that Git is willing to
do.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/config/commitgraph.txt |  4 ++
 Documentation/git-commit-graph.txt   |  6 +++
 bloom.c                              |  7 ++--
 builtin/commit-graph.c               | 41 ++++++++++++++++++++-
 commit-graph.c                       |  9 ++++-
 commit-graph.h                       |  1 +
 t/t4216-log-bloom.sh                 | 55 ++++++++++++++++++++++++++++
 7 files changed, 115 insertions(+), 8 deletions(-)

diff --git a/Documentation/config/commitgraph.txt b/Documentation/config/commitgraph.txt
index cff0797b54..4582c39fc4 100644
--- a/Documentation/config/commitgraph.txt
+++ b/Documentation/config/commitgraph.txt
@@ -1,3 +1,7 @@
+commitGraph.maxNewFilters::
+	Specifies the default value for the `--max-new-filters` option of `git
+	commit-graph write` (c.f., linkgit:git-commit-graph[1]).
+
 commitGraph.readChangedPaths::
 	If true, then git will use the changed-path Bloom filters in the
 	commit-graph file (if it exists, and they are present). Defaults to
diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
index 17405c73a9..60df4e4bfa 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -67,6 +67,12 @@ this option is given, future commit-graph writes will automatically assume
 that this option was intended. Use `--no-changed-paths` to stop storing this
 data.
 +
+With the `--max-new-filters=<n>` option, generate at most `n` new Bloom
+filters (if `--changed-paths` is specified). If `n` is `-1`, no limit is
+enforced. Commits whose filters are not calculated are stored as a
+length zero Bloom filter. Overrides the `commitGraph.maxNewFilters`
+configuration.
++
 With the `--split[=<strategy>]` option, write the commit-graph as a
 chain of multiple commit-graph files stored in
 `<dir>/info/commit-graphs`. Commit-graph layers are merged based on the
diff --git a/bloom.c b/bloom.c
index d24747a1d5..230a515831 100644
--- a/bloom.c
+++ b/bloom.c
@@ -204,12 +204,11 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,

 	if (!filter->data) {
 		load_commit_graph_info(r, c);
-		if (commit_graph_position(c) != COMMIT_NOT_FROM_GRAPH &&
-			load_bloom_filter_from_graph(r->objects->commit_graph, filter, c))
-				return filter;
+		if (commit_graph_position(c) != COMMIT_NOT_FROM_GRAPH)
+			load_bloom_filter_from_graph(r->objects->commit_graph, filter, c);
 	}

-	if (filter->data)
+	if (filter->data && filter->len)
 		return filter;
 	if (!compute_if_not_present)
 		return NULL;
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index f3243bd982..988445abdf 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -13,7 +13,8 @@ static char const * const builtin_commit_graph_usage[] = {
 	N_("git commit-graph verify [--object-dir <objdir>] [--shallow] [--[no-]progress]"),
 	N_("git commit-graph write [--object-dir <objdir>] [--append] "
 	   "[--split[=<strategy>]] [--reachable|--stdin-packs|--stdin-commits] "
-	   "[--changed-paths] [--[no-]progress] <split options>"),
+	   "[--changed-paths] [--[no-]max-new-filters <n>] [--[no-]progress] "
+	   "<split options>"),
 	NULL
 };

@@ -25,7 +26,8 @@ static const char * const builtin_commit_graph_verify_usage[] = {
 static const char * const builtin_commit_graph_write_usage[] = {
 	N_("git commit-graph write [--object-dir <objdir>] [--append] "
 	   "[--split[=<strategy>]] [--reachable|--stdin-packs|--stdin-commits] "
-	   "[--changed-paths] [--[no-]progress] <split options>"),
+	   "[--changed-paths] [--[no-]max-new-filters <n>] [--[no-]progress] "
+	   "<split options>"),
 	NULL
 };

@@ -162,6 +164,35 @@ static int read_one_commit(struct oidset *commits, struct progress *progress,
 	return 0;
 }

+static int write_option_max_new_filters(const struct option *opt,
+					const char *arg,
+					int unset)
+{
+	int *to = opt->value;
+	if (unset)
+		*to = -1;
+	else {
+		const char *s;
+		*to = strtol(arg, (char **)&s, 10);
+		if (*s)
+			return error(_("%s expects a numerical value"),
+				     optname(opt, opt->flags));
+	}
+	return 0;
+}
+
+static int git_commit_graph_write_config(const char *var, const char *value,
+					 void *cb)
+{
+	if (!strcmp(var, "commitgraph.maxnewfilters"))
+		write_opts.max_new_filters = git_config_int(var, value);
+	/*
+	 * No need to fall-back to 'git_default_config', since this was already
+	 * called in 'cmd_commit_graph()'.
+	 */
+	return 0;
+}
+
 static int graph_write(int argc, const char **argv)
 {
 	struct string_list pack_indexes = STRING_LIST_INIT_NODUP;
@@ -197,6 +228,9 @@ static int graph_write(int argc, const char **argv)
 			N_("maximum ratio between two levels of a split commit-graph")),
 		OPT_EXPIRY_DATE(0, "expire-time", &write_opts.expire_time,
 			N_("only expire files older than a given date-time")),
+		OPT_CALLBACK_F(0, "max-new-filters", &write_opts.max_new_filters,
+			NULL, N_("maximum number of changed-path Bloom filters to compute"),
+			0, write_option_max_new_filters),
 		OPT_END(),
 	};

@@ -205,9 +239,12 @@ static int graph_write(int argc, const char **argv)
 	write_opts.size_multiple = 2;
 	write_opts.max_commits = 0;
 	write_opts.expire_time = 0;
+	write_opts.max_new_filters = -1;

 	trace2_cmd_mode("write");

+	git_config(git_commit_graph_write_config, &opts);
+
 	argc = parse_options(argc, argv, NULL,
 			     builtin_commit_graph_write_options,
 			     builtin_commit_graph_write_usage, 0);
diff --git a/commit-graph.c b/commit-graph.c
index dcc27b74e3..1d9f8cc7e9 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1422,6 +1422,7 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 	int i;
 	struct progress *progress = NULL;
 	struct commit **sorted_commits;
+	int max_new_filters;

 	init_bloom_filters();

@@ -1438,13 +1439,16 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 	else
 		QSORT(sorted_commits, ctx->commits.nr, commit_gen_cmp);

+	max_new_filters = ctx->opts && ctx->opts->max_new_filters >= 0 ?
+		ctx->opts->max_new_filters : ctx->commits.nr;
+
 	for (i = 0; i < ctx->commits.nr; i++) {
 		enum bloom_filter_computed computed = 0;
 		struct commit *c = sorted_commits[i];
 		struct bloom_filter *filter = get_or_compute_bloom_filter(
 			ctx->r,
 			c,
-			1,
+			ctx->count_bloom_filter_computed < max_new_filters,
 			ctx->bloom_settings,
 			&computed);
 		if (computed & BLOOM_COMPUTED) {
@@ -1455,7 +1459,8 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 				ctx->count_bloom_filter_trunc_large++;
 		} else if (computed & BLOOM_NOT_COMPUTED)
 			ctx->count_bloom_filter_not_computed++;
-		ctx->total_bloom_filter_data_size += sizeof(unsigned char) * filter->len;
+		ctx->total_bloom_filter_data_size += filter
+			? sizeof(unsigned char) * filter->len : 0;
 		display_progress(progress, i + 1);
 	}

diff --git a/commit-graph.h b/commit-graph.h
index b7914b0a7a..a22bd86701 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -110,6 +110,7 @@ struct commit_graph_opts {
 	int max_commits;
 	timestamp_t expire_time;
 	enum commit_graph_split_flags split_flags;
+	int max_new_filters;
 };

 /*
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index a56327ffd4..3cb766301d 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -287,5 +287,60 @@ test_expect_success 'correctly report commits with no changed paths' '
 		grep "\"filter_trunc_large\":0" trace
 	)
 '
+test_bloom_filters_computed () {
+	commit_graph_args=$1
+	rm -f "$TRASH_DIRECTORY/trace.event" &&
+	GIT_TRACE2_EVENT="$TRASH_DIRECTORY/trace.event" git commit-graph write \
+		$commit_graph_args &&
+	grep "\"filter_not_computed\":$2" "$TRASH_DIRECTORY/trace.event" &&
+	grep "\"filter_trunc_large\":$3" "$TRASH_DIRECTORY/trace.event" &&
+	grep "\"filter_computed\":$4" "$TRASH_DIRECTORY/trace.event"
+}
+
+test_expect_success 'Bloom generation is limited by --max-new-filters' '
+	(
+		cd limits &&
+		test_commit c2 filter &&
+		test_commit c3 filter &&
+		test_commit c4 no-filter &&
+		test_bloom_filters_computed "--reachable --changed-paths --split=replace --max-new-filters=2" \
+			3 0 2
+	)
+'
+
+test_expect_success 'Bloom generation backfills previously-skipped filters' '
+	# Check specifying commitGraph.maxNewFilters over "git config" works.
+	test_config -C limits commitGraph.maxNewFilters 1 &&
+	(
+		cd limits &&
+		test_bloom_filters_computed "--reachable --changed-paths --split=replace" \
+			4 0 1
+	)
+'
+
+test_expect_success 'Bloom generation backfills empty commits' '
+	git init empty &&
+	test_when_finished "rm -fr empty" &&
+	(
+		cd empty &&
+		for i in $(test_seq 1 6)
+		do
+			git commit --allow-empty -m "$i"
+		done &&
+
+		# Generate Bloom filters for empty commits 1-6, two at a time.
+		test_bloom_filters_computed "--reachable --changed-paths --max-new-filters=2" \
+			4 0 2 &&
+		test_bloom_filters_computed "--reachable --changed-paths --max-new-filters=2" \
+			4 0 2 &&
+		test_bloom_filters_computed "--reachable --changed-paths --max-new-filters=2" \
+			4 0 2 &&
+
+		# Finally, make sure that once all commits have filters, that
+		# none are subsequently recomputed.
+		test_bloom_filters_computed "--reachable --changed-paths --max-new-filters=2" \
+			6 0 0
+	)
+'

 test_done
--
2.27.0.2924.ga64bac9092.dirty


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* Re: [PATCH 12/12] builtin/commit-graph.c: introduce '--max-new-filters=<n>'
  2020-09-14 20:12         ` Taylor Blau
@ 2020-09-14 20:31           ` Derrick Stolee
  2020-09-14 20:36             ` Taylor Blau
  0 siblings, 1 reply; 75+ messages in thread
From: Derrick Stolee @ 2020-09-14 20:31 UTC (permalink / raw)
  To: Taylor Blau, Junio C Hamano; +Cc: git, dstolee, szeder.dev, Jeff King

On 9/14/2020 4:12 PM, Taylor Blau wrote:
> On Fri, Sep 11, 2020 at 03:25:55PM -0400, Taylor Blau wrote:
>> On Fri, Sep 11, 2020 at 02:59:34PM -0400, Taylor Blau wrote:
>>> On Fri, Sep 11, 2020 at 01:52:16PM -0400, Jeff King wrote:
>>>> On Wed, Sep 09, 2020 at 11:24:00AM -0400, Taylor Blau wrote:
>>>>> +With the `--max-new-filters=<n>` option, generate at most `n` new Bloom
>>>>> +filters (if `--changed-paths` is specified). If `n` is `-1`, no limit is
>>>>> +enforced. Commits whose filters are not calculated are stored as a
>>>>> +length zero Bloom filter, and their bit is marked in the `BFXL` chunk.
>>>>> +Overrides the `commitGraph.maxNewFilters` configuration.
>>>>
>>>> The BFXL chunk doesn't exist anymore in this iteration, right?
>>>
>>> Ack; I'll have to drop that.
>>
>> Junio, I know that I've already sent one replacement patch. If you don't
>> mind, here's another (and if you do mind, I'm happy to re-roll the
>> series).
> 
> Just kidding. Let's use *this* version which fixes a bug reading the
> commitGraph.maxNewFilters configuration. At this point, the
> fix-ups are:
> 
>   - This patch (attached below the scisors) instead of 12/12, and
> 
>   - This [1] patch instead of 10/12.
> 
> [1]: https://lore.kernel.org/git/20200910154516.GA32117@nand.local/
> 
> Let me know if you'd rather have a full re-roll.

It's getting a bit difficult to track all of these "use this instead"
patches. But, I'm not the one applying them, so maybe that's not actually
a problem.

You might need a re-roll, anyway, as I have a few comments here:

> --- 8< ---
> 
> Subject: [PATCH] builtin/commit-graph.c: introduce '--max-new-filters=<n>'

You also introduce commitGraph.maxNewFitlers here, which is not
mentioned in the commit message anywhere. In fact, it might be
good to include it as a separate patch so its implementation and
tests can be isolated from the command-line functionality.

> +length zero Bloom filter. Overrides the `commitGraph.maxNewFilters`
> +configuration.

We have found it valuable to demonstrate these overrides in tests.
Let's inspect your tests for this.

> +test_bloom_filters_computed () {
> +	commit_graph_args=$1
> +	rm -f "$TRASH_DIRECTORY/trace.event" &&
> +	GIT_TRACE2_EVENT="$TRASH_DIRECTORY/trace.event" git commit-graph write \
> +		$commit_graph_args &&
> +	grep "\"filter_not_computed\":$2" "$TRASH_DIRECTORY/trace.event" &&
> +	grep "\"filter_trunc_large\":$3" "$TRASH_DIRECTORY/trace.event" &&
> +	grep "\"filter_computed\":$4" "$TRASH_DIRECTORY/trace.event"
> +}

If the arguments were moved to the last parameter, then we could do a few
interesting things here.

test_bloom_filters_computed () {
	NOT_COMPUTED="\"filter_not_computed\":$1" &&
	shift &&
	TRUNCATED="\"filter_trunc_large\":$1" &&
	shift &&
	COMPUTED="\"filter_computed\":$1" &&
	shift &&
	rm -f "$TRASH_DIRECTORY/trace.event" &&
	GIT_TRACE2_EVENT="$TRASH_DIRECTORY/trace.event" git commit-graph write $@ &&
	grep "$NOT_COMPUTED" "$TRASH_DIRECTORY/trace.event" &&
	grep "$TRUNCATED" "$TRASH_DIRECTORY/trace.event" &&
	grep "$COMPUTED" "$TRASH_DIRECTORY/trace.event"
}


(I have not tested this script. It might need some work.)
This would make your callers a bit cleaner-looking, for example:

test_expect_success 'Bloom generation is limited by --max-new-filters' '
	(
		cd limits &&
		test_commit c2 filter &&
		test_commit c3 filter &&
		test_commit c4 no-filter &&
		test_bloom_filters_computed 3 0 2 \
			--reachable --changed-paths --split=replace --max-new-filters=2
	)
'

At least, this looks nicer to me.

> +test_expect_success 'Bloom generation backfills previously-skipped filters' '
> +	# Check specifying commitGraph.maxNewFilters over "git config" works.
> +	test_config -C limits commitGraph.maxNewFilters 1 &&
> +	(
> +		cd limits &&
> +		test_bloom_filters_computed "--reachable --changed-paths --split=replace" \
> +			4 0 1
> +	)
> +'

Adding a case for `commitGraph.maxNewFilters=1` and `--max-new-filters=2` might
be interesting for the override rules.

> +
> +test_expect_success 'Bloom generation backfills empty commits' '
> +	git init empty &&
> +	test_when_finished "rm -fr empty" &&
> +	(
> +		cd empty &&
> +		for i in $(test_seq 1 6)
> +		do
> +			git commit --allow-empty -m "$i"
> +		done &&
> +
> +		# Generate Bloom filters for empty commits 1-6, two at a time.
> +		test_bloom_filters_computed "--reachable --changed-paths --max-new-filters=2" \
> +			4 0 2 &&
> +		test_bloom_filters_computed "--reachable --changed-paths --max-new-filters=2" \
> +			4 0 2 &&
> +		test_bloom_filters_computed "--reachable --changed-paths --max-new-filters=2" \
> +			4 0 2 &&

I'm concerned that the max-new-filters limit (2) is a divisor
of the full number of commits (6). It might be good to add one
more commit here and test again with a limit of 2. That would
handle both "equal to limit" and "less than limit" cases.

> +		# Finally, make sure that once all commits have filters, that
> +		# none are subsequently recomputed.
> +		test_bloom_filters_computed "--reachable --changed-paths --max-new-filters=2" \
> +			6 0 0
> +	)
> +'

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 12/12] builtin/commit-graph.c: introduce '--max-new-filters=<n>'
  2020-09-14 20:31           ` Derrick Stolee
@ 2020-09-14 20:36             ` Taylor Blau
  2020-09-15  0:59               ` Derrick Stolee
  2020-09-15 21:49               ` Junio C Hamano
  0 siblings, 2 replies; 75+ messages in thread
From: Taylor Blau @ 2020-09-14 20:36 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Junio C Hamano, git, dstolee, szeder.dev, Jeff King

On Mon, Sep 14, 2020 at 04:31:03PM -0400, Derrick Stolee wrote:
> On 9/14/2020 4:12 PM, Taylor Blau wrote:
> >   - This patch (attached below the scisors) instead of 12/12, and
> >
> >   - This [1] patch instead of 10/12.
> >
> > [1]: https://lore.kernel.org/git/20200910154516.GA32117@nand.local/
> >
> > Let me know if you'd rather have a full re-roll.
>
> It's getting a bit difficult to track all of these "use this instead"
> patches. But, I'm not the one applying them, so maybe that's not actually
> a problem.

The above list is the only changes that I've made, so I'm happy if Junio
wants to follow what's written there, but I'm equally happy to send a
new reroll.

> You might need a re-roll, anyway, as I have a few comments here:

Let's take a look...

> You also introduce commitGraph.maxNewFitlers here, which is not
> mentioned in the commit message anywhere. In fact, it might be
> good to include it as a separate patch so its implementation and
> tests can be isolated from the command-line functionality.

I could go either way on both of these, to be honest. I don't think
there's anything interesting that isn't said in the documentation
changes introduced by that commit that is worth convering there, so I'm
not sue 'commitGraph.maxNewFilters' needs the additional call-out.

> > +length zero Bloom filter. Overrides the `commitGraph.maxNewFilters`
> > +configuration.
>
> We have found it valuable to demonstrate these overrides in tests.
> Let's inspect your tests for this.
>
> > +test_bloom_filters_computed () {
> > +	commit_graph_args=$1
> > +	rm -f "$TRASH_DIRECTORY/trace.event" &&
> > +	GIT_TRACE2_EVENT="$TRASH_DIRECTORY/trace.event" git commit-graph write \
> > +		$commit_graph_args &&
> > +	grep "\"filter_not_computed\":$2" "$TRASH_DIRECTORY/trace.event" &&
> > +	grep "\"filter_trunc_large\":$3" "$TRASH_DIRECTORY/trace.event" &&
> > +	grep "\"filter_computed\":$4" "$TRASH_DIRECTORY/trace.event"
> > +}
>
> If the arguments were moved to the last parameter, then we could do a few
> interesting things here.
>
> test_bloom_filters_computed () {
> 	NOT_COMPUTED="\"filter_not_computed\":$1" &&
> 	shift &&
> 	TRUNCATED="\"filter_trunc_large\":$1" &&
> 	shift &&
> 	COMPUTED="\"filter_computed\":$1" &&
> 	shift &&
> 	rm -f "$TRASH_DIRECTORY/trace.event" &&
> 	GIT_TRACE2_EVENT="$TRASH_DIRECTORY/trace.event" git commit-graph write $@ &&
> 	grep "$NOT_COMPUTED" "$TRASH_DIRECTORY/trace.event" &&
> 	grep "$TRUNCATED" "$TRASH_DIRECTORY/trace.event" &&
> 	grep "$COMPUTED" "$TRASH_DIRECTORY/trace.event"
> }
>
>
> (I have not tested this script. It might need some work.)
> This would make your callers a bit cleaner-looking, for example:
>
> test_expect_success 'Bloom generation is limited by --max-new-filters' '
> 	(
> 		cd limits &&
> 		test_commit c2 filter &&
> 		test_commit c3 filter &&
> 		test_commit c4 no-filter &&
> 		test_bloom_filters_computed 3 0 2 \
> 			--reachable --changed-paths --split=replace --max-new-filters=2
> 	)
> '
>
> At least, this looks nicer to me.

Yeah, but I think we're still stuck with the test_config below unless
you write "git $@" instead of "git commit-graph write $@". I don't think
that I have strong feelings about this unless you do.

> > +test_expect_success 'Bloom generation backfills previously-skipped filters' '
> > +	# Check specifying commitGraph.maxNewFilters over "git config" works.
> > +	test_config -C limits commitGraph.maxNewFilters 1 &&
> > +	(
> > +		cd limits &&
> > +		test_bloom_filters_computed "--reachable --changed-paths --split=replace" \
> > +			4 0 1
> > +	)
> > +'
>
> Adding a case for `commitGraph.maxNewFilters=1` and `--max-new-filters=2` might
> be interesting for the override rules.

Potentially. I'm equally happy to do it in a follow-up series. I worry
slightly about adding too many test-cases for somewhat trivial behavior.

> > +
> > +test_expect_success 'Bloom generation backfills empty commits' '
> > +	git init empty &&
> > +	test_when_finished "rm -fr empty" &&
> > +	(
> > +		cd empty &&
> > +		for i in $(test_seq 1 6)
> > +		do
> > +			git commit --allow-empty -m "$i"
> > +		done &&
> > +
> > +		# Generate Bloom filters for empty commits 1-6, two at a time.
> > +		test_bloom_filters_computed "--reachable --changed-paths --max-new-filters=2" \
> > +			4 0 2 &&
> > +		test_bloom_filters_computed "--reachable --changed-paths --max-new-filters=2" \
> > +			4 0 2 &&
> > +		test_bloom_filters_computed "--reachable --changed-paths --max-new-filters=2" \
> > +			4 0 2 &&
>
> I'm concerned that the max-new-filters limit (2) is a divisor
> of the full number of commits (6). It might be good to add one
> more commit here and test again with a limit of 2. That would
> handle both "equal to limit" and "less than limit" cases.

That case is already covered in the test two above this one ("Bloom
generation is limited by --max-new-filters").

> Thanks,
> -Stolee

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 12/12] builtin/commit-graph.c: introduce '--max-new-filters=<n>'
  2020-09-14 20:36             ` Taylor Blau
@ 2020-09-15  0:59               ` Derrick Stolee
  2020-09-15  4:31                 ` Taylor Blau
  2020-09-15 21:49               ` Junio C Hamano
  1 sibling, 1 reply; 75+ messages in thread
From: Derrick Stolee @ 2020-09-15  0:59 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Junio C Hamano, git, dstolee, szeder.dev, Jeff King

On 9/14/2020 4:36 PM, Taylor Blau wrote:
> On Mon, Sep 14, 2020 at 04:31:03PM -0400, Derrick Stolee wrote:
>> On 9/14/2020 4:12 PM, Taylor Blau wrote:
>>>   - This patch (attached below the scisors) instead of 12/12, and
>>>
>>>   - This [1] patch instead of 10/12.
>>>
>>> [1]: https://lore.kernel.org/git/20200910154516.GA32117@nand.local/
>>>
>>> Let me know if you'd rather have a full re-roll.
>>
>> It's getting a bit difficult to track all of these "use this instead"
>> patches. But, I'm not the one applying them, so maybe that's not actually
>> a problem.
> 
> The above list is the only changes that I've made, so I'm happy if Junio
> wants to follow what's written there, but I'm equally happy to send a
> new reroll.
> 
>> You might need a re-roll, anyway, as I have a few comments here:
> 
> Let's take a look...
> 
>> You also introduce commitGraph.maxNewFitlers here, which is not
>> mentioned in the commit message anywhere. In fact, it might be
>> good to include it as a separate patch so its implementation and
>> tests can be isolated from the command-line functionality.
> 
> I could go either way on both of these, to be honest. I don't think
> there's anything interesting that isn't said in the documentation
> changes introduced by that commit that is worth convering there, so I'm
> not sue 'commitGraph.maxNewFilters' needs the additional call-out.

This is fine. Adding an option along with the config version of it
is easy enough. Just a thought for future series.

I'm fine with the series as-is. My nits are just that.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 12/12] builtin/commit-graph.c: introduce '--max-new-filters=<n>'
  2020-09-15  0:59               ` Derrick Stolee
@ 2020-09-15  4:31                 ` Taylor Blau
  0 siblings, 0 replies; 75+ messages in thread
From: Taylor Blau @ 2020-09-15  4:31 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Taylor Blau, Junio C Hamano, git, dstolee, szeder.dev, Jeff King

On Mon, Sep 14, 2020 at 08:59:28PM -0400, Derrick Stolee wrote:
> This is fine. Adding an option along with the config version of it
> is easy enough. Just a thought for future series.
>
> I'm fine with the series as-is. My nits are just that.

Thanks for your review, and for all of your thoughts and help on this
series in general. I'm sorry if I seemed dismissive; I just wanted to
avoid holding up some important fixes besides the '--max-new-filters'
feature.

I'm not quite sure how to handle series like these. On the one hand, I'd
like to send a couple of small series to fix the important bugs quickly.
On the other hand, the small series can generate a lot of noise and
burden reviewers and the maintainer when the dependencies between those
series aren't straightforward.

So, I dunno. That's at least shedding a little bit of light on how this
series came to be / got so large, which is part of the reason that it
took so long.

Anyway, thank you again. I'm looking forward to seeing this merged
hopefully soon.

> Thanks,
> -Stolee

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 12/12] builtin/commit-graph.c: introduce '--max-new-filters=<n>'
  2020-09-14 20:36             ` Taylor Blau
  2020-09-15  0:59               ` Derrick Stolee
@ 2020-09-15 21:49               ` Junio C Hamano
  2020-09-15 21:53                 ` Taylor Blau
  1 sibling, 1 reply; 75+ messages in thread
From: Junio C Hamano @ 2020-09-15 21:49 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Derrick Stolee, git, dstolee, szeder.dev, Jeff King

Taylor Blau <me@ttaylorr.com> writes:

>> It's getting a bit difficult to track all of these "use this instead"
>> patches. But, I'm not the one applying them, so maybe that's not actually
>> a problem.
>
> The above list is the only changes that I've made, so I'm happy if Junio
> wants to follow what's written there, but I'm equally happy to send a
> new reroll.

It's getting so unorganized to follow from sidelines.  Even
resending just the few steps that needs replacement, indicating
which ones are replaced with them, would be easier to manage (and
full replacement would be the easiest to handle).

Thanks.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 12/12] builtin/commit-graph.c: introduce '--max-new-filters=<n>'
  2020-09-15 21:49               ` Junio C Hamano
@ 2020-09-15 21:53                 ` Taylor Blau
  0 siblings, 0 replies; 75+ messages in thread
From: Taylor Blau @ 2020-09-15 21:53 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Taylor Blau, Derrick Stolee, git, dstolee, szeder.dev, Jeff King

On Tue, Sep 15, 2020 at 02:49:37PM -0700, Junio C Hamano wrote:
> It's getting so unorganized to follow from sidelines.  Even
> resending just the few steps that needs replacement, indicating
> which ones are replaced with them, would be easier to manage (and
> full replacement would be the easiest to handle).
>
> Thanks.

I'll send you a full re-roll, no problem.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [PATCH v2 00/13] more miscellaneous Bloom filter improvements, redux
  2020-09-09 15:22 [PATCH 00/12] more miscellaneous Bloom filter improvements, redux Taylor Blau
                   ` (11 preceding siblings ...)
  2020-09-09 15:24 ` [PATCH 12/12] builtin/commit-graph.c: introduce '--max-new-filters=<n>' Taylor Blau
@ 2020-09-16 18:06 ` Taylor Blau
  2020-09-16 18:06   ` [PATCH v2 01/13] commit-graph: introduce 'get_bloom_filter_settings()' Taylor Blau
                     ` (13 more replies)
  2020-09-18  2:58 ` [PATCH v3 " Taylor Blau
  13 siblings, 14 replies; 75+ messages in thread
From: Taylor Blau @ 2020-09-16 18:06 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, peff, szeder.dev

Here's another re-roll of mine and Stolee's series to add a series of
improvements to the Bloom filter machinery, culminating in the
command-line flag '--max-new-filters'.

Much is left the same from [1], with two notable exceptions:

  - I took much of Stolee's feedback from [2], since the suggestions
    were good and we are rerolling, anyway.

  - I dropped the trace2 calls that write JSON from within the
    commit-graph machinery in favor of pure "data" writes, which are
    easier to grep for. This caused some test refactoring and allowed us
    to ultimately drop the 'test_bloom_filters_computed' function.

Hopefully this is it for this series ;). I think that it's in good shape
now, and I couldn't find anything in my own inspection that I wanted to
change. So, if others feel good, too, I think we should focus on
incremental fixes on top of this.

[1]: https://lore.kernel.org/git/cover.1599664389.git.me@ttaylorr.com/
[2]: https://lore.kernel.org/git/134d64a0-abb6-bdc9-2c05-7aded01a906a@gmail.com/

Derrick Stolee (1):
  bloom/diff: properly short-circuit on max_changes

Taylor Blau (12):
  commit-graph: introduce 'get_bloom_filter_settings()'
  t4216: use an '&&'-chain
  commit-graph: pass a 'struct repository *' in more places
  t/helper/test-read-graph.c: prepare repo settings
  commit-graph: respect 'commitGraph.readChangedPaths'
  commit-graph.c: store maximum changed paths
  bloom: split 'get_bloom_filter()' in two
  bloom: use provided 'struct bloom_filter_settings'
  bloom: encode out-of-bounds filters as non-empty
  commit-graph: rename 'split_commit_graph_opts'
  builtin/commit-graph.c: introduce '--max-new-filters=<n>'
  commit-graph: introduce 'commitGraph.maxNewFilters'

 Documentation/config.txt                      |   2 +
 Documentation/config/commitgraph.txt          |   8 +
 Documentation/git-commit-graph.txt            |   6 +
 .../technical/commit-graph-format.txt         |   2 +-
 blame.c                                       |   8 +-
 bloom.c                                       |  59 +++--
 bloom.h                                       |  29 ++-
 builtin/commit-graph.c                        |  63 ++++-
 commit-graph.c                                | 140 +++++++---
 commit-graph.h                                |  17 +-
 diff.h                                        |   2 -
 fuzz-commit-graph.c                           |   5 +-
 line-log.c                                    |   2 +-
 repo-settings.c                               |   3 +
 repository.h                                  |   1 +
 revision.c                                    |   7 +-
 t/helper/test-bloom.c                         |   4 +-
 t/helper/test-read-graph.c                    |   3 +-
 t/t0095-bloom.sh                              |   8 +-
 t/t4216-log-bloom.sh                          | 241 ++++++++++++++++--
 t/t5324-split-commit-graph.sh                 |  13 +
 tree-diff.c                                   |   5 +-
 22 files changed, 505 insertions(+), 123 deletions(-)
 create mode 100644 Documentation/config/commitgraph.txt

--
2.28.0.510.g86fdc5f89a

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [PATCH v2 01/13] commit-graph: introduce 'get_bloom_filter_settings()'
  2020-09-16 18:06 ` [PATCH v2 00/13] more miscellaneous Bloom filter improvements, redux Taylor Blau
@ 2020-09-16 18:06   ` Taylor Blau
  2020-09-16 18:07   ` [PATCH v2 02/13] t4216: use an '&&'-chain Taylor Blau
                     ` (12 subsequent siblings)
  13 siblings, 0 replies; 75+ messages in thread
From: Taylor Blau @ 2020-09-16 18:06 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, peff, szeder.dev

Many places in the code often need a pointer to the commit-graph's
'struct bloom_filter_settings', in which case they often take the value
from the top-most commit-graph.

In the non-split case, this works as expected. In the split case,
however, things get a little tricky. Not all layers in a chain of
incremental commit-graphs are required to themselves have Bloom data,
and so whether or not some part of the code uses Bloom filters depends
entirely on whether or not the top-most level of the commit-graph chain
has Bloom filters.

This has been the behavior since Bloom filters were introduced, and has
been codified into the tests since a759bfa9ee (t4216: add end to end
tests for git log with Bloom filters, 2020-04-06). In fact, t4216.130
requires that Bloom filters are not used in exactly the case described
earlier.

There is no reason that this needs to be the case, since it is perfectly
valid for commits in an earlier layer to have Bloom filters when commits
in a newer layer do not.

Since Bloom settings are guaranteed in practice to be the same for any
layer in a chain that has Bloom data, it is sufficient to traverse the
'->base_graph' pointer until either (1) a non-null 'struct
bloom_filter_settings *' is found, or (2) until we are at the root of
the commit-graph chain.

Introduce a 'get_bloom_filter_settings()' function that does just this,
and use it instead of purely dereferencing the top-most graph's
'->bloom_filter_settings' pointer.

While we're at it, add an additional test in t5324 to guard against code
in the commit-graph writing machinery that doesn't correctly handle a
NULL 'struct bloom_filter *'.

Co-authored-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 blame.c                       |  6 ++++--
 bloom.c                       |  6 +++---
 commit-graph.c                | 11 +++++++++++
 commit-graph.h                |  2 ++
 revision.c                    |  5 +----
 t/t4216-log-bloom.sh          |  9 ++++++---
 t/t5324-split-commit-graph.sh | 13 +++++++++++++
 7 files changed, 40 insertions(+), 12 deletions(-)

diff --git a/blame.c b/blame.c
index 1be1cd82a2..903e23af23 100644
--- a/blame.c
+++ b/blame.c
@@ -2892,16 +2892,18 @@ void setup_blame_bloom_data(struct blame_scoreboard *sb,
 			    const char *path)
 {
 	struct blame_bloom_data *bd;
+	struct bloom_filter_settings *bs;
 
 	if (!sb->repo->objects->commit_graph)
 		return;
 
-	if (!sb->repo->objects->commit_graph->bloom_filter_settings)
+	bs = get_bloom_filter_settings(sb->repo);
+	if (!bs)
 		return;
 
 	bd = xmalloc(sizeof(struct blame_bloom_data));
 
-	bd->settings = sb->repo->objects->commit_graph->bloom_filter_settings;
+	bd->settings = bs;
 
 	bd->alloc = 4;
 	bd->nr = 0;
diff --git a/bloom.c b/bloom.c
index 1a573226e7..cd9380ac62 100644
--- a/bloom.c
+++ b/bloom.c
@@ -38,7 +38,7 @@ static int load_bloom_filter_from_graph(struct commit_graph *g,
 	while (graph_pos < g->num_commits_in_base)
 		g = g->base_graph;
 
-	/* The commit graph commit 'c' lives in doesn't carry bloom filters. */
+	/* The commit graph commit 'c' lives in doesn't carry Bloom filters. */
 	if (!g->chunk_bloom_indexes)
 		return 0;
 
@@ -195,8 +195,8 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
 	if (!filter->data) {
 		load_commit_graph_info(r, c);
 		if (commit_graph_position(c) != COMMIT_NOT_FROM_GRAPH &&
-			r->objects->commit_graph->chunk_bloom_indexes)
-			load_bloom_filter_from_graph(r->objects->commit_graph, filter, c);
+			load_bloom_filter_from_graph(r->objects->commit_graph, filter, c))
+				return filter;
 	}
 
 	if (filter->data)
diff --git a/commit-graph.c b/commit-graph.c
index 0ed003e218..6a36ed0b06 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -667,6 +667,17 @@ int generation_numbers_enabled(struct repository *r)
 	return !!first_generation;
 }
 
+struct bloom_filter_settings *get_bloom_filter_settings(struct repository *r)
+{
+	struct commit_graph *g = r->objects->commit_graph;
+	while (g) {
+		if (g->bloom_filter_settings)
+			return g->bloom_filter_settings;
+		g = g->base_graph;
+	}
+	return NULL;
+}
+
 static void close_commit_graph_one(struct commit_graph *g)
 {
 	if (!g)
diff --git a/commit-graph.h b/commit-graph.h
index 09a97030dc..0677dd1031 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -87,6 +87,8 @@ struct commit_graph *parse_commit_graph(void *graph_map, size_t graph_size);
  */
 int generation_numbers_enabled(struct repository *r);
 
+struct bloom_filter_settings *get_bloom_filter_settings(struct repository *r);
+
 enum commit_graph_write_flags {
 	COMMIT_GRAPH_WRITE_APPEND     = (1 << 0),
 	COMMIT_GRAPH_WRITE_PROGRESS   = (1 << 1),
diff --git a/revision.c b/revision.c
index 1239023f93..6aeb764821 100644
--- a/revision.c
+++ b/revision.c
@@ -681,10 +681,7 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
 
 	repo_parse_commit(revs->repo, revs->commits->item);
 
-	if (!revs->repo->objects->commit_graph)
-		return;
-
-	revs->bloom_filter_settings = revs->repo->objects->commit_graph->bloom_filter_settings;
+	revs->bloom_filter_settings = get_bloom_filter_settings(revs->repo);
 	if (!revs->bloom_filter_settings)
 		return;
 
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index 4bb9e9dbe2..715912ad0f 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -65,7 +65,7 @@ setup () {
 
 test_bloom_filters_used () {
 	log_args=$1
-	bloom_trace_prefix="statistics:{\"filter_not_present\":0,\"maybe\""
+	bloom_trace_prefix="statistics:{\"filter_not_present\":${2:-0},\"maybe\""
 	setup "$log_args" &&
 	grep -q "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.perf" &&
 	test_cmp log_wo_bloom log_w_bloom &&
@@ -139,8 +139,11 @@ test_expect_success 'setup - add commit-graph to the chain without Bloom filters
 	test_line_count = 2 .git/objects/info/commit-graphs/commit-graph-chain
 '
 
-test_expect_success 'Do not use Bloom filters if the latest graph does not have Bloom filters.' '
-	test_bloom_filters_not_used "-- A/B"
+test_expect_success 'use Bloom filters even if the latest graph does not have Bloom filters' '
+	# Ensure that the number of empty filters is equal to the number of
+	# filters in the latest graph layer to prove that they are loaded (and
+	# ignored).
+	test_bloom_filters_used "-- A/B" 3
 '
 
 test_expect_success 'setup - add commit-graph to the chain with Bloom filters' '
diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
index 18216463c7..c334ee9155 100755
--- a/t/t5324-split-commit-graph.sh
+++ b/t/t5324-split-commit-graph.sh
@@ -427,4 +427,17 @@ done <<\EOF
 0600 -r--------
 EOF
 
+test_expect_success '--split=replace with partial Bloom data' '
+	rm -rf $graphdir $infodir/commit-graph &&
+	git reset --hard commits/3 &&
+	git rev-list -1 HEAD~2 >a &&
+	git rev-list -1 HEAD~1 >b &&
+	git commit-graph write --split=no-merge --stdin-commits --changed-paths <a &&
+	git commit-graph write --split=no-merge --stdin-commits <b &&
+	git commit-graph write --split=replace --stdin-commits --changed-paths <c &&
+	ls $graphdir/graph-*.graph >graph-files &&
+	test_line_count = 1 graph-files &&
+	verify_chain_files_exist $graphdir
+'
+
 test_done
-- 
2.28.0.510.g86fdc5f89a


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH v2 02/13] t4216: use an '&&'-chain
  2020-09-16 18:06 ` [PATCH v2 00/13] more miscellaneous Bloom filter improvements, redux Taylor Blau
  2020-09-16 18:06   ` [PATCH v2 01/13] commit-graph: introduce 'get_bloom_filter_settings()' Taylor Blau
@ 2020-09-16 18:07   ` Taylor Blau
  2020-09-16 18:07   ` [PATCH v2 03/13] commit-graph: pass a 'struct repository *' in more places Taylor Blau
                     ` (11 subsequent siblings)
  13 siblings, 0 replies; 75+ messages in thread
From: Taylor Blau @ 2020-09-16 18:07 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, peff, szeder.dev

In a759bfa9ee (t4216: add end to end tests for git log with Bloom
filters, 2020-04-06), a 'rm' invocation was added without a
corresponding '&&' chain.

When 'trace.perf' already exists, everything works fine. However, the
function can be executed without 'trace.perf' on disk (eg., when the
subset of tests run is altered with '--run'), and so the bare 'rm'
complains about a missing file.

To remove some noise from the test log, invoke 'rm' with '-f', at which
point it is sensible to place the 'rm -f' in an '&&'-chain, which is
both (1) our usual style, and (2) avoids a broken chain in the future if
more commands are added at the beginning of the function.

Helped-by: Eric Sunshine <sunshine@sunshineco.com>
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/t4216-log-bloom.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index 715912ad0f..cd89c75002 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -58,7 +58,7 @@ sane_unset GIT_TRACE2_PERF_BRIEF
 sane_unset GIT_TRACE2_CONFIG_PARAMS
 
 setup () {
-	rm "$TRASH_DIRECTORY/trace.perf"
+	rm -f "$TRASH_DIRECTORY/trace.perf" &&
 	git -c core.commitGraph=false log --pretty="format:%s" $1 >log_wo_bloom &&
 	GIT_TRACE2_PERF="$TRASH_DIRECTORY/trace.perf" git -c core.commitGraph=true log --pretty="format:%s" $1 >log_w_bloom
 }
-- 
2.28.0.510.g86fdc5f89a


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH v2 03/13] commit-graph: pass a 'struct repository *' in more places
  2020-09-16 18:06 ` [PATCH v2 00/13] more miscellaneous Bloom filter improvements, redux Taylor Blau
  2020-09-16 18:06   ` [PATCH v2 01/13] commit-graph: introduce 'get_bloom_filter_settings()' Taylor Blau
  2020-09-16 18:07   ` [PATCH v2 02/13] t4216: use an '&&'-chain Taylor Blau
@ 2020-09-16 18:07   ` Taylor Blau
  2020-09-16 18:07   ` [PATCH v2 04/13] t/helper/test-read-graph.c: prepare repo settings Taylor Blau
                     ` (10 subsequent siblings)
  13 siblings, 0 replies; 75+ messages in thread
From: Taylor Blau @ 2020-09-16 18:07 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, peff, szeder.dev

In a future commit, some commit-graph internals will want access to
'r->settings', but we only have the 'struct object_directory *'
corresponding to that repository.

Add an additional parameter to pass the repository around in more
places.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/commit-graph.c |  2 +-
 commit-graph.c         | 17 ++++++++++-------
 commit-graph.h         |  6 ++++--
 fuzz-commit-graph.c    |  5 +++--
 4 files changed, 18 insertions(+), 12 deletions(-)

diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 523501f217..ba5584463f 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -106,7 +106,7 @@ static int graph_verify(int argc, const char **argv)
 	FREE_AND_NULL(graph_name);
 
 	if (open_ok)
-		graph = load_commit_graph_one_fd_st(fd, &st, odb);
+		graph = load_commit_graph_one_fd_st(the_repository, fd, &st, odb);
 	else
 		graph = read_commit_graph_one(the_repository, odb);
 
diff --git a/commit-graph.c b/commit-graph.c
index 6a36ed0b06..72a838bd00 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -231,7 +231,8 @@ int open_commit_graph(const char *graph_file, int *fd, struct stat *st)
 	return 1;
 }
 
-struct commit_graph *load_commit_graph_one_fd_st(int fd, struct stat *st,
+struct commit_graph *load_commit_graph_one_fd_st(struct repository *r,
+						 int fd, struct stat *st,
 						 struct object_directory *odb)
 {
 	void *graph_map;
@@ -247,7 +248,7 @@ struct commit_graph *load_commit_graph_one_fd_st(int fd, struct stat *st,
 	}
 	graph_map = xmmap(NULL, graph_size, PROT_READ, MAP_PRIVATE, fd, 0);
 	close(fd);
-	ret = parse_commit_graph(graph_map, graph_size);
+	ret = parse_commit_graph(r, graph_map, graph_size);
 
 	if (ret)
 		ret->odb = odb;
@@ -287,7 +288,8 @@ static int verify_commit_graph_lite(struct commit_graph *g)
 	return 0;
 }
 
-struct commit_graph *parse_commit_graph(void *graph_map, size_t graph_size)
+struct commit_graph *parse_commit_graph(struct repository *r,
+					void *graph_map, size_t graph_size)
 {
 	const unsigned char *data, *chunk_lookup;
 	uint32_t i;
@@ -452,7 +454,8 @@ struct commit_graph *parse_commit_graph(void *graph_map, size_t graph_size)
 	return NULL;
 }
 
-static struct commit_graph *load_commit_graph_one(const char *graph_file,
+static struct commit_graph *load_commit_graph_one(struct repository *r,
+						  const char *graph_file,
 						  struct object_directory *odb)
 {
 
@@ -464,7 +467,7 @@ static struct commit_graph *load_commit_graph_one(const char *graph_file,
 	if (!open_ok)
 		return NULL;
 
-	g = load_commit_graph_one_fd_st(fd, &st, odb);
+	g = load_commit_graph_one_fd_st(r, fd, &st, odb);
 
 	if (g)
 		g->filename = xstrdup(graph_file);
@@ -476,7 +479,7 @@ static struct commit_graph *load_commit_graph_v1(struct repository *r,
 						 struct object_directory *odb)
 {
 	char *graph_name = get_commit_graph_filename(odb);
-	struct commit_graph *g = load_commit_graph_one(graph_name, odb);
+	struct commit_graph *g = load_commit_graph_one(r, graph_name, odb);
 	free(graph_name);
 
 	return g;
@@ -557,7 +560,7 @@ static struct commit_graph *load_commit_graph_chain(struct repository *r,
 		valid = 0;
 		for (odb = r->objects->odb; odb; odb = odb->next) {
 			char *graph_name = get_split_graph_filename(odb, line.buf);
-			struct commit_graph *g = load_commit_graph_one(graph_name, odb);
+			struct commit_graph *g = load_commit_graph_one(r, graph_name, odb);
 
 			free(graph_name);
 
diff --git a/commit-graph.h b/commit-graph.h
index 0677dd1031..d9acb22bac 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -75,11 +75,13 @@ struct commit_graph {
 	struct bloom_filter_settings *bloom_filter_settings;
 };
 
-struct commit_graph *load_commit_graph_one_fd_st(int fd, struct stat *st,
+struct commit_graph *load_commit_graph_one_fd_st(struct repository *r,
+						 int fd, struct stat *st,
 						 struct object_directory *odb);
 struct commit_graph *read_commit_graph_one(struct repository *r,
 					   struct object_directory *odb);
-struct commit_graph *parse_commit_graph(void *graph_map, size_t graph_size);
+struct commit_graph *parse_commit_graph(struct repository *r,
+					void *graph_map, size_t graph_size);
 
 /*
  * Return 1 if and only if the repository has a commit-graph
diff --git a/fuzz-commit-graph.c b/fuzz-commit-graph.c
index 430817214d..e7cf6d5b0f 100644
--- a/fuzz-commit-graph.c
+++ b/fuzz-commit-graph.c
@@ -1,7 +1,8 @@
 #include "commit-graph.h"
 #include "repository.h"
 
-struct commit_graph *parse_commit_graph(void *graph_map, size_t graph_size);
+struct commit_graph *parse_commit_graph(struct repository *r,
+					void *graph_map, size_t graph_size);
 
 int LLVMFuzzerTestOneInput(const uint8_t *data, size_t size);
 
@@ -10,7 +11,7 @@ int LLVMFuzzerTestOneInput(const uint8_t *data, size_t size)
 	struct commit_graph *g;
 
 	initialize_the_repository();
-	g = parse_commit_graph((void *)data, size);
+	g = parse_commit_graph(the_repository, (void *)data, size);
 	repo_clear(the_repository);
 	free_commit_graph(g);
 
-- 
2.28.0.510.g86fdc5f89a


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH v2 04/13] t/helper/test-read-graph.c: prepare repo settings
  2020-09-16 18:06 ` [PATCH v2 00/13] more miscellaneous Bloom filter improvements, redux Taylor Blau
                     ` (2 preceding siblings ...)
  2020-09-16 18:07   ` [PATCH v2 03/13] commit-graph: pass a 'struct repository *' in more places Taylor Blau
@ 2020-09-16 18:07   ` Taylor Blau
  2020-09-16 18:07   ` [PATCH v2 05/13] commit-graph: respect 'commitGraph.readChangedPaths' Taylor Blau
                     ` (9 subsequent siblings)
  13 siblings, 0 replies; 75+ messages in thread
From: Taylor Blau @ 2020-09-16 18:07 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, peff, szeder.dev

The read-graph test-tool is used by a number of the commit-graph test to
assert various properties about a commit-graph. Previously, this program
never ran 'prepare_repo_settings()'. There was no need to do so, since
none of the commit-graph machinery is affected by the repo settings.

In the next patch, the commit-graph machinery's behavior will become
dependent on the repo settings, and so loading them before running the
rest of the test tool is critical.

As such, teach the test tool to call 'prepare_repo_settings()'.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/helper/test-read-graph.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/t/helper/test-read-graph.c b/t/helper/test-read-graph.c
index 6d0c962438..5f585a1725 100644
--- a/t/helper/test-read-graph.c
+++ b/t/helper/test-read-graph.c
@@ -12,11 +12,12 @@ int cmd__read_graph(int argc, const char **argv)
 	setup_git_directory();
 	odb = the_repository->objects->odb;
 
+	prepare_repo_settings(the_repository);
+
 	graph = read_commit_graph_one(the_repository, odb);
 	if (!graph)
 		return 1;
 
-
 	printf("header: %08x %d %d %d %d\n",
 		ntohl(*(uint32_t*)graph->data),
 		*(unsigned char*)(graph->data + 4),
-- 
2.28.0.510.g86fdc5f89a


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH v2 05/13] commit-graph: respect 'commitGraph.readChangedPaths'
  2020-09-16 18:06 ` [PATCH v2 00/13] more miscellaneous Bloom filter improvements, redux Taylor Blau
                     ` (3 preceding siblings ...)
  2020-09-16 18:07   ` [PATCH v2 04/13] t/helper/test-read-graph.c: prepare repo settings Taylor Blau
@ 2020-09-16 18:07   ` Taylor Blau
  2020-09-16 18:07   ` [PATCH v2 06/13] commit-graph.c: store maximum changed paths Taylor Blau
                     ` (8 subsequent siblings)
  13 siblings, 0 replies; 75+ messages in thread
From: Taylor Blau @ 2020-09-16 18:07 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, peff, szeder.dev

Git uses the 'core.commitGraph' configuration value to control whether
or not the commit graph is used when parsing commits or performing a
traversal.

Now that commit-graphs can also contain a section for changed-path Bloom
filters, administrators that already have commit-graphs may find it
convenient to use those graphs without relying on their changed-path
Bloom filters. This can happen, for example, during a staged roll-out,
or in the event of an incident.

Introduce 'commitGraph.readChangedPaths' to control whether or not Bloom
filters are read. Note that this configuration is independent from both:

  - 'core.commitGraph', to allow flexibility in using all parts of a
    commit-graph _except_ for its Bloom filters.

  - The '--changed-paths' option for 'git commit-graph write', to allow
    reading and writing Bloom filters to be controlled independently.

When the variable is set, pretend as if no Bloom data was specified at
all. This avoids adding additional special-casing outside of the
commit-graph internals.

Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/config.txt             | 2 ++
 Documentation/config/commitgraph.txt | 4 ++++
 commit-graph.c                       | 6 ++++--
 repo-settings.c                      | 3 +++
 repository.h                         | 1 +
 t/t4216-log-bloom.sh                 | 4 +++-
 6 files changed, 17 insertions(+), 3 deletions(-)
 create mode 100644 Documentation/config/commitgraph.txt

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 3042d80978..770ae79b82 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -340,6 +340,8 @@ include::config/column.txt[]
 
 include::config/commit.txt[]
 
+include::config/commitgraph.txt[]
+
 include::config/credential.txt[]
 
 include::config/completion.txt[]
diff --git a/Documentation/config/commitgraph.txt b/Documentation/config/commitgraph.txt
new file mode 100644
index 0000000000..cff0797b54
--- /dev/null
+++ b/Documentation/config/commitgraph.txt
@@ -0,0 +1,4 @@
+commitGraph.readChangedPaths::
+	If true, then git will use the changed-path Bloom filters in the
+	commit-graph file (if it exists, and they are present). Defaults to
+	true. See linkgit:git-commit-graph[1] for more information.
diff --git a/commit-graph.c b/commit-graph.c
index 72a838bd00..ea54d108b9 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -327,6 +327,8 @@ struct commit_graph *parse_commit_graph(struct repository *r,
 		return NULL;
 	}
 
+	prepare_repo_settings(r);
+
 	graph = alloc_commit_graph();
 
 	graph->hash_len = the_hash_algo->rawsz;
@@ -403,14 +405,14 @@ struct commit_graph *parse_commit_graph(struct repository *r,
 		case GRAPH_CHUNKID_BLOOMINDEXES:
 			if (graph->chunk_bloom_indexes)
 				chunk_repeated = 1;
-			else
+			else if (r->settings.commit_graph_read_changed_paths)
 				graph->chunk_bloom_indexes = data + chunk_offset;
 			break;
 
 		case GRAPH_CHUNKID_BLOOMDATA:
 			if (graph->chunk_bloom_data)
 				chunk_repeated = 1;
-			else {
+			else if (r->settings.commit_graph_read_changed_paths) {
 				uint32_t hash_version;
 				graph->chunk_bloom_data = data + chunk_offset;
 				hash_version = get_be32(data + chunk_offset);
diff --git a/repo-settings.c b/repo-settings.c
index aa61a35338..88ccce2036 100644
--- a/repo-settings.c
+++ b/repo-settings.c
@@ -17,9 +17,12 @@ void prepare_repo_settings(struct repository *r)
 
 	if (!repo_config_get_bool(r, "core.commitgraph", &value))
 		r->settings.core_commit_graph = value;
+	if (!repo_config_get_bool(r, "commitgraph.readchangedpaths", &value))
+		r->settings.commit_graph_read_changed_paths = value;
 	if (!repo_config_get_bool(r, "gc.writecommitgraph", &value))
 		r->settings.gc_write_commit_graph = value;
 	UPDATE_DEFAULT_BOOL(r->settings.core_commit_graph, 1);
+	UPDATE_DEFAULT_BOOL(r->settings.commit_graph_read_changed_paths, 1);
 	UPDATE_DEFAULT_BOOL(r->settings.gc_write_commit_graph, 1);
 
 	if (!repo_config_get_int(r, "index.version", &value))
diff --git a/repository.h b/repository.h
index 628c834367..bacf843d46 100644
--- a/repository.h
+++ b/repository.h
@@ -30,6 +30,7 @@ struct repo_settings {
 	int initialized;
 
 	int core_commit_graph;
+	int commit_graph_read_changed_paths;
 	int gc_write_commit_graph;
 	int fetch_write_commit_graph;
 
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index cd89c75002..fc7693806c 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -95,7 +95,9 @@ do
 		      "--ancestry-path side..master"
 	do
 		test_expect_success "git log option: $option for path: $path" '
-			test_bloom_filters_used "$option -- $path"
+			test_bloom_filters_used "$option -- $path" &&
+			test_config commitgraph.readChangedPaths false &&
+			test_bloom_filters_not_used "$option -- $path"
 		'
 	done
 done
-- 
2.28.0.510.g86fdc5f89a


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH v2 06/13] commit-graph.c: store maximum changed paths
  2020-09-16 18:06 ` [PATCH v2 00/13] more miscellaneous Bloom filter improvements, redux Taylor Blau
                     ` (4 preceding siblings ...)
  2020-09-16 18:07   ` [PATCH v2 05/13] commit-graph: respect 'commitGraph.readChangedPaths' Taylor Blau
@ 2020-09-16 18:07   ` Taylor Blau
  2020-09-16 18:07   ` [PATCH v2 07/13] bloom: split 'get_bloom_filter()' in two Taylor Blau
                     ` (7 subsequent siblings)
  13 siblings, 0 replies; 75+ messages in thread
From: Taylor Blau @ 2020-09-16 18:07 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, peff, szeder.dev

For now, we assume that there is a fixed constant describing the
maximum number of changed paths we are willing to store in a Bloom
filter.

Prepare for that to (at least partially) not be the case by making it a
member of the 'struct bloom_filter_settings'. This will be helpful in
the subsequent patches by reducing the size of test cases that exercise
storing too many changed paths, as well as preparing for an eventual
future in which this value might change.

This patch alone does not cause newly generated Bloom filters to use
a custom upper-bound on the maximum number of changed paths a single
Bloom filter can hold, that will occur in a later patch.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 bloom.h              | 11 ++++++++++-
 commit-graph.c       |  3 +++
 t/t4216-log-bloom.sh |  4 ++--
 3 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/bloom.h b/bloom.h
index d8fbb0fbf1..0b9b59a6fe 100644
--- a/bloom.h
+++ b/bloom.h
@@ -28,9 +28,18 @@ struct bloom_filter_settings {
 	 * that contain n*b bits.
 	 */
 	uint32_t bits_per_entry;
+
+	/*
+	 * The maximum number of changed paths per commit
+	 * before declaring a Bloom filter to be too-large.
+	 *
+	 * Not written to the commit-graph file.
+	 */
+	uint32_t max_changed_paths;
 };
 
-#define DEFAULT_BLOOM_FILTER_SETTINGS { 1, 7, 10 }
+#define DEFAULT_BLOOM_MAX_CHANGES 512
+#define DEFAULT_BLOOM_FILTER_SETTINGS { 1, 7, 10, DEFAULT_BLOOM_MAX_CHANGES }
 #define BITS_PER_WORD 8
 #define BLOOMDATA_CHUNK_HEADER_SIZE 3 * sizeof(uint32_t)
 
diff --git a/commit-graph.c b/commit-graph.c
index ea54d108b9..55af498aa0 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1201,6 +1201,7 @@ static void trace2_bloom_filter_settings(struct write_commit_graph_context *ctx)
 	jw_object_intmax(&jw, "hash_version", ctx->bloom_settings->hash_version);
 	jw_object_intmax(&jw, "num_hashes", ctx->bloom_settings->num_hashes);
 	jw_object_intmax(&jw, "bits_per_entry", ctx->bloom_settings->bits_per_entry);
+	jw_object_intmax(&jw, "max_changed_paths", ctx->bloom_settings->max_changed_paths);
 	jw_end(&jw);
 
 	trace2_data_json("bloom", ctx->r, "settings", &jw);
@@ -1669,6 +1670,8 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 							      bloom_settings.bits_per_entry);
 		bloom_settings.num_hashes = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_NUM_HASHES",
 							  bloom_settings.num_hashes);
+		bloom_settings.max_changed_paths = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS",
+							  bloom_settings.max_changed_paths);
 		ctx->bloom_settings = &bloom_settings;
 	}
 
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index fc7693806c..47ddf2641f 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -174,11 +174,11 @@ test_expect_success 'persist filter settings' '
 		GIT_TEST_BLOOM_SETTINGS_NUM_HASHES=9 \
 		GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY=15 \
 		git commit-graph write --reachable --changed-paths &&
-	grep "{\"hash_version\":1,\"num_hashes\":9,\"bits_per_entry\":15}" trace2.txt &&
+	grep "{\"hash_version\":1,\"num_hashes\":9,\"bits_per_entry\":15" trace2.txt &&
 	GIT_TRACE2_EVENT="$(pwd)/trace2-auto.txt" \
 		GIT_TRACE2_EVENT_NESTING=5 \
 		git commit-graph write --reachable --changed-paths &&
-	grep "{\"hash_version\":1,\"num_hashes\":9,\"bits_per_entry\":15}" trace2-auto.txt
+	grep "{\"hash_version\":1,\"num_hashes\":9,\"bits_per_entry\":15" trace2-auto.txt
 '
 
 test_expect_success 'correctly report changes over limit' '
-- 
2.28.0.510.g86fdc5f89a


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH v2 07/13] bloom: split 'get_bloom_filter()' in two
  2020-09-16 18:06 ` [PATCH v2 00/13] more miscellaneous Bloom filter improvements, redux Taylor Blau
                     ` (5 preceding siblings ...)
  2020-09-16 18:07   ` [PATCH v2 06/13] commit-graph.c: store maximum changed paths Taylor Blau
@ 2020-09-16 18:07   ` Taylor Blau
  2020-09-16 18:07   ` [PATCH v2 08/13] bloom: use provided 'struct bloom_filter_settings' Taylor Blau
                     ` (6 subsequent siblings)
  13 siblings, 0 replies; 75+ messages in thread
From: Taylor Blau @ 2020-09-16 18:07 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, peff, szeder.dev

'get_bloom_filter' takes a flag to control whether it will compute a
Bloom filter if the requested one is missing. In the next patch, we'll
add yet another parameter to this method, which would force all but one
caller to specify an extra 'NULL' parameter at the end.

Instead of doing this, split 'get_bloom_filter' into two functions:
'get_bloom_filter' and 'get_or_compute_bloom_filter'. The former only
looks up a Bloom filter (and does not compute one if it's missing,
thus dropping the 'compute_if_not_present' flag). The latter does
compute missing Bloom filters, with an additional parameter to store
whether or not it needed to do so.

This simplifies many call-sites, since the majority of existing callers
to 'get_bloom_filter' do not want missing Bloom filters to be computed
(so they can drop the parameter entirely and use the simpler version of
the function).

While we're at it, instrument the new 'get_or_compute_bloom_filter()'
with counters in the 'write_commit_graph_context' struct which store
the number of filters that we did and didn't compute, as well as filters
that were truncated.

It would be nice to drop the 'compute_if_not_present' flag entirely,
since all remaining callers of 'get_or_compute_bloom_filter' pass it as
'1', but this will change in a future patch and hence cannot be removed.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 blame.c               |  2 +-
 bloom.c               | 16 +++++++++++++---
 bloom.h               | 16 +++++++++++++---
 commit-graph.c        | 34 +++++++++++++++++++++++++++++++---
 line-log.c            |  2 +-
 revision.c            |  2 +-
 t/helper/test-bloom.c |  3 ++-
 7 files changed, 62 insertions(+), 13 deletions(-)

diff --git a/blame.c b/blame.c
index 903e23af23..e5ba35dbd1 100644
--- a/blame.c
+++ b/blame.c
@@ -1276,7 +1276,7 @@ static int maybe_changed_path(struct repository *r,
 	if (commit_graph_generation(origin->commit) == GENERATION_NUMBER_INFINITY)
 		return 1;
 
-	filter = get_bloom_filter(r, origin->commit, 0);
+	filter = get_bloom_filter(r, origin->commit);
 
 	if (!filter)
 		return 1;
diff --git a/bloom.c b/bloom.c
index cd9380ac62..393c61b4bc 100644
--- a/bloom.c
+++ b/bloom.c
@@ -177,9 +177,10 @@ static int pathmap_cmp(const void *hashmap_cmp_fn_data,
 	return strcmp(e1->path, e2->path);
 }
 
-struct bloom_filter *get_bloom_filter(struct repository *r,
-				      struct commit *c,
-				      int compute_if_not_present)
+struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
+						 struct commit *c,
+						 int compute_if_not_present,
+						 enum bloom_filter_computed *computed)
 {
 	struct bloom_filter *filter;
 	struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
@@ -187,6 +188,9 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
 	struct diff_options diffopt;
 	int max_changes = 512;
 
+	if (computed)
+		*computed = BLOOM_NOT_COMPUTED;
+
 	if (!bloom_filters.slab_size)
 		return NULL;
 
@@ -271,8 +275,14 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
 			diff_free_filepair(diff_queued_diff.queue[i]);
 		filter->data = NULL;
 		filter->len = 0;
+
+		if (computed)
+			*computed |= BLOOM_TRUNC_LARGE;
 	}
 
+	if (computed)
+		*computed |= BLOOM_COMPUTED;
+
 	free(diff_queued_diff.queue);
 	DIFF_QUEUE_CLEAR(&diff_queued_diff);
 
diff --git a/bloom.h b/bloom.h
index 0b9b59a6fe..e2e035ad14 100644
--- a/bloom.h
+++ b/bloom.h
@@ -89,9 +89,19 @@ void add_key_to_filter(const struct bloom_key *key,
 
 void init_bloom_filters(void);
 
-struct bloom_filter *get_bloom_filter(struct repository *r,
-				      struct commit *c,
-				      int compute_if_not_present);
+enum bloom_filter_computed {
+	BLOOM_NOT_COMPUTED = (1 << 0),
+	BLOOM_COMPUTED     = (1 << 1),
+	BLOOM_TRUNC_LARGE  = (1 << 2),
+};
+
+struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
+						 struct commit *c,
+						 int compute_if_not_present,
+						 enum bloom_filter_computed *computed);
+
+#define get_bloom_filter(r, c) get_or_compute_bloom_filter( \
+	(r), (c), 0, NULL)
 
 int bloom_filter_contains(const struct bloom_filter *filter,
 			  const struct bloom_key *key,
diff --git a/commit-graph.c b/commit-graph.c
index 55af498aa0..45fcd62596 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -971,6 +971,10 @@ struct write_commit_graph_context {
 	const struct split_commit_graph_opts *split_opts;
 	size_t total_bloom_filter_data_size;
 	const struct bloom_filter_settings *bloom_settings;
+
+	int count_bloom_filter_computed;
+	int count_bloom_filter_not_computed;
+	int count_bloom_filter_trunc_large;
 };
 
 static int write_graph_chunk_fanout(struct hashfile *f,
@@ -1182,7 +1186,7 @@ static int write_graph_chunk_bloom_indexes(struct hashfile *f,
 	uint32_t cur_pos = 0;
 
 	while (list < last) {
-		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
+		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
 		size_t len = filter ? filter->len : 0;
 		cur_pos += len;
 		display_progress(ctx->progress, ++ctx->progress_cnt);
@@ -1222,7 +1226,7 @@ static int write_graph_chunk_bloom_data(struct hashfile *f,
 	hashwrite_be32(f, ctx->bloom_settings->bits_per_entry);
 
 	while (list < last) {
-		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
+		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
 		size_t len = filter ? filter->len : 0;
 
 		display_progress(ctx->progress, ++ctx->progress_cnt);
@@ -1392,6 +1396,16 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 	stop_progress(&ctx->progress);
 }
 
+static void trace2_bloom_filter_write_statistics(struct write_commit_graph_context *ctx)
+{
+	trace2_data_intmax("commit-graph", ctx->r, "filter-computed",
+			   ctx->count_bloom_filter_computed);
+	trace2_data_intmax("commit-graph", ctx->r, "filter-not-computed",
+			   ctx->count_bloom_filter_not_computed);
+	trace2_data_intmax("commit-graph", ctx->r, "filter-trunc-large",
+			   ctx->count_bloom_filter_trunc_large);
+}
+
 static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 {
 	int i;
@@ -1414,12 +1428,26 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 		QSORT(sorted_commits, ctx->commits.nr, commit_gen_cmp);
 
 	for (i = 0; i < ctx->commits.nr; i++) {
+		enum bloom_filter_computed computed = 0;
 		struct commit *c = sorted_commits[i];
-		struct bloom_filter *filter = get_bloom_filter(ctx->r, c, 1);
+		struct bloom_filter *filter = get_or_compute_bloom_filter(
+			ctx->r,
+			c,
+			1,
+			&computed);
+		if (computed & BLOOM_COMPUTED) {
+			ctx->count_bloom_filter_computed++;
+			if (computed & BLOOM_TRUNC_LARGE)
+				ctx->count_bloom_filter_trunc_large++;
+		} else if (computed & BLOOM_NOT_COMPUTED)
+			ctx->count_bloom_filter_not_computed++;
 		ctx->total_bloom_filter_data_size += sizeof(unsigned char) * filter->len;
 		display_progress(progress, i + 1);
 	}
 
+	if (trace2_is_enabled())
+		trace2_bloom_filter_write_statistics(ctx);
+
 	free(sorted_commits);
 	stop_progress(&progress);
 }
diff --git a/line-log.c b/line-log.c
index bf73ea95ac..68eeb425f8 100644
--- a/line-log.c
+++ b/line-log.c
@@ -1159,7 +1159,7 @@ static int bloom_filter_check(struct rev_info *rev,
 		return 1;
 
 	if (!rev->bloom_filter_settings ||
-	    !(filter = get_bloom_filter(rev->repo, commit, 0)))
+	    !(filter = get_bloom_filter(rev->repo, commit)))
 		return 1;
 
 	if (!range)
diff --git a/revision.c b/revision.c
index 6aeb764821..c53ab99e59 100644
--- a/revision.c
+++ b/revision.c
@@ -752,7 +752,7 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
 	if (commit_graph_generation(commit) == GENERATION_NUMBER_INFINITY)
 		return -1;
 
-	filter = get_bloom_filter(revs->repo, commit, 0);
+	filter = get_bloom_filter(revs->repo, commit);
 
 	if (!filter) {
 		count_bloom_filter_not_present++;
diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
index 5e77d56f59..9f7bb729fc 100644
--- a/t/helper/test-bloom.c
+++ b/t/helper/test-bloom.c
@@ -39,7 +39,8 @@ static void get_bloom_filter_for_commit(const struct object_id *commit_oid)
 	struct bloom_filter *filter;
 	setup_git_directory();
 	c = lookup_commit(the_repository, commit_oid);
-	filter = get_bloom_filter(the_repository, c, 1);
+	filter = get_or_compute_bloom_filter(the_repository, c, 1,
+					     NULL);
 	print_bloom_filter(filter);
 }
 
-- 
2.28.0.510.g86fdc5f89a


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH v2 08/13] bloom: use provided 'struct bloom_filter_settings'
  2020-09-16 18:06 ` [PATCH v2 00/13] more miscellaneous Bloom filter improvements, redux Taylor Blau
                     ` (6 preceding siblings ...)
  2020-09-16 18:07   ` [PATCH v2 07/13] bloom: split 'get_bloom_filter()' in two Taylor Blau
@ 2020-09-16 18:07   ` Taylor Blau
  2020-09-16 18:07   ` [PATCH v2 09/13] bloom/diff: properly short-circuit on max_changes Taylor Blau
                     ` (5 subsequent siblings)
  13 siblings, 0 replies; 75+ messages in thread
From: Taylor Blau @ 2020-09-16 18:07 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, peff, szeder.dev

When 'get_or_compute_bloom_filter()' needs to compute a Bloom filter
from scratch, it looks to the default 'struct bloom_filter_settings' in
order to determine the maximum number of changed paths, number of bits
per entry, and so on.

All of these values have so far been constant, and so there was no need
to pass in a pointer from the caller (eg., the one that is stored in the
'struct write_commit_graph_context').

Start passing in a 'struct bloom_filter_settings *' instead of using the
default values to respect graph-specific settings (eg., in the case of
setting 'GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS').

In order to have an initialized value for these settings, move its
initialization to earlier in the commit-graph write.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 bloom.c               | 13 ++++++-------
 bloom.h               |  3 ++-
 commit-graph.c        | 21 ++++++++++-----------
 t/helper/test-bloom.c |  1 +
 4 files changed, 19 insertions(+), 19 deletions(-)

diff --git a/bloom.c b/bloom.c
index 393c61b4bc..2d6aef9098 100644
--- a/bloom.c
+++ b/bloom.c
@@ -180,13 +180,12 @@ static int pathmap_cmp(const void *hashmap_cmp_fn_data,
 struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 						 struct commit *c,
 						 int compute_if_not_present,
+						 const struct bloom_filter_settings *settings,
 						 enum bloom_filter_computed *computed)
 {
 	struct bloom_filter *filter;
-	struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
 	int i;
 	struct diff_options diffopt;
-	int max_changes = 512;
 
 	if (computed)
 		*computed = BLOOM_NOT_COMPUTED;
@@ -211,7 +210,7 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 	repo_diff_setup(r, &diffopt);
 	diffopt.flags.recursive = 1;
 	diffopt.detect_rename = 0;
-	diffopt.max_changes = max_changes;
+	diffopt.max_changes = settings->max_changed_paths;
 	diff_setup_done(&diffopt);
 
 	/* ensure commit is parsed so we have parent information */
@@ -223,7 +222,7 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 		diff_tree_oid(NULL, &c->object.oid, "", &diffopt);
 	diffcore_std(&diffopt);
 
-	if (diffopt.num_changes <= max_changes) {
+	if (diffopt.num_changes <= settings->max_changed_paths) {
 		struct hashmap pathmap;
 		struct pathmap_hash_entry *e;
 		struct hashmap_iter iter;
@@ -260,13 +259,13 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 			diff_free_filepair(diff_queued_diff.queue[i]);
 		}
 
-		filter->len = (hashmap_get_size(&pathmap) * settings.bits_per_entry + BITS_PER_WORD - 1) / BITS_PER_WORD;
+		filter->len = (hashmap_get_size(&pathmap) * settings->bits_per_entry + BITS_PER_WORD - 1) / BITS_PER_WORD;
 		filter->data = xcalloc(filter->len, sizeof(unsigned char));
 
 		hashmap_for_each_entry(&pathmap, &iter, e, entry) {
 			struct bloom_key key;
-			fill_bloom_key(e->path, strlen(e->path), &key, &settings);
-			add_key_to_filter(&key, filter, &settings);
+			fill_bloom_key(e->path, strlen(e->path), &key, settings);
+			add_key_to_filter(&key, filter, settings);
 		}
 
 		hashmap_free_entries(&pathmap, struct pathmap_hash_entry, entry);
diff --git a/bloom.h b/bloom.h
index e2e035ad14..c6d77e8393 100644
--- a/bloom.h
+++ b/bloom.h
@@ -98,10 +98,11 @@ enum bloom_filter_computed {
 struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 						 struct commit *c,
 						 int compute_if_not_present,
+						 const struct bloom_filter_settings *settings,
 						 enum bloom_filter_computed *computed);
 
 #define get_bloom_filter(r, c) get_or_compute_bloom_filter( \
-	(r), (c), 0, NULL)
+	(r), (c), 0, NULL, NULL)
 
 int bloom_filter_contains(const struct bloom_filter *filter,
 			  const struct bloom_key *key,
diff --git a/commit-graph.c b/commit-graph.c
index 45fcd62596..1ca754f19c 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1434,6 +1434,7 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 			ctx->r,
 			c,
 			1,
+			ctx->bloom_settings,
 			&computed);
 		if (computed & BLOOM_COMPUTED) {
 			ctx->count_bloom_filter_computed++;
@@ -1691,17 +1692,6 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	int num_chunks = 3;
 	uint64_t chunk_offset;
 	struct object_id file_hash;
-	struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
-
-	if (!ctx->bloom_settings) {
-		bloom_settings.bits_per_entry = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY",
-							      bloom_settings.bits_per_entry);
-		bloom_settings.num_hashes = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_NUM_HASHES",
-							  bloom_settings.num_hashes);
-		bloom_settings.max_changed_paths = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS",
-							  bloom_settings.max_changed_paths);
-		ctx->bloom_settings = &bloom_settings;
-	}
 
 	if (ctx->split) {
 		struct strbuf tmp_file = STRBUF_INIT;
@@ -2147,6 +2137,7 @@ int write_commit_graph(struct object_directory *odb,
 	uint32_t i, count_distinct = 0;
 	int res = 0;
 	int replace = 0;
+	struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
 
 	if (!commit_graph_compatible(the_repository))
 		return 0;
@@ -2160,6 +2151,14 @@ int write_commit_graph(struct object_directory *odb,
 	ctx->split_opts = split_opts;
 	ctx->total_bloom_filter_data_size = 0;
 
+	bloom_settings.bits_per_entry = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY",
+						      bloom_settings.bits_per_entry);
+	bloom_settings.num_hashes = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_NUM_HASHES",
+						  bloom_settings.num_hashes);
+	bloom_settings.max_changed_paths = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS",
+							 bloom_settings.max_changed_paths);
+	ctx->bloom_settings = &bloom_settings;
+
 	if (flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS)
 		ctx->changed_paths = 1;
 	if (!(flags & COMMIT_GRAPH_NO_WRITE_BLOOM_FILTERS)) {
diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
index 9f7bb729fc..46e97b04eb 100644
--- a/t/helper/test-bloom.c
+++ b/t/helper/test-bloom.c
@@ -40,6 +40,7 @@ static void get_bloom_filter_for_commit(const struct object_id *commit_oid)
 	setup_git_directory();
 	c = lookup_commit(the_repository, commit_oid);
 	filter = get_or_compute_bloom_filter(the_repository, c, 1,
+					     &settings,
 					     NULL);
 	print_bloom_filter(filter);
 }
-- 
2.28.0.510.g86fdc5f89a


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH v2 09/13] bloom/diff: properly short-circuit on max_changes
  2020-09-16 18:06 ` [PATCH v2 00/13] more miscellaneous Bloom filter improvements, redux Taylor Blau
                     ` (7 preceding siblings ...)
  2020-09-16 18:07   ` [PATCH v2 08/13] bloom: use provided 'struct bloom_filter_settings' Taylor Blau
@ 2020-09-16 18:07   ` Taylor Blau
  2020-09-16 18:07   ` [PATCH v2 10/13] bloom: encode out-of-bounds filters as non-empty Taylor Blau
                     ` (4 subsequent siblings)
  13 siblings, 0 replies; 75+ messages in thread
From: Taylor Blau @ 2020-09-16 18:07 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, peff, szeder.dev

From: Derrick Stolee <dstolee@microsoft.com>

Commit e3696980 (diff: halt tree-diff early after max_changes,
2020-03-30) intended to create a mechanism to short-circuit a diff
calculation after a certain number of paths were modified. By
incrementing a "num_changes" counter throughout the recursive
ll_diff_tree_paths(), this was supposed to match the number of changes
that would be written into the changed-path Bloom filters.
Unfortunately, this was not implemented correctly and instead misses
simple cases like file modifications. This then does not stop very
large changed-path filters from being written (unless they add or remove
many files).

To start, change the implementation in ll_diff_tree_paths() to instead
use the global diff_queue_diff struct's 'nr' member as the count. This
is a way to simplify the logic instead of making more mistakes in the
complicated diff code.

This has a drawback: the diff_queue_diff struct only lists the paths
corresponding to blob changes, not their leading directories. Thus,
get_or_compute_bloom_filter() needs an additional check to see if the
hashmap with the leading directories becomes too large.

One reason why this was not caught by test cases was that the test in
t4216-log-bloom.sh that was supposed to check this "too many changes"
condition only checked this on the initial commit of a repository. The
old logic counted these values correctly. Update this test in a few
ways:

1. Use GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS to reduce the limit,
   allowing smaller commits to engage with this logic.

2. Create several interesting cases of edits, adds, removes, and mode
   changes (in the second commit). By testing both sides of the
   inequality with the *_MAX_CHANGED_PATHS variable, we can see that
   the count is exactly correct, so none of these changes are missed
   or over-counted.

3. Use the trace2 data value filter_found_large to verify that these
   commits are on the correct side of the limit.

Another way to verify the behavior is correct is through performance
tests. By testing on my local copies of the Git repository and the Linux
kernel repository, I could measure the effect of these short-circuits
when computing a fresh commit-graph file with changed-path Bloom filters
using the command

  GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS=N time \
    git commit-graph write --reachable --changed-paths

and reporting the wall time and resulting commit-graph size.

For Git, the results are

|        |      N=1       |       N=10     |      N=512     |
|--------|----------------|----------------|----------------|
| HEAD~1 | 10.90s  9.18MB | 11.11s  9.34MB | 11.31s  9.35MB |
| HEAD   |  9.21s  8.62MB | 11.11s  9.29MB | 11.29s  9.34MB |

For Linux, the results are

|        |       N=1      |     N=20      |     N=512     |
|--------|----------------|---------------|---------------|
| HEAD~1 | 61.28s  64.3MB | 76.9s  72.6MB | 77.6s  72.6MB |
| HEAD   | 49.44s  56.3MB | 68.7s  65.9MB | 69.2s  65.9MB |

Naturally, the improvement becomes much less as the limit grows, as
fewer commits satisfy the short-circuit.

Reported-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 bloom.c              |   9 +++-
 diff.h               |   2 -
 t/t4216-log-bloom.sh | 100 +++++++++++++++++++++++++++++++++++++++----
 tree-diff.c          |   5 +--
 4 files changed, 100 insertions(+), 16 deletions(-)

diff --git a/bloom.c b/bloom.c
index 2d6aef9098..db9fb82437 100644
--- a/bloom.c
+++ b/bloom.c
@@ -222,7 +222,7 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 		diff_tree_oid(NULL, &c->object.oid, "", &diffopt);
 	diffcore_std(&diffopt);
 
-	if (diffopt.num_changes <= settings->max_changed_paths) {
+	if (diff_queued_diff.nr <= settings->max_changed_paths) {
 		struct hashmap pathmap;
 		struct pathmap_hash_entry *e;
 		struct hashmap_iter iter;
@@ -259,6 +259,12 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 			diff_free_filepair(diff_queued_diff.queue[i]);
 		}
 
+		if (hashmap_get_size(&pathmap) > settings->max_changed_paths) {
+			if (computed)
+				*computed |= BLOOM_TRUNC_LARGE;
+			goto cleanup;
+		}
+
 		filter->len = (hashmap_get_size(&pathmap) * settings->bits_per_entry + BITS_PER_WORD - 1) / BITS_PER_WORD;
 		filter->data = xcalloc(filter->len, sizeof(unsigned char));
 
@@ -268,6 +274,7 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 			add_key_to_filter(&key, filter, settings);
 		}
 
+	cleanup:
 		hashmap_free_entries(&pathmap, struct pathmap_hash_entry, entry);
 	} else {
 		for (i = 0; i < diff_queued_diff.nr; i++)
diff --git a/diff.h b/diff.h
index e0c0af6286..1d32b71885 100644
--- a/diff.h
+++ b/diff.h
@@ -287,8 +287,6 @@ struct diff_options {
 
 	/* If non-zero, then stop computing after this many changes. */
 	int max_changes;
-	/* For internal use only. */
-	int num_changes;
 
 	int ita_invisible_in_index;
 /* white-space error highlighting */
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index 47ddf2641f..1ac8f4c4eb 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -181,21 +181,103 @@ test_expect_success 'persist filter settings' '
 	grep "{\"hash_version\":1,\"num_hashes\":9,\"bits_per_entry\":15" trace2-auto.txt
 '
 
+test_max_changed_paths () {
+	grep "\"max_changed_paths\":$1" $2
+}
+
+test_filter_computed () {
+	grep "\"key\":\"filter-computed\",\"value\":\"$1\"" $2
+}
+
+test_filter_trunc_large () {
+	grep "\"key\":\"filter-trunc-large\",\"value\":\"$1\"" $2
+}
+
 test_expect_success 'correctly report changes over limit' '
-	git init 513changes &&
+	git init limits &&
 	(
-		cd 513changes &&
-		for i in $(test_seq 1 513)
+		cd limits &&
+		mkdir d &&
+		mkdir d/e &&
+
+		for i in $(test_seq 1 2)
 		do
-			echo $i >file$i.txt || return 1
+			printf $i >d/file$i.txt &&
+			printf $i >d/e/file$i.txt || return 1
 		done &&
-		git add . &&
+
+		mkdir mode &&
+		printf bash >mode/script.sh &&
+
+		mkdir foo &&
+		touch foo/bar &&
+		touch foo.txt &&
+
+		git add d foo foo.txt mode &&
 		git commit -m "files" &&
-		git commit-graph write --reachable --changed-paths &&
-		for i in $(test_seq 1 513)
+
+		# Commit has 7 file and 4 directory adds
+		GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS=10 \
+			GIT_TRACE2_EVENT="$(pwd)/trace" \
+			git commit-graph write --reachable --changed-paths &&
+		test_max_changed_paths 10 trace &&
+		test_filter_computed 1 trace &&
+		test_filter_trunc_large 1 trace &&
+
+		for path in $(git ls-tree -r --name-only HEAD)
 		do
-			git -c core.commitGraph=false log -- file$i.txt >expect &&
-			git log -- file$i.txt >actual &&
+			git -c commitGraph.readChangedPaths=false log \
+				-- $path >expect &&
+			git log -- $path >actual &&
+			test_cmp expect actual || return 1
+		done &&
+
+		# Make a variety of path changes
+		printf new1 >d/e/file1.txt &&
+		printf new2 >d/file2.txt &&
+		rm d/e/file2.txt &&
+		rm -r foo &&
+		printf text >foo &&
+		mkdir f &&
+		printf new1 >f/file1.txt &&
+
+		# including a mode-only change (counts as modified)
+		git update-index --chmod=+x mode/script.sh &&
+
+		git add foo d f &&
+		git commit -m "complicated" &&
+
+		# start from scratch and rebuild
+		rm -f .git/objects/info/commit-graph &&
+		GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS=10 \
+			GIT_TRACE2_EVENT="$(pwd)/trace-edit" \
+			git commit-graph write --reachable --changed-paths &&
+		test_max_changed_paths 10 trace-edit &&
+		test_filter_computed 2 trace-edit &&
+		test_filter_trunc_large 2 trace-edit &&
+
+		for path in $(git ls-tree -r --name-only HEAD)
+		do
+			git -c commitGraph.readChangedPaths=false log \
+				-- $path >expect &&
+			git log -- $path >actual &&
+			test_cmp expect actual || return 1
+		done &&
+
+		# start from scratch and rebuild
+		rm -f .git/objects/info/commit-graph &&
+		GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS=11 \
+			GIT_TRACE2_EVENT="$(pwd)/trace-update" \
+			git commit-graph write --reachable --changed-paths &&
+		test_max_changed_paths 11 trace-update &&
+		test_filter_computed 2 trace-update &&
+		test_filter_trunc_large 0 trace-update &&
+
+		for path in $(git ls-tree -r --name-only HEAD)
+		do
+			git -c commitGraph.readChangedPaths=false log \
+				-- $path >expect &&
+			git log -- $path >actual &&
 			test_cmp expect actual || return 1
 		done
 	)
diff --git a/tree-diff.c b/tree-diff.c
index 6ebad1a46f..7cebbb327e 100644
--- a/tree-diff.c
+++ b/tree-diff.c
@@ -434,7 +434,7 @@ static struct combine_diff_path *ll_diff_tree_paths(
 		if (diff_can_quit_early(opt))
 			break;
 
-		if (opt->max_changes && opt->num_changes > opt->max_changes)
+		if (opt->max_changes && diff_queued_diff.nr > opt->max_changes)
 			break;
 
 		if (opt->pathspec.nr) {
@@ -521,7 +521,6 @@ static struct combine_diff_path *ll_diff_tree_paths(
 
 			/* t↓ */
 			update_tree_entry(&t);
-			opt->num_changes++;
 		}
 
 		/* t > p[imin] */
@@ -539,7 +538,6 @@ static struct combine_diff_path *ll_diff_tree_paths(
 		skip_emit_tp:
 			/* ∀ pi=p[imin]  pi↓ */
 			update_tp_entries(tp, nparent);
-			opt->num_changes++;
 		}
 	}
 
@@ -557,7 +555,6 @@ struct combine_diff_path *diff_tree_paths(
 	const struct object_id **parents_oid, int nparent,
 	struct strbuf *base, struct diff_options *opt)
 {
-	opt->num_changes = 0;
 	p = ll_diff_tree_paths(p, oid, parents_oid, nparent, base, opt);
 
 	/*
-- 
2.28.0.510.g86fdc5f89a


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH v2 10/13] bloom: encode out-of-bounds filters as non-empty
  2020-09-16 18:06 ` [PATCH v2 00/13] more miscellaneous Bloom filter improvements, redux Taylor Blau
                     ` (8 preceding siblings ...)
  2020-09-16 18:07   ` [PATCH v2 09/13] bloom/diff: properly short-circuit on max_changes Taylor Blau
@ 2020-09-16 18:07   ` Taylor Blau
  2020-09-17 22:13     ` SZEDER Gábor
  2020-09-16 18:08   ` [PATCH v2 11/13] commit-graph: rename 'split_commit_graph_opts' Taylor Blau
                     ` (3 subsequent siblings)
  13 siblings, 1 reply; 75+ messages in thread
From: Taylor Blau @ 2020-09-16 18:07 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, peff, szeder.dev

When a changed-path Bloom filter has either zero, or more than a
certain number (commonly 512) of entries, the commit-graph machinery
encodes it as "missing". More specifically, it sets the indices adjacent
in the BIDX chunk as equal to each other to indicate a "length 0"
filter; that is, that the filter occupies zero bytes on disk.

This has heretofore been fine, since the commit-graph machinery has no
need to care about these filters with too few or too many changed paths.
Both cases act like no filter has been generated at all, and so there is
no need to store them.

In a subsequent commit, however, the commit-graph machinery will learn
to only compute Bloom filters for some commits in the current
commit-graph layer. This is a change from the current implementation
which computes Bloom filters for all commits that are in the layer being
written. Critically for this patch, only computing some of the Bloom
filters means adding a third state for length 0 Bloom filters: zero
entries, too many entries, or "hasn't been computed".

It will be important for that future patch to distinguish between "not
representable" (i.e., zero or too-many changed paths), and "hasn't been
computed". In particular, we don't want to waste time recomputing
filters that have already been computed.

To that end, change how we store Bloom filters in the "computed but not
representable" category:

  - Bloom filters with no entries are stored as a single byte with all
    bits low (i.e., all queries to that Bloom filter will return
    "definitely not")

  - Bloom filters with too many entries are stored as a single byte with
    all bits set high (i.e., all queries to that Bloom filter will
    return "maybe").

These rules are sufficient to not incur a behavior change by changing
the on-disk representation of these two classes. Likewise, no
specification changes are necessary for the commit-graph format, either:

  - Filters that were previously empty will be recomputed and stored
    according to the new rules, and

  - old clients reading filters generated by new clients will interpret
    the filters correctly and be none the wiser to how they were
    generated.

Clients will invoke the Bloom machinery in more cases than before, but
this can be addressed by returning a NULL filter when all bits are set
high. This can be addressed in a future patch.

Finally, note that this does increase the size of on-disk commit-graphs,
but far less than other proposals. In particular, this is generally more
efficient than storing a bitmap for which commits haven't computed their
Bloom filters. Storing a bitmap incurs a penalty of one bit per commit,
whereas storing explicit filters as above incurs a penalty of one byte
per too-large or too-small commit.

In practice, these boundary commits likely occupy a small proportion of
the overall number of commits, and so the size penalty is likely smaller
than storing a bitmap for all commits.

A test to exercise filters which contain too many changed path entries
will be introduced in a subsequent patch.

Suggested-by: SZEDER Gábor <szeder.dev@gmail.com>
Suggested-by: Jakub Narębski <jnareb@gmail.com>
Helped-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 .../technical/commit-graph-format.txt         |  2 +-
 bloom.c                                       | 16 ++++++++--
 bloom.h                                       |  1 +
 commit-graph.c                                |  5 ++++
 t/t0095-bloom.sh                              |  8 ++---
 t/t4216-log-bloom.sh                          | 30 +++++++++++++++++--
 6 files changed, 53 insertions(+), 9 deletions(-)

diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
index 6ddbceba15..6585f1948a 100644
--- a/Documentation/technical/commit-graph-format.txt
+++ b/Documentation/technical/commit-graph-format.txt
@@ -125,7 +125,7 @@ CHUNK DATA:
     * The rest of the chunk is the concatenation of all the computed Bloom
       filters for the commits in lexicographic order.
     * Note: Commits with no changes or more than 512 changes have Bloom filters
-      of length zero.
+      of length one, with either all bits set to zero or one respectively.
     * The BDAT chunk is present if and only if BIDX is present.
 
   Base Graphs List (ID: {'B', 'A', 'S', 'E'}) [Optional]
diff --git a/bloom.c b/bloom.c
index db9fb82437..d24747a1d5 100644
--- a/bloom.c
+++ b/bloom.c
@@ -177,6 +177,13 @@ static int pathmap_cmp(const void *hashmap_cmp_fn_data,
 	return strcmp(e1->path, e2->path);
 }
 
+static void init_truncated_large_filter(struct bloom_filter *filter)
+{
+	filter->data = xmalloc(1);
+	filter->data[0] = 0xFF;
+	filter->len = 1;
+}
+
 struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 						 struct commit *c,
 						 int compute_if_not_present,
@@ -260,12 +267,18 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 		}
 
 		if (hashmap_get_size(&pathmap) > settings->max_changed_paths) {
+			init_truncated_large_filter(filter);
 			if (computed)
 				*computed |= BLOOM_TRUNC_LARGE;
 			goto cleanup;
 		}
 
 		filter->len = (hashmap_get_size(&pathmap) * settings->bits_per_entry + BITS_PER_WORD - 1) / BITS_PER_WORD;
+		if (!filter->len) {
+			if (computed)
+				*computed |= BLOOM_TRUNC_SMALL;
+			filter->len = 1;
+		}
 		filter->data = xcalloc(filter->len, sizeof(unsigned char));
 
 		hashmap_for_each_entry(&pathmap, &iter, e, entry) {
@@ -279,8 +292,7 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 	} else {
 		for (i = 0; i < diff_queued_diff.nr; i++)
 			diff_free_filepair(diff_queued_diff.queue[i]);
-		filter->data = NULL;
-		filter->len = 0;
+		init_truncated_large_filter(filter);
 
 		if (computed)
 			*computed |= BLOOM_TRUNC_LARGE;
diff --git a/bloom.h b/bloom.h
index c6d77e8393..70a8840896 100644
--- a/bloom.h
+++ b/bloom.h
@@ -93,6 +93,7 @@ enum bloom_filter_computed {
 	BLOOM_NOT_COMPUTED = (1 << 0),
 	BLOOM_COMPUTED     = (1 << 1),
 	BLOOM_TRUNC_LARGE  = (1 << 2),
+	BLOOM_TRUNC_SMALL  = (1 << 3),
 };
 
 struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
diff --git a/commit-graph.c b/commit-graph.c
index 1ca754f19c..bd4247bca5 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -974,6 +974,7 @@ struct write_commit_graph_context {
 
 	int count_bloom_filter_computed;
 	int count_bloom_filter_not_computed;
+	int count_bloom_filter_trunc_small;
 	int count_bloom_filter_trunc_large;
 };
 
@@ -1402,6 +1403,8 @@ static void trace2_bloom_filter_write_statistics(struct write_commit_graph_conte
 			   ctx->count_bloom_filter_computed);
 	trace2_data_intmax("commit-graph", ctx->r, "filter-not-computed",
 			   ctx->count_bloom_filter_not_computed);
+	trace2_data_intmax("commit-graph", ctx->r, "filter-trunc-small",
+			   ctx->count_bloom_filter_trunc_small);
 	trace2_data_intmax("commit-graph", ctx->r, "filter-trunc-large",
 			   ctx->count_bloom_filter_trunc_large);
 }
@@ -1438,6 +1441,8 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 			&computed);
 		if (computed & BLOOM_COMPUTED) {
 			ctx->count_bloom_filter_computed++;
+			if (computed & BLOOM_TRUNC_SMALL)
+				ctx->count_bloom_filter_trunc_small++;
 			if (computed & BLOOM_TRUNC_LARGE)
 				ctx->count_bloom_filter_trunc_large++;
 		} else if (computed & BLOOM_NOT_COMPUTED)
diff --git a/t/t0095-bloom.sh b/t/t0095-bloom.sh
index 232ba2c485..7e4ab1795f 100755
--- a/t/t0095-bloom.sh
+++ b/t/t0095-bloom.sh
@@ -71,8 +71,8 @@ test_expect_success 'get bloom filters for commit with no changes' '
 	git init &&
 	git commit --allow-empty -m "c0" &&
 	cat >expect <<-\EOF &&
-	Filter_Length:0
-	Filter_Data:
+	Filter_Length:1
+	Filter_Data:00|
 	EOF
 	test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
 	test_cmp expect actual
@@ -107,8 +107,8 @@ test_expect_success EXPENSIVE 'get bloom filter for commit with 513 changes' '
 	git add bigDir &&
 	git commit -m "commit with 513 changes" &&
 	cat >expect <<-\EOF &&
-	Filter_Length:0
-	Filter_Data:
+	Filter_Length:1
+	Filter_Data:ff|
 	EOF
 	test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
 	test_cmp expect actual
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index 1ac8f4c4eb..a0c9c9ea23 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -30,6 +30,7 @@ test_expect_success 'setup test - repo, commits, commit graph, log outputs' '
 	rm file_to_be_deleted &&
 	git add . &&
 	git commit -m "file removed" &&
+	git commit --allow-empty -m "empty" &&
 	git commit-graph write --reachable --changed-paths &&
 
 	test_oid_cache <<-EOF
@@ -49,7 +50,7 @@ graph_read_expect () {
 }
 
 test_expect_success 'commit-graph write wrote out the bloom chunks' '
-	graph_read_expect 15
+	graph_read_expect 16
 '
 
 # Turn off any inherited trace2 settings for this test.
@@ -156,7 +157,7 @@ test_expect_success 'setup - add commit-graph to the chain with Bloom filters' '
 
 test_bloom_filters_used_when_some_filters_are_missing () {
 	log_args=$1
-	bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"maybe\":6,\"definitely_not\":8"
+	bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"maybe\":6,\"definitely_not\":9"
 	setup "$log_args" &&
 	grep -q "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.perf" &&
 	test_cmp log_wo_bloom log_w_bloom
@@ -185,10 +186,18 @@ test_max_changed_paths () {
 	grep "\"max_changed_paths\":$1" $2
 }
 
+test_filter_not_computed () {
+	grep "\"key\":\"filter-not-computed\",\"value\":\"$1\"" $2
+}
+
 test_filter_computed () {
 	grep "\"key\":\"filter-computed\",\"value\":\"$1\"" $2
 }
 
+test_filter_trunc_small () {
+	grep "\"key\":\"filter-trunc-small\",\"value\":\"$1\"" $2
+}
+
 test_filter_trunc_large () {
 	grep "\"key\":\"filter-trunc-large\",\"value\":\"$1\"" $2
 }
@@ -283,4 +292,21 @@ test_expect_success 'correctly report changes over limit' '
 	)
 '
 
+test_expect_success 'correctly report commits with no changed paths' '
+	git init small &&
+	test_when_finished "rm -fr small" &&
+	(
+		cd small &&
+
+		git commit --allow-empty -m "initial commit" &&
+
+		GIT_TRACE2_EVENT="$(pwd)/trace.event" \
+			git commit-graph write --reachable --changed-paths &&
+		test_filter_computed 1 trace.event &&
+		test_filter_not_computed 0 trace.event &&
+		test_filter_trunc_small 1 trace.event &&
+		test_filter_trunc_large 0 trace.event
+	)
+'
+
 test_done
-- 
2.28.0.510.g86fdc5f89a


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH v2 11/13] commit-graph: rename 'split_commit_graph_opts'
  2020-09-16 18:06 ` [PATCH v2 00/13] more miscellaneous Bloom filter improvements, redux Taylor Blau
                     ` (9 preceding siblings ...)
  2020-09-16 18:07   ` [PATCH v2 10/13] bloom: encode out-of-bounds filters as non-empty Taylor Blau
@ 2020-09-16 18:08   ` Taylor Blau
  2020-09-16 18:08   ` [PATCH v2 12/13] builtin/commit-graph.c: introduce '--max-new-filters=<n>' Taylor Blau
                     ` (2 subsequent siblings)
  13 siblings, 0 replies; 75+ messages in thread
From: Taylor Blau @ 2020-09-16 18:08 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, peff, szeder.dev

In the subsequent commit, additional options will be added to the
commit-graph API which have nothing to do with splitting.

Rename the 'split_commit_graph_opts' structure to the more-generic
'commit_graph_opts' to encompass both. Likewise, rename the 'flags'
member to instead be 'split_flags' to clarify that it only has to do
with the behavior implied by '--split'.

Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/commit-graph.c | 20 ++++++++++----------
 commit-graph.c         | 40 ++++++++++++++++++++--------------------
 commit-graph.h         |  8 ++++----
 3 files changed, 34 insertions(+), 34 deletions(-)

diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index ba5584463f..f3243bd982 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -119,7 +119,7 @@ static int graph_verify(int argc, const char **argv)
 }
 
 extern int read_replace_refs;
-static struct split_commit_graph_opts split_opts;
+static struct commit_graph_opts write_opts;
 
 static int write_option_parse_split(const struct option *opt, const char *arg,
 				    int unset)
@@ -187,24 +187,24 @@ static int graph_write(int argc, const char **argv)
 		OPT_BOOL(0, "changed-paths", &opts.enable_changed_paths,
 			N_("enable computation for changed paths")),
 		OPT_BOOL(0, "progress", &opts.progress, N_("force progress reporting")),
-		OPT_CALLBACK_F(0, "split", &split_opts.flags, NULL,
+		OPT_CALLBACK_F(0, "split", &write_opts.split_flags, NULL,
 			N_("allow writing an incremental commit-graph file"),
 			PARSE_OPT_OPTARG | PARSE_OPT_NONEG,
 			write_option_parse_split),
-		OPT_INTEGER(0, "max-commits", &split_opts.max_commits,
+		OPT_INTEGER(0, "max-commits", &write_opts.max_commits,
 			N_("maximum number of commits in a non-base split commit-graph")),
-		OPT_INTEGER(0, "size-multiple", &split_opts.size_multiple,
+		OPT_INTEGER(0, "size-multiple", &write_opts.size_multiple,
 			N_("maximum ratio between two levels of a split commit-graph")),
-		OPT_EXPIRY_DATE(0, "expire-time", &split_opts.expire_time,
+		OPT_EXPIRY_DATE(0, "expire-time", &write_opts.expire_time,
 			N_("only expire files older than a given date-time")),
 		OPT_END(),
 	};
 
 	opts.progress = isatty(2);
 	opts.enable_changed_paths = -1;
-	split_opts.size_multiple = 2;
-	split_opts.max_commits = 0;
-	split_opts.expire_time = 0;
+	write_opts.size_multiple = 2;
+	write_opts.max_commits = 0;
+	write_opts.expire_time = 0;
 
 	trace2_cmd_mode("write");
 
@@ -232,7 +232,7 @@ static int graph_write(int argc, const char **argv)
 	odb = find_odb(the_repository, opts.obj_dir);
 
 	if (opts.reachable) {
-		if (write_commit_graph_reachable(odb, flags, &split_opts))
+		if (write_commit_graph_reachable(odb, flags, &write_opts))
 			return 1;
 		return 0;
 	}
@@ -261,7 +261,7 @@ static int graph_write(int argc, const char **argv)
 			       opts.stdin_packs ? &pack_indexes : NULL,
 			       opts.stdin_commits ? &commits : NULL,
 			       flags,
-			       &split_opts))
+			       &write_opts))
 		result = 1;
 
 cleanup:
diff --git a/commit-graph.c b/commit-graph.c
index bd4247bca5..913f78a9a1 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -968,7 +968,7 @@ struct write_commit_graph_context {
 		 changed_paths:1,
 		 order_by_pack:1;
 
-	const struct split_commit_graph_opts *split_opts;
+	const struct commit_graph_opts *opts;
 	size_t total_bloom_filter_data_size;
 	const struct bloom_filter_settings *bloom_settings;
 
@@ -1292,8 +1292,8 @@ static void close_reachable(struct write_commit_graph_context *ctx)
 {
 	int i;
 	struct commit *commit;
-	enum commit_graph_split_flags flags = ctx->split_opts ?
-		ctx->split_opts->flags : COMMIT_GRAPH_SPLIT_UNSPECIFIED;
+	enum commit_graph_split_flags flags = ctx->opts ?
+		ctx->opts->split_flags : COMMIT_GRAPH_SPLIT_UNSPECIFIED;
 
 	if (ctx->report_progress)
 		ctx->progress = start_delayed_progress(
@@ -1482,7 +1482,7 @@ static int add_ref_to_set(const char *refname,
 
 int write_commit_graph_reachable(struct object_directory *odb,
 				 enum commit_graph_write_flags flags,
-				 const struct split_commit_graph_opts *split_opts)
+				 const struct commit_graph_opts *opts)
 {
 	struct oidset commits = OIDSET_INIT;
 	struct refs_cb_data data;
@@ -1499,7 +1499,7 @@ int write_commit_graph_reachable(struct object_directory *odb,
 	stop_progress(&data.progress);
 
 	result = write_commit_graph(odb, NULL, &commits,
-				    flags, split_opts);
+				    flags, opts);
 
 	oidset_clear(&commits);
 	return result;
@@ -1614,8 +1614,8 @@ static uint32_t count_distinct_commits(struct write_commit_graph_context *ctx)
 static void copy_oids_to_commits(struct write_commit_graph_context *ctx)
 {
 	uint32_t i;
-	enum commit_graph_split_flags flags = ctx->split_opts ?
-		ctx->split_opts->flags : COMMIT_GRAPH_SPLIT_UNSPECIFIED;
+	enum commit_graph_split_flags flags = ctx->opts ?
+		ctx->opts->split_flags : COMMIT_GRAPH_SPLIT_UNSPECIFIED;
 
 	ctx->num_extra_edges = 0;
 	if (ctx->report_progress)
@@ -1900,13 +1900,13 @@ static void split_graph_merge_strategy(struct write_commit_graph_context *ctx)
 	int max_commits = 0;
 	int size_mult = 2;
 
-	if (ctx->split_opts) {
-		max_commits = ctx->split_opts->max_commits;
+	if (ctx->opts) {
+		max_commits = ctx->opts->max_commits;
 
-		if (ctx->split_opts->size_multiple)
-			size_mult = ctx->split_opts->size_multiple;
+		if (ctx->opts->size_multiple)
+			size_mult = ctx->opts->size_multiple;
 
-		flags = ctx->split_opts->flags;
+		flags = ctx->opts->split_flags;
 	}
 
 	g = ctx->r->objects->commit_graph;
@@ -2084,8 +2084,8 @@ static void expire_commit_graphs(struct write_commit_graph_context *ctx)
 	size_t dirnamelen;
 	timestamp_t expire_time = time(NULL);
 
-	if (ctx->split_opts && ctx->split_opts->expire_time)
-		expire_time = ctx->split_opts->expire_time;
+	if (ctx->opts && ctx->opts->expire_time)
+		expire_time = ctx->opts->expire_time;
 	if (!ctx->split) {
 		char *chain_file_name = get_chain_filename(ctx->odb);
 		unlink(chain_file_name);
@@ -2136,7 +2136,7 @@ int write_commit_graph(struct object_directory *odb,
 		       struct string_list *pack_indexes,
 		       struct oidset *commits,
 		       enum commit_graph_write_flags flags,
-		       const struct split_commit_graph_opts *split_opts)
+		       const struct commit_graph_opts *opts)
 {
 	struct write_commit_graph_context *ctx;
 	uint32_t i, count_distinct = 0;
@@ -2153,7 +2153,7 @@ int write_commit_graph(struct object_directory *odb,
 	ctx->append = flags & COMMIT_GRAPH_WRITE_APPEND ? 1 : 0;
 	ctx->report_progress = flags & COMMIT_GRAPH_WRITE_PROGRESS ? 1 : 0;
 	ctx->split = flags & COMMIT_GRAPH_WRITE_SPLIT ? 1 : 0;
-	ctx->split_opts = split_opts;
+	ctx->opts = opts;
 	ctx->total_bloom_filter_data_size = 0;
 
 	bloom_settings.bits_per_entry = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY",
@@ -2201,15 +2201,15 @@ int write_commit_graph(struct object_directory *odb,
 			}
 		}
 
-		if (ctx->split_opts)
-			replace = ctx->split_opts->flags & COMMIT_GRAPH_SPLIT_REPLACE;
+		if (ctx->opts)
+			replace = ctx->opts->split_flags & COMMIT_GRAPH_SPLIT_REPLACE;
 	}
 
 	ctx->approx_nr_objects = approximate_object_count();
 	ctx->oids.alloc = ctx->approx_nr_objects / 32;
 
-	if (ctx->split && split_opts && ctx->oids.alloc > split_opts->max_commits)
-		ctx->oids.alloc = split_opts->max_commits;
+	if (ctx->split && opts && ctx->oids.alloc > opts->max_commits)
+		ctx->oids.alloc = opts->max_commits;
 
 	if (ctx->append) {
 		prepare_commit_graph_one(ctx->r, ctx->odb);
diff --git a/commit-graph.h b/commit-graph.h
index d9acb22bac..b7914b0a7a 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -105,11 +105,11 @@ enum commit_graph_split_flags {
 	COMMIT_GRAPH_SPLIT_REPLACE          = 2
 };
 
-struct split_commit_graph_opts {
+struct commit_graph_opts {
 	int size_multiple;
 	int max_commits;
 	timestamp_t expire_time;
-	enum commit_graph_split_flags flags;
+	enum commit_graph_split_flags split_flags;
 };
 
 /*
@@ -120,12 +120,12 @@ struct split_commit_graph_opts {
  */
 int write_commit_graph_reachable(struct object_directory *odb,
 				 enum commit_graph_write_flags flags,
-				 const struct split_commit_graph_opts *split_opts);
+				 const struct commit_graph_opts *opts);
 int write_commit_graph(struct object_directory *odb,
 		       struct string_list *pack_indexes,
 		       struct oidset *commits,
 		       enum commit_graph_write_flags flags,
-		       const struct split_commit_graph_opts *split_opts);
+		       const struct commit_graph_opts *opts);
 
 #define COMMIT_GRAPH_VERIFY_SHALLOW	(1 << 0)
 
-- 
2.28.0.510.g86fdc5f89a


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH v2 12/13] builtin/commit-graph.c: introduce '--max-new-filters=<n>'
  2020-09-16 18:06 ` [PATCH v2 00/13] more miscellaneous Bloom filter improvements, redux Taylor Blau
                     ` (10 preceding siblings ...)
  2020-09-16 18:08   ` [PATCH v2 11/13] commit-graph: rename 'split_commit_graph_opts' Taylor Blau
@ 2020-09-16 18:08   ` Taylor Blau
  2020-09-18  9:23     ` SZEDER Gábor
  2020-09-16 18:08   ` [PATCH v2 13/13] commit-graph: introduce 'commitGraph.maxNewFilters' Taylor Blau
  2020-09-16 22:51   ` [PATCH v2 00/13] more miscellaneous Bloom filter improvements, redux Derrick Stolee
  13 siblings, 1 reply; 75+ messages in thread
From: Taylor Blau @ 2020-09-16 18:08 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, peff, szeder.dev

Introduce a command-line flag to specify the maximum number of new Bloom
filters that a 'git commit-graph write' is willing to compute from
scratch.

Prior to this patch, a commit-graph write with '--changed-paths' would
compute Bloom filters for all selected commits which haven't already
been computed (i.e., by a previous commit-graph write with '--split'
such that a roll-up or replacement is performed).

This behavior can cause prohibitively-long commit-graph writes for a
variety of reasons:

  * There may be lots of filters whose diffs take a long time to
    generate (for example, they have close to the maximum number of
    changes, diffing itself takes a long time, etc).

  * Old-style commit-graphs (which encode filters with too many entries
    as not having been computed at all) cause us to waste time
    recomputing filters that appear to have not been computed only to
    discover that they are too-large.

This can make the upper-bound of the time it takes for 'git commit-graph
write --changed-paths' to be rather unpredictable.

To make this command behave more predictably, introduce
'--max-new-filters=<n>' to allow computing at most '<n>' Bloom filters
from scratch. This lets "computing" already-known filters proceed
quickly, while bounding the number of slow tasks that Git is willing to
do.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/git-commit-graph.txt |  5 +++
 bloom.c                            |  7 ++-
 builtin/commit-graph.c             | 27 +++++++++++-
 commit-graph.c                     |  9 +++-
 commit-graph.h                     |  1 +
 t/t4216-log-bloom.sh               | 70 ++++++++++++++++++++++++++++++
 6 files changed, 111 insertions(+), 8 deletions(-)

diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
index 17405c73a9..8357846d30 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -67,6 +67,11 @@ this option is given, future commit-graph writes will automatically assume
 that this option was intended. Use `--no-changed-paths` to stop storing this
 data.
 +
+With the `--max-new-filters=<n>` option, generate at most `n` new Bloom
+filters (if `--changed-paths` is specified). If `n` is `-1`, no limit is
+enforced. Commits whose filters are not calculated are stored as a
+length zero Bloom filter.
++
 With the `--split[=<strategy>]` option, write the commit-graph as a
 chain of multiple commit-graph files stored in
 `<dir>/info/commit-graphs`. Commit-graph layers are merged based on the
diff --git a/bloom.c b/bloom.c
index d24747a1d5..230a515831 100644
--- a/bloom.c
+++ b/bloom.c
@@ -204,12 +204,11 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 
 	if (!filter->data) {
 		load_commit_graph_info(r, c);
-		if (commit_graph_position(c) != COMMIT_NOT_FROM_GRAPH &&
-			load_bloom_filter_from_graph(r->objects->commit_graph, filter, c))
-				return filter;
+		if (commit_graph_position(c) != COMMIT_NOT_FROM_GRAPH)
+			load_bloom_filter_from_graph(r->objects->commit_graph, filter, c);
 	}
 
-	if (filter->data)
+	if (filter->data && filter->len)
 		return filter;
 	if (!compute_if_not_present)
 		return NULL;
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index f3243bd982..5df9b2ef80 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -13,7 +13,8 @@ static char const * const builtin_commit_graph_usage[] = {
 	N_("git commit-graph verify [--object-dir <objdir>] [--shallow] [--[no-]progress]"),
 	N_("git commit-graph write [--object-dir <objdir>] [--append] "
 	   "[--split[=<strategy>]] [--reachable|--stdin-packs|--stdin-commits] "
-	   "[--changed-paths] [--[no-]progress] <split options>"),
+	   "[--changed-paths] [--[no-]max-new-filters <n>] [--[no-]progress] "
+	   "<split options>"),
 	NULL
 };
 
@@ -25,7 +26,8 @@ static const char * const builtin_commit_graph_verify_usage[] = {
 static const char * const builtin_commit_graph_write_usage[] = {
 	N_("git commit-graph write [--object-dir <objdir>] [--append] "
 	   "[--split[=<strategy>]] [--reachable|--stdin-packs|--stdin-commits] "
-	   "[--changed-paths] [--[no-]progress] <split options>"),
+	   "[--changed-paths] [--[no-]max-new-filters <n>] [--[no-]progress] "
+	   "<split options>"),
 	NULL
 };
 
@@ -162,6 +164,23 @@ static int read_one_commit(struct oidset *commits, struct progress *progress,
 	return 0;
 }
 
+static int write_option_max_new_filters(const struct option *opt,
+					const char *arg,
+					int unset)
+{
+	int *to = opt->value;
+	if (unset)
+		*to = -1;
+	else {
+		const char *s;
+		*to = strtol(arg, (char **)&s, 10);
+		if (*s)
+			return error(_("%s expects a numerical value"),
+				     optname(opt, opt->flags));
+	}
+	return 0;
+}
+
 static int graph_write(int argc, const char **argv)
 {
 	struct string_list pack_indexes = STRING_LIST_INIT_NODUP;
@@ -197,6 +216,9 @@ static int graph_write(int argc, const char **argv)
 			N_("maximum ratio between two levels of a split commit-graph")),
 		OPT_EXPIRY_DATE(0, "expire-time", &write_opts.expire_time,
 			N_("only expire files older than a given date-time")),
+		OPT_CALLBACK_F(0, "max-new-filters", &write_opts.max_new_filters,
+			NULL, N_("maximum number of changed-path Bloom filters to compute"),
+			0, write_option_max_new_filters),
 		OPT_END(),
 	};
 
@@ -205,6 +227,7 @@ static int graph_write(int argc, const char **argv)
 	write_opts.size_multiple = 2;
 	write_opts.max_commits = 0;
 	write_opts.expire_time = 0;
+	write_opts.max_new_filters = -1;
 
 	trace2_cmd_mode("write");
 
diff --git a/commit-graph.c b/commit-graph.c
index 913f78a9a1..33af6c2430 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1414,6 +1414,7 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 	int i;
 	struct progress *progress = NULL;
 	struct commit **sorted_commits;
+	int max_new_filters;
 
 	init_bloom_filters();
 
@@ -1430,13 +1431,16 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 	else
 		QSORT(sorted_commits, ctx->commits.nr, commit_gen_cmp);
 
+	max_new_filters = ctx->opts && ctx->opts->max_new_filters >= 0 ?
+		ctx->opts->max_new_filters : ctx->commits.nr;
+
 	for (i = 0; i < ctx->commits.nr; i++) {
 		enum bloom_filter_computed computed = 0;
 		struct commit *c = sorted_commits[i];
 		struct bloom_filter *filter = get_or_compute_bloom_filter(
 			ctx->r,
 			c,
-			1,
+			ctx->count_bloom_filter_computed < max_new_filters,
 			ctx->bloom_settings,
 			&computed);
 		if (computed & BLOOM_COMPUTED) {
@@ -1447,7 +1451,8 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 				ctx->count_bloom_filter_trunc_large++;
 		} else if (computed & BLOOM_NOT_COMPUTED)
 			ctx->count_bloom_filter_not_computed++;
-		ctx->total_bloom_filter_data_size += sizeof(unsigned char) * filter->len;
+		ctx->total_bloom_filter_data_size += filter
+			? sizeof(unsigned char) * filter->len : 0;
 		display_progress(progress, i + 1);
 	}
 
diff --git a/commit-graph.h b/commit-graph.h
index b7914b0a7a..a22bd86701 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -110,6 +110,7 @@ struct commit_graph_opts {
 	int max_commits;
 	timestamp_t expire_time;
 	enum commit_graph_split_flags split_flags;
+	int max_new_filters;
 };
 
 /*
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index a0c9c9ea23..9ce0c318e9 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -309,4 +309,74 @@ test_expect_success 'correctly report commits with no changed paths' '
 	)
 '
 
+test_expect_success 'Bloom generation is limited by --max-new-filters' '
+	(
+		cd limits &&
+		test_commit c2 filter &&
+		test_commit c3 filter &&
+		test_commit c4 no-filter &&
+
+		rm -f trace.event &&
+		GIT_TRACE2_EVENT="$(pwd)/trace.event" \
+			git commit-graph write --reachable --split=replace \
+				--changed-paths --max-new-filters=2 &&
+
+		test_filter_computed 2 trace.event &&
+		test_filter_not_computed 3 trace.event &&
+		test_filter_trunc_small 0 trace.event &&
+		test_filter_trunc_large 0 trace.event
+	)
+'
+
+test_expect_success 'Bloom generation backfills previously-skipped filters' '
+	(
+		cd limits &&
+
+		rm -f trace.event &&
+		GIT_TRACE2_EVENT="$(pwd)/trace.event" \
+			git commit-graph write --reachable --changed-paths \
+				--split=replace --max-new-filters=1 &&
+		test_filter_computed 1 trace.event &&
+		test_filter_not_computed 4 trace.event &&
+		test_filter_trunc_small 0 trace.event &&
+		test_filter_trunc_large 0 trace.event
+	)
+'
+
+test_expect_success 'Bloom generation backfills empty commits' '
+	git init empty &&
+	test_when_finished "rm -fr empty" &&
+	(
+		cd empty &&
+		for i in $(test_seq 1 6)
+		do
+			git commit --allow-empty -m "$i"
+		done &&
+
+		# Generate Bloom filters for empty commits 1-6, two at a time.
+		for i in $(test_seq 1 3)
+		do
+			rm -f trace.event &&
+			GIT_TRACE2_EVENT="$(pwd)/trace.event" \
+				git commit-graph write --reachable \
+					--changed-paths --max-new-filters=2 &&
+			test_filter_computed 2 trace.event &&
+			test_filter_not_computed 4 trace.event &&
+			test_filter_trunc_small 2 trace.event &&
+			test_filter_trunc_large 0 trace.event
+		done &&
+
+		# Finally, make sure that once all commits have filters, that
+		# none are subsequently recomputed.
+		rm -f trace.event &&
+		GIT_TRACE2_EVENT="$(pwd)/trace.event" \
+			git commit-graph write --reachable \
+				--changed-paths --max-new-filters=2 &&
+		test_filter_computed 0 trace.event &&
+		test_filter_not_computed 6 trace.event &&
+		test_filter_trunc_small 0 trace.event &&
+		test_filter_trunc_large 0 trace.event
+	)
+'
+
 test_done
-- 
2.28.0.510.g86fdc5f89a


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH v2 13/13] commit-graph: introduce 'commitGraph.maxNewFilters'
  2020-09-16 18:06 ` [PATCH v2 00/13] more miscellaneous Bloom filter improvements, redux Taylor Blau
                     ` (11 preceding siblings ...)
  2020-09-16 18:08   ` [PATCH v2 12/13] builtin/commit-graph.c: introduce '--max-new-filters=<n>' Taylor Blau
@ 2020-09-16 18:08   ` Taylor Blau
  2020-09-16 22:51   ` [PATCH v2 00/13] more miscellaneous Bloom filter improvements, redux Derrick Stolee
  13 siblings, 0 replies; 75+ messages in thread
From: Taylor Blau @ 2020-09-16 18:08 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, peff, szeder.dev

Introduce a configuration variable to specify a default value for the
recently-introduce '--max-new-filters' option of 'git commit-graph
write'.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/config/commitgraph.txt |  4 ++++
 Documentation/git-commit-graph.txt   |  3 ++-
 builtin/commit-graph.c               | 14 ++++++++++++++
 t/t4216-log-bloom.sh                 | 24 +++++++++++++++++++++++-
 4 files changed, 43 insertions(+), 2 deletions(-)

diff --git a/Documentation/config/commitgraph.txt b/Documentation/config/commitgraph.txt
index cff0797b54..4582c39fc4 100644
--- a/Documentation/config/commitgraph.txt
+++ b/Documentation/config/commitgraph.txt
@@ -1,3 +1,7 @@
+commitGraph.maxNewFilters::
+	Specifies the default value for the `--max-new-filters` option of `git
+	commit-graph write` (c.f., linkgit:git-commit-graph[1]).
+
 commitGraph.readChangedPaths::
 	If true, then git will use the changed-path Bloom filters in the
 	commit-graph file (if it exists, and they are present). Defaults to
diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
index 8357846d30..60df4e4bfa 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -70,7 +70,8 @@ data.
 With the `--max-new-filters=<n>` option, generate at most `n` new Bloom
 filters (if `--changed-paths` is specified). If `n` is `-1`, no limit is
 enforced. Commits whose filters are not calculated are stored as a
-length zero Bloom filter.
+length zero Bloom filter. Overrides the `commitGraph.maxNewFilters`
+configuration.
 +
 With the `--split[=<strategy>]` option, write the commit-graph as a
 chain of multiple commit-graph files stored in
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 5df9b2ef80..988445abdf 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -181,6 +181,18 @@ static int write_option_max_new_filters(const struct option *opt,
 	return 0;
 }
 
+static int git_commit_graph_write_config(const char *var, const char *value,
+					 void *cb)
+{
+	if (!strcmp(var, "commitgraph.maxnewfilters"))
+		write_opts.max_new_filters = git_config_int(var, value);
+	/*
+	 * No need to fall-back to 'git_default_config', since this was already
+	 * called in 'cmd_commit_graph()'.
+	 */
+	return 0;
+}
+
 static int graph_write(int argc, const char **argv)
 {
 	struct string_list pack_indexes = STRING_LIST_INIT_NODUP;
@@ -231,6 +243,8 @@ static int graph_write(int argc, const char **argv)
 
 	trace2_cmd_mode("write");
 
+	git_config(git_commit_graph_write_config, &opts);
+
 	argc = parse_options(argc, argv, NULL,
 			     builtin_commit_graph_write_options,
 			     builtin_commit_graph_write_usage, 0);
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index 9ce0c318e9..dc7d62c778 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -329,13 +329,15 @@ test_expect_success 'Bloom generation is limited by --max-new-filters' '
 '
 
 test_expect_success 'Bloom generation backfills previously-skipped filters' '
+	# Check specifying commitGraph.maxNewFilters over "git config" works.
+	test_config -C limits commitGraph.maxNewFilters 1 &&
 	(
 		cd limits &&
 
 		rm -f trace.event &&
 		GIT_TRACE2_EVENT="$(pwd)/trace.event" \
 			git commit-graph write --reachable --changed-paths \
-				--split=replace --max-new-filters=1 &&
+				--split=replace &&
 		test_filter_computed 1 trace.event &&
 		test_filter_not_computed 4 trace.event &&
 		test_filter_trunc_small 0 trace.event &&
@@ -343,6 +345,26 @@ test_expect_success 'Bloom generation backfills previously-skipped filters' '
 	)
 '
 
+test_expect_success '--max-new-filters overrides configuration' '
+	git init override &&
+	test_when_finished "rm -fr override" &&
+	test_config -C override commitGraph.maxNewFilters 2 &&
+	(
+		cd override &&
+		test_commit one &&
+		test_commit two &&
+
+		rm -f trace.event &&
+		GIT_TRACE2_EVENT="$(pwd)/trace.event" \
+			git commit-graph write --reachable --changed-paths \
+				--max-new-filters=1 &&
+		test_filter_computed 1 trace.event &&
+		test_filter_not_computed 1 trace.event &&
+		test_filter_trunc_small 0 trace.event &&
+		test_filter_trunc_large 0 trace.event
+	)
+'
+
 test_expect_success 'Bloom generation backfills empty commits' '
 	git init empty &&
 	test_when_finished "rm -fr empty" &&
-- 
2.28.0.510.g86fdc5f89a

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* Re: [PATCH v2 00/13] more miscellaneous Bloom filter improvements, redux
  2020-09-16 18:06 ` [PATCH v2 00/13] more miscellaneous Bloom filter improvements, redux Taylor Blau
                     ` (12 preceding siblings ...)
  2020-09-16 18:08   ` [PATCH v2 13/13] commit-graph: introduce 'commitGraph.maxNewFilters' Taylor Blau
@ 2020-09-16 22:51   ` Derrick Stolee
  2020-09-16 23:07     ` Junio C Hamano
  13 siblings, 1 reply; 75+ messages in thread
From: Derrick Stolee @ 2020-09-16 22:51 UTC (permalink / raw)
  To: Taylor Blau, git; +Cc: dstolee, gitster, peff, szeder.dev

On 9/16/20 2:06 PM, Taylor Blau wrote:
> Hopefully this is it for this series ;). I think that it's in good shape
> now, and I couldn't find anything in my own inspection that I wanted to
> change. So, if others feel good, too, I think we should focus on
> incremental fixes on top of this.

I am also happy with this version. The test updates are
rather nice.

It is worth highlighging the tracing changes, in case
anyone was planning to use that style of JSON data in
production. I think it is better to include this simpler
data model of one variable per event instead of rolling
our own JSON format.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2 00/13] more miscellaneous Bloom filter improvements, redux
  2020-09-16 22:51   ` [PATCH v2 00/13] more miscellaneous Bloom filter improvements, redux Derrick Stolee
@ 2020-09-16 23:07     ` Junio C Hamano
  2020-09-17  0:45       ` Taylor Blau
  0 siblings, 1 reply; 75+ messages in thread
From: Junio C Hamano @ 2020-09-16 23:07 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Taylor Blau, git, dstolee, peff, szeder.dev

Derrick Stolee <stolee@gmail.com> writes:

> On 9/16/20 2:06 PM, Taylor Blau wrote:
>> Hopefully this is it for this series ;). I think that it's in good shape
>> now, and I couldn't find anything in my own inspection that I wanted to
>> change. So, if others feel good, too, I think we should focus on
>> incremental fixes on top of this.
>
> I am also happy with this version. The test updates are
> rather nice.
>
> It is worth highlighging the tracing changes, in case
> anyone was planning to use that style of JSON data in
> production. I think it is better to include this simpler
> data model of one variable per event instead of rolling
> our own JSON format.

Yup, lets merge it down soonish.

Thanks.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2 00/13] more miscellaneous Bloom filter improvements, redux
  2020-09-16 23:07     ` Junio C Hamano
@ 2020-09-17  0:45       ` Taylor Blau
  2020-09-17  0:59         ` Junio C Hamano
  0 siblings, 1 reply; 75+ messages in thread
From: Taylor Blau @ 2020-09-17  0:45 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Derrick Stolee, Taylor Blau, git, dstolee, peff, szeder.dev

On Wed, Sep 16, 2020 at 04:07:51PM -0700, Junio C Hamano wrote:
> Yup, lets merge it down soonish.

It's me, the bearer of bad news. I noticed an uninitialized read when
running the tests with SANITIZE=address,undefined. I *think* the fix is
as simple as a single replacement, but let me double check before you
merge this.

Sorry this topic has been such a disaster. Assuming the fix is isolated
to a single patch, do you want a new version of that patch, or the whole
series?

> Thanks.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2 00/13] more miscellaneous Bloom filter improvements, redux
  2020-09-17  0:45       ` Taylor Blau
@ 2020-09-17  0:59         ` Junio C Hamano
  2020-09-17  1:10           ` Taylor Blau
  0 siblings, 1 reply; 75+ messages in thread
From: Junio C Hamano @ 2020-09-17  0:59 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Derrick Stolee, git, dstolee, peff, szeder.dev

Taylor Blau <me@ttaylorr.com> writes:

> On Wed, Sep 16, 2020 at 04:07:51PM -0700, Junio C Hamano wrote:
>> Yup, lets merge it down soonish.
>
> It's me, the bearer of bad news. I noticed an uninitialized read when
> running the tests with SANITIZE=address,undefined. I *think* the fix is
> as simple as a single replacement, but let me double check before you
> merge this.
>
> Sorry this topic has been such a disaster. Assuming the fix is isolated
> to a single patch, do you want a new version of that patch, or the whole
> series?

Depends on the timing---if it takes less than 2 days, please expect
that the topic would still be in my cache and "squash this into
patch 07/13" would be sufficient.  More than 10 days, wholesale
replacement would be the easiest.  A single patch replacement in
between.

Thanks.


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2 00/13] more miscellaneous Bloom filter improvements, redux
  2020-09-17  0:59         ` Junio C Hamano
@ 2020-09-17  1:10           ` Taylor Blau
  2020-09-17 13:34             ` Taylor Blau
  0 siblings, 1 reply; 75+ messages in thread
From: Taylor Blau @ 2020-09-17  1:10 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Taylor Blau, Derrick Stolee, git, dstolee, peff, szeder.dev

On Wed, Sep 16, 2020 at 05:59:26PM -0700, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> > On Wed, Sep 16, 2020 at 04:07:51PM -0700, Junio C Hamano wrote:
> >> Yup, lets merge it down soonish.
> >
> > It's me, the bearer of bad news. I noticed an uninitialized read when
> > running the tests with SANITIZE=address,undefined. I *think* the fix is
> > as simple as a single replacement, but let me double check before you
> > merge this.
> >
> > Sorry this topic has been such a disaster. Assuming the fix is isolated
> > to a single patch, do you want a new version of that patch, or the whole
> > series?
>
> Depends on the timing---if it takes less than 2 days, please expect
> that the topic would still be in my cache and "squash this into
> patch 07/13" would be sufficient.  More than 10 days, wholesale
> replacement would be the easiest.  A single patch replacement in
> between.

I should have the patch in your inbox by the end of tonight, depending
on how fast my workstation can run the ASan-enabled test suite 13 times
;).

> Thanks.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2 00/13] more miscellaneous Bloom filter improvements, redux
  2020-09-17  1:10           ` Taylor Blau
@ 2020-09-17 13:34             ` Taylor Blau
  2020-09-17 13:38               ` Derrick Stolee
  0 siblings, 1 reply; 75+ messages in thread
From: Taylor Blau @ 2020-09-17 13:34 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Derrick Stolee, git, dstolee, peff, szeder.dev

Junio,

On Wed, Sep 16, 2020 at 09:10:49PM -0400, Taylor Blau wrote:
> I should have the patch in your inbox by the end of tonight, depending
> on how fast my workstation can run the ASan-enabled test suite 13 times
> ;).

All finished. This is sufficient to fix the ASan-enabled test suite,
along with fixing a bug where we wouldn't respect the limit on changed
paths when loading an existing commit-graph. This has nothing to do with
the user-specified '--max-new-filters', nor does it mean that we're
storing the limit in the commit-graph file. Instead it's because we're
loading the bloom_filter_settings struct from the graph and
initializing it ourselves, instead of using the default values (which is
the case when we don't load a graph at all).

Anyway, let's use this instead of 6/13. Here's an inter-diff that shows
the fix and test change:

  diff --git a/commit-graph.c b/commit-graph.c
  index 33af6c2430..fc6c6fdc3e 100644
  --- a/commit-graph.c
  +++ b/commit-graph.c
  @@ -424,6 +424,7 @@ struct commit_graph *parse_commit_graph(struct repository *r,
          graph->bloom_filter_settings->hash_version = hash_version;
          graph->bloom_filter_settings->num_hashes = get_be32(data + chunk_offset + 4);
          graph->bloom_filter_settings->bits_per_entry = get_be32(data + chunk_offset + 8);
  +				graph->bloom_filter_settings->max_changed_paths = DEFAULT_BLOOM_MAX_CHANGES;
        }
        break;
      }
  diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
  index dc7d62c778..5bc1627568 100755
  --- a/t/t4216-log-bloom.sh
  +++ b/t/t4216-log-bloom.sh
  @@ -175,11 +175,11 @@ test_expect_success 'persist filter settings' '
      GIT_TEST_BLOOM_SETTINGS_NUM_HASHES=9 \
      GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY=15 \
      git commit-graph write --reachable --changed-paths &&
  -	grep "{\"hash_version\":1,\"num_hashes\":9,\"bits_per_entry\":15" trace2.txt &&
  +	grep "{\"hash_version\":1,\"num_hashes\":9,\"bits_per_entry\":15,\"max_changed_paths\":512" trace2.txt &&
    GIT_TRACE2_EVENT="$(pwd)/trace2-auto.txt" \
      GIT_TRACE2_EVENT_NESTING=5 \
      git commit-graph write --reachable --changed-paths &&
  -	grep "{\"hash_version\":1,\"num_hashes\":9,\"bits_per_entry\":15" trace2-auto.txt
  +	grep "{\"hash_version\":1,\"num_hashes\":9,\"bits_per_entry\":15,\"max_changed_paths\":512" trace2-auto.txt
   '

   test_max_changed_paths () {

...and here is the updated patch. Sorry again for all of the trouble.

--- >8 ---

Subject: [PATCH] commit-graph.c: store maximum changed paths

For now, we assume that there is a fixed constant describing the
maximum number of changed paths we are willing to store in a Bloom
filter.

Prepare for that to (at least partially) not be the case by making it a
member of the 'struct bloom_filter_settings'. This will be helpful in
the subsequent patches by reducing the size of test cases that exercise
storing too many changed paths, as well as preparing for an eventual
future in which this value might change.

This patch alone does not cause newly generated Bloom filters to use
a custom upper-bound on the maximum number of changed paths a single
Bloom filter can hold, that will occur in a later patch.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 bloom.h              | 11 ++++++++++-
 commit-graph.c       |  4 ++++
 t/t4216-log-bloom.sh |  4 ++--
 3 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/bloom.h b/bloom.h
index d8fbb0fbf1..0b9b59a6fe 100644
--- a/bloom.h
+++ b/bloom.h
@@ -28,9 +28,18 @@ struct bloom_filter_settings {
 	 * that contain n*b bits.
 	 */
 	uint32_t bits_per_entry;
+
+	/*
+	 * The maximum number of changed paths per commit
+	 * before declaring a Bloom filter to be too-large.
+	 *
+	 * Not written to the commit-graph file.
+	 */
+	uint32_t max_changed_paths;
 };

-#define DEFAULT_BLOOM_FILTER_SETTINGS { 1, 7, 10 }
+#define DEFAULT_BLOOM_MAX_CHANGES 512
+#define DEFAULT_BLOOM_FILTER_SETTINGS { 1, 7, 10, DEFAULT_BLOOM_MAX_CHANGES }
 #define BITS_PER_WORD 8
 #define BLOOMDATA_CHUNK_HEADER_SIZE 3 * sizeof(uint32_t)

diff --git a/commit-graph.c b/commit-graph.c
index ea54d108b9..ba6d4a4c6c 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -424,6 +424,7 @@ struct commit_graph *parse_commit_graph(struct repository *r,
 				graph->bloom_filter_settings->hash_version = hash_version;
 				graph->bloom_filter_settings->num_hashes = get_be32(data + chunk_offset + 4);
 				graph->bloom_filter_settings->bits_per_entry = get_be32(data + chunk_offset + 8);
+				graph->bloom_filter_settings->max_changed_paths = DEFAULT_BLOOM_MAX_CHANGES;
 			}
 			break;
 		}
@@ -1201,6 +1202,7 @@ static void trace2_bloom_filter_settings(struct write_commit_graph_context *ctx)
 	jw_object_intmax(&jw, "hash_version", ctx->bloom_settings->hash_version);
 	jw_object_intmax(&jw, "num_hashes", ctx->bloom_settings->num_hashes);
 	jw_object_intmax(&jw, "bits_per_entry", ctx->bloom_settings->bits_per_entry);
+	jw_object_intmax(&jw, "max_changed_paths", ctx->bloom_settings->max_changed_paths);
 	jw_end(&jw);

 	trace2_data_json("bloom", ctx->r, "settings", &jw);
@@ -1669,6 +1671,8 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 							      bloom_settings.bits_per_entry);
 		bloom_settings.num_hashes = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_NUM_HASHES",
 							  bloom_settings.num_hashes);
+		bloom_settings.max_changed_paths = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS",
+							  bloom_settings.max_changed_paths);
 		ctx->bloom_settings = &bloom_settings;
 	}

diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index fc7693806c..593571358d 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -174,11 +174,11 @@ test_expect_success 'persist filter settings' '
 		GIT_TEST_BLOOM_SETTINGS_NUM_HASHES=9 \
 		GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY=15 \
 		git commit-graph write --reachable --changed-paths &&
-	grep "{\"hash_version\":1,\"num_hashes\":9,\"bits_per_entry\":15}" trace2.txt &&
+	grep "{\"hash_version\":1,\"num_hashes\":9,\"bits_per_entry\":15,\"max_changed_paths\":512" trace2.txt &&
 	GIT_TRACE2_EVENT="$(pwd)/trace2-auto.txt" \
 		GIT_TRACE2_EVENT_NESTING=5 \
 		git commit-graph write --reachable --changed-paths &&
-	grep "{\"hash_version\":1,\"num_hashes\":9,\"bits_per_entry\":15}" trace2-auto.txt
+	grep "{\"hash_version\":1,\"num_hashes\":9,\"bits_per_entry\":15,\"max_changed_paths\":512" trace2-auto.txt
 '

 test_expect_success 'correctly report changes over limit' '
--
2.28.0.510.g86fdc5f89a


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* Re: [PATCH v2 00/13] more miscellaneous Bloom filter improvements, redux
  2020-09-17 13:34             ` Taylor Blau
@ 2020-09-17 13:38               ` Derrick Stolee
  0 siblings, 0 replies; 75+ messages in thread
From: Derrick Stolee @ 2020-09-17 13:38 UTC (permalink / raw)
  To: Taylor Blau, Junio C Hamano; +Cc: git, dstolee, peff, szeder.dev

On 9/17/2020 9:34 AM, Taylor Blau wrote:
> Junio,
> 
> On Wed, Sep 16, 2020 at 09:10:49PM -0400, Taylor Blau wrote:
>> I should have the patch in your inbox by the end of tonight, depending
>> on how fast my workstation can run the ASan-enabled test suite 13 times
>> ;).
> 
> All finished. This is sufficient to fix the ASan-enabled test suite,
> along with fixing a bug where we wouldn't respect the limit on changed
> paths when loading an existing commit-graph. This has nothing to do with
> the user-specified '--max-new-filters', nor does it mean that we're
> storing the limit in the commit-graph file. Instead it's because we're
> loading the bloom_filter_settings struct from the graph and
> initializing it ourselves, instead of using the default values (which is
> the case when we don't load a graph at all).
> 
> Anyway, let's use this instead of 6/13. Here's an inter-diff that shows
> the fix and test change:
> 
>   diff --git a/commit-graph.c b/commit-graph.c
>   index 33af6c2430..fc6c6fdc3e 100644
>   --- a/commit-graph.c
>   +++ b/commit-graph.c
>   @@ -424,6 +424,7 @@ struct commit_graph *parse_commit_graph(struct repository *r,
>           graph->bloom_filter_settings->hash_version = hash_version;
>           graph->bloom_filter_settings->num_hashes = get_be32(data + chunk_offset + 4);
>           graph->bloom_filter_settings->bits_per_entry = get_be32(data + chunk_offset + 8);
>   +				graph->bloom_filter_settings->max_changed_paths = DEFAULT_BLOOM_MAX_CHANGES;

This whitespace looks strange in the inter-diff...

>  				graph->bloom_filter_settings->bits_per_entry = get_be32(data + chunk_offset + 8);
> +				graph->bloom_filter_settings->max_changed_paths = DEFAULT_BLOOM_MAX_CHANGES;

...but it is correct in the patch itself.

> -	grep "{\"hash_version\":1,\"num_hashes\":9,\"bits_per_entry\":15}" trace2.txt &&
> +	grep "{\"hash_version\":1,\"num_hashes\":9,\"bits_per_entry\":15,\"max_changed_paths\":512" trace2.txt &&
>  	GIT_TRACE2_EVENT="$(pwd)/trace2-auto.txt" \
>  		GIT_TRACE2_EVENT_NESTING=5 \
>  		git commit-graph write --reachable --changed-paths &&
> -	grep "{\"hash_version\":1,\"num_hashes\":9,\"bits_per_entry\":15}" trace2-auto.txt
> +	grep "{\"hash_version\":1,\"num_hashes\":9,\"bits_per_entry\":15,\"max_changed_paths\":512" trace2-auto.txt

I appreciate the additional tests to guarantee this is set
correctly.

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2 10/13] bloom: encode out-of-bounds filters as non-empty
  2020-09-16 18:07   ` [PATCH v2 10/13] bloom: encode out-of-bounds filters as non-empty Taylor Blau
@ 2020-09-17 22:13     ` SZEDER Gábor
  2020-09-17 23:13       ` Taylor Blau
  0 siblings, 1 reply; 75+ messages in thread
From: SZEDER Gábor @ 2020-09-17 22:13 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, dstolee, gitster, peff

On Wed, Sep 16, 2020 at 02:07:59PM -0400, Taylor Blau wrote:
> When a changed-path Bloom filter has either zero, or more than a
> certain number (commonly 512) of entries, the commit-graph machinery
> encodes it as "missing". More specifically, it sets the indices adjacent
> in the BIDX chunk as equal to each other to indicate a "length 0"
> filter; that is, that the filter occupies zero bytes on disk.
> 
> This has heretofore been fine, since the commit-graph machinery has no
> need to care about these filters with too few or too many changed paths.
> Both cases act like no filter has been generated at all, and so there is
> no need to store them.
> 
> In a subsequent commit, however, the commit-graph machinery will learn
> to only compute Bloom filters for some commits in the current
> commit-graph layer. This is a change from the current implementation
> which computes Bloom filters for all commits that are in the layer being
> written. Critically for this patch, only computing some of the Bloom
> filters means adding a third state for length 0 Bloom filters: zero
> entries, too many entries, or "hasn't been computed".
> 
> It will be important for that future patch to distinguish between "not
> representable" (i.e., zero or too-many changed paths), and "hasn't been
> computed". In particular, we don't want to waste time recomputing
> filters that have already been computed.
> 
> To that end, change how we store Bloom filters in the "computed but not
> representable" category:
> 
>   - Bloom filters with no entries are stored as a single byte with all
>     bits low (i.e., all queries to that Bloom filter will return
>     "definitely not")
> 
>   - Bloom filters with too many entries are stored as a single byte with
>     all bits set high (i.e., all queries to that Bloom filter will
>     return "maybe").
> 
> These rules are sufficient to not incur a behavior change by changing
> the on-disk representation of these two classes. Likewise, no
> specification changes are necessary for the commit-graph format, either:
> 
>   - Filters that were previously empty will be recomputed and stored
>     according to the new rules, and
> 
>   - old clients reading filters generated by new clients will interpret
>     the filters correctly and be none the wiser to how they were
>     generated.
> 
> Clients will invoke the Bloom machinery in more cases than before, but
> this can be addressed by returning a NULL filter when all bits are set
> high. This can be addressed in a future patch.

OTOH, clients will invoke the tree-diff machinery in fewer cases than
before, because querying the Bloom filter of commits not modifying any
files will now return "definitely not".

> Finally, note that this does increase the size of on-disk commit-graphs,
> but far less than other proposals. In particular, this is generally more
> efficient than storing a bitmap for which commits haven't computed their
> Bloom filters. Storing a bitmap incurs a penalty of one bit per commit,
> whereas storing explicit filters as above incurs a penalty of one byte
> per too-large or too-small commit.

s/too-small/empty/

> In practice, these boundary commits likely occupy a small proportion of
> the overall number of commits, and so the size penalty is likely smaller
> than storing a bitmap for all commits.

                 |      Percentage of
                 |    commits modifying
                 |   0 path   |  >= 512 paths
  ---------------+------------+----------------
  android-base   |   13.20%   |   0.13%
  cmssw          |    0.15%   |   0.23%
  cpython        |    3.07%   |   0.01%
  elasticsearch  |    0.70%   |   1.00%
  gcc            |    0.00%   |   0.08%
  gecko-dev      |    0.14%   |   0.64%
  git            |    0.11%   |   0.02%
  glibc          |    0.02%   |   0.10%
  go             |    0.00%   |   0.07%
  homebrew-cask  |    0.40%   |   0.02%
  homebrew-core  |    0.01%   |   0.01%
  jdk            |    0.26%   |   5.64%
  linux          |    0.01%   |   0.51%
  llvm-project   |    0.12%   |   0.03%
  rails          |    0.10%   |   0.10%
  rust           |    0.07%   |   0.17%
  tensorflow     |    0.09%   |   1.02%
  webkit         |    0.05%   |   0.31%


> A test to exercise filters which contain too many changed path entries
> will be introduced in a subsequent patch.


> diff --git a/bloom.h b/bloom.h
> index c6d77e8393..70a8840896 100644
> --- a/bloom.h
> +++ b/bloom.h
> @@ -93,6 +93,7 @@ enum bloom_filter_computed {
>  	BLOOM_NOT_COMPUTED = (1 << 0),
>  	BLOOM_COMPUTED     = (1 << 1),
>  	BLOOM_TRUNC_LARGE  = (1 << 2),
> +	BLOOM_TRUNC_SMALL  = (1 << 3),

s/SMALL/EMPTY/

This "small" suffix in the constant, variable, and trace2 key names is
misleading, because we only mean empty commits.

>  };
>  
>  struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
> diff --git a/commit-graph.c b/commit-graph.c
> index 1ca754f19c..bd4247bca5 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -974,6 +974,7 @@ struct write_commit_graph_context {
>  
>  	int count_bloom_filter_computed;
>  	int count_bloom_filter_not_computed;
> +	int count_bloom_filter_trunc_small;
>  	int count_bloom_filter_trunc_large;
>  };
>  

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2 10/13] bloom: encode out-of-bounds filters as non-empty
  2020-09-17 22:13     ` SZEDER Gábor
@ 2020-09-17 23:13       ` Taylor Blau
  2020-09-18  0:52         ` Junio C Hamano
  0 siblings, 1 reply; 75+ messages in thread
From: Taylor Blau @ 2020-09-17 23:13 UTC (permalink / raw)
  To: SZEDER Gábor; +Cc: Taylor Blau, git, dstolee, gitster, peff

On Fri, Sep 18, 2020 at 12:13:02AM +0200, SZEDER Gábor wrote:
> > Clients will invoke the Bloom machinery in more cases than before, but
> > this can be addressed by returning a NULL filter when all bits are set
> > high. This can be addressed in a future patch.
>
> OTOH, clients will invoke the tree-diff machinery in fewer cases than
> before, because querying the Bloom filter of commits not modifying any
> files will now return "definitely not".

Absolutely right.

> > Finally, note that this does increase the size of on-disk commit-graphs,
> > but far less than other proposals. In particular, this is generally more
> > efficient than storing a bitmap for which commits haven't computed their
> > Bloom filters. Storing a bitmap incurs a penalty of one bit per commit,
> > whereas storing explicit filters as above incurs a penalty of one byte
> > per too-large or too-small commit.
>
> s/too-small/empty/

Fair enough, although I'm not planning to alter this or any other patch
now that it's picked up unless there's a real show-stopper (this doesn't
seem like one).

> > In practice, these boundary commits likely occupy a small proportion of
> > the overall number of commits, and so the size penalty is likely smaller
> > than storing a bitmap for all commits.
>
>                  |      Percentage of
>                  |    commits modifying
>                  |   0 path   |  >= 512 paths
>   ---------------+------------+----------------
>   android-base   |   13.20%   |   0.13%
>   cmssw          |    0.15%   |   0.23%
>   cpython        |    3.07%   |   0.01%
>   elasticsearch  |    0.70%   |   1.00%
>   gcc            |    0.00%   |   0.08%
>   gecko-dev      |    0.14%   |   0.64%
>   git            |    0.11%   |   0.02%
>   glibc          |    0.02%   |   0.10%
>   go             |    0.00%   |   0.07%
>   homebrew-cask  |    0.40%   |   0.02%
>   homebrew-core  |    0.01%   |   0.01%
>   jdk            |    0.26%   |   5.64%
>   linux          |    0.01%   |   0.51%
>   llvm-project   |    0.12%   |   0.03%
>   rails          |    0.10%   |   0.10%
>   rust           |    0.07%   |   0.17%
>   tensorflow     |    0.09%   |   1.02%
>   webkit         |    0.05%   |   0.31%

This is very useful information to have! Without the total number of
commits, it's impossible to know whether or not this is a win over the
BFXL chunk. But, since the number of commits is probably "large" versus
the percentage of boundary commits which is "small", it's almost
certainly an advantage.

> > A test to exercise filters which contain too many changed path entries
> > will be introduced in a subsequent patch.
>
>
> > diff --git a/bloom.h b/bloom.h
> > index c6d77e8393..70a8840896 100644
> > --- a/bloom.h
> > +++ b/bloom.h
> > @@ -93,6 +93,7 @@ enum bloom_filter_computed {
> >  	BLOOM_NOT_COMPUTED = (1 << 0),
> >  	BLOOM_COMPUTED     = (1 << 1),
> >  	BLOOM_TRUNC_LARGE  = (1 << 2),
> > +	BLOOM_TRUNC_SMALL  = (1 << 3),
>
> s/SMALL/EMPTY/
>
> This "small" suffix in the constant, variable, and trace2 key names is
> misleading, because we only mean empty commits.

I could buy that it might be misleading; I only picked this since it was
the opposite of "large". You could imagine that BLOOM_TRUNC_X means
"truncated in the direction of 'x'", but to be honest I don't think that
this matters.

I understand the churn of coming back to this after the topic has
already been merged creates more hassle, but frankly this series has
already gone on for quite a while, and it has been holding up important
bug fixes that are unrelated to the main feature.

So, I think that if it's truly misleading, we could revisit this after
the topic is merged. But, I'm not planning on changing anything at this
point.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2 10/13] bloom: encode out-of-bounds filters as non-empty
  2020-09-17 23:13       ` Taylor Blau
@ 2020-09-18  0:52         ` Junio C Hamano
  2020-09-18  1:15           ` Taylor Blau
  0 siblings, 1 reply; 75+ messages in thread
From: Junio C Hamano @ 2020-09-18  0:52 UTC (permalink / raw)
  To: Taylor Blau; +Cc: SZEDER Gábor, git, dstolee, peff

Taylor Blau <me@ttaylorr.com> writes:

>> > In practice, these boundary commits likely occupy a small proportion of
>> > the overall number of commits, and so the size penalty is likely smaller
>> > than storing a bitmap for all commits.
>>
>>                  |      Percentage of
>>                  |    commits modifying
>>                  |   0 path   |  >= 512 paths
>>   ---------------+------------+----------------
>>   android-base   |   13.20%   |   0.13%
>>   cmssw          |    0.15%   |   0.23%
>>   cpython        |    3.07%   |   0.01%
>>   elasticsearch  |    0.70%   |   1.00%
>>   gcc            |    0.00%   |   0.08%
>>   gecko-dev      |    0.14%   |   0.64%
>>   git            |    0.11%   |   0.02%
>>   glibc          |    0.02%   |   0.10%
>>   go             |    0.00%   |   0.07%
>>   homebrew-cask  |    0.40%   |   0.02%
>>   homebrew-core  |    0.01%   |   0.01%
>>   jdk            |    0.26%   |   5.64%
>>   linux          |    0.01%   |   0.51%
>>   llvm-project   |    0.12%   |   0.03%
>>   rails          |    0.10%   |   0.10%
>>   rust           |    0.07%   |   0.17%
>>   tensorflow     |    0.09%   |   1.02%
>>   webkit         |    0.05%   |   0.31%
>
> This is very useful information to have! Without the total number of
> commits, it's impossible to know whether or not this is a win over the
> BFXL chunk. But, since the number of commits is probably "large" versus
> the percentage of boundary commits which is "small", it's almost
> certainly an advantage.

Do you want to include it in either the log message in one of the
commits, in code comment, or a technical doc?

> So, I think that if it's truly misleading, we could revisit this after
> the topic is merged. But, I'm not planning on changing anything at this
> point.

If you do not want to help us go the last-mile to completion, that
is sad, but I do not want to see a basically good patch stall like
that, so let's find somebody else who can do the helping ;-)

Here is what my trial rebase produced.  I'll queue it to 'seen' (if
you prefer I can send a full v3 patch, but I expect that you know
how to fetch from 'seen' and review locally) after checking if the
result passes the tests locally; an extra set of eyeballs to verify
the result is pretty much appreciated.

Thanks.


1:  ca6c060171 ! 1:  9b294a0c66 bloom: encode out-of-bounds filters as non-empty
    @@ Commit message
         efficient than storing a bitmap for which commits haven't computed their
         Bloom filters. Storing a bitmap incurs a penalty of one bit per commit,
         whereas storing explicit filters as above incurs a penalty of one byte
    -    per too-large or too-small commit.
    +    per too-large or empty commit.
     
         In practice, these boundary commits likely occupy a small proportion of
         the overall number of commits, and so the size penalty is likely smaller
    @@ bloom.c: struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
      		filter->len = (hashmap_get_size(&pathmap) * settings->bits_per_entry + BITS_PER_WORD - 1) / BITS_PER_WORD;
     +		if (!filter->len) {
     +			if (computed)
    -+				*computed |= BLOOM_TRUNC_SMALL;
    ++				*computed |= BLOOM_TRUNC_EMPTY;
     +			filter->len = 1;
     +		}
      		filter->data = xcalloc(filter->len, sizeof(unsigned char));
    @@ bloom.h: enum bloom_filter_computed {
      	BLOOM_NOT_COMPUTED = (1 << 0),
      	BLOOM_COMPUTED     = (1 << 1),
      	BLOOM_TRUNC_LARGE  = (1 << 2),
    -+	BLOOM_TRUNC_SMALL  = (1 << 3),
    ++	BLOOM_TRUNC_EMPTY  = (1 << 3),
      };
      
      struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
    @@ commit-graph.c: struct write_commit_graph_context {
      
      	int count_bloom_filter_computed;
      	int count_bloom_filter_not_computed;
    -+	int count_bloom_filter_trunc_small;
    ++	int count_bloom_filter_trunc_empty;
      	int count_bloom_filter_trunc_large;
      };
      
    @@ commit-graph.c: static void trace2_bloom_filter_write_statistics(struct write_co
      			   ctx->count_bloom_filter_computed);
      	trace2_data_intmax("commit-graph", ctx->r, "filter-not-computed",
      			   ctx->count_bloom_filter_not_computed);
    -+	trace2_data_intmax("commit-graph", ctx->r, "filter-trunc-small",
    -+			   ctx->count_bloom_filter_trunc_small);
    ++	trace2_data_intmax("commit-graph", ctx->r, "filter-trunc-empty",
    ++			   ctx->count_bloom_filter_trunc_empty);
      	trace2_data_intmax("commit-graph", ctx->r, "filter-trunc-large",
      			   ctx->count_bloom_filter_trunc_large);
      }
    @@ commit-graph.c: static void compute_bloom_filters(struct write_commit_graph_cont
      			&computed);
      		if (computed & BLOOM_COMPUTED) {
      			ctx->count_bloom_filter_computed++;
    -+			if (computed & BLOOM_TRUNC_SMALL)
    -+				ctx->count_bloom_filter_trunc_small++;
    ++			if (computed & BLOOM_TRUNC_EMPTY)
    ++				ctx->count_bloom_filter_trunc_empty++;
      			if (computed & BLOOM_TRUNC_LARGE)
      				ctx->count_bloom_filter_trunc_large++;
      		} else if (computed & BLOOM_NOT_COMPUTED)
    @@ t/t4216-log-bloom.sh: test_max_changed_paths () {
      	grep "\"key\":\"filter-computed\",\"value\":\"$1\"" $2
      }
      
    -+test_filter_trunc_small () {
    -+	grep "\"key\":\"filter-trunc-small\",\"value\":\"$1\"" $2
    ++test_filter_trunc_empty () {
    ++	grep "\"key\":\"filter-trunc-empty\",\"value\":\"$1\"" $2
     +}
     +
      test_filter_trunc_large () {
    @@ t/t4216-log-bloom.sh: test_expect_success 'correctly report changes over limit'
      '
      
     +test_expect_success 'correctly report commits with no changed paths' '
    -+	git init small &&
    -+	test_when_finished "rm -fr small" &&
    ++	git init empty &&
    ++	test_when_finished "rm -fr empty" &&
     +	(
    -+		cd small &&
    ++		cd empty &&
     +
     +		git commit --allow-empty -m "initial commit" &&
     +
    @@ t/t4216-log-bloom.sh: test_expect_success 'correctly report changes over limit'
     +			git commit-graph write --reachable --changed-paths &&
     +		test_filter_computed 1 trace.event &&
     +		test_filter_not_computed 0 trace.event &&
    -+		test_filter_trunc_small 1 trace.event &&
    ++		test_filter_trunc_empty 1 trace.event &&
     +		test_filter_trunc_large 0 trace.event
     +	)
     +'
2:  11db600d51 = 2:  1b4c861e68 commit-graph: rename 'split_commit_graph_opts'
3:  cf49598137 ! 3:  d6c1bd395e builtin/commit-graph.c: introduce '--max-new-filters=<n>'
    @@ t/t4216-log-bloom.sh: test_expect_success 'correctly report commits with no chan
     +
     +		test_filter_computed 2 trace.event &&
     +		test_filter_not_computed 3 trace.event &&
    -+		test_filter_trunc_small 0 trace.event &&
    ++		test_filter_trunc_empty 0 trace.event &&
     +		test_filter_trunc_large 0 trace.event
     +	)
     +'
    @@ t/t4216-log-bloom.sh: test_expect_success 'correctly report commits with no chan
     +				--split=replace --max-new-filters=1 &&
     +		test_filter_computed 1 trace.event &&
     +		test_filter_not_computed 4 trace.event &&
    -+		test_filter_trunc_small 0 trace.event &&
    ++		test_filter_trunc_empty 0 trace.event &&
     +		test_filter_trunc_large 0 trace.event
     +	)
     +'
    @@ t/t4216-log-bloom.sh: test_expect_success 'correctly report commits with no chan
     +					--changed-paths --max-new-filters=2 &&
     +			test_filter_computed 2 trace.event &&
     +			test_filter_not_computed 4 trace.event &&
    -+			test_filter_trunc_small 2 trace.event &&
    ++			test_filter_trunc_empty 2 trace.event &&
     +			test_filter_trunc_large 0 trace.event
     +		done &&
     +
    @@ t/t4216-log-bloom.sh: test_expect_success 'correctly report commits with no chan
     +				--changed-paths --max-new-filters=2 &&
     +		test_filter_computed 0 trace.event &&
     +		test_filter_not_computed 6 trace.event &&
    -+		test_filter_trunc_small 0 trace.event &&
    ++		test_filter_trunc_empty 0 trace.event &&
     +		test_filter_trunc_large 0 trace.event
     +	)
     +'
4:  1bc82cd008 ! 4:  b0d51fb04a commit-graph: introduce 'commitGraph.maxNewFilters'
    @@ t/t4216-log-bloom.sh: test_expect_success 'Bloom generation is limited by --max-
     +				--split=replace &&
      		test_filter_computed 1 trace.event &&
      		test_filter_not_computed 4 trace.event &&
    - 		test_filter_trunc_small 0 trace.event &&
    + 		test_filter_trunc_empty 0 trace.event &&
     @@ t/t4216-log-bloom.sh: test_expect_success 'Bloom generation backfills previously-skipped filters' '
      	)
      '
    @@ t/t4216-log-bloom.sh: test_expect_success 'Bloom generation backfills previously
     +				--max-new-filters=1 &&
     +		test_filter_computed 1 trace.event &&
     +		test_filter_not_computed 1 trace.event &&
    -+		test_filter_trunc_small 0 trace.event &&
    ++		test_filter_trunc_empty 0 trace.event &&
     +		test_filter_trunc_large 0 trace.event
     +	)
     +'

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2 10/13] bloom: encode out-of-bounds filters as non-empty
  2020-09-18  0:52         ` Junio C Hamano
@ 2020-09-18  1:15           ` Taylor Blau
  0 siblings, 0 replies; 75+ messages in thread
From: Taylor Blau @ 2020-09-18  1:15 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Taylor Blau, SZEDER Gábor, git, dstolee, peff

On Thu, Sep 17, 2020 at 05:52:11PM -0700, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> >> > In practice, these boundary commits likely occupy a small proportion of
> >> > the overall number of commits, and so the size penalty is likely smaller
> >> > than storing a bitmap for all commits.
> >>
> >>                  |      Percentage of
> >>                  |    commits modifying
> >>                  |   0 path   |  >= 512 paths
> >>   ---------------+------------+----------------
> >>   android-base   |   13.20%   |   0.13%
> >>   cmssw          |    0.15%   |   0.23%
> >>   cpython        |    3.07%   |   0.01%
> >>   elasticsearch  |    0.70%   |   1.00%
> >>   gcc            |    0.00%   |   0.08%
> >>   gecko-dev      |    0.14%   |   0.64%
> >>   git            |    0.11%   |   0.02%
> >>   glibc          |    0.02%   |   0.10%
> >>   go             |    0.00%   |   0.07%
> >>   homebrew-cask  |    0.40%   |   0.02%
> >>   homebrew-core  |    0.01%   |   0.01%
> >>   jdk            |    0.26%   |   5.64%
> >>   linux          |    0.01%   |   0.51%
> >>   llvm-project   |    0.12%   |   0.03%
> >>   rails          |    0.10%   |   0.10%
> >>   rust           |    0.07%   |   0.17%
> >>   tensorflow     |    0.09%   |   1.02%
> >>   webkit         |    0.05%   |   0.31%
> >
> > This is very useful information to have! Without the total number of
> > commits, it's impossible to know whether or not this is a win over the
> > BFXL chunk. But, since the number of commits is probably "large" versus
> > the percentage of boundary commits which is "small", it's almost
> > certainly an advantage.
>
> Do you want to include it in either the log message in one of the
> commits, in code comment, or a technical doc?

Let's put it in the commit message. I think that it's useful enough that
people interested enough to dig through the commits would want it, but
too detailed to be as visible as in the technical documentation.

> > So, I think that if it's truly misleading, we could revisit this after
> > the topic is merged. But, I'm not planning on changing anything at this
> > point.
>
> If you do not want to help us go the last-mile to completion, that
> is sad, but I do not want to see a basically good patch stall like
> that, so let's find somebody else who can do the helping ;-)

I'm sorry to give off the impression that I do not want to help; that's
not the case. After reading Gàbor's email, I thought that the changes he
was suggesting were minor, and was trying to say "let's do this on top
and apply the important bug fixes first," not, "I am never going to
touch this again".

> Here is what my trial rebase produced.  I'll queue it to 'seen' (if
> you prefer I can send a full v3 patch, but I expect that you know
> how to fetch from 'seen' and review locally) after checking if the
> result passes the tests locally; an extra set of eyeballs to verify
> the result is pretty much appreciated.

It looks off to a great start; and thanks for taking the care to make up
for my laziness. Everything you did looks good to me. I touched up
"bloom: encode out-of-bounds filters as non-empty" locally to include
the table above (and some commentary around it), so I'll send a v3 to
the list once I have finished running the tests.

> Thanks.

Thank you, and sorry again.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [PATCH v3 00/13] more miscellaneous Bloom filter improvements, redux
  2020-09-09 15:22 [PATCH 00/12] more miscellaneous Bloom filter improvements, redux Taylor Blau
                   ` (12 preceding siblings ...)
  2020-09-16 18:06 ` [PATCH v2 00/13] more miscellaneous Bloom filter improvements, redux Taylor Blau
@ 2020-09-18  2:58 ` Taylor Blau
  2020-09-18  2:58   ` [PATCH v3 01/13] commit-graph: introduce 'get_bloom_filter_settings()' Taylor Blau
                     ` (13 more replies)
  13 siblings, 14 replies; 75+ messages in thread
From: Taylor Blau @ 2020-09-18  2:58 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, peff, szeder.dev

Hi again,

Here's a few more changes to the "Bloom filter improvements" topic,
sent as one brand-new re-roll in order to simplify queuing. It
incorporates:

  - Junio's changes from applying to 'seen' (namely, dropping references
    to "too-small" commits in favor of the much more clear "empty"
    commits).

  - On top, I applied Gàbor's data gathered in [1] to 10/13 (and added a
    little more detail on the absolute and relative size differences of
    the resulting commit-graph files built before/after that patch).

Thanks to everyone who has helped out along the way with this series
(Stolee, Gàbor, Junio, Jakub, and I am sure that I am forgetting some...).

Sorry to have given anyone the impression that I was abandoning this
topic; I'm definitely not ;-).

[1]: https://lore.kernel.org/git/20200917221302.GC23146@szeder.dev/

Derrick Stolee (1):
  bloom/diff: properly short-circuit on max_changes

Taylor Blau (12):
  commit-graph: introduce 'get_bloom_filter_settings()'
  t4216: use an '&&'-chain
  commit-graph: pass a 'struct repository *' in more places
  t/helper/test-read-graph.c: prepare repo settings
  commit-graph: respect 'commitGraph.readChangedPaths'
  commit-graph.c: store maximum changed paths
  bloom: split 'get_bloom_filter()' in two
  bloom: use provided 'struct bloom_filter_settings'
  bloom: encode out-of-bounds filters as non-empty
  commit-graph: rename 'split_commit_graph_opts'
  builtin/commit-graph.c: introduce '--max-new-filters=<n>'
  commit-graph: introduce 'commitGraph.maxNewFilters'

 Documentation/config.txt                      |   2 +
 Documentation/config/commitgraph.txt          |   8 +
 Documentation/git-commit-graph.txt            |   6 +
 .../technical/commit-graph-format.txt         |   2 +-
 blame.c                                       |   8 +-
 bloom.c                                       |  59 +++--
 bloom.h                                       |  29 ++-
 builtin/commit-graph.c                        |  63 ++++-
 commit-graph.c                                | 141 +++++++---
 commit-graph.h                                |  17 +-
 diff.h                                        |   2 -
 fuzz-commit-graph.c                           |   5 +-
 line-log.c                                    |   2 +-
 repo-settings.c                               |   3 +
 repository.h                                  |   1 +
 revision.c                                    |   7 +-
 t/helper/test-bloom.c                         |   4 +-
 t/helper/test-read-graph.c                    |   3 +-
 t/t0095-bloom.sh                              |   8 +-
 t/t4216-log-bloom.sh                          | 242 ++++++++++++++++--
 t/t5324-split-commit-graph.sh                 |  13 +
 tree-diff.c                                   |   5 +-
 22 files changed, 507 insertions(+), 123 deletions(-)
 create mode 100644 Documentation/config/commitgraph.txt

--
2.28.0.510.g375ecf1f36

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [PATCH v3 01/13] commit-graph: introduce 'get_bloom_filter_settings()'
  2020-09-18  2:58 ` [PATCH v3 " Taylor Blau
@ 2020-09-18  2:58   ` Taylor Blau
  2020-09-18  2:58   ` [PATCH v3 02/13] t4216: use an '&&'-chain Taylor Blau
                     ` (12 subsequent siblings)
  13 siblings, 0 replies; 75+ messages in thread
From: Taylor Blau @ 2020-09-18  2:58 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, peff, szeder.dev

Many places in the code often need a pointer to the commit-graph's
'struct bloom_filter_settings', in which case they often take the value
from the top-most commit-graph.

In the non-split case, this works as expected. In the split case,
however, things get a little tricky. Not all layers in a chain of
incremental commit-graphs are required to themselves have Bloom data,
and so whether or not some part of the code uses Bloom filters depends
entirely on whether or not the top-most level of the commit-graph chain
has Bloom filters.

This has been the behavior since Bloom filters were introduced, and has
been codified into the tests since a759bfa9ee (t4216: add end to end
tests for git log with Bloom filters, 2020-04-06). In fact, t4216.130
requires that Bloom filters are not used in exactly the case described
earlier.

There is no reason that this needs to be the case, since it is perfectly
valid for commits in an earlier layer to have Bloom filters when commits
in a newer layer do not.

Since Bloom settings are guaranteed in practice to be the same for any
layer in a chain that has Bloom data, it is sufficient to traverse the
'->base_graph' pointer until either (1) a non-null 'struct
bloom_filter_settings *' is found, or (2) until we are at the root of
the commit-graph chain.

Introduce a 'get_bloom_filter_settings()' function that does just this,
and use it instead of purely dereferencing the top-most graph's
'->bloom_filter_settings' pointer.

While we're at it, add an additional test in t5324 to guard against code
in the commit-graph writing machinery that doesn't correctly handle a
NULL 'struct bloom_filter *'.

Co-authored-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 blame.c                       |  6 ++++--
 bloom.c                       |  6 +++---
 commit-graph.c                | 11 +++++++++++
 commit-graph.h                |  2 ++
 revision.c                    |  5 +----
 t/t4216-log-bloom.sh          |  9 ++++++---
 t/t5324-split-commit-graph.sh | 13 +++++++++++++
 7 files changed, 40 insertions(+), 12 deletions(-)

diff --git a/blame.c b/blame.c
index 1be1cd82a2..903e23af23 100644
--- a/blame.c
+++ b/blame.c
@@ -2892,16 +2892,18 @@ void setup_blame_bloom_data(struct blame_scoreboard *sb,
 			    const char *path)
 {
 	struct blame_bloom_data *bd;
+	struct bloom_filter_settings *bs;
 
 	if (!sb->repo->objects->commit_graph)
 		return;
 
-	if (!sb->repo->objects->commit_graph->bloom_filter_settings)
+	bs = get_bloom_filter_settings(sb->repo);
+	if (!bs)
 		return;
 
 	bd = xmalloc(sizeof(struct blame_bloom_data));
 
-	bd->settings = sb->repo->objects->commit_graph->bloom_filter_settings;
+	bd->settings = bs;
 
 	bd->alloc = 4;
 	bd->nr = 0;
diff --git a/bloom.c b/bloom.c
index 1a573226e7..cd9380ac62 100644
--- a/bloom.c
+++ b/bloom.c
@@ -38,7 +38,7 @@ static int load_bloom_filter_from_graph(struct commit_graph *g,
 	while (graph_pos < g->num_commits_in_base)
 		g = g->base_graph;
 
-	/* The commit graph commit 'c' lives in doesn't carry bloom filters. */
+	/* The commit graph commit 'c' lives in doesn't carry Bloom filters. */
 	if (!g->chunk_bloom_indexes)
 		return 0;
 
@@ -195,8 +195,8 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
 	if (!filter->data) {
 		load_commit_graph_info(r, c);
 		if (commit_graph_position(c) != COMMIT_NOT_FROM_GRAPH &&
-			r->objects->commit_graph->chunk_bloom_indexes)
-			load_bloom_filter_from_graph(r->objects->commit_graph, filter, c);
+			load_bloom_filter_from_graph(r->objects->commit_graph, filter, c))
+				return filter;
 	}
 
 	if (filter->data)
diff --git a/commit-graph.c b/commit-graph.c
index 0ed003e218..6a36ed0b06 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -667,6 +667,17 @@ int generation_numbers_enabled(struct repository *r)
 	return !!first_generation;
 }
 
+struct bloom_filter_settings *get_bloom_filter_settings(struct repository *r)
+{
+	struct commit_graph *g = r->objects->commit_graph;
+	while (g) {
+		if (g->bloom_filter_settings)
+			return g->bloom_filter_settings;
+		g = g->base_graph;
+	}
+	return NULL;
+}
+
 static void close_commit_graph_one(struct commit_graph *g)
 {
 	if (!g)
diff --git a/commit-graph.h b/commit-graph.h
index 09a97030dc..0677dd1031 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -87,6 +87,8 @@ struct commit_graph *parse_commit_graph(void *graph_map, size_t graph_size);
  */
 int generation_numbers_enabled(struct repository *r);
 
+struct bloom_filter_settings *get_bloom_filter_settings(struct repository *r);
+
 enum commit_graph_write_flags {
 	COMMIT_GRAPH_WRITE_APPEND     = (1 << 0),
 	COMMIT_GRAPH_WRITE_PROGRESS   = (1 << 1),
diff --git a/revision.c b/revision.c
index 1239023f93..6aeb764821 100644
--- a/revision.c
+++ b/revision.c
@@ -681,10 +681,7 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
 
 	repo_parse_commit(revs->repo, revs->commits->item);
 
-	if (!revs->repo->objects->commit_graph)
-		return;
-
-	revs->bloom_filter_settings = revs->repo->objects->commit_graph->bloom_filter_settings;
+	revs->bloom_filter_settings = get_bloom_filter_settings(revs->repo);
 	if (!revs->bloom_filter_settings)
 		return;
 
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index 4bb9e9dbe2..715912ad0f 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -65,7 +65,7 @@ setup () {
 
 test_bloom_filters_used () {
 	log_args=$1
-	bloom_trace_prefix="statistics:{\"filter_not_present\":0,\"maybe\""
+	bloom_trace_prefix="statistics:{\"filter_not_present\":${2:-0},\"maybe\""
 	setup "$log_args" &&
 	grep -q "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.perf" &&
 	test_cmp log_wo_bloom log_w_bloom &&
@@ -139,8 +139,11 @@ test_expect_success 'setup - add commit-graph to the chain without Bloom filters
 	test_line_count = 2 .git/objects/info/commit-graphs/commit-graph-chain
 '
 
-test_expect_success 'Do not use Bloom filters if the latest graph does not have Bloom filters.' '
-	test_bloom_filters_not_used "-- A/B"
+test_expect_success 'use Bloom filters even if the latest graph does not have Bloom filters' '
+	# Ensure that the number of empty filters is equal to the number of
+	# filters in the latest graph layer to prove that they are loaded (and
+	# ignored).
+	test_bloom_filters_used "-- A/B" 3
 '
 
 test_expect_success 'setup - add commit-graph to the chain with Bloom filters' '
diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
index 18216463c7..c334ee9155 100755
--- a/t/t5324-split-commit-graph.sh
+++ b/t/t5324-split-commit-graph.sh
@@ -427,4 +427,17 @@ done <<\EOF
 0600 -r--------
 EOF
 
+test_expect_success '--split=replace with partial Bloom data' '
+	rm -rf $graphdir $infodir/commit-graph &&
+	git reset --hard commits/3 &&
+	git rev-list -1 HEAD~2 >a &&
+	git rev-list -1 HEAD~1 >b &&
+	git commit-graph write --split=no-merge --stdin-commits --changed-paths <a &&
+	git commit-graph write --split=no-merge --stdin-commits <b &&
+	git commit-graph write --split=replace --stdin-commits --changed-paths <c &&
+	ls $graphdir/graph-*.graph >graph-files &&
+	test_line_count = 1 graph-files &&
+	verify_chain_files_exist $graphdir
+'
+
 test_done
-- 
2.28.0.510.g375ecf1f36


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH v3 02/13] t4216: use an '&&'-chain
  2020-09-18  2:58 ` [PATCH v3 " Taylor Blau
  2020-09-18  2:58   ` [PATCH v3 01/13] commit-graph: introduce 'get_bloom_filter_settings()' Taylor Blau
@ 2020-09-18  2:58   ` Taylor Blau
  2020-09-18  2:59   ` [PATCH v3 03/13] commit-graph: pass a 'struct repository *' in more places Taylor Blau
                     ` (11 subsequent siblings)
  13 siblings, 0 replies; 75+ messages in thread
From: Taylor Blau @ 2020-09-18  2:58 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, peff, szeder.dev

In a759bfa9ee (t4216: add end to end tests for git log with Bloom
filters, 2020-04-06), a 'rm' invocation was added without a
corresponding '&&' chain.

When 'trace.perf' already exists, everything works fine. However, the
function can be executed without 'trace.perf' on disk (eg., when the
subset of tests run is altered with '--run'), and so the bare 'rm'
complains about a missing file.

To remove some noise from the test log, invoke 'rm' with '-f', at which
point it is sensible to place the 'rm -f' in an '&&'-chain, which is
both (1) our usual style, and (2) avoids a broken chain in the future if
more commands are added at the beginning of the function.

Helped-by: Eric Sunshine <sunshine@sunshineco.com>
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 t/t4216-log-bloom.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index 715912ad0f..cd89c75002 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -58,7 +58,7 @@ sane_unset GIT_TRACE2_PERF_BRIEF
 sane_unset GIT_TRACE2_CONFIG_PARAMS
 
 setup () {
-	rm "$TRASH_DIRECTORY/trace.perf"
+	rm -f "$TRASH_DIRECTORY/trace.perf" &&
 	git -c core.commitGraph=false log --pretty="format:%s" $1 >log_wo_bloom &&
 	GIT_TRACE2_PERF="$TRASH_DIRECTORY/trace.perf" git -c core.commitGraph=true log --pretty="format:%s" $1 >log_w_bloom
 }
-- 
2.28.0.510.g375ecf1f36


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH v3 03/13] commit-graph: pass a 'struct repository *' in more places
  2020-09-18  2:58 ` [PATCH v3 " Taylor Blau
  2020-09-18  2:58   ` [PATCH v3 01/13] commit-graph: introduce 'get_bloom_filter_settings()' Taylor Blau
  2020-09-18  2:58   ` [PATCH v3 02/13] t4216: use an '&&'-chain Taylor Blau
@ 2020-09-18  2:59   ` Taylor Blau
  2020-09-18  2:59   ` [PATCH v3 04/13] t/helper/test-read-graph.c: prepare repo settings Taylor Blau
                     ` (10 subsequent siblings)
  13 siblings, 0 replies; 75+ messages in thread
From: Taylor Blau @ 2020-09-18  2:59 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, peff, szeder.dev

In a future commit, some commit-graph internals will want access to
'r->settings', but we only have the 'struct object_directory *'
corresponding to that repository.

Add an additional parameter to pass the repository around in more
places.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 builtin/commit-graph.c |  2 +-
 commit-graph.c         | 17 ++++++++++-------
 commit-graph.h         |  6 ++++--
 fuzz-commit-graph.c    |  5 +++--
 4 files changed, 18 insertions(+), 12 deletions(-)

diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 523501f217..ba5584463f 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -106,7 +106,7 @@ static int graph_verify(int argc, const char **argv)
 	FREE_AND_NULL(graph_name);
 
 	if (open_ok)
-		graph = load_commit_graph_one_fd_st(fd, &st, odb);
+		graph = load_commit_graph_one_fd_st(the_repository, fd, &st, odb);
 	else
 		graph = read_commit_graph_one(the_repository, odb);
 
diff --git a/commit-graph.c b/commit-graph.c
index 6a36ed0b06..72a838bd00 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -231,7 +231,8 @@ int open_commit_graph(const char *graph_file, int *fd, struct stat *st)
 	return 1;
 }
 
-struct commit_graph *load_commit_graph_one_fd_st(int fd, struct stat *st,
+struct commit_graph *load_commit_graph_one_fd_st(struct repository *r,
+						 int fd, struct stat *st,
 						 struct object_directory *odb)
 {
 	void *graph_map;
@@ -247,7 +248,7 @@ struct commit_graph *load_commit_graph_one_fd_st(int fd, struct stat *st,
 	}
 	graph_map = xmmap(NULL, graph_size, PROT_READ, MAP_PRIVATE, fd, 0);
 	close(fd);
-	ret = parse_commit_graph(graph_map, graph_size);
+	ret = parse_commit_graph(r, graph_map, graph_size);
 
 	if (ret)
 		ret->odb = odb;
@@ -287,7 +288,8 @@ static int verify_commit_graph_lite(struct commit_graph *g)
 	return 0;
 }
 
-struct commit_graph *parse_commit_graph(void *graph_map, size_t graph_size)
+struct commit_graph *parse_commit_graph(struct repository *r,
+					void *graph_map, size_t graph_size)
 {
 	const unsigned char *data, *chunk_lookup;
 	uint32_t i;
@@ -452,7 +454,8 @@ struct commit_graph *parse_commit_graph(void *graph_map, size_t graph_size)
 	return NULL;
 }
 
-static struct commit_graph *load_commit_graph_one(const char *graph_file,
+static struct commit_graph *load_commit_graph_one(struct repository *r,
+						  const char *graph_file,
 						  struct object_directory *odb)
 {
 
@@ -464,7 +467,7 @@ static struct commit_graph *load_commit_graph_one(const char *graph_file,
 	if (!open_ok)
 		return NULL;
 
-	g = load_commit_graph_one_fd_st(fd, &st, odb);
+	g = load_commit_graph_one_fd_st(r, fd, &st, odb);
 
 	if (g)
 		g->filename = xstrdup(graph_file);
@@ -476,7 +479,7 @@ static struct commit_graph *load_commit_graph_v1(struct repository *r,
 						 struct object_directory *odb)
 {
 	char *graph_name = get_commit_graph_filename(odb);
-	struct commit_graph *g = load_commit_graph_one(graph_name, odb);
+	struct commit_graph *g = load_commit_graph_one(r, graph_name, odb);
 	free(graph_name);
 
 	return g;
@@ -557,7 +560,7 @@ static struct commit_graph *load_commit_graph_chain(struct repository *r,
 		valid = 0;
 		for (odb = r->objects->odb; odb; odb = odb->next) {
 			char *graph_name = get_split_graph_filename(odb, line.buf);
-			struct commit_graph *g = load_commit_graph_one(graph_name, odb);
+			struct commit_graph *g = load_commit_graph_one(r, graph_name, odb);
 
 			free(graph_name);
 
diff --git a/commit-graph.h b/commit-graph.h
index 0677dd1031..d9acb22bac 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -75,11 +75,13 @@ struct commit_graph {
 	struct bloom_filter_settings *bloom_filter_settings;
 };
 
-struct commit_graph *load_commit_graph_one_fd_st(int fd, struct stat *st,
+struct commit_graph *load_commit_graph_one_fd_st(struct repository *r,
+						 int fd, struct stat *st,
 						 struct object_directory *odb);
 struct commit_graph *read_commit_graph_one(struct repository *r,
 					   struct object_directory *odb);
-struct commit_graph *parse_commit_graph(void *graph_map, size_t graph_size);
+struct commit_graph *parse_commit_graph(struct repository *r,
+					void *graph_map, size_t graph_size);
 
 /*
  * Return 1 if and only if the repository has a commit-graph
diff --git a/fuzz-commit-graph.c b/fuzz-commit-graph.c
index 430817214d..e7cf6d5b0f 100644
--- a/fuzz-commit-graph.c
+++ b/fuzz-commit-graph.c
@@ -1,7 +1,8 @@
 #include "commit-graph.h"
 #include "repository.h"
 
-struct commit_graph *parse_commit_graph(void *graph_map, size_t graph_size);
+struct commit_graph *parse_commit_graph(struct repository *r,
+					void *graph_map, size_t graph_size);
 
 int LLVMFuzzerTestOneInput(const uint8_t *data, size_t size);
 
@@ -10,7 +11,7 @@ int LLVMFuzzerTestOneInput(const uint8_t *data, size_t size)
 	struct commit_graph *g;
 
 	initialize_the_repository();
-	g = parse_commit_graph((void *)data, size);
+	g = parse_commit_graph(the_repository, (void *)data, size);
 	repo_clear(the_repository);
 	free_commit_graph(g);
 
-- 
2.28.0.510.g375ecf1f36


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH v3 04/13] t/helper/test-read-graph.c: prepare repo settings
  2020-09-18  2:58 ` [PATCH v3 " Taylor Blau
                     ` (2 preceding siblings ...)
  2020-09-18  2:59   ` [PATCH v3 03/13] commit-graph: pass a 'struct repository *' in more places Taylor Blau
@ 2020-09-18  2:59   ` Taylor Blau
  2020-09-18  2:59   ` [PATCH v3 05/13] commit-graph: respect 'commitGraph.readChangedPaths' Taylor Blau
                     ` (9 subsequent siblings)
  13 siblings, 0 replies; 75+ messages in thread
From: Taylor Blau @ 2020-09-18  2:59 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, peff, szeder.dev

The read-graph test-tool is used by a number of the commit-graph test to
assert various properties about a commit-graph. Previously, this program
never ran 'prepare_repo_settings()'. There was no need to do so, since
none of the commit-graph machinery is affected by the repo settings.

In the next patch, the commit-graph machinery's behavior will become
dependent on the repo settings, and so loading them before running the
rest of the test tool is critical.

As such, teach the test tool to call 'prepare_repo_settings()'.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 t/helper/test-read-graph.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/t/helper/test-read-graph.c b/t/helper/test-read-graph.c
index 6d0c962438..5f585a1725 100644
--- a/t/helper/test-read-graph.c
+++ b/t/helper/test-read-graph.c
@@ -12,11 +12,12 @@ int cmd__read_graph(int argc, const char **argv)
 	setup_git_directory();
 	odb = the_repository->objects->odb;
 
+	prepare_repo_settings(the_repository);
+
 	graph = read_commit_graph_one(the_repository, odb);
 	if (!graph)
 		return 1;
 
-
 	printf("header: %08x %d %d %d %d\n",
 		ntohl(*(uint32_t*)graph->data),
 		*(unsigned char*)(graph->data + 4),
-- 
2.28.0.510.g375ecf1f36


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH v3 05/13] commit-graph: respect 'commitGraph.readChangedPaths'
  2020-09-18  2:58 ` [PATCH v3 " Taylor Blau
                     ` (3 preceding siblings ...)
  2020-09-18  2:59   ` [PATCH v3 04/13] t/helper/test-read-graph.c: prepare repo settings Taylor Blau
@ 2020-09-18  2:59   ` Taylor Blau
  2020-09-18  2:59   ` [PATCH v3 06/13] commit-graph.c: store maximum changed paths Taylor Blau
                     ` (8 subsequent siblings)
  13 siblings, 0 replies; 75+ messages in thread
From: Taylor Blau @ 2020-09-18  2:59 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, peff, szeder.dev

Git uses the 'core.commitGraph' configuration value to control whether
or not the commit graph is used when parsing commits or performing a
traversal.

Now that commit-graphs can also contain a section for changed-path Bloom
filters, administrators that already have commit-graphs may find it
convenient to use those graphs without relying on their changed-path
Bloom filters. This can happen, for example, during a staged roll-out,
or in the event of an incident.

Introduce 'commitGraph.readChangedPaths' to control whether or not Bloom
filters are read. Note that this configuration is independent from both:

  - 'core.commitGraph', to allow flexibility in using all parts of a
    commit-graph _except_ for its Bloom filters.

  - The '--changed-paths' option for 'git commit-graph write', to allow
    reading and writing Bloom filters to be controlled independently.

When the variable is set, pretend as if no Bloom data was specified at
all. This avoids adding additional special-casing outside of the
commit-graph internals.

Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 Documentation/config.txt             | 2 ++
 Documentation/config/commitgraph.txt | 4 ++++
 commit-graph.c                       | 6 ++++--
 repo-settings.c                      | 3 +++
 repository.h                         | 1 +
 t/t4216-log-bloom.sh                 | 4 +++-
 6 files changed, 17 insertions(+), 3 deletions(-)
 create mode 100644 Documentation/config/commitgraph.txt

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 3042d80978..770ae79b82 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -340,6 +340,8 @@ include::config/column.txt[]
 
 include::config/commit.txt[]
 
+include::config/commitgraph.txt[]
+
 include::config/credential.txt[]
 
 include::config/completion.txt[]
diff --git a/Documentation/config/commitgraph.txt b/Documentation/config/commitgraph.txt
new file mode 100644
index 0000000000..cff0797b54
--- /dev/null
+++ b/Documentation/config/commitgraph.txt
@@ -0,0 +1,4 @@
+commitGraph.readChangedPaths::
+	If true, then git will use the changed-path Bloom filters in the
+	commit-graph file (if it exists, and they are present). Defaults to
+	true. See linkgit:git-commit-graph[1] for more information.
diff --git a/commit-graph.c b/commit-graph.c
index 72a838bd00..ea54d108b9 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -327,6 +327,8 @@ struct commit_graph *parse_commit_graph(struct repository *r,
 		return NULL;
 	}
 
+	prepare_repo_settings(r);
+
 	graph = alloc_commit_graph();
 
 	graph->hash_len = the_hash_algo->rawsz;
@@ -403,14 +405,14 @@ struct commit_graph *parse_commit_graph(struct repository *r,
 		case GRAPH_CHUNKID_BLOOMINDEXES:
 			if (graph->chunk_bloom_indexes)
 				chunk_repeated = 1;
-			else
+			else if (r->settings.commit_graph_read_changed_paths)
 				graph->chunk_bloom_indexes = data + chunk_offset;
 			break;
 
 		case GRAPH_CHUNKID_BLOOMDATA:
 			if (graph->chunk_bloom_data)
 				chunk_repeated = 1;
-			else {
+			else if (r->settings.commit_graph_read_changed_paths) {
 				uint32_t hash_version;
 				graph->chunk_bloom_data = data + chunk_offset;
 				hash_version = get_be32(data + chunk_offset);
diff --git a/repo-settings.c b/repo-settings.c
index aa61a35338..88ccce2036 100644
--- a/repo-settings.c
+++ b/repo-settings.c
@@ -17,9 +17,12 @@ void prepare_repo_settings(struct repository *r)
 
 	if (!repo_config_get_bool(r, "core.commitgraph", &value))
 		r->settings.core_commit_graph = value;
+	if (!repo_config_get_bool(r, "commitgraph.readchangedpaths", &value))
+		r->settings.commit_graph_read_changed_paths = value;
 	if (!repo_config_get_bool(r, "gc.writecommitgraph", &value))
 		r->settings.gc_write_commit_graph = value;
 	UPDATE_DEFAULT_BOOL(r->settings.core_commit_graph, 1);
+	UPDATE_DEFAULT_BOOL(r->settings.commit_graph_read_changed_paths, 1);
 	UPDATE_DEFAULT_BOOL(r->settings.gc_write_commit_graph, 1);
 
 	if (!repo_config_get_int(r, "index.version", &value))
diff --git a/repository.h b/repository.h
index 628c834367..bacf843d46 100644
--- a/repository.h
+++ b/repository.h
@@ -30,6 +30,7 @@ struct repo_settings {
 	int initialized;
 
 	int core_commit_graph;
+	int commit_graph_read_changed_paths;
 	int gc_write_commit_graph;
 	int fetch_write_commit_graph;
 
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index cd89c75002..fc7693806c 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -95,7 +95,9 @@ do
 		      "--ancestry-path side..master"
 	do
 		test_expect_success "git log option: $option for path: $path" '
-			test_bloom_filters_used "$option -- $path"
+			test_bloom_filters_used "$option -- $path" &&
+			test_config commitgraph.readChangedPaths false &&
+			test_bloom_filters_not_used "$option -- $path"
 		'
 	done
 done
-- 
2.28.0.510.g375ecf1f36


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH v3 06/13] commit-graph.c: store maximum changed paths
  2020-09-18  2:58 ` [PATCH v3 " Taylor Blau
                     ` (4 preceding siblings ...)
  2020-09-18  2:59   ` [PATCH v3 05/13] commit-graph: respect 'commitGraph.readChangedPaths' Taylor Blau
@ 2020-09-18  2:59   ` Taylor Blau
  2020-09-18  2:59   ` [PATCH v3 07/13] bloom: split 'get_bloom_filter()' in two Taylor Blau
                     ` (7 subsequent siblings)
  13 siblings, 0 replies; 75+ messages in thread
From: Taylor Blau @ 2020-09-18  2:59 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, peff, szeder.dev

For now, we assume that there is a fixed constant describing the
maximum number of changed paths we are willing to store in a Bloom
filter.

Prepare for that to (at least partially) not be the case by making it a
member of the 'struct bloom_filter_settings'. This will be helpful in
the subsequent patches by reducing the size of test cases that exercise
storing too many changed paths, as well as preparing for an eventual
future in which this value might change.

This patch alone does not cause newly generated Bloom filters to use
a custom upper-bound on the maximum number of changed paths a single
Bloom filter can hold, that will occur in a later patch.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 bloom.h              | 11 ++++++++++-
 commit-graph.c       |  4 ++++
 t/t4216-log-bloom.sh |  4 ++--
 3 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/bloom.h b/bloom.h
index d8fbb0fbf1..0b9b59a6fe 100644
--- a/bloom.h
+++ b/bloom.h
@@ -28,9 +28,18 @@ struct bloom_filter_settings {
 	 * that contain n*b bits.
 	 */
 	uint32_t bits_per_entry;
+
+	/*
+	 * The maximum number of changed paths per commit
+	 * before declaring a Bloom filter to be too-large.
+	 *
+	 * Not written to the commit-graph file.
+	 */
+	uint32_t max_changed_paths;
 };
 
-#define DEFAULT_BLOOM_FILTER_SETTINGS { 1, 7, 10 }
+#define DEFAULT_BLOOM_MAX_CHANGES 512
+#define DEFAULT_BLOOM_FILTER_SETTINGS { 1, 7, 10, DEFAULT_BLOOM_MAX_CHANGES }
 #define BITS_PER_WORD 8
 #define BLOOMDATA_CHUNK_HEADER_SIZE 3 * sizeof(uint32_t)
 
diff --git a/commit-graph.c b/commit-graph.c
index ea54d108b9..ba6d4a4c6c 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -424,6 +424,7 @@ struct commit_graph *parse_commit_graph(struct repository *r,
 				graph->bloom_filter_settings->hash_version = hash_version;
 				graph->bloom_filter_settings->num_hashes = get_be32(data + chunk_offset + 4);
 				graph->bloom_filter_settings->bits_per_entry = get_be32(data + chunk_offset + 8);
+				graph->bloom_filter_settings->max_changed_paths = DEFAULT_BLOOM_MAX_CHANGES;
 			}
 			break;
 		}
@@ -1201,6 +1202,7 @@ static void trace2_bloom_filter_settings(struct write_commit_graph_context *ctx)
 	jw_object_intmax(&jw, "hash_version", ctx->bloom_settings->hash_version);
 	jw_object_intmax(&jw, "num_hashes", ctx->bloom_settings->num_hashes);
 	jw_object_intmax(&jw, "bits_per_entry", ctx->bloom_settings->bits_per_entry);
+	jw_object_intmax(&jw, "max_changed_paths", ctx->bloom_settings->max_changed_paths);
 	jw_end(&jw);
 
 	trace2_data_json("bloom", ctx->r, "settings", &jw);
@@ -1669,6 +1671,8 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 							      bloom_settings.bits_per_entry);
 		bloom_settings.num_hashes = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_NUM_HASHES",
 							  bloom_settings.num_hashes);
+		bloom_settings.max_changed_paths = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS",
+							  bloom_settings.max_changed_paths);
 		ctx->bloom_settings = &bloom_settings;
 	}
 
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index fc7693806c..593571358d 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -174,11 +174,11 @@ test_expect_success 'persist filter settings' '
 		GIT_TEST_BLOOM_SETTINGS_NUM_HASHES=9 \
 		GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY=15 \
 		git commit-graph write --reachable --changed-paths &&
-	grep "{\"hash_version\":1,\"num_hashes\":9,\"bits_per_entry\":15}" trace2.txt &&
+	grep "{\"hash_version\":1,\"num_hashes\":9,\"bits_per_entry\":15,\"max_changed_paths\":512" trace2.txt &&
 	GIT_TRACE2_EVENT="$(pwd)/trace2-auto.txt" \
 		GIT_TRACE2_EVENT_NESTING=5 \
 		git commit-graph write --reachable --changed-paths &&
-	grep "{\"hash_version\":1,\"num_hashes\":9,\"bits_per_entry\":15}" trace2-auto.txt
+	grep "{\"hash_version\":1,\"num_hashes\":9,\"bits_per_entry\":15,\"max_changed_paths\":512" trace2-auto.txt
 '
 
 test_expect_success 'correctly report changes over limit' '
-- 
2.28.0.510.g375ecf1f36


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH v3 07/13] bloom: split 'get_bloom_filter()' in two
  2020-09-18  2:58 ` [PATCH v3 " Taylor Blau
                     ` (5 preceding siblings ...)
  2020-09-18  2:59   ` [PATCH v3 06/13] commit-graph.c: store maximum changed paths Taylor Blau
@ 2020-09-18  2:59   ` Taylor Blau
  2020-09-18  2:59   ` [PATCH v3 08/13] bloom: use provided 'struct bloom_filter_settings' Taylor Blau
                     ` (6 subsequent siblings)
  13 siblings, 0 replies; 75+ messages in thread
From: Taylor Blau @ 2020-09-18  2:59 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, peff, szeder.dev

'get_bloom_filter' takes a flag to control whether it will compute a
Bloom filter if the requested one is missing. In the next patch, we'll
add yet another parameter to this method, which would force all but one
caller to specify an extra 'NULL' parameter at the end.

Instead of doing this, split 'get_bloom_filter' into two functions:
'get_bloom_filter' and 'get_or_compute_bloom_filter'. The former only
looks up a Bloom filter (and does not compute one if it's missing,
thus dropping the 'compute_if_not_present' flag). The latter does
compute missing Bloom filters, with an additional parameter to store
whether or not it needed to do so.

This simplifies many call-sites, since the majority of existing callers
to 'get_bloom_filter' do not want missing Bloom filters to be computed
(so they can drop the parameter entirely and use the simpler version of
the function).

While we're at it, instrument the new 'get_or_compute_bloom_filter()'
with counters in the 'write_commit_graph_context' struct which store
the number of filters that we did and didn't compute, as well as filters
that were truncated.

It would be nice to drop the 'compute_if_not_present' flag entirely,
since all remaining callers of 'get_or_compute_bloom_filter' pass it as
'1', but this will change in a future patch and hence cannot be removed.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 blame.c               |  2 +-
 bloom.c               | 16 +++++++++++++---
 bloom.h               | 16 +++++++++++++---
 commit-graph.c        | 34 +++++++++++++++++++++++++++++++---
 line-log.c            |  2 +-
 revision.c            |  2 +-
 t/helper/test-bloom.c |  3 ++-
 7 files changed, 62 insertions(+), 13 deletions(-)

diff --git a/blame.c b/blame.c
index 903e23af23..e5ba35dbd1 100644
--- a/blame.c
+++ b/blame.c
@@ -1276,7 +1276,7 @@ static int maybe_changed_path(struct repository *r,
 	if (commit_graph_generation(origin->commit) == GENERATION_NUMBER_INFINITY)
 		return 1;
 
-	filter = get_bloom_filter(r, origin->commit, 0);
+	filter = get_bloom_filter(r, origin->commit);
 
 	if (!filter)
 		return 1;
diff --git a/bloom.c b/bloom.c
index cd9380ac62..393c61b4bc 100644
--- a/bloom.c
+++ b/bloom.c
@@ -177,9 +177,10 @@ static int pathmap_cmp(const void *hashmap_cmp_fn_data,
 	return strcmp(e1->path, e2->path);
 }
 
-struct bloom_filter *get_bloom_filter(struct repository *r,
-				      struct commit *c,
-				      int compute_if_not_present)
+struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
+						 struct commit *c,
+						 int compute_if_not_present,
+						 enum bloom_filter_computed *computed)
 {
 	struct bloom_filter *filter;
 	struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
@@ -187,6 +188,9 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
 	struct diff_options diffopt;
 	int max_changes = 512;
 
+	if (computed)
+		*computed = BLOOM_NOT_COMPUTED;
+
 	if (!bloom_filters.slab_size)
 		return NULL;
 
@@ -271,8 +275,14 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
 			diff_free_filepair(diff_queued_diff.queue[i]);
 		filter->data = NULL;
 		filter->len = 0;
+
+		if (computed)
+			*computed |= BLOOM_TRUNC_LARGE;
 	}
 
+	if (computed)
+		*computed |= BLOOM_COMPUTED;
+
 	free(diff_queued_diff.queue);
 	DIFF_QUEUE_CLEAR(&diff_queued_diff);
 
diff --git a/bloom.h b/bloom.h
index 0b9b59a6fe..e2e035ad14 100644
--- a/bloom.h
+++ b/bloom.h
@@ -89,9 +89,19 @@ void add_key_to_filter(const struct bloom_key *key,
 
 void init_bloom_filters(void);
 
-struct bloom_filter *get_bloom_filter(struct repository *r,
-				      struct commit *c,
-				      int compute_if_not_present);
+enum bloom_filter_computed {
+	BLOOM_NOT_COMPUTED = (1 << 0),
+	BLOOM_COMPUTED     = (1 << 1),
+	BLOOM_TRUNC_LARGE  = (1 << 2),
+};
+
+struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
+						 struct commit *c,
+						 int compute_if_not_present,
+						 enum bloom_filter_computed *computed);
+
+#define get_bloom_filter(r, c) get_or_compute_bloom_filter( \
+	(r), (c), 0, NULL)
 
 int bloom_filter_contains(const struct bloom_filter *filter,
 			  const struct bloom_key *key,
diff --git a/commit-graph.c b/commit-graph.c
index ba6d4a4c6c..ecdab89e93 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -972,6 +972,10 @@ struct write_commit_graph_context {
 	const struct split_commit_graph_opts *split_opts;
 	size_t total_bloom_filter_data_size;
 	const struct bloom_filter_settings *bloom_settings;
+
+	int count_bloom_filter_computed;
+	int count_bloom_filter_not_computed;
+	int count_bloom_filter_trunc_large;
 };
 
 static int write_graph_chunk_fanout(struct hashfile *f,
@@ -1183,7 +1187,7 @@ static int write_graph_chunk_bloom_indexes(struct hashfile *f,
 	uint32_t cur_pos = 0;
 
 	while (list < last) {
-		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
+		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
 		size_t len = filter ? filter->len : 0;
 		cur_pos += len;
 		display_progress(ctx->progress, ++ctx->progress_cnt);
@@ -1223,7 +1227,7 @@ static int write_graph_chunk_bloom_data(struct hashfile *f,
 	hashwrite_be32(f, ctx->bloom_settings->bits_per_entry);
 
 	while (list < last) {
-		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
+		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
 		size_t len = filter ? filter->len : 0;
 
 		display_progress(ctx->progress, ++ctx->progress_cnt);
@@ -1393,6 +1397,16 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 	stop_progress(&ctx->progress);
 }
 
+static void trace2_bloom_filter_write_statistics(struct write_commit_graph_context *ctx)
+{
+	trace2_data_intmax("commit-graph", ctx->r, "filter-computed",
+			   ctx->count_bloom_filter_computed);
+	trace2_data_intmax("commit-graph", ctx->r, "filter-not-computed",
+			   ctx->count_bloom_filter_not_computed);
+	trace2_data_intmax("commit-graph", ctx->r, "filter-trunc-large",
+			   ctx->count_bloom_filter_trunc_large);
+}
+
 static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 {
 	int i;
@@ -1415,12 +1429,26 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 		QSORT(sorted_commits, ctx->commits.nr, commit_gen_cmp);
 
 	for (i = 0; i < ctx->commits.nr; i++) {
+		enum bloom_filter_computed computed = 0;
 		struct commit *c = sorted_commits[i];
-		struct bloom_filter *filter = get_bloom_filter(ctx->r, c, 1);
+		struct bloom_filter *filter = get_or_compute_bloom_filter(
+			ctx->r,
+			c,
+			1,
+			&computed);
+		if (computed & BLOOM_COMPUTED) {
+			ctx->count_bloom_filter_computed++;
+			if (computed & BLOOM_TRUNC_LARGE)
+				ctx->count_bloom_filter_trunc_large++;
+		} else if (computed & BLOOM_NOT_COMPUTED)
+			ctx->count_bloom_filter_not_computed++;
 		ctx->total_bloom_filter_data_size += sizeof(unsigned char) * filter->len;
 		display_progress(progress, i + 1);
 	}
 
+	if (trace2_is_enabled())
+		trace2_bloom_filter_write_statistics(ctx);
+
 	free(sorted_commits);
 	stop_progress(&progress);
 }
diff --git a/line-log.c b/line-log.c
index bf73ea95ac..68eeb425f8 100644
--- a/line-log.c
+++ b/line-log.c
@@ -1159,7 +1159,7 @@ static int bloom_filter_check(struct rev_info *rev,
 		return 1;
 
 	if (!rev->bloom_filter_settings ||
-	    !(filter = get_bloom_filter(rev->repo, commit, 0)))
+	    !(filter = get_bloom_filter(rev->repo, commit)))
 		return 1;
 
 	if (!range)
diff --git a/revision.c b/revision.c
index 6aeb764821..c53ab99e59 100644
--- a/revision.c
+++ b/revision.c
@@ -752,7 +752,7 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
 	if (commit_graph_generation(commit) == GENERATION_NUMBER_INFINITY)
 		return -1;
 
-	filter = get_bloom_filter(revs->repo, commit, 0);
+	filter = get_bloom_filter(revs->repo, commit);
 
 	if (!filter) {
 		count_bloom_filter_not_present++;
diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
index 5e77d56f59..9f7bb729fc 100644
--- a/t/helper/test-bloom.c
+++ b/t/helper/test-bloom.c
@@ -39,7 +39,8 @@ static void get_bloom_filter_for_commit(const struct object_id *commit_oid)
 	struct bloom_filter *filter;
 	setup_git_directory();
 	c = lookup_commit(the_repository, commit_oid);
-	filter = get_bloom_filter(the_repository, c, 1);
+	filter = get_or_compute_bloom_filter(the_repository, c, 1,
+					     NULL);
 	print_bloom_filter(filter);
 }
 
-- 
2.28.0.510.g375ecf1f36


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH v3 08/13] bloom: use provided 'struct bloom_filter_settings'
  2020-09-18  2:58 ` [PATCH v3 " Taylor Blau
                     ` (6 preceding siblings ...)
  2020-09-18  2:59   ` [PATCH v3 07/13] bloom: split 'get_bloom_filter()' in two Taylor Blau
@ 2020-09-18  2:59   ` Taylor Blau
  2020-09-18 16:27     ` SZEDER Gábor
  2020-09-18  2:59   ` [PATCH v3 09/13] bloom/diff: properly short-circuit on max_changes Taylor Blau
                     ` (5 subsequent siblings)
  13 siblings, 1 reply; 75+ messages in thread
From: Taylor Blau @ 2020-09-18  2:59 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, peff, szeder.dev

When 'get_or_compute_bloom_filter()' needs to compute a Bloom filter
from scratch, it looks to the default 'struct bloom_filter_settings' in
order to determine the maximum number of changed paths, number of bits
per entry, and so on.

All of these values have so far been constant, and so there was no need
to pass in a pointer from the caller (eg., the one that is stored in the
'struct write_commit_graph_context').

Start passing in a 'struct bloom_filter_settings *' instead of using the
default values to respect graph-specific settings (eg., in the case of
setting 'GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS').

In order to have an initialized value for these settings, move its
initialization to earlier in the commit-graph write.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 bloom.c               | 13 ++++++-------
 bloom.h               |  3 ++-
 commit-graph.c        | 21 ++++++++++-----------
 t/helper/test-bloom.c |  1 +
 4 files changed, 19 insertions(+), 19 deletions(-)

diff --git a/bloom.c b/bloom.c
index 393c61b4bc..2d6aef9098 100644
--- a/bloom.c
+++ b/bloom.c
@@ -180,13 +180,12 @@ static int pathmap_cmp(const void *hashmap_cmp_fn_data,
 struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 						 struct commit *c,
 						 int compute_if_not_present,
+						 const struct bloom_filter_settings *settings,
 						 enum bloom_filter_computed *computed)
 {
 	struct bloom_filter *filter;
-	struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
 	int i;
 	struct diff_options diffopt;
-	int max_changes = 512;
 
 	if (computed)
 		*computed = BLOOM_NOT_COMPUTED;
@@ -211,7 +210,7 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 	repo_diff_setup(r, &diffopt);
 	diffopt.flags.recursive = 1;
 	diffopt.detect_rename = 0;
-	diffopt.max_changes = max_changes;
+	diffopt.max_changes = settings->max_changed_paths;
 	diff_setup_done(&diffopt);
 
 	/* ensure commit is parsed so we have parent information */
@@ -223,7 +222,7 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 		diff_tree_oid(NULL, &c->object.oid, "", &diffopt);
 	diffcore_std(&diffopt);
 
-	if (diffopt.num_changes <= max_changes) {
+	if (diffopt.num_changes <= settings->max_changed_paths) {
 		struct hashmap pathmap;
 		struct pathmap_hash_entry *e;
 		struct hashmap_iter iter;
@@ -260,13 +259,13 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 			diff_free_filepair(diff_queued_diff.queue[i]);
 		}
 
-		filter->len = (hashmap_get_size(&pathmap) * settings.bits_per_entry + BITS_PER_WORD - 1) / BITS_PER_WORD;
+		filter->len = (hashmap_get_size(&pathmap) * settings->bits_per_entry + BITS_PER_WORD - 1) / BITS_PER_WORD;
 		filter->data = xcalloc(filter->len, sizeof(unsigned char));
 
 		hashmap_for_each_entry(&pathmap, &iter, e, entry) {
 			struct bloom_key key;
-			fill_bloom_key(e->path, strlen(e->path), &key, &settings);
-			add_key_to_filter(&key, filter, &settings);
+			fill_bloom_key(e->path, strlen(e->path), &key, settings);
+			add_key_to_filter(&key, filter, settings);
 		}
 
 		hashmap_free_entries(&pathmap, struct pathmap_hash_entry, entry);
diff --git a/bloom.h b/bloom.h
index e2e035ad14..c6d77e8393 100644
--- a/bloom.h
+++ b/bloom.h
@@ -98,10 +98,11 @@ enum bloom_filter_computed {
 struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 						 struct commit *c,
 						 int compute_if_not_present,
+						 const struct bloom_filter_settings *settings,
 						 enum bloom_filter_computed *computed);
 
 #define get_bloom_filter(r, c) get_or_compute_bloom_filter( \
-	(r), (c), 0, NULL)
+	(r), (c), 0, NULL, NULL)
 
 int bloom_filter_contains(const struct bloom_filter *filter,
 			  const struct bloom_key *key,
diff --git a/commit-graph.c b/commit-graph.c
index ecdab89e93..50519eb968 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1435,6 +1435,7 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 			ctx->r,
 			c,
 			1,
+			ctx->bloom_settings,
 			&computed);
 		if (computed & BLOOM_COMPUTED) {
 			ctx->count_bloom_filter_computed++;
@@ -1692,17 +1693,6 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	int num_chunks = 3;
 	uint64_t chunk_offset;
 	struct object_id file_hash;
-	struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
-
-	if (!ctx->bloom_settings) {
-		bloom_settings.bits_per_entry = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY",
-							      bloom_settings.bits_per_entry);
-		bloom_settings.num_hashes = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_NUM_HASHES",
-							  bloom_settings.num_hashes);
-		bloom_settings.max_changed_paths = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS",
-							  bloom_settings.max_changed_paths);
-		ctx->bloom_settings = &bloom_settings;
-	}
 
 	if (ctx->split) {
 		struct strbuf tmp_file = STRBUF_INIT;
@@ -2148,6 +2138,7 @@ int write_commit_graph(struct object_directory *odb,
 	uint32_t i, count_distinct = 0;
 	int res = 0;
 	int replace = 0;
+	struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
 
 	if (!commit_graph_compatible(the_repository))
 		return 0;
@@ -2161,6 +2152,14 @@ int write_commit_graph(struct object_directory *odb,
 	ctx->split_opts = split_opts;
 	ctx->total_bloom_filter_data_size = 0;
 
+	bloom_settings.bits_per_entry = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY",
+						      bloom_settings.bits_per_entry);
+	bloom_settings.num_hashes = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_NUM_HASHES",
+						  bloom_settings.num_hashes);
+	bloom_settings.max_changed_paths = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS",
+							 bloom_settings.max_changed_paths);
+	ctx->bloom_settings = &bloom_settings;
+
 	if (flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS)
 		ctx->changed_paths = 1;
 	if (!(flags & COMMIT_GRAPH_NO_WRITE_BLOOM_FILTERS)) {
diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
index 9f7bb729fc..46e97b04eb 100644
--- a/t/helper/test-bloom.c
+++ b/t/helper/test-bloom.c
@@ -40,6 +40,7 @@ static void get_bloom_filter_for_commit(const struct object_id *commit_oid)
 	setup_git_directory();
 	c = lookup_commit(the_repository, commit_oid);
 	filter = get_or_compute_bloom_filter(the_repository, c, 1,
+					     &settings,
 					     NULL);
 	print_bloom_filter(filter);
 }
-- 
2.28.0.510.g375ecf1f36


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH v3 09/13] bloom/diff: properly short-circuit on max_changes
  2020-09-18  2:58 ` [PATCH v3 " Taylor Blau
                     ` (7 preceding siblings ...)
  2020-09-18  2:59   ` [PATCH v3 08/13] bloom: use provided 'struct bloom_filter_settings' Taylor Blau
@ 2020-09-18  2:59   ` Taylor Blau
  2020-09-18  2:59   ` [PATCH v3 10/13] bloom: encode out-of-bounds filters as non-empty Taylor Blau
                     ` (4 subsequent siblings)
  13 siblings, 0 replies; 75+ messages in thread
From: Taylor Blau @ 2020-09-18  2:59 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, peff, szeder.dev

From: Derrick Stolee <dstolee@microsoft.com>

Commit e3696980 (diff: halt tree-diff early after max_changes,
2020-03-30) intended to create a mechanism to short-circuit a diff
calculation after a certain number of paths were modified. By
incrementing a "num_changes" counter throughout the recursive
ll_diff_tree_paths(), this was supposed to match the number of changes
that would be written into the changed-path Bloom filters.
Unfortunately, this was not implemented correctly and instead misses
simple cases like file modifications. This then does not stop very
large changed-path filters from being written (unless they add or remove
many files).

To start, change the implementation in ll_diff_tree_paths() to instead
use the global diff_queue_diff struct's 'nr' member as the count. This
is a way to simplify the logic instead of making more mistakes in the
complicated diff code.

This has a drawback: the diff_queue_diff struct only lists the paths
corresponding to blob changes, not their leading directories. Thus,
get_or_compute_bloom_filter() needs an additional check to see if the
hashmap with the leading directories becomes too large.

One reason why this was not caught by test cases was that the test in
t4216-log-bloom.sh that was supposed to check this "too many changes"
condition only checked this on the initial commit of a repository. The
old logic counted these values correctly. Update this test in a few
ways:

1. Use GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS to reduce the limit,
   allowing smaller commits to engage with this logic.

2. Create several interesting cases of edits, adds, removes, and mode
   changes (in the second commit). By testing both sides of the
   inequality with the *_MAX_CHANGED_PATHS variable, we can see that
   the count is exactly correct, so none of these changes are missed
   or over-counted.

3. Use the trace2 data value filter_found_large to verify that these
   commits are on the correct side of the limit.

Another way to verify the behavior is correct is through performance
tests. By testing on my local copies of the Git repository and the Linux
kernel repository, I could measure the effect of these short-circuits
when computing a fresh commit-graph file with changed-path Bloom filters
using the command

  GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS=N time \
    git commit-graph write --reachable --changed-paths

and reporting the wall time and resulting commit-graph size.

For Git, the results are

|        |      N=1       |       N=10     |      N=512     |
|--------|----------------|----------------|----------------|
| HEAD~1 | 10.90s  9.18MB | 11.11s  9.34MB | 11.31s  9.35MB |
| HEAD   |  9.21s  8.62MB | 11.11s  9.29MB | 11.29s  9.34MB |

For Linux, the results are

|        |       N=1      |     N=20      |     N=512     |
|--------|----------------|---------------|---------------|
| HEAD~1 | 61.28s  64.3MB | 76.9s  72.6MB | 77.6s  72.6MB |
| HEAD   | 49.44s  56.3MB | 68.7s  65.9MB | 69.2s  65.9MB |

Naturally, the improvement becomes much less as the limit grows, as
fewer commits satisfy the short-circuit.

Reported-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 bloom.c              |   9 +++-
 diff.h               |   2 -
 t/t4216-log-bloom.sh | 100 +++++++++++++++++++++++++++++++++++++++----
 tree-diff.c          |   5 +--
 4 files changed, 100 insertions(+), 16 deletions(-)

diff --git a/bloom.c b/bloom.c
index 2d6aef9098..db9fb82437 100644
--- a/bloom.c
+++ b/bloom.c
@@ -222,7 +222,7 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 		diff_tree_oid(NULL, &c->object.oid, "", &diffopt);
 	diffcore_std(&diffopt);
 
-	if (diffopt.num_changes <= settings->max_changed_paths) {
+	if (diff_queued_diff.nr <= settings->max_changed_paths) {
 		struct hashmap pathmap;
 		struct pathmap_hash_entry *e;
 		struct hashmap_iter iter;
@@ -259,6 +259,12 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 			diff_free_filepair(diff_queued_diff.queue[i]);
 		}
 
+		if (hashmap_get_size(&pathmap) > settings->max_changed_paths) {
+			if (computed)
+				*computed |= BLOOM_TRUNC_LARGE;
+			goto cleanup;
+		}
+
 		filter->len = (hashmap_get_size(&pathmap) * settings->bits_per_entry + BITS_PER_WORD - 1) / BITS_PER_WORD;
 		filter->data = xcalloc(filter->len, sizeof(unsigned char));
 
@@ -268,6 +274,7 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 			add_key_to_filter(&key, filter, settings);
 		}
 
+	cleanup:
 		hashmap_free_entries(&pathmap, struct pathmap_hash_entry, entry);
 	} else {
 		for (i = 0; i < diff_queued_diff.nr; i++)
diff --git a/diff.h b/diff.h
index e0c0af6286..1d32b71885 100644
--- a/diff.h
+++ b/diff.h
@@ -287,8 +287,6 @@ struct diff_options {
 
 	/* If non-zero, then stop computing after this many changes. */
 	int max_changes;
-	/* For internal use only. */
-	int num_changes;
 
 	int ita_invisible_in_index;
 /* white-space error highlighting */
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index 593571358d..123b46a0db 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -181,21 +181,103 @@ test_expect_success 'persist filter settings' '
 	grep "{\"hash_version\":1,\"num_hashes\":9,\"bits_per_entry\":15,\"max_changed_paths\":512" trace2-auto.txt
 '
 
+test_max_changed_paths () {
+	grep "\"max_changed_paths\":$1" $2
+}
+
+test_filter_computed () {
+	grep "\"key\":\"filter-computed\",\"value\":\"$1\"" $2
+}
+
+test_filter_trunc_large () {
+	grep "\"key\":\"filter-trunc-large\",\"value\":\"$1\"" $2
+}
+
 test_expect_success 'correctly report changes over limit' '
-	git init 513changes &&
+	git init limits &&
 	(
-		cd 513changes &&
-		for i in $(test_seq 1 513)
+		cd limits &&
+		mkdir d &&
+		mkdir d/e &&
+
+		for i in $(test_seq 1 2)
 		do
-			echo $i >file$i.txt || return 1
+			printf $i >d/file$i.txt &&
+			printf $i >d/e/file$i.txt || return 1
 		done &&
-		git add . &&
+
+		mkdir mode &&
+		printf bash >mode/script.sh &&
+
+		mkdir foo &&
+		touch foo/bar &&
+		touch foo.txt &&
+
+		git add d foo foo.txt mode &&
 		git commit -m "files" &&
-		git commit-graph write --reachable --changed-paths &&
-		for i in $(test_seq 1 513)
+
+		# Commit has 7 file and 4 directory adds
+		GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS=10 \
+			GIT_TRACE2_EVENT="$(pwd)/trace" \
+			git commit-graph write --reachable --changed-paths &&
+		test_max_changed_paths 10 trace &&
+		test_filter_computed 1 trace &&
+		test_filter_trunc_large 1 trace &&
+
+		for path in $(git ls-tree -r --name-only HEAD)
 		do
-			git -c core.commitGraph=false log -- file$i.txt >expect &&
-			git log -- file$i.txt >actual &&
+			git -c commitGraph.readChangedPaths=false log \
+				-- $path >expect &&
+			git log -- $path >actual &&
+			test_cmp expect actual || return 1
+		done &&
+
+		# Make a variety of path changes
+		printf new1 >d/e/file1.txt &&
+		printf new2 >d/file2.txt &&
+		rm d/e/file2.txt &&
+		rm -r foo &&
+		printf text >foo &&
+		mkdir f &&
+		printf new1 >f/file1.txt &&
+
+		# including a mode-only change (counts as modified)
+		git update-index --chmod=+x mode/script.sh &&
+
+		git add foo d f &&
+		git commit -m "complicated" &&
+
+		# start from scratch and rebuild
+		rm -f .git/objects/info/commit-graph &&
+		GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS=10 \
+			GIT_TRACE2_EVENT="$(pwd)/trace-edit" \
+			git commit-graph write --reachable --changed-paths &&
+		test_max_changed_paths 10 trace-edit &&
+		test_filter_computed 2 trace-edit &&
+		test_filter_trunc_large 2 trace-edit &&
+
+		for path in $(git ls-tree -r --name-only HEAD)
+		do
+			git -c commitGraph.readChangedPaths=false log \
+				-- $path >expect &&
+			git log -- $path >actual &&
+			test_cmp expect actual || return 1
+		done &&
+
+		# start from scratch and rebuild
+		rm -f .git/objects/info/commit-graph &&
+		GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS=11 \
+			GIT_TRACE2_EVENT="$(pwd)/trace-update" \
+			git commit-graph write --reachable --changed-paths &&
+		test_max_changed_paths 11 trace-update &&
+		test_filter_computed 2 trace-update &&
+		test_filter_trunc_large 0 trace-update &&
+
+		for path in $(git ls-tree -r --name-only HEAD)
+		do
+			git -c commitGraph.readChangedPaths=false log \
+				-- $path >expect &&
+			git log -- $path >actual &&
 			test_cmp expect actual || return 1
 		done
 	)
diff --git a/tree-diff.c b/tree-diff.c
index 6ebad1a46f..7cebbb327e 100644
--- a/tree-diff.c
+++ b/tree-diff.c
@@ -434,7 +434,7 @@ static struct combine_diff_path *ll_diff_tree_paths(
 		if (diff_can_quit_early(opt))
 			break;
 
-		if (opt->max_changes && opt->num_changes > opt->max_changes)
+		if (opt->max_changes && diff_queued_diff.nr > opt->max_changes)
 			break;
 
 		if (opt->pathspec.nr) {
@@ -521,7 +521,6 @@ static struct combine_diff_path *ll_diff_tree_paths(
 
 			/* t↓ */
 			update_tree_entry(&t);
-			opt->num_changes++;
 		}
 
 		/* t > p[imin] */
@@ -539,7 +538,6 @@ static struct combine_diff_path *ll_diff_tree_paths(
 		skip_emit_tp:
 			/* ∀ pi=p[imin]  pi↓ */
 			update_tp_entries(tp, nparent);
-			opt->num_changes++;
 		}
 	}
 
@@ -557,7 +555,6 @@ struct combine_diff_path *diff_tree_paths(
 	const struct object_id **parents_oid, int nparent,
 	struct strbuf *base, struct diff_options *opt)
 {
-	opt->num_changes = 0;
 	p = ll_diff_tree_paths(p, oid, parents_oid, nparent, base, opt);
 
 	/*
-- 
2.28.0.510.g375ecf1f36


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH v3 10/13] bloom: encode out-of-bounds filters as non-empty
  2020-09-18  2:58 ` [PATCH v3 " Taylor Blau
                     ` (8 preceding siblings ...)
  2020-09-18  2:59   ` [PATCH v3 09/13] bloom/diff: properly short-circuit on max_changes Taylor Blau
@ 2020-09-18  2:59   ` Taylor Blau
  2020-09-18  2:59   ` [PATCH v3 11/13] commit-graph: rename 'split_commit_graph_opts' Taylor Blau
                     ` (3 subsequent siblings)
  13 siblings, 0 replies; 75+ messages in thread
From: Taylor Blau @ 2020-09-18  2:59 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, peff, szeder.dev

When a changed-path Bloom filter has either zero, or more than a
certain number (commonly 512) of entries, the commit-graph machinery
encodes it as "missing". More specifically, it sets the indices adjacent
in the BIDX chunk as equal to each other to indicate a "length 0"
filter; that is, that the filter occupies zero bytes on disk.

This has heretofore been fine, since the commit-graph machinery has no
need to care about these filters with too few or too many changed paths.
Both cases act like no filter has been generated at all, and so there is
no need to store them.

In a subsequent commit, however, the commit-graph machinery will learn
to only compute Bloom filters for some commits in the current
commit-graph layer. This is a change from the current implementation
which computes Bloom filters for all commits that are in the layer being
written. Critically for this patch, only computing some of the Bloom
filters means adding a third state for length 0 Bloom filters: zero
entries, too many entries, or "hasn't been computed".

It will be important for that future patch to distinguish between "not
representable" (i.e., zero or too-many changed paths), and "hasn't been
computed". In particular, we don't want to waste time recomputing
filters that have already been computed.

To that end, change how we store Bloom filters in the "computed but not
representable" category:

  - Bloom filters with no entries are stored as a single byte with all
    bits low (i.e., all queries to that Bloom filter will return
    "definitely not")

  - Bloom filters with too many entries are stored as a single byte with
    all bits set high (i.e., all queries to that Bloom filter will
    return "maybe").

These rules are sufficient to not incur a behavior change by changing
the on-disk representation of these two classes. Likewise, no
specification changes are necessary for the commit-graph format, either:

  - Filters that were previously empty will be recomputed and stored
    according to the new rules, and

  - old clients reading filters generated by new clients will interpret
    the filters correctly and be none the wiser to how they were
    generated.

Clients will invoke the Bloom machinery in more cases than before, but
this can be addressed by returning a NULL filter when all bits are set
high. This can be addressed in a future patch.

Note that this does increase the size of on-disk commit-graphs, but far
less than other proposals. In particular, this is generally more
efficient than storing a bitmap for which commits haven't computed their
Bloom filters. Storing a bitmap incurs a penalty of one bit per commit,
whereas storing explicit filters as above incurs a penalty of one byte
per too-large or empty commit.

In practice, these boundary commits likely occupy a small proportion of
the overall number of commits, and so the size penalty is likely smaller
than storing a bitmap for all commits.

See, for example, these relative proportions of such boundary commits
(collected by SZEDER Gábor):

                  |     Percentage of     |    commit-graph   |           |
                  |   commits modifying   |     file size     |           |
                  ├────────┬──────────────┼───────────────────┤    pct.   |
                  | 0 path | >= 512 paths | before  |  after  |   change  |
 ┌────────────────┼────────┼──────────────┼─────────┼─────────┼───────────┤
 | android-base   | 13.20% |        0.13% | 37.468M | 37.534M | +0.1741 % |
 | cmssw          |  0.15% |        0.23% | 17.118M | 17.119M | +0.0091 % |
 | cpython        |  3.07% |        0.01% |  7.967M |  7.971M | +0.0423 % |
 | elasticsearch  |  0.70% |        1.00% |  8.833M |  8.835M | +0.0128 % |
 | gcc            |  0.00% |        0.08% | 16.073M | 16.074M | +0.0030 % |
 | gecko-dev      |  0.14% |        0.64% | 59.868M | 59.874M | +0.0105 % |
 | git            |  0.11% |        0.02% |  3.895M |  3.895M | +0.0020 % |
 | glibc          |  0.02% |        0.10% |  3.555M |  3.555M | +0.0021 % |
 | go             |  0.00% |        0.07% |  3.186M |  3.186M | +0.0018 % |
 | homebrew-cask  |  0.40% |        0.02% |  7.035M |  7.035M | +0.0065 % |
 | homebrew-core  |  0.01% |        0.01% | 11.611M | 11.611M | +0.0002 % |
 | jdk            |  0.26% |        5.64% |  5.537M |  5.540M | +0.0590 % |
 | linux          |  0.01% |        0.51% | 63.735M | 63.740M | +0.0073 % |
 | llvm-project   |  0.12% |        0.03% | 25.515M | 25.516M | +0.0050 % |
 | rails          |  0.10% |        0.10% |  6.252M |  6.252M | +0.0027 % |
 | rust           |  0.07% |        0.17% |  9.364M |  9.364M | +0.0033 % |
 | tensorflow     |  0.09% |        1.02% |  7.009M |  7.010M | +0.0158 % |
 | webkit         |  0.05% |        0.31% | 17.405M | 17.406M | +0.0047 % |

(where the above increase is determined by computing a non-split
commit-graph before and after this patch).

Given that these projects are all "large" by commit count, the storage
cost by writing these filters explicitly is negligible. In the most
extreme example, android-base (which has 494,848 commits at the time of
writing) would have its commit-graph increase by a modest 68.4 KB.

Finally, a test to exercise filters which contain too many changed path
entries will be introduced in a subsequent patch.

Suggested-by: SZEDER Gábor <szeder.dev@gmail.com>
Suggested-by: Jakub Narębski <jnareb@gmail.com>
Helped-by: Derrick Stolee <dstolee@microsoft.com>
Helped-by: SZEDER Gábor <szeder.dev@gmail.com>
Helped-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 .../technical/commit-graph-format.txt         |  2 +-
 bloom.c                                       | 16 ++++++++--
 bloom.h                                       |  1 +
 commit-graph.c                                |  5 +++
 t/t0095-bloom.sh                              |  8 ++---
 t/t4216-log-bloom.sh                          | 31 +++++++++++++++++--
 6 files changed, 54 insertions(+), 9 deletions(-)

diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
index 6ddbceba15..6585f1948a 100644
--- a/Documentation/technical/commit-graph-format.txt
+++ b/Documentation/technical/commit-graph-format.txt
@@ -125,7 +125,7 @@ CHUNK DATA:
     * The rest of the chunk is the concatenation of all the computed Bloom
       filters for the commits in lexicographic order.
     * Note: Commits with no changes or more than 512 changes have Bloom filters
-      of length zero.
+      of length one, with either all bits set to zero or one respectively.
     * The BDAT chunk is present if and only if BIDX is present.
 
   Base Graphs List (ID: {'B', 'A', 'S', 'E'}) [Optional]
diff --git a/bloom.c b/bloom.c
index db9fb82437..d234551ce0 100644
--- a/bloom.c
+++ b/bloom.c
@@ -177,6 +177,13 @@ static int pathmap_cmp(const void *hashmap_cmp_fn_data,
 	return strcmp(e1->path, e2->path);
 }
 
+static void init_truncated_large_filter(struct bloom_filter *filter)
+{
+	filter->data = xmalloc(1);
+	filter->data[0] = 0xFF;
+	filter->len = 1;
+}
+
 struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 						 struct commit *c,
 						 int compute_if_not_present,
@@ -260,12 +267,18 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 		}
 
 		if (hashmap_get_size(&pathmap) > settings->max_changed_paths) {
+			init_truncated_large_filter(filter);
 			if (computed)
 				*computed |= BLOOM_TRUNC_LARGE;
 			goto cleanup;
 		}
 
 		filter->len = (hashmap_get_size(&pathmap) * settings->bits_per_entry + BITS_PER_WORD - 1) / BITS_PER_WORD;
+		if (!filter->len) {
+			if (computed)
+				*computed |= BLOOM_TRUNC_EMPTY;
+			filter->len = 1;
+		}
 		filter->data = xcalloc(filter->len, sizeof(unsigned char));
 
 		hashmap_for_each_entry(&pathmap, &iter, e, entry) {
@@ -279,8 +292,7 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 	} else {
 		for (i = 0; i < diff_queued_diff.nr; i++)
 			diff_free_filepair(diff_queued_diff.queue[i]);
-		filter->data = NULL;
-		filter->len = 0;
+		init_truncated_large_filter(filter);
 
 		if (computed)
 			*computed |= BLOOM_TRUNC_LARGE;
diff --git a/bloom.h b/bloom.h
index c6d77e8393..adde6dfe21 100644
--- a/bloom.h
+++ b/bloom.h
@@ -93,6 +93,7 @@ enum bloom_filter_computed {
 	BLOOM_NOT_COMPUTED = (1 << 0),
 	BLOOM_COMPUTED     = (1 << 1),
 	BLOOM_TRUNC_LARGE  = (1 << 2),
+	BLOOM_TRUNC_EMPTY  = (1 << 3),
 };
 
 struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
diff --git a/commit-graph.c b/commit-graph.c
index 50519eb968..a71663c28c 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -975,6 +975,7 @@ struct write_commit_graph_context {
 
 	int count_bloom_filter_computed;
 	int count_bloom_filter_not_computed;
+	int count_bloom_filter_trunc_empty;
 	int count_bloom_filter_trunc_large;
 };
 
@@ -1403,6 +1404,8 @@ static void trace2_bloom_filter_write_statistics(struct write_commit_graph_conte
 			   ctx->count_bloom_filter_computed);
 	trace2_data_intmax("commit-graph", ctx->r, "filter-not-computed",
 			   ctx->count_bloom_filter_not_computed);
+	trace2_data_intmax("commit-graph", ctx->r, "filter-trunc-empty",
+			   ctx->count_bloom_filter_trunc_empty);
 	trace2_data_intmax("commit-graph", ctx->r, "filter-trunc-large",
 			   ctx->count_bloom_filter_trunc_large);
 }
@@ -1439,6 +1442,8 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 			&computed);
 		if (computed & BLOOM_COMPUTED) {
 			ctx->count_bloom_filter_computed++;
+			if (computed & BLOOM_TRUNC_EMPTY)
+				ctx->count_bloom_filter_trunc_empty++;
 			if (computed & BLOOM_TRUNC_LARGE)
 				ctx->count_bloom_filter_trunc_large++;
 		} else if (computed & BLOOM_NOT_COMPUTED)
diff --git a/t/t0095-bloom.sh b/t/t0095-bloom.sh
index 232ba2c485..7e4ab1795f 100755
--- a/t/t0095-bloom.sh
+++ b/t/t0095-bloom.sh
@@ -71,8 +71,8 @@ test_expect_success 'get bloom filters for commit with no changes' '
 	git init &&
 	git commit --allow-empty -m "c0" &&
 	cat >expect <<-\EOF &&
-	Filter_Length:0
-	Filter_Data:
+	Filter_Length:1
+	Filter_Data:00|
 	EOF
 	test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
 	test_cmp expect actual
@@ -107,8 +107,8 @@ test_expect_success EXPENSIVE 'get bloom filter for commit with 513 changes' '
 	git add bigDir &&
 	git commit -m "commit with 513 changes" &&
 	cat >expect <<-\EOF &&
-	Filter_Length:0
-	Filter_Data:
+	Filter_Length:1
+	Filter_Data:ff|
 	EOF
 	test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
 	test_cmp expect actual
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index 123b46a0db..15bbb626a1 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -30,6 +30,7 @@ test_expect_success 'setup test - repo, commits, commit graph, log outputs' '
 	rm file_to_be_deleted &&
 	git add . &&
 	git commit -m "file removed" &&
+	git commit --allow-empty -m "empty" &&
 	git commit-graph write --reachable --changed-paths &&
 
 	test_oid_cache <<-EOF
@@ -37,6 +38,7 @@ test_expect_success 'setup test - repo, commits, commit graph, log outputs' '
 	oid_version sha256:2
 	EOF
 '
+
 graph_read_expect () {
 	NUM_CHUNKS=5
 	cat >expect <<- EOF
@@ -49,7 +51,7 @@ graph_read_expect () {
 }
 
 test_expect_success 'commit-graph write wrote out the bloom chunks' '
-	graph_read_expect 15
+	graph_read_expect 16
 '
 
 # Turn off any inherited trace2 settings for this test.
@@ -156,7 +158,7 @@ test_expect_success 'setup - add commit-graph to the chain with Bloom filters' '
 
 test_bloom_filters_used_when_some_filters_are_missing () {
 	log_args=$1
-	bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"maybe\":6,\"definitely_not\":8"
+	bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"maybe\":6,\"definitely_not\":9"
 	setup "$log_args" &&
 	grep -q "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.perf" &&
 	test_cmp log_wo_bloom log_w_bloom
@@ -185,10 +187,18 @@ test_max_changed_paths () {
 	grep "\"max_changed_paths\":$1" $2
 }
 
+test_filter_not_computed () {
+	grep "\"key\":\"filter-not-computed\",\"value\":\"$1\"" $2
+}
+
 test_filter_computed () {
 	grep "\"key\":\"filter-computed\",\"value\":\"$1\"" $2
 }
 
+test_filter_trunc_empty () {
+	grep "\"key\":\"filter-trunc-empty\",\"value\":\"$1\"" $2
+}
+
 test_filter_trunc_large () {
 	grep "\"key\":\"filter-trunc-large\",\"value\":\"$1\"" $2
 }
@@ -283,4 +293,21 @@ test_expect_success 'correctly report changes over limit' '
 	)
 '
 
+test_expect_success 'correctly report commits with no changed paths' '
+	git init empty &&
+	test_when_finished "rm -fr empty" &&
+	(
+		cd empty &&
+
+		git commit --allow-empty -m "initial commit" &&
+
+		GIT_TRACE2_EVENT="$(pwd)/trace.event" \
+			git commit-graph write --reachable --changed-paths &&
+		test_filter_computed 1 trace.event &&
+		test_filter_not_computed 0 trace.event &&
+		test_filter_trunc_empty 1 trace.event &&
+		test_filter_trunc_large 0 trace.event
+	)
+'
+
 test_done
-- 
2.28.0.510.g375ecf1f36


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH v3 11/13] commit-graph: rename 'split_commit_graph_opts'
  2020-09-18  2:58 ` [PATCH v3 " Taylor Blau
                     ` (9 preceding siblings ...)
  2020-09-18  2:59   ` [PATCH v3 10/13] bloom: encode out-of-bounds filters as non-empty Taylor Blau
@ 2020-09-18  2:59   ` Taylor Blau
  2020-09-18  2:59   ` [PATCH v3 12/13] builtin/commit-graph.c: introduce '--max-new-filters=<n>' Taylor Blau
                     ` (2 subsequent siblings)
  13 siblings, 0 replies; 75+ messages in thread
From: Taylor Blau @ 2020-09-18  2:59 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, peff, szeder.dev

In the subsequent commit, additional options will be added to the
commit-graph API which have nothing to do with splitting.

Rename the 'split_commit_graph_opts' structure to the more-generic
'commit_graph_opts' to encompass both. Likewise, rename the 'flags'
member to instead be 'split_flags' to clarify that it only has to do
with the behavior implied by '--split'.

Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 builtin/commit-graph.c | 20 ++++++++++----------
 commit-graph.c         | 40 ++++++++++++++++++++--------------------
 commit-graph.h         |  8 ++++----
 3 files changed, 34 insertions(+), 34 deletions(-)

diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index ba5584463f..f3243bd982 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -119,7 +119,7 @@ static int graph_verify(int argc, const char **argv)
 }
 
 extern int read_replace_refs;
-static struct split_commit_graph_opts split_opts;
+static struct commit_graph_opts write_opts;
 
 static int write_option_parse_split(const struct option *opt, const char *arg,
 				    int unset)
@@ -187,24 +187,24 @@ static int graph_write(int argc, const char **argv)
 		OPT_BOOL(0, "changed-paths", &opts.enable_changed_paths,
 			N_("enable computation for changed paths")),
 		OPT_BOOL(0, "progress", &opts.progress, N_("force progress reporting")),
-		OPT_CALLBACK_F(0, "split", &split_opts.flags, NULL,
+		OPT_CALLBACK_F(0, "split", &write_opts.split_flags, NULL,
 			N_("allow writing an incremental commit-graph file"),
 			PARSE_OPT_OPTARG | PARSE_OPT_NONEG,
 			write_option_parse_split),
-		OPT_INTEGER(0, "max-commits", &split_opts.max_commits,
+		OPT_INTEGER(0, "max-commits", &write_opts.max_commits,
 			N_("maximum number of commits in a non-base split commit-graph")),
-		OPT_INTEGER(0, "size-multiple", &split_opts.size_multiple,
+		OPT_INTEGER(0, "size-multiple", &write_opts.size_multiple,
 			N_("maximum ratio between two levels of a split commit-graph")),
-		OPT_EXPIRY_DATE(0, "expire-time", &split_opts.expire_time,
+		OPT_EXPIRY_DATE(0, "expire-time", &write_opts.expire_time,
 			N_("only expire files older than a given date-time")),
 		OPT_END(),
 	};
 
 	opts.progress = isatty(2);
 	opts.enable_changed_paths = -1;
-	split_opts.size_multiple = 2;
-	split_opts.max_commits = 0;
-	split_opts.expire_time = 0;
+	write_opts.size_multiple = 2;
+	write_opts.max_commits = 0;
+	write_opts.expire_time = 0;
 
 	trace2_cmd_mode("write");
 
@@ -232,7 +232,7 @@ static int graph_write(int argc, const char **argv)
 	odb = find_odb(the_repository, opts.obj_dir);
 
 	if (opts.reachable) {
-		if (write_commit_graph_reachable(odb, flags, &split_opts))
+		if (write_commit_graph_reachable(odb, flags, &write_opts))
 			return 1;
 		return 0;
 	}
@@ -261,7 +261,7 @@ static int graph_write(int argc, const char **argv)
 			       opts.stdin_packs ? &pack_indexes : NULL,
 			       opts.stdin_commits ? &commits : NULL,
 			       flags,
-			       &split_opts))
+			       &write_opts))
 		result = 1;
 
 cleanup:
diff --git a/commit-graph.c b/commit-graph.c
index a71663c28c..d2a2f78776 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -969,7 +969,7 @@ struct write_commit_graph_context {
 		 changed_paths:1,
 		 order_by_pack:1;
 
-	const struct split_commit_graph_opts *split_opts;
+	const struct commit_graph_opts *opts;
 	size_t total_bloom_filter_data_size;
 	const struct bloom_filter_settings *bloom_settings;
 
@@ -1293,8 +1293,8 @@ static void close_reachable(struct write_commit_graph_context *ctx)
 {
 	int i;
 	struct commit *commit;
-	enum commit_graph_split_flags flags = ctx->split_opts ?
-		ctx->split_opts->flags : COMMIT_GRAPH_SPLIT_UNSPECIFIED;
+	enum commit_graph_split_flags flags = ctx->opts ?
+		ctx->opts->split_flags : COMMIT_GRAPH_SPLIT_UNSPECIFIED;
 
 	if (ctx->report_progress)
 		ctx->progress = start_delayed_progress(
@@ -1483,7 +1483,7 @@ static int add_ref_to_set(const char *refname,
 
 int write_commit_graph_reachable(struct object_directory *odb,
 				 enum commit_graph_write_flags flags,
-				 const struct split_commit_graph_opts *split_opts)
+				 const struct commit_graph_opts *opts)
 {
 	struct oidset commits = OIDSET_INIT;
 	struct refs_cb_data data;
@@ -1500,7 +1500,7 @@ int write_commit_graph_reachable(struct object_directory *odb,
 	stop_progress(&data.progress);
 
 	result = write_commit_graph(odb, NULL, &commits,
-				    flags, split_opts);
+				    flags, opts);
 
 	oidset_clear(&commits);
 	return result;
@@ -1615,8 +1615,8 @@ static uint32_t count_distinct_commits(struct write_commit_graph_context *ctx)
 static void copy_oids_to_commits(struct write_commit_graph_context *ctx)
 {
 	uint32_t i;
-	enum commit_graph_split_flags flags = ctx->split_opts ?
-		ctx->split_opts->flags : COMMIT_GRAPH_SPLIT_UNSPECIFIED;
+	enum commit_graph_split_flags flags = ctx->opts ?
+		ctx->opts->split_flags : COMMIT_GRAPH_SPLIT_UNSPECIFIED;
 
 	ctx->num_extra_edges = 0;
 	if (ctx->report_progress)
@@ -1901,13 +1901,13 @@ static void split_graph_merge_strategy(struct write_commit_graph_context *ctx)
 	int max_commits = 0;
 	int size_mult = 2;
 
-	if (ctx->split_opts) {
-		max_commits = ctx->split_opts->max_commits;
+	if (ctx->opts) {
+		max_commits = ctx->opts->max_commits;
 
-		if (ctx->split_opts->size_multiple)
-			size_mult = ctx->split_opts->size_multiple;
+		if (ctx->opts->size_multiple)
+			size_mult = ctx->opts->size_multiple;
 
-		flags = ctx->split_opts->flags;
+		flags = ctx->opts->split_flags;
 	}
 
 	g = ctx->r->objects->commit_graph;
@@ -2085,8 +2085,8 @@ static void expire_commit_graphs(struct write_commit_graph_context *ctx)
 	size_t dirnamelen;
 	timestamp_t expire_time = time(NULL);
 
-	if (ctx->split_opts && ctx->split_opts->expire_time)
-		expire_time = ctx->split_opts->expire_time;
+	if (ctx->opts && ctx->opts->expire_time)
+		expire_time = ctx->opts->expire_time;
 	if (!ctx->split) {
 		char *chain_file_name = get_chain_filename(ctx->odb);
 		unlink(chain_file_name);
@@ -2137,7 +2137,7 @@ int write_commit_graph(struct object_directory *odb,
 		       struct string_list *pack_indexes,
 		       struct oidset *commits,
 		       enum commit_graph_write_flags flags,
-		       const struct split_commit_graph_opts *split_opts)
+		       const struct commit_graph_opts *opts)
 {
 	struct write_commit_graph_context *ctx;
 	uint32_t i, count_distinct = 0;
@@ -2154,7 +2154,7 @@ int write_commit_graph(struct object_directory *odb,
 	ctx->append = flags & COMMIT_GRAPH_WRITE_APPEND ? 1 : 0;
 	ctx->report_progress = flags & COMMIT_GRAPH_WRITE_PROGRESS ? 1 : 0;
 	ctx->split = flags & COMMIT_GRAPH_WRITE_SPLIT ? 1 : 0;
-	ctx->split_opts = split_opts;
+	ctx->opts = opts;
 	ctx->total_bloom_filter_data_size = 0;
 
 	bloom_settings.bits_per_entry = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY",
@@ -2202,15 +2202,15 @@ int write_commit_graph(struct object_directory *odb,
 			}
 		}
 
-		if (ctx->split_opts)
-			replace = ctx->split_opts->flags & COMMIT_GRAPH_SPLIT_REPLACE;
+		if (ctx->opts)
+			replace = ctx->opts->split_flags & COMMIT_GRAPH_SPLIT_REPLACE;
 	}
 
 	ctx->approx_nr_objects = approximate_object_count();
 	ctx->oids.alloc = ctx->approx_nr_objects / 32;
 
-	if (ctx->split && split_opts && ctx->oids.alloc > split_opts->max_commits)
-		ctx->oids.alloc = split_opts->max_commits;
+	if (ctx->split && opts && ctx->oids.alloc > opts->max_commits)
+		ctx->oids.alloc = opts->max_commits;
 
 	if (ctx->append) {
 		prepare_commit_graph_one(ctx->r, ctx->odb);
diff --git a/commit-graph.h b/commit-graph.h
index d9acb22bac..b7914b0a7a 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -105,11 +105,11 @@ enum commit_graph_split_flags {
 	COMMIT_GRAPH_SPLIT_REPLACE          = 2
 };
 
-struct split_commit_graph_opts {
+struct commit_graph_opts {
 	int size_multiple;
 	int max_commits;
 	timestamp_t expire_time;
-	enum commit_graph_split_flags flags;
+	enum commit_graph_split_flags split_flags;
 };
 
 /*
@@ -120,12 +120,12 @@ struct split_commit_graph_opts {
  */
 int write_commit_graph_reachable(struct object_directory *odb,
 				 enum commit_graph_write_flags flags,
-				 const struct split_commit_graph_opts *split_opts);
+				 const struct commit_graph_opts *opts);
 int write_commit_graph(struct object_directory *odb,
 		       struct string_list *pack_indexes,
 		       struct oidset *commits,
 		       enum commit_graph_write_flags flags,
-		       const struct split_commit_graph_opts *split_opts);
+		       const struct commit_graph_opts *opts);
 
 #define COMMIT_GRAPH_VERIFY_SHALLOW	(1 << 0)
 
-- 
2.28.0.510.g375ecf1f36


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH v3 12/13] builtin/commit-graph.c: introduce '--max-new-filters=<n>'
  2020-09-18  2:58 ` [PATCH v3 " Taylor Blau
                     ` (10 preceding siblings ...)
  2020-09-18  2:59   ` [PATCH v3 11/13] commit-graph: rename 'split_commit_graph_opts' Taylor Blau
@ 2020-09-18  2:59   ` Taylor Blau
  2020-09-18  2:59   ` [PATCH v3 13/13] commit-graph: introduce 'commitGraph.maxNewFilters' Taylor Blau
  2020-09-18 13:31   ` [PATCH v3 00/13] more miscellaneous Bloom filter improvements, redux Taylor Blau
  13 siblings, 0 replies; 75+ messages in thread
From: Taylor Blau @ 2020-09-18  2:59 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, peff, szeder.dev

Introduce a command-line flag to specify the maximum number of new Bloom
filters that a 'git commit-graph write' is willing to compute from
scratch.

Prior to this patch, a commit-graph write with '--changed-paths' would
compute Bloom filters for all selected commits which haven't already
been computed (i.e., by a previous commit-graph write with '--split'
such that a roll-up or replacement is performed).

This behavior can cause prohibitively-long commit-graph writes for a
variety of reasons:

  * There may be lots of filters whose diffs take a long time to
    generate (for example, they have close to the maximum number of
    changes, diffing itself takes a long time, etc).

  * Old-style commit-graphs (which encode filters with too many entries
    as not having been computed at all) cause us to waste time
    recomputing filters that appear to have not been computed only to
    discover that they are too-large.

This can make the upper-bound of the time it takes for 'git commit-graph
write --changed-paths' to be rather unpredictable.

To make this command behave more predictably, introduce
'--max-new-filters=<n>' to allow computing at most '<n>' Bloom filters
from scratch. This lets "computing" already-known filters proceed
quickly, while bounding the number of slow tasks that Git is willing to
do.

Helped-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 Documentation/git-commit-graph.txt |  5 +++
 bloom.c                            |  7 ++-
 builtin/commit-graph.c             | 27 +++++++++++-
 commit-graph.c                     |  9 +++-
 commit-graph.h                     |  1 +
 t/t4216-log-bloom.sh               | 70 ++++++++++++++++++++++++++++++
 6 files changed, 111 insertions(+), 8 deletions(-)

diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
index 17405c73a9..8357846d30 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -67,6 +67,11 @@ this option is given, future commit-graph writes will automatically assume
 that this option was intended. Use `--no-changed-paths` to stop storing this
 data.
 +
+With the `--max-new-filters=<n>` option, generate at most `n` new Bloom
+filters (if `--changed-paths` is specified). If `n` is `-1`, no limit is
+enforced. Commits whose filters are not calculated are stored as a
+length zero Bloom filter.
++
 With the `--split[=<strategy>]` option, write the commit-graph as a
 chain of multiple commit-graph files stored in
 `<dir>/info/commit-graphs`. Commit-graph layers are merged based on the
diff --git a/bloom.c b/bloom.c
index d234551ce0..68c73200a5 100644
--- a/bloom.c
+++ b/bloom.c
@@ -204,12 +204,11 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 
 	if (!filter->data) {
 		load_commit_graph_info(r, c);
-		if (commit_graph_position(c) != COMMIT_NOT_FROM_GRAPH &&
-			load_bloom_filter_from_graph(r->objects->commit_graph, filter, c))
-				return filter;
+		if (commit_graph_position(c) != COMMIT_NOT_FROM_GRAPH)
+			load_bloom_filter_from_graph(r->objects->commit_graph, filter, c);
 	}
 
-	if (filter->data)
+	if (filter->data && filter->len)
 		return filter;
 	if (!compute_if_not_present)
 		return NULL;
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index f3243bd982..5df9b2ef80 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -13,7 +13,8 @@ static char const * const builtin_commit_graph_usage[] = {
 	N_("git commit-graph verify [--object-dir <objdir>] [--shallow] [--[no-]progress]"),
 	N_("git commit-graph write [--object-dir <objdir>] [--append] "
 	   "[--split[=<strategy>]] [--reachable|--stdin-packs|--stdin-commits] "
-	   "[--changed-paths] [--[no-]progress] <split options>"),
+	   "[--changed-paths] [--[no-]max-new-filters <n>] [--[no-]progress] "
+	   "<split options>"),
 	NULL
 };
 
@@ -25,7 +26,8 @@ static const char * const builtin_commit_graph_verify_usage[] = {
 static const char * const builtin_commit_graph_write_usage[] = {
 	N_("git commit-graph write [--object-dir <objdir>] [--append] "
 	   "[--split[=<strategy>]] [--reachable|--stdin-packs|--stdin-commits] "
-	   "[--changed-paths] [--[no-]progress] <split options>"),
+	   "[--changed-paths] [--[no-]max-new-filters <n>] [--[no-]progress] "
+	   "<split options>"),
 	NULL
 };
 
@@ -162,6 +164,23 @@ static int read_one_commit(struct oidset *commits, struct progress *progress,
 	return 0;
 }
 
+static int write_option_max_new_filters(const struct option *opt,
+					const char *arg,
+					int unset)
+{
+	int *to = opt->value;
+	if (unset)
+		*to = -1;
+	else {
+		const char *s;
+		*to = strtol(arg, (char **)&s, 10);
+		if (*s)
+			return error(_("%s expects a numerical value"),
+				     optname(opt, opt->flags));
+	}
+	return 0;
+}
+
 static int graph_write(int argc, const char **argv)
 {
 	struct string_list pack_indexes = STRING_LIST_INIT_NODUP;
@@ -197,6 +216,9 @@ static int graph_write(int argc, const char **argv)
 			N_("maximum ratio between two levels of a split commit-graph")),
 		OPT_EXPIRY_DATE(0, "expire-time", &write_opts.expire_time,
 			N_("only expire files older than a given date-time")),
+		OPT_CALLBACK_F(0, "max-new-filters", &write_opts.max_new_filters,
+			NULL, N_("maximum number of changed-path Bloom filters to compute"),
+			0, write_option_max_new_filters),
 		OPT_END(),
 	};
 
@@ -205,6 +227,7 @@ static int graph_write(int argc, const char **argv)
 	write_opts.size_multiple = 2;
 	write_opts.max_commits = 0;
 	write_opts.expire_time = 0;
+	write_opts.max_new_filters = -1;
 
 	trace2_cmd_mode("write");
 
diff --git a/commit-graph.c b/commit-graph.c
index d2a2f78776..d10e730f29 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1415,6 +1415,7 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 	int i;
 	struct progress *progress = NULL;
 	struct commit **sorted_commits;
+	int max_new_filters;
 
 	init_bloom_filters();
 
@@ -1431,13 +1432,16 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 	else
 		QSORT(sorted_commits, ctx->commits.nr, commit_gen_cmp);
 
+	max_new_filters = ctx->opts && ctx->opts->max_new_filters >= 0 ?
+		ctx->opts->max_new_filters : ctx->commits.nr;
+
 	for (i = 0; i < ctx->commits.nr; i++) {
 		enum bloom_filter_computed computed = 0;
 		struct commit *c = sorted_commits[i];
 		struct bloom_filter *filter = get_or_compute_bloom_filter(
 			ctx->r,
 			c,
-			1,
+			ctx->count_bloom_filter_computed < max_new_filters,
 			ctx->bloom_settings,
 			&computed);
 		if (computed & BLOOM_COMPUTED) {
@@ -1448,7 +1452,8 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 				ctx->count_bloom_filter_trunc_large++;
 		} else if (computed & BLOOM_NOT_COMPUTED)
 			ctx->count_bloom_filter_not_computed++;
-		ctx->total_bloom_filter_data_size += sizeof(unsigned char) * filter->len;
+		ctx->total_bloom_filter_data_size += filter
+			? sizeof(unsigned char) * filter->len : 0;
 		display_progress(progress, i + 1);
 	}
 
diff --git a/commit-graph.h b/commit-graph.h
index b7914b0a7a..a22bd86701 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -110,6 +110,7 @@ struct commit_graph_opts {
 	int max_commits;
 	timestamp_t expire_time;
 	enum commit_graph_split_flags split_flags;
+	int max_new_filters;
 };
 
 /*
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index 15bbb626a1..af0975c841 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -310,4 +310,74 @@ test_expect_success 'correctly report commits with no changed paths' '
 	)
 '
 
+test_expect_success 'Bloom generation is limited by --max-new-filters' '
+	(
+		cd limits &&
+		test_commit c2 filter &&
+		test_commit c3 filter &&
+		test_commit c4 no-filter &&
+
+		rm -f trace.event &&
+		GIT_TRACE2_EVENT="$(pwd)/trace.event" \
+			git commit-graph write --reachable --split=replace \
+				--changed-paths --max-new-filters=2 &&
+
+		test_filter_computed 2 trace.event &&
+		test_filter_not_computed 3 trace.event &&
+		test_filter_trunc_empty 0 trace.event &&
+		test_filter_trunc_large 0 trace.event
+	)
+'
+
+test_expect_success 'Bloom generation backfills previously-skipped filters' '
+	(
+		cd limits &&
+
+		rm -f trace.event &&
+		GIT_TRACE2_EVENT="$(pwd)/trace.event" \
+			git commit-graph write --reachable --changed-paths \
+				--split=replace --max-new-filters=1 &&
+		test_filter_computed 1 trace.event &&
+		test_filter_not_computed 4 trace.event &&
+		test_filter_trunc_empty 0 trace.event &&
+		test_filter_trunc_large 0 trace.event
+	)
+'
+
+test_expect_success 'Bloom generation backfills empty commits' '
+	git init empty &&
+	test_when_finished "rm -fr empty" &&
+	(
+		cd empty &&
+		for i in $(test_seq 1 6)
+		do
+			git commit --allow-empty -m "$i"
+		done &&
+
+		# Generate Bloom filters for empty commits 1-6, two at a time.
+		for i in $(test_seq 1 3)
+		do
+			rm -f trace.event &&
+			GIT_TRACE2_EVENT="$(pwd)/trace.event" \
+				git commit-graph write --reachable \
+					--changed-paths --max-new-filters=2 &&
+			test_filter_computed 2 trace.event &&
+			test_filter_not_computed 4 trace.event &&
+			test_filter_trunc_empty 2 trace.event &&
+			test_filter_trunc_large 0 trace.event
+		done &&
+
+		# Finally, make sure that once all commits have filters, that
+		# none are subsequently recomputed.
+		rm -f trace.event &&
+		GIT_TRACE2_EVENT="$(pwd)/trace.event" \
+			git commit-graph write --reachable \
+				--changed-paths --max-new-filters=2 &&
+		test_filter_computed 0 trace.event &&
+		test_filter_not_computed 6 trace.event &&
+		test_filter_trunc_empty 0 trace.event &&
+		test_filter_trunc_large 0 trace.event
+	)
+'
+
 test_done
-- 
2.28.0.510.g375ecf1f36


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH v3 13/13] commit-graph: introduce 'commitGraph.maxNewFilters'
  2020-09-18  2:58 ` [PATCH v3 " Taylor Blau
                     ` (11 preceding siblings ...)
  2020-09-18  2:59   ` [PATCH v3 12/13] builtin/commit-graph.c: introduce '--max-new-filters=<n>' Taylor Blau
@ 2020-09-18  2:59   ` Taylor Blau
  2020-09-18 13:29     ` Taylor Blau
  2020-09-18 13:31   ` [PATCH v3 00/13] more miscellaneous Bloom filter improvements, redux Taylor Blau
  13 siblings, 1 reply; 75+ messages in thread
From: Taylor Blau @ 2020-09-18  2:59 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, peff, szeder.dev

Introduce a configuration variable to specify a default value for the
recently-introduce '--max-new-filters' option of 'git commit-graph
write'.

Helped-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 Documentation/config/commitgraph.txt |  4 ++++
 Documentation/git-commit-graph.txt   |  3 ++-
 builtin/commit-graph.c               | 14 ++++++++++++++
 t/t4216-log-bloom.sh                 | 24 +++++++++++++++++++++++-
 4 files changed, 43 insertions(+), 2 deletions(-)

diff --git a/Documentation/config/commitgraph.txt b/Documentation/config/commitgraph.txt
index cff0797b54..4582c39fc4 100644
--- a/Documentation/config/commitgraph.txt
+++ b/Documentation/config/commitgraph.txt
@@ -1,3 +1,7 @@
+commitGraph.maxNewFilters::
+	Specifies the default value for the `--max-new-filters` option of `git
+	commit-graph write` (c.f., linkgit:git-commit-graph[1]).
+
 commitGraph.readChangedPaths::
 	If true, then git will use the changed-path Bloom filters in the
 	commit-graph file (if it exists, and they are present). Defaults to
diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
index 8357846d30..60df4e4bfa 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -70,7 +70,8 @@ data.
 With the `--max-new-filters=<n>` option, generate at most `n` new Bloom
 filters (if `--changed-paths` is specified). If `n` is `-1`, no limit is
 enforced. Commits whose filters are not calculated are stored as a
-length zero Bloom filter.
+length zero Bloom filter. Overrides the `commitGraph.maxNewFilters`
+configuration.
 +
 With the `--split[=<strategy>]` option, write the commit-graph as a
 chain of multiple commit-graph files stored in
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 5df9b2ef80..988445abdf 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -181,6 +181,18 @@ static int write_option_max_new_filters(const struct option *opt,
 	return 0;
 }
 
+static int git_commit_graph_write_config(const char *var, const char *value,
+					 void *cb)
+{
+	if (!strcmp(var, "commitgraph.maxnewfilters"))
+		write_opts.max_new_filters = git_config_int(var, value);
+	/*
+	 * No need to fall-back to 'git_default_config', since this was already
+	 * called in 'cmd_commit_graph()'.
+	 */
+	return 0;
+}
+
 static int graph_write(int argc, const char **argv)
 {
 	struct string_list pack_indexes = STRING_LIST_INIT_NODUP;
@@ -231,6 +243,8 @@ static int graph_write(int argc, const char **argv)
 
 	trace2_cmd_mode("write");
 
+	git_config(git_commit_graph_write_config, &opts);
+
 	argc = parse_options(argc, argv, NULL,
 			     builtin_commit_graph_write_options,
 			     builtin_commit_graph_write_usage, 0);
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index af0975c841..d11040ce41 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -330,13 +330,15 @@ test_expect_success 'Bloom generation is limited by --max-new-filters' '
 '
 
 test_expect_success 'Bloom generation backfills previously-skipped filters' '
+	# Check specifying commitGraph.maxNewFilters over "git config" works.
+	test_config -C limits commitGraph.maxNewFilters 1 &&
 	(
 		cd limits &&
 
 		rm -f trace.event &&
 		GIT_TRACE2_EVENT="$(pwd)/trace.event" \
 			git commit-graph write --reachable --changed-paths \
-				--split=replace --max-new-filters=1 &&
+				--split=replace &&
 		test_filter_computed 1 trace.event &&
 		test_filter_not_computed 4 trace.event &&
 		test_filter_trunc_empty 0 trace.event &&
@@ -344,6 +346,26 @@ test_expect_success 'Bloom generation backfills previously-skipped filters' '
 	)
 '
 
+test_expect_success '--max-new-filters overrides configuration' '
+	git init override &&
+	test_when_finished "rm -fr override" &&
+	test_config -C override commitGraph.maxNewFilters 2 &&
+	(
+		cd override &&
+		test_commit one &&
+		test_commit two &&
+
+		rm -f trace.event &&
+		GIT_TRACE2_EVENT="$(pwd)/trace.event" \
+			git commit-graph write --reachable --changed-paths \
+				--max-new-filters=1 &&
+		test_filter_computed 1 trace.event &&
+		test_filter_not_computed 1 trace.event &&
+		test_filter_trunc_empty 0 trace.event &&
+		test_filter_trunc_large 0 trace.event
+	)
+'
+
 test_expect_success 'Bloom generation backfills empty commits' '
 	git init empty &&
 	test_when_finished "rm -fr empty" &&
-- 
2.28.0.510.g375ecf1f36

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* Re: [PATCH v2 12/13] builtin/commit-graph.c: introduce '--max-new-filters=<n>'
  2020-09-16 18:08   ` [PATCH v2 12/13] builtin/commit-graph.c: introduce '--max-new-filters=<n>' Taylor Blau
@ 2020-09-18  9:23     ` SZEDER Gábor
  2020-09-18 13:27       ` Taylor Blau
  0 siblings, 1 reply; 75+ messages in thread
From: SZEDER Gábor @ 2020-09-18  9:23 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, dstolee, gitster, peff

On Wed, Sep 16, 2020 at 02:08:10PM -0400, Taylor Blau wrote:
> Introduce a command-line flag to specify the maximum number of new Bloom
> filters that a 'git commit-graph write' is willing to compute from
> scratch.
> 
> Prior to this patch, a commit-graph write with '--changed-paths' would
> compute Bloom filters for all selected commits which haven't already
> been computed (i.e., by a previous commit-graph write with '--split'
> such that a roll-up or replacement is performed).
> 
> This behavior can cause prohibitively-long commit-graph writes for a
> variety of reasons:
> 
>   * There may be lots of filters whose diffs take a long time to
>     generate (for example, they have close to the maximum number of
>     changes, diffing itself takes a long time, etc).
> 
>   * Old-style commit-graphs (which encode filters with too many entries
>     as not having been computed at all) cause us to waste time
>     recomputing filters that appear to have not been computed only to
>     discover that they are too-large.
> 
> This can make the upper-bound of the time it takes for 'git commit-graph
> write --changed-paths' to be rather unpredictable.
> 
> To make this command behave more predictably, introduce
> '--max-new-filters=<n>' to allow computing at most '<n>' Bloom filters
> from scratch. This lets "computing" already-known filters proceed
> quickly, while bounding the number of slow tasks that Git is willing to
> do.
> 
> Signed-off-by: Taylor Blau <me@ttaylorr.com>
> ---
>  Documentation/git-commit-graph.txt |  5 +++
>  bloom.c                            |  7 ++-
>  builtin/commit-graph.c             | 27 +++++++++++-
>  commit-graph.c                     |  9 +++-
>  commit-graph.h                     |  1 +
>  t/t4216-log-bloom.sh               | 70 ++++++++++++++++++++++++++++++
>  6 files changed, 111 insertions(+), 8 deletions(-)
> 
> diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
> index 17405c73a9..8357846d30 100644
> --- a/Documentation/git-commit-graph.txt
> +++ b/Documentation/git-commit-graph.txt
> @@ -67,6 +67,11 @@ this option is given, future commit-graph writes will automatically assume
>  that this option was intended. Use `--no-changed-paths` to stop storing this
>  data.
>  +
> +With the `--max-new-filters=<n>` option, generate at most `n` new Bloom
> +filters (if `--changed-paths` is specified).

If '--max-new-filters=42' is specified without '--changed-paths', then
it is silently ignored instead of erroring out.  Is that intentional?

> If `n` is `-1`, no limit is
> +enforced. Commits whose filters are not calculated are stored as a
> +length zero Bloom filter.

This last sentence goes into implementation details of the file
format, so I don't think it should be included here.

What should be documented here instead is what happens later with
those commits for which no modified path Bloom filters were computed.
Will they ever be computed?  If yes, which commands will compute them
(even implicitly), and thus will run slower?  What command(s) should
the users run if they want to compute the missing filters?

Let's suppose that running

  git commit-graph write --reachable --changed-paths --max-new-filters=<L> --split

writes a commit-graph layer containing N1 commits, and N1 > L, so N1-L
commits won't have an associated modified path Bloom filter.  After
creating and fetching new commits, this command is executed again to
add information about N2 commits to the commit-graph, and N2 < L.

  - If this second invocation writes a new commit-graph layer, then it
    will only compute Bloom filters for commits in this new layer,
    even though it still has some "Bloom-computing-capacity" left.
    (This will happen even if this second write were invoked without
    '--max-new-filters'.)

  - If the merge strategy decides that, instead of writing a new
    layer, the new N2 commits should be merged with the layer
    containing the previous N1 commits, then it will compute Bloom
    filters for some or all of those N1 commits without filters,
    depending on whether N2-L is smaller or larger than N1-L.

Is this how it is supposed to work?  We can't tell, because neither
the commit messages nor the documentation updates talk about this.

> ++
>  With the `--split[=<strategy>]` option, write the commit-graph as a
>  chain of multiple commit-graph files stored in
>  `<dir>/info/commit-graphs`. Commit-graph layers are merged based on the

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2 12/13] builtin/commit-graph.c: introduce '--max-new-filters=<n>'
  2020-09-18  9:23     ` SZEDER Gábor
@ 2020-09-18 13:27       ` Taylor Blau
  0 siblings, 0 replies; 75+ messages in thread
From: Taylor Blau @ 2020-09-18 13:27 UTC (permalink / raw)
  To: SZEDER Gábor; +Cc: Taylor Blau, git, dstolee, gitster, peff

On Fri, Sep 18, 2020 at 11:23:24AM +0200, SZEDER Gábor wrote:
> On Wed, Sep 16, 2020 at 02:08:10PM -0400, Taylor Blau wrote:
> > +With the `--max-new-filters=<n>` option, generate at most `n` new Bloom
> > +filters (if `--changed-paths` is specified).
>
> If '--max-new-filters=42' is specified without '--changed-paths', then
> it is silently ignored instead of erroring out.  Is that intentional?

Yeah, that is intentional. The rationale is that if you're specifying "I
won't compute any more than N filters" but don't say "compute and write
changed-path Bloom filters", then you'll compute 0 < N filters, which is
fine.

I considered making this a warning, but ultimately decided against it.
In particular, I had users who might set 'commitGraph.maxNewFilters' to
some value, but only sometimes write a commit-graph with
'--changed-paths'.

> > If `n` is `-1`, no limit is
> > +enforced. Commits whose filters are not calculated are stored as a
> > +length zero Bloom filter.
>
> This last sentence goes into implementation details of the file
> format, so I don't think it should be included here.

Good idea, I'll drop that.

> What should be documented here instead is what happens later with
> those commits for which no modified path Bloom filters were computed.
> Will they ever be computed?  If yes, which commands will compute them
> (even implicitly), and thus will run slower?  What command(s) should
> the users run if they want to compute the missing filters?
>
> Let's suppose that running
>
>   git commit-graph write --reachable --changed-paths --max-new-filters=<L> --split
>
> writes a commit-graph layer containing N1 commits, and N1 > L, so N1-L
> commits won't have an associated modified path Bloom filter.  After
> creating and fetching new commits, this command is executed again to
> add information about N2 commits to the commit-graph, and N2 < L.
>
>   - If this second invocation writes a new commit-graph layer, then it
>     will only compute Bloom filters for commits in this new layer,
>     even though it still has some "Bloom-computing-capacity" left.
>     (This will happen even if this second write were invoked without
>     '--max-new-filters'.)
>
>   - If the merge strategy decides that, instead of writing a new
>     layer, the new N2 commits should be merged with the layer
>     containing the previous N1 commits, then it will compute Bloom
>     filters for some or all of those N1 commits without filters,
>     depending on whether N2-L is smaller or larger than N1-L.
>
> Is this how it is supposed to work?  We can't tell, because neither
> the commit messages nor the documentation updates talk about this.

That's a wonderful question, and yes: that's how it's supposed to work.
Implicit in all of this is that these commit-graph options are only
acting over the new layers that we are going to write. So, if we're
going to roll-up older layers, we'll start with their uncomputed
filters before computing new filters in later layers.

This is the same thing as running 'git commit-graph write --reachable
--changed-paths --split' (omitting '--max-new-filters' is not a typo
there). Because the commit-graph command only acts over new layers, old
layers do not suddenly get Bloom filters computed and those filters
rewritten. If a user wants that to happen, they can pass
'--split=replace', which will force all existing graphs to reside in a
new, single layer chain.

Let's use the following version of the patch which I think will clarify
your questions:

--- >8 ---

Subject: [PATCH] builtin/commit-graph.c: introduce '--max-new-filters=<n>'

Introduce a command-line flag to specify the maximum number of new Bloom
filters that a 'git commit-graph write' is willing to compute from
scratch.

Prior to this patch, a commit-graph write with '--changed-paths' would
compute Bloom filters for all selected commits which haven't already
been computed (i.e., by a previous commit-graph write with '--split'
such that a roll-up or replacement is performed).

This behavior can cause prohibitively-long commit-graph writes for a
variety of reasons:

  * There may be lots of filters whose diffs take a long time to
    generate (for example, they have close to the maximum number of
    changes, diffing itself takes a long time, etc).

  * Old-style commit-graphs (which encode filters with too many entries
    as not having been computed at all) cause us to waste time
    recomputing filters that appear to have not been computed only to
    discover that they are too-large.

This can make the upper-bound of the time it takes for 'git commit-graph
write --changed-paths' to be rather unpredictable.

To make this command behave more predictably, introduce
'--max-new-filters=<n>' to allow computing at most '<n>' Bloom filters
from scratch. This lets "computing" already-known filters proceed
quickly, while bounding the number of slow tasks that Git is willing to
do.

Helped-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 Documentation/git-commit-graph.txt |  6 +++
 bloom.c                            |  7 ++-
 builtin/commit-graph.c             | 27 +++++++++++-
 commit-graph.c                     |  9 +++-
 commit-graph.h                     |  1 +
 t/t4216-log-bloom.sh               | 70 ++++++++++++++++++++++++++++++
 6 files changed, 112 insertions(+), 8 deletions(-)

diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
index 17405c73a9..8c75855782 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -67,6 +67,12 @@ this option is given, future commit-graph writes will automatically assume
 that this option was intended. Use `--no-changed-paths` to stop storing this
 data.
 +
+With the `--max-new-filters=<n>` option, generate at most `n` new Bloom
+filters (if `--changed-paths` is specified). If `n` is `-1`, no limit is
+enforced. Only commits present in the new layer count against this
+limit. To retroactively compute Bloom filters over earlier layers, it is
+advised to use `--split=replace`.
++
 With the `--split[=<strategy>]` option, write the commit-graph as a
 chain of multiple commit-graph files stored in
 `<dir>/info/commit-graphs`. Commit-graph layers are merged based on the
diff --git a/bloom.c b/bloom.c
index d234551ce0..68c73200a5 100644
--- a/bloom.c
+++ b/bloom.c
@@ -204,12 +204,11 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,

 	if (!filter->data) {
 		load_commit_graph_info(r, c);
-		if (commit_graph_position(c) != COMMIT_NOT_FROM_GRAPH &&
-			load_bloom_filter_from_graph(r->objects->commit_graph, filter, c))
-				return filter;
+		if (commit_graph_position(c) != COMMIT_NOT_FROM_GRAPH)
+			load_bloom_filter_from_graph(r->objects->commit_graph, filter, c);
 	}

-	if (filter->data)
+	if (filter->data && filter->len)
 		return filter;
 	if (!compute_if_not_present)
 		return NULL;
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index f3243bd982..5df9b2ef80 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -13,7 +13,8 @@ static char const * const builtin_commit_graph_usage[] = {
 	N_("git commit-graph verify [--object-dir <objdir>] [--shallow] [--[no-]progress]"),
 	N_("git commit-graph write [--object-dir <objdir>] [--append] "
 	   "[--split[=<strategy>]] [--reachable|--stdin-packs|--stdin-commits] "
-	   "[--changed-paths] [--[no-]progress] <split options>"),
+	   "[--changed-paths] [--[no-]max-new-filters <n>] [--[no-]progress] "
+	   "<split options>"),
 	NULL
 };

@@ -25,7 +26,8 @@ static const char * const builtin_commit_graph_verify_usage[] = {
 static const char * const builtin_commit_graph_write_usage[] = {
 	N_("git commit-graph write [--object-dir <objdir>] [--append] "
 	   "[--split[=<strategy>]] [--reachable|--stdin-packs|--stdin-commits] "
-	   "[--changed-paths] [--[no-]progress] <split options>"),
+	   "[--changed-paths] [--[no-]max-new-filters <n>] [--[no-]progress] "
+	   "<split options>"),
 	NULL
 };

@@ -162,6 +164,23 @@ static int read_one_commit(struct oidset *commits, struct progress *progress,
 	return 0;
 }

+static int write_option_max_new_filters(const struct option *opt,
+					const char *arg,
+					int unset)
+{
+	int *to = opt->value;
+	if (unset)
+		*to = -1;
+	else {
+		const char *s;
+		*to = strtol(arg, (char **)&s, 10);
+		if (*s)
+			return error(_("%s expects a numerical value"),
+				     optname(opt, opt->flags));
+	}
+	return 0;
+}
+
 static int graph_write(int argc, const char **argv)
 {
 	struct string_list pack_indexes = STRING_LIST_INIT_NODUP;
@@ -197,6 +216,9 @@ static int graph_write(int argc, const char **argv)
 			N_("maximum ratio between two levels of a split commit-graph")),
 		OPT_EXPIRY_DATE(0, "expire-time", &write_opts.expire_time,
 			N_("only expire files older than a given date-time")),
+		OPT_CALLBACK_F(0, "max-new-filters", &write_opts.max_new_filters,
+			NULL, N_("maximum number of changed-path Bloom filters to compute"),
+			0, write_option_max_new_filters),
 		OPT_END(),
 	};

@@ -205,6 +227,7 @@ static int graph_write(int argc, const char **argv)
 	write_opts.size_multiple = 2;
 	write_opts.max_commits = 0;
 	write_opts.expire_time = 0;
+	write_opts.max_new_filters = -1;

 	trace2_cmd_mode("write");

diff --git a/commit-graph.c b/commit-graph.c
index d2a2f78776..d10e730f29 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1415,6 +1415,7 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 	int i;
 	struct progress *progress = NULL;
 	struct commit **sorted_commits;
+	int max_new_filters;

 	init_bloom_filters();

@@ -1431,13 +1432,16 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 	else
 		QSORT(sorted_commits, ctx->commits.nr, commit_gen_cmp);

+	max_new_filters = ctx->opts && ctx->opts->max_new_filters >= 0 ?
+		ctx->opts->max_new_filters : ctx->commits.nr;
+
 	for (i = 0; i < ctx->commits.nr; i++) {
 		enum bloom_filter_computed computed = 0;
 		struct commit *c = sorted_commits[i];
 		struct bloom_filter *filter = get_or_compute_bloom_filter(
 			ctx->r,
 			c,
-			1,
+			ctx->count_bloom_filter_computed < max_new_filters,
 			ctx->bloom_settings,
 			&computed);
 		if (computed & BLOOM_COMPUTED) {
@@ -1448,7 +1452,8 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 				ctx->count_bloom_filter_trunc_large++;
 		} else if (computed & BLOOM_NOT_COMPUTED)
 			ctx->count_bloom_filter_not_computed++;
-		ctx->total_bloom_filter_data_size += sizeof(unsigned char) * filter->len;
+		ctx->total_bloom_filter_data_size += filter
+			? sizeof(unsigned char) * filter->len : 0;
 		display_progress(progress, i + 1);
 	}

diff --git a/commit-graph.h b/commit-graph.h
index b7914b0a7a..a22bd86701 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -110,6 +110,7 @@ struct commit_graph_opts {
 	int max_commits;
 	timestamp_t expire_time;
 	enum commit_graph_split_flags split_flags;
+	int max_new_filters;
 };

 /*
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index 15bbb626a1..af0975c841 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -310,4 +310,74 @@ test_expect_success 'correctly report commits with no changed paths' '
 	)
 '

+test_expect_success 'Bloom generation is limited by --max-new-filters' '
+	(
+		cd limits &&
+		test_commit c2 filter &&
+		test_commit c3 filter &&
+		test_commit c4 no-filter &&
+
+		rm -f trace.event &&
+		GIT_TRACE2_EVENT="$(pwd)/trace.event" \
+			git commit-graph write --reachable --split=replace \
+				--changed-paths --max-new-filters=2 &&
+
+		test_filter_computed 2 trace.event &&
+		test_filter_not_computed 3 trace.event &&
+		test_filter_trunc_empty 0 trace.event &&
+		test_filter_trunc_large 0 trace.event
+	)
+'
+
+test_expect_success 'Bloom generation backfills previously-skipped filters' '
+	(
+		cd limits &&
+
+		rm -f trace.event &&
+		GIT_TRACE2_EVENT="$(pwd)/trace.event" \
+			git commit-graph write --reachable --changed-paths \
+				--split=replace --max-new-filters=1 &&
+		test_filter_computed 1 trace.event &&
+		test_filter_not_computed 4 trace.event &&
+		test_filter_trunc_empty 0 trace.event &&
+		test_filter_trunc_large 0 trace.event
+	)
+'
+
+test_expect_success 'Bloom generation backfills empty commits' '
+	git init empty &&
+	test_when_finished "rm -fr empty" &&
+	(
+		cd empty &&
+		for i in $(test_seq 1 6)
+		do
+			git commit --allow-empty -m "$i"
+		done &&
+
+		# Generate Bloom filters for empty commits 1-6, two at a time.
+		for i in $(test_seq 1 3)
+		do
+			rm -f trace.event &&
+			GIT_TRACE2_EVENT="$(pwd)/trace.event" \
+				git commit-graph write --reachable \
+					--changed-paths --max-new-filters=2 &&
+			test_filter_computed 2 trace.event &&
+			test_filter_not_computed 4 trace.event &&
+			test_filter_trunc_empty 2 trace.event &&
+			test_filter_trunc_large 0 trace.event
+		done &&
+
+		# Finally, make sure that once all commits have filters, that
+		# none are subsequently recomputed.
+		rm -f trace.event &&
+		GIT_TRACE2_EVENT="$(pwd)/trace.event" \
+			git commit-graph write --reachable \
+				--changed-paths --max-new-filters=2 &&
+		test_filter_computed 0 trace.event &&
+		test_filter_not_computed 6 trace.event &&
+		test_filter_trunc_empty 0 trace.event &&
+		test_filter_trunc_large 0 trace.event
+	)
+'
+
 test_done
--
2.28.0.510.g375ecf1f36


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* Re: [PATCH v3 13/13] commit-graph: introduce 'commitGraph.maxNewFilters'
  2020-09-18  2:59   ` [PATCH v3 13/13] commit-graph: introduce 'commitGraph.maxNewFilters' Taylor Blau
@ 2020-09-18 13:29     ` Taylor Blau
  2020-09-18 17:43       ` Junio C Hamano
  0 siblings, 1 reply; 75+ messages in thread
From: Taylor Blau @ 2020-09-18 13:29 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, peff, szeder.dev

On Thu, Sep 17, 2020 at 10:59:57PM -0400, Taylor Blau wrote:
> Introduce a configuration variable to specify a default value for the
> recently-introduce '--max-new-filters' option of 'git commit-graph
> write'.

The replacement in [1] causes a conflict when applying this patch after
it, so here is a suggested resolution:

[1]: https://lore.kernel.org/git/cover.1600397826.git.me@ttaylorr.com/

--- >8 ---

Subject: [PATCH] commit-graph: introduce 'commitGraph.maxNewFilters'

Introduce a configuration variable to specify a default value for the
recently-introduce '--max-new-filters' option of 'git commit-graph
write'.

Helped-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 Documentation/config/commitgraph.txt |  4 ++++
 Documentation/git-commit-graph.txt   |  3 ++-
 builtin/commit-graph.c               | 14 ++++++++++++++
 t/t4216-log-bloom.sh                 | 24 +++++++++++++++++++++++-
 4 files changed, 43 insertions(+), 2 deletions(-)

diff --git a/Documentation/config/commitgraph.txt b/Documentation/config/commitgraph.txt
index cff0797b54..4582c39fc4 100644
--- a/Documentation/config/commitgraph.txt
+++ b/Documentation/config/commitgraph.txt
@@ -1,3 +1,7 @@
+commitGraph.maxNewFilters::
+	Specifies the default value for the `--max-new-filters` option of `git
+	commit-graph write` (c.f., linkgit:git-commit-graph[1]).
+
 commitGraph.readChangedPaths::
 	If true, then git will use the changed-path Bloom filters in the
 	commit-graph file (if it exists, and they are present). Defaults to
diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
index 8c75855782..b6ddbb4700 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -71,7 +71,8 @@ With the `--max-new-filters=<n>` option, generate at most `n` new Bloom
 filters (if `--changed-paths` is specified). If `n` is `-1`, no limit is
 enforced. Only commits present in the new layer count against this
 limit. To retroactively compute Bloom filters over earlier layers, it is
-advised to use `--split=replace`.
+advised to use `--split=replace`. Overrides the
+`commitGraph.maxNewFilters` configuration.
 +
 With the `--split[=<strategy>]` option, write the commit-graph as a
 chain of multiple commit-graph files stored in
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 5df9b2ef80..988445abdf 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -181,6 +181,18 @@ static int write_option_max_new_filters(const struct option *opt,
 	return 0;
 }

+static int git_commit_graph_write_config(const char *var, const char *value,
+					 void *cb)
+{
+	if (!strcmp(var, "commitgraph.maxnewfilters"))
+		write_opts.max_new_filters = git_config_int(var, value);
+	/*
+	 * No need to fall-back to 'git_default_config', since this was already
+	 * called in 'cmd_commit_graph()'.
+	 */
+	return 0;
+}
+
 static int graph_write(int argc, const char **argv)
 {
 	struct string_list pack_indexes = STRING_LIST_INIT_NODUP;
@@ -231,6 +243,8 @@ static int graph_write(int argc, const char **argv)

 	trace2_cmd_mode("write");

+	git_config(git_commit_graph_write_config, &opts);
+
 	argc = parse_options(argc, argv, NULL,
 			     builtin_commit_graph_write_options,
 			     builtin_commit_graph_write_usage, 0);
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index af0975c841..d11040ce41 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -330,13 +330,15 @@ test_expect_success 'Bloom generation is limited by --max-new-filters' '
 '

 test_expect_success 'Bloom generation backfills previously-skipped filters' '
+	# Check specifying commitGraph.maxNewFilters over "git config" works.
+	test_config -C limits commitGraph.maxNewFilters 1 &&
 	(
 		cd limits &&

 		rm -f trace.event &&
 		GIT_TRACE2_EVENT="$(pwd)/trace.event" \
 			git commit-graph write --reachable --changed-paths \
-				--split=replace --max-new-filters=1 &&
+				--split=replace &&
 		test_filter_computed 1 trace.event &&
 		test_filter_not_computed 4 trace.event &&
 		test_filter_trunc_empty 0 trace.event &&
@@ -344,6 +346,26 @@ test_expect_success 'Bloom generation backfills previously-skipped filters' '
 	)
 '

+test_expect_success '--max-new-filters overrides configuration' '
+	git init override &&
+	test_when_finished "rm -fr override" &&
+	test_config -C override commitGraph.maxNewFilters 2 &&
+	(
+		cd override &&
+		test_commit one &&
+		test_commit two &&
+
+		rm -f trace.event &&
+		GIT_TRACE2_EVENT="$(pwd)/trace.event" \
+			git commit-graph write --reachable --changed-paths \
+				--max-new-filters=1 &&
+		test_filter_computed 1 trace.event &&
+		test_filter_not_computed 1 trace.event &&
+		test_filter_trunc_empty 0 trace.event &&
+		test_filter_trunc_large 0 trace.event
+	)
+'
+
 test_expect_success 'Bloom generation backfills empty commits' '
 	git init empty &&
 	test_when_finished "rm -fr empty" &&
--
2.28.0.510.g375ecf1f36


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* Re: [PATCH v3 00/13] more miscellaneous Bloom filter improvements, redux
  2020-09-18  2:58 ` [PATCH v3 " Taylor Blau
                     ` (12 preceding siblings ...)
  2020-09-18  2:59   ` [PATCH v3 13/13] commit-graph: introduce 'commitGraph.maxNewFilters' Taylor Blau
@ 2020-09-18 13:31   ` Taylor Blau
  2020-09-18 13:34     ` Taylor Blau
  13 siblings, 1 reply; 75+ messages in thread
From: Taylor Blau @ 2020-09-18 13:31 UTC (permalink / raw)
  To: gister, git; +Cc: dstolee, peff, szeder.dev

Junio,

Two replacements for this version, if it ends up being the one you
queue. Gàbor suggested some helpful changes on 12/13, which in turn
cause a conflict when applying 13/13.

When queueing, please take:

  - The patch in [1] instead of v3's original 12/13, and
  - The patch in [2] as a suggested resolution when applying v3's
    original 13/13 on top.

[1]: https://lore.kernel.org/git/20200918132727.GB1600256@nand.local/
[2]: https://lore.kernel.org/git/20200918132937.GA1601745@nand.local/

Thanks.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v3 00/13] more miscellaneous Bloom filter improvements, redux
  2020-09-18 13:31   ` [PATCH v3 00/13] more miscellaneous Bloom filter improvements, redux Taylor Blau
@ 2020-09-18 13:34     ` Taylor Blau
  0 siblings, 0 replies; 75+ messages in thread
From: Taylor Blau @ 2020-09-18 13:34 UTC (permalink / raw)
  To: gister, git; +Cc: dstolee, peff, szeder.dev

On Fri, Sep 18, 2020 at 09:31:40AM -0400, Taylor Blau wrote:
> Junio,
>
> Two replacements for this version, if it ends up being the one you
> queue. Gàbor suggested some helpful changes on 12/13, which in turn
> cause a conflict when applying 13/13.

I should mention, the changes are purely an alteration to the new
documentation introduced by this series. Here's a range-diff:

12:  4549f0f747 ! 12:  1c3f6b5c96 builtin/commit-graph.c: introduce '--max-new-filters=<n>'
    @@ Documentation/git-commit-graph.txt: this option is given, future commit-graph wr
      +
     +With the `--max-new-filters=<n>` option, generate at most `n` new Bloom
     +filters (if `--changed-paths` is specified). If `n` is `-1`, no limit is
    -+enforced. Commits whose filters are not calculated are stored as a
    -+length zero Bloom filter.
    ++enforced. Only commits present in the new layer count against this
    ++limit. To retroactively compute Bloom filters over earlier layers, it is
    ++advised to use `--split=replace`.
     ++
      With the `--split[=<strategy>]` option, write the commit-graph as a
      chain of multiple commit-graph files stored in
13:  375ecf1f36 ! 13:  a7330ee850 commit-graph: introduce 'commitGraph.maxNewFilters'
    @@ Documentation/config/commitgraph.txt
      	commit-graph file (if it exists, and they are present). Defaults to

      ## Documentation/git-commit-graph.txt ##
    -@@ Documentation/git-commit-graph.txt: data.
    - With the `--max-new-filters=<n>` option, generate at most `n` new Bloom
    +@@ Documentation/git-commit-graph.txt: With the `--max-new-filters=<n>` option, generate at most `n` new Bloom
      filters (if `--changed-paths` is specified). If `n` is `-1`, no limit is
    - enforced. Commits whose filters are not calculated are stored as a
    --length zero Bloom filter.
    -+length zero Bloom filter. Overrides the `commitGraph.maxNewFilters`
    -+configuration.
    + enforced. Only commits present in the new layer count against this
    + limit. To retroactively compute Bloom filters over earlier layers, it is
    +-advised to use `--split=replace`.
    ++advised to use `--split=replace`. Overrides the
    ++`commitGraph.maxNewFilters` configuration.
      +
      With the `--split[=<strategy>]` option, write the commit-graph as a
      chain of multiple commit-graph files stored in

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v3 08/13] bloom: use provided 'struct bloom_filter_settings'
  2020-09-18  2:59   ` [PATCH v3 08/13] bloom: use provided 'struct bloom_filter_settings' Taylor Blau
@ 2020-09-18 16:27     ` SZEDER Gábor
  2020-09-18 16:32       ` Taylor Blau
  0 siblings, 1 reply; 75+ messages in thread
From: SZEDER Gábor @ 2020-09-18 16:27 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, dstolee, gitster, peff

On Thu, Sep 17, 2020 at 10:59:27PM -0400, Taylor Blau wrote:
> When 'get_or_compute_bloom_filter()' needs to compute a Bloom filter
> from scratch, it looks to the default 'struct bloom_filter_settings' in
> order to determine the maximum number of changed paths, number of bits
> per entry, and so on.
> 
> All of these values have so far been constant, and so there was no need
> to pass in a pointer from the caller (eg., the one that is stored in the
> 'struct write_commit_graph_context').
> 
> Start passing in a 'struct bloom_filter_settings *' instead of using the
> default values to respect graph-specific settings (eg., in the case of
> setting 'GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS').

I think this description misses the most important aspect of this
patch: it fixes, even if only partially, the half-broken fix in
0087a87ba8 (commit-graph: persist existence of changed-paths,
2020-07-01).

That commit, among other things, tried to make sure that we use the
same Bloom filter settings to compute new Bloom filters that have been
used in already existing filters.  However, it only read those
settings from the header of the existing BDAT chunk and wrote them to
the header of the new BDAT chunk (and printed them to trace2 output).
Unfortunately, it didn't actually use those settings to compute Bloom
filters, because it left get_bloom_filter() unchanged to use the
hardcoded default Bloom filter settings.  This can result in bogus
commits-graphs and, in turn, pathspec-limited revision walks omitting
commits that do modify the specified path.


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v3 08/13] bloom: use provided 'struct bloom_filter_settings'
  2020-09-18 16:27     ` SZEDER Gábor
@ 2020-09-18 16:32       ` Taylor Blau
  0 siblings, 0 replies; 75+ messages in thread
From: Taylor Blau @ 2020-09-18 16:32 UTC (permalink / raw)
  To: SZEDER Gábor; +Cc: git, dstolee, gitster, peff

On Fri, Sep 18, 2020 at 06:27:40PM +0200, SZEDER Gábor wrote:
> On Thu, Sep 17, 2020 at 10:59:27PM -0400, Taylor Blau wrote:
> > When 'get_or_compute_bloom_filter()' needs to compute a Bloom filter
> > from scratch, it looks to the default 'struct bloom_filter_settings' in
> > order to determine the maximum number of changed paths, number of bits
> > per entry, and so on.
> >
> > All of these values have so far been constant, and so there was no need
> > to pass in a pointer from the caller (eg., the one that is stored in the
> > 'struct write_commit_graph_context').
> >
> > Start passing in a 'struct bloom_filter_settings *' instead of using the
> > default values to respect graph-specific settings (eg., in the case of
> > setting 'GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS').
>
> I think this description misses the most important aspect of this
> patch: it fixes, even if only partially, the half-broken fix in
> 0087a87ba8 (commit-graph: persist existence of changed-paths,
> 2020-07-01).

Could you suggest an amendment? I understand what you are saying below,
but I'm not sure how or if you want it incorporated into this patch.

If you feel this is critical to change (I have no opinion either way),
then I'm happy to send a re-roll once review has stabilized. When do you
plan on finishing your read-through of this v3?

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v3 13/13] commit-graph: introduce 'commitGraph.maxNewFilters'
  2020-09-18 13:29     ` Taylor Blau
@ 2020-09-18 17:43       ` Junio C Hamano
  0 siblings, 0 replies; 75+ messages in thread
From: Junio C Hamano @ 2020-09-18 17:43 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, dstolee, peff, szeder.dev

Taylor Blau <me@ttaylorr.com> writes:

> On Thu, Sep 17, 2020 at 10:59:57PM -0400, Taylor Blau wrote:
>> Introduce a configuration variable to specify a default value for the
>> recently-introduce '--max-new-filters' option of 'git commit-graph
>> write'.
>
> The replacement in [1] causes a conflict when applying this patch after
> it, so here is a suggested resolution:
>
> [1]: https://lore.kernel.org/git/cover.1600397826.git.me@ttaylorr.com/

I think you meant <20200918132727.GB1600256@nand.local> not the
cover letter.  Here is what I ended up with:

    $ git checkout tb/bloom-improvements
    $ git diff @{yesterday}
    diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
    index 60df4e4bfa..de6b6de230 100644
    --- a/Documentation/git-commit-graph.txt
    +++ b/Documentation/git-commit-graph.txt
    @@ -69,8 +69,9 @@ data.
     +
     With the `--max-new-filters=<n>` option, generate at most `n` new Bloom
     filters (if `--changed-paths` is specified). If `n` is `-1`, no limit is
    -enforced. Commits whose filters are not calculated are stored as a
    -length zero Bloom filter. Overrides the `commitGraph.maxNewFilters`
    +enforced. Only commits present in the new layer count against this
    +limit. To retroactively compute Bloom filters over earlier layers, it is
    +advised to use `--split=replace`.  Overrides the `commitGraph.maxNewFilters`
     configuration.
     +
     With the `--split[=<strategy>]` option, write the commit-graph as a

Thanks.

^ permalink raw reply	[flat|nested] 75+ messages in thread

end of thread, other threads:[~2020-09-18 17:43 UTC | newest]

Thread overview: 75+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-09 15:22 [PATCH 00/12] more miscellaneous Bloom filter improvements, redux Taylor Blau
2020-09-09 15:22 ` [PATCH 01/12] commit-graph: introduce 'get_bloom_filter_settings()' Taylor Blau
2020-09-09 15:22 ` [PATCH 02/12] t4216: use an '&&'-chain Taylor Blau
2020-09-09 15:22 ` [PATCH 03/12] commit-graph: pass a 'struct repository *' in more places Taylor Blau
2020-09-09 15:23 ` [PATCH 04/12] t/helper/test-read-graph.c: prepare repo settings Taylor Blau
2020-09-09 15:23 ` [PATCH 05/12] commit-graph: respect 'commitGraph.readChangedPaths' Taylor Blau
2020-09-09 15:23 ` [PATCH 06/12] commit-graph.c: store maximum changed paths Taylor Blau
2020-09-09 15:23 ` [PATCH 07/12] bloom: split 'get_bloom_filter()' in two Taylor Blau
2020-09-09 15:23 ` [PATCH 08/12] bloom: use provided 'struct bloom_filter_settings' Taylor Blau
2020-09-09 15:23 ` [PATCH 09/12] bloom/diff: properly short-circuit on max_changes Taylor Blau
2020-09-09 15:23 ` [PATCH 10/12] bloom: encode out-of-bounds filters as non-empty Taylor Blau
2020-09-10  3:35   ` Taylor Blau
2020-09-10 15:45     ` Taylor Blau
2020-09-11 18:15       ` Derrick Stolee
2020-09-09 15:23 ` [PATCH 11/12] commit-graph: rename 'split_commit_graph_opts' Taylor Blau
2020-09-09 15:24 ` [PATCH 12/12] builtin/commit-graph.c: introduce '--max-new-filters=<n>' Taylor Blau
2020-09-11 17:52   ` Jeff King
2020-09-11 18:59     ` Taylor Blau
2020-09-11 19:25       ` Taylor Blau
2020-09-14 20:12         ` Taylor Blau
2020-09-14 20:31           ` Derrick Stolee
2020-09-14 20:36             ` Taylor Blau
2020-09-15  0:59               ` Derrick Stolee
2020-09-15  4:31                 ` Taylor Blau
2020-09-15 21:49               ` Junio C Hamano
2020-09-15 21:53                 ` Taylor Blau
2020-09-11 19:47       ` Jeff King
2020-09-11 19:31     ` Junio C Hamano
2020-09-16 18:06 ` [PATCH v2 00/13] more miscellaneous Bloom filter improvements, redux Taylor Blau
2020-09-16 18:06   ` [PATCH v2 01/13] commit-graph: introduce 'get_bloom_filter_settings()' Taylor Blau
2020-09-16 18:07   ` [PATCH v2 02/13] t4216: use an '&&'-chain Taylor Blau
2020-09-16 18:07   ` [PATCH v2 03/13] commit-graph: pass a 'struct repository *' in more places Taylor Blau
2020-09-16 18:07   ` [PATCH v2 04/13] t/helper/test-read-graph.c: prepare repo settings Taylor Blau
2020-09-16 18:07   ` [PATCH v2 05/13] commit-graph: respect 'commitGraph.readChangedPaths' Taylor Blau
2020-09-16 18:07   ` [PATCH v2 06/13] commit-graph.c: store maximum changed paths Taylor Blau
2020-09-16 18:07   ` [PATCH v2 07/13] bloom: split 'get_bloom_filter()' in two Taylor Blau
2020-09-16 18:07   ` [PATCH v2 08/13] bloom: use provided 'struct bloom_filter_settings' Taylor Blau
2020-09-16 18:07   ` [PATCH v2 09/13] bloom/diff: properly short-circuit on max_changes Taylor Blau
2020-09-16 18:07   ` [PATCH v2 10/13] bloom: encode out-of-bounds filters as non-empty Taylor Blau
2020-09-17 22:13     ` SZEDER Gábor
2020-09-17 23:13       ` Taylor Blau
2020-09-18  0:52         ` Junio C Hamano
2020-09-18  1:15           ` Taylor Blau
2020-09-16 18:08   ` [PATCH v2 11/13] commit-graph: rename 'split_commit_graph_opts' Taylor Blau
2020-09-16 18:08   ` [PATCH v2 12/13] builtin/commit-graph.c: introduce '--max-new-filters=<n>' Taylor Blau
2020-09-18  9:23     ` SZEDER Gábor
2020-09-18 13:27       ` Taylor Blau
2020-09-16 18:08   ` [PATCH v2 13/13] commit-graph: introduce 'commitGraph.maxNewFilters' Taylor Blau
2020-09-16 22:51   ` [PATCH v2 00/13] more miscellaneous Bloom filter improvements, redux Derrick Stolee
2020-09-16 23:07     ` Junio C Hamano
2020-09-17  0:45       ` Taylor Blau
2020-09-17  0:59         ` Junio C Hamano
2020-09-17  1:10           ` Taylor Blau
2020-09-17 13:34             ` Taylor Blau
2020-09-17 13:38               ` Derrick Stolee
2020-09-18  2:58 ` [PATCH v3 " Taylor Blau
2020-09-18  2:58   ` [PATCH v3 01/13] commit-graph: introduce 'get_bloom_filter_settings()' Taylor Blau
2020-09-18  2:58   ` [PATCH v3 02/13] t4216: use an '&&'-chain Taylor Blau
2020-09-18  2:59   ` [PATCH v3 03/13] commit-graph: pass a 'struct repository *' in more places Taylor Blau
2020-09-18  2:59   ` [PATCH v3 04/13] t/helper/test-read-graph.c: prepare repo settings Taylor Blau
2020-09-18  2:59   ` [PATCH v3 05/13] commit-graph: respect 'commitGraph.readChangedPaths' Taylor Blau
2020-09-18  2:59   ` [PATCH v3 06/13] commit-graph.c: store maximum changed paths Taylor Blau
2020-09-18  2:59   ` [PATCH v3 07/13] bloom: split 'get_bloom_filter()' in two Taylor Blau
2020-09-18  2:59   ` [PATCH v3 08/13] bloom: use provided 'struct bloom_filter_settings' Taylor Blau
2020-09-18 16:27     ` SZEDER Gábor
2020-09-18 16:32       ` Taylor Blau
2020-09-18  2:59   ` [PATCH v3 09/13] bloom/diff: properly short-circuit on max_changes Taylor Blau
2020-09-18  2:59   ` [PATCH v3 10/13] bloom: encode out-of-bounds filters as non-empty Taylor Blau
2020-09-18  2:59   ` [PATCH v3 11/13] commit-graph: rename 'split_commit_graph_opts' Taylor Blau
2020-09-18  2:59   ` [PATCH v3 12/13] builtin/commit-graph.c: introduce '--max-new-filters=<n>' Taylor Blau
2020-09-18  2:59   ` [PATCH v3 13/13] commit-graph: introduce 'commitGraph.maxNewFilters' Taylor Blau
2020-09-18 13:29     ` Taylor Blau
2020-09-18 17:43       ` Junio C Hamano
2020-09-18 13:31   ` [PATCH v3 00/13] more miscellaneous Bloom filter improvements, redux Taylor Blau
2020-09-18 13:34     ` Taylor Blau

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).