git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] Integrate changed-path Bloom filters with 'git blame'
@ 2020-04-11  1:02 Derrick Stolee via GitGitGadget
  2020-04-11  1:02 ` [PATCH 1/3] revision: complicated pathspecs disable filters Derrick Stolee via GitGitGadget
                   ` (4 more replies)
  0 siblings, 5 replies; 48+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-04-11  1:02 UTC (permalink / raw)
  To: git; +Cc: me, jnareb, garimasigit, Derrick Stolee

If the changed-path Bloom filters are relatively stable, then I propose
trying to build upon them as a way to discover any deficiencies. Also, it's
good to use them when we can.

The goal I set out to achieve was to use Bloom filters as much as possible
in git blame. I think I have achieved most of that, except I did not
integrate it with the -C mode. In that case, the blob-diff computation takes
a majority of the computation time, so short-circuiting the tree diff using
Bloom filters. Also, it's just really complicated. If someone else thinks
there is an easy win there, then please go ahead and give it a try with the
extra logic presented here in PATCH 3.

While I was playing around with Bloom filters and git blame, I purposefully
got it working with some scenarios but not all. Then I tried to trigger a
failing build in the blame tests using GIT_TEST_COMMIT_GRAPH=1 and 
GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=1. But the tests all succeeded!

Examining the data, I see that the commit-graph didn't have the Bloom filter
chunks at all. This is because we are not setting the flag to write them in
the right spot. The GIT_TEST_COMMIT_GRAPH=1 variable triggers a commit-graph
write during git commit, so we need to update the code there instead of just
inspecting the variable in git commit-graph write. (This is PATCH 2.)

By updating this variable, I saw some test failures in other tests regarding
non-exact pathspecs. I fixed these in PATCH 1 so we keep clean builds.

I based this change on [1] but it would apply cleanly (and logically) on
gs/commit-graph-path-filter

Thanks, -Stolee

[1] 
https://lore.kernel.org/git/pull.601.v2.git.1586437211842.gitgitgadget@gmail.com/

Derrick Stolee (3):
  revision: complicated pathspecs disable filters
  commit: write commit-graph with bloom filters
  blame: use changed-path Bloom filters

 blame.c          | 139 ++++++++++++++++++++++++++++++++++++++++++++---
 blame.h          |   6 ++
 builtin/blame.c  |  10 ++++
 builtin/commit.c |   4 +-
 revision.c       |   9 +++
 5 files changed, 158 insertions(+), 10 deletions(-)


base-commit: f4df00a0dd448edce0e854a97f63598fefe27d27
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-609%2Fderrickstolee%2Fbloom-blame-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-609/derrickstolee/bloom-blame-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/609
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 1/3] revision: complicated pathspecs disable filters
  2020-04-11  1:02 [PATCH 0/3] Integrate changed-path Bloom filters with 'git blame' Derrick Stolee via GitGitGadget
@ 2020-04-11  1:02 ` Derrick Stolee via GitGitGadget
  2020-04-11 21:40   ` Junio C Hamano
  2020-04-12 22:22   ` Taylor Blau
  2020-04-11  1:03 ` [PATCH 2/3] commit: write commit-graph with bloom filters Derrick Stolee via GitGitGadget
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 48+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-04-11  1:02 UTC (permalink / raw)
  To: git; +Cc: me, jnareb, garimasigit, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The changed-path Bloom filters work only when we can compute an
explicit Bloom filter key in advance. When a pathspec is given
that allows case-insensitive checks or wildcard matching, we
must disable the Bloom filter performance checks.

By checking the pathspec in prepare_to_use_bloom_filters(), we
avoid setting up the Bloom filter data and thus revert to the
usual logic.

Before this change, the following tests would fail*:

	t6004-rev-list-path-optim.sh (Tests 6-7)
	t6130-pathspec-noglob.sh (Tests 3-6)
	t6131-pathspec-icase.sh (Tests 3-5)

*These tests would fail when using GIT_TEST_COMMIT_GRAPH and
GIT_TEST_COMMIT_GRAPH_BLOOM_FILTERS except that the latter
environment variable was not set up correctly to write the changed-
path Bloom filters in the test suite. That will be fixed in the
next change.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 revision.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/revision.c b/revision.c
index 2b06ee739c8..e37b5b06108 100644
--- a/revision.c
+++ b/revision.c
@@ -661,6 +661,15 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
 	if (!revs->commits)
 	    return;
 
+	if (revs->prune_data.has_wildcard)
+		return;
+	if (revs->prune_data.nr > 1)
+		return;
+	if (revs->prune_data.magic ||
+	    (revs->prune_data.nr &&
+	     revs->prune_data.items[0].magic))
+		return;
+
 	repo_parse_commit(revs->repo, revs->commits->item);
 
 	if (!revs->repo->objects->commit_graph)
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 2/3] commit: write commit-graph with bloom filters
  2020-04-11  1:02 [PATCH 0/3] Integrate changed-path Bloom filters with 'git blame' Derrick Stolee via GitGitGadget
  2020-04-11  1:02 ` [PATCH 1/3] revision: complicated pathspecs disable filters Derrick Stolee via GitGitGadget
@ 2020-04-11  1:03 ` Derrick Stolee via GitGitGadget
  2020-04-11 21:57   ` Junio C Hamano
  2020-04-11  1:03 ` [PATCH 3/3] blame: use changed-path Bloom filters Derrick Stolee via GitGitGadget
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 48+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-04-11  1:03 UTC (permalink / raw)
  To: git; +Cc: me, jnareb, garimasigit, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The GIT_TEST_COMMIT_GRAPH environment variable updates the commit-
graph file whenever "git commit" is run, ensuring that we always
have an updated commit-graph throughout the test suite. The
GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS environment variable was
introduced to write the changed-path Bloom filters whenever "git
commit-graph write" is run. However, the GIT_TEST_COMMIT_GRAPH
trick doesn't launch a separate process and instead writes it
directly.

Update the "git commit" builtin to write changed-path Bloom filters
when both GIT_TEST_COMMIT_GRAPH and GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS
are enabled.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 builtin/commit.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/builtin/commit.c b/builtin/commit.c
index d3e7781e658..d2fdfdc4363 100644
--- a/builtin/commit.c
+++ b/builtin/commit.c
@@ -1701,7 +1701,9 @@ int cmd_commit(int argc, const char **argv, const char *prefix)
 		      "not exceeded, and then \"git restore --staged :/\" to recover."));
 
 	if (git_env_bool(GIT_TEST_COMMIT_GRAPH, 0) &&
-	    write_commit_graph_reachable(the_repository->objects->odb, 0, NULL))
+	    write_commit_graph_reachable(the_repository->objects->odb,
+					 git_env_bool(GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS, 0) ? COMMIT_GRAPH_WRITE_BLOOM_FILTERS : 0,
+					 NULL))
 		return 1;
 
 	repo_rerere(the_repository, 0);
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 3/3] blame: use changed-path Bloom filters
  2020-04-11  1:02 [PATCH 0/3] Integrate changed-path Bloom filters with 'git blame' Derrick Stolee via GitGitGadget
  2020-04-11  1:02 ` [PATCH 1/3] revision: complicated pathspecs disable filters Derrick Stolee via GitGitGadget
  2020-04-11  1:03 ` [PATCH 2/3] commit: write commit-graph with bloom filters Derrick Stolee via GitGitGadget
@ 2020-04-11  1:03 ` Derrick Stolee via GitGitGadget
  2020-04-11 22:03   ` Junio C Hamano
  2020-04-11 21:30 ` [PATCH 0/3] Integrate changed-path Bloom filters with 'git blame' Junio C Hamano
  2020-04-13 14:45 ` [PATCH v2 0/4] " Derrick Stolee via GitGitGadget
  4 siblings, 1 reply; 48+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-04-11  1:03 UTC (permalink / raw)
  To: git; +Cc: me, jnareb, garimasigit, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The changed-path Bloom filters help reduce the amount of tree
parsing required during history queries. Before calculating a
diff, we can ask the filter if a path changed between a commit
and its first parent. If the filter says "no" then we can move
on without parsing trees. If the filter says "maybe" then we
parse trees to discover if the answer is actually "yes" or "no".

When computing a blame, there is a section in find_origin() that
computes a diff between a commit and one of its parents. When this
is the first parent, we can check the Bloom filters before calling
diff_tree_oid().

In order to make this work with the blame machinery, we need to
initialize a struct bloom_key with the initial path. But also, we
need to add more keys to a list if a rename is detected. We then
check to see if _any_ of these keys answer "maybe" in the diff.

During development, I purposefully left out this "add a new key
when a rename is detected" to see if the test suite would catch
my error. That is how I discovered the issues with
GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS from the previous change.
With that change, we can feel some confidence in the coverage of
this change.

If a user requests copy detection using "git blame -C", then there
are more places where the set of "important" files can expand. I
do not know enough about how this happens in the blame machinery.
Thus, the Bloom filter integration is explicitly disabled in this
mode. A later change could expand the bloom_key data with an
appropriate call (or calls) to add_bloom_key().

If we did not disable this mode, then the following tests would
fail:

	t8003-blame-corner-cases.sh
	t8011-blame-split-file.sh

Generally, this is a performance enhancement and should not
change the behavior of 'git blame' in any way. If a repo has a
commit-graph file with computed changed-path Bloom filters, then
they should notice improved performance for their 'git blame'
commands.

Here are some example timings that I found by blaming some paths
in the Linux kernel repository:

 git blame arch/x86/kernel/topology.c >/dev/null

 Before: 0.83s
  After: 0.24s

 git blame kernel/time/time.c >/dev/null

 Before: 0.72s
  After: 0.24s

 git blame tools/perf/ui/stdio/hist.c >/dev/null

 Before: 0.27s
  After: 0.11s

I specifically looked for "deep" paths that were also edited many
times. As a counterpoint, the MAINTAINERS file was edited many
times but is located in the root tree. This means that the cost of
computing a diff relative to the pathspec is very small. Here are
the timings for that command:

 git blame MAINTAINERS >/dev/null

 Before: 20.1s
  After: 18.0s

These timings are the best of five. The worst-case runs were on the
order of 2.5 minutes for both cases. Note that the MAINTAINERS file
has 18,740 lines across 17,000+ commits. This happens to be one of
the cases where this change provides the least improvement.

The lack of improvement for the MAINTAINERS file and the relatively
modest improvement for the other examples can be easily explained.
The blame machinery needs to compute line-level diffs to determine
which lines were changed by each commit. That makes up a large
proportion of the computation time, and this change does not
attempt to improve on that section of the algorithm. The
MAINTAINERS file is large and changed often, so it takes time to
determine which lines were updated by which commit. In contrast,
the code files are much smaller, and it takes longer to comute
the line-by-line diff for a single patch on the Linux mailing
lists.

Outside of the "-C" integration, I believe there is little more to
gain from the changed-path Bloom filters for 'git blame' after this
patch.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 blame.c         | 139 ++++++++++++++++++++++++++++++++++++++++++++----
 blame.h         |   6 +++
 builtin/blame.c |  10 ++++
 3 files changed, 146 insertions(+), 9 deletions(-)

diff --git a/blame.c b/blame.c
index 29770e5c81c..9fbf79e47c3 100644
--- a/blame.c
+++ b/blame.c
@@ -9,6 +9,8 @@
 #include "blame.h"
 #include "alloc.h"
 #include "commit-slab.h"
+#include "bloom.h"
+#include "commit-graph.h"
 
 define_commit_slab(blame_suspects, struct blame_origin *);
 static struct blame_suspects blame_suspects;
@@ -1246,13 +1248,75 @@ static int fill_blob_sha1_and_mode(struct repository *r,
 	return -1;
 }
 
+struct blame_bloom_data {
+	/*
+	 * Changed-path Bloom filter keys. These can help prevent
+	 * computing diffs against first parents, but we need to
+	 * expand the list as code is moved or files are renamed.
+	 */
+	struct bloom_filter_settings *settings;
+	struct bloom_key **keys;
+	int nr;
+	int alloc;
+};
+
+static int bloom_count_queries = 0;
+static int bloom_count_no = 0;
+static int maybe_changed_path(struct repository *r,
+			      struct commit *parent,
+			      struct blame_origin *origin,
+			      struct blame_bloom_data *bd)
+{
+	int i;
+	struct bloom_filter *filter;
+
+	if (!bd)
+		return 1;
+
+	if (origin->commit->generation == GENERATION_NUMBER_INFINITY)
+		return 1;
+
+	filter = get_bloom_filter(r, origin->commit, 0);
+
+	if (!filter)
+		return 1;
+
+	bloom_count_queries++;
+	for (i = 0; i < bd->nr; i++) {
+		if (bloom_filter_contains(filter,
+					  bd->keys[i],
+					  bd->settings))
+			return 1;
+	}
+
+	bloom_count_no++;
+	return 0;
+}
+
+static void add_bloom_key(struct blame_bloom_data *bd,
+			  const char *path)
+{
+	if (!bd)
+		return;
+
+	if (bd->nr >= bd->alloc) {
+		bd->alloc *= 2;
+		REALLOC_ARRAY(bd->keys, bd->alloc);
+	}
+
+	bd->keys[bd->nr] = xmalloc(sizeof(struct bloom_key));
+	fill_bloom_key(path, strlen(path), bd->keys[bd->nr], bd->settings);
+	bd->nr++;
+}
+
 /*
  * We have an origin -- check if the same path exists in the
  * parent and return an origin structure to represent it.
  */
 static struct blame_origin *find_origin(struct repository *r,
 					struct commit *parent,
-					struct blame_origin *origin)
+					struct blame_origin *origin,
+					struct blame_bloom_data *bd)
 {
 	struct blame_origin *porigin;
 	struct diff_options diff_opts;
@@ -1286,10 +1350,19 @@ static struct blame_origin *find_origin(struct repository *r,
 
 	if (is_null_oid(&origin->commit->object.oid))
 		do_diff_cache(get_commit_tree_oid(parent), &diff_opts);
-	else
-		diff_tree_oid(get_commit_tree_oid(parent),
-			      get_commit_tree_oid(origin->commit),
-			      "", &diff_opts);
+	else {
+		int compute_diff = 1;
+		if (origin->commit->parents &&
+		    !oidcmp(&parent->object.oid,
+			    &origin->commit->parents->item->object.oid))
+			compute_diff = maybe_changed_path(r, parent,
+							  origin, bd);
+
+		if (compute_diff)
+			diff_tree_oid(get_commit_tree_oid(parent),
+				      get_commit_tree_oid(origin->commit),
+				      "", &diff_opts);
+	}
 	diffcore_std(&diff_opts);
 
 	if (!diff_queued_diff.nr) {
@@ -1341,7 +1414,8 @@ static struct blame_origin *find_origin(struct repository *r,
  */
 static struct blame_origin *find_rename(struct repository *r,
 					struct commit *parent,
-					struct blame_origin *origin)
+					struct blame_origin *origin,
+					struct blame_bloom_data *bd)
 {
 	struct blame_origin *porigin = NULL;
 	struct diff_options diff_opts;
@@ -1366,6 +1440,7 @@ static struct blame_origin *find_rename(struct repository *r,
 		struct diff_filepair *p = diff_queued_diff.queue[i];
 		if ((p->status == 'R' || p->status == 'C') &&
 		    !strcmp(p->two->path, origin->path)) {
+			add_bloom_key(bd, p->one->path);
 			porigin = get_origin(parent, p->one->path);
 			oidcpy(&porigin->blob_oid, &p->one->oid);
 			porigin->mode = p->one->mode;
@@ -2332,6 +2407,11 @@ static void distribute_blame(struct blame_scoreboard *sb, struct blame_entry *bl
 
 #define MAXSG 16
 
+typedef struct blame_origin *(*blame_find_alg)(struct repository *,
+					       struct commit *,
+					       struct blame_origin *,
+					       struct blame_bloom_data *);
+
 static void pass_blame(struct blame_scoreboard *sb, struct blame_origin *origin, int opt)
 {
 	struct rev_info *revs = sb->revs;
@@ -2356,8 +2436,7 @@ static void pass_blame(struct blame_scoreboard *sb, struct blame_origin *origin,
 	 * common cases, then we look for renames in the second pass.
 	 */
 	for (pass = 0; pass < 2 - sb->no_whole_file_rename; pass++) {
-		struct blame_origin *(*find)(struct repository *, struct commit *, struct blame_origin *);
-		find = pass ? find_rename : find_origin;
+		blame_find_alg find = pass ? find_rename : find_origin;
 
 		for (i = 0, sg = first_scapegoat(revs, commit, sb->reverse);
 		     i < num_sg && sg;
@@ -2369,7 +2448,7 @@ static void pass_blame(struct blame_scoreboard *sb, struct blame_origin *origin,
 				continue;
 			if (parse_commit(p))
 				continue;
-			porigin = find(sb->repo, p, origin);
+			porigin = find(sb->repo, p, origin, sb->bloom_data);
 			if (!porigin)
 				continue;
 			if (oideq(&porigin->blob_oid, &origin->blob_oid)) {
@@ -2809,3 +2888,45 @@ struct blame_entry *blame_entry_prepend(struct blame_entry *head,
 	blame_origin_incref(o);
 	return new_head;
 }
+
+void setup_blame_bloom_data(struct blame_scoreboard *sb,
+			    const char *path)
+{
+	struct blame_bloom_data *bd;
+
+	if (!sb->repo->objects->commit_graph)
+		return;
+
+	if (!sb->repo->objects->commit_graph->bloom_filter_settings)
+		return;
+
+	bd = xmalloc(sizeof(struct blame_bloom_data));
+
+	bd->settings = sb->repo->objects->commit_graph->bloom_filter_settings;
+
+	bd->alloc = 4;
+	bd->nr = 0;
+	ALLOC_ARRAY(bd->keys, bd->alloc);
+
+	add_bloom_key(bd, path);
+
+	sb->bloom_data = bd;
+}
+
+void cleanup_scoreboard(struct blame_scoreboard *sb)
+{
+	if (sb->bloom_data) {
+		int i;
+		for (i = 0; i < sb->bloom_data->nr; i++) {
+			free(sb->bloom_data->keys[i]->hashes);
+			free(sb->bloom_data->keys[i]);
+		}
+		free(sb->bloom_data->keys);
+		FREE_AND_NULL(sb->bloom_data);
+
+		trace2_data_intmax("blame", sb->repo,
+				   "bloom/queries", bloom_count_queries);
+		trace2_data_intmax("blame", sb->repo,
+				   "bloom/response-no", bloom_count_no);
+	}
+}
diff --git a/blame.h b/blame.h
index 089b181ff27..b6bbee41472 100644
--- a/blame.h
+++ b/blame.h
@@ -100,6 +100,8 @@ struct blame_entry {
 	int unblamable;
 };
 
+struct blame_bloom_data;
+
 /*
  * The current state of the blame assignment.
  */
@@ -156,6 +158,7 @@ struct blame_scoreboard {
 	void(*found_guilty_entry)(struct blame_entry *, void *);
 
 	void *found_guilty_entry_data;
+	struct blame_bloom_data *bloom_data;
 };
 
 /*
@@ -180,6 +183,9 @@ void init_scoreboard(struct blame_scoreboard *sb);
 void setup_scoreboard(struct blame_scoreboard *sb,
 		      const char *path,
 		      struct blame_origin **orig);
+void setup_blame_bloom_data(struct blame_scoreboard *sb,
+			    const char *path);
+void cleanup_scoreboard(struct blame_scoreboard *sb);
 
 struct blame_entry *blame_entry_prepend(struct blame_entry *head,
 					long start, long end,
diff --git a/builtin/blame.c b/builtin/blame.c
index bf1cecdf3f9..3c13634f279 100644
--- a/builtin/blame.c
+++ b/builtin/blame.c
@@ -1061,6 +1061,14 @@ int cmd_blame(int argc, const char **argv, const char *prefix)
 	string_list_clear(&ignore_revs_file_list, 0);
 	string_list_clear(&ignore_rev_list, 0);
 	setup_scoreboard(&sb, path, &o);
+
+	/*
+	 * Changed-path Bloom filters are disabled when looking
+	 * for copies.
+	 */
+	if (!(opt & PICKAXE_BLAME_COPY))
+		setup_blame_bloom_data(&sb, path);
+
 	lno = sb.num_lines;
 
 	if (lno && !range_list.nr)
@@ -1164,5 +1172,7 @@ int cmd_blame(int argc, const char **argv, const char *prefix)
 		printf("num get patch: %d\n", sb.num_get_patch);
 		printf("num commits: %d\n", sb.num_commits);
 	}
+
+	cleanup_scoreboard(&sb);
 	return 0;
 }
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH 0/3] Integrate changed-path Bloom filters with 'git blame'
  2020-04-11  1:02 [PATCH 0/3] Integrate changed-path Bloom filters with 'git blame' Derrick Stolee via GitGitGadget
                   ` (2 preceding siblings ...)
  2020-04-11  1:03 ` [PATCH 3/3] blame: use changed-path Bloom filters Derrick Stolee via GitGitGadget
@ 2020-04-11 21:30 ` Junio C Hamano
  2020-04-13 14:45 ` [PATCH v2 0/4] " Derrick Stolee via GitGitGadget
  4 siblings, 0 replies; 48+ messages in thread
From: Junio C Hamano @ 2020-04-11 21:30 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, me, jnareb, garimasigit, Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> The goal I set out to achieve was to use Bloom filters as much as possible
> in git blame.

Exciting.  Uncomfortably exciting.

Thanks.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/3] revision: complicated pathspecs disable filters
  2020-04-11  1:02 ` [PATCH 1/3] revision: complicated pathspecs disable filters Derrick Stolee via GitGitGadget
@ 2020-04-11 21:40   ` Junio C Hamano
  2020-04-13 11:49     ` Derrick Stolee
  2020-04-12 22:22   ` Taylor Blau
  1 sibling, 1 reply; 48+ messages in thread
From: Junio C Hamano @ 2020-04-11 21:40 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, me, jnareb, garimasigit, Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Derrick Stolee <dstolee@microsoft.com>
>
> The changed-path Bloom filters work only when we can compute an
> explicit Bloom filter key in advance. When a pathspec is given
> that allows case-insensitive checks or wildcard matching, we
> must disable the Bloom filter performance checks.

How often do we want to do case-insensitive path matching, I wonder.

As somebody who never touches case-insensitive filesystem, I would
be a bad judge, and I would imagine that I would be looking for a
pathspec "[Mm]akefile" rather than ":(icase)makefile" myself if
there are projects that may have either/both, so I do not mind
disabling the bloom filter for case insensitivity myself.

But if users may use icase pathspec very often, it may be worth
considering to build the bloom filter after downcasing the paths,
perhaps?  Given that many projects extract their source code to a
case insensitive filesystem, I would imagine that downcasing paths
would map two originally different paths into the same thing only
rarely, if ever, so there may not be much downside to do so.

> By checking the pathspec in prepare_to_use_bloom_filters(), we
> avoid setting up the Bloom filter data and thus revert to the
> usual logic.
>
> Before this change, the following tests would fail*:
>
> 	t6004-rev-list-path-optim.sh (Tests 6-7)
> 	t6130-pathspec-noglob.sh (Tests 3-6)
> 	t6131-pathspec-icase.sh (Tests 3-5)
>
> *These tests would fail when using GIT_TEST_COMMIT_GRAPH and
> GIT_TEST_COMMIT_GRAPH_BLOOM_FILTERS except that the latter
> environment variable was not set up correctly to write the changed-
> path Bloom filters in the test suite. That will be fixed in the
> next change.

Thanks.  This is exciting.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/3] commit: write commit-graph with bloom filters
  2020-04-11  1:03 ` [PATCH 2/3] commit: write commit-graph with bloom filters Derrick Stolee via GitGitGadget
@ 2020-04-11 21:57   ` Junio C Hamano
  2020-04-12 20:51     ` Taylor Blau
  0 siblings, 1 reply; 48+ messages in thread
From: Junio C Hamano @ 2020-04-11 21:57 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, me, jnareb, garimasigit, Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> diff --git a/builtin/commit.c b/builtin/commit.c
> index d3e7781e658..d2fdfdc4363 100644
> --- a/builtin/commit.c
> +++ b/builtin/commit.c
> @@ -1701,7 +1701,9 @@ int cmd_commit(int argc, const char **argv, const char *prefix)
>  		      "not exceeded, and then \"git restore --staged :/\" to recover."));
>  
>  	if (git_env_bool(GIT_TEST_COMMIT_GRAPH, 0) &&
> -	    write_commit_graph_reachable(the_repository->objects->odb, 0, NULL))
> +	    write_commit_graph_reachable(the_repository->objects->odb,
> +					 git_env_bool(GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS, 0) ? COMMIT_GRAPH_WRITE_BLOOM_FILTERS : 0,
> +					 NULL))

Yuck.  That is doubly mouthful.

I wonder how much value we are getting by having this call here in
the first place.  This function being cmd_commit(), by definition we
won't be updating the graph file unless the test script does "git
commit".  We won't update the graph file upon "git rebase", "git
merge", "git pull", "git fetch", etc., so it is not like with this
hack, the test coverage for various traversal using the graph file
would magically be perfect.  We'd still need an explicit call to
"commit-graph write" at strategic places in the test scripts, no?

How are we testing with and without bitmaps, and do we have a kludge
like this one for the feature, or do we use a different mechanism
to help writing tests easier to gain better coverage?

In any case, I can see why we want a separate knob specific to the
CHANGED_PATHS until the path filter part becomes as stable as the
rest of the graph file, but after some time, we should remove that
knob, and at that point, we always use the path filter whenever we
use commit-graph, so that we won't have to suffer from the ugliness.

Wait.  I wonder if we can tweak the arrangement a bit so that this
layer does not need to know any more than GIT_TEST_COMMIT_GRAPH?

For example, in commit-graph.c::write_commit_graph(), can we test
the CHANGED_PATHS environment variable and flip the .changed_paths
bit, no matter what the 'flags' argument to the function says?

Thanks.

>  		return 1;
>  
>  	repo_rerere(the_repository, 0);

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/3] blame: use changed-path Bloom filters
  2020-04-11  1:03 ` [PATCH 3/3] blame: use changed-path Bloom filters Derrick Stolee via GitGitGadget
@ 2020-04-11 22:03   ` Junio C Hamano
  2020-04-12  7:39     ` Eric Sunshine
  0 siblings, 1 reply; 48+ messages in thread
From: Junio C Hamano @ 2020-04-11 22:03 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, me, jnareb, garimasigit, Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Derrick Stolee <dstolee@microsoft.com>
> Subject: Re: [PATCH 3/3] blame: use changed-path Bloom filters

Ahh, I almost forgot we spell Bloom with capital B, so I should go
back and amend the title of [2/3].

> When computing a blame, there is a section in find_origin() that
> computes a diff between a commit and one of its parents. When this
> is the first parent, we can check the Bloom filters before calling
> diff_tree_oid().
>
> In order to make this work with the blame machinery, we need to
> initialize a struct bloom_key with the initial path. But also, we
> need to add more keys to a list if a rename is detected. We then
> check to see if _any_ of these keys answer "maybe" in the diff.

Yes.  I think after gaining experience with this technique, we may
be able to speed up the "git log --follow" while correcting its
semantics at the same time.  The prospect is unnervingly exciting.

> Generally, this is a performance enhancement and should not
> change the behavior of 'git blame' in any way.

Absolutely.

> The lack of improvement for the MAINTAINERS file and the relatively
> modest improvement for the other examples can be easily explained.
> The blame machinery needs to compute line-level diffs to determine
> which lines were changed by each commit. That makes up a large
> proportion of the computation time, and this change does not
> attempt to improve on that section of the algorithm. The
> MAINTAINERS file is large and changed often, so it takes time to
> determine which lines were updated by which commit. In contrast,
> the code files are much smaller, and it takes longer to comute
> the line-by-line diff for a single patch on the Linux mailing
> lists.

Yup, tree-diff for a deeper path would benefit the most, and your
numbers were indeed impressive.

> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  blame.c         | 139 ++++++++++++++++++++++++++++++++++++++++++++----
>  blame.h         |   6 +++
>  builtin/blame.c |  10 ++++
>  3 files changed, 146 insertions(+), 9 deletions(-)

I am kind-a surprised how little additional code it takes to do
this.  Good job.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/3] blame: use changed-path Bloom filters
  2020-04-11 22:03   ` Junio C Hamano
@ 2020-04-12  7:39     ` Eric Sunshine
  0 siblings, 0 replies; 48+ messages in thread
From: Eric Sunshine @ 2020-04-12  7:39 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Derrick Stolee via GitGitGadget, Git List, Taylor Blau,
	Jakub Narebski, garimasigit, Derrick Stolee

On Sat, Apr 11, 2020 at 6:03 PM Junio C Hamano <gitster@pobox.com> wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> > The lack of improvement for the MAINTAINERS file and the relatively
> > modest improvement for the other examples can be easily explained.
> > The blame machinery needs to compute line-level diffs to determine
> > which lines were changed by each commit. That makes up a large
> > proportion of the computation time, and this change does not
> > attempt to improve on that section of the algorithm. The
> > MAINTAINERS file is large and changed often, so it takes time to
> > determine which lines were updated by which commit. In contrast,
> > the code files are much smaller, and it takes longer to comute
> > the line-by-line diff for a single patch on the Linux mailing
> > lists.

s/comute/compute/

(Sorry for replying to Junio's reply -- the original is no longer in
my mailbox.)

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/3] commit: write commit-graph with bloom filters
  2020-04-11 21:57   ` Junio C Hamano
@ 2020-04-12 20:51     ` Taylor Blau
  2020-04-13 12:08       ` Derrick Stolee
  0 siblings, 1 reply; 48+ messages in thread
From: Taylor Blau @ 2020-04-12 20:51 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Derrick Stolee via GitGitGadget, git, me, jnareb, garimasigit,
	Derrick Stolee

On Sat, Apr 11, 2020 at 02:57:18PM -0700, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > diff --git a/builtin/commit.c b/builtin/commit.c
> > index d3e7781e658..d2fdfdc4363 100644
> > --- a/builtin/commit.c
> > +++ b/builtin/commit.c
> > @@ -1701,7 +1701,9 @@ int cmd_commit(int argc, const char **argv, const char *prefix)
> >  		      "not exceeded, and then \"git restore --staged :/\" to recover."));
> >
> >  	if (git_env_bool(GIT_TEST_COMMIT_GRAPH, 0) &&
> > -	    write_commit_graph_reachable(the_repository->objects->odb, 0, NULL))
> > +	    write_commit_graph_reachable(the_repository->objects->odb,
> > +					 git_env_bool(GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS, 0) ? COMMIT_GRAPH_WRITE_BLOOM_FILTERS : 0,
> > +					 NULL))
>
> Yuck.  That is doubly mouthful.

Yeah, this is quite a mouthful indeed. I think the most conservative fix
would be something like:

  if (git_env_bool(GIT_TEST_COMMIT_GRAPH, 0)) {
    enum commit_graph_write_flags flags = 0;
    if (git_env_bool(GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS, 0))
      flags |= COMMIT_GRAPH_WRITE_BLOOM_FILTERS;

    if (write_commit_graph_reachable(the_repository->objects->odb,
                                     flags, NULL)) {
      /* ... */
    }
  }

and then ripping this knob out (as Junio suggests below, which I think
is a wise suggestion) out would be straightforward.

> I wonder how much value we are getting by having this call here in
> the first place.  This function being cmd_commit(), by definition we
> won't be updating the graph file unless the test script does "git
> commit".  We won't update the graph file upon "git rebase", "git
> merge", "git pull", "git fetch", etc., so it is not like with this
> hack, the test coverage for various traversal using the graph file
> would magically be perfect.  We'd still need an explicit call to
> "commit-graph write" at strategic places in the test scripts, no?

Yeah. It's definitely not a silver bullet for the reasons you mention,
but I think that it is helping out in some common cases. For example,
if we moved this check out to 'test_commit', then we'd be relying on
tests to use that instead of 'git commit'. On the other hand, this
catches either, since they both wind up in this builtin.

I guess you could push this check down much further to when we're about
to write a new object, and--if that new object is a commit--update the
commit-graph. That sounds awfully slow (and feels to me like a hack, but
I can't justify whether it's more or less of a hack than we already
have), but I think it would be OK, since this isn't the normal way to
run tests.

> How are we testing with and without bitmaps, and do we have a kludge
> like this one for the feature, or do we use a different mechanism
> to help writing tests easier to gain better coverage?

As far as I know after a brief search, we have no similar such mechanism
for bitmaps, so commit-graphs are a unique instance.

> In any case, I can see why we want a separate knob specific to the
> CHANGED_PATHS until the path filter part becomes as stable as the
> rest of the graph file, but after some time, we should remove that
> knob, and at that point, we always use the path filter whenever we
> use commit-graph, so that we won't have to suffer from the ugliness.
>
> Wait.  I wonder if we can tweak the arrangement a bit so that this
> layer does not need to know any more than GIT_TEST_COMMIT_GRAPH?
>
> For example, in commit-graph.c::write_commit_graph(), can we test
> the CHANGED_PATHS environment variable and flip the .changed_paths
> bit, no matter what the 'flags' argument to the function says?

It may be cleaner to have 'GIT_TEST_COMMIT_GRAPH' specify the flags that
it wants as its value, but the additional coupling makes me somewhat
uncomfortable.

> Thanks.
>
> >  		return 1;
> >
> >  	repo_rerere(the_repository, 0);

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/3] revision: complicated pathspecs disable filters
  2020-04-11  1:02 ` [PATCH 1/3] revision: complicated pathspecs disable filters Derrick Stolee via GitGitGadget
  2020-04-11 21:40   ` Junio C Hamano
@ 2020-04-12 22:22   ` Taylor Blau
  2020-04-12 22:30     ` Junio C Hamano
  1 sibling, 1 reply; 48+ messages in thread
From: Taylor Blau @ 2020-04-12 22:22 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, me, jnareb, garimasigit, Derrick Stolee

Hi Stolee,

On Sat, Apr 11, 2020 at 01:02:59AM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
>
> The changed-path Bloom filters work only when we can compute an
> explicit Bloom filter key in advance. When a pathspec is given
> that allows case-insensitive checks or wildcard matching, we
> must disable the Bloom filter performance checks.
>
> By checking the pathspec in prepare_to_use_bloom_filters(), we
> avoid setting up the Bloom filter data and thus revert to the
> usual logic.

All makes sense to me, and this seems like the only reasonable thing
*to* do in this situation. That's fine, since we're not regressing,
we're just not using Bloom filters.

> Before this change, the following tests would fail*:
>
> 	t6004-rev-list-path-optim.sh (Tests 6-7)
> 	t6130-pathspec-noglob.sh (Tests 3-6)
> 	t6131-pathspec-icase.sh (Tests 3-5)
>
> *These tests would fail when using GIT_TEST_COMMIT_GRAPH and
> GIT_TEST_COMMIT_GRAPH_BLOOM_FILTERS except that the latter
> environment variable was not set up correctly to write the changed-
> path Bloom filters in the test suite. That will be fixed in the
> next change.

Nicely done.

> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  revision.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
>
> diff --git a/revision.c b/revision.c
> index 2b06ee739c8..e37b5b06108 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -661,6 +661,15 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
>  	if (!revs->commits)
>  	    return;
>

I certainly wouldn't complain about a comment here explaining these
three checks, but I suppose that the rationale is only a 'git blame'
away (and I guess that is faster now after this series ;-)).

> +	if (revs->prune_data.has_wildcard)
> +		return;
> +	if (revs->prune_data.nr > 1)
> +		return;
> +	if (revs->prune_data.magic ||
> +	    (revs->prune_data.nr &&
> +	     revs->prune_data.items[0].magic))
> +		return;
> +
>  	repo_parse_commit(revs->repo, revs->commits->item);
>
>  	if (!revs->repo->objects->commit_graph)
> --
> gitgitgadget

  Reviewed-by: Taylor Blau <me@ttaylorr.com>

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/3] revision: complicated pathspecs disable filters
  2020-04-12 22:22   ` Taylor Blau
@ 2020-04-12 22:30     ` Junio C Hamano
  2020-04-13  0:07       ` Taylor Blau
  0 siblings, 1 reply; 48+ messages in thread
From: Junio C Hamano @ 2020-04-12 22:30 UTC (permalink / raw)
  To: Taylor Blau
  Cc: Derrick Stolee via GitGitGadget, git, jnareb, garimasigit,
	Derrick Stolee

Taylor Blau <me@ttaylorr.com> writes:

> I certainly wouldn't complain about a comment here explaining these
> three checks, but I suppose that the rationale is only a 'git blame'
> away (and I guess that is faster now after this series ;-)).
>
>> +	if (revs->prune_data.has_wildcard)
>> +		return;
>> +	if (revs->prune_data.nr > 1)
>> +		return;
>> +	if (revs->prune_data.magic ||
>> +	    (revs->prune_data.nr &&
>> +	     revs->prune_data.items[0].magic))

This says "any magic", but it is overly pessimistic to disable the
optimization for "literal" magic.  That magic is the one that lets
well written scripts to say "I have in a '$variable' that the user
gave me as a pathname (not pathspec), and it may have a wildcard
letter in it, but please treat it as a literal string" by prefixing
":(literal)" before that user-supplied data, so it is punishing well
disciplined folks.

>> +		return;
>> +
>>  	repo_parse_commit(revs->repo, revs->commits->item);
>>
>>  	if (!revs->repo->objects->commit_graph)
>> --
>> gitgitgadget
>
>   Reviewed-by: Taylor Blau <me@ttaylorr.com>
>
> Thanks,
> Taylor

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/3] revision: complicated pathspecs disable filters
  2020-04-12 22:30     ` Junio C Hamano
@ 2020-04-13  0:07       ` Taylor Blau
  2020-04-13 11:54         ` Derrick Stolee
  0 siblings, 1 reply; 48+ messages in thread
From: Taylor Blau @ 2020-04-13  0:07 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Taylor Blau, Derrick Stolee via GitGitGadget, git, jnareb,
	garimasigit, Derrick Stolee

On Sun, Apr 12, 2020 at 03:30:07PM -0700, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> > I certainly wouldn't complain about a comment here explaining these
> > three checks, but I suppose that the rationale is only a 'git blame'
> > away (and I guess that is faster now after this series ;-)).
> >
> >> +	if (revs->prune_data.has_wildcard)
> >> +		return;
> >> +	if (revs->prune_data.nr > 1)
> >> +		return;
> >> +	if (revs->prune_data.magic ||
> >> +	    (revs->prune_data.nr &&
> >> +	     revs->prune_data.items[0].magic))
>
> This says "any magic", but it is overly pessimistic to disable the
> optimization for "literal" magic.  That magic is the one that lets
> well written scripts to say "I have in a '$variable' that the user
> gave me as a pathname (not pathspec), and it may have a wildcard
> letter in it, but please treat it as a literal string" by prefixing
> ":(literal)" before that user-supplied data, so it is punishing well
> disciplined folks.

I hadn't thought of that, but it makes sense to me. How about something
like this squashed into this patch? I moved the if-chain that Stolee
introduced out to its own function, at least since they seem
well-contained and related to one another. I figure that this simplifies
the implementation of 'prepare_to_use_bloom_filter' by giving the reader
less to think about up-front.

diff --git a/revision.c b/revision.c
index 534c0bf996..15bf4ccff5 100644
--- a/revision.c
+++ b/revision.c
@@ -654,6 +654,18 @@ static void trace2_bloom_filter_statistics_atexit(void)
        jw_release(&jw);
 }

+static int has_bloom_key(struct pathspec *spec)
+{
+       if (spec->has_wildcard)
+               return 0;
+       if (spec->nr > 1)
+               return 0;
+       if ((spec->magic & ~PATHSPEC_LITERAL) ||
+           (spec->nr && spec->items[0].magic & ~PATHSPEC_LITERAL))
+               return 0;
+       return 1;
+}
+
 static void prepare_to_use_bloom_filter(struct rev_info *revs)
 {
        struct pathspec_item *pi;
@@ -665,13 +677,7 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
        if (!revs->commits)
            return;

-       if (revs->prune_data.has_wildcard)
-               return;
-       if (revs->prune_data.nr > 1)
-               return;
-       if (revs->prune_data.magic ||
-           (revs->prune_data.nr &&
-            revs->prune_data.items[0].magic))
+       if (!has_bloom_key(&revs->prune_data))
                return;

        repo_parse_commit(revs->repo, revs->commits->item);

> >> +		return;
> >> +
> >>  	repo_parse_commit(revs->repo, revs->commits->item);
> >>
> >>  	if (!revs->repo->objects->commit_graph)
> >> --
> >> gitgitgadget
> >
> >   Reviewed-by: Taylor Blau <me@ttaylorr.com>
> >
> > Thanks,
> > Taylor

Thanks,
Taylor

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/3] revision: complicated pathspecs disable filters
  2020-04-11 21:40   ` Junio C Hamano
@ 2020-04-13 11:49     ` Derrick Stolee
  2020-04-14 18:25       ` Junio C Hamano
  0 siblings, 1 reply; 48+ messages in thread
From: Derrick Stolee @ 2020-04-13 11:49 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee via GitGitGadget
  Cc: git, me, jnareb, garimasigit, Derrick Stolee

On 4/11/2020 5:40 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> From: Derrick Stolee <dstolee@microsoft.com>
>>
>> The changed-path Bloom filters work only when we can compute an
>> explicit Bloom filter key in advance. When a pathspec is given
>> that allows case-insensitive checks or wildcard matching, we
>> must disable the Bloom filter performance checks.
> 
> How often do we want to do case-insensitive path matching, I wonder.
> 
> As somebody who never touches case-insensitive filesystem, I would
> be a bad judge, and I would imagine that I would be looking for a
> pathspec "[Mm]akefile" rather than ":(icase)makefile" myself if
> there are projects that may have either/both, so I do not mind
> disabling the bloom filter for case insensitivity myself.
> 
> But if users may use icase pathspec very often, it may be worth
> considering to build the bloom filter after downcasing the paths,
> perhaps?  Given that many projects extract their source code to a
> case insensitive filesystem, I would imagine that downcasing paths
> would map two originally different paths into the same thing only
> rarely, if ever, so there may not be much downside to do so.

This behavior could be extended later, and carefully. My initial
thought was that the case check would happen on every commit. If
the :(icase) check only happens at the walk tip(s), then we could
compute a single Bloom key at the start.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/3] revision: complicated pathspecs disable filters
  2020-04-13  0:07       ` Taylor Blau
@ 2020-04-13 11:54         ` Derrick Stolee
  0 siblings, 0 replies; 48+ messages in thread
From: Derrick Stolee @ 2020-04-13 11:54 UTC (permalink / raw)
  To: Taylor Blau, Junio C Hamano
  Cc: Derrick Stolee via GitGitGadget, git, jnareb, garimasigit,
	Derrick Stolee

On 4/12/2020 8:07 PM, Taylor Blau wrote:
> On Sun, Apr 12, 2020 at 03:30:07PM -0700, Junio C Hamano wrote:
>> Taylor Blau <me@ttaylorr.com> writes:
>>
>>> I certainly wouldn't complain about a comment here explaining these
>>> three checks, but I suppose that the rationale is only a 'git blame'
>>> away (and I guess that is faster now after this series ;-)).
>>>
>>>> +	if (revs->prune_data.has_wildcard)
>>>> +		return;
>>>> +	if (revs->prune_data.nr > 1)
>>>> +		return;
>>>> +	if (revs->prune_data.magic ||
>>>> +	    (revs->prune_data.nr &&
>>>> +	     revs->prune_data.items[0].magic))
>>
>> This says "any magic", but it is overly pessimistic to disable the
>> optimization for "literal" magic.  That magic is the one that lets
>> well written scripts to say "I have in a '$variable' that the user
>> gave me as a pathname (not pathspec), and it may have a wildcard
>> letter in it, but please treat it as a literal string" by prefixing
>> ":(literal)" before that user-supplied data, so it is punishing well
>> disciplined folks.

This is a good point. I'm unfamiliar with these advanced pathspec
tricks.

> I hadn't thought of that, but it makes sense to me. How about something
> like this squashed into this patch? I moved the if-chain that Stolee
> introduced out to its own function, at least since they seem
> well-contained and related to one another. I figure that this simplifies
> the implementation of 'prepare_to_use_bloom_filter' by giving the reader
> less to think about up-front.
> 
> diff --git a/revision.c b/revision.c
> index 534c0bf996..15bf4ccff5 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -654,6 +654,18 @@ static void trace2_bloom_filter_statistics_atexit(void)
>         jw_release(&jw);
>  }
> 
> +static int has_bloom_key(struct pathspec *spec)
> +{
> +       if (spec->has_wildcard)
> +               return 0;
> +       if (spec->nr > 1)
> +               return 0;
> +       if ((spec->magic & ~PATHSPEC_LITERAL) ||
> +           (spec->nr && spec->items[0].magic & ~PATHSPEC_LITERAL))
> +               return 0;
> +       return 1;
> +}
> +

Perhaps flip this on its head?

+static int forbids_bloom_key(struct pathspec *spec)
+{
+       if (spec->has_wildcard)
+               return 1;
+       if (spec->nr > 1)
+               return 1;
+       if (spec->magic & ~PATHSPEC_LITERAL)
+		return 1;
+	if (spec->nr && (spec->items[0].magic & ~PATHSPEC_LITERAL))
+               return 1;
+       return 0;
+}
+

>  static void prepare_to_use_bloom_filter(struct rev_info *revs)
>  {
>         struct pathspec_item *pi;
> @@ -665,13 +677,7 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
>         if (!revs->commits)
>             return;
> 
> -       if (revs->prune_data.has_wildcard)
> -               return;
> -       if (revs->prune_data.nr > 1)
> -               return;
> -       if (revs->prune_data.magic ||
> -           (revs->prune_data.nr &&
> -            revs->prune_data.items[0].magic))
> +       if (!has_bloom_key(&revs->prune_data))
>                 return;

Then this would be "if (forbids_bloom_key(&revs->prune_data))"

Generally, I like pulling this stuff out as a method to isolate and
label its purpose. If we wanted to allow certain :(icase) things
later, then we know what to modify in order to "allow" it.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/3] commit: write commit-graph with bloom filters
  2020-04-12 20:51     ` Taylor Blau
@ 2020-04-13 12:08       ` Derrick Stolee
  2020-04-13 22:11         ` Junio C Hamano
  0 siblings, 1 reply; 48+ messages in thread
From: Derrick Stolee @ 2020-04-13 12:08 UTC (permalink / raw)
  To: Taylor Blau, Junio C Hamano
  Cc: Derrick Stolee via GitGitGadget, git, jnareb, garimasigit,
	Derrick Stolee

On 4/12/2020 4:51 PM, Taylor Blau wrote:
> On Sat, Apr 11, 2020 at 02:57:18PM -0700, Junio C Hamano wrote:
>> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>>
>>> diff --git a/builtin/commit.c b/builtin/commit.c
>>> index d3e7781e658..d2fdfdc4363 100644
>>> --- a/builtin/commit.c
>>> +++ b/builtin/commit.c
>>> @@ -1701,7 +1701,9 @@ int cmd_commit(int argc, const char **argv, const char *prefix)
>>>  		      "not exceeded, and then \"git restore --staged :/\" to recover."));
>>>
>>>  	if (git_env_bool(GIT_TEST_COMMIT_GRAPH, 0) &&
>>> -	    write_commit_graph_reachable(the_repository->objects->odb, 0, NULL))
>>> +	    write_commit_graph_reachable(the_repository->objects->odb,
>>> +					 git_env_bool(GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS, 0) ? COMMIT_GRAPH_WRITE_BLOOM_FILTERS : 0,
>>> +					 NULL))
>>
>> Yuck.  That is doubly mouthful.
> 
> Yeah, this is quite a mouthful indeed. I think the most conservative fix
> would be something like:
> 
>   if (git_env_bool(GIT_TEST_COMMIT_GRAPH, 0)) {
>     enum commit_graph_write_flags flags = 0;
>     if (git_env_bool(GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS, 0))
>       flags |= COMMIT_GRAPH_WRITE_BLOOM_FILTERS;
> 
>     if (write_commit_graph_reachable(the_repository->objects->odb,
>                                      flags, NULL)) {
>       /* ... */
>     }
>   }
> 
> and then ripping this knob out (as Junio suggests below, which I think
> is a wise suggestion) out would be straightforward.
> 
>> I wonder how much value we are getting by having this call here in
>> the first place.  This function being cmd_commit(), by definition we
>> won't be updating the graph file unless the test script does "git
>> commit".  We won't update the graph file upon "git rebase", "git
>> merge", "git pull", "git fetch", etc., so it is not like with this
>> hack, the test coverage for various traversal using the graph file
>> would magically be perfect.  We'd still need an explicit call to
>> "commit-graph write" at strategic places in the test scripts, no?
> 
> Yeah. It's definitely not a silver bullet for the reasons you mention,
> but I think that it is helping out in some common cases. For example,
> if we moved this check out to 'test_commit', then we'd be relying on
> tests to use that instead of 'git commit'. On the other hand, this
> catches either, since they both wind up in this builtin.
> 
> I guess you could push this check down much further to when we're about
> to write a new object, and--if that new object is a commit--update the
> commit-graph. That sounds awfully slow (and feels to me like a hack, but
> I can't justify whether it's more or less of a hack than we already
> have), but I think it would be OK, since this isn't the normal way to
> run tests.

If we keep this simple, or extract the process to a
"write_commit_graph_for_tests()" macro inside builtin.h, then we could
insert a commit-graph write in more places.

However, I think there is value in testing the "not every commit is
included in the commit-graph" case, which our current setup does quite
nicely. The one exception is that 'git merge' could benefit from this
snippet, so we cap off the commit-graph whenever a test script
"constructs" a repository to match a data shape.

>> How are we testing with and without bitmaps, and do we have a kludge
>> like this one for the feature, or do we use a different mechanism
>> to help writing tests easier to gain better coverage?
> 
> As far as I know after a brief search, we have no similar such mechanism
> for bitmaps, so commit-graphs are a unique instance.

I would not be against a similar mechanism for bitmaps, but this is
really important for commit-graphs. The difference lies in which users
have commit-graphs versus bitmaps. Now that commit-graphs are on by
default and written by 'git gc', client users get commit-graphs unless
they opt-out. We need to test as many scenarios as possible to avoid
causing them disruption.

Who uses bitmaps? Primarily Git servers. The scenarios run on those
repos are much more restricted, and those scenarios are tested with
bitmaps explicitly.

>> In any case, I can see why we want a separate knob specific to the
>> CHANGED_PATHS until the path filter part becomes as stable as the
>> rest of the graph file, but after some time, we should remove that
>> knob, and at that point, we always use the path filter whenever we
>> use commit-graph, so that we won't have to suffer from the ugliness.
>>
>> Wait.  I wonder if we can tweak the arrangement a bit so that this
>> layer does not need to know any more than GIT_TEST_COMMIT_GRAPH?
>>
>> For example, in commit-graph.c::write_commit_graph(), can we test
>> the CHANGED_PATHS environment variable and flip the .changed_paths
>> bit, no matter what the 'flags' argument to the function says?
> 
> It may be cleaner to have 'GIT_TEST_COMMIT_GRAPH' specify the flags that
> it wants as its value, but the additional coupling makes me somewhat
> uncomfortable.

It would be easy to insert it here: 

diff --git a/commit-graph.c b/commit-graph.c
index 77668629e2..3e8f36ce5c 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1968,6 +1968,8 @@ int write_commit_graph(struct object_directory *odb,
        ctx->check_oids = flags & COMMIT_GRAPH_WRITE_CHECK_OIDS ? 1 : 0;
        ctx->split_opts = split_opts;
        ctx->changed_paths = flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS ? 1 : 0;
+       if (git_env_bool(GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS, 0))
+               ctx->changed_paths = 1;
        ctx->total_bloom_filter_data_size = 0;
 
        if (ctx->split) {

The problem here is that if GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=1 and
someone runs "git commit-graph write --no-changed-paths" then the
negation of that option is ignored. But this is a GIT_TEST_* variable,
and any test that requires that check could disable the enviroment
variable first.

Thanks,
-Stolee

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v2 0/4] Integrate changed-path Bloom filters with 'git blame'
  2020-04-11  1:02 [PATCH 0/3] Integrate changed-path Bloom filters with 'git blame' Derrick Stolee via GitGitGadget
                   ` (3 preceding siblings ...)
  2020-04-11 21:30 ` [PATCH 0/3] Integrate changed-path Bloom filters with 'git blame' Junio C Hamano
@ 2020-04-13 14:45 ` Derrick Stolee via GitGitGadget
  2020-04-13 14:45   ` [PATCH v2 1/4] revision: complicated pathspecs disable filters Derrick Stolee via GitGitGadget
                     ` (5 more replies)
  4 siblings, 6 replies; 48+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-04-13 14:45 UTC (permalink / raw)
  To: git; +Cc: me, jnareb, garimasigit, Derrick Stolee

If the changed-path Bloom filters are relatively stable, then I propose
trying to build upon them as a way to discover any deficiencies. Also, it's
good to use them when we can.

The goal I set out to achieve was to use Bloom filters as much as possible
in git blame. I think I have achieved most of that, except I did not
integrate it with the -C mode. In that case, the blob-diff computation takes
a majority of the computation time, so short-circuiting the tree diff using
Bloom filters. Also, it's just really complicated. If someone else thinks
there is an easy win there, then please go ahead and give it a try with the
extra logic presented here in PATCH 3.

While I was playing around with Bloom filters and git blame, I purposefully
got it working with some scenarios but not all. Then I tried to trigger a
failing build in the blame tests using GIT_TEST_COMMIT_GRAPH=1 and 
GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=1. But the tests all succeeded!

Examining the data, I see that the commit-graph didn't have the Bloom filter
chunks at all. This is because we are not setting the flag to write them in
the right spot. The GIT_TEST_COMMIT_GRAPH=1 variable triggers a commit-graph
write during git commit, so we need to update the code there instead of just
inspecting the variable in git commit-graph write. (This is PATCH 2.)

By updating this variable, I saw some test failures in other tests regarding
non-exact pathspecs. I fixed these in PATCH 1 so we keep clean builds.

I based this change on [1] but it would apply cleanly (and logically) on
gs/commit-graph-path-filter

Updates in v2:

 * Added PATCH 3 to write commit-graph files during 'git merge' if
   GIT_TEST_COMMIT_GRAPH is enabled.
   
   
 * Updated PATCH 1 with the simplification recommended by Taylor.
   
   
 * Fixed the lower-case "bloom" in the commit message.
   
   

Thanks, -Stolee

[1] 
https://lore.kernel.org/git/pull.601.v2.git.1586437211842.gitgitgadget@gmail.com/

Derrick Stolee (4):
  revision: complicated pathspecs disable filters
  commit: write commit-graph with Bloom filters
  commit-graph: write commit-graph in more tests
  blame: use changed-path Bloom filters

 blame.c          | 139 ++++++++++++++++++++++++++++++++++++++++++++---
 blame.h          |   6 ++
 builtin/blame.c  |  10 ++++
 builtin/commit.c |   4 +-
 builtin/merge.c  |   7 ++-
 commit-graph.c   |   9 +++
 commit-graph.h   |   2 +
 revision.c       |  19 ++++++-
 8 files changed, 181 insertions(+), 15 deletions(-)


base-commit: f4df00a0dd448edce0e854a97f63598fefe27d27
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-609%2Fderrickstolee%2Fbloom-blame-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-609/derrickstolee/bloom-blame-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/609

Range-diff vs v1:

 1:  9cc31c289aa ! 1:  adc03eee4ac revision: complicated pathspecs disable filters
     @@ Commit message
          path Bloom filters in the test suite. That will be fixed in the
          next change.
      
     +    Helped-by: Taylor Blau <me@ttaylorr.com>
          Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
      
       ## revision.c ##
     +@@ revision.c: static void trace2_bloom_filter_statistics_atexit(void)
     + 	jw_release(&jw);
     + }
     + 
     ++static int forbid_bloom_filters(struct pathspec *spec)
     ++{
     ++	if (spec->has_wildcard)
     ++		return 1;
     ++	if (spec->nr > 1)
     ++		return 1;
     ++	if (spec->magic & ~PATHSPEC_LITERAL)
     ++		return 1;
     ++	if (spec->nr && (spec->items[0].magic & ~PATHSPEC_LITERAL))
     ++		return 1;
     ++
     ++	return 0;
     ++}
     ++
     + static void prepare_to_use_bloom_filter(struct rev_info *revs)
     + {
     + 	struct pathspec_item *pi;
      @@ revision.c: static void prepare_to_use_bloom_filter(struct rev_info *revs)
     - 	if (!revs->commits)
     - 	    return;
     + 	int len;
       
     -+	if (revs->prune_data.has_wildcard)
     -+		return;
     -+	if (revs->prune_data.nr > 1)
     -+		return;
     -+	if (revs->prune_data.magic ||
     -+	    (revs->prune_data.nr &&
     -+	     revs->prune_data.items[0].magic))
     + 	if (!revs->commits)
     +-	    return;
      +		return;
      +
     ++	if (forbid_bloom_filters(&revs->prune_data))
     ++		return;
     + 
       	repo_parse_commit(revs->repo, revs->commits->item);
       
     - 	if (!revs->repo->objects->commit_graph)
 2:  bb5ce39d028 < -:  ----------- commit: write commit-graph with bloom filters
 -:  ----------- > 2:  7e8f1aed113 commit: write commit-graph with Bloom filters
 -:  ----------- > 3:  824f8ad067b commit-graph: write commit-graph in more tests
 3:  431fde68031 = 4:  4ae196d6355 blame: use changed-path Bloom filters

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH v2 1/4] revision: complicated pathspecs disable filters
  2020-04-13 14:45 ` [PATCH v2 0/4] " Derrick Stolee via GitGitGadget
@ 2020-04-13 14:45   ` Derrick Stolee via GitGitGadget
  2020-04-13 16:09     ` Taylor Blau
  2020-04-13 14:45   ` [PATCH v2 2/4] commit: write commit-graph with Bloom filters Derrick Stolee via GitGitGadget
                     ` (4 subsequent siblings)
  5 siblings, 1 reply; 48+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-04-13 14:45 UTC (permalink / raw)
  To: git; +Cc: me, jnareb, garimasigit, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The changed-path Bloom filters work only when we can compute an
explicit Bloom filter key in advance. When a pathspec is given
that allows case-insensitive checks or wildcard matching, we
must disable the Bloom filter performance checks.

By checking the pathspec in prepare_to_use_bloom_filters(), we
avoid setting up the Bloom filter data and thus revert to the
usual logic.

Before this change, the following tests would fail*:

	t6004-rev-list-path-optim.sh (Tests 6-7)
	t6130-pathspec-noglob.sh (Tests 3-6)
	t6131-pathspec-icase.sh (Tests 3-5)

*These tests would fail when using GIT_TEST_COMMIT_GRAPH and
GIT_TEST_COMMIT_GRAPH_BLOOM_FILTERS except that the latter
environment variable was not set up correctly to write the changed-
path Bloom filters in the test suite. That will be fixed in the
next change.

Helped-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 revision.c | 19 ++++++++++++++++++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/revision.c b/revision.c
index 2b06ee739c8..f78c636e4d0 100644
--- a/revision.c
+++ b/revision.c
@@ -650,6 +650,20 @@ static void trace2_bloom_filter_statistics_atexit(void)
 	jw_release(&jw);
 }
 
+static int forbid_bloom_filters(struct pathspec *spec)
+{
+	if (spec->has_wildcard)
+		return 1;
+	if (spec->nr > 1)
+		return 1;
+	if (spec->magic & ~PATHSPEC_LITERAL)
+		return 1;
+	if (spec->nr && (spec->items[0].magic & ~PATHSPEC_LITERAL))
+		return 1;
+
+	return 0;
+}
+
 static void prepare_to_use_bloom_filter(struct rev_info *revs)
 {
 	struct pathspec_item *pi;
@@ -659,7 +673,10 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
 	int len;
 
 	if (!revs->commits)
-	    return;
+		return;
+
+	if (forbid_bloom_filters(&revs->prune_data))
+		return;
 
 	repo_parse_commit(revs->repo, revs->commits->item);
 
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v2 2/4] commit: write commit-graph with Bloom filters
  2020-04-13 14:45 ` [PATCH v2 0/4] " Derrick Stolee via GitGitGadget
  2020-04-13 14:45   ` [PATCH v2 1/4] revision: complicated pathspecs disable filters Derrick Stolee via GitGitGadget
@ 2020-04-13 14:45   ` Derrick Stolee via GitGitGadget
  2020-04-13 16:12     ` Taylor Blau
  2020-04-13 14:45   ` [PATCH v2 3/4] commit-graph: write commit-graph in more tests Derrick Stolee via GitGitGadget
                     ` (3 subsequent siblings)
  5 siblings, 1 reply; 48+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-04-13 14:45 UTC (permalink / raw)
  To: git; +Cc: me, jnareb, garimasigit, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The GIT_TEST_COMMIT_GRAPH environment variable updates the commit-
graph file whenever "git commit" is run, ensuring that we always
have an updated commit-graph throughout the test suite. The
GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS environment variable was
introduced to write the changed-path Bloom filters whenever "git
commit-graph write" is run. However, the GIT_TEST_COMMIT_GRAPH
trick doesn't launch a separate process and instead writes it
directly.

Update the "git commit" builtin to write changed-path Bloom filters
when both GIT_TEST_COMMIT_GRAPH and GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS
are enabled.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/commit-graph.c b/commit-graph.c
index 77668629e27..3e8f36ce5c8 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1968,6 +1968,8 @@ int write_commit_graph(struct object_directory *odb,
 	ctx->check_oids = flags & COMMIT_GRAPH_WRITE_CHECK_OIDS ? 1 : 0;
 	ctx->split_opts = split_opts;
 	ctx->changed_paths = flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS ? 1 : 0;
+	if (git_env_bool(GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS, 0))
+		ctx->changed_paths = 1;
 	ctx->total_bloom_filter_data_size = 0;
 
 	if (ctx->split) {
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v2 3/4] commit-graph: write commit-graph in more tests
  2020-04-13 14:45 ` [PATCH v2 0/4] " Derrick Stolee via GitGitGadget
  2020-04-13 14:45   ` [PATCH v2 1/4] revision: complicated pathspecs disable filters Derrick Stolee via GitGitGadget
  2020-04-13 14:45   ` [PATCH v2 2/4] commit: write commit-graph with Bloom filters Derrick Stolee via GitGitGadget
@ 2020-04-13 14:45   ` Derrick Stolee via GitGitGadget
  2020-04-13 14:45   ` [PATCH v2 4/4] blame: use changed-path Bloom filters Derrick Stolee via GitGitGadget
                     ` (2 subsequent siblings)
  5 siblings, 0 replies; 48+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-04-13 14:45 UTC (permalink / raw)
  To: git; +Cc: me, jnareb, garimasigit, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The GIT_TEST_COMMIT_GRAPH test environment variable triggers
commit-graph writes during each "git commit" process. To expand
the number of tests that have commits in the commit-graph file,
add a helper method that computes the commit-graph and place
that helper inside "git commit" and "git merge".

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 builtin/commit.c | 4 +---
 builtin/merge.c  | 7 +++++--
 commit-graph.c   | 7 +++++++
 commit-graph.h   | 2 ++
 4 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/builtin/commit.c b/builtin/commit.c
index d3e7781e658..27d4ff6b790 100644
--- a/builtin/commit.c
+++ b/builtin/commit.c
@@ -1700,9 +1700,7 @@ int cmd_commit(int argc, const char **argv, const char *prefix)
 		      "new_index file. Check that disk is not full and quota is\n"
 		      "not exceeded, and then \"git restore --staged :/\" to recover."));
 
-	if (git_env_bool(GIT_TEST_COMMIT_GRAPH, 0) &&
-	    write_commit_graph_reachable(the_repository->objects->odb, 0, NULL))
-		return 1;
+	git_test_write_commit_graph_or_die();
 
 	repo_rerere(the_repository, 0);
 	run_command_v_opt(argv_gc_auto, RUN_GIT_CMD);
diff --git a/builtin/merge.c b/builtin/merge.c
index d127d2225f8..db11af5b1cd 100644
--- a/builtin/merge.c
+++ b/builtin/merge.c
@@ -40,6 +40,7 @@
 #include "branch.h"
 #include "commit-reach.h"
 #include "wt-status.h"
+#include "commit-graph.h"
 
 #define DEFAULT_TWOHEAD (1<<0)
 #define DEFAULT_OCTOPUS (1<<1)
@@ -1673,9 +1674,11 @@ int cmd_merge(int argc, const char **argv, const char *prefix)
 				   head_commit);
 	}
 
-	if (squash)
+	if (squash) {
 		finish(head_commit, remoteheads, NULL, NULL);
-	else
+
+		git_test_write_commit_graph_or_die();
+	} else
 		write_merge_state(remoteheads);
 
 	if (merge_was_ok)
diff --git a/commit-graph.c b/commit-graph.c
index 3e8f36ce5c8..5d64cb5f09c 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -19,6 +19,13 @@
 #include "bloom.h"
 #include "commit-slab.h"
 
+void git_test_write_commit_graph_or_die(void)
+{
+	if (git_env_bool(GIT_TEST_COMMIT_GRAPH, 0) &&
+	    write_commit_graph_reachable(the_repository->objects->odb, 0, NULL))
+		die("failed to write commit-graph under GIT_TEST_COMMIT_GRAPH");
+}
+
 #define GRAPH_SIGNATURE 0x43475048 /* "CGPH" */
 #define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
 #define GRAPH_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
diff --git a/commit-graph.h b/commit-graph.h
index 8655d064c14..8f852a9bee2 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -11,6 +11,8 @@
 #define GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD "GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD"
 #define GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS "GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS"
 
+void git_test_write_commit_graph_or_die(void);
+
 struct commit;
 struct bloom_filter_settings;
 
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v2 4/4] blame: use changed-path Bloom filters
  2020-04-13 14:45 ` [PATCH v2 0/4] " Derrick Stolee via GitGitGadget
                     ` (2 preceding siblings ...)
  2020-04-13 14:45   ` [PATCH v2 3/4] commit-graph: write commit-graph in more tests Derrick Stolee via GitGitGadget
@ 2020-04-13 14:45   ` Derrick Stolee via GitGitGadget
  2020-04-13 16:21   ` [PATCH v2 0/4] Integrate changed-path Bloom filters with 'git blame' Taylor Blau
  2020-04-16 20:14   ` [PATCH v3 0/3] " Derrick Stolee via GitGitGadget
  5 siblings, 0 replies; 48+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-04-13 14:45 UTC (permalink / raw)
  To: git; +Cc: me, jnareb, garimasigit, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The changed-path Bloom filters help reduce the amount of tree
parsing required during history queries. Before calculating a
diff, we can ask the filter if a path changed between a commit
and its first parent. If the filter says "no" then we can move
on without parsing trees. If the filter says "maybe" then we
parse trees to discover if the answer is actually "yes" or "no".

When computing a blame, there is a section in find_origin() that
computes a diff between a commit and one of its parents. When this
is the first parent, we can check the Bloom filters before calling
diff_tree_oid().

In order to make this work with the blame machinery, we need to
initialize a struct bloom_key with the initial path. But also, we
need to add more keys to a list if a rename is detected. We then
check to see if _any_ of these keys answer "maybe" in the diff.

During development, I purposefully left out this "add a new key
when a rename is detected" to see if the test suite would catch
my error. That is how I discovered the issues with
GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS from the previous change.
With that change, we can feel some confidence in the coverage of
this change.

If a user requests copy detection using "git blame -C", then there
are more places where the set of "important" files can expand. I
do not know enough about how this happens in the blame machinery.
Thus, the Bloom filter integration is explicitly disabled in this
mode. A later change could expand the bloom_key data with an
appropriate call (or calls) to add_bloom_key().

If we did not disable this mode, then the following tests would
fail:

	t8003-blame-corner-cases.sh
	t8011-blame-split-file.sh

Generally, this is a performance enhancement and should not
change the behavior of 'git blame' in any way. If a repo has a
commit-graph file with computed changed-path Bloom filters, then
they should notice improved performance for their 'git blame'
commands.

Here are some example timings that I found by blaming some paths
in the Linux kernel repository:

 git blame arch/x86/kernel/topology.c >/dev/null

 Before: 0.83s
  After: 0.24s

 git blame kernel/time/time.c >/dev/null

 Before: 0.72s
  After: 0.24s

 git blame tools/perf/ui/stdio/hist.c >/dev/null

 Before: 0.27s
  After: 0.11s

I specifically looked for "deep" paths that were also edited many
times. As a counterpoint, the MAINTAINERS file was edited many
times but is located in the root tree. This means that the cost of
computing a diff relative to the pathspec is very small. Here are
the timings for that command:

 git blame MAINTAINERS >/dev/null

 Before: 20.1s
  After: 18.0s

These timings are the best of five. The worst-case runs were on the
order of 2.5 minutes for both cases. Note that the MAINTAINERS file
has 18,740 lines across 17,000+ commits. This happens to be one of
the cases where this change provides the least improvement.

The lack of improvement for the MAINTAINERS file and the relatively
modest improvement for the other examples can be easily explained.
The blame machinery needs to compute line-level diffs to determine
which lines were changed by each commit. That makes up a large
proportion of the computation time, and this change does not
attempt to improve on that section of the algorithm. The
MAINTAINERS file is large and changed often, so it takes time to
determine which lines were updated by which commit. In contrast,
the code files are much smaller, and it takes longer to comute
the line-by-line diff for a single patch on the Linux mailing
lists.

Outside of the "-C" integration, I believe there is little more to
gain from the changed-path Bloom filters for 'git blame' after this
patch.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 blame.c         | 139 ++++++++++++++++++++++++++++++++++++++++++++----
 blame.h         |   6 +++
 builtin/blame.c |  10 ++++
 3 files changed, 146 insertions(+), 9 deletions(-)

diff --git a/blame.c b/blame.c
index 29770e5c81c..9fbf79e47c3 100644
--- a/blame.c
+++ b/blame.c
@@ -9,6 +9,8 @@
 #include "blame.h"
 #include "alloc.h"
 #include "commit-slab.h"
+#include "bloom.h"
+#include "commit-graph.h"
 
 define_commit_slab(blame_suspects, struct blame_origin *);
 static struct blame_suspects blame_suspects;
@@ -1246,13 +1248,75 @@ static int fill_blob_sha1_and_mode(struct repository *r,
 	return -1;
 }
 
+struct blame_bloom_data {
+	/*
+	 * Changed-path Bloom filter keys. These can help prevent
+	 * computing diffs against first parents, but we need to
+	 * expand the list as code is moved or files are renamed.
+	 */
+	struct bloom_filter_settings *settings;
+	struct bloom_key **keys;
+	int nr;
+	int alloc;
+};
+
+static int bloom_count_queries = 0;
+static int bloom_count_no = 0;
+static int maybe_changed_path(struct repository *r,
+			      struct commit *parent,
+			      struct blame_origin *origin,
+			      struct blame_bloom_data *bd)
+{
+	int i;
+	struct bloom_filter *filter;
+
+	if (!bd)
+		return 1;
+
+	if (origin->commit->generation == GENERATION_NUMBER_INFINITY)
+		return 1;
+
+	filter = get_bloom_filter(r, origin->commit, 0);
+
+	if (!filter)
+		return 1;
+
+	bloom_count_queries++;
+	for (i = 0; i < bd->nr; i++) {
+		if (bloom_filter_contains(filter,
+					  bd->keys[i],
+					  bd->settings))
+			return 1;
+	}
+
+	bloom_count_no++;
+	return 0;
+}
+
+static void add_bloom_key(struct blame_bloom_data *bd,
+			  const char *path)
+{
+	if (!bd)
+		return;
+
+	if (bd->nr >= bd->alloc) {
+		bd->alloc *= 2;
+		REALLOC_ARRAY(bd->keys, bd->alloc);
+	}
+
+	bd->keys[bd->nr] = xmalloc(sizeof(struct bloom_key));
+	fill_bloom_key(path, strlen(path), bd->keys[bd->nr], bd->settings);
+	bd->nr++;
+}
+
 /*
  * We have an origin -- check if the same path exists in the
  * parent and return an origin structure to represent it.
  */
 static struct blame_origin *find_origin(struct repository *r,
 					struct commit *parent,
-					struct blame_origin *origin)
+					struct blame_origin *origin,
+					struct blame_bloom_data *bd)
 {
 	struct blame_origin *porigin;
 	struct diff_options diff_opts;
@@ -1286,10 +1350,19 @@ static struct blame_origin *find_origin(struct repository *r,
 
 	if (is_null_oid(&origin->commit->object.oid))
 		do_diff_cache(get_commit_tree_oid(parent), &diff_opts);
-	else
-		diff_tree_oid(get_commit_tree_oid(parent),
-			      get_commit_tree_oid(origin->commit),
-			      "", &diff_opts);
+	else {
+		int compute_diff = 1;
+		if (origin->commit->parents &&
+		    !oidcmp(&parent->object.oid,
+			    &origin->commit->parents->item->object.oid))
+			compute_diff = maybe_changed_path(r, parent,
+							  origin, bd);
+
+		if (compute_diff)
+			diff_tree_oid(get_commit_tree_oid(parent),
+				      get_commit_tree_oid(origin->commit),
+				      "", &diff_opts);
+	}
 	diffcore_std(&diff_opts);
 
 	if (!diff_queued_diff.nr) {
@@ -1341,7 +1414,8 @@ static struct blame_origin *find_origin(struct repository *r,
  */
 static struct blame_origin *find_rename(struct repository *r,
 					struct commit *parent,
-					struct blame_origin *origin)
+					struct blame_origin *origin,
+					struct blame_bloom_data *bd)
 {
 	struct blame_origin *porigin = NULL;
 	struct diff_options diff_opts;
@@ -1366,6 +1440,7 @@ static struct blame_origin *find_rename(struct repository *r,
 		struct diff_filepair *p = diff_queued_diff.queue[i];
 		if ((p->status == 'R' || p->status == 'C') &&
 		    !strcmp(p->two->path, origin->path)) {
+			add_bloom_key(bd, p->one->path);
 			porigin = get_origin(parent, p->one->path);
 			oidcpy(&porigin->blob_oid, &p->one->oid);
 			porigin->mode = p->one->mode;
@@ -2332,6 +2407,11 @@ static void distribute_blame(struct blame_scoreboard *sb, struct blame_entry *bl
 
 #define MAXSG 16
 
+typedef struct blame_origin *(*blame_find_alg)(struct repository *,
+					       struct commit *,
+					       struct blame_origin *,
+					       struct blame_bloom_data *);
+
 static void pass_blame(struct blame_scoreboard *sb, struct blame_origin *origin, int opt)
 {
 	struct rev_info *revs = sb->revs;
@@ -2356,8 +2436,7 @@ static void pass_blame(struct blame_scoreboard *sb, struct blame_origin *origin,
 	 * common cases, then we look for renames in the second pass.
 	 */
 	for (pass = 0; pass < 2 - sb->no_whole_file_rename; pass++) {
-		struct blame_origin *(*find)(struct repository *, struct commit *, struct blame_origin *);
-		find = pass ? find_rename : find_origin;
+		blame_find_alg find = pass ? find_rename : find_origin;
 
 		for (i = 0, sg = first_scapegoat(revs, commit, sb->reverse);
 		     i < num_sg && sg;
@@ -2369,7 +2448,7 @@ static void pass_blame(struct blame_scoreboard *sb, struct blame_origin *origin,
 				continue;
 			if (parse_commit(p))
 				continue;
-			porigin = find(sb->repo, p, origin);
+			porigin = find(sb->repo, p, origin, sb->bloom_data);
 			if (!porigin)
 				continue;
 			if (oideq(&porigin->blob_oid, &origin->blob_oid)) {
@@ -2809,3 +2888,45 @@ struct blame_entry *blame_entry_prepend(struct blame_entry *head,
 	blame_origin_incref(o);
 	return new_head;
 }
+
+void setup_blame_bloom_data(struct blame_scoreboard *sb,
+			    const char *path)
+{
+	struct blame_bloom_data *bd;
+
+	if (!sb->repo->objects->commit_graph)
+		return;
+
+	if (!sb->repo->objects->commit_graph->bloom_filter_settings)
+		return;
+
+	bd = xmalloc(sizeof(struct blame_bloom_data));
+
+	bd->settings = sb->repo->objects->commit_graph->bloom_filter_settings;
+
+	bd->alloc = 4;
+	bd->nr = 0;
+	ALLOC_ARRAY(bd->keys, bd->alloc);
+
+	add_bloom_key(bd, path);
+
+	sb->bloom_data = bd;
+}
+
+void cleanup_scoreboard(struct blame_scoreboard *sb)
+{
+	if (sb->bloom_data) {
+		int i;
+		for (i = 0; i < sb->bloom_data->nr; i++) {
+			free(sb->bloom_data->keys[i]->hashes);
+			free(sb->bloom_data->keys[i]);
+		}
+		free(sb->bloom_data->keys);
+		FREE_AND_NULL(sb->bloom_data);
+
+		trace2_data_intmax("blame", sb->repo,
+				   "bloom/queries", bloom_count_queries);
+		trace2_data_intmax("blame", sb->repo,
+				   "bloom/response-no", bloom_count_no);
+	}
+}
diff --git a/blame.h b/blame.h
index 089b181ff27..b6bbee41472 100644
--- a/blame.h
+++ b/blame.h
@@ -100,6 +100,8 @@ struct blame_entry {
 	int unblamable;
 };
 
+struct blame_bloom_data;
+
 /*
  * The current state of the blame assignment.
  */
@@ -156,6 +158,7 @@ struct blame_scoreboard {
 	void(*found_guilty_entry)(struct blame_entry *, void *);
 
 	void *found_guilty_entry_data;
+	struct blame_bloom_data *bloom_data;
 };
 
 /*
@@ -180,6 +183,9 @@ void init_scoreboard(struct blame_scoreboard *sb);
 void setup_scoreboard(struct blame_scoreboard *sb,
 		      const char *path,
 		      struct blame_origin **orig);
+void setup_blame_bloom_data(struct blame_scoreboard *sb,
+			    const char *path);
+void cleanup_scoreboard(struct blame_scoreboard *sb);
 
 struct blame_entry *blame_entry_prepend(struct blame_entry *head,
 					long start, long end,
diff --git a/builtin/blame.c b/builtin/blame.c
index bf1cecdf3f9..3c13634f279 100644
--- a/builtin/blame.c
+++ b/builtin/blame.c
@@ -1061,6 +1061,14 @@ int cmd_blame(int argc, const char **argv, const char *prefix)
 	string_list_clear(&ignore_revs_file_list, 0);
 	string_list_clear(&ignore_rev_list, 0);
 	setup_scoreboard(&sb, path, &o);
+
+	/*
+	 * Changed-path Bloom filters are disabled when looking
+	 * for copies.
+	 */
+	if (!(opt & PICKAXE_BLAME_COPY))
+		setup_blame_bloom_data(&sb, path);
+
 	lno = sb.num_lines;
 
 	if (lno && !range_list.nr)
@@ -1164,5 +1172,7 @@ int cmd_blame(int argc, const char **argv, const char *prefix)
 		printf("num get patch: %d\n", sb.num_get_patch);
 		printf("num commits: %d\n", sb.num_commits);
 	}
+
+	cleanup_scoreboard(&sb);
 	return 0;
 }
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 1/4] revision: complicated pathspecs disable filters
  2020-04-13 14:45   ` [PATCH v2 1/4] revision: complicated pathspecs disable filters Derrick Stolee via GitGitGadget
@ 2020-04-13 16:09     ` Taylor Blau
  2020-04-13 22:18       ` Junio C Hamano
  0 siblings, 1 reply; 48+ messages in thread
From: Taylor Blau @ 2020-04-13 16:09 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, me, jnareb, garimasigit, Derrick Stolee

On Mon, Apr 13, 2020 at 02:45:23PM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
>
> The changed-path Bloom filters work only when we can compute an
> explicit Bloom filter key in advance. When a pathspec is given
> that allows case-insensitive checks or wildcard matching, we
> must disable the Bloom filter performance checks.
>
> By checking the pathspec in prepare_to_use_bloom_filters(), we
> avoid setting up the Bloom filter data and thus revert to the
> usual logic.
>
> Before this change, the following tests would fail*:
>
> 	t6004-rev-list-path-optim.sh (Tests 6-7)
> 	t6130-pathspec-noglob.sh (Tests 3-6)
> 	t6131-pathspec-icase.sh (Tests 3-5)
>
> *These tests would fail when using GIT_TEST_COMMIT_GRAPH and
> GIT_TEST_COMMIT_GRAPH_BLOOM_FILTERS except that the latter
> environment variable was not set up correctly to write the changed-
> path Bloom filters in the test suite. That will be fixed in the
> next change.
>
> Helped-by: Taylor Blau <me@ttaylorr.com>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  revision.c | 19 ++++++++++++++++++-
>  1 file changed, 18 insertions(+), 1 deletion(-)
>
> diff --git a/revision.c b/revision.c
> index 2b06ee739c8..f78c636e4d0 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -650,6 +650,20 @@ static void trace2_bloom_filter_statistics_atexit(void)
>  	jw_release(&jw);
>  }
>
> +static int forbid_bloom_filters(struct pathspec *spec)
> +{
> +	if (spec->has_wildcard)
> +		return 1;
> +	if (spec->nr > 1)
> +		return 1;
> +	if (spec->magic & ~PATHSPEC_LITERAL)
> +		return 1;
> +	if (spec->nr && (spec->items[0].magic & ~PATHSPEC_LITERAL))
> +		return 1;
> +
> +	return 0;
> +}
> +
>  static void prepare_to_use_bloom_filter(struct rev_info *revs)
>  {
>  	struct pathspec_item *pi;
> @@ -659,7 +673,10 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
>  	int len;
>
>  	if (!revs->commits)
> -	    return;
> +		return;
> +
> +	if (forbid_bloom_filters(&revs->prune_data))
> +		return;
>
>  	repo_parse_commit(revs->repo, revs->commits->item);
>
> --
> gitgitgadget
>

Nicely done, this looks good to me. Thanks.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 2/4] commit: write commit-graph with Bloom filters
  2020-04-13 14:45   ` [PATCH v2 2/4] commit: write commit-graph with Bloom filters Derrick Stolee via GitGitGadget
@ 2020-04-13 16:12     ` Taylor Blau
  2020-04-13 22:21       ` Junio C Hamano
  0 siblings, 1 reply; 48+ messages in thread
From: Taylor Blau @ 2020-04-13 16:12 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, me, jnareb, garimasigit, Derrick Stolee

On Mon, Apr 13, 2020 at 02:45:24PM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
>
> The GIT_TEST_COMMIT_GRAPH environment variable updates the commit-
> graph file whenever "git commit" is run, ensuring that we always
> have an updated commit-graph throughout the test suite. The
> GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS environment variable was
> introduced to write the changed-path Bloom filters whenever "git
> commit-graph write" is run. However, the GIT_TEST_COMMIT_GRAPH
> trick doesn't launch a separate process and instead writes it
> directly.
>
> Update the "git commit" builtin to write changed-path Bloom filters
> when both GIT_TEST_COMMIT_GRAPH and GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS
> are enabled.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  commit-graph.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index 77668629e27..3e8f36ce5c8 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -1968,6 +1968,8 @@ int write_commit_graph(struct object_directory *odb,
>  	ctx->check_oids = flags & COMMIT_GRAPH_WRITE_CHECK_OIDS ? 1 : 0;
>  	ctx->split_opts = split_opts;
>  	ctx->changed_paths = flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS ? 1 : 0;
> +	if (git_env_bool(GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS, 0))
> +		ctx->changed_paths = 1;

Hmm. I'm not crazy about a library function looking at 'GIT_TEST_*'
environment variables and potentially ignoring its arguments, but given
the discussion we had in v1, I don't feel strongly enough to recommend
that you change this.

For what it's worth, I think that 'write_commit_graph' could behave more
purely if callers checked 'GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS' and set
'flags' appropriately from the outside, but I can understand also why
it's preferred to do it in this style, too.

>  	ctx->total_bloom_filter_data_size = 0;
>
>  	if (ctx->split) {
> --
> gitgitgadget
>

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 0/4] Integrate changed-path Bloom filters with 'git blame'
  2020-04-13 14:45 ` [PATCH v2 0/4] " Derrick Stolee via GitGitGadget
                     ` (3 preceding siblings ...)
  2020-04-13 14:45   ` [PATCH v2 4/4] blame: use changed-path Bloom filters Derrick Stolee via GitGitGadget
@ 2020-04-13 16:21   ` Taylor Blau
  2020-04-16 20:14   ` [PATCH v3 0/3] " Derrick Stolee via GitGitGadget
  5 siblings, 0 replies; 48+ messages in thread
From: Taylor Blau @ 2020-04-13 16:21 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, me, jnareb, garimasigit, Derrick Stolee

On Mon, Apr 13, 2020 at 02:45:22PM +0000, Derrick Stolee via GitGitGadget wrote:
> Updates in v2:
>
>  * Added PATCH 3 to write commit-graph files during 'git merge' if
>    GIT_TEST_COMMIT_GRAPH is enabled.
>
>
>  * Updated PATCH 1 with the simplification recommended by Taylor.
>
>
>  * Fixed the lower-case "bloom" in the commit message.

This updated version looks great to me. Thanks for looking at all of my
suggestions :-).

  Reviewed-by: Taylor Blau <me@ttaylorr.com>

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/3] commit: write commit-graph with bloom filters
  2020-04-13 12:08       ` Derrick Stolee
@ 2020-04-13 22:11         ` Junio C Hamano
  0 siblings, 0 replies; 48+ messages in thread
From: Junio C Hamano @ 2020-04-13 22:11 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Taylor Blau, Derrick Stolee via GitGitGadget, git, jnareb,
	garimasigit, Derrick Stolee

Derrick Stolee <stolee@gmail.com> writes:

> If we keep this simple, or extract the process to a
> "write_commit_graph_for_tests()" macro inside builtin.h, then we could
> insert a commit-graph write in more places.

As long as the check necessary is cheap enough to realize that we
are in production mode, we should be able to keep the run-time
overhead to the minimum.  Sprinkling such a call all over the place,
however, might add to the overhead of reading code, though.

> However, I think there is value in testing the "not every commit is
> included in the commit-graph" case, which our current setup does quite
> nicely.

Yeah, that is also a good point.

> The one exception is that 'git merge' could benefit from this
> snippet, so we cap off the commit-graph whenever a test script
> "constructs" a repository to match a data shape.

Sounds good.

> The problem here is that if GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=1 and
> someone runs "git commit-graph write --no-changed-paths" then the
> negation of that option is ignored. But this is a GIT_TEST_* variable,
> and any test that requires that check could disable the enviroment
> variable first.

Yeah, that sounds good.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 1/4] revision: complicated pathspecs disable filters
  2020-04-13 16:09     ` Taylor Blau
@ 2020-04-13 22:18       ` Junio C Hamano
  0 siblings, 0 replies; 48+ messages in thread
From: Junio C Hamano @ 2020-04-13 22:18 UTC (permalink / raw)
  To: Taylor Blau
  Cc: Derrick Stolee via GitGitGadget, git, jnareb, garimasigit,
	Derrick Stolee

Taylor Blau <me@ttaylorr.com> writes:

>> +static int forbid_bloom_filters(struct pathspec *spec)
>> +{
>> +	if (spec->has_wildcard)
>> +		return 1;
>> +	if (spec->nr > 1)
>> +		return 1;
>> +	if (spec->magic & ~PATHSPEC_LITERAL)
>> +		return 1;
>> +	if (spec->nr && (spec->items[0].magic & ~PATHSPEC_LITERAL))
>> +		return 1;
>> +
>> +	return 0;
>> +}
>> +
>>  static void prepare_to_use_bloom_filter(struct rev_info *revs)
>>  {
>>  	struct pathspec_item *pi;
>> @@ -659,7 +673,10 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
>>  	int len;
>>
>>  	if (!revs->commits)
>> -	    return;
>> +		return;
>> +
>> +	if (forbid_bloom_filters(&revs->prune_data))
>> +		return;
>>
>>  	repo_parse_commit(revs->repo, revs->commits->item);
>>
>> --
>> gitgitgadget
>>
>
> Nicely done, this looks good to me. Thanks.

Likewise.  Very exciting.

Will queue.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 2/4] commit: write commit-graph with Bloom filters
  2020-04-13 16:12     ` Taylor Blau
@ 2020-04-13 22:21       ` Junio C Hamano
  2020-04-14 15:04         ` Derrick Stolee
  0 siblings, 1 reply; 48+ messages in thread
From: Junio C Hamano @ 2020-04-13 22:21 UTC (permalink / raw)
  To: Taylor Blau
  Cc: Derrick Stolee via GitGitGadget, git, jnareb, garimasigit,
	Derrick Stolee

Taylor Blau <me@ttaylorr.com> writes:

> Hmm. I'm not crazy about a library function looking at 'GIT_TEST_*'
> environment variables and potentially ignoring its arguments, but given
> the discussion we had in v1, I don't feel strongly enough to recommend
> that you change this.
>
> For what it's worth, I think that 'write_commit_graph' could behave more
> purely if callers checked 'GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS' and set
> 'flags' appropriately from the outside,...

Yeah, I agree that it would be a lot cleaner if that is easy to
arrange.  I have a suspicion that Derrick already tried and the
resulting code looked messier and was discarded?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 2/4] commit: write commit-graph with Bloom filters
  2020-04-13 22:21       ` Junio C Hamano
@ 2020-04-14 15:04         ` Derrick Stolee
  2020-04-14 17:26           ` Junio C Hamano
  0 siblings, 1 reply; 48+ messages in thread
From: Derrick Stolee @ 2020-04-14 15:04 UTC (permalink / raw)
  To: Junio C Hamano, Taylor Blau
  Cc: Derrick Stolee via GitGitGadget, git, jnareb, garimasigit,
	Derrick Stolee

On 4/13/2020 6:21 PM, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
> 
>> Hmm. I'm not crazy about a library function looking at 'GIT_TEST_*'
>> environment variables and potentially ignoring its arguments, but given
>> the discussion we had in v1, I don't feel strongly enough to recommend
>> that you change this.
>>
>> For what it's worth, I think that 'write_commit_graph' could behave more
>> purely if callers checked 'GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS' and set
>> 'flags' appropriately from the outside,...
> 
> Yeah, I agree that it would be a lot cleaner if that is easy to
> arrange.  I have a suspicion that Derrick already tried and the
> resulting code looked messier and was discarded?

Perhaps I could fix both concerns by

1. Use a macro instead of a library call.

2. Check the _CHANGED_PATHS variable in the macro.

The macro would look like this:


#define git_test_write_commit_graph_or_die() \
	if (git_env_bool(GIT_TEST_COMMIT_GRAPH, 0)) { \
		int flags = 0; \
		if (git_env_bool(GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS, 0)) \
			flags = COMMIT_GRAPH_WRITE_BLOOM_FILTERS; \
		if (write_commit_graph_reachable( \
			the_repository->objects->odb, flags, NULL)) \
			die("failed to write commit-graph under GIT_TEST_COMMIT_GRAPH"); \
	}

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 2/4] commit: write commit-graph with Bloom filters
  2020-04-14 15:04         ` Derrick Stolee
@ 2020-04-14 17:26           ` Junio C Hamano
  2020-04-14 17:40             ` Derrick Stolee
  0 siblings, 1 reply; 48+ messages in thread
From: Junio C Hamano @ 2020-04-14 17:26 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Taylor Blau, Derrick Stolee via GitGitGadget, git, jnareb,
	garimasigit, Derrick Stolee

Derrick Stolee <stolee@gmail.com> writes:

> On 4/13/2020 6:21 PM, Junio C Hamano wrote:
>> Taylor Blau <me@ttaylorr.com> writes:
>> 
>>> Hmm. I'm not crazy about a library function looking at 'GIT_TEST_*'
>>> environment variables and potentially ignoring its arguments, but given
>>> the discussion we had in v1, I don't feel strongly enough to recommend
>>> that you change this.
>>>
>>> For what it's worth, I think that 'write_commit_graph' could behave more
>>> purely if callers checked 'GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS' and set
>>> 'flags' appropriately from the outside,...
>> 
>> Yeah, I agree that it would be a lot cleaner if that is easy to
>> arrange.  I have a suspicion that Derrick already tried and the
>> resulting code looked messier and was discarded?
>
> Perhaps I could fix both concerns by
>
> 1. Use a macro instead of a library call.
>
> 2. Check the _CHANGED_PATHS variable in the macro.

I am not sure how use of a macro "fixes" purity, though.  And what
is the other concern?

How widely would this "if we are testing, write out the graph file"
call be sprinkled over the codebase?  I am hoping that it won't be
"everywhere", but only at strategic places (like "just once before
we leave a subcommand that creates one or bunch of commits").  And
how often would they be called?  I am also hoping that it won't be
"inside a tight loop".  In short, I am wondering if we can promise
our codebase that 

 - git_test_write_commit_graph_or_die() calls won't be an eyesore
   (and/or distraction) for developers too much.

 - git_env_bool() call won't have performance impact.


> The macro would look like this:
>
>
> #define git_test_write_commit_graph_or_die() \
> 	if (git_env_bool(GIT_TEST_COMMIT_GRAPH, 0)) { \
> 		int flags = 0; \
> 		if (git_env_bool(GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS, 0)) \
> 			flags = COMMIT_GRAPH_WRITE_BLOOM_FILTERS; \
> 		if (write_commit_graph_reachable( \
> 			the_repository->objects->odb, flags, NULL)) \
> 			die("failed to write commit-graph under GIT_TEST_COMMIT_GRAPH"); \
> 	}
>
> Thanks,
> -Stolee

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 2/4] commit: write commit-graph with Bloom filters
  2020-04-14 17:26           ` Junio C Hamano
@ 2020-04-14 17:40             ` Derrick Stolee
  2020-04-15  0:17               ` Taylor Blau
  0 siblings, 1 reply; 48+ messages in thread
From: Derrick Stolee @ 2020-04-14 17:40 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Taylor Blau, Derrick Stolee via GitGitGadget, git, jnareb,
	garimasigit, Derrick Stolee

On 4/14/2020 1:26 PM, Junio C Hamano wrote:
> Derrick Stolee <stolee@gmail.com> writes:
> 
>> On 4/13/2020 6:21 PM, Junio C Hamano wrote:
>>> Taylor Blau <me@ttaylorr.com> writes:
>>>
>>>> Hmm. I'm not crazy about a library function looking at 'GIT_TEST_*'
>>>> environment variables and potentially ignoring its arguments, but given
>>>> the discussion we had in v1, I don't feel strongly enough to recommend
>>>> that you change this.
>>>>
>>>> For what it's worth, I think that 'write_commit_graph' could behave more
>>>> purely if callers checked 'GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS' and set
>>>> 'flags' appropriately from the outside,...
>>>
>>> Yeah, I agree that it would be a lot cleaner if that is easy to
>>> arrange.  I have a suspicion that Derrick already tried and the
>>> resulting code looked messier and was discarded?
>>
>> Perhaps I could fix both concerns by
>>
>> 1. Use a macro instead of a library call.
>>
>> 2. Check the _CHANGED_PATHS variable in the macro.
> 
> I am not sure how use of a macro "fixes" purity, though.  And what
> is the other concern?

The concern was (1) checking the environment and (2) die()ing in the
library.

> How widely would this "if we are testing, write out the graph file"
> call be sprinkled over the codebase?  I am hoping that it won't be
> "everywhere", but only at strategic places (like "just once before
> we leave a subcommand that creates one or bunch of commits").  And
> how often would they be called?  I am also hoping that it won't be
> "inside a tight loop".  In short, I am wondering if we can promise
> our codebase that 
> 
>  - git_test_write_commit_graph_or_die() calls won't be an eyesore
>    (and/or distraction) for developers too much.
> 
>  - git_env_bool() call won't have performance impact.

I could add a comment to the header file to say "this is only for
improving coverage of optional features in the test suite. Do not
call this method unless you know what you are doing."

My intention right now is to only include this in 'git commit'
and 'git merge'. Earlier discussion included thoughts about
'git rebase' and similar, but those are more rarely used when
constructing an "example repo" in the test scripts.

-Stolee

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/3] revision: complicated pathspecs disable filters
  2020-04-13 11:49     ` Derrick Stolee
@ 2020-04-14 18:25       ` Junio C Hamano
  2020-04-15 13:27         ` Derrick Stolee
  0 siblings, 1 reply; 48+ messages in thread
From: Junio C Hamano @ 2020-04-14 18:25 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, git, me, jnareb, garimasigit,
	Derrick Stolee

Derrick Stolee <stolee@gmail.com> writes:

>> But if users may use icase pathspec very often, it may be worth
>> considering to build the bloom filter after downcasing the paths,
>> perhaps?  Given that many projects extract their source code to a
>> case insensitive filesystem, I would imagine that downcasing paths
>> would map two originally different paths into the same thing only
>> rarely, if ever, so there may not be much downside to do so.
>
> This behavior could be extended later, and carefully. My initial
> thought was that the case check would happen on every commit. If
> the :(icase) check only happens at the walk tip(s), then we could
> compute a single Bloom key at the start.

Sorry, I am not sure what you mean.

Do you mean that we notice that the user wants to match 'foo' case
insensitively, and tell the logic that uses changed-path records in
the graph file that commits that cannot possibly have touched any or
the paths 'foo', 'foO', 'fOo', ... (all 8 case permutations) are not
interesting?

I guess that would work, but I was wondering if it is simpler
without much downside if the changed-path records in the graph file
are prepared on paths after they are normalized to a single case.
That would lose information (e.g. you no longer can say "commits
that touch the path 'foo' is interesting, but those that touch the
path 'Foo' are not"), but makes the side that queries much simpler
(i.e. you do not have to prepare all 8 case permutations---you only
ask about 'foo').

And because the Bloom filter is used only for performance to cull
commits that can never possibly match, allowing a false positive
that would be discarded by actually running tree-diff anyway, the
only potential downside happens when the project has too many paths
that are different only in cases by increased collisions and by
reducing our chances to skip running tree-diff (and never affects
correctness).  

But this is not the "could be extended later" kind of behaviour, I
am afraid.  It is baked in the data stored in the graph file.

It all depends on how often people want :(icase) pathspec matches in
the history, I suspect.  My point was that we need to declare that
:(icase) won't matter in real life (hence we won't optimize our data
to support that use case), before the way in which the data stored
in the graph file is computed is cast in stone.

Thanks.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 2/4] commit: write commit-graph with Bloom filters
  2020-04-14 17:40             ` Derrick Stolee
@ 2020-04-15  0:17               ` Taylor Blau
  0 siblings, 0 replies; 48+ messages in thread
From: Taylor Blau @ 2020-04-15  0:17 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Junio C Hamano, Taylor Blau, Derrick Stolee via GitGitGadget,
	git, jnareb, garimasigit, Derrick Stolee

On Tue, Apr 14, 2020 at 01:40:53PM -0400, Derrick Stolee wrote:
> On 4/14/2020 1:26 PM, Junio C Hamano wrote:
> > Derrick Stolee <stolee@gmail.com> writes:
> >
> >> On 4/13/2020 6:21 PM, Junio C Hamano wrote:
> >>> Taylor Blau <me@ttaylorr.com> writes:
> >>>
> >>>> Hmm. I'm not crazy about a library function looking at 'GIT_TEST_*'
> >>>> environment variables and potentially ignoring its arguments, but given
> >>>> the discussion we had in v1, I don't feel strongly enough to recommend
> >>>> that you change this.
> >>>>
> >>>> For what it's worth, I think that 'write_commit_graph' could behave more
> >>>> purely if callers checked 'GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS' and set
> >>>> 'flags' appropriately from the outside,...
> >>>
> >>> Yeah, I agree that it would be a lot cleaner if that is easy to
> >>> arrange.  I have a suspicion that Derrick already tried and the
> >>> resulting code looked messier and was discarded?
> >>
> >> Perhaps I could fix both concerns by
> >>
> >> 1. Use a macro instead of a library call.
> >>
> >> 2. Check the _CHANGED_PATHS variable in the macro.
> >
> > I am not sure how use of a macro "fixes" purity, though.  And what
> > is the other concern?
>
> The concern was (1) checking the environment and (2) die()ing in the
> library.

I think that we might as well use a plain old function here instead of a
macro, especially since the macro is just expanding out to what that
function would be if we had written it that way.

For what it's worth, my concern was more that I don't mind dying, but
I'd rather do so in a calling function, and that by the time we're
calling 'write_commit_graph', that we should be using return values
instead of 'die()' or 'error()'.

> > How widely would this "if we are testing, write out the graph file"
> > call be sprinkled over the codebase?  I am hoping that it won't be
> > "everywhere", but only at strategic places (like "just once before
> > we leave a subcommand that creates one or bunch of commits").  And
> > how often would they be called?  I am also hoping that it won't be
> > "inside a tight loop".  In short, I am wondering if we can promise
> > our codebase that
> >
> >  - git_test_write_commit_graph_or_die() calls won't be an eyesore
> >    (and/or distraction) for developers too much.
> >
> >  - git_env_bool() call won't have performance impact.
>
> I could add a comment to the header file to say "this is only for
> improving coverage of optional features in the test suite. Do not
> call this method unless you know what you are doing."

Works for me.

> My intention right now is to only include this in 'git commit'
> and 'git merge'. Earlier discussion included thoughts about
> 'git rebase' and similar, but those are more rarely used when
> constructing an "example repo" in the test scripts.

Ditto. I think that there's benefit in having it in some places but not
all (as Stolee pointed out earlier in the thread), and that trying to
have it everywhere is a fool's errand.

> -Stolee

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/3] revision: complicated pathspecs disable filters
  2020-04-14 18:25       ` Junio C Hamano
@ 2020-04-15 13:27         ` Derrick Stolee
  2020-04-15 18:37           ` Derrick Stolee
  0 siblings, 1 reply; 48+ messages in thread
From: Derrick Stolee @ 2020-04-15 13:27 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Derrick Stolee via GitGitGadget, git, me, jnareb, garimasigit,
	Derrick Stolee

On 4/14/2020 2:25 PM, Junio C Hamano wrote:
> Derrick Stolee <stolee@gmail.com> writes:
> 
>>> But if users may use icase pathspec very often, it may be worth
>>> considering to build the bloom filter after downcasing the paths,
>>> perhaps?  Given that many projects extract their source code to a
>>> case insensitive filesystem, I would imagine that downcasing paths
>>> would map two originally different paths into the same thing only
>>> rarely, if ever, so there may not be much downside to do so.
>>
>> This behavior could be extended later, and carefully. My initial
>> thought was that the case check would happen on every commit. If
>> the :(icase) check only happens at the walk tip(s), then we could
>> compute a single Bloom key at the start.
> 
> Sorry, I am not sure what you mean.

That's my fault. There are a couple of things I misunderstood here.

1. Thinking about "git blame" we would need to "collapse" a pathspec
   to a specific file before starting history. But blame doesn't
   allow ":(icase)" anyway.

2. With that context of "git blame" in my head, I was thinking
   (incorrectly) that "git log" would collapse the pathspec based on
   what file(s) match the pattern at HEAD. The tests in
   t6131-pathspec-icase.sh clearly show that this is wrong. In fact,
   if we apply the following diff to this patch, then we can get failures
   with the changed-path filters:

diff --git a/revision.c b/revision.c
index f78c636e4d..a02be25feb 100644
--- a/revision.c
+++ b/revision.c
@@ -652,13 +652,14 @@ static void trace2_bloom_filter_statistics_atexit(void)
 
 static int forbid_bloom_filters(struct pathspec *spec)
 {
+       int allowed_flags = PATHSPEC_LITERAL | PATHSPEC_ICASE;
        if (spec->has_wildcard)
                return 1;
        if (spec->nr > 1)
                return 1;
-       if (spec->magic & ~PATHSPEC_LITERAL)
+       if (spec->magic & ~allowed_flags)
                return 1;
-       if (spec->nr && (spec->items[0].magic & ~PATHSPEC_LITERAL))
+       if (spec->nr && (spec->items[0].magic & ~allowed_flags))
                return 1;
 
        return 0;

> Do you mean that we notice that the user wants to match 'foo' case
> insensitively, and tell the logic that uses changed-path records in
> the graph file that commits that cannot possibly have touched any or
> the paths 'foo', 'foO', 'fOo', ... (all 8 case permutations) are not
> interesting?
> 
> I guess that would work, but I was wondering if it is simpler
> without much downside if the changed-path records in the graph file
> are prepared on paths after they are normalized to a single case.
> That would lose information (e.g. you no longer can say "commits
> that touch the path 'foo' is interesting, but those that touch the
> path 'Foo' are not"), but makes the side that queries much simpler
> (i.e. you do not have to prepare all 8 case permutations---you only
> ask about 'foo').
> 
> And because the Bloom filter is used only for performance to cull
> commits that can never possibly match, allowing a false positive
> that would be discarded by actually running tree-diff anyway, the
> only potential downside happens when the project has too many paths
> that are different only in cases by increased collisions and by
> reducing our chances to skip running tree-diff (and never affects
> correctness).  
> 
> But this is not the "could be extended later" kind of behaviour, I
> am afraid.  It is baked in the data stored in the graph file.

Since the feature is not released, we still have time to update the
format if we so desired. With the current format, we would need to
disable the filters when using an :(icase) pathspec as the current
patch does.

I'm not against the idea. Logically, collapsing case before hashing
the Bloom keys should not increase the probabilities of false
positives except in the situations where we have case conflicts.
There is a small cost in the pre-hashing step to change the case of
the paths, but that should be much lower than the cost of the hash
itself and the tree parsing to find the changed paths.

> It all depends on how often people want :(icase) pathspec matches in
> the history, I suspect.  My point was that we need to declare that
> :(icase) won't matter in real life (hence we won't optimize our data
> to support that use case), before the way in which the data stored
> in the graph file is computed is cast in stone.

My earlier statement can be summarized as "we could make this happen"
and you ask here "is it worth doing?"

I will play around with how complicated the change would be while
the community considers the "is it worth doing?" question.

Thanks,
-Stolee

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/3] revision: complicated pathspecs disable filters
  2020-04-15 13:27         ` Derrick Stolee
@ 2020-04-15 18:37           ` Derrick Stolee
  2020-04-15 19:32             ` Junio C Hamano
                               ` (2 more replies)
  0 siblings, 3 replies; 48+ messages in thread
From: Derrick Stolee @ 2020-04-15 18:37 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Derrick Stolee via GitGitGadget, git, me, jnareb, garimasigit,
	Derrick Stolee

On 4/15/2020 9:27 AM, Derrick Stolee wrote:
>
> I'm not against the idea. Logically, collapsing case before hashing
> the Bloom keys should not increase the probabilities of false
> positives except in the situations where we have case conflicts.
> There is a small cost in the pre-hashing step to change the case of
> the paths, but that should be much lower than the cost of the hash
> itself and the tree parsing to find the changed paths.
> 
>> It all depends on how often people want :(icase) pathspec matches in
>> the history, I suspect.  My point was that we need to declare that
>> :(icase) won't matter in real life (hence we won't optimize our data
>> to support that use case), before the way in which the data stored
>> in the graph file is computed is cast in stone.
> 
> My earlier statement can be summarized as "we could make this happen"
> and you ask here "is it worth doing?"
> 
> I will play around with how complicated the change would be while
> the community considers the "is it worth doing?" question.

I played a bit, and it wasn't terrible to cover everything by
adjusting fill_bloom_key(). I also added a test that we will want
anyways.

There is more work to be done if this is the direction we want to
go, including double-checking that this doesn't cause significant
performance degradation.

The point of me sending this patch is to make the proposed
direction very clear how it would work. I'm still not sure how I
feel about it.

Thanks,
-Stolee

-->8--
From 89beb9598daabb19e3c896bbceeb0fc1b9ccc6ca Mon Sep 17 00:00:00 2001
From: Derrick Stolee <dstolee@microsoft.com>
Date: Wed, 15 Apr 2020 18:04:25 +0000
Subject: [PATCH] bloom: compute all Bloom hashes from lowercase

The changed-path Bloom filters currently hash path strings using
the exact string for the path. This makes it difficult* to use the
filters when restricting to case-insensitive pathspecs.

* I say "difficult" because it is possible to generate all 2^n
  options for the case of a path and test them all, but this is
  a bad idea and should not be done. "Impossible" is an appropriate
  alternative.

THIS IS A BREAKING CHANGE. Commit-graph files with changed-path
Bloom filters computed by a previous commit will not be compatible
with the filters computed in this commit, nor will we get correct
results when testing across these incompatible versions. Normally,
this would be a completely unacceptable change, but the filters
have not been released and hence are still possible to update
before release.

TODO: If we decide to move in this direction, then the following
steps should be done (and some of them should be done anyway):

* We need to document the Bloom filter format to specify exactly
  how we compute the filter data. The details should be careful
  enough that someone can reproduce the exact file format without
  looking at the C code.

* That document would include the tolower() transformation that is
  being done here.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 bloom.c                   | 16 ++++++++++++++--
 revision.c                |  5 +++--
 t/t6131-pathspec-icase.sh | 18 ++++++++++++++++++
 3 files changed, 35 insertions(+), 4 deletions(-)

diff --git a/bloom.c b/bloom.c
index dd9bab9bbd..c919c60f12 100644
--- a/bloom.c
+++ b/bloom.c
@@ -130,8 +130,20 @@ void fill_bloom_key(const char *data,
 	int i;
 	const uint32_t seed0 = 0x293ae76f;
 	const uint32_t seed1 = 0x7e646e2c;
-	const uint32_t hash0 = murmur3_seeded(seed0, data, len);
-	const uint32_t hash1 = murmur3_seeded(seed1, data, len);
+	uint32_t hash0, hash1;
+	static struct strbuf icase_data = STRBUF_INIT;
+	char *cur;
+
+	strbuf_setlen(&icase_data, 0);
+	strbuf_addstr(&icase_data, data);
+
+	for (cur = icase_data.buf; cur && *cur; cur++) {
+		if (isupper(*cur))
+			*cur = tolower(*cur);
+	}
+
+	hash0 = murmur3_seeded(seed0, icase_data.buf, len);
+	hash1 = murmur3_seeded(seed1, icase_data.buf, len);
 
 	key->hashes = (uint32_t *)xcalloc(settings->num_hashes, sizeof(uint32_t));
 	for (i = 0; i < settings->num_hashes; i++)
diff --git a/revision.c b/revision.c
index f78c636e4d..a02be25feb 100644
--- a/revision.c
+++ b/revision.c
@@ -652,13 +652,14 @@ static void trace2_bloom_filter_statistics_atexit(void)
 
 static int forbid_bloom_filters(struct pathspec *spec)
 {
+	int allowed_flags = PATHSPEC_LITERAL | PATHSPEC_ICASE;
 	if (spec->has_wildcard)
 		return 1;
 	if (spec->nr > 1)
 		return 1;
-	if (spec->magic & ~PATHSPEC_LITERAL)
+	if (spec->magic & ~allowed_flags)
 		return 1;
-	if (spec->nr && (spec->items[0].magic & ~PATHSPEC_LITERAL))
+	if (spec->nr && (spec->items[0].magic & ~allowed_flags))
 		return 1;
 
 	return 0;
diff --git a/t/t6131-pathspec-icase.sh b/t/t6131-pathspec-icase.sh
index 39fc3f6769..ee0766bd91 100755
--- a/t/t6131-pathspec-icase.sh
+++ b/t/t6131-pathspec-icase.sh
@@ -106,4 +106,22 @@ test_expect_success '"git diff" can take magic :(icase) pathspec' '
 	test_cmp expect actual
 '
 
+test_expect_success '"git log" takes the magic :(icase) pathspec' '
+	cat >expect <<-\EOF &&
+	FOO/BAR
+	FOO/bAr
+	FOO/bar
+	fOo/BAR
+	fOo/bAr
+	fOo/bar
+	foo/BAR
+	foo/bAr
+	foo/bar
+	EOF
+	git log --pretty=%s HEAD -- ":(icase)foo/bar" >actual &&
+	test_cmp expect actual &&
+	git log --pretty=%s HEAD -- ":(icase)foo" >actual &&
+	test_cmp expect actual
+'
+
 test_done



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/3] revision: complicated pathspecs disable filters
  2020-04-15 18:37           ` Derrick Stolee
@ 2020-04-15 19:32             ` Junio C Hamano
  2020-04-15 19:39               ` Junio C Hamano
  2020-04-15 21:25             ` Junio C Hamano
  2020-04-15 22:18             ` Jakub Narębski
  2 siblings, 1 reply; 48+ messages in thread
From: Junio C Hamano @ 2020-04-15 19:32 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, git, me, jnareb, garimasigit,
	Derrick Stolee

Derrick Stolee <stolee@gmail.com> writes:

> diff --git a/bloom.c b/bloom.c
> index dd9bab9bbd..c919c60f12 100644
> --- a/bloom.c
> +++ b/bloom.c
> @@ -130,8 +130,20 @@ void fill_bloom_key(const char *data,
>  	int i;
>  	const uint32_t seed0 = 0x293ae76f;
>  	const uint32_t seed1 = 0x7e646e2c;
> -	const uint32_t hash0 = murmur3_seeded(seed0, data, len);
> -	const uint32_t hash1 = murmur3_seeded(seed1, data, len);
> +	uint32_t hash0, hash1;
> +	static struct strbuf icase_data = STRBUF_INIT;
> +	char *cur;
> +
> +	strbuf_setlen(&icase_data, 0);
> +	strbuf_addstr(&icase_data, data);

Perhaps 

	strbuf_reset(&icase_data);
	strbuf_add(&icase_data, data, len);

Or do we know bloom keys are always NUL-terminated string?

I am not sure how reusable bloom.c::fill_bloom_key() is designed to
be.  If it is for the sole use of the changed-paths, then it is OK
to assume that our data is NUL-terminated string and that keys wants
to be always computed after downcasing (assuming that icase search
is something we want to optimize for, which I find is a bit iffy).

If not, obviously we would want these two things done on the
caller's side (or perhaps a new helper function whose callers can
make these assumption, fill_bloom_path(), may want to be inserted
between fill_bloom_key() and its callers).

> +	for (cur = icase_data.buf; cur && *cur; cur++) {
> +		if (isupper(*cur))
> +			*cur = tolower(*cur);
> +	}
> +
> +	hash0 = murmur3_seeded(seed0, icase_data.buf, len);
> +	hash1 = murmur3_seeded(seed1, icase_data.buf, len);
>  
>  	key->hashes = (uint32_t *)xcalloc(settings->num_hashes, sizeof(uint32_t));
>  	for (i = 0; i < settings->num_hashes; i++)

In any case, the update to the above function seems fairly isolated.
Nicely done.

> diff --git a/revision.c b/revision.c
> index f78c636e4d..a02be25feb 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -652,13 +652,14 @@ static void trace2_bloom_filter_statistics_atexit(void)
>  
>  static int forbid_bloom_filters(struct pathspec *spec)
>  {
> +	int allowed_flags = PATHSPEC_LITERAL | PATHSPEC_ICASE;
>  	if (spec->has_wildcard)
>  		return 1;
>  	if (spec->nr > 1)
>  		return 1;
> -	if (spec->magic & ~PATHSPEC_LITERAL)
> +	if (spec->magic & ~allowed_flags)
>  		return 1;
> -	if (spec->nr && (spec->items[0].magic & ~PATHSPEC_LITERAL))
> +	if (spec->nr && (spec->items[0].magic & ~allowed_flags))
>  		return 1;
>  
>  	return 0;

And thanks to the refactoring done to invent this helper function,
the side that uses the Bloom data is quite straight-forward.

As you are, I am on the fence.  

I do not think :(icase) pathspec is something we want to optimize
for, but I still like this new hash function primarily because I
suspect that it will increase the number of paths that you can cram
into the filter without getting their hashes collided (hence getting
false positive), under the assumption that real projects won't try
to store too many pair of paths that are only different in their
case, and if that is the case, it would help performance.  So if we
were to benchmark, we'd also pay attention to that side, in addition
to the obvious downside of having to allocate and downcase.

Thanks.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/3] revision: complicated pathspecs disable filters
  2020-04-15 19:32             ` Junio C Hamano
@ 2020-04-15 19:39               ` Junio C Hamano
  0 siblings, 0 replies; 48+ messages in thread
From: Junio C Hamano @ 2020-04-15 19:39 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, git, me, jnareb, garimasigit,
	Derrick Stolee

Junio C Hamano <gitster@pobox.com> writes:

> As you are, I am on the fence.  
>
> I do not think :(icase) pathspec is something we want to optimize
> for, but I still like this new hash function primarily because I
> suspect that it will increase the number of paths that you can cram
> into the filter without getting their hashes collided (hence getting
> false positive), under the assumption that real projects won't try
> to store too many pair of paths that are only different in their
> case...

Sorry, but no, I do not think there is such upside.  It may have
effects on the actual hash values to downcase paths that are
originally camelCased, but reducing the entropy of input paths that
way shouldn't have effect on the overall distribution and rate of
collision in any meaningful way (otherwise the chosen underlying
hash function would be broken).  So, sorry for the noise.




^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/3] revision: complicated pathspecs disable filters
  2020-04-15 18:37           ` Derrick Stolee
  2020-04-15 19:32             ` Junio C Hamano
@ 2020-04-15 21:25             ` Junio C Hamano
  2020-04-16  0:56               ` Taylor Blau
  2020-04-15 22:18             ` Jakub Narębski
  2 siblings, 1 reply; 48+ messages in thread
From: Junio C Hamano @ 2020-04-15 21:25 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, git, me, jnareb, garimasigit,
	Derrick Stolee

Derrick Stolee <stolee@gmail.com> writes:

> THIS IS A BREAKING CHANGE. Commit-graph files with changed-path
> Bloom filters computed by a previous commit will not be compatible
> with the filters computed in this commit, nor will we get correct
> results when testing across these incompatible versions. Normally,
> this would be a completely unacceptable change, but the filters
> have not been released and hence are still possible to update
> before release.

Sure, it hasn't even hit 'next' yet.  

But I think we are both sort-of leaning towards concluding that it
does not help all that much.  So I think it is OK.

> TODO: If we decide to move in this direction, then the following
> steps should be done (and some of them should be done anyway):

Even if we decide not to do this "downcase before hashing" thing, we
should document how exactly we compute, I think.

And if we decide do change our mind later, it is not the end of the
world.  We should be able to use a different chunk type to store the
filters computed over downcased paths.

> * We need to document the Bloom filter format to specify exactly
>   how we compute the filter data. The details should be careful
>   enough that someone can reproduce the exact file format without
>   looking at the C code.
>
> * That document would include the tolower() transformation that is
>   being done here.

As the tree-diff comparison done internally while running "git
blame" does not take an end-user specified pathspec in any
meaningful way, this does not matter in practice, but there is
another possible complication we would want to consider when we
extend the support to "git log" and friends---negative pathspecs
(e.g. "git log ':(exclude)COPYING'").  A commit that cannot possibly
have touched the COPYING file would be eligible for output without
actually running tree-diff between it and its parent (as long as the
trees of the two commits are different, we know it must be shown).

Thanks.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/3] revision: complicated pathspecs disable filters
  2020-04-15 18:37           ` Derrick Stolee
  2020-04-15 19:32             ` Junio C Hamano
  2020-04-15 21:25             ` Junio C Hamano
@ 2020-04-15 22:18             ` Jakub Narębski
  2020-04-16  0:52               ` Taylor Blau
  2 siblings, 1 reply; 48+ messages in thread
From: Jakub Narębski @ 2020-04-15 22:18 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Junio C Hamano, Derrick Stolee via GitGitGadget, git,
	Taylor Blau, Garima Singh, Derrick Stolee

On Wed, 15 Apr 2020 at 20:37, Derrick Stolee <stolee@gmail.com> wrote:
[...]
> -->8--
> From 89beb9598daabb19e3c896bbceeb0fc1b9ccc6ca Mon Sep 17 00:00:00 2001
> From: Derrick Stolee <dstolee@microsoft.com>
> Date: Wed, 15 Apr 2020 18:04:25 +0000
> Subject: [PATCH] bloom: compute all Bloom hashes from lowercase
>
> The changed-path Bloom filters currently hash path strings using
> the exact string for the path. This makes it difficult* to use the
> filters when restricting to case-insensitive pathspecs.
>
> * I say "difficult" because it is possible to generate all 2^n
>   options for the case of a path and test them all, but this is
>   a bad idea and should not be done. "Impossible" is an appropriate
>   alternative.
>
> THIS IS A BREAKING CHANGE. Commit-graph files with changed-path
> Bloom filters computed by a previous commit will not be compatible
> with the filters computed in this commit, nor will we get correct
> results when testing across these incompatible versions. Normally,
> this would be a completely unacceptable change, but the filters
> have not been released and hence are still possible to update
> before release.
>
> TODO: If we decide to move in this direction, then the following
> steps should be done (and some of them should be done anyway):
>
> * We need to document the Bloom filter format to specify exactly
>   how we compute the filter data. The details should be careful
>   enough that someone can reproduce the exact file format without
>   looking at the C code.
>
> * That document would include the tolower() transformation that is
>   being done here.

Why not modify the BDAT chunk to include version of
case folding transformation or other collation algorithm
(other transformation).that is done prior to computing
the Bloom filter key? Though that might be unnecessary
flexibility...

For example the value of 0x00 in such field of BDAT
chunk header would mean no transformation, while
the value of 0x01 would mean per-character tolower()
or Unicode equivalent of it.

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/3] revision: complicated pathspecs disable filters
  2020-04-15 22:18             ` Jakub Narębski
@ 2020-04-16  0:52               ` Taylor Blau
  2020-04-16 13:26                 ` Derrick Stolee
  0 siblings, 1 reply; 48+ messages in thread
From: Taylor Blau @ 2020-04-16  0:52 UTC (permalink / raw)
  To: Jakub Narębski
  Cc: Derrick Stolee, Junio C Hamano, Derrick Stolee via GitGitGadget,
	git, Taylor Blau, Garima Singh, Derrick Stolee

On Thu, Apr 16, 2020 at 12:18:33AM +0200, Jakub Narębski wrote:
> On Wed, 15 Apr 2020 at 20:37, Derrick Stolee <stolee@gmail.com> wrote:
> [...]
> > -->8--
> > From 89beb9598daabb19e3c896bbceeb0fc1b9ccc6ca Mon Sep 17 00:00:00 2001
> > From: Derrick Stolee <dstolee@microsoft.com>
> > Date: Wed, 15 Apr 2020 18:04:25 +0000
> > Subject: [PATCH] bloom: compute all Bloom hashes from lowercase
> >
> > The changed-path Bloom filters currently hash path strings using
> > the exact string for the path. This makes it difficult* to use the
> > filters when restricting to case-insensitive pathspecs.
> >
> > * I say "difficult" because it is possible to generate all 2^n
> >   options for the case of a path and test them all, but this is
> >   a bad idea and should not be done. "Impossible" is an appropriate
> >   alternative.
> >
> > THIS IS A BREAKING CHANGE. Commit-graph files with changed-path
> > Bloom filters computed by a previous commit will not be compatible
> > with the filters computed in this commit, nor will we get correct
> > results when testing across these incompatible versions. Normally,
> > this would be a completely unacceptable change, but the filters
> > have not been released and hence are still possible to update
> > before release.
> >
> > TODO: If we decide to move in this direction, then the following
> > steps should be done (and some of them should be done anyway):
> >
> > * We need to document the Bloom filter format to specify exactly
> >   how we compute the filter data. The details should be careful
> >   enough that someone can reproduce the exact file format without
> >   looking at the C code.
> >
> > * That document would include the tolower() transformation that is
> >   being done here.
>
> Why not modify the BDAT chunk to include version of
> case folding transformation or other collation algorithm
> (other transformation).that is done prior to computing
> the Bloom filter key? Though that might be unnecessary
> flexibility...

If this ends up being something that we want to do, I agree with
Stolee's reasoning that this should be a breaking change. If we were,
say, several months into having Bloom filters in a release and decided
at that point to make the change, then: sure, supporting both by writing
a bit in the BDAT chunk makes sense.

But, we're many months away from that state yet, and so I don't think
the cost of rebuilding what few commit-graphs exist with bloom filters
in them today to support both ordinary and lower-cased paths in the
filter.

Anyway, I'm still not sold on this idea in general (nor do I understand
it that others are), so I'll respond in more detail in another part of
the thread...

> For example the value of 0x00 in such field of BDAT
> chunk header would mean no transformation, while
> the value of 0x01 would mean per-character tolower()
> or Unicode equivalent of it.
>
> Best,
> --
> Jakub Narębski

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/3] revision: complicated pathspecs disable filters
  2020-04-15 21:25             ` Junio C Hamano
@ 2020-04-16  0:56               ` Taylor Blau
  0 siblings, 0 replies; 48+ messages in thread
From: Taylor Blau @ 2020-04-16  0:56 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Derrick Stolee, Derrick Stolee via GitGitGadget, git, me, jnareb,
	garimasigit, Derrick Stolee

On Wed, Apr 15, 2020 at 02:25:48PM -0700, Junio C Hamano wrote:
> Derrick Stolee <stolee@gmail.com> writes:
>
> > THIS IS A BREAKING CHANGE. Commit-graph files with changed-path
> > Bloom filters computed by a previous commit will not be compatible
> > with the filters computed in this commit, nor will we get correct
> > results when testing across these incompatible versions. Normally,
> > this would be a completely unacceptable change, but the filters
> > have not been released and hence are still possible to update
> > before release.
>
> Sure, it hasn't even hit 'next' yet.
>
> But I think we are both sort-of leaning towards concluding that it
> does not help all that much.  So I think it is OK.

Yeah, I think that the only thing that it is potentially helping is the
:(icase) magic. In a repository that doesn't have colliding paths in
case-insensitive filesystems (i.e., having both 'foo' and 'FOO' in the
same tree), this doesn't seem to be obviously hurting anything.

But, it is at least semi-harmful to repositories that do have such
trees, since we now can't answer "probably yes" or "definitely not" for
colliding paths unless both produce the same fingerprint. That seems
like a clear downside.

Now, how common is that against people who would benefit from
changed-path Bloom filters in their commit-graphs? I don't know one way
or the other. But, the upside seems to be minimal compared to the
potential cost, so I think that it may be better to just leave this one
alone.

> > TODO: If we decide to move in this direction, then the following
> > steps should be done (and some of them should be done anyway):
>
> Even if we decide not to do this "downcase before hashing" thing, we
> should document how exactly we compute, I think.
>
> And if we decide do change our mind later, it is not the end of the
> world.  We should be able to use a different chunk type to store the
> filters computed over downcased paths.
>
> > * We need to document the Bloom filter format to specify exactly
> >   how we compute the filter data. The details should be careful
> >   enough that someone can reproduce the exact file format without
> >   looking at the C code.
> >
> > * That document would include the tolower() transformation that is
> >   being done here.
>
> As the tree-diff comparison done internally while running "git
> blame" does not take an end-user specified pathspec in any
> meaningful way, this does not matter in practice, but there is
> another possible complication we would want to consider when we
> extend the support to "git log" and friends---negative pathspecs
> (e.g. "git log ':(exclude)COPYING'").  A commit that cannot possibly
> have touched the COPYING file would be eligible for output without
> actually running tree-diff between it and its parent (as long as the
> trees of the two commits are different, we know it must be shown).
>
> Thanks.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/3] revision: complicated pathspecs disable filters
  2020-04-16  0:52               ` Taylor Blau
@ 2020-04-16 13:26                 ` Derrick Stolee
  2020-04-16 16:33                   ` Taylor Blau
  0 siblings, 1 reply; 48+ messages in thread
From: Derrick Stolee @ 2020-04-16 13:26 UTC (permalink / raw)
  To: Taylor Blau, Jakub Narębski
  Cc: Junio C Hamano, Derrick Stolee via GitGitGadget, git,
	Garima Singh, Derrick Stolee

On 4/15/2020 8:52 PM, Taylor Blau wrote:
> On Thu, Apr 16, 2020 at 12:18:33AM +0200, Jakub Narębski wrote:
>> On Wed, 15 Apr 2020 at 20:37, Derrick Stolee <stolee@gmail.com> wrote:
>> [...]
>>> -->8--
>>> From 89beb9598daabb19e3c896bbceeb0fc1b9ccc6ca Mon Sep 17 00:00:00 2001
>>> From: Derrick Stolee <dstolee@microsoft.com>
>>> Date: Wed, 15 Apr 2020 18:04:25 +0000
>>> Subject: [PATCH] bloom: compute all Bloom hashes from lowercase
>>>
>>> The changed-path Bloom filters currently hash path strings using
>>> the exact string for the path. This makes it difficult* to use the
>>> filters when restricting to case-insensitive pathspecs.
>>>
>>> * I say "difficult" because it is possible to generate all 2^n
>>>   options for the case of a path and test them all, but this is
>>>   a bad idea and should not be done. "Impossible" is an appropriate
>>>   alternative.
>>>
>>> THIS IS A BREAKING CHANGE. Commit-graph files with changed-path
>>> Bloom filters computed by a previous commit will not be compatible
>>> with the filters computed in this commit, nor will we get correct
>>> results when testing across these incompatible versions. Normally,
>>> this would be a completely unacceptable change, but the filters
>>> have not been released and hence are still possible to update
>>> before release.
>>>
>>> TODO: If we decide to move in this direction, then the following
>>> steps should be done (and some of them should be done anyway):
>>>
>>> * We need to document the Bloom filter format to specify exactly
>>>   how we compute the filter data. The details should be careful
>>>   enough that someone can reproduce the exact file format without
>>>   looking at the C code.
>>>
>>> * That document would include the tolower() transformation that is
>>>   being done here.
>>
>> Why not modify the BDAT chunk to include version of
>> case folding transformation or other collation algorithm
>> (other transformation).that is done prior to computing
>> the Bloom filter key? Though that might be unnecessary
>> flexibility...
> 
> If this ends up being something that we want to do, I agree with
> Stolee's reasoning that this should be a breaking change. If we were,
> say, several months into having Bloom filters in a release and decided
> at that point to make the change, then: sure, supporting both by writing
> a bit in the BDAT chunk makes sense.
> 
> But, we're many months away from that state yet, and so I don't think
> the cost of rebuilding what few commit-graphs exist with bloom filters
> in them today to support both ordinary and lower-cased paths in the
> filter.
> 
> Anyway, I'm still not sold on this idea in general (nor do I understand
> it that others are), so I'll respond in more detail in another part of
> the thread...

I agree that this is not a good direction to go. I created the patch
because I was curious how difficult it would be, and it is good to have
a record of the possible direction. However, it complicates the file
format and will have unpredictable effects on the entropy (or on the
performance of history for case-colliding paths).

It is good that we have the capability to extend the filter data in
the future if we really need to.

I'll make a TODO item for myself to try writing that detailed Bloom
filter format documentation as a follow-up. In the meantime, I'll try
to close this out by responding to the feedback we have so far.

Thanks,
-Stolee



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/3] revision: complicated pathspecs disable filters
  2020-04-16 13:26                 ` Derrick Stolee
@ 2020-04-16 16:33                   ` Taylor Blau
  2020-04-16 18:02                     ` Junio C Hamano
  0 siblings, 1 reply; 48+ messages in thread
From: Taylor Blau @ 2020-04-16 16:33 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Taylor Blau, Jakub Narębski, Junio C Hamano,
	Derrick Stolee via GitGitGadget, git, Garima Singh,
	Derrick Stolee

On Thu, Apr 16, 2020 at 09:26:49AM -0400, Derrick Stolee wrote:
> On 4/15/2020 8:52 PM, Taylor Blau wrote:
> > On Thu, Apr 16, 2020 at 12:18:33AM +0200, Jakub Narębski wrote:
> >> On Wed, 15 Apr 2020 at 20:37, Derrick Stolee <stolee@gmail.com> wrote:
> >> [...]
> >>> -->8--
> >>> From 89beb9598daabb19e3c896bbceeb0fc1b9ccc6ca Mon Sep 17 00:00:00 2001
> >>> From: Derrick Stolee <dstolee@microsoft.com>
> >>> Date: Wed, 15 Apr 2020 18:04:25 +0000
> >>> Subject: [PATCH] bloom: compute all Bloom hashes from lowercase
> >>>
> >>> The changed-path Bloom filters currently hash path strings using
> >>> the exact string for the path. This makes it difficult* to use the
> >>> filters when restricting to case-insensitive pathspecs.
> >>>
> >>> * I say "difficult" because it is possible to generate all 2^n
> >>>   options for the case of a path and test them all, but this is
> >>>   a bad idea and should not be done. "Impossible" is an appropriate
> >>>   alternative.
> >>>
> >>> THIS IS A BREAKING CHANGE. Commit-graph files with changed-path
> >>> Bloom filters computed by a previous commit will not be compatible
> >>> with the filters computed in this commit, nor will we get correct
> >>> results when testing across these incompatible versions. Normally,
> >>> this would be a completely unacceptable change, but the filters
> >>> have not been released and hence are still possible to update
> >>> before release.
> >>>
> >>> TODO: If we decide to move in this direction, then the following
> >>> steps should be done (and some of them should be done anyway):
> >>>
> >>> * We need to document the Bloom filter format to specify exactly
> >>>   how we compute the filter data. The details should be careful
> >>>   enough that someone can reproduce the exact file format without
> >>>   looking at the C code.
> >>>
> >>> * That document would include the tolower() transformation that is
> >>>   being done here.
> >>
> >> Why not modify the BDAT chunk to include version of
> >> case folding transformation or other collation algorithm
> >> (other transformation).that is done prior to computing
> >> the Bloom filter key? Though that might be unnecessary
> >> flexibility...
> >
> > If this ends up being something that we want to do, I agree with
> > Stolee's reasoning that this should be a breaking change. If we were,
> > say, several months into having Bloom filters in a release and decided
> > at that point to make the change, then: sure, supporting both by writing
> > a bit in the BDAT chunk makes sense.
> >
> > But, we're many months away from that state yet, and so I don't think
> > the cost of rebuilding what few commit-graphs exist with bloom filters
> > in them today to support both ordinary and lower-cased paths in the
> > filter.
> >
> > Anyway, I'm still not sold on this idea in general (nor do I understand
> > it that others are), so I'll respond in more detail in another part of
> > the thread...
>
> I agree that this is not a good direction to go. I created the patch
> because I was curious how difficult it would be, and it is good to have
> a record of the possible direction. However, it complicates the file
> format and will have unpredictable effects on the entropy (or on the
> performance of history for case-colliding paths).
>
> It is good that we have the capability to extend the filter data in
> the future if we really need to.
>
> I'll make a TODO item for myself to try writing that detailed Bloom
> filter format documentation as a follow-up. In the meantime, I'll try
> to close this out by responding to the feedback we have so far.

Sounds good, and thanks for investigating.

> Thanks,
> -Stolee

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/3] revision: complicated pathspecs disable filters
  2020-04-16 16:33                   ` Taylor Blau
@ 2020-04-16 18:02                     ` Junio C Hamano
  0 siblings, 0 replies; 48+ messages in thread
From: Junio C Hamano @ 2020-04-16 18:02 UTC (permalink / raw)
  To: Taylor Blau
  Cc: Derrick Stolee, Jakub Narębski,
	Derrick Stolee via GitGitGadget, git, Garima Singh,
	Derrick Stolee

Taylor Blau <me@ttaylorr.com> writes:

>> It is good that we have the capability to extend the filter data in
>> the future if we really need to.
>>
>> I'll make a TODO item for myself to try writing that detailed Bloom
>> filter format documentation as a follow-up. In the meantime, I'll try
>> to close this out by responding to the feedback we have so far.
>
> Sounds good, and thanks for investigating.

Yeah, thanks, all.


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH v3 0/3] Integrate changed-path Bloom filters with 'git blame'
  2020-04-13 14:45 ` [PATCH v2 0/4] " Derrick Stolee via GitGitGadget
                     ` (4 preceding siblings ...)
  2020-04-13 16:21   ` [PATCH v2 0/4] Integrate changed-path Bloom filters with 'git blame' Taylor Blau
@ 2020-04-16 20:14   ` Derrick Stolee via GitGitGadget
  2020-04-16 20:14     ` [PATCH v3 1/3] revision: complicated pathspecs disable filters Derrick Stolee via GitGitGadget
                       ` (2 more replies)
  5 siblings, 3 replies; 48+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-04-16 20:14 UTC (permalink / raw)
  To: git; +Cc: me, jnareb, garimasigit, Derrick Stolee

If the changed-path Bloom filters are relatively stable, then I propose
trying to build upon them as a way to discover any deficiencies. Also, it's
good to use them when we can.

The goal I set out to achieve was to use Bloom filters as much as possible
in git blame. I think I have achieved most of that, except I did not
integrate it with the -C mode. In that case, the blob-diff computation takes
a majority of the computation time, so short-circuiting the tree diff using
Bloom filters. Also, it's just really complicated. If someone else thinks
there is an easy win there, then please go ahead and give it a try with the
extra logic presented here in PATCH 3.

While I was playing around with Bloom filters and git blame, I purposefully
got it working with some scenarios but not all. Then I tried to trigger a
failing build in the blame tests using GIT_TEST_COMMIT_GRAPH=1 and 
GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=1. But the tests all succeeded!

Examining the data, I see that the commit-graph didn't have the Bloom filter
chunks at all. This is because we are not setting the flag to write them in
the right spot. The GIT_TEST_COMMIT_GRAPH=1 variable triggers a commit-graph
write during git commit, so we need to update the code there instead of just
inspecting the variable in git commit-graph write. (This is PATCH 2.)

By updating this variable, I saw some test failures in other tests regarding
non-exact pathspecs. I fixed these in PATCH 1 so we keep clean builds.

I based this change on [1] but it would apply cleanly (and logically) on
gs/commit-graph-path-filter

Updates in v2:

 * Added PATCH 3 to write commit-graph files during 'git merge' if
   GIT_TEST_COMMIT_GRAPH is enabled.
   
   
 * Updated PATCH 1 with the simplification recommended by Taylor.
   
   
 * Fixed the lower-case "bloom" in the commit message.
   
   

Thanks, -Stolee

[1] 
https://lore.kernel.org/git/pull.601.v2.git.1586437211842.gitgitgadget@gmail.com/

Derrick Stolee (3):
  revision: complicated pathspecs disable filters
  tests: write commit-graph with Bloom filters
  blame: use changed-path Bloom filters

 blame.c          | 139 ++++++++++++++++++++++++++++++++++++++++++++---
 blame.h          |   6 ++
 builtin/blame.c  |  10 ++++
 builtin/commit.c |   4 +-
 builtin/merge.c  |   7 ++-
 commit-graph.c   |  14 +++++
 commit-graph.h   |   9 +++
 revision.c       |  19 ++++++-
 8 files changed, 193 insertions(+), 15 deletions(-)


base-commit: f4df00a0dd448edce0e854a97f63598fefe27d27
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-609%2Fderrickstolee%2Fbloom-blame-v3
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-609/derrickstolee/bloom-blame-v3
Pull-Request: https://github.com/gitgitgadget/git/pull/609

Range-diff vs v2:

 1:  adc03eee4ac = 1:  adc03eee4ac revision: complicated pathspecs disable filters
 2:  7e8f1aed113 < -:  ----------- commit: write commit-graph with Bloom filters
 3:  824f8ad067b ! 2:  4073c8fe42f commit-graph: write commit-graph in more tests
     @@ Metadata
      Author: Derrick Stolee <dstolee@microsoft.com>
      
       ## Commit message ##
     -    commit-graph: write commit-graph in more tests
     +    tests: write commit-graph with Bloom filters
      
     -    The GIT_TEST_COMMIT_GRAPH test environment variable triggers
     -    commit-graph writes during each "git commit" process. To expand
     -    the number of tests that have commits in the commit-graph file,
     -    add a helper method that computes the commit-graph and place
     +    The GIT_TEST_COMMIT_GRAPH environment variable updates the commit-
     +    graph file whenever "git commit" is run, ensuring that we always
     +    have an updated commit-graph throughout the test suite. The
     +    GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS environment variable was
     +    introduced to write the changed-path Bloom filters whenever "git
     +    commit-graph write" is run. However, the GIT_TEST_COMMIT_GRAPH
     +    trick doesn't launch a separate process and instead writes it
     +    directly.
     +
     +    To expand the number of tests that have commits in the commit-graph
     +    file, add a helper method that computes the commit-graph and place
          that helper inside "git commit" and "git merge".
      
     +    In the helper method, check GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS
     +    to ensure we are writing changed-path Bloom filters whenever
     +    possible.
     +
          Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
      
       ## builtin/commit.c ##
     @@ commit-graph.c
       
      +void git_test_write_commit_graph_or_die(void)
      +{
     -+	if (git_env_bool(GIT_TEST_COMMIT_GRAPH, 0) &&
     -+	    write_commit_graph_reachable(the_repository->objects->odb, 0, NULL))
     ++	int flags = 0;
     ++	if (!git_env_bool(GIT_TEST_COMMIT_GRAPH, 0))
     ++		return;
     ++
     ++	if (git_env_bool(GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS, 0))
     ++		flags = COMMIT_GRAPH_WRITE_BLOOM_FILTERS;
     ++	
     ++	if (write_commit_graph_reachable(the_repository->objects->odb,
     ++					 flags, NULL))
      +		die("failed to write commit-graph under GIT_TEST_COMMIT_GRAPH");
      +}
      +
     @@ commit-graph.h
       #define GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD "GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD"
       #define GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS "GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS"
       
     ++/*
     ++ * This method is only used to enhance coverage of the commit-graph
     ++ * feature in the test suite with the GIT_TEST_COMMIT_GRAPH and
     ++ * GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS environment variables. Do not
     ++ * call this method oustide of a builtin, and only if you know what
     ++ * you are doing!
     ++ */
      +void git_test_write_commit_graph_or_die(void);
      +
       struct commit;
 4:  4ae196d6355 = 3:  463d6bf5033 blame: use changed-path Bloom filters

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH v3 1/3] revision: complicated pathspecs disable filters
  2020-04-16 20:14   ` [PATCH v3 0/3] " Derrick Stolee via GitGitGadget
@ 2020-04-16 20:14     ` Derrick Stolee via GitGitGadget
  2020-06-07 20:33       ` SZEDER Gábor
  2020-04-16 20:14     ` [PATCH v3 2/3] tests: write commit-graph with Bloom filters Derrick Stolee via GitGitGadget
  2020-04-16 20:14     ` [PATCH v3 3/3] blame: use changed-path " Derrick Stolee via GitGitGadget
  2 siblings, 1 reply; 48+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-04-16 20:14 UTC (permalink / raw)
  To: git; +Cc: me, jnareb, garimasigit, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The changed-path Bloom filters work only when we can compute an
explicit Bloom filter key in advance. When a pathspec is given
that allows case-insensitive checks or wildcard matching, we
must disable the Bloom filter performance checks.

By checking the pathspec in prepare_to_use_bloom_filters(), we
avoid setting up the Bloom filter data and thus revert to the
usual logic.

Before this change, the following tests would fail*:

	t6004-rev-list-path-optim.sh (Tests 6-7)
	t6130-pathspec-noglob.sh (Tests 3-6)
	t6131-pathspec-icase.sh (Tests 3-5)

*These tests would fail when using GIT_TEST_COMMIT_GRAPH and
GIT_TEST_COMMIT_GRAPH_BLOOM_FILTERS except that the latter
environment variable was not set up correctly to write the changed-
path Bloom filters in the test suite. That will be fixed in the
next change.

Helped-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 revision.c | 19 ++++++++++++++++++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/revision.c b/revision.c
index 2b06ee739c8..f78c636e4d0 100644
--- a/revision.c
+++ b/revision.c
@@ -650,6 +650,20 @@ static void trace2_bloom_filter_statistics_atexit(void)
 	jw_release(&jw);
 }
 
+static int forbid_bloom_filters(struct pathspec *spec)
+{
+	if (spec->has_wildcard)
+		return 1;
+	if (spec->nr > 1)
+		return 1;
+	if (spec->magic & ~PATHSPEC_LITERAL)
+		return 1;
+	if (spec->nr && (spec->items[0].magic & ~PATHSPEC_LITERAL))
+		return 1;
+
+	return 0;
+}
+
 static void prepare_to_use_bloom_filter(struct rev_info *revs)
 {
 	struct pathspec_item *pi;
@@ -659,7 +673,10 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
 	int len;
 
 	if (!revs->commits)
-	    return;
+		return;
+
+	if (forbid_bloom_filters(&revs->prune_data))
+		return;
 
 	repo_parse_commit(revs->repo, revs->commits->item);
 
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v3 2/3] tests: write commit-graph with Bloom filters
  2020-04-16 20:14   ` [PATCH v3 0/3] " Derrick Stolee via GitGitGadget
  2020-04-16 20:14     ` [PATCH v3 1/3] revision: complicated pathspecs disable filters Derrick Stolee via GitGitGadget
@ 2020-04-16 20:14     ` Derrick Stolee via GitGitGadget
  2020-04-16 20:14     ` [PATCH v3 3/3] blame: use changed-path " Derrick Stolee via GitGitGadget
  2 siblings, 0 replies; 48+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-04-16 20:14 UTC (permalink / raw)
  To: git; +Cc: me, jnareb, garimasigit, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The GIT_TEST_COMMIT_GRAPH environment variable updates the commit-
graph file whenever "git commit" is run, ensuring that we always
have an updated commit-graph throughout the test suite. The
GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS environment variable was
introduced to write the changed-path Bloom filters whenever "git
commit-graph write" is run. However, the GIT_TEST_COMMIT_GRAPH
trick doesn't launch a separate process and instead writes it
directly.

To expand the number of tests that have commits in the commit-graph
file, add a helper method that computes the commit-graph and place
that helper inside "git commit" and "git merge".

In the helper method, check GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS
to ensure we are writing changed-path Bloom filters whenever
possible.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 builtin/commit.c |  4 +---
 builtin/merge.c  |  7 +++++--
 commit-graph.c   | 14 ++++++++++++++
 commit-graph.h   |  9 +++++++++
 4 files changed, 29 insertions(+), 5 deletions(-)

diff --git a/builtin/commit.c b/builtin/commit.c
index d3e7781e658..27d4ff6b790 100644
--- a/builtin/commit.c
+++ b/builtin/commit.c
@@ -1700,9 +1700,7 @@ int cmd_commit(int argc, const char **argv, const char *prefix)
 		      "new_index file. Check that disk is not full and quota is\n"
 		      "not exceeded, and then \"git restore --staged :/\" to recover."));
 
-	if (git_env_bool(GIT_TEST_COMMIT_GRAPH, 0) &&
-	    write_commit_graph_reachable(the_repository->objects->odb, 0, NULL))
-		return 1;
+	git_test_write_commit_graph_or_die();
 
 	repo_rerere(the_repository, 0);
 	run_command_v_opt(argv_gc_auto, RUN_GIT_CMD);
diff --git a/builtin/merge.c b/builtin/merge.c
index d127d2225f8..db11af5b1cd 100644
--- a/builtin/merge.c
+++ b/builtin/merge.c
@@ -40,6 +40,7 @@
 #include "branch.h"
 #include "commit-reach.h"
 #include "wt-status.h"
+#include "commit-graph.h"
 
 #define DEFAULT_TWOHEAD (1<<0)
 #define DEFAULT_OCTOPUS (1<<1)
@@ -1673,9 +1674,11 @@ int cmd_merge(int argc, const char **argv, const char *prefix)
 				   head_commit);
 	}
 
-	if (squash)
+	if (squash) {
 		finish(head_commit, remoteheads, NULL, NULL);
-	else
+
+		git_test_write_commit_graph_or_die();
+	} else
 		write_merge_state(remoteheads);
 
 	if (merge_was_ok)
diff --git a/commit-graph.c b/commit-graph.c
index 77668629e27..0809f34f8d3 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -19,6 +19,20 @@
 #include "bloom.h"
 #include "commit-slab.h"
 
+void git_test_write_commit_graph_or_die(void)
+{
+	int flags = 0;
+	if (!git_env_bool(GIT_TEST_COMMIT_GRAPH, 0))
+		return;
+
+	if (git_env_bool(GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS, 0))
+		flags = COMMIT_GRAPH_WRITE_BLOOM_FILTERS;
+	
+	if (write_commit_graph_reachable(the_repository->objects->odb,
+					 flags, NULL))
+		die("failed to write commit-graph under GIT_TEST_COMMIT_GRAPH");
+}
+
 #define GRAPH_SIGNATURE 0x43475048 /* "CGPH" */
 #define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
 #define GRAPH_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
diff --git a/commit-graph.h b/commit-graph.h
index 8655d064c14..39484482cc1 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -11,6 +11,15 @@
 #define GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD "GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD"
 #define GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS "GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS"
 
+/*
+ * This method is only used to enhance coverage of the commit-graph
+ * feature in the test suite with the GIT_TEST_COMMIT_GRAPH and
+ * GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS environment variables. Do not
+ * call this method oustide of a builtin, and only if you know what
+ * you are doing!
+ */
+void git_test_write_commit_graph_or_die(void);
+
 struct commit;
 struct bloom_filter_settings;
 
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v3 3/3] blame: use changed-path Bloom filters
  2020-04-16 20:14   ` [PATCH v3 0/3] " Derrick Stolee via GitGitGadget
  2020-04-16 20:14     ` [PATCH v3 1/3] revision: complicated pathspecs disable filters Derrick Stolee via GitGitGadget
  2020-04-16 20:14     ` [PATCH v3 2/3] tests: write commit-graph with Bloom filters Derrick Stolee via GitGitGadget
@ 2020-04-16 20:14     ` Derrick Stolee via GitGitGadget
  2 siblings, 0 replies; 48+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-04-16 20:14 UTC (permalink / raw)
  To: git; +Cc: me, jnareb, garimasigit, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The changed-path Bloom filters help reduce the amount of tree
parsing required during history queries. Before calculating a
diff, we can ask the filter if a path changed between a commit
and its first parent. If the filter says "no" then we can move
on without parsing trees. If the filter says "maybe" then we
parse trees to discover if the answer is actually "yes" or "no".

When computing a blame, there is a section in find_origin() that
computes a diff between a commit and one of its parents. When this
is the first parent, we can check the Bloom filters before calling
diff_tree_oid().

In order to make this work with the blame machinery, we need to
initialize a struct bloom_key with the initial path. But also, we
need to add more keys to a list if a rename is detected. We then
check to see if _any_ of these keys answer "maybe" in the diff.

During development, I purposefully left out this "add a new key
when a rename is detected" to see if the test suite would catch
my error. That is how I discovered the issues with
GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS from the previous change.
With that change, we can feel some confidence in the coverage of
this change.

If a user requests copy detection using "git blame -C", then there
are more places where the set of "important" files can expand. I
do not know enough about how this happens in the blame machinery.
Thus, the Bloom filter integration is explicitly disabled in this
mode. A later change could expand the bloom_key data with an
appropriate call (or calls) to add_bloom_key().

If we did not disable this mode, then the following tests would
fail:

	t8003-blame-corner-cases.sh
	t8011-blame-split-file.sh

Generally, this is a performance enhancement and should not
change the behavior of 'git blame' in any way. If a repo has a
commit-graph file with computed changed-path Bloom filters, then
they should notice improved performance for their 'git blame'
commands.

Here are some example timings that I found by blaming some paths
in the Linux kernel repository:

 git blame arch/x86/kernel/topology.c >/dev/null

 Before: 0.83s
  After: 0.24s

 git blame kernel/time/time.c >/dev/null

 Before: 0.72s
  After: 0.24s

 git blame tools/perf/ui/stdio/hist.c >/dev/null

 Before: 0.27s
  After: 0.11s

I specifically looked for "deep" paths that were also edited many
times. As a counterpoint, the MAINTAINERS file was edited many
times but is located in the root tree. This means that the cost of
computing a diff relative to the pathspec is very small. Here are
the timings for that command:

 git blame MAINTAINERS >/dev/null

 Before: 20.1s
  After: 18.0s

These timings are the best of five. The worst-case runs were on the
order of 2.5 minutes for both cases. Note that the MAINTAINERS file
has 18,740 lines across 17,000+ commits. This happens to be one of
the cases where this change provides the least improvement.

The lack of improvement for the MAINTAINERS file and the relatively
modest improvement for the other examples can be easily explained.
The blame machinery needs to compute line-level diffs to determine
which lines were changed by each commit. That makes up a large
proportion of the computation time, and this change does not
attempt to improve on that section of the algorithm. The
MAINTAINERS file is large and changed often, so it takes time to
determine which lines were updated by which commit. In contrast,
the code files are much smaller, and it takes longer to comute
the line-by-line diff for a single patch on the Linux mailing
lists.

Outside of the "-C" integration, I believe there is little more to
gain from the changed-path Bloom filters for 'git blame' after this
patch.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 blame.c         | 139 ++++++++++++++++++++++++++++++++++++++++++++----
 blame.h         |   6 +++
 builtin/blame.c |  10 ++++
 3 files changed, 146 insertions(+), 9 deletions(-)

diff --git a/blame.c b/blame.c
index 29770e5c81c..9fbf79e47c3 100644
--- a/blame.c
+++ b/blame.c
@@ -9,6 +9,8 @@
 #include "blame.h"
 #include "alloc.h"
 #include "commit-slab.h"
+#include "bloom.h"
+#include "commit-graph.h"
 
 define_commit_slab(blame_suspects, struct blame_origin *);
 static struct blame_suspects blame_suspects;
@@ -1246,13 +1248,75 @@ static int fill_blob_sha1_and_mode(struct repository *r,
 	return -1;
 }
 
+struct blame_bloom_data {
+	/*
+	 * Changed-path Bloom filter keys. These can help prevent
+	 * computing diffs against first parents, but we need to
+	 * expand the list as code is moved or files are renamed.
+	 */
+	struct bloom_filter_settings *settings;
+	struct bloom_key **keys;
+	int nr;
+	int alloc;
+};
+
+static int bloom_count_queries = 0;
+static int bloom_count_no = 0;
+static int maybe_changed_path(struct repository *r,
+			      struct commit *parent,
+			      struct blame_origin *origin,
+			      struct blame_bloom_data *bd)
+{
+	int i;
+	struct bloom_filter *filter;
+
+	if (!bd)
+		return 1;
+
+	if (origin->commit->generation == GENERATION_NUMBER_INFINITY)
+		return 1;
+
+	filter = get_bloom_filter(r, origin->commit, 0);
+
+	if (!filter)
+		return 1;
+
+	bloom_count_queries++;
+	for (i = 0; i < bd->nr; i++) {
+		if (bloom_filter_contains(filter,
+					  bd->keys[i],
+					  bd->settings))
+			return 1;
+	}
+
+	bloom_count_no++;
+	return 0;
+}
+
+static void add_bloom_key(struct blame_bloom_data *bd,
+			  const char *path)
+{
+	if (!bd)
+		return;
+
+	if (bd->nr >= bd->alloc) {
+		bd->alloc *= 2;
+		REALLOC_ARRAY(bd->keys, bd->alloc);
+	}
+
+	bd->keys[bd->nr] = xmalloc(sizeof(struct bloom_key));
+	fill_bloom_key(path, strlen(path), bd->keys[bd->nr], bd->settings);
+	bd->nr++;
+}
+
 /*
  * We have an origin -- check if the same path exists in the
  * parent and return an origin structure to represent it.
  */
 static struct blame_origin *find_origin(struct repository *r,
 					struct commit *parent,
-					struct blame_origin *origin)
+					struct blame_origin *origin,
+					struct blame_bloom_data *bd)
 {
 	struct blame_origin *porigin;
 	struct diff_options diff_opts;
@@ -1286,10 +1350,19 @@ static struct blame_origin *find_origin(struct repository *r,
 
 	if (is_null_oid(&origin->commit->object.oid))
 		do_diff_cache(get_commit_tree_oid(parent), &diff_opts);
-	else
-		diff_tree_oid(get_commit_tree_oid(parent),
-			      get_commit_tree_oid(origin->commit),
-			      "", &diff_opts);
+	else {
+		int compute_diff = 1;
+		if (origin->commit->parents &&
+		    !oidcmp(&parent->object.oid,
+			    &origin->commit->parents->item->object.oid))
+			compute_diff = maybe_changed_path(r, parent,
+							  origin, bd);
+
+		if (compute_diff)
+			diff_tree_oid(get_commit_tree_oid(parent),
+				      get_commit_tree_oid(origin->commit),
+				      "", &diff_opts);
+	}
 	diffcore_std(&diff_opts);
 
 	if (!diff_queued_diff.nr) {
@@ -1341,7 +1414,8 @@ static struct blame_origin *find_origin(struct repository *r,
  */
 static struct blame_origin *find_rename(struct repository *r,
 					struct commit *parent,
-					struct blame_origin *origin)
+					struct blame_origin *origin,
+					struct blame_bloom_data *bd)
 {
 	struct blame_origin *porigin = NULL;
 	struct diff_options diff_opts;
@@ -1366,6 +1440,7 @@ static struct blame_origin *find_rename(struct repository *r,
 		struct diff_filepair *p = diff_queued_diff.queue[i];
 		if ((p->status == 'R' || p->status == 'C') &&
 		    !strcmp(p->two->path, origin->path)) {
+			add_bloom_key(bd, p->one->path);
 			porigin = get_origin(parent, p->one->path);
 			oidcpy(&porigin->blob_oid, &p->one->oid);
 			porigin->mode = p->one->mode;
@@ -2332,6 +2407,11 @@ static void distribute_blame(struct blame_scoreboard *sb, struct blame_entry *bl
 
 #define MAXSG 16
 
+typedef struct blame_origin *(*blame_find_alg)(struct repository *,
+					       struct commit *,
+					       struct blame_origin *,
+					       struct blame_bloom_data *);
+
 static void pass_blame(struct blame_scoreboard *sb, struct blame_origin *origin, int opt)
 {
 	struct rev_info *revs = sb->revs;
@@ -2356,8 +2436,7 @@ static void pass_blame(struct blame_scoreboard *sb, struct blame_origin *origin,
 	 * common cases, then we look for renames in the second pass.
 	 */
 	for (pass = 0; pass < 2 - sb->no_whole_file_rename; pass++) {
-		struct blame_origin *(*find)(struct repository *, struct commit *, struct blame_origin *);
-		find = pass ? find_rename : find_origin;
+		blame_find_alg find = pass ? find_rename : find_origin;
 
 		for (i = 0, sg = first_scapegoat(revs, commit, sb->reverse);
 		     i < num_sg && sg;
@@ -2369,7 +2448,7 @@ static void pass_blame(struct blame_scoreboard *sb, struct blame_origin *origin,
 				continue;
 			if (parse_commit(p))
 				continue;
-			porigin = find(sb->repo, p, origin);
+			porigin = find(sb->repo, p, origin, sb->bloom_data);
 			if (!porigin)
 				continue;
 			if (oideq(&porigin->blob_oid, &origin->blob_oid)) {
@@ -2809,3 +2888,45 @@ struct blame_entry *blame_entry_prepend(struct blame_entry *head,
 	blame_origin_incref(o);
 	return new_head;
 }
+
+void setup_blame_bloom_data(struct blame_scoreboard *sb,
+			    const char *path)
+{
+	struct blame_bloom_data *bd;
+
+	if (!sb->repo->objects->commit_graph)
+		return;
+
+	if (!sb->repo->objects->commit_graph->bloom_filter_settings)
+		return;
+
+	bd = xmalloc(sizeof(struct blame_bloom_data));
+
+	bd->settings = sb->repo->objects->commit_graph->bloom_filter_settings;
+
+	bd->alloc = 4;
+	bd->nr = 0;
+	ALLOC_ARRAY(bd->keys, bd->alloc);
+
+	add_bloom_key(bd, path);
+
+	sb->bloom_data = bd;
+}
+
+void cleanup_scoreboard(struct blame_scoreboard *sb)
+{
+	if (sb->bloom_data) {
+		int i;
+		for (i = 0; i < sb->bloom_data->nr; i++) {
+			free(sb->bloom_data->keys[i]->hashes);
+			free(sb->bloom_data->keys[i]);
+		}
+		free(sb->bloom_data->keys);
+		FREE_AND_NULL(sb->bloom_data);
+
+		trace2_data_intmax("blame", sb->repo,
+				   "bloom/queries", bloom_count_queries);
+		trace2_data_intmax("blame", sb->repo,
+				   "bloom/response-no", bloom_count_no);
+	}
+}
diff --git a/blame.h b/blame.h
index 089b181ff27..b6bbee41472 100644
--- a/blame.h
+++ b/blame.h
@@ -100,6 +100,8 @@ struct blame_entry {
 	int unblamable;
 };
 
+struct blame_bloom_data;
+
 /*
  * The current state of the blame assignment.
  */
@@ -156,6 +158,7 @@ struct blame_scoreboard {
 	void(*found_guilty_entry)(struct blame_entry *, void *);
 
 	void *found_guilty_entry_data;
+	struct blame_bloom_data *bloom_data;
 };
 
 /*
@@ -180,6 +183,9 @@ void init_scoreboard(struct blame_scoreboard *sb);
 void setup_scoreboard(struct blame_scoreboard *sb,
 		      const char *path,
 		      struct blame_origin **orig);
+void setup_blame_bloom_data(struct blame_scoreboard *sb,
+			    const char *path);
+void cleanup_scoreboard(struct blame_scoreboard *sb);
 
 struct blame_entry *blame_entry_prepend(struct blame_entry *head,
 					long start, long end,
diff --git a/builtin/blame.c b/builtin/blame.c
index bf1cecdf3f9..3c13634f279 100644
--- a/builtin/blame.c
+++ b/builtin/blame.c
@@ -1061,6 +1061,14 @@ int cmd_blame(int argc, const char **argv, const char *prefix)
 	string_list_clear(&ignore_revs_file_list, 0);
 	string_list_clear(&ignore_rev_list, 0);
 	setup_scoreboard(&sb, path, &o);
+
+	/*
+	 * Changed-path Bloom filters are disabled when looking
+	 * for copies.
+	 */
+	if (!(opt & PICKAXE_BLAME_COPY))
+		setup_blame_bloom_data(&sb, path);
+
 	lno = sb.num_lines;
 
 	if (lno && !range_list.nr)
@@ -1164,5 +1172,7 @@ int cmd_blame(int argc, const char **argv, const char *prefix)
 		printf("num get patch: %d\n", sb.num_get_patch);
 		printf("num commits: %d\n", sb.num_commits);
 	}
+
+	cleanup_scoreboard(&sb);
 	return 0;
 }
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 1/3] revision: complicated pathspecs disable filters
  2020-04-16 20:14     ` [PATCH v3 1/3] revision: complicated pathspecs disable filters Derrick Stolee via GitGitGadget
@ 2020-06-07 20:33       ` SZEDER Gábor
  0 siblings, 0 replies; 48+ messages in thread
From: SZEDER Gábor @ 2020-06-07 20:33 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, me, jnareb, garimasigit, Derrick Stolee

On Thu, Apr 16, 2020 at 08:14:02PM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
> 
> The changed-path Bloom filters work only when we can compute an
> explicit Bloom filter key in advance. When a pathspec is given
> that allows case-insensitive checks or wildcard matching, we
> must disable the Bloom filter performance checks.
> 
> By checking the pathspec in prepare_to_use_bloom_filters(), we
> avoid setting up the Bloom filter data and thus revert to the
> usual logic.
> 
> Before this change, the following tests would fail*:
> 
> 	t6004-rev-list-path-optim.sh (Tests 6-7)
> 	t6130-pathspec-noglob.sh (Tests 3-6)
> 	t6131-pathspec-icase.sh (Tests 3-5)
> 
> *These tests would fail when using GIT_TEST_COMMIT_GRAPH and
> GIT_TEST_COMMIT_GRAPH_BLOOM_FILTERS except that the latter
> environment variable was not set up correctly to write the changed-
> path Bloom filters in the test suite. That will be fixed in the
> next change.
> 
> Helped-by: Taylor Blau <me@ttaylorr.com>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  revision.c | 19 ++++++++++++++++++-
>  1 file changed, 18 insertions(+), 1 deletion(-)
> 
> diff --git a/revision.c b/revision.c
> index 2b06ee739c8..f78c636e4d0 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -650,6 +650,20 @@ static void trace2_bloom_filter_statistics_atexit(void)
>  	jw_release(&jw);
>  }
>  
> +static int forbid_bloom_filters(struct pathspec *spec)
> +{
> +	if (spec->has_wildcard)
> +		return 1;
> +	if (spec->nr > 1)
> +		return 1;
> +	if (spec->magic & ~PATHSPEC_LITERAL)

Nit: spec->magic is the bitwise OR combination of all
spec->items[i].magic, so checking the latter below is unnecessary.

> +		return 1;
> +	if (spec->nr && (spec->items[0].magic & ~PATHSPEC_LITERAL))
> +		return 1;
> +
> +	return 0;
> +}
> +
>  static void prepare_to_use_bloom_filter(struct rev_info *revs)
>  {
>  	struct pathspec_item *pi;
> @@ -659,7 +673,10 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
>  	int len;
>  
>  	if (!revs->commits)
> -	    return;
> +		return;
> +
> +	if (forbid_bloom_filters(&revs->prune_data))
> +		return;
>  
>  	repo_parse_commit(revs->repo, revs->commits->item);
>  
> -- 
> gitgitgadget
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2020-06-07 20:33 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-11  1:02 [PATCH 0/3] Integrate changed-path Bloom filters with 'git blame' Derrick Stolee via GitGitGadget
2020-04-11  1:02 ` [PATCH 1/3] revision: complicated pathspecs disable filters Derrick Stolee via GitGitGadget
2020-04-11 21:40   ` Junio C Hamano
2020-04-13 11:49     ` Derrick Stolee
2020-04-14 18:25       ` Junio C Hamano
2020-04-15 13:27         ` Derrick Stolee
2020-04-15 18:37           ` Derrick Stolee
2020-04-15 19:32             ` Junio C Hamano
2020-04-15 19:39               ` Junio C Hamano
2020-04-15 21:25             ` Junio C Hamano
2020-04-16  0:56               ` Taylor Blau
2020-04-15 22:18             ` Jakub Narębski
2020-04-16  0:52               ` Taylor Blau
2020-04-16 13:26                 ` Derrick Stolee
2020-04-16 16:33                   ` Taylor Blau
2020-04-16 18:02                     ` Junio C Hamano
2020-04-12 22:22   ` Taylor Blau
2020-04-12 22:30     ` Junio C Hamano
2020-04-13  0:07       ` Taylor Blau
2020-04-13 11:54         ` Derrick Stolee
2020-04-11  1:03 ` [PATCH 2/3] commit: write commit-graph with bloom filters Derrick Stolee via GitGitGadget
2020-04-11 21:57   ` Junio C Hamano
2020-04-12 20:51     ` Taylor Blau
2020-04-13 12:08       ` Derrick Stolee
2020-04-13 22:11         ` Junio C Hamano
2020-04-11  1:03 ` [PATCH 3/3] blame: use changed-path Bloom filters Derrick Stolee via GitGitGadget
2020-04-11 22:03   ` Junio C Hamano
2020-04-12  7:39     ` Eric Sunshine
2020-04-11 21:30 ` [PATCH 0/3] Integrate changed-path Bloom filters with 'git blame' Junio C Hamano
2020-04-13 14:45 ` [PATCH v2 0/4] " Derrick Stolee via GitGitGadget
2020-04-13 14:45   ` [PATCH v2 1/4] revision: complicated pathspecs disable filters Derrick Stolee via GitGitGadget
2020-04-13 16:09     ` Taylor Blau
2020-04-13 22:18       ` Junio C Hamano
2020-04-13 14:45   ` [PATCH v2 2/4] commit: write commit-graph with Bloom filters Derrick Stolee via GitGitGadget
2020-04-13 16:12     ` Taylor Blau
2020-04-13 22:21       ` Junio C Hamano
2020-04-14 15:04         ` Derrick Stolee
2020-04-14 17:26           ` Junio C Hamano
2020-04-14 17:40             ` Derrick Stolee
2020-04-15  0:17               ` Taylor Blau
2020-04-13 14:45   ` [PATCH v2 3/4] commit-graph: write commit-graph in more tests Derrick Stolee via GitGitGadget
2020-04-13 14:45   ` [PATCH v2 4/4] blame: use changed-path Bloom filters Derrick Stolee via GitGitGadget
2020-04-13 16:21   ` [PATCH v2 0/4] Integrate changed-path Bloom filters with 'git blame' Taylor Blau
2020-04-16 20:14   ` [PATCH v3 0/3] " Derrick Stolee via GitGitGadget
2020-04-16 20:14     ` [PATCH v3 1/3] revision: complicated pathspecs disable filters Derrick Stolee via GitGitGadget
2020-06-07 20:33       ` SZEDER Gábor
2020-04-16 20:14     ` [PATCH v3 2/3] tests: write commit-graph with Bloom filters Derrick Stolee via GitGitGadget
2020-04-16 20:14     ` [PATCH v3 3/3] blame: use changed-path " Derrick Stolee via GitGitGadget

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).