Git Mailing List Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH 0/8] More commit-graph/Bloom filter improvements
@ 2020-06-15 20:14 Derrick Stolee via GitGitGadget
  2020-06-15 20:14 ` [PATCH 1/8] commit-graph: place bloom_settings in context Derrick Stolee via GitGitGadget
                   ` (9 more replies)
  0 siblings, 10 replies; 71+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-06-15 20:14 UTC (permalink / raw)
  To: git; +Cc: me, szeder.dev, Derrick Stolee

This builds on sg/commit-graph-cleanups, which took several patches from
Szeder's series [1] and applied them almost directly to a more-recent
version of Git [2].

[1] https://lore.kernel.org/git/20200529085038.26008-1-szeder.dev@gmail.com/
[2] 
https://lore.kernel.org/git/pull.650.git.1591362032.gitgitgadget@gmail.com/

This series adds a few extra improvements, several of which are rooted in
Szeder's original series. I maintained his authorship and sign-off, even
though the patches did not apply or cherry-pick at all.

 1. commit-graph: place bloom_settings in context
 2. commit-graph: unify the signatures of all write_graph_chunk_*()
    functions
 3. commit-graph: simplify chunk writes into loop
 4. commit-graph: check chunk sizes after writing
 5. commit-graph: check all leading directories in changed path Bloom
    filters

Patch 1 is a new preparation patch to then apply Szeder's ideas in the next
four. Some are refactoring or defensive programming, but Patch 5 presents a
meaningful performance improvement. By creating bloom_keys for each leading
directory in a path, we can greatly improve the false-positive rate.

 6. bloom: enforce a minimum size of 8 bytes

Patch 6 is based on a comment of Szeder's that since we are using 1-byte
alignment in the filters, that some small filters do not fit the theoretical
analysis that calculated the expected false-positive rate. By increasing the
minimum (non-zero) filter size, we can gain significant performance benefits
while increasing the file size a small amount.

 7. commit-graph: change test to die on parse, not load
 8. commit-graph: persist existence of changed-paths

The final two patches handle the unresolved usability issue: if a user
writes a commit-graph with --changed-paths, the next write will probably
clear them out. Think about gc.writeCommitGraph or fetch.writeCommitGraph,
which do not allow for the --changed-paths option directly. Another idea is
to add a config option, but I will leave that to others [3].

[3] https://github.com/gitgitgadget/git/pull/633

Here is an analysis of the range-diff between this series and Szeder's PoC
submission.

These patches either are part of sg/commit-graph-cleanups or were discarded
as unnecessary.

 1:  7a8dbfba53a <  -:  ----------- tree-walk.c: don't match submodule entries for 'submod/anything'
 2:  df25e984c58 <  -:  ----------- commit-graph: fix parsing the Chunk Lookup table
 3:  598f7f9a978 <  -:  ----------- commit-graph-format.txt: all multi-byte numbers are in network byte order
 4:  b29e5d39ed6 <  -:  ----------- commit-slab: add a function to deep free entries on the slab
 5:  18f4db7bfb9 <  -:  ----------- diff.h: drop diff_tree_oid() & friends' return value
 6:  bf336f109e6 <  -:  ----------- commit-graph: clean up #includes
 7:  b7f0f831bcf <  -:  ----------- commit-graph: simplify parse_commit_graph() #1
 8:  f2752000052 <  -:  ----------- commit-graph: simplify parse_commit_graph() #2
 9:  4e184b8743c <  -:  ----------- commit-graph: simplify write_commit_graph_file() #1
10:  344dd337da5 <  -:  ----------- commit-graph: simplify write_commit_graph_file() #2
11:  56e3c4f57b3 <  -:  ----------- commit-graph: allocate the 'struct chunk_info' array dinamically

This first patch enables the next refactoring patch.

 -:  ----------- >  1:  c966969071b commit-graph: place bloom_settings in context

This patch is recognized as similar, but all differences are due to
whitespace corrections and the new write_graph_chunk_*() methods.

12:  28fb1b5bdfe !  2:  65eb15221c8 commit-graph: unify the signatures of all write_graph_chunk_*() functions
    @@ Commit message
         This opens up the possibility for further cleanups and foolproofing in
         the following two patches.

         Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
    +    Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

      ## commit-graph.c ##
     @@ commit-graph.c: struct write_commit_graph_context {
    -     const struct split_commit_graph_opts *split_opts;
    +     struct bloom_filter_settings bloom_settings;
      };

     -static void write_graph_chunk_fanout(struct hashfile *f,
    +-                     struct write_commit_graph_context *ctx)
     +static int write_graph_chunk_fanout(struct hashfile *f,
    -                      struct write_commit_graph_context *ctx)
    ++                    struct write_commit_graph_context *ctx)
      {
          int i, count = 0;
    +     struct commit **list = ctx->commits.list;
     @@ commit-graph.c: static void write_graph_chunk_fanout(struct hashfile *f,

              hashwrite_be32(f, count);
          }
    ++
     +    return 0;
      }

     -static void write_graph_chunk_oids(struct hashfile *f, int hash_len,
    +-                   struct write_commit_graph_context *ctx)
     +static int write_graph_chunk_oids(struct hashfile *f,
    -                    struct write_commit_graph_context *ctx)
    ++                  struct write_commit_graph_context *ctx)
      {
          struct commit **list = ctx->commits.list;
          int count;
          for (count = 0; count < ctx->commits.nr; count++, list++) {
              display_progress(ctx->progress, ++ctx->progress_cnt);
     -        hashwrite(f, (*list)->object.oid.hash, (int)hash_len);
    -+        hashwrite(f, (*list)->object.oid.hash, the_hash_algo->rawsz);
    ++        hashwrite(f, (*list)->object.oid.hash, (int)the_hash_algo->rawsz);
          }
    ++
     +    return 0;
      }

    @@ commit-graph.c: static const unsigned char *commit_to_sha1(size_t index, void *t
      }

     -static void write_graph_chunk_data(struct hashfile *f, int hash_len,
    +-                   struct write_commit_graph_context *ctx)
     +static int write_graph_chunk_data(struct hashfile *f,
    -                    struct write_commit_graph_context *ctx)
    ++                  struct write_commit_graph_context *ctx)
      {
          struct commit **list = ctx->commits.list;
    +     struct commit **last = ctx->commits.list + ctx->commits.nr;
     @@ commit-graph.c: static void write_graph_chunk_data(struct hashfile *f, int hash_len,
                  die(_("unable to parse commit %s"),
                      oid_to_hex(&(*list)->object.oid));
    @@ commit-graph.c: static void write_graph_chunk_data(struct hashfile *f, int hash_

              list++;
          }
    ++
     +    return 0;
      }

     -static void write_graph_chunk_extra_edges(struct hashfile *f,
    +-                      struct write_commit_graph_context *ctx)
     +static int write_graph_chunk_extra_edges(struct hashfile *f,
    -                       struct write_commit_graph_context *ctx)
    ++                     struct write_commit_graph_context *ctx)
      {
          struct commit **list = ctx->commits.list;
    +     struct commit **last = ctx->commits.list + ctx->commits.nr;
     @@ commit-graph.c: static void write_graph_chunk_extra_edges(struct hashfile *f,

              list++;
          }
    ++
    ++    return 0;
    + }
    + 
    +-static void write_graph_chunk_bloom_indexes(struct hashfile *f,
    +-                        struct write_commit_graph_context *ctx)
    ++static int write_graph_chunk_bloom_indexes(struct hashfile *f,
    ++                       struct write_commit_graph_context *ctx)
    + {
    +     struct commit **list = ctx->commits.list;
    +     struct commit **last = ctx->commits.list + ctx->commits.nr;
    +@@ commit-graph.c: static void write_graph_chunk_bloom_indexes(struct hashfile *f,
    +     }
    + 
    +     stop_progress(&progress);
    ++    return 0;
    + }
    + 
    +-static void write_graph_chunk_bloom_data(struct hashfile *f,
    +-                     struct write_commit_graph_context *ctx)
    ++static int write_graph_chunk_bloom_data(struct hashfile *f,
    ++                    struct write_commit_graph_context *ctx)
    + {
    +     struct commit **list = ctx->commits.list;
    +     struct commit **last = ctx->commits.list + ctx->commits.nr;
    +@@ commit-graph.c: static void write_graph_chunk_bloom_data(struct hashfile *f,
    +     }
    + 
    +     stop_progress(&progress);
     +    return 0;
      }

      static int oid_compare(const void *_a, const void *_b)
     @@ commit-graph.c: static int write_commit_graph_file(struct write_commit_graph_context *ctx)
    -             chunks_nr * ctx->commits.nr);
    +             num_chunks * ctx->commits.nr);
          }
          write_graph_chunk_fanout(f, ctx);
     -    write_graph_chunk_oids(f, hashsz, ctx);
    @@ commit-graph.c: static int write_commit_graph_file(struct write_commit_graph_con
     +    write_graph_chunk_data(f, ctx);
          if (ctx->num_extra_edges)
              write_graph_chunk_extra_edges(f, ctx);
    -     if (ctx->num_commit_graphs_after > 1 &&
    +     if (ctx->changed_paths) {

These patches follow the same intent, but are significantly different
because they are updated with split commit-graphs and the existing
changed-path Bloom filters.

13:  1e1e59e2592 <  -:  ----------- commit-graph: simplify write_commit_graph_file() #3
 -:  ----------- >  3:  3d24b9802df commit-graph: simplify chunk writes into loop
14:  6f0d912e4b8 <  -:  ----------- commit-graph: check chunk sizes after writing
 -:  ----------- >  4:  bdca834e6da commit-graph: check chunk sizes after writing
24:  dc96f0d9822 <  -:  ----------- commit-graph: check all leading directories in modified path Bloom filters
 -:  ----------- >  5:  9975fc96f12 commit-graph: check all leading directories in changed path Bloom filters

These three patches are a few valuable improvements of my own design:

 -:  ----------- >  6:  2a5f1e17528 bloom: enforce a minimum size of 8 bytes
 -:  ----------- >  7:  60bbc15d24a commit-graph: change test to die on parse, not load
 -:  ----------- >  8:  db5b8fe8439 commit-graph: persist existence of changed-paths

At this point, we have updated the existing changed-path Bloom filter
implementation to be on even terms with Szeder's modified-path Bloom filter
implementation.

The next batch of patches contain Szeder's implementation. These implement a
completely different file format, so they are not intended as ways to move
forward. If there is a significant improvement to be found by using this
file format instead of the established one (comparing the old implementation
with these patches), then we could consider swapping the optional chunks for
those that he proposes.

While I had the motivation and energy to defend the current implementation
by applying Szeder's (excellent) ideas to the existing format, I do not have
intent to go through the effort to compare the file formats explicitly at
this point. I would be interested to read a performance analysis, if someone
were to provide one now.

15:  0ab955aac32 <  -:  ----------- commit-graph-format.txt: document the modified path Bloom filter chunks
16:  4c128d51dfe <  -:  ----------- Add a generic and minimal Bloom filter implementation
17:  41f02bc38f7 <  -:  ----------- Import a streaming-capable Murmur3 hash function implementation
18:  e5fd1da48d4 <  -:  ----------- commit-graph: write "empty" Modified Path Bloom Filter Index chunk
19:  2dd882ec601 <  -:  ----------- commit-graph: add commit slab for modified path Bloom filters
20:  f30e495c2b0 <  -:  ----------- commit-graph: fill the Modified Path Bloom Filter Index chunk
21:  e904cb58301 <  -:  ----------- commit-graph: load and use the Modified Path Bloom Filter Index chunk
22:  c71647ca374 <  -:  ----------- commit-graph: write the Modified Path Bloom Filters chunk
23:  50898d42291 <  -:  ----------- commit-graph: load and use the Modified Path Bloom Filters chunk
25:  7cbf1bc6b66 <  -:  ----------- commit-graph: check embedded modified path Bloom filters with a mask
26:  3951fdedf6a <  -:  ----------- commit-graph: deduplicate modified path Bloom filters
27:  5aba19a2766 <  -:  ----------- commit-graph: load modified path Bloom filters for merge commits
28:  93fc6af1d2f <  -:  ----------- commit-graph: write Modified Path Bloom Filter Merge Index chunk
29:  f87b37bf08e <  -:  ----------- commit-graph: extract init and free write_commit_graph_context
30:  943b0d9554c <  -:  ----------- commit-graph: move write_commit_graph_reachable below write_commit_graph
31:  47b26ea61aa <  -:  ----------- t7007-show: make the first test compatible with the next patch
32:  9201b71071c <  -:  ----------- PoC commit-graph: use revision walk machinery for '--reachable'
33:  5c72d97e5e9 <  -:  ----------- commit-graph: write modified path Bloom filters in "history order"

This patch is likely worth investigating again:

34:  8b40ec4cd30 <  -:  ----------- commit-graph: use modified path Bloom filters with wildcards, if possible

Thanks, -Stolee

Derrick Stolee (4):
  commit-graph: place bloom_settings in context
  bloom: enforce a minimum size of 8 bytes
  commit-graph: change test to die on parse, not load
  commit-graph: persist existence of changed-paths

SZEDER Gábor (4):
  commit-graph: unify the signatures of all write_graph_chunk_*()
    functions
  commit-graph: simplify chunk writes into loop
  commit-graph: check chunk sizes after writing
  commit-graph: check all leading directories in changed path Bloom
    filters

 Documentation/git-commit-graph.txt |   5 +-
 bloom.c                            |   4 ++
 builtin/commit-graph.c             |   5 +-
 commit-graph.c                     | 112 ++++++++++++++++++++---------
 commit-graph.h                     |   3 +-
 revision.c                         |  35 ++++++---
 revision.h                         |   6 +-
 t/t4216-log-bloom.sh               |   4 +-
 t/t5318-commit-graph.sh            |   2 +-
 9 files changed, 124 insertions(+), 52 deletions(-)


base-commit: 7fbfe07ab4d4e58c0971dac73001b89f180a0af3
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-659%2Fderrickstolee%2Fbloom-2-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-659/derrickstolee/bloom-2-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/659
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 1/8] commit-graph: place bloom_settings in context
  2020-06-15 20:14 [PATCH 0/8] More commit-graph/Bloom filter improvements Derrick Stolee via GitGitGadget
@ 2020-06-15 20:14 ` Derrick Stolee via GitGitGadget
  2020-06-18 20:30   ` René Scharfe
  2020-06-15 20:14 ` [PATCH 2/8] commit-graph: unify the signatures of all write_graph_chunk_*() functions SZEDER Gábor via GitGitGadget
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 71+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-06-15 20:14 UTC (permalink / raw)
  To: git; +Cc: me, szeder.dev, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Place an instance of struct bloom_settings into the struct
write_commit_graph_context. This allows simplifying the function
prototype of write_graph_chunk_bloom_data(). This will allow us
to combine the function prototypes and use function pointers to
simplify write_commit_graph_file().

Reported-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 887837e8826..05b7035d8d5 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -882,6 +882,7 @@ struct write_commit_graph_context {
 
 	const struct split_commit_graph_opts *split_opts;
 	size_t total_bloom_filter_data_size;
+	struct bloom_filter_settings bloom_settings;
 };
 
 static void write_graph_chunk_fanout(struct hashfile *f,
@@ -1103,8 +1104,7 @@ static void write_graph_chunk_bloom_indexes(struct hashfile *f,
 }
 
 static void write_graph_chunk_bloom_data(struct hashfile *f,
-					 struct write_commit_graph_context *ctx,
-					 const struct bloom_filter_settings *settings)
+					 struct write_commit_graph_context *ctx)
 {
 	struct commit **list = ctx->commits.list;
 	struct commit **last = ctx->commits.list + ctx->commits.nr;
@@ -1116,9 +1116,9 @@ static void write_graph_chunk_bloom_data(struct hashfile *f,
 			_("Writing changed paths Bloom filters data"),
 			ctx->commits.nr);
 
-	hashwrite_be32(f, settings->hash_version);
-	hashwrite_be32(f, settings->num_hashes);
-	hashwrite_be32(f, settings->bits_per_entry);
+	hashwrite_be32(f, ctx->bloom_settings.hash_version);
+	hashwrite_be32(f, ctx->bloom_settings.num_hashes);
+	hashwrite_be32(f, ctx->bloom_settings.bits_per_entry);
 
 	while (list < last) {
 		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
@@ -1541,6 +1541,8 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	struct object_id file_hash;
 	const struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
 
+	ctx->bloom_settings = bloom_settings;
+
 	if (ctx->split) {
 		struct strbuf tmp_file = STRBUF_INIT;
 
@@ -1642,7 +1644,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 		write_graph_chunk_extra_edges(f, ctx);
 	if (ctx->changed_paths) {
 		write_graph_chunk_bloom_indexes(f, ctx);
-		write_graph_chunk_bloom_data(f, ctx, &bloom_settings);
+		write_graph_chunk_bloom_data(f, ctx);
 	}
 	if (ctx->num_commit_graphs_after > 1 &&
 	    write_graph_chunk_base(f, ctx)) {
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 2/8] commit-graph: unify the signatures of all write_graph_chunk_*() functions
  2020-06-15 20:14 [PATCH 0/8] More commit-graph/Bloom filter improvements Derrick Stolee via GitGitGadget
  2020-06-15 20:14 ` [PATCH 1/8] commit-graph: place bloom_settings in context Derrick Stolee via GitGitGadget
@ 2020-06-15 20:14 ` SZEDER Gábor via GitGitGadget
  2020-06-18 20:30   ` René Scharfe
  2020-06-15 20:14 ` [PATCH 3/8] commit-graph: simplify chunk writes into loop SZEDER Gábor via GitGitGadget
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 71+ messages in thread
From: SZEDER Gábor via GitGitGadget @ 2020-06-15 20:14 UTC (permalink / raw)
  To: git; +Cc: me, szeder.dev, Derrick Stolee, SZEDER Gábor

From: =?UTF-8?q?SZEDER=20G=C3=A1bor?= <szeder.dev@gmail.com>

Update the write_graph_chunk_*() helper functions to have the same
signature:

  - Return an int error code from all these functions.
    write_graph_chunk_base() already has an int error code, now the
    others will have one, too, but since they don't indicate any
    error, they will always return 0.

  - Drop the hash size parameter of write_graph_chunk_oids() and
    write_graph_chunk_data(); its value can be read directly from
    'the_hash_algo' inside these functions as well.

This opens up the possibility for further cleanups and foolproofing in
the following two patches.

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 42 ++++++++++++++++++++++++++----------------
 1 file changed, 26 insertions(+), 16 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 05b7035d8d5..3bae1e52ed0 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -885,8 +885,8 @@ struct write_commit_graph_context {
 	struct bloom_filter_settings bloom_settings;
 };
 
-static void write_graph_chunk_fanout(struct hashfile *f,
-				     struct write_commit_graph_context *ctx)
+static int write_graph_chunk_fanout(struct hashfile *f,
+				    struct write_commit_graph_context *ctx)
 {
 	int i, count = 0;
 	struct commit **list = ctx->commits.list;
@@ -907,17 +907,21 @@ static void write_graph_chunk_fanout(struct hashfile *f,
 
 		hashwrite_be32(f, count);
 	}
+
+	return 0;
 }
 
-static void write_graph_chunk_oids(struct hashfile *f, int hash_len,
-				   struct write_commit_graph_context *ctx)
+static int write_graph_chunk_oids(struct hashfile *f,
+				  struct write_commit_graph_context *ctx)
 {
 	struct commit **list = ctx->commits.list;
 	int count;
 	for (count = 0; count < ctx->commits.nr; count++, list++) {
 		display_progress(ctx->progress, ++ctx->progress_cnt);
-		hashwrite(f, (*list)->object.oid.hash, (int)hash_len);
+		hashwrite(f, (*list)->object.oid.hash, (int)the_hash_algo->rawsz);
 	}
+
+	return 0;
 }
 
 static const unsigned char *commit_to_sha1(size_t index, void *table)
@@ -926,8 +930,8 @@ static const unsigned char *commit_to_sha1(size_t index, void *table)
 	return commits[index]->object.oid.hash;
 }
 
-static void write_graph_chunk_data(struct hashfile *f, int hash_len,
-				   struct write_commit_graph_context *ctx)
+static int write_graph_chunk_data(struct hashfile *f,
+				  struct write_commit_graph_context *ctx)
 {
 	struct commit **list = ctx->commits.list;
 	struct commit **last = ctx->commits.list + ctx->commits.nr;
@@ -944,7 +948,7 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len,
 			die(_("unable to parse commit %s"),
 				oid_to_hex(&(*list)->object.oid));
 		tree = get_commit_tree_oid(*list);
-		hashwrite(f, tree->hash, hash_len);
+		hashwrite(f, tree->hash, the_hash_algo->rawsz);
 
 		parent = (*list)->parents;
 
@@ -1024,10 +1028,12 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len,
 
 		list++;
 	}
+
+	return 0;
 }
 
-static void write_graph_chunk_extra_edges(struct hashfile *f,
-					  struct write_commit_graph_context *ctx)
+static int write_graph_chunk_extra_edges(struct hashfile *f,
+					 struct write_commit_graph_context *ctx)
 {
 	struct commit **list = ctx->commits.list;
 	struct commit **last = ctx->commits.list + ctx->commits.nr;
@@ -1076,10 +1082,12 @@ static void write_graph_chunk_extra_edges(struct hashfile *f,
 
 		list++;
 	}
+
+	return 0;
 }
 
-static void write_graph_chunk_bloom_indexes(struct hashfile *f,
-					    struct write_commit_graph_context *ctx)
+static int write_graph_chunk_bloom_indexes(struct hashfile *f,
+					   struct write_commit_graph_context *ctx)
 {
 	struct commit **list = ctx->commits.list;
 	struct commit **last = ctx->commits.list + ctx->commits.nr;
@@ -1101,10 +1109,11 @@ static void write_graph_chunk_bloom_indexes(struct hashfile *f,
 	}
 
 	stop_progress(&progress);
+	return 0;
 }
 
-static void write_graph_chunk_bloom_data(struct hashfile *f,
-					 struct write_commit_graph_context *ctx)
+static int write_graph_chunk_bloom_data(struct hashfile *f,
+					struct write_commit_graph_context *ctx)
 {
 	struct commit **list = ctx->commits.list;
 	struct commit **last = ctx->commits.list + ctx->commits.nr;
@@ -1128,6 +1137,7 @@ static void write_graph_chunk_bloom_data(struct hashfile *f,
 	}
 
 	stop_progress(&progress);
+	return 0;
 }
 
 static int oid_compare(const void *_a, const void *_b)
@@ -1638,8 +1648,8 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 			num_chunks * ctx->commits.nr);
 	}
 	write_graph_chunk_fanout(f, ctx);
-	write_graph_chunk_oids(f, hashsz, ctx);
-	write_graph_chunk_data(f, hashsz, ctx);
+	write_graph_chunk_oids(f, ctx);
+	write_graph_chunk_data(f, ctx);
 	if (ctx->num_extra_edges)
 		write_graph_chunk_extra_edges(f, ctx);
 	if (ctx->changed_paths) {
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 3/8] commit-graph: simplify chunk writes into loop
  2020-06-15 20:14 [PATCH 0/8] More commit-graph/Bloom filter improvements Derrick Stolee via GitGitGadget
  2020-06-15 20:14 ` [PATCH 1/8] commit-graph: place bloom_settings in context Derrick Stolee via GitGitGadget
  2020-06-15 20:14 ` [PATCH 2/8] commit-graph: unify the signatures of all write_graph_chunk_*() functions SZEDER Gábor via GitGitGadget
@ 2020-06-15 20:14 ` SZEDER Gábor via GitGitGadget
  2020-06-18 20:30   ` René Scharfe
  2020-06-15 20:14 ` [PATCH 4/8] commit-graph: check chunk sizes after writing SZEDER Gábor via GitGitGadget
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 71+ messages in thread
From: SZEDER Gábor via GitGitGadget @ 2020-06-15 20:14 UTC (permalink / raw)
  To: git; +Cc: me, szeder.dev, Derrick Stolee, SZEDER Gábor

From: =?UTF-8?q?SZEDER=20G=C3=A1bor?= <szeder.dev@gmail.com>

In write_commit_graph_file() we now have one block of code filling the
array of 'struct chunk_info' with the IDs and sizes of chunks to be
written, and an other block of code calling the functions responsible
for writing individual chunks.  In case of optional chunks like Extra
Edge List an Base Graphs List there is also a condition checking
whether that chunk is necessary/desired, and that same condition is
repeated in both blocks of code. Other, newer chunks have similar
optional conditions.

Eliminate these repeated conditions by storing the function pointers
responsible for writing individual chunks in the 'struct chunk_info'
array as well, and calling them in a loop to write the commit-graph
file.  This will open up the possibility for a bit of foolproofing in
the following patch.

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 31 +++++++++++++++++++------------
 1 file changed, 19 insertions(+), 12 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 3bae1e52ed0..78e023be664 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1532,9 +1532,13 @@ static int write_graph_chunk_base(struct hashfile *f,
 	return 0;
 }
 
+typedef int (*chunk_write_fn)(struct hashfile *f,
+			      struct write_commit_graph_context *ctx);
+
 struct chunk_info {
 	uint32_t id;
 	uint64_t size;
+	chunk_write_fn write_fn;
 };
 
 static int write_commit_graph_file(struct write_commit_graph_context *ctx)
@@ -1591,27 +1595,34 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 
 	chunks[0].id = GRAPH_CHUNKID_OIDFANOUT;
 	chunks[0].size = GRAPH_FANOUT_SIZE;
+	chunks[0].write_fn = write_graph_chunk_fanout;
 	chunks[1].id = GRAPH_CHUNKID_OIDLOOKUP;
 	chunks[1].size = hashsz * ctx->commits.nr;
+	chunks[1].write_fn = write_graph_chunk_oids;
 	chunks[2].id = GRAPH_CHUNKID_DATA;
 	chunks[2].size = (hashsz + 16) * ctx->commits.nr;
+	chunks[2].write_fn = write_graph_chunk_data;
 	if (ctx->num_extra_edges) {
 		chunks[num_chunks].id = GRAPH_CHUNKID_EXTRAEDGES;
 		chunks[num_chunks].size = 4 * ctx->num_extra_edges;
+		chunks[num_chunks].write_fn = write_graph_chunk_extra_edges;
 		num_chunks++;
 	}
 	if (ctx->changed_paths) {
 		chunks[num_chunks].id = GRAPH_CHUNKID_BLOOMINDEXES;
 		chunks[num_chunks].size = sizeof(uint32_t) * ctx->commits.nr;
+		chunks[num_chunks].write_fn = write_graph_chunk_bloom_indexes;
 		num_chunks++;
 		chunks[num_chunks].id = GRAPH_CHUNKID_BLOOMDATA;
 		chunks[num_chunks].size = sizeof(uint32_t) * 3
 					  + ctx->total_bloom_filter_data_size;
+		chunks[num_chunks].write_fn = write_graph_chunk_bloom_data;
 		num_chunks++;
 	}
 	if (ctx->num_commit_graphs_after > 1) {
 		chunks[num_chunks].id = GRAPH_CHUNKID_BASE;
 		chunks[num_chunks].size = hashsz * (ctx->num_commit_graphs_after - 1);
+		chunks[num_chunks].write_fn = write_graph_chunk_base;
 		num_chunks++;
 	}
 
@@ -1647,19 +1658,15 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 			progress_title.buf,
 			num_chunks * ctx->commits.nr);
 	}
-	write_graph_chunk_fanout(f, ctx);
-	write_graph_chunk_oids(f, ctx);
-	write_graph_chunk_data(f, ctx);
-	if (ctx->num_extra_edges)
-		write_graph_chunk_extra_edges(f, ctx);
-	if (ctx->changed_paths) {
-		write_graph_chunk_bloom_indexes(f, ctx);
-		write_graph_chunk_bloom_data(f, ctx);
-	}
-	if (ctx->num_commit_graphs_after > 1 &&
-	    write_graph_chunk_base(f, ctx)) {
-		return -1;
+
+	for (i = 0; i < num_chunks; i++) {
+		if (chunks[i].write_fn(f, ctx)) {
+			error(_("failed writing chunk with id %"PRIx32""),
+			      chunks[i].id);
+			return -1;
+		}
 	}
+
 	stop_progress(&ctx->progress);
 	strbuf_release(&progress_title);
 
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 4/8] commit-graph: check chunk sizes after writing
  2020-06-15 20:14 [PATCH 0/8] More commit-graph/Bloom filter improvements Derrick Stolee via GitGitGadget
                   ` (2 preceding siblings ...)
  2020-06-15 20:14 ` [PATCH 3/8] commit-graph: simplify chunk writes into loop SZEDER Gábor via GitGitGadget
@ 2020-06-15 20:14 ` SZEDER Gábor via GitGitGadget
  2020-06-15 20:14 ` [PATCH 5/8] commit-graph: check all leading directories in changed path Bloom filters SZEDER Gábor via GitGitGadget
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 71+ messages in thread
From: SZEDER Gábor via GitGitGadget @ 2020-06-15 20:14 UTC (permalink / raw)
  To: git; +Cc: me, szeder.dev, Derrick Stolee, SZEDER Gábor

From: =?UTF-8?q?SZEDER=20G=C3=A1bor?= <szeder.dev@gmail.com>

In my experience while experimenting with new commit-graph chunks,
early versions of the corresponding new write_commit_graph_my_chunk()
functions are, sadly but not surprisingly, often buggy, and write more
or less data than they are supposed to, especially if the chunk size
is not directly proportional to the number of commits.  This then
causes all kinds of issues when reading such a bogus commit-graph
file, raising the question of whether the writing or the reading part
happens to be buggy this time.

Let's catch such issues early, already when writing the commit-graph
file, and check that each write_graph_chunk_*() function wrote the
amount of data that it was expected to, and what has been encoded in
the Chunk Lookup table.  Now that all commit-graph chunks are written
in a loop we can do this check in a single place for all chunks, and
any chunks added in the future will get checked as well.

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/commit-graph.c b/commit-graph.c
index 78e023be664..5c8f210cada 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1659,12 +1659,21 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 			num_chunks * ctx->commits.nr);
 	}
 
+	chunk_offset = f->total + f->offset;
 	for (i = 0; i < num_chunks; i++) {
+		uint64_t end_offset;
+
 		if (chunks[i].write_fn(f, ctx)) {
 			error(_("failed writing chunk with id %"PRIx32""),
 			      chunks[i].id);
 			return -1;
 		}
+
+		end_offset = f->total + f->offset;
+		if (end_offset - chunk_offset != chunks[i].size)
+			BUG("expected to write %"PRId64" bytes to chunk %"PRIx32", but wrote %"PRId64" instead",
+			    chunks[i].size, chunks[i].id, end_offset - chunk_offset);
+		chunk_offset = end_offset;
 	}
 
 	stop_progress(&ctx->progress);
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 5/8] commit-graph: check all leading directories in changed path Bloom filters
  2020-06-15 20:14 [PATCH 0/8] More commit-graph/Bloom filter improvements Derrick Stolee via GitGitGadget
                   ` (3 preceding siblings ...)
  2020-06-15 20:14 ` [PATCH 4/8] commit-graph: check chunk sizes after writing SZEDER Gábor via GitGitGadget
@ 2020-06-15 20:14 ` SZEDER Gábor via GitGitGadget
  2020-06-18 20:31   ` René Scharfe
  2020-06-19 17:17   ` Taylor Blau
  2020-06-15 20:14 ` [PATCH 6/8] bloom: enforce a minimum size of 8 bytes Derrick Stolee via GitGitGadget
                   ` (4 subsequent siblings)
  9 siblings, 2 replies; 71+ messages in thread
From: SZEDER Gábor via GitGitGadget @ 2020-06-15 20:14 UTC (permalink / raw)
  To: git; +Cc: me, szeder.dev, Derrick Stolee, SZEDER Gábor

From: =?UTF-8?q?SZEDER=20G=C3=A1bor?= <szeder.dev@gmail.com>

The file 'dir/subdir/file' can only be modified if its leading
directories 'dir' and 'dir/subdir' are modified as well.

So when checking modified path Bloom filters looking for commits
modifying a path with multiple path components, then check not only
the full path in the Bloom filters, but all its leading directories as
well.  Take care to check these paths in "deepest first" order,
because it's the full path that is least likely to be modified, and
the Bloom filter queries can short circuit sooner.

This can significantly reduce the average false positive rate, by
about an order of magnitude or three(!), and can further speed up
pathspec-limited revision walks.  The table below compares the average
false positive rate and runtime of

  git rev-list HEAD -- "$path"

before and after this change for 5000+ randomly* selected paths from
each repository:

                    Average false           Average        Average
                    positive rate           runtime        runtime
                  before     after     before     after   difference
  ------------------------------------------------------------------
  git             3.220%   0.7853%     0.0558s   0.0387s   -30.6%
  linux           2.453%   0.0296%     0.1046s   0.0766s   -26.8%
  tensorflow      2.536%   0.6977%     0.0594s   0.0420s   -29.2%

*Path selection was done with the following pipeline:

	git ls-tree -r --name-only HEAD | sort -R | head -n 5000

The improvements in runtime are much smaller than the improvements in
average false positive rate, as we are clearly reaching diminishing
returns here.  However, all these timings depend on that accessing
tree objects is reasonably fast (warm caches).  If we had a partial
clone and the tree objects had to be fetched from a promisor remote,
e.g.:

  $ git clone --filter=tree:0 --bare file://.../webkit.git webkit.notrees.git
  $ git -C webkit.git -c core.modifiedPathBloomFilters=1 \
        commit-graph write --reachable
  $ cp webkit.git/objects/info/commit-graph webkit.notrees.git/objects/info/
  $ git -C webkit.notrees.git -c core.modifiedPathBloomFilters=1 \
        rev-list HEAD -- "$path"

then checking all leading path component can reduce the runtime from
over an hour to a few seconds (and this is with the clone and the
promisor on the same machine).

This adjusts the tracing values in t4216-log-bloom.sh, which provides a
concrete way to notice the improvement.

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 revision.c           | 35 ++++++++++++++++++++++++++---------
 revision.h           |  6 ++++--
 t/t4216-log-bloom.sh |  2 +-
 3 files changed, 31 insertions(+), 12 deletions(-)

diff --git a/revision.c b/revision.c
index c644c660917..027ae3982b4 100644
--- a/revision.c
+++ b/revision.c
@@ -670,9 +670,10 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
 {
 	struct pathspec_item *pi;
 	char *path_alloc = NULL;
-	const char *path;
+	const char *path, *p;
 	int last_index;
-	int len;
+	size_t len;
+	int path_component_nr = 0, j;
 
 	if (!revs->commits)
 		return;
@@ -705,8 +706,22 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
 
 	len = strlen(path);
 
-	revs->bloom_key = xmalloc(sizeof(struct bloom_key));
-	fill_bloom_key(path, len, revs->bloom_key, revs->bloom_filter_settings);
+	p = path;
+	do {
+		p = strchrnul(p + 1, '/');
+		path_component_nr++;
+	} while (p - path < len);
+
+	revs->bloom_keys_nr = path_component_nr;
+	ALLOC_ARRAY(revs->bloom_keys, revs->bloom_keys_nr);
+
+	p = path;
+	for (j = 0; j < revs->bloom_keys_nr; j++) {
+		p = strchrnul(p + 1, '/');
+
+		fill_bloom_key(path, p - path, &revs->bloom_keys[j],
+			       revs->bloom_filter_settings);
+	}
 
 	if (trace2_is_enabled() && !bloom_filter_atexit_registered) {
 		atexit(trace2_bloom_filter_statistics_atexit);
@@ -720,7 +735,7 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
 						 struct commit *commit)
 {
 	struct bloom_filter *filter;
-	int result;
+	int result = 1, j;
 
 	if (!revs->repo->objects->commit_graph)
 		return -1;
@@ -740,9 +755,11 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
 		return -1;
 	}
 
-	result = bloom_filter_contains(filter,
-				       revs->bloom_key,
-				       revs->bloom_filter_settings);
+	for (j = 0; result && j < revs->bloom_keys_nr; j++) {
+		result = bloom_filter_contains(filter,
+					       &revs->bloom_keys[j],
+					       revs->bloom_filter_settings);
+	}
 
 	if (result)
 		count_bloom_filter_maybe++;
@@ -782,7 +799,7 @@ static int rev_compare_tree(struct rev_info *revs,
 			return REV_TREE_SAME;
 	}
 
-	if (revs->bloom_key && !nth_parent) {
+	if (revs->bloom_keys_nr && !nth_parent) {
 		bloom_ret = check_maybe_different_in_bloom_filter(revs, commit);
 
 		if (bloom_ret == 0)
diff --git a/revision.h b/revision.h
index 7c026fe41fc..abbfb4ab59a 100644
--- a/revision.h
+++ b/revision.h
@@ -295,8 +295,10 @@ struct rev_info {
 	struct topo_walk_info *topo_walk_info;
 
 	/* Commit graph bloom filter fields */
-	/* The bloom filter key for the pathspec */
-	struct bloom_key *bloom_key;
+	/* The bloom filter key(s) for the pathspec */
+	struct bloom_key *bloom_keys;
+	int bloom_keys_nr;
+
 	/*
 	 * The bloom filter settings used to generate the key.
 	 * This is loaded from the commit-graph being used.
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index c7011f33e2c..c13b97d3bda 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -142,7 +142,7 @@ test_expect_success 'setup - add commit-graph to the chain with Bloom filters' '
 
 test_bloom_filters_used_when_some_filters_are_missing () {
 	log_args=$1
-	bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"zero_length_filter\":0,\"maybe\":8,\"definitely_not\":6"
+	bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"zero_length_filter\":0,\"maybe\":6,\"definitely_not\":8"
 	setup "$log_args" &&
 	grep -q "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.perf" &&
 	test_cmp log_wo_bloom log_w_bloom
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 6/8] bloom: enforce a minimum size of 8 bytes
  2020-06-15 20:14 [PATCH 0/8] More commit-graph/Bloom filter improvements Derrick Stolee via GitGitGadget
                   ` (4 preceding siblings ...)
  2020-06-15 20:14 ` [PATCH 5/8] commit-graph: check all leading directories in changed path Bloom filters SZEDER Gábor via GitGitGadget
@ 2020-06-15 20:14 ` Derrick Stolee via GitGitGadget
  2020-06-15 20:14 ` [PATCH 7/8] commit-graph: change test to die on parse, not load Derrick Stolee via GitGitGadget
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 71+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-06-15 20:14 UTC (permalink / raw)
  To: git; +Cc: me, szeder.dev, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The original design of changed-path Bloom filters included an 8-byte
block size for filter lengths. This was changed mid-way through the
submission process, and now the length stored in the commit-graph has
one-byte granularity.

This can cause some issues for very small filters. The analysis for
false positive rates assume large filters, so rounding errors become
less important at that scale. When there are only a few paths changed,
a filter that has size only a few bytes could have very different
behavior. In fact, this is evidenced in the Git repository due to the
code organization and careful patch creation that leads to many commits
with very small filters. These small filters frequently have
false-positive rates in the 8-10% range or higher.

The previous change improved the false-positive rate using multiple
Bloom keys when the path has multiple directory components. However,
that does not help at all for files at root. It is typical to have
several commits that change only the README at root, and those commits
would be likely to have these artificially high false-positive rates.

Correct this issue by creating a minimum filters size of 8 bytes. This
requires the very small commits (with fewer than six changes, including
non-root directories) to have a larger filter. In principle, this
violates the bits_per_entry value of struct bloom_filter_settings.
However, it does not actually create a functional problem.

As for compatibility, this only affects new versions writing filters for
commits that do not yet have a filter. Old version will write the
smaller filters and this version will persist and properly read that
data. Now, the new files will be generated slightly larger.

               Bytes before   Bytes after  Difference
  --------------------------------------------------
  git             4,021,078    4,275,311   +6.32%
  linux          72,212,101   73,909,286   +2.35%
  tensorflow      7,596,359    7,691,646   +1.25%

This has a measurable improvement in the false-positive rate and the
end-to-end run time for these repos. The table below compares the average
false-positive rate and runtime of

  git rev-list HEAD -- "$path"

before and after this change for 5000+ randomly* selected paths from
each repository:

                    Average false           Average        Average
                    positive rate           runtime        runtime
                  before     after     before     after   difference
  ------------------------------------------------------------------
  git             0.786%     0.227%    0.0387s    0.0289s -25.5%
  linux           0.0296%    0.0174%   0.0766s    0.0706s  -7.8%
  tensorflow      0.6977%    0.0268%   0.0420s    0.0384s  -8.5%

*Path selection was done with the following pipeline:

        git ls-tree -r --name-only HEAD | sort -R | head -n 5000

These relatively-small increases in file size appear to be a fair price
to pay for these performance improvements.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 bloom.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/bloom.c b/bloom.c
index c38d1cff0c6..875e3853c2c 100644
--- a/bloom.c
+++ b/bloom.c
@@ -258,6 +258,10 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
 		}
 
 		filter->len = (hashmap_get_size(&pathmap) * settings.bits_per_entry + BITS_PER_WORD - 1) / BITS_PER_WORD;
+
+		if (filter->len && filter->len < 8)
+			filter->len = 8;
+
 		filter->data = xcalloc(filter->len, sizeof(unsigned char));
 
 		hashmap_for_each_entry(&pathmap, &iter, e, entry) {
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 7/8] commit-graph: change test to die on parse, not load
  2020-06-15 20:14 [PATCH 0/8] More commit-graph/Bloom filter improvements Derrick Stolee via GitGitGadget
                   ` (5 preceding siblings ...)
  2020-06-15 20:14 ` [PATCH 6/8] bloom: enforce a minimum size of 8 bytes Derrick Stolee via GitGitGadget
@ 2020-06-15 20:14 ` Derrick Stolee via GitGitGadget
  2020-06-15 20:14 ` [PATCH 8/8] commit-graph: persist existence of changed-paths Derrick Stolee via GitGitGadget
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 71+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-06-15 20:14 UTC (permalink / raw)
  To: git; +Cc: me, szeder.dev, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

43d3561 (commit-graph write: don't die if the existing graph is corrupt,
2019-03-25) introduced the GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD environment
variable. This was created to verify that commit-graph was not loaded
when writing a new non-incremental commit-graph.

An upcoming change wants to load a commit-graph in some valuable cases,
but we want to maintain that we don't trust the commit-graph data when
writing our new file. Instead of dying on load, instead die if we ever
try to parse a commit from the commit-graph. This functionally verifies
the same intended behavior, but allows a more advanced feature in the
next change.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c          | 12 ++++++++----
 commit-graph.h          |  2 +-
 t/t5318-commit-graph.sh |  2 +-
 3 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 5c8f210cada..3a64e3b382d 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -564,10 +564,6 @@ static int prepare_commit_graph(struct repository *r)
 		return !!r->objects->commit_graph;
 	r->objects->commit_graph_attempted = 1;
 
-	if (git_env_bool(GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD, 0))
-		die("dying as requested by the '%s' variable on commit-graph load!",
-		    GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD);
-
 	prepare_repo_settings(r);
 
 	if (!git_env_bool(GIT_TEST_COMMIT_GRAPH, 0) &&
@@ -790,6 +786,14 @@ static int parse_commit_in_graph_one(struct repository *r,
 
 int parse_commit_in_graph(struct repository *r, struct commit *item)
 {
+	static int checked_env = 0;
+
+	if (!checked_env &&
+	    git_env_bool(GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE, 0))
+		die("dying as requested by the '%s' variable on commit-graph parse!",
+		    GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE);
+	checked_env = 1;
+
 	if (!prepare_commit_graph(r))
 		return 0;
 	return parse_commit_in_graph_one(r, r->objects->commit_graph, item);
diff --git a/commit-graph.h b/commit-graph.h
index 881c9b46e57..f0fb13e3f28 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -5,7 +5,7 @@
 #include "object-store.h"
 
 #define GIT_TEST_COMMIT_GRAPH "GIT_TEST_COMMIT_GRAPH"
-#define GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD "GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD"
+#define GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE "GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE"
 #define GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS "GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS"
 
 /*
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 1073f9e3cf2..5ec01abdaa9 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -436,7 +436,7 @@ corrupt_graph_verify() {
 		cp $objdir/info/commit-graph commit-graph-pre-write-test
 	fi &&
 	git status --short &&
-	GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD=true git commit-graph write &&
+	GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE=true git commit-graph write &&
 	git commit-graph verify
 }
 
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 8/8] commit-graph: persist existence of changed-paths
  2020-06-15 20:14 [PATCH 0/8] More commit-graph/Bloom filter improvements Derrick Stolee via GitGitGadget
                   ` (6 preceding siblings ...)
  2020-06-15 20:14 ` [PATCH 7/8] commit-graph: change test to die on parse, not load Derrick Stolee via GitGitGadget
@ 2020-06-15 20:14 ` Derrick Stolee via GitGitGadget
  2020-06-17 21:21 ` [PATCH 0/8] More commit-graph/Bloom filter improvements Junio C Hamano
  2020-06-23 17:46 ` [PATCH v2 00/11] " Derrick Stolee via GitGitGadget
  9 siblings, 0 replies; 71+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-06-15 20:14 UTC (permalink / raw)
  To: git; +Cc: me, szeder.dev, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The changed-path Bloom filters were released in v2.27.0, but have a
significant drawback. A user can opt-in to writing the changed-path
filters using the "--changed-paths" option to "git commit-graph write"
but the next write will drop the filters unless that option is
specified.

This becomes even more important when considering the interaction with
gc.writeCommitGraph (on by default) or fetch.writeCommitGraph (part of
features.experimental). These config options trigger commit-graph writes
that the user did not signal, and hence there is no --changed-paths
option available.

Allow a user that opts-in to the changed-path filters to persist the
property of "my commit-graph has changed-path filters" automatically. A
user can drop filters using the --no-changed-paths option.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-commit-graph.txt |  5 ++++-
 builtin/commit-graph.c             |  5 ++++-
 commit-graph.c                     | 12 +++++++++++-
 commit-graph.h                     |  1 +
 t/t4216-log-bloom.sh               |  2 +-
 5 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
index f4b13c005b8..369b222b08b 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -60,7 +60,10 @@ existing commit-graph file.
 With the `--changed-paths` option, compute and write information about the
 paths changed between a commit and it's first parent. This operation can
 take a while on large repositories. It provides significant performance gains
-for getting history of a directory or a file with `git log -- <path>`.
+for getting history of a directory or a file with `git log -- <path>`. If
+this option is given, future commit-graph writes will automatically assume
+that this option was intended. Use `--no-changed-paths` to stop storing this
+data.
 +
 With the `--split` option, write the commit-graph as a chain of multiple
 commit-graph files stored in `<dir>/info/commit-graphs`. The new commits
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 59009837dc9..ff7b177c337 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -151,6 +151,7 @@ static int graph_write(int argc, const char **argv)
 	};
 
 	opts.progress = isatty(2);
+	opts.enable_changed_paths = -1;
 	split_opts.size_multiple = 2;
 	split_opts.max_commits = 0;
 	split_opts.expire_time = 0;
@@ -171,7 +172,9 @@ static int graph_write(int argc, const char **argv)
 		flags |= COMMIT_GRAPH_WRITE_SPLIT;
 	if (opts.progress)
 		flags |= COMMIT_GRAPH_WRITE_PROGRESS;
-	if (opts.enable_changed_paths ||
+	if (!opts.enable_changed_paths)
+		flags |= COMMIT_GRAPH_NO_WRITE_BLOOM_FILTERS;
+	if (opts.enable_changed_paths == 1 ||
 	    git_env_bool(GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS, 0))
 		flags |= COMMIT_GRAPH_WRITE_BLOOM_FILTERS;
 
diff --git a/commit-graph.c b/commit-graph.c
index 3a64e3b382d..04eea725232 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1996,9 +1996,19 @@ int write_commit_graph(struct object_directory *odb,
 	ctx->split = flags & COMMIT_GRAPH_WRITE_SPLIT ? 1 : 0;
 	ctx->check_oids = flags & COMMIT_GRAPH_WRITE_CHECK_OIDS ? 1 : 0;
 	ctx->split_opts = split_opts;
-	ctx->changed_paths = flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS ? 1 : 0;
 	ctx->total_bloom_filter_data_size = 0;
 
+	if (flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS)
+		ctx->changed_paths = 1;
+	else if (!(flags & COMMIT_GRAPH_NO_WRITE_BLOOM_FILTERS)) {
+		prepare_commit_graph_one(ctx->r, ctx->odb);
+
+		/* We have changed-paths already. Keep them in the next graph */
+		if (ctx->r->objects->commit_graph &&
+		    ctx->r->objects->commit_graph->chunk_bloom_data)
+			ctx->changed_paths = 1;
+	}
+
 	if (ctx->split) {
 		struct commit_graph *g;
 		prepare_commit_graph(ctx->r);
diff --git a/commit-graph.h b/commit-graph.h
index f0fb13e3f28..45b1e5bca39 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -96,6 +96,7 @@ enum commit_graph_write_flags {
 	/* Make sure that each OID in the input is a valid commit OID. */
 	COMMIT_GRAPH_WRITE_CHECK_OIDS = (1 << 3),
 	COMMIT_GRAPH_WRITE_BLOOM_FILTERS = (1 << 4),
+	COMMIT_GRAPH_NO_WRITE_BLOOM_FILTERS = (1 << 5),
 };
 
 struct split_commit_graph_opts {
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index c13b97d3bda..30c8d9562e8 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -126,7 +126,7 @@ test_expect_success 'setup - add commit-graph to the chain without Bloom filters
 	test_commit c14 A/anotherFile2 &&
 	test_commit c15 A/B/anotherFile2 &&
 	test_commit c16 A/B/C/anotherFile2 &&
-	GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=0 git commit-graph write --reachable --split &&
+	git commit-graph write --reachable --split --no-changed-paths &&
 	test_line_count = 2 .git/objects/info/commit-graphs/commit-graph-chain
 '
 
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 0/8] More commit-graph/Bloom filter improvements
  2020-06-15 20:14 [PATCH 0/8] More commit-graph/Bloom filter improvements Derrick Stolee via GitGitGadget
                   ` (7 preceding siblings ...)
  2020-06-15 20:14 ` [PATCH 8/8] commit-graph: persist existence of changed-paths Derrick Stolee via GitGitGadget
@ 2020-06-17 21:21 ` Junio C Hamano
  2020-06-18  1:46   ` Derrick Stolee
  2020-06-23 17:46 ` [PATCH v2 00/11] " Derrick Stolee via GitGitGadget
  9 siblings, 1 reply; 71+ messages in thread
From: Junio C Hamano @ 2020-06-17 21:21 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget; +Cc: git, me, szeder.dev, Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> This builds on sg/commit-graph-cleanups,...

How ready is that topic, do you think?  I'd rather not to see too
many patches piled on top of what is not even in 'next', but I do
not remember it reviewed seriously (I did take a look or two at it
myself before queuing the series, but that does not quite count).

Will queue to extend the topic for now.

Thanks.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 0/8] More commit-graph/Bloom filter improvements
  2020-06-17 21:21 ` [PATCH 0/8] More commit-graph/Bloom filter improvements Junio C Hamano
@ 2020-06-18  1:46   ` Derrick Stolee
  0 siblings, 0 replies; 71+ messages in thread
From: Derrick Stolee @ 2020-06-18  1:46 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee via GitGitGadget
  Cc: git, me, szeder.dev, Derrick Stolee

On 6/17/2020 5:21 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> This builds on sg/commit-graph-cleanups,...
> 
> How ready is that topic, do you think?  I'd rather not to see too
> many patches piled on top of what is not even in 'next', but I do
> not remember it reviewed seriously (I did take a look or two at it
> myself before queuing the series, but that does not quite count).

That topic was my attempt to apply the "easy and obvious" changes
from Szeder's proof-of-concept. In some sense, you could consider
them authored by Szeder and reviewed by me. But also, I did need to
tweak some things, so some review from others would be helpful.

> Will queue to extend the topic for now.

I'll refrain from pushing any more in this direction until more
review comes along. I found myself with a need to do something
productive and familiar, so I started doing commit-graph
performance stuff.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 1/8] commit-graph: place bloom_settings in context
  2020-06-15 20:14 ` [PATCH 1/8] commit-graph: place bloom_settings in context Derrick Stolee via GitGitGadget
@ 2020-06-18 20:30   ` René Scharfe
  2020-06-19 12:58     ` Derrick Stolee
  0 siblings, 1 reply; 71+ messages in thread
From: René Scharfe @ 2020-06-18 20:30 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget, git; +Cc: me, szeder.dev, Derrick Stolee

Am 15.06.20 um 22:14 schrieb Derrick Stolee via GitGitGadget:
> From: Derrick Stolee <dstolee@microsoft.com>
>
> Place an instance of struct bloom_settings into the struct
> write_commit_graph_context. This allows simplifying the function
> prototype of write_graph_chunk_bloom_data(). This will allow us
> to combine the function prototypes and use function pointers to
> simplify write_commit_graph_file().
>
> Reported-by: SZEDER Gábor <szeder.dev@gmail.com>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  commit-graph.c | 14 ++++++++------
>  1 file changed, 8 insertions(+), 6 deletions(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index 887837e8826..05b7035d8d5 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -882,6 +882,7 @@ struct write_commit_graph_context {
>
>  	const struct split_commit_graph_opts *split_opts;
>  	size_t total_bloom_filter_data_size;
> +	struct bloom_filter_settings bloom_settings;

That structure is quite busy already, so adding one more member wouldn't
matter much.

Passing so many things to lots of functions makes it harder to argue
about them, though, as all of them effectively become part of their
signature, and you have to look at their implementation to see which
pseudo-parameters they actually use.  It's like a God object.

>  };
>
>  static void write_graph_chunk_fanout(struct hashfile *f,
> @@ -1103,8 +1104,7 @@ static void write_graph_chunk_bloom_indexes(struct hashfile *f,
>  }
>
>  static void write_graph_chunk_bloom_data(struct hashfile *f,
> -					 struct write_commit_graph_context *ctx,
> -					 const struct bloom_filter_settings *settings)
> +					 struct write_commit_graph_context *ctx)
>  {
>  	struct commit **list = ctx->commits.list;
>  	struct commit **last = ctx->commits.list + ctx->commits.nr;
> @@ -1116,9 +1116,9 @@ static void write_graph_chunk_bloom_data(struct hashfile *f,
>  			_("Writing changed paths Bloom filters data"),
>  			ctx->commits.nr);
>
> -	hashwrite_be32(f, settings->hash_version);
> -	hashwrite_be32(f, settings->num_hashes);
> -	hashwrite_be32(f, settings->bits_per_entry);
> +	hashwrite_be32(f, ctx->bloom_settings.hash_version);
> +	hashwrite_be32(f, ctx->bloom_settings.num_hashes);
> +	hashwrite_be32(f, ctx->bloom_settings.bits_per_entry);
>
>  	while (list < last) {
>  		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
> @@ -1541,6 +1541,8 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
>  	struct object_id file_hash;
>  	const struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
>
> +	ctx->bloom_settings = bloom_settings;

So we use the defaults, no customization?  Then you could simply move
the declaration of bloom_settings from write_commit_graph_file() to
write_graph_chunk_bloom_data().  Glancing at pu I don't see additional
uses there, so no need to put it into the context (yet?).

René

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 2/8] commit-graph: unify the signatures of all write_graph_chunk_*() functions
  2020-06-15 20:14 ` [PATCH 2/8] commit-graph: unify the signatures of all write_graph_chunk_*() functions SZEDER Gábor via GitGitGadget
@ 2020-06-18 20:30   ` René Scharfe
  0 siblings, 0 replies; 71+ messages in thread
From: René Scharfe @ 2020-06-18 20:30 UTC (permalink / raw)
  To: SZEDER Gábor via GitGitGadget, git; +Cc: me, szeder.dev, Derrick Stolee

Am 15.06.20 um 22:14 schrieb SZEDER Gábor via GitGitGadget:
> -static void write_graph_chunk_oids(struct hashfile *f, int hash_len,
> -				   struct write_commit_graph_context *ctx)
> +static int write_graph_chunk_oids(struct hashfile *f,
> +				  struct write_commit_graph_context *ctx)
>  {
>  	struct commit **list = ctx->commits.list;
>  	int count;
>  	for (count = 0; count < ctx->commits.nr; count++, list++) {
>  		display_progress(ctx->progress, ++ctx->progress_cnt);
> -		hashwrite(f, (*list)->object.oid.hash, (int)hash_len);
> +		hashwrite(f, (*list)->object.oid.hash, (int)the_hash_algo->rawsz);

Before the cast was forcing an int into an int (huh?), now it forces a
size_t into an int, but hashwrite() expects an unsigned int.  Do we
really need that cast?

>  	}
> +
> +	return 0;
>  }
>
>  static const unsigned char *commit_to_sha1(size_t index, void *table)
> @@ -926,8 +930,8 @@ static const unsigned char *commit_to_sha1(size_t index, void *table)
>  	return commits[index]->object.oid.hash;
>  }
>
> -static void write_graph_chunk_data(struct hashfile *f, int hash_len,
> -				   struct write_commit_graph_context *ctx)
> +static int write_graph_chunk_data(struct hashfile *f,
> +				  struct write_commit_graph_context *ctx)
>  {
>  	struct commit **list = ctx->commits.list;
>  	struct commit **last = ctx->commits.list + ctx->commits.nr;
> @@ -944,7 +948,7 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len,
>  			die(_("unable to parse commit %s"),
>  				oid_to_hex(&(*list)->object.oid));
>  		tree = get_commit_tree_oid(*list);
> -		hashwrite(f, tree->hash, hash_len);
> +		hashwrite(f, tree->hash, the_hash_algo->rawsz);

... and here's the answer: No, we don't need to cast.

René

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 3/8] commit-graph: simplify chunk writes into loop
  2020-06-15 20:14 ` [PATCH 3/8] commit-graph: simplify chunk writes into loop SZEDER Gábor via GitGitGadget
@ 2020-06-18 20:30   ` René Scharfe
  0 siblings, 0 replies; 71+ messages in thread
From: René Scharfe @ 2020-06-18 20:30 UTC (permalink / raw)
  To: SZEDER Gábor via GitGitGadget, git; +Cc: me, szeder.dev, Derrick Stolee

Am 15.06.20 um 22:14 schrieb SZEDER Gábor via GitGitGadget:
> From: =?UTF-8?q?SZEDER=20G=C3=A1bor?= <szeder.dev@gmail.com>
>
> In write_commit_graph_file() we now have one block of code filling the
> array of 'struct chunk_info' with the IDs and sizes of chunks to be
> written, and an other block of code calling the functions responsible
> for writing individual chunks.  In case of optional chunks like Extra
> Edge List an Base Graphs List there is also a condition checking
> whether that chunk is necessary/desired, and that same condition is
> repeated in both blocks of code. Other, newer chunks have similar
> optional conditions.
>
> Eliminate these repeated conditions by storing the function pointers
> responsible for writing individual chunks in the 'struct chunk_info'
> array as well, and calling them in a loop to write the commit-graph
> file.  This will open up the possibility for a bit of foolproofing in
> the following patch.

OK.  An alternative would be a switch in the loop that calls the right
function based on the chunk id.  That would not require uniform
interfaces for all write functions; patch 2 would not be necessary.

>
> Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  commit-graph.c | 31 +++++++++++++++++++------------
>  1 file changed, 19 insertions(+), 12 deletions(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index 3bae1e52ed0..78e023be664 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -1532,9 +1532,13 @@ static int write_graph_chunk_base(struct hashfile *f,
>  	return 0;
>  }
>
> +typedef int (*chunk_write_fn)(struct hashfile *f,
> +			      struct write_commit_graph_context *ctx);
> +
>  struct chunk_info {
>  	uint32_t id;
>  	uint64_t size;
> +	chunk_write_fn write_fn;
>  };
>
>  static int write_commit_graph_file(struct write_commit_graph_context *ctx)
> @@ -1591,27 +1595,34 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
>
>  	chunks[0].id = GRAPH_CHUNKID_OIDFANOUT;
>  	chunks[0].size = GRAPH_FANOUT_SIZE;
> +	chunks[0].write_fn = write_graph_chunk_fanout;
>  	chunks[1].id = GRAPH_CHUNKID_OIDLOOKUP;
>  	chunks[1].size = hashsz * ctx->commits.nr;
> +	chunks[1].write_fn = write_graph_chunk_oids;
>  	chunks[2].id = GRAPH_CHUNKID_DATA;
>  	chunks[2].size = (hashsz + 16) * ctx->commits.nr;
> +	chunks[2].write_fn = write_graph_chunk_data;
>  	if (ctx->num_extra_edges) {
>  		chunks[num_chunks].id = GRAPH_CHUNKID_EXTRAEDGES;
>  		chunks[num_chunks].size = 4 * ctx->num_extra_edges;
> +		chunks[num_chunks].write_fn = write_graph_chunk_extra_edges;
>  		num_chunks++;
>  	}
>  	if (ctx->changed_paths) {
>  		chunks[num_chunks].id = GRAPH_CHUNKID_BLOOMINDEXES;
>  		chunks[num_chunks].size = sizeof(uint32_t) * ctx->commits.nr;
> +		chunks[num_chunks].write_fn = write_graph_chunk_bloom_indexes;
>  		num_chunks++;
>  		chunks[num_chunks].id = GRAPH_CHUNKID_BLOOMDATA;
>  		chunks[num_chunks].size = sizeof(uint32_t) * 3
>  					  + ctx->total_bloom_filter_data_size;
> +		chunks[num_chunks].write_fn = write_graph_chunk_bloom_data;
>  		num_chunks++;
>  	}
>  	if (ctx->num_commit_graphs_after > 1) {
>  		chunks[num_chunks].id = GRAPH_CHUNKID_BASE;
>  		chunks[num_chunks].size = hashsz * (ctx->num_commit_graphs_after - 1);
> +		chunks[num_chunks].write_fn = write_graph_chunk_base;
>  		num_chunks++;
>  	}
>
> @@ -1647,19 +1658,15 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
>  			progress_title.buf,
>  			num_chunks * ctx->commits.nr);
>  	}
> -	write_graph_chunk_fanout(f, ctx);
> -	write_graph_chunk_oids(f, ctx);
> -	write_graph_chunk_data(f, ctx);
> -	if (ctx->num_extra_edges)
> -		write_graph_chunk_extra_edges(f, ctx);
> -	if (ctx->changed_paths) {
> -		write_graph_chunk_bloom_indexes(f, ctx);
> -		write_graph_chunk_bloom_data(f, ctx);
> -	}
> -	if (ctx->num_commit_graphs_after > 1 &&
> -	    write_graph_chunk_base(f, ctx)) {
> -		return -1;
> +
> +	for (i = 0; i < num_chunks; i++) {
> +		if (chunks[i].write_fn(f, ctx)) {
> +			error(_("failed writing chunk with id %"PRIx32""),
> +			      chunks[i].id);

This error message is new and not mentioned in the commit message.
write_graph_chunk_base() seems to be the only write function that can
return something else than 0, and it already reports an error in that
case.  So do we really want the one here as well?

René

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 5/8] commit-graph: check all leading directories in changed path Bloom filters
  2020-06-15 20:14 ` [PATCH 5/8] commit-graph: check all leading directories in changed path Bloom filters SZEDER Gábor via GitGitGadget
@ 2020-06-18 20:31   ` René Scharfe
  2020-06-19  9:14     ` René Scharfe
  2020-06-19 17:17   ` Taylor Blau
  1 sibling, 1 reply; 71+ messages in thread
From: René Scharfe @ 2020-06-18 20:31 UTC (permalink / raw)
  To: SZEDER Gábor via GitGitGadget, git; +Cc: me, szeder.dev, Derrick Stolee

Am 15.06.20 um 22:14 schrieb SZEDER Gábor via GitGitGadget:
> From: =?UTF-8?q?SZEDER=20G=C3=A1bor?= <szeder.dev@gmail.com>
>
> The file 'dir/subdir/file' can only be modified if its leading
> directories 'dir' and 'dir/subdir' are modified as well.
>
> So when checking modified path Bloom filters looking for commits
> modifying a path with multiple path components, then check not only
> the full path in the Bloom filters, but all its leading directories as
> well.  Take care to check these paths in "deepest first" order,
> because it's the full path that is least likely to be modified, and
> the Bloom filter queries can short circuit sooner.
>
> This can significantly reduce the average false positive rate, by
> about an order of magnitude or three(!), and can further speed up
> pathspec-limited revision walks.  The table below compares the average
> false positive rate and runtime of
>
>   git rev-list HEAD -- "$path"
>
> before and after this change for 5000+ randomly* selected paths from
> each repository:
>
>                     Average false           Average        Average
>                     positive rate           runtime        runtime
>                   before     after     before     after   difference
>   ------------------------------------------------------------------
>   git             3.220%   0.7853%     0.0558s   0.0387s   -30.6%
>   linux           2.453%   0.0296%     0.1046s   0.0766s   -26.8%
>   tensorflow      2.536%   0.6977%     0.0594s   0.0420s   -29.2%

Nice!

>
> *Path selection was done with the following pipeline:
>
> 	git ls-tree -r --name-only HEAD | sort -R | head -n 5000
>
> The improvements in runtime are much smaller than the improvements in
> average false positive rate, as we are clearly reaching diminishing
> returns here.  However, all these timings depend on that accessing
> tree objects is reasonably fast (warm caches).  If we had a partial
> clone and the tree objects had to be fetched from a promisor remote,
> e.g.:
>
>   $ git clone --filter=tree:0 --bare file://.../webkit.git webkit.notrees.git
>   $ git -C webkit.git -c core.modifiedPathBloomFilters=1 \
>         commit-graph write --reachable
>   $ cp webkit.git/objects/info/commit-graph webkit.notrees.git/objects/info/
>   $ git -C webkit.notrees.git -c core.modifiedPathBloomFilters=1 \
>         rev-list HEAD -- "$path"
>
> then checking all leading path component can reduce the runtime from
> over an hour to a few seconds (and this is with the clone and the
> promisor on the same machine).
>
> This adjusts the tracing values in t4216-log-bloom.sh, which provides a
> concrete way to notice the improvement.
>
> Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  revision.c           | 35 ++++++++++++++++++++++++++---------
>  revision.h           |  6 ++++--
>  t/t4216-log-bloom.sh |  2 +-
>  3 files changed, 31 insertions(+), 12 deletions(-)
>
> diff --git a/revision.c b/revision.c
> index c644c660917..027ae3982b4 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -670,9 +670,10 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
>  {
>  	struct pathspec_item *pi;
>  	char *path_alloc = NULL;
> -	const char *path;
> +	const char *path, *p;
>  	int last_index;
> -	int len;
> +	size_t len;
> +	int path_component_nr = 0, j;
>
>  	if (!revs->commits)
>  		return;
> @@ -705,8 +706,22 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
>
>  	len = strlen(path);
>
> -	revs->bloom_key = xmalloc(sizeof(struct bloom_key));
> -	fill_bloom_key(path, len, revs->bloom_key, revs->bloom_filter_settings);
> +	p = path;
> +	do {
> +		p = strchrnul(p + 1, '/');
> +		path_component_nr++;
> +	} while (p - path < len);

Hmm, that "+ 1" makes me a bit nervous.  Can we be sure that path is not
an empty string?

And shouldn't we use is_dir_sep() or find_last_dir_sep() instead of
hard-coding a slash?

> +
> +	revs->bloom_keys_nr = path_component_nr;
> +	ALLOC_ARRAY(revs->bloom_keys, revs->bloom_keys_nr);
> +
> +	p = path;
> +	for (j = 0; j < revs->bloom_keys_nr; j++) {
> +		p = strchrnul(p + 1, '/');

Same here, of course.

Also note that this puts shorter sub-strings first.


> +
> +		fill_bloom_key(path, p - path, &revs->bloom_keys[j],
> +			       revs->bloom_filter_settings);
> +	}
>
>  	if (trace2_is_enabled() && !bloom_filter_atexit_registered) {
>  		atexit(trace2_bloom_filter_statistics_atexit);
> @@ -720,7 +735,7 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
>  						 struct commit *commit)
>  {
>  	struct bloom_filter *filter;
> -	int result;
> +	int result = 1, j;
>
>  	if (!revs->repo->objects->commit_graph)
>  		return -1;
> @@ -740,9 +755,11 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
>  		return -1;
>  	}
>
> -	result = bloom_filter_contains(filter,
> -				       revs->bloom_key,
> -				       revs->bloom_filter_settings);
> +	for (j = 0; result && j < revs->bloom_keys_nr; j++) {
> +		result = bloom_filter_contains(filter,
> +					       &revs->bloom_keys[j],
> +					       revs->bloom_filter_settings);
> +	}

This checks shorter sub-strings first, contradicting the "deepest first"
strategy mentioned in the commit message.

This can easily be fixed by inverting the traversal of one of the loops,
of course.  Or perhaps do something like this?

	revs->bloom_keys = NULL;
	revs->bloom_keys_nr = 0;
	strbuf_add(&path, pi->match, pi->len);
	strbuf_trim_trailing_dir_sep(&path);
	for (;;) {
		const char *sep;
		ALLOC_GROW(revs->bloom_keys, revs->bloom_keys_nr + 1, alloc);
		fill_bloom_key(path.buf, path.len,
			       &revs->bloom_keys[revs->bloom_keys_nr++],
			       revs->bloom_filter_settings);
		sep = find_last_dir_sep(path.buf);
		if (!sep)
			break;
		strbuf_setlen(&path, sep - path.buf);
	}
	strbuf_release(&path);

The find_last_dir_sep() calls scan the first part of the string over and
over, which is a bit silly.  A strbuf_trim_trailing_path_component()
could start at the end and scan backwards if that turns out to be an
actual problem.

ALLOC_GROW wastes memory on revs->bloom_keys, and reallocating instead
of allocating the right size from the start has a cost as well, but I'd
expect this to be dwarfed by the actual revision walk.

>
>  	if (result)
>  		count_bloom_filter_maybe++;
> @@ -782,7 +799,7 @@ static int rev_compare_tree(struct rev_info *revs,
>  			return REV_TREE_SAME;
>  	}
>
> -	if (revs->bloom_key && !nth_parent) {
> +	if (revs->bloom_keys_nr && !nth_parent) {
>  		bloom_ret = check_maybe_different_in_bloom_filter(revs, commit);
>
>  		if (bloom_ret == 0)
> diff --git a/revision.h b/revision.h
> index 7c026fe41fc..abbfb4ab59a 100644
> --- a/revision.h
> +++ b/revision.h
> @@ -295,8 +295,10 @@ struct rev_info {
>  	struct topo_walk_info *topo_walk_info;
>
>  	/* Commit graph bloom filter fields */
> -	/* The bloom filter key for the pathspec */
> -	struct bloom_key *bloom_key;
> +	/* The bloom filter key(s) for the pathspec */
> +	struct bloom_key *bloom_keys;
> +	int bloom_keys_nr;
> +
>  	/*
>  	 * The bloom filter settings used to generate the key.
>  	 * This is loaded from the commit-graph being used.
> diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
> index c7011f33e2c..c13b97d3bda 100755
> --- a/t/t4216-log-bloom.sh
> +++ b/t/t4216-log-bloom.sh
> @@ -142,7 +142,7 @@ test_expect_success 'setup - add commit-graph to the chain with Bloom filters' '
>
>  test_bloom_filters_used_when_some_filters_are_missing () {
>  	log_args=$1
> -	bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"zero_length_filter\":0,\"maybe\":8,\"definitely_not\":6"
> +	bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"zero_length_filter\":0,\"maybe\":6,\"definitely_not\":8"
>  	setup "$log_args" &&
>  	grep -q "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.perf" &&
>  	test_cmp log_wo_bloom log_w_bloom
>


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 5/8] commit-graph: check all leading directories in changed path Bloom filters
  2020-06-18 20:31   ` René Scharfe
@ 2020-06-19  9:14     ` René Scharfe
  0 siblings, 0 replies; 71+ messages in thread
From: René Scharfe @ 2020-06-19  9:14 UTC (permalink / raw)
  To: SZEDER Gábor via GitGitGadget, git; +Cc: me, szeder.dev, Derrick Stolee

Am 18.06.20 um 22:31 schrieb René Scharfe:
> Am 15.06.20 um 22:14 schrieb SZEDER Gábor via GitGitGadget:
>> --- a/revision.c
>> +++ b/revision.c
>> @@ -670,9 +670,10 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
>>  {
>>  	struct pathspec_item *pi;
>>  	char *path_alloc = NULL;
>> -	const char *path;
>> +	const char *path, *p;
>>  	int last_index;
>> -	int len;
>> +	size_t len;
>> +	int path_component_nr = 0, j;
>>
>>  	if (!revs->commits)
>>  		return;
>> @@ -705,8 +706,22 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
>>
>>  	len = strlen(path);
>>
>> -	revs->bloom_key = xmalloc(sizeof(struct bloom_key));
>> -	fill_bloom_key(path, len, revs->bloom_key, revs->bloom_filter_settings);
>> +	p = path;
>> +	do {
>> +		p = strchrnul(p + 1, '/');
>> +		path_component_nr++;
>> +	} while (p - path < len);

> And shouldn't we use is_dir_sep() or find_last_dir_sep() instead of
> hard-coding a slash?

Not necessarily.  Paths should be normalized to use one specific
separator, probably slash, both when building and querying the Bloom
filter.  Otherwise a filter that knows e.g. "foo/bar" could confidently
claim that "foo\bar" does not match.  If this is done in a previous
step then using a literal slash here would be correct.

René

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 1/8] commit-graph: place bloom_settings in context
  2020-06-18 20:30   ` René Scharfe
@ 2020-06-19 12:58     ` Derrick Stolee
  0 siblings, 0 replies; 71+ messages in thread
From: Derrick Stolee @ 2020-06-19 12:58 UTC (permalink / raw)
  To: René Scharfe, Derrick Stolee via GitGitGadget, git
  Cc: me, szeder.dev, Derrick Stolee

On 6/18/2020 4:30 PM, René Scharfe wrote:
> Am 15.06.20 um 22:14 schrieb Derrick Stolee via GitGitGadget:
>> From: Derrick Stolee <dstolee@microsoft.com>
>>
>> Place an instance of struct bloom_settings into the struct
>> write_commit_graph_context. This allows simplifying the function
>> prototype of write_graph_chunk_bloom_data(). This will allow us
>> to combine the function prototypes and use function pointers to
>> simplify write_commit_graph_file().
>>
>> Reported-by: SZEDER Gábor <szeder.dev@gmail.com>
>> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
>> ---
>>  commit-graph.c | 14 ++++++++------
>>  1 file changed, 8 insertions(+), 6 deletions(-)
>>
>> diff --git a/commit-graph.c b/commit-graph.c
>> index 887837e8826..05b7035d8d5 100644
>> --- a/commit-graph.c
>> +++ b/commit-graph.c
>> @@ -882,6 +882,7 @@ struct write_commit_graph_context {
>>
>>  	const struct split_commit_graph_opts *split_opts;
>>  	size_t total_bloom_filter_data_size;
>> +	struct bloom_filter_settings bloom_settings;
> 
> That structure is quite busy already, so adding one more member wouldn't
> matter much.
> 
> Passing so many things to lots of functions makes it harder to argue
> about them, though, as all of them effectively become part of their
> signature, and you have to look at their implementation to see which
> pseudo-parameters they actually use.  It's like a God object.

Correct. The write_commit_graph_context _is_ a God object for the
commit-graph write. The good news is that it is limited only to
commit-graph.c and the write operations therein. Hopefully, the
code organization benefits enough from this structure to justify
the massive struct.

In contrast, it's still smaller and more contained than
"struct rev_info"!

>>  };
>>
>>  static void write_graph_chunk_fanout(struct hashfile *f,
>> @@ -1103,8 +1104,7 @@ static void write_graph_chunk_bloom_indexes(struct hashfile *f,
>>  }
>>
>>  static void write_graph_chunk_bloom_data(struct hashfile *f,
>> -					 struct write_commit_graph_context *ctx,
>> -					 const struct bloom_filter_settings *settings)
>> +					 struct write_commit_graph_context *ctx)
>>  {
>>  	struct commit **list = ctx->commits.list;
>>  	struct commit **last = ctx->commits.list + ctx->commits.nr;
>> @@ -1116,9 +1116,9 @@ static void write_graph_chunk_bloom_data(struct hashfile *f,
>>  			_("Writing changed paths Bloom filters data"),
>>  			ctx->commits.nr);
>>
>> -	hashwrite_be32(f, settings->hash_version);
>> -	hashwrite_be32(f, settings->num_hashes);
>> -	hashwrite_be32(f, settings->bits_per_entry);
>> +	hashwrite_be32(f, ctx->bloom_settings.hash_version);
>> +	hashwrite_be32(f, ctx->bloom_settings.num_hashes);
>> +	hashwrite_be32(f, ctx->bloom_settings.bits_per_entry);
>>
>>  	while (list < last) {
>>  		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
>> @@ -1541,6 +1541,8 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
>>  	struct object_id file_hash;
>>  	const struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
>>
>> +	ctx->bloom_settings = bloom_settings;
> 
> So we use the defaults, no customization?  Then you could simply move
> the declaration of bloom_settings from write_commit_graph_file() to
> write_graph_chunk_bloom_data().  Glancing at pu I don't see additional
> uses there, so no need to put it into the context (yet?).

It certainly is not customized by a user (yet). However, you do make an
excellent point that I need to be more careful here! Patch 8
(commit-graph: persist existence of changed-paths) needs to load the
bloom_filter_settings from the existing commit-graph so we can be
future-proof from a future version customizing the settings inside the
commit-graph file!

This means that in v2 I'll move patches 7 & 8 to be after patch 1 and
add a test to verify the filter settings are preserved (after manually
changing the data in the file).

Thanks!
-Stolee

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 5/8] commit-graph: check all leading directories in changed path Bloom filters
  2020-06-15 20:14 ` [PATCH 5/8] commit-graph: check all leading directories in changed path Bloom filters SZEDER Gábor via GitGitGadget
  2020-06-18 20:31   ` René Scharfe
@ 2020-06-19 17:17   ` Taylor Blau
  2020-06-19 17:19     ` Taylor Blau
  2020-06-23 13:47     ` Derrick Stolee
  1 sibling, 2 replies; 71+ messages in thread
From: Taylor Blau @ 2020-06-19 17:17 UTC (permalink / raw)
  To: SZEDER Gábor via GitGitGadget; +Cc: git, me, szeder.dev, Derrick Stolee

Hi Stolee,

On Mon, Jun 15, 2020 at 08:14:50PM +0000, SZEDER Gábor via GitGitGadget wrote:
> From: =?UTF-8?q?SZEDER=20G=C3=A1bor?= <szeder.dev@gmail.com>
>
> The file 'dir/subdir/file' can only be modified if its leading
> directories 'dir' and 'dir/subdir' are modified as well.
>
> So when checking modified path Bloom filters looking for commits
> modifying a path with multiple path components, then check not only
> the full path in the Bloom filters, but all its leading directories as
> well.  Take care to check these paths in "deepest first" order,
> because it's the full path that is least likely to be modified, and
> the Bloom filter queries can short circuit sooner.
>
> This can significantly reduce the average false positive rate, by
> about an order of magnitude or three(!), and can further speed up
> pathspec-limited revision walks.  The table below compares the average
> false positive rate and runtime of
>
>   git rev-list HEAD -- "$path"
>
> before and after this change for 5000+ randomly* selected paths from
> each repository:
>
>                     Average false           Average        Average
>                     positive rate           runtime        runtime
>                   before     after     before     after   difference
>   ------------------------------------------------------------------
>   git             3.220%   0.7853%     0.0558s   0.0387s   -30.6%
>   linux           2.453%   0.0296%     0.1046s   0.0766s   -26.8%
>   tensorflow      2.536%   0.6977%     0.0594s   0.0420s   -29.2%
>
> *Path selection was done with the following pipeline:
>
> 	git ls-tree -r --name-only HEAD | sort -R | head -n 5000
>
> The improvements in runtime are much smaller than the improvements in
> average false positive rate, as we are clearly reaching diminishing
> returns here.  However, all these timings depend on that accessing
> tree objects is reasonably fast (warm caches).  If we had a partial
> clone and the tree objects had to be fetched from a promisor remote,
> e.g.:
>
>   $ git clone --filter=tree:0 --bare file://.../webkit.git webkit.notrees.git
>   $ git -C webkit.git -c core.modifiedPathBloomFilters=1 \
>         commit-graph write --reachable
>   $ cp webkit.git/objects/info/commit-graph webkit.notrees.git/objects/info/
>   $ git -C webkit.notrees.git -c core.modifiedPathBloomFilters=1 \
>         rev-list HEAD -- "$path"
>
> then checking all leading path component can reduce the runtime from
> over an hour to a few seconds (and this is with the clone and the
> promisor on the same machine).
>
> This adjusts the tracing values in t4216-log-bloom.sh, which provides a
> concrete way to notice the improvement.
>
> Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  revision.c           | 35 ++++++++++++++++++++++++++---------
>  revision.h           |  6 ++++--
>  t/t4216-log-bloom.sh |  2 +-
>  3 files changed, 31 insertions(+), 12 deletions(-)
>
> diff --git a/revision.c b/revision.c
> index c644c660917..027ae3982b4 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -670,9 +670,10 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
>  {
>  	struct pathspec_item *pi;
>  	char *path_alloc = NULL;
> -	const char *path;
> +	const char *path, *p;
>  	int last_index;
> -	int len;
> +	size_t len;
> +	int path_component_nr = 0, j;
>
>  	if (!revs->commits)
>  		return;
> @@ -705,8 +706,22 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
>
>  	len = strlen(path);
>
> -	revs->bloom_key = xmalloc(sizeof(struct bloom_key));
> -	fill_bloom_key(path, len, revs->bloom_key, revs->bloom_filter_settings);
> +	p = path;
> +	do {
> +		p = strchrnul(p + 1, '/');
> +		path_component_nr++;
> +	} while (p - path < len);
> +
> +	revs->bloom_keys_nr = path_component_nr;
> +	ALLOC_ARRAY(revs->bloom_keys, revs->bloom_keys_nr);
> +
> +	p = path;
> +	for (j = 0; j < revs->bloom_keys_nr; j++) {
> +		p = strchrnul(p + 1, '/');
> +
> +		fill_bloom_key(path, p - path, &revs->bloom_keys[j],
> +			       revs->bloom_filter_settings);
> +	}
>

Somewhat related to our off-list discussion yesterday, there is a bug in
both 2.27 and this patch which produces incorrect results when (1)
Bloom filters are enabled, and (2) we are doing a revision walk from
root with the pathspec '.'.

What appears to be going on is that our normalization takes '.' -> '',
and then we form a Bloom key based on the empty string, which will
return 'definitely not' when querying the Bloom filter some of the time,
which should never happen. This is a consequence of never inserting the
empty key into the Bloom filter upon generation.

As a result, I have patched this in GitHub's fork (which is currently
based on 2.27 and doesn't have these patches yet) by doing an early
return when 'strlen(path) == 0'. Since it looks like these patches are
going to land, here is some clean-up and a fix for the bug that you
should feel free to test with and apply on top:

--- >8 ---

diff --git a/revision.c b/revision.c
index 8bd383b1dd..123e72698d 100644
--- a/revision.c
+++ b/revision.c
@@ -670,10 +670,10 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
 {
        struct pathspec_item *pi;
        char *path_alloc = NULL;
-       const char *path, *p;
+       char *path, *p;
        int last_index;
        size_t len;
-       int path_component_nr = 0, j;
+       int path_component_nr = 1, j;

        if (!revs->commits)
                return;
@@ -698,29 +698,33 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)

        /* remove single trailing slash from path, if needed */
        if (pi->match[last_index] == '/') {
-           path_alloc = xstrdup(pi->match);
-           path_alloc[last_index] = '\0';
-           path = path_alloc;
-       } else
-           path = pi->match;
+               path_alloc = xstrdup(pi->match);
+               path_alloc[last_index] = '\0';
+               path = path_alloc;
+       } else {
+               path = pi->match;
+               len = pi->len;
+       }

-       len = strlen(path);
+       if (!len)
+               return;

-       p = path;
        do {
-               p = strchrnul(p + 1, '/');
-               path_component_nr++;
-       } while (p - path < len);
+               if (is_dir_sep(*p)) {
+                       *p = '\0';
+                       path_component_nr++;
+               }
+       } while (*p++);

        revs->bloom_keys_nr = path_component_nr;
        ALLOC_ARRAY(revs->bloom_keys, revs->bloom_keys_nr);

        p = path;
        for (j = 0; j < revs->bloom_keys_nr; j++) {
-               p = strchrnul(p + 1, '/');
-
-               fill_bloom_key(path, p - path, &revs->bloom_keys[j],
+               size_t plen = strlen(p);
+               fill_bloom_key(p, plen, &revs->bloom_keys[j],
                               revs->bloom_filter_settings);
+               p += plen;
        }

        if (trace2_is_enabled() && !bloom_filter_atexit_registered) {

>  	if (trace2_is_enabled() && !bloom_filter_atexit_registered) {
>  		atexit(trace2_bloom_filter_statistics_atexit);
> @@ -720,7 +735,7 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
>  						 struct commit *commit)
>  {
>  	struct bloom_filter *filter;
> -	int result;
> +	int result = 1, j;
>
>  	if (!revs->repo->objects->commit_graph)
>  		return -1;
> @@ -740,9 +755,11 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
>  		return -1;
>  	}
>
> -	result = bloom_filter_contains(filter,
> -				       revs->bloom_key,
> -				       revs->bloom_filter_settings);
> +	for (j = 0; result && j < revs->bloom_keys_nr; j++) {
> +		result = bloom_filter_contains(filter,
> +					       &revs->bloom_keys[j],
> +					       revs->bloom_filter_settings);
> +	}
>
>  	if (result)
>  		count_bloom_filter_maybe++;
> @@ -782,7 +799,7 @@ static int rev_compare_tree(struct rev_info *revs,
>  			return REV_TREE_SAME;
>  	}
>
> -	if (revs->bloom_key && !nth_parent) {
> +	if (revs->bloom_keys_nr && !nth_parent) {
>  		bloom_ret = check_maybe_different_in_bloom_filter(revs, commit);
>
>  		if (bloom_ret == 0)
> diff --git a/revision.h b/revision.h
> index 7c026fe41fc..abbfb4ab59a 100644
> --- a/revision.h
> +++ b/revision.h
> @@ -295,8 +295,10 @@ struct rev_info {
>  	struct topo_walk_info *topo_walk_info;
>
>  	/* Commit graph bloom filter fields */
> -	/* The bloom filter key for the pathspec */
> -	struct bloom_key *bloom_key;
> +	/* The bloom filter key(s) for the pathspec */
> +	struct bloom_key *bloom_keys;
> +	int bloom_keys_nr;
> +
>  	/*
>  	 * The bloom filter settings used to generate the key.
>  	 * This is loaded from the commit-graph being used.
> diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
> index c7011f33e2c..c13b97d3bda 100755
> --- a/t/t4216-log-bloom.sh
> +++ b/t/t4216-log-bloom.sh
> @@ -142,7 +142,7 @@ test_expect_success 'setup - add commit-graph to the chain with Bloom filters' '
>
>  test_bloom_filters_used_when_some_filters_are_missing () {
>  	log_args=$1
> -	bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"zero_length_filter\":0,\"maybe\":8,\"definitely_not\":6"
> +	bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"zero_length_filter\":0,\"maybe\":6,\"definitely_not\":8"
>  	setup "$log_args" &&
>  	grep -q "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.perf" &&
>  	test_cmp log_wo_bloom log_w_bloom
> --
> gitgitgadget
>
Thanks,
Taylor

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 5/8] commit-graph: check all leading directories in changed path Bloom filters
  2020-06-19 17:17   ` Taylor Blau
@ 2020-06-19 17:19     ` Taylor Blau
  2020-06-23 13:47     ` Derrick Stolee
  1 sibling, 0 replies; 71+ messages in thread
From: Taylor Blau @ 2020-06-19 17:19 UTC (permalink / raw)
  To: Taylor Blau
  Cc: SZEDER Gábor via GitGitGadget, git, szeder.dev, Derrick Stolee

On Fri, Jun 19, 2020 at 11:17:17AM -0600, Taylor Blau wrote:
> Hi Stolee,
>
> On Mon, Jun 15, 2020 at 08:14:50PM +0000, SZEDER Gábor via GitGitGadget wrote:
> > From: =?UTF-8?q?SZEDER=20G=C3=A1bor?= <szeder.dev@gmail.com>
> >
> > The file 'dir/subdir/file' can only be modified if its leading
> > directories 'dir' and 'dir/subdir' are modified as well.
> >
> > So when checking modified path Bloom filters looking for commits
> > modifying a path with multiple path components, then check not only
> > the full path in the Bloom filters, but all its leading directories as
> > well.  Take care to check these paths in "deepest first" order,
> > because it's the full path that is least likely to be modified, and
> > the Bloom filter queries can short circuit sooner.
> >
> > This can significantly reduce the average false positive rate, by
> > about an order of magnitude or three(!), and can further speed up
> > pathspec-limited revision walks.  The table below compares the average
> > false positive rate and runtime of
> >
> >   git rev-list HEAD -- "$path"
> >
> > before and after this change for 5000+ randomly* selected paths from
> > each repository:
> >
> >                     Average false           Average        Average
> >                     positive rate           runtime        runtime
> >                   before     after     before     after   difference
> >   ------------------------------------------------------------------
> >   git             3.220%   0.7853%     0.0558s   0.0387s   -30.6%
> >   linux           2.453%   0.0296%     0.1046s   0.0766s   -26.8%
> >   tensorflow      2.536%   0.6977%     0.0594s   0.0420s   -29.2%
> >
> > *Path selection was done with the following pipeline:
> >
> > 	git ls-tree -r --name-only HEAD | sort -R | head -n 5000
> >
> > The improvements in runtime are much smaller than the improvements in
> > average false positive rate, as we are clearly reaching diminishing
> > returns here.  However, all these timings depend on that accessing
> > tree objects is reasonably fast (warm caches).  If we had a partial
> > clone and the tree objects had to be fetched from a promisor remote,
> > e.g.:
> >
> >   $ git clone --filter=tree:0 --bare file://.../webkit.git webkit.notrees.git
> >   $ git -C webkit.git -c core.modifiedPathBloomFilters=1 \
> >         commit-graph write --reachable
> >   $ cp webkit.git/objects/info/commit-graph webkit.notrees.git/objects/info/
> >   $ git -C webkit.notrees.git -c core.modifiedPathBloomFilters=1 \
> >         rev-list HEAD -- "$path"
> >
> > then checking all leading path component can reduce the runtime from
> > over an hour to a few seconds (and this is with the clone and the
> > promisor on the same machine).
> >
> > This adjusts the tracing values in t4216-log-bloom.sh, which provides a
> > concrete way to notice the improvement.
> >
> > Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
> > Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> > ---
> >  revision.c           | 35 ++++++++++++++++++++++++++---------
> >  revision.h           |  6 ++++--
> >  t/t4216-log-bloom.sh |  2 +-
> >  3 files changed, 31 insertions(+), 12 deletions(-)
> >
> > diff --git a/revision.c b/revision.c
> > index c644c660917..027ae3982b4 100644
> > --- a/revision.c
> > +++ b/revision.c
> > @@ -670,9 +670,10 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
> >  {
> >  	struct pathspec_item *pi;
> >  	char *path_alloc = NULL;
> > -	const char *path;
> > +	const char *path, *p;
> >  	int last_index;
> > -	int len;
> > +	size_t len;
> > +	int path_component_nr = 0, j;
> >
> >  	if (!revs->commits)
> >  		return;
> > @@ -705,8 +706,22 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
> >
> >  	len = strlen(path);
> >
> > -	revs->bloom_key = xmalloc(sizeof(struct bloom_key));
> > -	fill_bloom_key(path, len, revs->bloom_key, revs->bloom_filter_settings);
> > +	p = path;
> > +	do {
> > +		p = strchrnul(p + 1, '/');
> > +		path_component_nr++;
> > +	} while (p - path < len);
> > +
> > +	revs->bloom_keys_nr = path_component_nr;
> > +	ALLOC_ARRAY(revs->bloom_keys, revs->bloom_keys_nr);
> > +
> > +	p = path;
> > +	for (j = 0; j < revs->bloom_keys_nr; j++) {
> > +		p = strchrnul(p + 1, '/');
> > +
> > +		fill_bloom_key(path, p - path, &revs->bloom_keys[j],
> > +			       revs->bloom_filter_settings);
> > +	}
> >
>
> Somewhat related to our off-list discussion yesterday, there is a bug in
> both 2.27 and this patch which produces incorrect results when (1)
> Bloom filters are enabled, and (2) we are doing a revision walk from
> root with the pathspec '.'.
>
> What appears to be going on is that our normalization takes '.' -> '',
> and then we form a Bloom key based on the empty string, which will
> return 'definitely not' when querying the Bloom filter some of the time,
> which should never happen. This is a consequence of never inserting the
> empty key into the Bloom filter upon generation.
>
> As a result, I have patched this in GitHub's fork (which is currently
> based on 2.27 and doesn't have these patches yet) by doing an early
> return when 'strlen(path) == 0'. Since it looks like these patches are
> going to land, here is some clean-up and a fix for the bug that you
> should feel free to test with and apply on top:
>
> --- >8 ---
>
> diff --git a/revision.c b/revision.c
> index 8bd383b1dd..123e72698d 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -670,10 +670,10 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
>  {
>         struct pathspec_item *pi;
>         char *path_alloc = NULL;
> -       const char *path, *p;
> +       char *path, *p;
>         int last_index;
>         size_t len;
> -       int path_component_nr = 0, j;
> +       int path_component_nr = 1, j;
>
>         if (!revs->commits)
>                 return;
> @@ -698,29 +698,33 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
>
>         /* remove single trailing slash from path, if needed */
>         if (pi->match[last_index] == '/') {
> -           path_alloc = xstrdup(pi->match);
> -           path_alloc[last_index] = '\0';
> -           path = path_alloc;
> -       } else
> -           path = pi->match;
> +               path_alloc = xstrdup(pi->match);
> +               path_alloc[last_index] = '\0';
> +               path = path_alloc;
> +       } else {
> +               path = pi->match;
> +               len = pi->len;
> +       }
>
> -       len = strlen(path);
> +       if (!len)
> +               return;

I should note that _this_ is the critical fix, and it should fix the bug
if you only applied just this hunk.

Everything else is purely style clean-ups on top (ranging from the four
spaces used instead of a tab, to some string processing niceties that I
_think_ should address Rene's concern, although I'm not sure if an
actual bug is lurking there or not...)

> -       p = path;
>         do {
> -               p = strchrnul(p + 1, '/');
> -               path_component_nr++;
> -       } while (p - path < len);
> +               if (is_dir_sep(*p)) {
> +                       *p = '\0';
> +                       path_component_nr++;
> +               }
> +       } while (*p++);
>
>         revs->bloom_keys_nr = path_component_nr;
>         ALLOC_ARRAY(revs->bloom_keys, revs->bloom_keys_nr);
>
>         p = path;
>         for (j = 0; j < revs->bloom_keys_nr; j++) {
> -               p = strchrnul(p + 1, '/');
> -
> -               fill_bloom_key(path, p - path, &revs->bloom_keys[j],
> +               size_t plen = strlen(p);
> +               fill_bloom_key(p, plen, &revs->bloom_keys[j],
>                                revs->bloom_filter_settings);
> +               p += plen;
>         }
>
>         if (trace2_is_enabled() && !bloom_filter_atexit_registered) {
>
> >  	if (trace2_is_enabled() && !bloom_filter_atexit_registered) {
> >  		atexit(trace2_bloom_filter_statistics_atexit);
> > @@ -720,7 +735,7 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
> >  						 struct commit *commit)
> >  {
> >  	struct bloom_filter *filter;
> > -	int result;
> > +	int result = 1, j;
> >
> >  	if (!revs->repo->objects->commit_graph)
> >  		return -1;
> > @@ -740,9 +755,11 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
> >  		return -1;
> >  	}
> >
> > -	result = bloom_filter_contains(filter,
> > -				       revs->bloom_key,
> > -				       revs->bloom_filter_settings);
> > +	for (j = 0; result && j < revs->bloom_keys_nr; j++) {
> > +		result = bloom_filter_contains(filter,
> > +					       &revs->bloom_keys[j],
> > +					       revs->bloom_filter_settings);
> > +	}
> >
> >  	if (result)
> >  		count_bloom_filter_maybe++;
> > @@ -782,7 +799,7 @@ static int rev_compare_tree(struct rev_info *revs,
> >  			return REV_TREE_SAME;
> >  	}
> >
> > -	if (revs->bloom_key && !nth_parent) {
> > +	if (revs->bloom_keys_nr && !nth_parent) {
> >  		bloom_ret = check_maybe_different_in_bloom_filter(revs, commit);
> >
> >  		if (bloom_ret == 0)
> > diff --git a/revision.h b/revision.h
> > index 7c026fe41fc..abbfb4ab59a 100644
> > --- a/revision.h
> > +++ b/revision.h
> > @@ -295,8 +295,10 @@ struct rev_info {
> >  	struct topo_walk_info *topo_walk_info;
> >
> >  	/* Commit graph bloom filter fields */
> > -	/* The bloom filter key for the pathspec */
> > -	struct bloom_key *bloom_key;
> > +	/* The bloom filter key(s) for the pathspec */
> > +	struct bloom_key *bloom_keys;
> > +	int bloom_keys_nr;
> > +
> >  	/*
> >  	 * The bloom filter settings used to generate the key.
> >  	 * This is loaded from the commit-graph being used.
> > diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
> > index c7011f33e2c..c13b97d3bda 100755
> > --- a/t/t4216-log-bloom.sh
> > +++ b/t/t4216-log-bloom.sh
> > @@ -142,7 +142,7 @@ test_expect_success 'setup - add commit-graph to the chain with Bloom filters' '
> >
> >  test_bloom_filters_used_when_some_filters_are_missing () {
> >  	log_args=$1
> > -	bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"zero_length_filter\":0,\"maybe\":8,\"definitely_not\":6"
> > +	bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"zero_length_filter\":0,\"maybe\":6,\"definitely_not\":8"
> >  	setup "$log_args" &&
> >  	grep -q "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.perf" &&
> >  	test_cmp log_wo_bloom log_w_bloom
> > --
> > gitgitgadget
> >
> Thanks,
> Taylor
Thanks,
Taylor

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 5/8] commit-graph: check all leading directories in changed path Bloom filters
  2020-06-19 17:17   ` Taylor Blau
  2020-06-19 17:19     ` Taylor Blau
@ 2020-06-23 13:47     ` Derrick Stolee
  1 sibling, 0 replies; 71+ messages in thread
From: Derrick Stolee @ 2020-06-23 13:47 UTC (permalink / raw)
  To: Taylor Blau, SZEDER Gábor via GitGitGadget
  Cc: git, szeder.dev, Derrick Stolee

On 6/19/2020 1:17 PM, Taylor Blau wrote:
> Hi Stolee,
> 
> On Mon, Jun 15, 2020 at 08:14:50PM +0000, SZEDER Gábor via GitGitGadget wrote:
>> From: =?UTF-8?q?SZEDER=20G=C3=A1bor?= <szeder.dev@gmail.com>
>>
>> The file 'dir/subdir/file' can only be modified if its leading
>> directories 'dir' and 'dir/subdir' are modified as well.
>>
>> So when checking modified path Bloom filters looking for commits
>> modifying a path with multiple path components, then check not only
>> the full path in the Bloom filters, but all its leading directories as
>> well.  Take care to check these paths in "deepest first" order,
>> because it's the full path that is least likely to be modified, and
>> the Bloom filter queries can short circuit sooner.
>>
>> This can significantly reduce the average false positive rate, by
>> about an order of magnitude or three(!), and can further speed up
>> pathspec-limited revision walks.  The table below compares the average
>> false positive rate and runtime of
>>
>>   git rev-list HEAD -- "$path"
>>
>> before and after this change for 5000+ randomly* selected paths from
>> each repository:
>>
>>                     Average false           Average        Average
>>                     positive rate           runtime        runtime
>>                   before     after     before     after   difference
>>   ------------------------------------------------------------------
>>   git             3.220%   0.7853%     0.0558s   0.0387s   -30.6%
>>   linux           2.453%   0.0296%     0.1046s   0.0766s   -26.8%
>>   tensorflow      2.536%   0.6977%     0.0594s   0.0420s   -29.2%
>>
>> *Path selection was done with the following pipeline:
>>
>> 	git ls-tree -r --name-only HEAD | sort -R | head -n 5000
>>
>> The improvements in runtime are much smaller than the improvements in
>> average false positive rate, as we are clearly reaching diminishing
>> returns here.  However, all these timings depend on that accessing
>> tree objects is reasonably fast (warm caches).  If we had a partial
>> clone and the tree objects had to be fetched from a promisor remote,
>> e.g.:
>>
>>   $ git clone --filter=tree:0 --bare file://.../webkit.git webkit.notrees.git
>>   $ git -C webkit.git -c core.modifiedPathBloomFilters=1 \
>>         commit-graph write --reachable
>>   $ cp webkit.git/objects/info/commit-graph webkit.notrees.git/objects/info/
>>   $ git -C webkit.notrees.git -c core.modifiedPathBloomFilters=1 \
>>         rev-list HEAD -- "$path"
>>
>> then checking all leading path component can reduce the runtime from
>> over an hour to a few seconds (and this is with the clone and the
>> promisor on the same machine).
>>
>> This adjusts the tracing values in t4216-log-bloom.sh, which provides a
>> concrete way to notice the improvement.
>>
>> Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
>> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
>> ---
>>  revision.c           | 35 ++++++++++++++++++++++++++---------
>>  revision.h           |  6 ++++--
>>  t/t4216-log-bloom.sh |  2 +-
>>  3 files changed, 31 insertions(+), 12 deletions(-)
>>
>> diff --git a/revision.c b/revision.c
>> index c644c660917..027ae3982b4 100644
>> --- a/revision.c
>> +++ b/revision.c
>> @@ -670,9 +670,10 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
>>  {
>>  	struct pathspec_item *pi;
>>  	char *path_alloc = NULL;
>> -	const char *path;
>> +	const char *path, *p;
>>  	int last_index;
>> -	int len;
>> +	size_t len;
>> +	int path_component_nr = 0, j;
>>
>>  	if (!revs->commits)
>>  		return;
>> @@ -705,8 +706,22 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
>>
>>  	len = strlen(path);
>>
>> -	revs->bloom_key = xmalloc(sizeof(struct bloom_key));
>> -	fill_bloom_key(path, len, revs->bloom_key, revs->bloom_filter_settings);
>> +	p = path;
>> +	do {
>> +		p = strchrnul(p + 1, '/');
>> +		path_component_nr++;
>> +	} while (p - path < len);
>> +
>> +	revs->bloom_keys_nr = path_component_nr;
>> +	ALLOC_ARRAY(revs->bloom_keys, revs->bloom_keys_nr);
>> +
>> +	p = path;
>> +	for (j = 0; j < revs->bloom_keys_nr; j++) {
>> +		p = strchrnul(p + 1, '/');
>> +
>> +		fill_bloom_key(path, p - path, &revs->bloom_keys[j],
>> +			       revs->bloom_filter_settings);
>> +	}
>>
> 
> Somewhat related to our off-list discussion yesterday, there is a bug in
> both 2.27 and this patch which produces incorrect results when (1)
> Bloom filters are enabled, and (2) we are doing a revision walk from
> root with the pathspec '.'.
> 
> What appears to be going on is that our normalization takes '.' -> '',
> and then we form a Bloom key based on the empty string, which will
> return 'definitely not' when querying the Bloom filter some of the time,
> which should never happen. This is a consequence of never inserting the
> empty key into the Bloom filter upon generation.
> 
> As a result, I have patched this in GitHub's fork (which is currently
> based on 2.27 and doesn't have these patches yet) by doing an early
> return when 'strlen(path) == 0'. Since it looks like these patches are
> going to land, here is some clean-up and a fix for the bug that you
> should feel free to test with and apply on top:
> 
> --- >8 ---
> 
> diff --git a/revision.c b/revision.c
> index 8bd383b1dd..123e72698d 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -670,10 +670,10 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
>  {
>         struct pathspec_item *pi;
>         char *path_alloc = NULL;
> -       const char *path, *p;
> +       char *path, *p;
>         int last_index;
>         size_t len;
> -       int path_component_nr = 0, j;
> +       int path_component_nr = 1, j;
> 
>         if (!revs->commits)
>                 return;
> @@ -698,29 +698,33 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
> 
>         /* remove single trailing slash from path, if needed */
>         if (pi->match[last_index] == '/') {
> -           path_alloc = xstrdup(pi->match);
> -           path_alloc[last_index] = '\0';
> -           path = path_alloc;
> -       } else
> -           path = pi->match;
> +               path_alloc = xstrdup(pi->match);
> +               path_alloc[last_index] = '\0';
> +               path = path_alloc;
> +       } else {
> +               path = pi->match;
> +               len = pi->len;
> +       }
> 
> -       len = strlen(path);
> +       if (!len)
> +               return;
> 
> -       p = path;
>         do {
> -               p = strchrnul(p + 1, '/');
> -               path_component_nr++;
> -       } while (p - path < len);
> +               if (is_dir_sep(*p)) {
> +                       *p = '\0';
> +                       path_component_nr++;
> +               }
> +       } while (*p++);
> 
>         revs->bloom_keys_nr = path_component_nr;
>         ALLOC_ARRAY(revs->bloom_keys, revs->bloom_keys_nr);
> 
>         p = path;
>         for (j = 0; j < revs->bloom_keys_nr; j++) {
> -               p = strchrnul(p + 1, '/');
> -
> -               fill_bloom_key(path, p - path, &revs->bloom_keys[j],
> +               size_t plen = strlen(p);
> +               fill_bloom_key(p, plen, &revs->bloom_keys[j],
>                                revs->bloom_filter_settings);
> +               p += plen;

I don't think this is correct at all. Looking at it, it seems
that it would take a path "A/B/C" and add keys for "A", "B", and
"C" instead of "A", "A/B", and "A/B/C".

Looking more closely, there are a few issues that makes it clear
why you didn't see a failing test:

1. You use "while (*p++)" instead of "while (*++p)" so the scan
   terminates after the first directory split. (So only "A" is
   added, which won't fail, but will be slower than intended.)

Changing that, we see the next problem:

2. You use "p += plen" instead of "p += plen + 1". This causes
   the filters to add "A", "", and "" (because we don't skip the
   terminating character). This _is_ incorrect and would result
   in test failures.

Changing that, we then see "A", "B", and "C" are added as keys.

I'm going to take the style issues that you presented, and
change them in one commit (reported-by you) and then the
if (!len) fix in a separate patch.

I'll update the scan loops to use is_dir_sep() accordingly
in this patch.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH v2 00/11] More commit-graph/Bloom filter improvements
  2020-06-15 20:14 [PATCH 0/8] More commit-graph/Bloom filter improvements Derrick Stolee via GitGitGadget
                   ` (8 preceding siblings ...)
  2020-06-17 21:21 ` [PATCH 0/8] More commit-graph/Bloom filter improvements Junio C Hamano
@ 2020-06-23 17:46 ` Derrick Stolee via GitGitGadget
  2020-06-23 17:47   ` [PATCH v2 01/11] commit-graph: place bloom_settings in context Derrick Stolee via GitGitGadget
                     ` (12 more replies)
  9 siblings, 13 replies; 71+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-06-23 17:46 UTC (permalink / raw)
  To: git; +Cc: me, szeder.dev, l.s.r, Derrick Stolee

This builds on sg/commit-graph-cleanups, which took several patches from
Szeder's series [1] and applied them almost directly to a more-recent
version of Git [2].

[1] https://lore.kernel.org/git/20200529085038.26008-1-szeder.dev@gmail.com/
[2] 
https://lore.kernel.org/git/pull.650.git.1591362032.gitgitgadget@gmail.com/

This series adds a few extra improvements, several of which are rooted in
Szeder's original series. I maintained his authorship and sign-off, even
though the patches did not apply or cherry-pick at all.

(In v2, I have removed the range-diff comparison to Szeder's series, so look
at the v1 cover letter for that.)

The patches have been significantly reordered. René pointed out (and Szeder
discovered in the old thread) that we are not re-using the
bloom_filter_settings from the existing commit-graph when writing a new one.

 1. commit-graph: place bloom_settings in context
 2. commit-graph: change test to die on parse, not load

These are mostly the same, except we now use a pointer to the settings in
the commit-graph write context.

 3. bloom: get_bloom_filter() cleanups

This new patch is a subtle change in behavior that will become relevant in
the very next patch. In fact, if we swap patch 3 and 4, then
t4216-log-bloom.sh fails with a segfault due to a NULL filter.

 4. commit-graph: persist existence of changed-paths

This patch is now updated to use the existing changed-path filter settings.

 5. commit-graph: unify the signatures of all write_graph_chunk_*()
    functions
 6. commit-graph: simplify chunk writes into loop
 7. commit-graph: check chunk sizes after writing

These are all the same as before.

 8. revision.c: fix whitespace

This patch is the cleanup part of Taylor's patch.

 9. revision: empty pathspecs should not use Bloom filters

Here is Taylor's fix for empty pathspecs.

 10. commit-graph: check all leading directories in changed path Bloom
     filters
 11. bloom: enforce a minimum size of 8 bytes

Finally, we get these performance patches. Patch 10 is updated to have the
better logic around directory separators and empty paths. Also, the list of
Bloom keys is ordered with the deepest path first. That has some tiny
performance benefits for deep paths since we can short-circuit the multi-key
checks more often. That code path is much faster than the tree parsing, so
it is hard to measure any change.

Thanks, -Stolee

Derrick Stolee (6):
  commit-graph: place bloom_settings in context
  commit-graph: change test to die on parse, not load
  bloom: get_bloom_filter() cleanups
  commit-graph: persist existence of changed-paths
  revision.c: fix whitespace
  bloom: enforce a minimum size of 8 bytes

SZEDER Gábor (4):
  commit-graph: unify the signatures of all write_graph_chunk_*()
    functions
  commit-graph: simplify chunk writes into loop
  commit-graph: check chunk sizes after writing
  commit-graph: check all leading directories in changed path Bloom
    filters

Taylor Blau (1):
  revision: empty pathspecs should not use Bloom filters

 Documentation/git-commit-graph.txt |   5 +-
 bloom.c                            |  19 ++--
 builtin/commit-graph.c             |   5 +-
 commit-graph.c                     | 136 +++++++++++++++++++++--------
 commit-graph.h                     |   3 +-
 revision.c                         |  53 ++++++++---
 revision.h                         |   6 +-
 t/t4216-log-bloom.sh               |  35 +++++++-
 t/t5318-commit-graph.sh            |   2 +-
 9 files changed, 200 insertions(+), 64 deletions(-)


base-commit: 7fbfe07ab4d4e58c0971dac73001b89f180a0af3
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-659%2Fderrickstolee%2Fbloom-2-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-659/derrickstolee/bloom-2-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/659

Range-diff vs v1:

  1:  c966969071 !  1:  57002040bc commit-graph: place bloom_settings in context
     @@ Commit message
          to combine the function prototypes and use function pointers to
          simplify write_commit_graph_file().
      
     +    By using a pointer, we can later replace the settings to match those
     +    that exist in the current commit-graph, in case a future Git version
     +    allows customization of these parameters.
     +
          Reported-by: SZEDER Gábor <szeder.dev@gmail.com>
          Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
      
     @@ commit-graph.c: struct write_commit_graph_context {
       
       	const struct split_commit_graph_opts *split_opts;
       	size_t total_bloom_filter_data_size;
     -+	struct bloom_filter_settings bloom_settings;
     ++	const struct bloom_filter_settings *bloom_settings;
       };
       
       static void write_graph_chunk_fanout(struct hashfile *f,
     @@ commit-graph.c: static void write_graph_chunk_bloom_data(struct hashfile *f,
      -	hashwrite_be32(f, settings->hash_version);
      -	hashwrite_be32(f, settings->num_hashes);
      -	hashwrite_be32(f, settings->bits_per_entry);
     -+	hashwrite_be32(f, ctx->bloom_settings.hash_version);
     -+	hashwrite_be32(f, ctx->bloom_settings.num_hashes);
     -+	hashwrite_be32(f, ctx->bloom_settings.bits_per_entry);
     ++	hashwrite_be32(f, ctx->bloom_settings->hash_version);
     ++	hashwrite_be32(f, ctx->bloom_settings->num_hashes);
     ++	hashwrite_be32(f, ctx->bloom_settings->bits_per_entry);
       
       	while (list < last) {
       		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
     @@ commit-graph.c: static int write_commit_graph_file(struct write_commit_graph_con
       	struct object_id file_hash;
       	const struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
       
     -+	ctx->bloom_settings = bloom_settings;
     ++	ctx->bloom_settings = &bloom_settings;
      +
       	if (ctx->split) {
       		struct strbuf tmp_file = STRBUF_INIT;
  7:  60bbc15d24 =  2:  6b63f9bd8a commit-graph: change test to die on parse, not load
  -:  ---------- >  3:  492deaf916 bloom: get_bloom_filter() cleanups
  8:  db5b8fe843 !  4:  8727b25468 commit-graph: persist existence of changed-paths
     @@ Commit message
          property of "my commit-graph has changed-path filters" automatically. A
          user can drop filters using the --no-changed-paths option.
      
     +    In the process, we need to be extremely careful to match the Bloom
     +    filter settings as specified by the commit-graph. This will allow future
     +    versions of Git to customize these settings, and the version with this
     +    change will persist those settings as commit-graphs are rewritten on
     +    top.
     +
     +    Use the trace2 API to signal the settings used during the write, and
     +    check that output in a test after manually adjusting the correct bytes
     +    in the commit-graph file.
     +
          Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
      
       ## Documentation/git-commit-graph.txt ##
     @@ builtin/commit-graph.c: static int graph_write(int argc, const char **argv)
       
      
       ## commit-graph.c ##
     +@@
     + #include "progress.h"
     + #include "bloom.h"
     + #include "commit-slab.h"
     ++#include "json-writer.h"
     ++#include "trace2.h"
     + 
     + void git_test_write_commit_graph_or_die(void)
     + {
     +@@ commit-graph.c: static void write_graph_chunk_bloom_indexes(struct hashfile *f,
     + 	stop_progress(&progress);
     + }
     + 
     ++static void trace2_bloom_filter_settings(struct write_commit_graph_context *ctx)
     ++{
     ++	struct json_writer jw = JSON_WRITER_INIT;
     ++
     ++	jw_object_begin(&jw, 0);
     ++	jw_object_intmax(&jw, "hash_version", ctx->bloom_settings->hash_version);
     ++	jw_object_intmax(&jw, "num_hashes", ctx->bloom_settings->num_hashes);
     ++	jw_object_intmax(&jw, "bits_per_entry", ctx->bloom_settings->bits_per_entry);
     ++	jw_end(&jw);
     ++
     ++	trace2_data_json("bloom", ctx->r, "settings", &jw);
     ++
     ++	jw_release(&jw);
     ++}
     ++
     + static void write_graph_chunk_bloom_data(struct hashfile *f,
     + 					 struct write_commit_graph_context *ctx)
     + {
     +@@ commit-graph.c: static void write_graph_chunk_bloom_data(struct hashfile *f,
     + 	struct progress *progress = NULL;
     + 	int i = 0;
     + 
     ++	trace2_bloom_filter_settings(ctx);
     ++
     + 	if (ctx->report_progress)
     + 		progress = start_delayed_progress(
     + 			_("Writing changed paths Bloom filters data"),
     +@@ commit-graph.c: static int write_commit_graph_file(struct write_commit_graph_context *ctx)
     + 	struct object_id file_hash;
     + 	const struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
     + 
     +-	ctx->bloom_settings = &bloom_settings;
     ++	if (!ctx->bloom_settings)
     ++		ctx->bloom_settings = &bloom_settings;
     + 
     + 	if (ctx->split) {
     + 		struct strbuf tmp_file = STRBUF_INIT;
      @@ commit-graph.c: int write_commit_graph(struct object_directory *odb,
       	ctx->split = flags & COMMIT_GRAPH_WRITE_SPLIT ? 1 : 0;
       	ctx->check_oids = flags & COMMIT_GRAPH_WRITE_CHECK_OIDS ? 1 : 0;
     @@ commit-graph.c: int write_commit_graph(struct object_directory *odb,
       
      +	if (flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS)
      +		ctx->changed_paths = 1;
     -+	else if (!(flags & COMMIT_GRAPH_NO_WRITE_BLOOM_FILTERS)) {
     ++	if (!(flags & COMMIT_GRAPH_NO_WRITE_BLOOM_FILTERS)) {
     ++		struct commit_graph *g;
      +		prepare_commit_graph_one(ctx->r, ctx->odb);
      +
     ++		g = ctx->r->objects->commit_graph;
     ++
      +		/* We have changed-paths already. Keep them in the next graph */
     -+		if (ctx->r->objects->commit_graph &&
     -+		    ctx->r->objects->commit_graph->chunk_bloom_data)
     ++		if (g && g->chunk_bloom_data) {
      +			ctx->changed_paths = 1;
     ++			ctx->bloom_settings = g->bloom_filter_settings;
     ++		}
      +	}
      +
       	if (ctx->split) {
     @@ t/t4216-log-bloom.sh: test_expect_success 'setup - add commit-graph to the chain
       	test_line_count = 2 .git/objects/info/commit-graphs/commit-graph-chain
       '
       
     +@@ t/t4216-log-bloom.sh: test_expect_success 'Use Bloom filters if they exist in the latest but not all c
     + 	test_bloom_filters_used_when_some_filters_are_missing "-- A/B"
     + '
     + 
     ++BASE_BDAT_OFFSET=2240
     ++BASE_K_BYTE_OFFSET=$((BASE_BDAT_OFFSET + 10))
     ++BASE_LEN_BYTE_OFFSET=$((BASE_BDAT_OFFSET + 14))
     ++
     ++corrupt_graph() {
     ++	pos=$1
     ++	data="${2:-\0}"
     ++	grepstr=$3
     ++	orig_size=$(wc -c < .git/objects/info/commit-graph) &&
     ++	zero_pos=${4:-${orig_size}} &&
     ++	printf "$data" | dd of=".git/objects/info/commit-graph" bs=1 seek="$pos" conv=notrunc &&
     ++	dd of=".git/objects/info/commit-graph" bs=1 seek="$zero_pos" if=/dev/null
     ++}
     ++
     ++test_expect_success 'persist filter settings' '
     ++	test_when_finished rm -rf .git/objects/info/commit-graph* &&
     ++	GIT_TRACE2_EVENT="$(pwd)/trace2.txt" git commit-graph write --reachable --changed-paths &&
     ++	grep "{\"hash_version\":1,\"num_hashes\":7,\"bits_per_entry\":10}" trace2.txt &&
     ++	cp .git/objects/info/commit-graph commit-graph-before &&
     ++	corrupt_graph $BASE_K_BYTE_OFFSET "\09" &&
     ++	corrupt_graph $BASE_LEN_BYTE_OFFSET "\0F" &&
     ++	cp .git/objects/info/commit-graph commit-graph-after &&
     ++	test_commit c18 A/corrupt &&
     ++	GIT_TRACE2_EVENT="$(pwd)/trace2.txt" git commit-graph write --reachable --changed-paths &&
     ++	grep "{\"hash_version\":1,\"num_hashes\":57,\"bits_per_entry\":70}" trace2.txt
     ++'
     ++
     + test_done
     + \ No newline at end of file
  2:  65eb15221c !  5:  244668fec4 commit-graph: unify the signatures of all write_graph_chunk_*() functions
     @@ Commit message
      
       ## commit-graph.c ##
      @@ commit-graph.c: struct write_commit_graph_context {
     - 	struct bloom_filter_settings bloom_settings;
     + 	const struct bloom_filter_settings *bloom_settings;
       };
       
      -static void write_graph_chunk_fanout(struct hashfile *f,
     @@ commit-graph.c: static void write_graph_chunk_bloom_indexes(struct hashfile *f,
      +	return 0;
       }
       
     + static void trace2_bloom_filter_settings(struct write_commit_graph_context *ctx)
     +@@ commit-graph.c: static void trace2_bloom_filter_settings(struct write_commit_graph_context *ctx)
     + 	jw_release(&jw);
     + }
     + 
      -static void write_graph_chunk_bloom_data(struct hashfile *f,
      -					 struct write_commit_graph_context *ctx)
      +static int write_graph_chunk_bloom_data(struct hashfile *f,
  3:  3d24b9802d =  6:  8b959f2f37 commit-graph: simplify chunk writes into loop
  4:  bdca834e6d =  7:  3eb10933dc commit-graph: check chunk sizes after writing
  -:  ---------- >  8:  0bcfc1f051 revision.c: fix whitespace
  -:  ---------- >  9:  719c7091a7 revision: empty pathspecs should not use Bloom filters
  5:  9975fc96f1 ! 10:  9c2076b4ce commit-graph: check all leading directories in changed path Bloom filters
     @@ Commit message
          This adjusts the tracing values in t4216-log-bloom.sh, which provides a
          concrete way to notice the improvement.
      
     +    Helped-by: Taylor Blau <me@ttaylorr.com>
     +    Helped-by: René Scharfe <l.s.r@web.de>
          Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
          Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
      
     @@ revision.c: static void prepare_to_use_bloom_filter(struct rev_info *revs)
       	int last_index;
      -	int len;
      +	size_t len;
     -+	int path_component_nr = 0, j;
     ++	int path_component_nr = 1;
       
       	if (!revs->commits)
       		return;
      @@ revision.c: static void prepare_to_use_bloom_filter(struct rev_info *revs)
     - 
     - 	len = strlen(path);
     + 		return;
     + 	}
       
      -	revs->bloom_key = xmalloc(sizeof(struct bloom_key));
      -	fill_bloom_key(path, len, revs->bloom_key, revs->bloom_filter_settings);
      +	p = path;
     -+	do {
     -+		p = strchrnul(p + 1, '/');
     -+		path_component_nr++;
     -+	} while (p - path < len);
     ++	while (*p) {
     ++		if (is_dir_sep(*p))
     ++			path_component_nr++;
     ++		p++;
     ++	}
      +
      +	revs->bloom_keys_nr = path_component_nr;
      +	ALLOC_ARRAY(revs->bloom_keys, revs->bloom_keys_nr);
      +
     -+	p = path;
     -+	for (j = 0; j < revs->bloom_keys_nr; j++) {
     -+		p = strchrnul(p + 1, '/');
     ++	fill_bloom_key(path, len, &revs->bloom_keys[0],
     ++		       revs->bloom_filter_settings);
     ++	path_component_nr = 1;
      +
     -+		fill_bloom_key(path, p - path, &revs->bloom_keys[j],
     -+			       revs->bloom_filter_settings);
     ++	p = path + len - 1;
     ++	while (p > path) {
     ++		if (is_dir_sep(*p))
     ++			fill_bloom_key(path, p - path,
     ++				       &revs->bloom_keys[path_component_nr++],
     ++				       revs->bloom_filter_settings);
     ++		p--;
      +	}
       
       	if (trace2_is_enabled() && !bloom_filter_atexit_registered) {
  6:  2a5f1e1752 = 11:  1022c0ad21 bloom: enforce a minimum size of 8 bytes

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH v2 01/11] commit-graph: place bloom_settings in context
  2020-06-23 17:46 ` [PATCH v2 00/11] " Derrick Stolee via GitGitGadget
@ 2020-06-23 17:47   ` Derrick Stolee via GitGitGadget
  2020-06-23 17:47   ` [PATCH v2 02/11] commit-graph: change test to die on parse, not load Derrick Stolee via GitGitGadget
                     ` (11 subsequent siblings)
  12 siblings, 0 replies; 71+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-06-23 17:47 UTC (permalink / raw)
  To: git; +Cc: me, szeder.dev, l.s.r, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Place an instance of struct bloom_settings into the struct
write_commit_graph_context. This allows simplifying the function
prototype of write_graph_chunk_bloom_data(). This will allow us
to combine the function prototypes and use function pointers to
simplify write_commit_graph_file().

By using a pointer, we can later replace the settings to match those
that exist in the current commit-graph, in case a future Git version
allows customization of these parameters.

Reported-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 887837e882..d0fedcd9b1 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -882,6 +882,7 @@ struct write_commit_graph_context {
 
 	const struct split_commit_graph_opts *split_opts;
 	size_t total_bloom_filter_data_size;
+	const struct bloom_filter_settings *bloom_settings;
 };
 
 static void write_graph_chunk_fanout(struct hashfile *f,
@@ -1103,8 +1104,7 @@ static void write_graph_chunk_bloom_indexes(struct hashfile *f,
 }
 
 static void write_graph_chunk_bloom_data(struct hashfile *f,
-					 struct write_commit_graph_context *ctx,
-					 const struct bloom_filter_settings *settings)
+					 struct write_commit_graph_context *ctx)
 {
 	struct commit **list = ctx->commits.list;
 	struct commit **last = ctx->commits.list + ctx->commits.nr;
@@ -1116,9 +1116,9 @@ static void write_graph_chunk_bloom_data(struct hashfile *f,
 			_("Writing changed paths Bloom filters data"),
 			ctx->commits.nr);
 
-	hashwrite_be32(f, settings->hash_version);
-	hashwrite_be32(f, settings->num_hashes);
-	hashwrite_be32(f, settings->bits_per_entry);
+	hashwrite_be32(f, ctx->bloom_settings->hash_version);
+	hashwrite_be32(f, ctx->bloom_settings->num_hashes);
+	hashwrite_be32(f, ctx->bloom_settings->bits_per_entry);
 
 	while (list < last) {
 		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
@@ -1541,6 +1541,8 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	struct object_id file_hash;
 	const struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
 
+	ctx->bloom_settings = &bloom_settings;
+
 	if (ctx->split) {
 		struct strbuf tmp_file = STRBUF_INIT;
 
@@ -1642,7 +1644,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 		write_graph_chunk_extra_edges(f, ctx);
 	if (ctx->changed_paths) {
 		write_graph_chunk_bloom_indexes(f, ctx);
-		write_graph_chunk_bloom_data(f, ctx, &bloom_settings);
+		write_graph_chunk_bloom_data(f, ctx);
 	}
 	if (ctx->num_commit_graphs_after > 1 &&
 	    write_graph_chunk_base(f, ctx)) {
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH v2 02/11] commit-graph: change test to die on parse, not load
  2020-06-23 17:46 ` [PATCH v2 00/11] " Derrick Stolee via GitGitGadget
  2020-06-23 17:47   ` [PATCH v2 01/11] commit-graph: place bloom_settings in context Derrick Stolee via GitGitGadget
@ 2020-06-23 17:47   ` Derrick Stolee via GitGitGadget
  2020-06-23 17:47   ` [PATCH v2 03/11] bloom: get_bloom_filter() cleanups Derrick Stolee via GitGitGadget
                     ` (10 subsequent siblings)
  12 siblings, 0 replies; 71+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-06-23 17:47 UTC (permalink / raw)
  To: git; +Cc: me, szeder.dev, l.s.r, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

43d3561 (commit-graph write: don't die if the existing graph is corrupt,
2019-03-25) introduced the GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD environment
variable. This was created to verify that commit-graph was not loaded
when writing a new non-incremental commit-graph.

An upcoming change wants to load a commit-graph in some valuable cases,
but we want to maintain that we don't trust the commit-graph data when
writing our new file. Instead of dying on load, instead die if we ever
try to parse a commit from the commit-graph. This functionally verifies
the same intended behavior, but allows a more advanced feature in the
next change.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c          | 12 ++++++++----
 commit-graph.h          |  2 +-
 t/t5318-commit-graph.sh |  2 +-
 3 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index d0fedcd9b1..6a28d4a5a6 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -564,10 +564,6 @@ static int prepare_commit_graph(struct repository *r)
 		return !!r->objects->commit_graph;
 	r->objects->commit_graph_attempted = 1;
 
-	if (git_env_bool(GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD, 0))
-		die("dying as requested by the '%s' variable on commit-graph load!",
-		    GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD);
-
 	prepare_repo_settings(r);
 
 	if (!git_env_bool(GIT_TEST_COMMIT_GRAPH, 0) &&
@@ -790,6 +786,14 @@ static int parse_commit_in_graph_one(struct repository *r,
 
 int parse_commit_in_graph(struct repository *r, struct commit *item)
 {
+	static int checked_env = 0;
+
+	if (!checked_env &&
+	    git_env_bool(GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE, 0))
+		die("dying as requested by the '%s' variable on commit-graph parse!",
+		    GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE);
+	checked_env = 1;
+
 	if (!prepare_commit_graph(r))
 		return 0;
 	return parse_commit_in_graph_one(r, r->objects->commit_graph, item);
diff --git a/commit-graph.h b/commit-graph.h
index 881c9b46e5..f0fb13e3f2 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -5,7 +5,7 @@
 #include "object-store.h"
 
 #define GIT_TEST_COMMIT_GRAPH "GIT_TEST_COMMIT_GRAPH"
-#define GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD "GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD"
+#define GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE "GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE"
 #define GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS "GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS"
 
 /*
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 1073f9e3cf..5ec01abdaa 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -436,7 +436,7 @@ corrupt_graph_verify() {
 		cp $objdir/info/commit-graph commit-graph-pre-write-test
 	fi &&
 	git status --short &&
-	GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD=true git commit-graph write &&
+	GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE=true git commit-graph write &&
 	git commit-graph verify
 }
 
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH v2 03/11] bloom: get_bloom_filter() cleanups
  2020-06-23 17:46 ` [PATCH v2 00/11] " Derrick Stolee via GitGitGadget
  2020-06-23 17:47   ` [PATCH v2 01/11] commit-graph: place bloom_settings in context Derrick Stolee via GitGitGadget
  2020-06-23 17:47   ` [PATCH v2 02/11] commit-graph: change test to die on parse, not load Derrick Stolee via GitGitGadget
@ 2020-06-23 17:47   ` Derrick Stolee via GitGitGadget
  2020-06-25  7:24     ` René Scharfe
  2020-06-23 17:47   ` [PATCH v2 04/11] commit-graph: persist existence of changed-paths Derrick Stolee via GitGitGadget
                     ` (9 subsequent siblings)
  12 siblings, 1 reply; 71+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-06-23 17:47 UTC (permalink / raw)
  To: git; +Cc: me, szeder.dev, l.s.r, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The get_bloom_filter() method is a bit complicated in some parts where
it does not need to be. In particular, it needs to return a NULL filter
only when compute_if_not_present is zero AND the filter data cannot be
loaded from a commit-graph file. This currently happens by accident
because the commit-graph does not load changed-path Bloom filters from
an existing commit-graph when writing a new one. This will change in a
later patch.

Also clean up some style issues while we are here.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 bloom.c | 15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/bloom.c b/bloom.c
index c38d1cff0c..7291506564 100644
--- a/bloom.c
+++ b/bloom.c
@@ -186,7 +186,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
 	struct diff_options diffopt;
 	int max_changes = 512;
 
-	if (bloom_filters.slab_size == 0)
+	if (!bloom_filters.slab_size)
 		return NULL;
 
 	filter = bloom_filter_slab_at(&bloom_filters, c);
@@ -194,16 +194,15 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
 	if (!filter->data) {
 		load_commit_graph_info(r, c);
 		if (c->graph_pos != COMMIT_NOT_FROM_GRAPH &&
-			r->objects->commit_graph->chunk_bloom_indexes) {
-			if (load_bloom_filter_from_graph(r->objects->commit_graph, filter, c))
-				return filter;
-			else
-				return NULL;
-		}
+		    r->objects->commit_graph->chunk_bloom_indexes &&
+		    load_bloom_filter_from_graph(r->objects->commit_graph, filter, c))
+			return filter;
 	}
 
-	if (filter->data || !compute_if_not_present)
+	if (filter->data)
 		return filter;
+	if (!filter->data && !compute_if_not_present)
+		return NULL;
 
 	repo_diff_setup(r, &diffopt);
 	diffopt.flags.recursive = 1;
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH v2 04/11] commit-graph: persist existence of changed-paths
  2020-06-23 17:46 ` [PATCH v2 00/11] " Derrick Stolee via GitGitGadget
                     ` (2 preceding siblings ...)
  2020-06-23 17:47   ` [PATCH v2 03/11] bloom: get_bloom_filter() cleanups Derrick Stolee via GitGitGadget
@ 2020-06-23 17:47   ` Derrick Stolee via GitGitGadget
  2020-06-23 17:47   ` [PATCH v2 05/11] commit-graph: unify the signatures of all write_graph_chunk_*() functions SZEDER Gábor via GitGitGadget
                     ` (8 subsequent siblings)
  12 siblings, 0 replies; 71+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-06-23 17:47 UTC (permalink / raw)
  To: git; +Cc: me, szeder.dev, l.s.r, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The changed-path Bloom filters were released in v2.27.0, but have a
significant drawback. A user can opt-in to writing the changed-path
filters using the "--changed-paths" option to "git commit-graph write"
but the next write will drop the filters unless that option is
specified.

This becomes even more important when considering the interaction with
gc.writeCommitGraph (on by default) or fetch.writeCommitGraph (part of
features.experimental). These config options trigger commit-graph writes
that the user did not signal, and hence there is no --changed-paths
option available.

Allow a user that opts-in to the changed-path filters to persist the
property of "my commit-graph has changed-path filters" automatically. A
user can drop filters using the --no-changed-paths option.

In the process, we need to be extremely careful to match the Bloom
filter settings as specified by the commit-graph. This will allow future
versions of Git to customize these settings, and the version with this
change will persist those settings as commit-graphs are rewritten on
top.

Use the trace2 API to signal the settings used during the write, and
check that output in a test after manually adjusting the correct bytes
in the commit-graph file.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-commit-graph.txt |  5 +++-
 builtin/commit-graph.c             |  5 +++-
 commit-graph.c                     | 38 ++++++++++++++++++++++++++++--
 commit-graph.h                     |  1 +
 t/t4216-log-bloom.sh               | 29 ++++++++++++++++++++++-
 5 files changed, 73 insertions(+), 5 deletions(-)

diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
index f4b13c005b..369b222b08 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -60,7 +60,10 @@ existing commit-graph file.
 With the `--changed-paths` option, compute and write information about the
 paths changed between a commit and it's first parent. This operation can
 take a while on large repositories. It provides significant performance gains
-for getting history of a directory or a file with `git log -- <path>`.
+for getting history of a directory or a file with `git log -- <path>`. If
+this option is given, future commit-graph writes will automatically assume
+that this option was intended. Use `--no-changed-paths` to stop storing this
+data.
 +
 With the `--split` option, write the commit-graph as a chain of multiple
 commit-graph files stored in `<dir>/info/commit-graphs`. The new commits
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 59009837dc..ff7b177c33 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -151,6 +151,7 @@ static int graph_write(int argc, const char **argv)
 	};
 
 	opts.progress = isatty(2);
+	opts.enable_changed_paths = -1;
 	split_opts.size_multiple = 2;
 	split_opts.max_commits = 0;
 	split_opts.expire_time = 0;
@@ -171,7 +172,9 @@ static int graph_write(int argc, const char **argv)
 		flags |= COMMIT_GRAPH_WRITE_SPLIT;
 	if (opts.progress)
 		flags |= COMMIT_GRAPH_WRITE_PROGRESS;
-	if (opts.enable_changed_paths ||
+	if (!opts.enable_changed_paths)
+		flags |= COMMIT_GRAPH_NO_WRITE_BLOOM_FILTERS;
+	if (opts.enable_changed_paths == 1 ||
 	    git_env_bool(GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS, 0))
 		flags |= COMMIT_GRAPH_WRITE_BLOOM_FILTERS;
 
diff --git a/commit-graph.c b/commit-graph.c
index 6a28d4a5a6..908f094271 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -16,6 +16,8 @@
 #include "progress.h"
 #include "bloom.h"
 #include "commit-slab.h"
+#include "json-writer.h"
+#include "trace2.h"
 
 void git_test_write_commit_graph_or_die(void)
 {
@@ -1107,6 +1109,21 @@ static void write_graph_chunk_bloom_indexes(struct hashfile *f,
 	stop_progress(&progress);
 }
 
+static void trace2_bloom_filter_settings(struct write_commit_graph_context *ctx)
+{
+	struct json_writer jw = JSON_WRITER_INIT;
+
+	jw_object_begin(&jw, 0);
+	jw_object_intmax(&jw, "hash_version", ctx->bloom_settings->hash_version);
+	jw_object_intmax(&jw, "num_hashes", ctx->bloom_settings->num_hashes);
+	jw_object_intmax(&jw, "bits_per_entry", ctx->bloom_settings->bits_per_entry);
+	jw_end(&jw);
+
+	trace2_data_json("bloom", ctx->r, "settings", &jw);
+
+	jw_release(&jw);
+}
+
 static void write_graph_chunk_bloom_data(struct hashfile *f,
 					 struct write_commit_graph_context *ctx)
 {
@@ -1115,6 +1132,8 @@ static void write_graph_chunk_bloom_data(struct hashfile *f,
 	struct progress *progress = NULL;
 	int i = 0;
 
+	trace2_bloom_filter_settings(ctx);
+
 	if (ctx->report_progress)
 		progress = start_delayed_progress(
 			_("Writing changed paths Bloom filters data"),
@@ -1545,7 +1564,8 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	struct object_id file_hash;
 	const struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
 
-	ctx->bloom_settings = &bloom_settings;
+	if (!ctx->bloom_settings)
+		ctx->bloom_settings = &bloom_settings;
 
 	if (ctx->split) {
 		struct strbuf tmp_file = STRBUF_INIT;
@@ -1970,9 +1990,23 @@ int write_commit_graph(struct object_directory *odb,
 	ctx->split = flags & COMMIT_GRAPH_WRITE_SPLIT ? 1 : 0;
 	ctx->check_oids = flags & COMMIT_GRAPH_WRITE_CHECK_OIDS ? 1 : 0;
 	ctx->split_opts = split_opts;
-	ctx->changed_paths = flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS ? 1 : 0;
 	ctx->total_bloom_filter_data_size = 0;
 
+	if (flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS)
+		ctx->changed_paths = 1;
+	if (!(flags & COMMIT_GRAPH_NO_WRITE_BLOOM_FILTERS)) {
+		struct commit_graph *g;
+		prepare_commit_graph_one(ctx->r, ctx->odb);
+
+		g = ctx->r->objects->commit_graph;
+
+		/* We have changed-paths already. Keep them in the next graph */
+		if (g && g->chunk_bloom_data) {
+			ctx->changed_paths = 1;
+			ctx->bloom_settings = g->bloom_filter_settings;
+		}
+	}
+
 	if (ctx->split) {
 		struct commit_graph *g;
 		prepare_commit_graph(ctx->r);
diff --git a/commit-graph.h b/commit-graph.h
index f0fb13e3f2..45b1e5bca3 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -96,6 +96,7 @@ enum commit_graph_write_flags {
 	/* Make sure that each OID in the input is a valid commit OID. */
 	COMMIT_GRAPH_WRITE_CHECK_OIDS = (1 << 3),
 	COMMIT_GRAPH_WRITE_BLOOM_FILTERS = (1 << 4),
+	COMMIT_GRAPH_NO_WRITE_BLOOM_FILTERS = (1 << 5),
 };
 
 struct split_commit_graph_opts {
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index c7011f33e2..426de10041 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -126,7 +126,7 @@ test_expect_success 'setup - add commit-graph to the chain without Bloom filters
 	test_commit c14 A/anotherFile2 &&
 	test_commit c15 A/B/anotherFile2 &&
 	test_commit c16 A/B/C/anotherFile2 &&
-	GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=0 git commit-graph write --reachable --split &&
+	git commit-graph write --reachable --split --no-changed-paths &&
 	test_line_count = 2 .git/objects/info/commit-graphs/commit-graph-chain
 '
 
@@ -152,4 +152,31 @@ test_expect_success 'Use Bloom filters if they exist in the latest but not all c
 	test_bloom_filters_used_when_some_filters_are_missing "-- A/B"
 '
 
+BASE_BDAT_OFFSET=2240
+BASE_K_BYTE_OFFSET=$((BASE_BDAT_OFFSET + 10))
+BASE_LEN_BYTE_OFFSET=$((BASE_BDAT_OFFSET + 14))
+
+corrupt_graph() {
+	pos=$1
+	data="${2:-\0}"
+	grepstr=$3
+	orig_size=$(wc -c < .git/objects/info/commit-graph) &&
+	zero_pos=${4:-${orig_size}} &&
+	printf "$data" | dd of=".git/objects/info/commit-graph" bs=1 seek="$pos" conv=notrunc &&
+	dd of=".git/objects/info/commit-graph" bs=1 seek="$zero_pos" if=/dev/null
+}
+
+test_expect_success 'persist filter settings' '
+	test_when_finished rm -rf .git/objects/info/commit-graph* &&
+	GIT_TRACE2_EVENT="$(pwd)/trace2.txt" git commit-graph write --reachable --changed-paths &&
+	grep "{\"hash_version\":1,\"num_hashes\":7,\"bits_per_entry\":10}" trace2.txt &&
+	cp .git/objects/info/commit-graph commit-graph-before &&
+	corrupt_graph $BASE_K_BYTE_OFFSET "\09" &&
+	corrupt_graph $BASE_LEN_BYTE_OFFSET "\0F" &&
+	cp .git/objects/info/commit-graph commit-graph-after &&
+	test_commit c18 A/corrupt &&
+	GIT_TRACE2_EVENT="$(pwd)/trace2.txt" git commit-graph write --reachable --changed-paths &&
+	grep "{\"hash_version\":1,\"num_hashes\":57,\"bits_per_entry\":70}" trace2.txt
+'
+
 test_done
\ No newline at end of file
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH v2 05/11] commit-graph: unify the signatures of all write_graph_chunk_*() functions
  2020-06-23 17:46 ` [PATCH v2 00/11] " Derrick Stolee via GitGitGadget
                     ` (3 preceding siblings ...)
  2020-06-23 17:47   ` [PATCH v2 04/11] commit-graph: persist existence of changed-paths Derrick Stolee via GitGitGadget
@ 2020-06-23 17:47   ` SZEDER Gábor via GitGitGadget
  2020-06-25  7:25     ` René Scharfe
  2020-06-23 17:47   ` [PATCH v2 06/11] commit-graph: simplify chunk writes into loop SZEDER Gábor via GitGitGadget
                     ` (7 subsequent siblings)
  12 siblings, 1 reply; 71+ messages in thread
From: SZEDER Gábor via GitGitGadget @ 2020-06-23 17:47 UTC (permalink / raw)
  To: git; +Cc: me, szeder.dev, l.s.r, Derrick Stolee, SZEDER Gábor

From: =?UTF-8?q?SZEDER=20G=C3=A1bor?= <szeder.dev@gmail.com>

Update the write_graph_chunk_*() helper functions to have the same
signature:

  - Return an int error code from all these functions.
    write_graph_chunk_base() already has an int error code, now the
    others will have one, too, but since they don't indicate any
    error, they will always return 0.

  - Drop the hash size parameter of write_graph_chunk_oids() and
    write_graph_chunk_data(); its value can be read directly from
    'the_hash_algo' inside these functions as well.

This opens up the possibility for further cleanups and foolproofing in
the following two patches.

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 42 ++++++++++++++++++++++++++----------------
 1 file changed, 26 insertions(+), 16 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 908f094271..f33bfe49b3 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -891,8 +891,8 @@ struct write_commit_graph_context {
 	const struct bloom_filter_settings *bloom_settings;
 };
 
-static void write_graph_chunk_fanout(struct hashfile *f,
-				     struct write_commit_graph_context *ctx)
+static int write_graph_chunk_fanout(struct hashfile *f,
+				    struct write_commit_graph_context *ctx)
 {
 	int i, count = 0;
 	struct commit **list = ctx->commits.list;
@@ -913,17 +913,21 @@ static void write_graph_chunk_fanout(struct hashfile *f,
 
 		hashwrite_be32(f, count);
 	}
+
+	return 0;
 }
 
-static void write_graph_chunk_oids(struct hashfile *f, int hash_len,
-				   struct write_commit_graph_context *ctx)
+static int write_graph_chunk_oids(struct hashfile *f,
+				  struct write_commit_graph_context *ctx)
 {
 	struct commit **list = ctx->commits.list;
 	int count;
 	for (count = 0; count < ctx->commits.nr; count++, list++) {
 		display_progress(ctx->progress, ++ctx->progress_cnt);
-		hashwrite(f, (*list)->object.oid.hash, (int)hash_len);
+		hashwrite(f, (*list)->object.oid.hash, (int)the_hash_algo->rawsz);
 	}
+
+	return 0;
 }
 
 static const unsigned char *commit_to_sha1(size_t index, void *table)
@@ -932,8 +936,8 @@ static const unsigned char *commit_to_sha1(size_t index, void *table)
 	return commits[index]->object.oid.hash;
 }
 
-static void write_graph_chunk_data(struct hashfile *f, int hash_len,
-				   struct write_commit_graph_context *ctx)
+static int write_graph_chunk_data(struct hashfile *f,
+				  struct write_commit_graph_context *ctx)
 {
 	struct commit **list = ctx->commits.list;
 	struct commit **last = ctx->commits.list + ctx->commits.nr;
@@ -950,7 +954,7 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len,
 			die(_("unable to parse commit %s"),
 				oid_to_hex(&(*list)->object.oid));
 		tree = get_commit_tree_oid(*list);
-		hashwrite(f, tree->hash, hash_len);
+		hashwrite(f, tree->hash, the_hash_algo->rawsz);
 
 		parent = (*list)->parents;
 
@@ -1030,10 +1034,12 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len,
 
 		list++;
 	}
+
+	return 0;
 }
 
-static void write_graph_chunk_extra_edges(struct hashfile *f,
-					  struct write_commit_graph_context *ctx)
+static int write_graph_chunk_extra_edges(struct hashfile *f,
+					 struct write_commit_graph_context *ctx)
 {
 	struct commit **list = ctx->commits.list;
 	struct commit **last = ctx->commits.list + ctx->commits.nr;
@@ -1082,10 +1088,12 @@ static void write_graph_chunk_extra_edges(struct hashfile *f,
 
 		list++;
 	}
+
+	return 0;
 }
 
-static void write_graph_chunk_bloom_indexes(struct hashfile *f,
-					    struct write_commit_graph_context *ctx)
+static int write_graph_chunk_bloom_indexes(struct hashfile *f,
+					   struct write_commit_graph_context *ctx)
 {
 	struct commit **list = ctx->commits.list;
 	struct commit **last = ctx->commits.list + ctx->commits.nr;
@@ -1107,6 +1115,7 @@ static void write_graph_chunk_bloom_indexes(struct hashfile *f,
 	}
 
 	stop_progress(&progress);
+	return 0;
 }
 
 static void trace2_bloom_filter_settings(struct write_commit_graph_context *ctx)
@@ -1124,8 +1133,8 @@ static void trace2_bloom_filter_settings(struct write_commit_graph_context *ctx)
 	jw_release(&jw);
 }
 
-static void write_graph_chunk_bloom_data(struct hashfile *f,
-					 struct write_commit_graph_context *ctx)
+static int write_graph_chunk_bloom_data(struct hashfile *f,
+					struct write_commit_graph_context *ctx)
 {
 	struct commit **list = ctx->commits.list;
 	struct commit **last = ctx->commits.list + ctx->commits.nr;
@@ -1151,6 +1160,7 @@ static void write_graph_chunk_bloom_data(struct hashfile *f,
 	}
 
 	stop_progress(&progress);
+	return 0;
 }
 
 static int oid_compare(const void *_a, const void *_b)
@@ -1662,8 +1672,8 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 			num_chunks * ctx->commits.nr);
 	}
 	write_graph_chunk_fanout(f, ctx);
-	write_graph_chunk_oids(f, hashsz, ctx);
-	write_graph_chunk_data(f, hashsz, ctx);
+	write_graph_chunk_oids(f, ctx);
+	write_graph_chunk_data(f, ctx);
 	if (ctx->num_extra_edges)
 		write_graph_chunk_extra_edges(f, ctx);
 	if (ctx->changed_paths) {
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH v2 06/11] commit-graph: simplify chunk writes into loop
  2020-06-23 17:46 ` [PATCH v2 00/11] " Derrick Stolee via GitGitGadget
                     ` (4 preceding siblings ...)
  2020-06-23 17:47   ` [PATCH v2 05/11] commit-graph: unify the signatures of all write_graph_chunk_*() functions SZEDER Gábor via GitGitGadget
@ 2020-06-23 17:47   ` SZEDER Gábor via GitGitGadget
  2020-06-25  7:25     ` René Scharfe
  2020-06-23 17:47   ` [PATCH v2 07/11] commit-graph: check chunk sizes after writing SZEDER Gábor via GitGitGadget
                     ` (6 subsequent siblings)
  12 siblings, 1 reply; 71+ messages in thread
From: SZEDER Gábor via GitGitGadget @ 2020-06-23 17:47 UTC (permalink / raw)
  To: git; +Cc: me, szeder.dev, l.s.r, Derrick Stolee, SZEDER Gábor

From: =?UTF-8?q?SZEDER=20G=C3=A1bor?= <szeder.dev@gmail.com>

In write_commit_graph_file() we now have one block of code filling the
array of 'struct chunk_info' with the IDs and sizes of chunks to be
written, and an other block of code calling the functions responsible
for writing individual chunks.  In case of optional chunks like Extra
Edge List an Base Graphs List there is also a condition checking
whether that chunk is necessary/desired, and that same condition is
repeated in both blocks of code. Other, newer chunks have similar
optional conditions.

Eliminate these repeated conditions by storing the function pointers
responsible for writing individual chunks in the 'struct chunk_info'
array as well, and calling them in a loop to write the commit-graph
file.  This will open up the possibility for a bit of foolproofing in
the following patch.

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 31 +++++++++++++++++++------------
 1 file changed, 19 insertions(+), 12 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index f33bfe49b3..086fc2d070 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1555,9 +1555,13 @@ static int write_graph_chunk_base(struct hashfile *f,
 	return 0;
 }
 
+typedef int (*chunk_write_fn)(struct hashfile *f,
+			      struct write_commit_graph_context *ctx);
+
 struct chunk_info {
 	uint32_t id;
 	uint64_t size;
+	chunk_write_fn write_fn;
 };
 
 static int write_commit_graph_file(struct write_commit_graph_context *ctx)
@@ -1615,27 +1619,34 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 
 	chunks[0].id = GRAPH_CHUNKID_OIDFANOUT;
 	chunks[0].size = GRAPH_FANOUT_SIZE;
+	chunks[0].write_fn = write_graph_chunk_fanout;
 	chunks[1].id = GRAPH_CHUNKID_OIDLOOKUP;
 	chunks[1].size = hashsz * ctx->commits.nr;
+	chunks[1].write_fn = write_graph_chunk_oids;
 	chunks[2].id = GRAPH_CHUNKID_DATA;
 	chunks[2].size = (hashsz + 16) * ctx->commits.nr;
+	chunks[2].write_fn = write_graph_chunk_data;
 	if (ctx->num_extra_edges) {
 		chunks[num_chunks].id = GRAPH_CHUNKID_EXTRAEDGES;
 		chunks[num_chunks].size = 4 * ctx->num_extra_edges;
+		chunks[num_chunks].write_fn = write_graph_chunk_extra_edges;
 		num_chunks++;
 	}
 	if (ctx->changed_paths) {
 		chunks[num_chunks].id = GRAPH_CHUNKID_BLOOMINDEXES;
 		chunks[num_chunks].size = sizeof(uint32_t) * ctx->commits.nr;
+		chunks[num_chunks].write_fn = write_graph_chunk_bloom_indexes;
 		num_chunks++;
 		chunks[num_chunks].id = GRAPH_CHUNKID_BLOOMDATA;
 		chunks[num_chunks].size = sizeof(uint32_t) * 3
 					  + ctx->total_bloom_filter_data_size;
+		chunks[num_chunks].write_fn = write_graph_chunk_bloom_data;
 		num_chunks++;
 	}
 	if (ctx->num_commit_graphs_after > 1) {
 		chunks[num_chunks].id = GRAPH_CHUNKID_BASE;
 		chunks[num_chunks].size = hashsz * (ctx->num_commit_graphs_after - 1);
+		chunks[num_chunks].write_fn = write_graph_chunk_base;
 		num_chunks++;
 	}
 
@@ -1671,19 +1682,15 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 			progress_title.buf,
 			num_chunks * ctx->commits.nr);
 	}
-	write_graph_chunk_fanout(f, ctx);
-	write_graph_chunk_oids(f, ctx);
-	write_graph_chunk_data(f, ctx);
-	if (ctx->num_extra_edges)
-		write_graph_chunk_extra_edges(f, ctx);
-	if (ctx->changed_paths) {
-		write_graph_chunk_bloom_indexes(f, ctx);
-		write_graph_chunk_bloom_data(f, ctx);
-	}
-	if (ctx->num_commit_graphs_after > 1 &&
-	    write_graph_chunk_base(f, ctx)) {
-		return -1;
+
+	for (i = 0; i < num_chunks; i++) {
+		if (chunks[i].write_fn(f, ctx)) {
+			error(_("failed writing chunk with id %"PRIx32""),
+			      chunks[i].id);
+			return -1;
+		}
 	}
+
 	stop_progress(&ctx->progress);
 	strbuf_release(&progress_title);
 
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH v2 07/11] commit-graph: check chunk sizes after writing
  2020-06-23 17:46 ` [PATCH v2 00/11] " Derrick Stolee via GitGitGadget
                     ` (5 preceding siblings ...)
  2020-06-23 17:47   ` [PATCH v2 06/11] commit-graph: simplify chunk writes into loop SZEDER Gábor via GitGitGadget
@ 2020-06-23 17:47   ` SZEDER Gábor via GitGitGadget
  2020-06-25  7:25     ` René Scharfe
  2020-06-23 17:47   ` [PATCH v2 08/11] revision.c: fix whitespace Derrick Stolee via GitGitGadget
                     ` (5 subsequent siblings)
  12 siblings, 1 reply; 71+ messages in thread
From: SZEDER Gábor via GitGitGadget @ 2020-06-23 17:47 UTC (permalink / raw)
  To: git; +Cc: me, szeder.dev, l.s.r, Derrick Stolee, SZEDER Gábor

From: =?UTF-8?q?SZEDER=20G=C3=A1bor?= <szeder.dev@gmail.com>

In my experience while experimenting with new commit-graph chunks,
early versions of the corresponding new write_commit_graph_my_chunk()
functions are, sadly but not surprisingly, often buggy, and write more
or less data than they are supposed to, especially if the chunk size
is not directly proportional to the number of commits.  This then
causes all kinds of issues when reading such a bogus commit-graph
file, raising the question of whether the writing or the reading part
happens to be buggy this time.

Let's catch such issues early, already when writing the commit-graph
file, and check that each write_graph_chunk_*() function wrote the
amount of data that it was expected to, and what has been encoded in
the Chunk Lookup table.  Now that all commit-graph chunks are written
in a loop we can do this check in a single place for all chunks, and
any chunks added in the future will get checked as well.

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/commit-graph.c b/commit-graph.c
index 086fc2d070..1de6800d74 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1683,12 +1683,21 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 			num_chunks * ctx->commits.nr);
 	}
 
+	chunk_offset = f->total + f->offset;
 	for (i = 0; i < num_chunks; i++) {
+		uint64_t end_offset;
+
 		if (chunks[i].write_fn(f, ctx)) {
 			error(_("failed writing chunk with id %"PRIx32""),
 			      chunks[i].id);
 			return -1;
 		}
+
+		end_offset = f->total + f->offset;
+		if (end_offset - chunk_offset != chunks[i].size)
+			BUG("expected to write %"PRId64" bytes to chunk %"PRIx32", but wrote %"PRId64" instead",
+			    chunks[i].size, chunks[i].id, end_offset - chunk_offset);
+		chunk_offset = end_offset;
 	}
 
 	stop_progress(&ctx->progress);
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH v2 08/11] revision.c: fix whitespace
  2020-06-23 17:46 ` [PATCH v2 00/11] " Derrick Stolee via GitGitGadget
                     ` (6 preceding siblings ...)
  2020-06-23 17:47   ` [PATCH v2 07/11] commit-graph: check chunk sizes after writing SZEDER Gábor via GitGitGadget
@ 2020-06-23 17:47   ` Derrick Stolee via GitGitGadget
  2020-06-23 17:47   ` [PATCH v2 09/11] revision: empty pathspecs should not use Bloom filters Taylor Blau via GitGitGadget
                     ` (4 subsequent siblings)
  12 siblings, 0 replies; 71+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-06-23 17:47 UTC (permalink / raw)
  To: git; +Cc: me, szeder.dev, l.s.r, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Here, four spaces were used instead of tab characters.

Reported-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 revision.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/revision.c b/revision.c
index c644c66091..ed59084f50 100644
--- a/revision.c
+++ b/revision.c
@@ -697,11 +697,11 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
 
 	/* remove single trailing slash from path, if needed */
 	if (pi->match[last_index] == '/') {
-	    path_alloc = xstrdup(pi->match);
-	    path_alloc[last_index] = '\0';
-	    path = path_alloc;
+		path_alloc = xstrdup(pi->match);
+		path_alloc[last_index] = '\0';
+		path = path_alloc;
 	} else
-	    path = pi->match;
+		path = pi->match;
 
 	len = strlen(path);
 
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH v2 09/11] revision: empty pathspecs should not use Bloom filters
  2020-06-23 17:46 ` [PATCH v2 00/11] " Derrick Stolee via GitGitGadget
                     ` (7 preceding siblings ...)
  2020-06-23 17:47   ` [PATCH v2 08/11] revision.c: fix whitespace Derrick Stolee via GitGitGadget
@ 2020-06-23 17:47   ` Taylor Blau via GitGitGadget
  2020-06-23 17:47   ` [PATCH v2 10/11] commit-graph: check all leading directories in changed path " SZEDER Gábor via GitGitGadget
                     ` (3 subsequent siblings)
  12 siblings, 0 replies; 71+ messages in thread
From: Taylor Blau via GitGitGadget @ 2020-06-23 17:47 UTC (permalink / raw)
  To: git; +Cc: me, szeder.dev, l.s.r, Derrick Stolee, Taylor Blau

From: Taylor Blau <me@ttaylorr.com>

The prepare_to_use_bloom_filter() method was not intended to be called
on an empty pathspec. However, 'git log -- .' and 'git log' are subtly
different: the latter reports all commits while the former will simplify
commits that do not change the root tree.

This means that the path used to construct the bloom_key might be empty,
and that value is not added to the Bloom filter during construction.
That means that the results are likely incorrect!

To resolve the issue, be careful about the length of the path and stop
filling Bloom filters. To be completely sure we do not use them, drop
the pointer to the bloom_filter_settings from the commit-graph. That
allows our test to look at the trace2 logs to verify no Bloom filter
statistics are reported.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 revision.c           | 4 ++++
 t/t4216-log-bloom.sh | 4 ++++
 2 files changed, 8 insertions(+)

diff --git a/revision.c b/revision.c
index ed59084f50..b53377cd52 100644
--- a/revision.c
+++ b/revision.c
@@ -704,6 +704,10 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
 		path = pi->match;
 
 	len = strlen(path);
+	if (!len) {
+		revs->bloom_filter_settings = NULL;
+		return;
+	}
 
 	revs->bloom_key = xmalloc(sizeof(struct bloom_key));
 	fill_bloom_key(path, len, revs->bloom_key, revs->bloom_filter_settings);
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index 426de10041..f890cc4737 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -112,6 +112,10 @@ test_expect_success 'git log -- multiple path specs does not use Bloom filters'
 	test_bloom_filters_not_used "-- file4 A/file1"
 '
 
+test_expect_success 'git log -- "." pathspec at root does not use Bloom filters' '
+	test_bloom_filters_not_used "-- ."
+'
+
 test_expect_success 'git log with wildcard that resolves to a single path uses Bloom filters' '
 	test_bloom_filters_used "-- *4" &&
 	test_bloom_filters_used "-- *renamed"
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH v2 10/11] commit-graph: check all leading directories in changed path Bloom filters
  2020-06-23 17:46 ` [PATCH v2 00/11] " Derrick Stolee via GitGitGadget
                     ` (8 preceding siblings ...)
  2020-06-23 17:47   ` [PATCH v2 09/11] revision: empty pathspecs should not use Bloom filters Taylor Blau via GitGitGadget
@ 2020-06-23 17:47   ` SZEDER Gábor via GitGitGadget
  2020-06-25  7:25     ` René Scharfe
  2020-06-23 17:47   ` [PATCH v2 11/11] bloom: enforce a minimum size of 8 bytes Derrick Stolee via GitGitGadget
                     ` (2 subsequent siblings)
  12 siblings, 1 reply; 71+ messages in thread
From: SZEDER Gábor via GitGitGadget @ 2020-06-23 17:47 UTC (permalink / raw)
  To: git; +Cc: me, szeder.dev, l.s.r, Derrick Stolee, SZEDER Gábor

From: =?UTF-8?q?SZEDER=20G=C3=A1bor?= <szeder.dev@gmail.com>

The file 'dir/subdir/file' can only be modified if its leading
directories 'dir' and 'dir/subdir' are modified as well.

So when checking modified path Bloom filters looking for commits
modifying a path with multiple path components, then check not only
the full path in the Bloom filters, but all its leading directories as
well.  Take care to check these paths in "deepest first" order,
because it's the full path that is least likely to be modified, and
the Bloom filter queries can short circuit sooner.

This can significantly reduce the average false positive rate, by
about an order of magnitude or three(!), and can further speed up
pathspec-limited revision walks.  The table below compares the average
false positive rate and runtime of

  git rev-list HEAD -- "$path"

before and after this change for 5000+ randomly* selected paths from
each repository:

                    Average false           Average        Average
                    positive rate           runtime        runtime
                  before     after     before     after   difference
  ------------------------------------------------------------------
  git             3.220%   0.7853%     0.0558s   0.0387s   -30.6%
  linux           2.453%   0.0296%     0.1046s   0.0766s   -26.8%
  tensorflow      2.536%   0.6977%     0.0594s   0.0420s   -29.2%

*Path selection was done with the following pipeline:

	git ls-tree -r --name-only HEAD | sort -R | head -n 5000

The improvements in runtime are much smaller than the improvements in
average false positive rate, as we are clearly reaching diminishing
returns here.  However, all these timings depend on that accessing
tree objects is reasonably fast (warm caches).  If we had a partial
clone and the tree objects had to be fetched from a promisor remote,
e.g.:

  $ git clone --filter=tree:0 --bare file://.../webkit.git webkit.notrees.git
  $ git -C webkit.git -c core.modifiedPathBloomFilters=1 \
        commit-graph write --reachable
  $ cp webkit.git/objects/info/commit-graph webkit.notrees.git/objects/info/
  $ git -C webkit.notrees.git -c core.modifiedPathBloomFilters=1 \
        rev-list HEAD -- "$path"

then checking all leading path component can reduce the runtime from
over an hour to a few seconds (and this is with the clone and the
promisor on the same machine).

This adjusts the tracing values in t4216-log-bloom.sh, which provides a
concrete way to notice the improvement.

Helped-by: Taylor Blau <me@ttaylorr.com>
Helped-by: René Scharfe <l.s.r@web.de>
Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 revision.c           | 41 ++++++++++++++++++++++++++++++++---------
 revision.h           |  6 ++++--
 t/t4216-log-bloom.sh |  2 +-
 3 files changed, 37 insertions(+), 12 deletions(-)

diff --git a/revision.c b/revision.c
index b53377cd52..077888ee51 100644
--- a/revision.c
+++ b/revision.c
@@ -670,9 +670,10 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
 {
 	struct pathspec_item *pi;
 	char *path_alloc = NULL;
-	const char *path;
+	const char *path, *p;
 	int last_index;
-	int len;
+	size_t len;
+	int path_component_nr = 1;
 
 	if (!revs->commits)
 		return;
@@ -709,8 +710,28 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
 		return;
 	}
 
-	revs->bloom_key = xmalloc(sizeof(struct bloom_key));
-	fill_bloom_key(path, len, revs->bloom_key, revs->bloom_filter_settings);
+	p = path;
+	while (*p) {
+		if (is_dir_sep(*p))
+			path_component_nr++;
+		p++;
+	}
+
+	revs->bloom_keys_nr = path_component_nr;
+	ALLOC_ARRAY(revs->bloom_keys, revs->bloom_keys_nr);
+
+	fill_bloom_key(path, len, &revs->bloom_keys[0],
+		       revs->bloom_filter_settings);
+	path_component_nr = 1;
+
+	p = path + len - 1;
+	while (p > path) {
+		if (is_dir_sep(*p))
+			fill_bloom_key(path, p - path,
+				       &revs->bloom_keys[path_component_nr++],
+				       revs->bloom_filter_settings);
+		p--;
+	}
 
 	if (trace2_is_enabled() && !bloom_filter_atexit_registered) {
 		atexit(trace2_bloom_filter_statistics_atexit);
@@ -724,7 +745,7 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
 						 struct commit *commit)
 {
 	struct bloom_filter *filter;
-	int result;
+	int result = 1, j;
 
 	if (!revs->repo->objects->commit_graph)
 		return -1;
@@ -744,9 +765,11 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
 		return -1;
 	}
 
-	result = bloom_filter_contains(filter,
-				       revs->bloom_key,
-				       revs->bloom_filter_settings);
+	for (j = 0; result && j < revs->bloom_keys_nr; j++) {
+		result = bloom_filter_contains(filter,
+					       &revs->bloom_keys[j],
+					       revs->bloom_filter_settings);
+	}
 
 	if (result)
 		count_bloom_filter_maybe++;
@@ -786,7 +809,7 @@ static int rev_compare_tree(struct rev_info *revs,
 			return REV_TREE_SAME;
 	}
 
-	if (revs->bloom_key && !nth_parent) {
+	if (revs->bloom_keys_nr && !nth_parent) {
 		bloom_ret = check_maybe_different_in_bloom_filter(revs, commit);
 
 		if (bloom_ret == 0)
diff --git a/revision.h b/revision.h
index 7c026fe41f..abbfb4ab59 100644
--- a/revision.h
+++ b/revision.h
@@ -295,8 +295,10 @@ struct rev_info {
 	struct topo_walk_info *topo_walk_info;
 
 	/* Commit graph bloom filter fields */
-	/* The bloom filter key for the pathspec */
-	struct bloom_key *bloom_key;
+	/* The bloom filter key(s) for the pathspec */
+	struct bloom_key *bloom_keys;
+	int bloom_keys_nr;
+
 	/*
 	 * The bloom filter settings used to generate the key.
 	 * This is loaded from the commit-graph being used.
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index f890cc4737..84f95972ca 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -146,7 +146,7 @@ test_expect_success 'setup - add commit-graph to the chain with Bloom filters' '
 
 test_bloom_filters_used_when_some_filters_are_missing () {
 	log_args=$1
-	bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"zero_length_filter\":0,\"maybe\":8,\"definitely_not\":6"
+	bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"zero_length_filter\":0,\"maybe\":6,\"definitely_not\":8"
 	setup "$log_args" &&
 	grep -q "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.perf" &&
 	test_cmp log_wo_bloom log_w_bloom
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH v2 11/11] bloom: enforce a minimum size of 8 bytes
  2020-06-23 17:46 ` [PATCH v2 00/11] " Derrick Stolee via GitGitGadget
                     ` (9 preceding siblings ...)
  2020-06-23 17:47   ` [PATCH v2 10/11] commit-graph: check all leading directories in changed path " SZEDER Gábor via GitGitGadget
@ 2020-06-23 17:47   ` Derrick Stolee via GitGitGadget
  2020-06-24 23:11   ` [PATCH v2 00/11] More commit-graph/Bloom filter improvements Junio C Hamano
  2020-06-26 12:30   ` [PATCH v3 00/10] " Derrick Stolee via GitGitGadget
  12 siblings, 0 replies; 71+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-06-23 17:47 UTC (permalink / raw)
  To: git; +Cc: me, szeder.dev, l.s.r, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The original design of changed-path Bloom filters included an 8-byte
block size for filter lengths. This was changed mid-way through the
submission process, and now the length stored in the commit-graph has
one-byte granularity.

This can cause some issues for very small filters. The analysis for
false positive rates assume large filters, so rounding errors become
less important at that scale. When there are only a few paths changed,
a filter that has size only a few bytes could have very different
behavior. In fact, this is evidenced in the Git repository due to the
code organization and careful patch creation that leads to many commits
with very small filters. These small filters frequently have
false-positive rates in the 8-10% range or higher.

The previous change improved the false-positive rate using multiple
Bloom keys when the path has multiple directory components. However,
that does not help at all for files at root. It is typical to have
several commits that change only the README at root, and those commits
would be likely to have these artificially high false-positive rates.

Correct this issue by creating a minimum filters size of 8 bytes. This
requires the very small commits (with fewer than six changes, including
non-root directories) to have a larger filter. In principle, this
violates the bits_per_entry value of struct bloom_filter_settings.
However, it does not actually create a functional problem.

As for compatibility, this only affects new versions writing filters for
commits that do not yet have a filter. Old version will write the
smaller filters and this version will persist and properly read that
data. Now, the new files will be generated slightly larger.

               Bytes before   Bytes after  Difference
  --------------------------------------------------
  git             4,021,078    4,275,311   +6.32%
  linux          72,212,101   73,909,286   +2.35%
  tensorflow      7,596,359    7,691,646   +1.25%

This has a measurable improvement in the false-positive rate and the
end-to-end run time for these repos. The table below compares the average
false-positive rate and runtime of

  git rev-list HEAD -- "$path"

before and after this change for 5000+ randomly* selected paths from
each repository:

                    Average false           Average        Average
                    positive rate           runtime        runtime
                  before     after     before     after   difference
  ------------------------------------------------------------------
  git             0.786%     0.227%    0.0387s    0.0289s -25.5%
  linux           0.0296%    0.0174%   0.0766s    0.0706s  -7.8%
  tensorflow      0.6977%    0.0268%   0.0420s    0.0384s  -8.5%

*Path selection was done with the following pipeline:

        git ls-tree -r --name-only HEAD | sort -R | head -n 5000

These relatively-small increases in file size appear to be a fair price
to pay for these performance improvements.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 bloom.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/bloom.c b/bloom.c
index 7291506564..e9dc15976c 100644
--- a/bloom.c
+++ b/bloom.c
@@ -257,6 +257,10 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
 		}
 
 		filter->len = (hashmap_get_size(&pathmap) * settings.bits_per_entry + BITS_PER_WORD - 1) / BITS_PER_WORD;
+
+		if (filter->len && filter->len < 8)
+			filter->len = 8;
+
 		filter->data = xcalloc(filter->len, sizeof(unsigned char));
 
 		hashmap_for_each_entry(&pathmap, &iter, e, entry) {
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 00/11] More commit-graph/Bloom filter improvements
  2020-06-23 17:46 ` [PATCH v2 00/11] " Derrick Stolee via GitGitGadget
                     ` (10 preceding siblings ...)
  2020-06-23 17:47   ` [PATCH v2 11/11] bloom: enforce a minimum size of 8 bytes Derrick Stolee via GitGitGadget
@ 2020-06-24 23:11   ` Junio C Hamano
  2020-06-24 23:32     ` Derrick Stolee
  2020-06-26 12:30   ` [PATCH v3 00/10] " Derrick Stolee via GitGitGadget
  12 siblings, 1 reply; 71+ messages in thread
From: Junio C Hamano @ 2020-06-24 23:11 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, me, szeder.dev, l.s.r, Derrick Stolee

This does not seem to play well with what is in flight.  Tests seem
to pass with topics up to es/config-hooks merged but not with this
topic merged on top.

    1b5d3d8260 Merge branch 'ds/commit-graph-bloom-updates' into seen
    32169c595c Merge branch 'es/config-hooks' into seen
    ...

$ sh t4216-log-bloom.sh -i -v

ends like so:

ok 133 - Use Bloom filters if they exist in the latest but not all commit graphs in the chain.

expecting success of 4216.134 'persist filter settings':
        test_when_finished rm -rf .git/objects/info/commit-graph* &&
        GIT_TRACE2_EVENT="$(pwd)/trace2.txt" git commit-graph write --reachable --changed-paths &&
        grep "{\"hash_version\":1,\"num_hashes\":7,\"bits_per_entry\":10}" trace2.txt &&
        cp .git/objects/info/commit-graph commit-graph-before &&
        corrupt_graph $BASE_K_BYTE_OFFSET "\09" &&
        corrupt_graph $BASE_LEN_BYTE_OFFSET "\0F" &&
        cp .git/objects/info/commit-graph commit-graph-after &&
        test_commit c18 A/corrupt &&
        GIT_TRACE2_EVENT="$(pwd)/trace2.txt" git commit-graph write --reachable --changed-paths &&
        grep "{\"hash_version\":1,\"num_hashes\":57,\"bits_per_entry\":70}" trace2.txt

not ok 134 - persist filter settings
# ...

Thanks.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 00/11] More commit-graph/Bloom filter improvements
  2020-06-24 23:11   ` [PATCH v2 00/11] More commit-graph/Bloom filter improvements Junio C Hamano
@ 2020-06-24 23:32     ` Derrick Stolee
  2020-06-25  0:38       ` Junio C Hamano
  0 siblings, 1 reply; 71+ messages in thread
From: Derrick Stolee @ 2020-06-24 23:32 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee via GitGitGadget
  Cc: git, me, szeder.dev, l.s.r, Derrick Stolee

On 6/24/2020 7:11 PM, Junio C Hamano wrote:
> This does not seem to play well with what is in flight.  Tests seem
> to pass with topics up to es/config-hooks merged but not with this
> topic merged on top.
> 
>     1b5d3d8260 Merge branch 'ds/commit-graph-bloom-updates' into seen
>     32169c595c Merge branch 'es/config-hooks' into seen
>     ...
> 
> $ sh t4216-log-bloom.sh -i -v
> 
> ends like so:
> 
> ok 133 - Use Bloom filters if they exist in the latest but not all commit graphs in the chain.
> 
> expecting success of 4216.134 'persist filter settings':
>         test_when_finished rm -rf .git/objects/info/commit-graph* &&
>         GIT_TRACE2_EVENT="$(pwd)/trace2.txt" git commit-graph write --reachable --changed-paths &&
>         grep "{\"hash_version\":1,\"num_hashes\":7,\"bits_per_entry\":10}" trace2.txt &&
>         cp .git/objects/info/commit-graph commit-graph-before &&
>         corrupt_graph $BASE_K_BYTE_OFFSET "\09" &&
>         corrupt_graph $BASE_LEN_BYTE_OFFSET "\0F" &&
>         cp .git/objects/info/commit-graph commit-graph-after &&
>         test_commit c18 A/corrupt &&
>         GIT_TRACE2_EVENT="$(pwd)/trace2.txt" git commit-graph write --reachable --changed-paths &&
>         grep "{\"hash_version\":1,\"num_hashes\":57,\"bits_per_entry\":70}" trace2.txt
> 
> not ok 134 - persist filter settings
> # ...
> 
> Thanks.

Thanks for letting me know. I'll investigate carefully with the
rest of the 'seen' branch. This test is a bit fragile due to
computed values for which bytes to replace, so anything that
could have changed the length or order of chunks would lead to
a failure here.

Sorry for the disruption.

-Stolee


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 00/11] More commit-graph/Bloom filter improvements
  2020-06-24 23:32     ` Derrick Stolee
@ 2020-06-25  0:38       ` Junio C Hamano
  2020-06-25 13:38         ` Derrick Stolee
  0 siblings, 1 reply; 71+ messages in thread
From: Junio C Hamano @ 2020-06-25  0:38 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, git, me, szeder.dev, l.s.r,
	Derrick Stolee

Derrick Stolee <stolee@gmail.com> writes:

> On 6/24/2020 7:11 PM, Junio C Hamano wrote:
>> This does not seem to play well with what is in flight.  Tests seem
>> to pass with topics up to es/config-hooks merged but not with this
>> topic merged on top.
>> 
>>     1b5d3d8260 Merge branch 'ds/commit-graph-bloom-updates' into seen
>>     32169c595c Merge branch 'es/config-hooks' into seen
>>     ...
>> 
>> $ sh t4216-log-bloom.sh -i -v
>> 
>> ends like so:
>> 
>> ok 133 - Use Bloom filters if they exist in the latest but not all commit graphs in the chain.
>> 
>> expecting success of 4216.134 'persist filter settings':
>>         test_when_finished rm -rf .git/objects/info/commit-graph* &&
>>         GIT_TRACE2_EVENT="$(pwd)/trace2.txt" git commit-graph write --reachable --changed-paths &&
>>         grep "{\"hash_version\":1,\"num_hashes\":7,\"bits_per_entry\":10}" trace2.txt &&
>>         cp .git/objects/info/commit-graph commit-graph-before &&
>>         corrupt_graph $BASE_K_BYTE_OFFSET "\09" &&
>>         corrupt_graph $BASE_LEN_BYTE_OFFSET "\0F" &&
>>         cp .git/objects/info/commit-graph commit-graph-after &&
>>         test_commit c18 A/corrupt &&
>>         GIT_TRACE2_EVENT="$(pwd)/trace2.txt" git commit-graph write --reachable --changed-paths &&
>>         grep "{\"hash_version\":1,\"num_hashes\":57,\"bits_per_entry\":70}" trace2.txt
>> 
>> not ok 134 - persist filter settings
>> # ...
>> 
>> Thanks.
>
> Thanks for letting me know. I'll investigate carefully with the
> rest of the 'seen' branch. This test is a bit fragile due to
> computed values for which bytes to replace, so anything that
> could have changed the length or order of chunks would lead to
> a failure here.
>
> Sorry for the disruption.

Oh, not at all.  Thanks for helping.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 03/11] bloom: get_bloom_filter() cleanups
  2020-06-23 17:47   ` [PATCH v2 03/11] bloom: get_bloom_filter() cleanups Derrick Stolee via GitGitGadget
@ 2020-06-25  7:24     ` René Scharfe
  0 siblings, 0 replies; 71+ messages in thread
From: René Scharfe @ 2020-06-25  7:24 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget, git; +Cc: me, szeder.dev, Derrick Stolee

Am 23.06.20 um 19:47 schrieb Derrick Stolee via GitGitGadget:
> From: Derrick Stolee <dstolee@microsoft.com>
>
> The get_bloom_filter() method is a bit complicated in some parts where
> it does not need to be. In particular, it needs to return a NULL filter
> only when compute_if_not_present is zero AND the filter data cannot be
> loaded from a commit-graph file. This currently happens by accident
> because the commit-graph does not load changed-path Bloom filters from
> an existing commit-graph when writing a new one. This will change in a
> later patch.

So this is actually a logic fix, not just a cleanup as the subject says?

>
> Also clean up some style issues while we are here.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  bloom.c | 15 +++++++--------
>  1 file changed, 7 insertions(+), 8 deletions(-)
>
> diff --git a/bloom.c b/bloom.c
> index c38d1cff0c..7291506564 100644
> --- a/bloom.c
> +++ b/bloom.c
> @@ -186,7 +186,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
>  	struct diff_options diffopt;
>  	int max_changes = 512;
>
> -	if (bloom_filters.slab_size == 0)
> +	if (!bloom_filters.slab_size)
>  		return NULL;
>
>  	filter = bloom_filter_slab_at(&bloom_filters, c);
> @@ -194,16 +194,15 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
>  	if (!filter->data) {
>  		load_commit_graph_info(r, c);
>  		if (c->graph_pos != COMMIT_NOT_FROM_GRAPH &&
> -			r->objects->commit_graph->chunk_bloom_indexes) {
> -			if (load_bloom_filter_from_graph(r->objects->commit_graph, filter, c))
> -				return filter;
> -			else
> -				return NULL;

... and the fix is that this else branch should not be taken if
compute_if_not_present is set.

> -		}
> +		    r->objects->commit_graph->chunk_bloom_indexes &&
> +		    load_bloom_filter_from_graph(r->objects->commit_graph, filter, c))
> +			return filter;

You could even drop this return as well and have the check below handle the
successful load case.

>  	}
>
> -	if (filter->data || !compute_if_not_present)
> +	if (filter->data)
>  		return filter;
> +	if (!filter->data && !compute_if_not_present)
            ^^^^^^^^^^^^^
The first condition is always true, as the check two lines above makes sure.
Removing it would be cleaner IMHO.

> +		return NULL;
>
>  	repo_diff_setup(r, &diffopt);
>  	diffopt.flags.recursive = 1;
>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 05/11] commit-graph: unify the signatures of all write_graph_chunk_*() functions
  2020-06-23 17:47   ` [PATCH v2 05/11] commit-graph: unify the signatures of all write_graph_chunk_*() functions SZEDER Gábor via GitGitGadget
@ 2020-06-25  7:25     ` René Scharfe
  0 siblings, 0 replies; 71+ messages in thread
From: René Scharfe @ 2020-06-25  7:25 UTC (permalink / raw)
  To: SZEDER Gábor via GitGitGadget, git; +Cc: me, szeder.dev, Derrick Stolee

Am 23.06.20 um 19:47 schrieb SZEDER Gábor via GitGitGadget:
> From: =?UTF-8?q?SZEDER=20G=C3=A1bor?= <szeder.dev@gmail.com>
>
> Update the write_graph_chunk_*() helper functions to have the same
> signature:
>
>   - Return an int error code from all these functions.
>     write_graph_chunk_base() already has an int error code, now the
>     others will have one, too, but since they don't indicate any
>     error, they will always return 0.
>
>   - Drop the hash size parameter of write_graph_chunk_oids() and
>     write_graph_chunk_data(); its value can be read directly from
>     'the_hash_algo' inside these functions as well.
>
> This opens up the possibility for further cleanups and foolproofing in
> the following two patches.
>
> Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  commit-graph.c | 42 ++++++++++++++++++++++++++----------------
>  1 file changed, 26 insertions(+), 16 deletions(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index 908f094271..f33bfe49b3 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -891,8 +891,8 @@ struct write_commit_graph_context {
>  	const struct bloom_filter_settings *bloom_settings;
>  };
>
> -static void write_graph_chunk_fanout(struct hashfile *f,
> -				     struct write_commit_graph_context *ctx)
> +static int write_graph_chunk_fanout(struct hashfile *f,
> +				    struct write_commit_graph_context *ctx)
>  {
>  	int i, count = 0;
>  	struct commit **list = ctx->commits.list;
> @@ -913,17 +913,21 @@ static void write_graph_chunk_fanout(struct hashfile *f,
>
>  		hashwrite_be32(f, count);
>  	}
> +
> +	return 0;
>  }
>
> -static void write_graph_chunk_oids(struct hashfile *f, int hash_len,
> -				   struct write_commit_graph_context *ctx)
> +static int write_graph_chunk_oids(struct hashfile *f,
> +				  struct write_commit_graph_context *ctx)
>  {
>  	struct commit **list = ctx->commits.list;
>  	int count;
>  	for (count = 0; count < ctx->commits.nr; count++, list++) {
>  		display_progress(ctx->progress, ++ctx->progress_cnt);
> -		hashwrite(f, (*list)->object.oid.hash, (int)hash_len);
> +		hashwrite(f, (*list)->object.oid.hash, (int)the_hash_algo->rawsz);
                                                       ^^^^^
This cast is confusing...

>  	}
> +
> +	return 0;
>  }
>
>  static const unsigned char *commit_to_sha1(size_t index, void *table)
> @@ -932,8 +936,8 @@ static const unsigned char *commit_to_sha1(size_t index, void *table)
>  	return commits[index]->object.oid.hash;
>  }
>
> -static void write_graph_chunk_data(struct hashfile *f, int hash_len,
> -				   struct write_commit_graph_context *ctx)
> +static int write_graph_chunk_data(struct hashfile *f,
> +				  struct write_commit_graph_context *ctx)
>  {
>  	struct commit **list = ctx->commits.list;
>  	struct commit **last = ctx->commits.list + ctx->commits.nr;
> @@ -950,7 +954,7 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len,
>  			die(_("unable to parse commit %s"),
>  				oid_to_hex(&(*list)->object.oid));
>  		tree = get_commit_tree_oid(*list);
> -		hashwrite(f, tree->hash, hash_len);
> +		hashwrite(f, tree->hash, the_hash_algo->rawsz);

... and obviously not needed, as this example shows.

>
>  		parent = (*list)->parents;
>
> @@ -1030,10 +1034,12 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len,
>
>  		list++;
>  	}
> +
> +	return 0;
>  }
>
> -static void write_graph_chunk_extra_edges(struct hashfile *f,
> -					  struct write_commit_graph_context *ctx)
> +static int write_graph_chunk_extra_edges(struct hashfile *f,
> +					 struct write_commit_graph_context *ctx)
>  {
>  	struct commit **list = ctx->commits.list;
>  	struct commit **last = ctx->commits.list + ctx->commits.nr;
> @@ -1082,10 +1088,12 @@ static void write_graph_chunk_extra_edges(struct hashfile *f,
>
>  		list++;
>  	}
> +
> +	return 0;
>  }
>
> -static void write_graph_chunk_bloom_indexes(struct hashfile *f,
> -					    struct write_commit_graph_context *ctx)
> +static int write_graph_chunk_bloom_indexes(struct hashfile *f,
> +					   struct write_commit_graph_context *ctx)
>  {
>  	struct commit **list = ctx->commits.list;
>  	struct commit **last = ctx->commits.list + ctx->commits.nr;
> @@ -1107,6 +1115,7 @@ static void write_graph_chunk_bloom_indexes(struct hashfile *f,
>  	}
>
>  	stop_progress(&progress);
> +	return 0;
>  }
>
>  static void trace2_bloom_filter_settings(struct write_commit_graph_context *ctx)
> @@ -1124,8 +1133,8 @@ static void trace2_bloom_filter_settings(struct write_commit_graph_context *ctx)
>  	jw_release(&jw);
>  }
>
> -static void write_graph_chunk_bloom_data(struct hashfile *f,
> -					 struct write_commit_graph_context *ctx)
> +static int write_graph_chunk_bloom_data(struct hashfile *f,
> +					struct write_commit_graph_context *ctx)
>  {
>  	struct commit **list = ctx->commits.list;
>  	struct commit **last = ctx->commits.list + ctx->commits.nr;
> @@ -1151,6 +1160,7 @@ static void write_graph_chunk_bloom_data(struct hashfile *f,
>  	}
>
>  	stop_progress(&progress);
> +	return 0;
>  }
>
>  static int oid_compare(const void *_a, const void *_b)
> @@ -1662,8 +1672,8 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
>  			num_chunks * ctx->commits.nr);
>  	}
>  	write_graph_chunk_fanout(f, ctx);
> -	write_graph_chunk_oids(f, hashsz, ctx);
> -	write_graph_chunk_data(f, hashsz, ctx);
> +	write_graph_chunk_oids(f, ctx);
> +	write_graph_chunk_data(f, ctx);
>  	if (ctx->num_extra_edges)
>  		write_graph_chunk_extra_edges(f, ctx);
>  	if (ctx->changed_paths) {
>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 07/11] commit-graph: check chunk sizes after writing
  2020-06-23 17:47   ` [PATCH v2 07/11] commit-graph: check chunk sizes after writing SZEDER Gábor via GitGitGadget
@ 2020-06-25  7:25     ` René Scharfe
  2020-06-25 15:02       ` Derrick Stolee
  0 siblings, 1 reply; 71+ messages in thread
From: René Scharfe @ 2020-06-25  7:25 UTC (permalink / raw)
  To: SZEDER Gábor via GitGitGadget, git; +Cc: me, szeder.dev, Derrick Stolee

Am 23.06.20 um 19:47 schrieb SZEDER Gábor via GitGitGadget:
> From: =?UTF-8?q?SZEDER=20G=C3=A1bor?= <szeder.dev@gmail.com>
>
> In my experience while experimenting with new commit-graph chunks,
> early versions of the corresponding new write_commit_graph_my_chunk()
> functions are, sadly but not surprisingly, often buggy, and write more
> or less data than they are supposed to, especially if the chunk size
> is not directly proportional to the number of commits.  This then
> causes all kinds of issues when reading such a bogus commit-graph
> file, raising the question of whether the writing or the reading part
> happens to be buggy this time.
>
> Let's catch such issues early, already when writing the commit-graph
> file, and check that each write_graph_chunk_*() function wrote the
> amount of data that it was expected to, and what has been encoded in
> the Chunk Lookup table.  Now that all commit-graph chunks are written
> in a loop we can do this check in a single place for all chunks, and
> any chunks added in the future will get checked as well.
>
> Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  commit-graph.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index 086fc2d070..1de6800d74 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -1683,12 +1683,21 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
>  			num_chunks * ctx->commits.nr);
>  	}
>
> +	chunk_offset = f->total + f->offset;
>  	for (i = 0; i < num_chunks; i++) {
> +		uint64_t end_offset;
> +

Hmm, the added code looks complicated because it keeps state outside the
loop, but it could be replace by this:

		uint64_t start_offset = f->total + f->offset;

>  		if (chunks[i].write_fn(f, ctx)) {
>  			error(_("failed writing chunk with id %"PRIx32""),
>  			      chunks[i].id);
>  			return -1;
>  		}
> +
> +		end_offset = f->total + f->offset;
> +		if (end_offset - chunk_offset != chunks[i].size)
> +			BUG("expected to write %"PRId64" bytes to chunk %"PRIx32", but wrote %"PRId64" instead",
> +			    chunks[i].size, chunks[i].id, end_offset - chunk_offset);
> +		chunk_offset = end_offset;

... and that:

		if (f->total + f->offset != start_offset + chunks[i].size)
			BUG(...);

>  	}
>
>  	stop_progress(&ctx->progress);
>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 06/11] commit-graph: simplify chunk writes into loop
  2020-06-23 17:47   ` [PATCH v2 06/11] commit-graph: simplify chunk writes into loop SZEDER Gábor via GitGitGadget
@ 2020-06-25  7:25     ` René Scharfe
  2020-06-25 14:59       ` Derrick Stolee
  0 siblings, 1 reply; 71+ messages in thread
From: René Scharfe @ 2020-06-25  7:25 UTC (permalink / raw)
  To: SZEDER Gábor via GitGitGadget, git; +Cc: me, szeder.dev, Derrick Stolee

Am 23.06.20 um 19:47 schrieb SZEDER Gábor via GitGitGadget:
> From: =?UTF-8?q?SZEDER=20G=C3=A1bor?= <szeder.dev@gmail.com>
>
> In write_commit_graph_file() we now have one block of code filling the
> array of 'struct chunk_info' with the IDs and sizes of chunks to be
> written, and an other block of code calling the functions responsible
> for writing individual chunks.  In case of optional chunks like Extra
> Edge List an Base Graphs List there is also a condition checking
> whether that chunk is necessary/desired, and that same condition is
> repeated in both blocks of code. Other, newer chunks have similar
> optional conditions.
>
> Eliminate these repeated conditions by storing the function pointers
> responsible for writing individual chunks in the 'struct chunk_info'
> array as well, and calling them in a loop to write the commit-graph
> file.  This will open up the possibility for a bit of foolproofing in
> the following patch.

You can do that without storing function pointers by selecting the
function to use based on the chunk ID -- like parse_commit_graph() does
on the read side.  Advantage: You don't need to press all write
functions into the same mold and can keep their individual signatures.

>
> Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  commit-graph.c | 31 +++++++++++++++++++------------
>  1 file changed, 19 insertions(+), 12 deletions(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index f33bfe49b3..086fc2d070 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -1555,9 +1555,13 @@ static int write_graph_chunk_base(struct hashfile *f,
>  	return 0;
>  }
>
> +typedef int (*chunk_write_fn)(struct hashfile *f,
> +			      struct write_commit_graph_context *ctx);
> +
>  struct chunk_info {
>  	uint32_t id;
>  	uint64_t size;
> +	chunk_write_fn write_fn;
>  };
>
>  static int write_commit_graph_file(struct write_commit_graph_context *ctx)
> @@ -1615,27 +1619,34 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
>
>  	chunks[0].id = GRAPH_CHUNKID_OIDFANOUT;
>  	chunks[0].size = GRAPH_FANOUT_SIZE;
> +	chunks[0].write_fn = write_graph_chunk_fanout;
>  	chunks[1].id = GRAPH_CHUNKID_OIDLOOKUP;
>  	chunks[1].size = hashsz * ctx->commits.nr;
> +	chunks[1].write_fn = write_graph_chunk_oids;
>  	chunks[2].id = GRAPH_CHUNKID_DATA;
>  	chunks[2].size = (hashsz + 16) * ctx->commits.nr;
> +	chunks[2].write_fn = write_graph_chunk_data;
>  	if (ctx->num_extra_edges) {
>  		chunks[num_chunks].id = GRAPH_CHUNKID_EXTRAEDGES;
>  		chunks[num_chunks].size = 4 * ctx->num_extra_edges;
> +		chunks[num_chunks].write_fn = write_graph_chunk_extra_edges;
>  		num_chunks++;
>  	}
>  	if (ctx->changed_paths) {
>  		chunks[num_chunks].id = GRAPH_CHUNKID_BLOOMINDEXES;
>  		chunks[num_chunks].size = sizeof(uint32_t) * ctx->commits.nr;
> +		chunks[num_chunks].write_fn = write_graph_chunk_bloom_indexes;
>  		num_chunks++;
>  		chunks[num_chunks].id = GRAPH_CHUNKID_BLOOMDATA;
>  		chunks[num_chunks].size = sizeof(uint32_t) * 3
>  					  + ctx->total_bloom_filter_data_size;
> +		chunks[num_chunks].write_fn = write_graph_chunk_bloom_data;
>  		num_chunks++;
>  	}
>  	if (ctx->num_commit_graphs_after > 1) {
>  		chunks[num_chunks].id = GRAPH_CHUNKID_BASE;
>  		chunks[num_chunks].size = hashsz * (ctx->num_commit_graphs_after - 1);
> +		chunks[num_chunks].write_fn = write_graph_chunk_base;
>  		num_chunks++;
>  	}
>
> @@ -1671,19 +1682,15 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
>  			progress_title.buf,
>  			num_chunks * ctx->commits.nr);
>  	}
> -	write_graph_chunk_fanout(f, ctx);
> -	write_graph_chunk_oids(f, ctx);
> -	write_graph_chunk_data(f, ctx);
> -	if (ctx->num_extra_edges)
> -		write_graph_chunk_extra_edges(f, ctx);
> -	if (ctx->changed_paths) {
> -		write_graph_chunk_bloom_indexes(f, ctx);
> -		write_graph_chunk_bloom_data(f, ctx);
> -	}
> -	if (ctx->num_commit_graphs_after > 1 &&
> -	    write_graph_chunk_base(f, ctx)) {
> -		return -1;
> +
> +	for (i = 0; i < num_chunks; i++) {
> +		if (chunks[i].write_fn(f, ctx)) {
> +			error(_("failed writing chunk with id %"PRIx32""),
> +			      chunks[i].id);

Of all the write functions only write_graph_chunk_base() can return
non-zero and it already prints an error message in that case ("failed to
write correct number of base graph ids").  Why add this one?

> +			return -1;
> +		}
>  	}
> +
>  	stop_progress(&ctx->progress);
>  	strbuf_release(&progress_title);
>
>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 10/11] commit-graph: check all leading directories in changed path Bloom filters
  2020-06-23 17:47   ` [PATCH v2 10/11] commit-graph: check all leading directories in changed path " SZEDER Gábor via GitGitGadget
@ 2020-06-25  7:25     ` René Scharfe
  2020-06-25 15:05       ` Derrick Stolee
  0 siblings, 1 reply; 71+ messages in thread
From: René Scharfe @ 2020-06-25  7:25 UTC (permalink / raw)
  To: SZEDER Gábor via GitGitGadget, git; +Cc: me, szeder.dev, Derrick Stolee

Am 23.06.20 um 19:47 schrieb SZEDER Gábor via GitGitGadget:
> From: =?UTF-8?q?SZEDER=20G=C3=A1bor?= <szeder.dev@gmail.com>
>
> The file 'dir/subdir/file' can only be modified if its leading
> directories 'dir' and 'dir/subdir' are modified as well.
>
> So when checking modified path Bloom filters looking for commits
> modifying a path with multiple path components, then check not only
> the full path in the Bloom filters, but all its leading directories as
> well.  Take care to check these paths in "deepest first" order,
> because it's the full path that is least likely to be modified, and
> the Bloom filter queries can short circuit sooner.
>
> This can significantly reduce the average false positive rate, by
> about an order of magnitude or three(!), and can further speed up
> pathspec-limited revision walks.  The table below compares the average
> false positive rate and runtime of
>
>   git rev-list HEAD -- "$path"
>
> before and after this change for 5000+ randomly* selected paths from
> each repository:
>
>                     Average false           Average        Average
>                     positive rate           runtime        runtime
>                   before     after     before     after   difference
>   ------------------------------------------------------------------
>   git             3.220%   0.7853%     0.0558s   0.0387s   -30.6%
>   linux           2.453%   0.0296%     0.1046s   0.0766s   -26.8%
>   tensorflow      2.536%   0.6977%     0.0594s   0.0420s   -29.2%
>
> *Path selection was done with the following pipeline:
>
> 	git ls-tree -r --name-only HEAD | sort -R | head -n 5000
>
> The improvements in runtime are much smaller than the improvements in
> average false positive rate, as we are clearly reaching diminishing
> returns here.  However, all these timings depend on that accessing
> tree objects is reasonably fast (warm caches).  If we had a partial
> clone and the tree objects had to be fetched from a promisor remote,
> e.g.:
>
>   $ git clone --filter=tree:0 --bare file://.../webkit.git webkit.notrees.git
>   $ git -C webkit.git -c core.modifiedPathBloomFilters=1 \
>         commit-graph write --reachable
>   $ cp webkit.git/objects/info/commit-graph webkit.notrees.git/objects/info/
>   $ git -C webkit.notrees.git -c core.modifiedPathBloomFilters=1 \
>         rev-list HEAD -- "$path"
>
> then checking all leading path component can reduce the runtime from
> over an hour to a few seconds (and this is with the clone and the
> promisor on the same machine).
>
> This adjusts the tracing values in t4216-log-bloom.sh, which provides a
> concrete way to notice the improvement.
>
> Helped-by: Taylor Blau <me@ttaylorr.com>
> Helped-by: René Scharfe <l.s.r@web.de>
> Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  revision.c           | 41 ++++++++++++++++++++++++++++++++---------
>  revision.h           |  6 ++++--
>  t/t4216-log-bloom.sh |  2 +-
>  3 files changed, 37 insertions(+), 12 deletions(-)
>
> diff --git a/revision.c b/revision.c
> index b53377cd52..077888ee51 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -670,9 +670,10 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
>  {
>  	struct pathspec_item *pi;
>  	char *path_alloc = NULL;
> -	const char *path;
> +	const char *path, *p;
>  	int last_index;
> -	int len;
> +	size_t len;
> +	int path_component_nr = 1;
>
>  	if (!revs->commits)
>  		return;
> @@ -709,8 +710,28 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
>  		return;
>  	}
>
> -	revs->bloom_key = xmalloc(sizeof(struct bloom_key));
> -	fill_bloom_key(path, len, revs->bloom_key, revs->bloom_filter_settings);
> +	p = path;
> +	while (*p) {
> +		if (is_dir_sep(*p))
> +			path_component_nr++;
> +		p++;
> +	}
> +
> +	revs->bloom_keys_nr = path_component_nr;
> +	ALLOC_ARRAY(revs->bloom_keys, revs->bloom_keys_nr);
> +
> +	fill_bloom_key(path, len, &revs->bloom_keys[0],
> +		       revs->bloom_filter_settings);
> +	path_component_nr = 1;
> +
> +	p = path + len - 1;

len cannot be 0 at this point, as patch 9 made sure, so this is safe.
Good.

> +	while (p > path) {
> +		if (is_dir_sep(*p))
> +			fill_bloom_key(path, p - path,
> +				       &revs->bloom_keys[path_component_nr++],
> +				       revs->bloom_filter_settings);
> +		p--;
> +	}

This walks the directory hierarchy upwards and adds bloom filters for
shorter and shorter paths, ("deepest first").  Good.

And it supports all directory separators.  On Windows that would be
slash (/) and backslash (\).  I assume paths are normalized to use
only slashes when bloom filters are written, correct?  Then the lookup
side needs to normalize a given path to only use slashes as well,
otherwise paths with backslashes cannot be found.  This part seems to
be missing.

>
>  	if (trace2_is_enabled() && !bloom_filter_atexit_registered) {
>  		atexit(trace2_bloom_filter_statistics_atexit);
> @@ -724,7 +745,7 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
>  						 struct commit *commit)
>  {
>  	struct bloom_filter *filter;
> -	int result;
> +	int result = 1, j;
>
>  	if (!revs->repo->objects->commit_graph)
>  		return -1;
> @@ -744,9 +765,11 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
>  		return -1;
>  	}
>
> -	result = bloom_filter_contains(filter,
> -				       revs->bloom_key,
> -				       revs->bloom_filter_settings);
> +	for (j = 0; result && j < revs->bloom_keys_nr; j++) {
> +		result = bloom_filter_contains(filter,
> +					       &revs->bloom_keys[j],
> +					       revs->bloom_filter_settings);
> +	}
>
>  	if (result)
>  		count_bloom_filter_maybe++;
> @@ -786,7 +809,7 @@ static int rev_compare_tree(struct rev_info *revs,
>  			return REV_TREE_SAME;
>  	}
>
> -	if (revs->bloom_key && !nth_parent) {
> +	if (revs->bloom_keys_nr && !nth_parent) {
>  		bloom_ret = check_maybe_different_in_bloom_filter(revs, commit);
>
>  		if (bloom_ret == 0)
> diff --git a/revision.h b/revision.h
> index 7c026fe41f..abbfb4ab59 100644
> --- a/revision.h
> +++ b/revision.h
> @@ -295,8 +295,10 @@ struct rev_info {
>  	struct topo_walk_info *topo_walk_info;
>
>  	/* Commit graph bloom filter fields */
> -	/* The bloom filter key for the pathspec */
> -	struct bloom_key *bloom_key;
> +	/* The bloom filter key(s) for the pathspec */
> +	struct bloom_key *bloom_keys;
> +	int bloom_keys_nr;
> +
>  	/*
>  	 * The bloom filter settings used to generate the key.
>  	 * This is loaded from the commit-graph being used.
> diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
> index f890cc4737..84f95972ca 100755
> --- a/t/t4216-log-bloom.sh
> +++ b/t/t4216-log-bloom.sh
> @@ -146,7 +146,7 @@ test_expect_success 'setup - add commit-graph to the chain with Bloom filters' '
>
>  test_bloom_filters_used_when_some_filters_are_missing () {
>  	log_args=$1
> -	bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"zero_length_filter\":0,\"maybe\":8,\"definitely_not\":6"
> +	bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"zero_length_filter\":0,\"maybe\":6,\"definitely_not\":8"
>  	setup "$log_args" &&
>  	grep -q "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.perf" &&
>  	test_cmp log_wo_bloom log_w_bloom
>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 00/11] More commit-graph/Bloom filter improvements
  2020-06-25  0:38       ` Junio C Hamano
@ 2020-06-25 13:38         ` Derrick Stolee
  2020-06-25 16:34           ` Junio C Hamano
  0 siblings, 1 reply; 71+ messages in thread
From: Derrick Stolee @ 2020-06-25 13:38 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Derrick Stolee via GitGitGadget, git, me, szeder.dev, l.s.r,
	Derrick Stolee

On 6/24/2020 8:38 PM, Junio C Hamano wrote:
> Derrick Stolee <stolee@gmail.com> writes:
> 
>> On 6/24/2020 7:11 PM, Junio C Hamano wrote:
>>> This does not seem to play well with what is in flight.  Tests seem
>>> to pass with topics up to es/config-hooks merged but not with this
>>> topic merged on top.
>>>
>>>     1b5d3d8260 Merge branch 'ds/commit-graph-bloom-updates' into seen
>>>     32169c595c Merge branch 'es/config-hooks' into seen
>>>     ...
>>>
>>> $ sh t4216-log-bloom.sh -i -v
>>>
>>> ends like so:
>>>
>>> ok 133 - Use Bloom filters if they exist in the latest but not all commit graphs in the chain.
>>>
>>> expecting success of 4216.134 'persist filter settings':
>>>         test_when_finished rm -rf .git/objects/info/commit-graph* &&
>>>         GIT_TRACE2_EVENT="$(pwd)/trace2.txt" git commit-graph write --reachable --changed-paths &&
>>>         grep "{\"hash_version\":1,\"num_hashes\":7,\"bits_per_entry\":10}" trace2.txt &&
>>>         cp .git/objects/info/commit-graph commit-graph-before &&
>>>         corrupt_graph $BASE_K_BYTE_OFFSET "\09" &&
>>>         corrupt_graph $BASE_LEN_BYTE_OFFSET "\0F" &&
>>>         cp .git/objects/info/commit-graph commit-graph-after &&
>>>         test_commit c18 A/corrupt &&
>>>         GIT_TRACE2_EVENT="$(pwd)/trace2.txt" git commit-graph write --reachable --changed-paths &&
>>>         grep "{\"hash_version\":1,\"num_hashes\":57,\"bits_per_entry\":70}" trace2.txt
>>>
>>> not ok 134 - persist filter settings
>>> # ...
>>>
>>> Thanks.
>>
>> Thanks for letting me know. I'll investigate carefully with the
>> rest of the 'seen' branch. This test is a bit fragile due to
>> computed values for which bytes to replace, so anything that
>> could have changed the length or order of chunks would lead to
>> a failure here.
>>
>> Sorry for the disruption.
> 
> Oh, not at all.  Thanks for helping
I'll squash the patch into my v3, but here it is now to make 'seen'
pass tests again.

The _real_ reason for the failure was that some changes in trace2
pushed the events out of the nesting limits. I also think it is a
good idea to make the test less brittle. Adding GIT_TEST_* variables
will also help anyone who wants to adjust the Bloom filter settings
for testing.

Question: Should these be GIT_BLOOM_SETTINGS_* instead of GIT_TEST_...?
I ask because this _could_ be a way to allow user customization,
without making it as public as a config option. Or, should I just do
the work and add config settings in this series?

Thanks,
-Stolee

-- >8 --
From 9245d31f0431eceec60f0b7a90900d2825787530 Mon Sep 17 00:00:00 2001
From: Derrick Stolee <dstolee@microsoft.com>
Date: Thu, 25 Jun 2020 09:19:13 -0400
Subject: [PATCH] fixup! commit-graph: persist existence of changed-paths

The previous version of this test was too fragile to subtle changes
in the commit-graph file size. This version now uses two environment
variables to customize the Bloom filter settings before rewriting
without those environment variables. This demonstrates that we
persist the settings correctly.

The issue with the 'seen' branch is due to es/trace-log-progress
adding trace2 regions in the progress indicators. This pushed the
trace2 data that the test was expecting outside the nesting limit.
Set GIT_TRACE2_EVENT_NESTING to ensure we still record those items.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c       |  9 +++++++--
 t/t4216-log-bloom.sh | 34 +++++++++++-----------------------
 2 files changed, 18 insertions(+), 25 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 1de6800d74..026ec63d38 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1576,10 +1576,15 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	int num_chunks = 3;
 	uint64_t chunk_offset;
 	struct object_id file_hash;
-	const struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
+	struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
 
-	if (!ctx->bloom_settings)
+	if (!ctx->bloom_settings) {
+		bloom_settings.bits_per_entry = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY",
+							      bloom_settings.bits_per_entry);
+		bloom_settings.num_hashes = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_NUM_HASHES",
+							  bloom_settings.num_hashes);
 		ctx->bloom_settings = &bloom_settings;
+	}
 
 	if (ctx->split) {
 		struct strbuf tmp_file = STRBUF_INIT;
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index 84f95972ca..d7dd717347 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -156,31 +156,19 @@ test_expect_success 'Use Bloom filters if they exist in the latest but not all c
 	test_bloom_filters_used_when_some_filters_are_missing "-- A/B"
 '
 
-BASE_BDAT_OFFSET=2240
-BASE_K_BYTE_OFFSET=$((BASE_BDAT_OFFSET + 10))
-BASE_LEN_BYTE_OFFSET=$((BASE_BDAT_OFFSET + 14))
-
-corrupt_graph() {
-	pos=$1
-	data="${2:-\0}"
-	grepstr=$3
-	orig_size=$(wc -c < .git/objects/info/commit-graph) &&
-	zero_pos=${4:-${orig_size}} &&
-	printf "$data" | dd of=".git/objects/info/commit-graph" bs=1 seek="$pos" conv=notrunc &&
-	dd of=".git/objects/info/commit-graph" bs=1 seek="$zero_pos" if=/dev/null
-}
-
 test_expect_success 'persist filter settings' '
 	test_when_finished rm -rf .git/objects/info/commit-graph* &&
-	GIT_TRACE2_EVENT="$(pwd)/trace2.txt" git commit-graph write --reachable --changed-paths &&
-	grep "{\"hash_version\":1,\"num_hashes\":7,\"bits_per_entry\":10}" trace2.txt &&
-	cp .git/objects/info/commit-graph commit-graph-before &&
-	corrupt_graph $BASE_K_BYTE_OFFSET "\09" &&
-	corrupt_graph $BASE_LEN_BYTE_OFFSET "\0F" &&
-	cp .git/objects/info/commit-graph commit-graph-after &&
-	test_commit c18 A/corrupt &&
-	GIT_TRACE2_EVENT="$(pwd)/trace2.txt" git commit-graph write --reachable --changed-paths &&
-	grep "{\"hash_version\":1,\"num_hashes\":57,\"bits_per_entry\":70}" trace2.txt
+	rm -rf .git/objects/info/commit-graph* &&
+	GIT_TRACE2_EVENT="$(pwd)/trace2.txt" \
+		GIT_TRACE2_EVENT_NESTING=5 \
+		GIT_TEST_BLOOM_SETTINGS_NUM_HASHES=9 \
+		GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY=15 \
+		git commit-graph write --reachable --changed-paths &&
+	grep "{\"hash_version\":1,\"num_hashes\":9,\"bits_per_entry\":15}" trace2.txt &&
+	GIT_TRACE2_EVENT="$(pwd)/trace2-auto.txt" \
+		GIT_TRACE2_EVENT_NESTING=5 \
+		git commit-graph write --reachable --changed-paths &&
+	grep "{\"hash_version\":1,\"num_hashes\":9,\"bits_per_entry\":15}" trace2-auto.txt
 '
 
 test_done
\ No newline at end of file
-- 
2.27.0




^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 06/11] commit-graph: simplify chunk writes into loop
  2020-06-25  7:25     ` René Scharfe
@ 2020-06-25 14:59       ` Derrick Stolee
  0 siblings, 0 replies; 71+ messages in thread
From: Derrick Stolee @ 2020-06-25 14:59 UTC (permalink / raw)
  To: René Scharfe, SZEDER Gábor via GitGitGadget, git
  Cc: me, szeder.dev, Derrick Stolee

On 6/25/2020 3:25 AM, René Scharfe wrote:
> Am 23.06.20 um 19:47 schrieb SZEDER Gábor via GitGitGadget:
>> From: =?UTF-8?q?SZEDER=20G=C3=A1bor?= <szeder.dev@gmail.com>
>>
>> In write_commit_graph_file() we now have one block of code filling the
>> array of 'struct chunk_info' with the IDs and sizes of chunks to be
>> written, and an other block of code calling the functions responsible
>> for writing individual chunks.  In case of optional chunks like Extra
>> Edge List an Base Graphs List there is also a condition checking
>> whether that chunk is necessary/desired, and that same condition is
>> repeated in both blocks of code. Other, newer chunks have similar
>> optional conditions.
>>
>> Eliminate these repeated conditions by storing the function pointers
>> responsible for writing individual chunks in the 'struct chunk_info'
>> array as well, and calling them in a loop to write the commit-graph
>> file.  This will open up the possibility for a bit of foolproofing in
>> the following patch.
> 
> You can do that without storing function pointers by selecting the
> function to use based on the chunk ID -- like parse_commit_graph() does
> on the read side.  Advantage: You don't need to press all write
> functions into the same mold and can keep their individual signatures.

I do think that the loop without a switch statement is valuable.
It focuses the updates for new chunks to be localized to the
section that calculates the offset values.

>>
>> Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
>> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
>> ---
>>  commit-graph.c | 31 +++++++++++++++++++------------
>>  1 file changed, 19 insertions(+), 12 deletions(-)
>>
>> diff --git a/commit-graph.c b/commit-graph.c
>> index f33bfe49b3..086fc2d070 100644
>> --- a/commit-graph.c
>> +++ b/commit-graph.c
>> @@ -1555,9 +1555,13 @@ static int write_graph_chunk_base(struct hashfile *f,
>>  	return 0;
>>  }
>>
>> +typedef int (*chunk_write_fn)(struct hashfile *f,
>> +			      struct write_commit_graph_context *ctx);
>> +
>>  struct chunk_info {
>>  	uint32_t id;
>>  	uint64_t size;
>> +	chunk_write_fn write_fn;
>>  };
>>
>>  static int write_commit_graph_file(struct write_commit_graph_context *ctx)
>> @@ -1615,27 +1619,34 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
>>
>>  	chunks[0].id = GRAPH_CHUNKID_OIDFANOUT;
>>  	chunks[0].size = GRAPH_FANOUT_SIZE;
>> +	chunks[0].write_fn = write_graph_chunk_fanout;
>>  	chunks[1].id = GRAPH_CHUNKID_OIDLOOKUP;
>>  	chunks[1].size = hashsz * ctx->commits.nr;
>> +	chunks[1].write_fn = write_graph_chunk_oids;
>>  	chunks[2].id = GRAPH_CHUNKID_DATA;
>>  	chunks[2].size = (hashsz + 16) * ctx->commits.nr;
>> +	chunks[2].write_fn = write_graph_chunk_data;
>>  	if (ctx->num_extra_edges) {
>>  		chunks[num_chunks].id = GRAPH_CHUNKID_EXTRAEDGES;
>>  		chunks[num_chunks].size = 4 * ctx->num_extra_edges;
>> +		chunks[num_chunks].write_fn = write_graph_chunk_extra_edges;
>>  		num_chunks++;
>>  	}
>>  	if (ctx->changed_paths) {
>>  		chunks[num_chunks].id = GRAPH_CHUNKID_BLOOMINDEXES;
>>  		chunks[num_chunks].size = sizeof(uint32_t) * ctx->commits.nr;
>> +		chunks[num_chunks].write_fn = write_graph_chunk_bloom_indexes;
>>  		num_chunks++;
>>  		chunks[num_chunks].id = GRAPH_CHUNKID_BLOOMDATA;
>>  		chunks[num_chunks].size = sizeof(uint32_t) * 3
>>  					  + ctx->total_bloom_filter_data_size;
>> +		chunks[num_chunks].write_fn = write_graph_chunk_bloom_data;
>>  		num_chunks++;
>>  	}
>>  	if (ctx->num_commit_graphs_after > 1) {
>>  		chunks[num_chunks].id = GRAPH_CHUNKID_BASE;
>>  		chunks[num_chunks].size = hashsz * (ctx->num_commit_graphs_after - 1);
>> +		chunks[num_chunks].write_fn = write_graph_chunk_base;
>>  		num_chunks++;
>>  	}
>>
>> @@ -1671,19 +1682,15 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
>>  			progress_title.buf,
>>  			num_chunks * ctx->commits.nr);
>>  	}
>> -	write_graph_chunk_fanout(f, ctx);
>> -	write_graph_chunk_oids(f, ctx);
>> -	write_graph_chunk_data(f, ctx);
>> -	if (ctx->num_extra_edges)
>> -		write_graph_chunk_extra_edges(f, ctx);
>> -	if (ctx->changed_paths) {
>> -		write_graph_chunk_bloom_indexes(f, ctx);
>> -		write_graph_chunk_bloom_data(f, ctx);
>> -	}
>> -	if (ctx->num_commit_graphs_after > 1 &&
>> -	    write_graph_chunk_base(f, ctx)) {
>> -		return -1;
>> +
>> +	for (i = 0; i < num_chunks; i++) {
>> +		if (chunks[i].write_fn(f, ctx)) {
>> +			error(_("failed writing chunk with id %"PRIx32""),
>> +			      chunks[i].id);
> 
> Of all the write functions only write_graph_chunk_base() can return
> non-zero and it already prints an error message in that case ("failed to
> write correct number of base graph ids").  Why add this one?

Ok, we can require the chunk methods to write an error() message with
appropriate context and simply return -1 here.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 07/11] commit-graph: check chunk sizes after writing
  2020-06-25  7:25     ` René Scharfe
@ 2020-06-25 15:02       ` Derrick Stolee
  0 siblings, 0 replies; 71+ messages in thread
From: Derrick Stolee @ 2020-06-25 15:02 UTC (permalink / raw)
  To: René Scharfe, SZEDER Gábor via GitGitGadget, git
  Cc: me, szeder.dev, Derrick Stolee

On 6/25/2020 3:25 AM, René Scharfe wrote:
> Am 23.06.20 um 19:47 schrieb SZEDER Gábor via GitGitGadget:
>> From: =?UTF-8?q?SZEDER=20G=C3=A1bor?= <szeder.dev@gmail.com>
>>
>> In my experience while experimenting with new commit-graph chunks,
>> early versions of the corresponding new write_commit_graph_my_chunk()
>> functions are, sadly but not surprisingly, often buggy, and write more
>> or less data than they are supposed to, especially if the chunk size
>> is not directly proportional to the number of commits.  This then
>> causes all kinds of issues when reading such a bogus commit-graph
>> file, raising the question of whether the writing or the reading part
>> happens to be buggy this time.
>>
>> Let's catch such issues early, already when writing the commit-graph
>> file, and check that each write_graph_chunk_*() function wrote the
>> amount of data that it was expected to, and what has been encoded in
>> the Chunk Lookup table.  Now that all commit-graph chunks are written
>> in a loop we can do this check in a single place for all chunks, and
>> any chunks added in the future will get checked as well.
>>
>> Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
>> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
>> ---
>>  commit-graph.c | 9 +++++++++
>>  1 file changed, 9 insertions(+)
>>
>> diff --git a/commit-graph.c b/commit-graph.c
>> index 086fc2d070..1de6800d74 100644
>> --- a/commit-graph.c
>> +++ b/commit-graph.c
>> @@ -1683,12 +1683,21 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
>>  			num_chunks * ctx->commits.nr);
>>  	}
>>
>> +	chunk_offset = f->total + f->offset;
>>  	for (i = 0; i < num_chunks; i++) {
>> +		uint64_t end_offset;
>> +
> 
> Hmm, the added code looks complicated because it keeps state outside the
> loop, but it could be replace by this:
> 
> 		uint64_t start_offset = f->total + f->offset;
> 
>>  		if (chunks[i].write_fn(f, ctx)) {
>>  			error(_("failed writing chunk with id %"PRIx32""),
>>  			      chunks[i].id);
>>  			return -1;
>>  		}
>> +
>> +		end_offset = f->total + f->offset;
>> +		if (end_offset - chunk_offset != chunks[i].size)
>> +			BUG("expected to write %"PRId64" bytes to chunk %"PRIx32", but wrote %"PRId64" instead",
>> +			    chunks[i].size, chunks[i].id, end_offset - chunk_offset);
>> +		chunk_offset = end_offset;
> 
> ... and that:
> 
> 		if (f->total + f->offset != start_offset + chunks[i].size)
> 			BUG(...);

Thanks! I agree this approach is simpler and less prone to
bugs since we are using the local state.

-Stolee

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 10/11] commit-graph: check all leading directories in changed path Bloom filters
  2020-06-25  7:25     ` René Scharfe
@ 2020-06-25 15:05       ` Derrick Stolee
  2020-06-26  6:34         ` SZEDER Gábor
  0 siblings, 1 reply; 71+ messages in thread
From: Derrick Stolee @ 2020-06-25 15:05 UTC (permalink / raw)
  To: René Scharfe, SZEDER Gábor via GitGitGadget, git
  Cc: me, szeder.dev, Derrick Stolee

On 6/25/2020 3:25 AM, René Scharfe wrote:
> Am 23.06.20 um 19:47 schrieb SZEDER Gábor via GitGitGadget:
>> From: =?UTF-8?q?SZEDER=20G=C3=A1bor?= <szeder.dev@gmail.com>
>>
>> The file 'dir/subdir/file' can only be modified if its leading
>> directories 'dir' and 'dir/subdir' are modified as well.
>>
>> So when checking modified path Bloom filters looking for commits
>> modifying a path with multiple path components, then check not only
>> the full path in the Bloom filters, but all its leading directories as
>> well.  Take care to check these paths in "deepest first" order,
>> because it's the full path that is least likely to be modified, and
>> the Bloom filter queries can short circuit sooner.
>>
>> This can significantly reduce the average false positive rate, by
>> about an order of magnitude or three(!), and can further speed up
>> pathspec-limited revision walks.  The table below compares the average
>> false positive rate and runtime of
>>
>>   git rev-list HEAD -- "$path"
>>
>> before and after this change for 5000+ randomly* selected paths from
>> each repository:
>>
>>                     Average false           Average        Average
>>                     positive rate           runtime        runtime
>>                   before     after     before     after   difference
>>   ------------------------------------------------------------------
>>   git             3.220%   0.7853%     0.0558s   0.0387s   -30.6%
>>   linux           2.453%   0.0296%     0.1046s   0.0766s   -26.8%
>>   tensorflow      2.536%   0.6977%     0.0594s   0.0420s   -29.2%
>>
>> *Path selection was done with the following pipeline:
>>
>> 	git ls-tree -r --name-only HEAD | sort -R | head -n 5000
>>
>> The improvements in runtime are much smaller than the improvements in
>> average false positive rate, as we are clearly reaching diminishing
>> returns here.  However, all these timings depend on that accessing
>> tree objects is reasonably fast (warm caches).  If we had a partial
>> clone and the tree objects had to be fetched from a promisor remote,
>> e.g.:
>>
>>   $ git clone --filter=tree:0 --bare file://.../webkit.git webkit.notrees.git
>>   $ git -C webkit.git -c core.modifiedPathBloomFilters=1 \
>>         commit-graph write --reachable
>>   $ cp webkit.git/objects/info/commit-graph webkit.notrees.git/objects/info/
>>   $ git -C webkit.notrees.git -c core.modifiedPathBloomFilters=1 \
>>         rev-list HEAD -- "$path"
>>
>> then checking all leading path component can reduce the runtime from
>> over an hour to a few seconds (and this is with the clone and the
>> promisor on the same machine).
>>
>> This adjusts the tracing values in t4216-log-bloom.sh, which provides a
>> concrete way to notice the improvement.
>>
>> Helped-by: Taylor Blau <me@ttaylorr.com>
>> Helped-by: René Scharfe <l.s.r@web.de>
>> Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
>> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
>> ---
>>  revision.c           | 41 ++++++++++++++++++++++++++++++++---------
>>  revision.h           |  6 ++++--
>>  t/t4216-log-bloom.sh |  2 +-
>>  3 files changed, 37 insertions(+), 12 deletions(-)
>>
>> diff --git a/revision.c b/revision.c
>> index b53377cd52..077888ee51 100644
>> --- a/revision.c
>> +++ b/revision.c
>> @@ -670,9 +670,10 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
>>  {
>>  	struct pathspec_item *pi;
>>  	char *path_alloc = NULL;
>> -	const char *path;
>> +	const char *path, *p;
>>  	int last_index;
>> -	int len;
>> +	size_t len;
>> +	int path_component_nr = 1;
>>
>>  	if (!revs->commits)
>>  		return;
>> @@ -709,8 +710,28 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
>>  		return;
>>  	}
>>
>> -	revs->bloom_key = xmalloc(sizeof(struct bloom_key));
>> -	fill_bloom_key(path, len, revs->bloom_key, revs->bloom_filter_settings);
>> +	p = path;
>> +	while (*p) {
>> +		if (is_dir_sep(*p))
>> +			path_component_nr++;
>> +		p++;
>> +	}
>> +
>> +	revs->bloom_keys_nr = path_component_nr;
>> +	ALLOC_ARRAY(revs->bloom_keys, revs->bloom_keys_nr);
>> +
>> +	fill_bloom_key(path, len, &revs->bloom_keys[0],
>> +		       revs->bloom_filter_settings);
>> +	path_component_nr = 1;
>> +
>> +	p = path + len - 1;
> 
> len cannot be 0 at this point, as patch 9 made sure, so this is safe.
> Good.
> 
>> +	while (p > path) {
>> +		if (is_dir_sep(*p))
>> +			fill_bloom_key(path, p - path,
>> +				       &revs->bloom_keys[path_component_nr++],
>> +				       revs->bloom_filter_settings);
>> +		p--;
>> +	}
> 
> This walks the directory hierarchy upwards and adds bloom filters for
> shorter and shorter paths, ("deepest first").  Good.
> 
> And it supports all directory separators.  On Windows that would be
> slash (/) and backslash (\).  I assume paths are normalized to use
> only slashes when bloom filters are written, correct?  Then the lookup
> side needs to normalize a given path to only use slashes as well,
> otherwise paths with backslashes cannot be found.  This part seems to
> be missing.

Yes, that's a good point. We _require_ the paths to be normalized
here to be Unix-style paths or else the Bloom filter keys are
incorrect. Thankfully, they are. Let's make that clear in-code by
using '/' instead of is_dir_sep().

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 00/11] More commit-graph/Bloom filter improvements
  2020-06-25 13:38         ` Derrick Stolee
@ 2020-06-25 16:34           ` Junio C Hamano
  0 siblings, 0 replies; 71+ messages in thread
From: Junio C Hamano @ 2020-06-25 16:34 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, git, me, szeder.dev, l.s.r,
	Derrick Stolee

Derrick Stolee <stolee@gmail.com> writes:

> Question: Should these be GIT_BLOOM_SETTINGS_* instead of GIT_TEST_...?
> I ask because this _could_ be a way to allow user customization,
> without making it as public as a config option. Or, should I just do
> the work and add config settings in this series?

Other than when testing and/or debugging, what are the expected
reasons and situations an individual would want to use customized
settings?  Once a decision is made to use one customized setting for
a repository, does it make sense for a setting other than that one
setting for the same repository, or is it something very handy if we
can use different settings on a whim?  

My gut feeling is that it should be added as per-repo configuration
but only after a use case is found, and GIT_TEST_* would be the way
to go.

Thanks.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 10/11] commit-graph: check all leading directories in changed path Bloom filters
  2020-06-25 15:05       ` Derrick Stolee
@ 2020-06-26  6:34         ` SZEDER Gábor
  2020-06-26 14:42           ` Derrick Stolee
  0 siblings, 1 reply; 71+ messages in thread
From: SZEDER Gábor @ 2020-06-26  6:34 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: René Scharfe, SZEDER Gábor via GitGitGadget, git, me,
	Derrick Stolee

On Thu, Jun 25, 2020 at 11:05:04AM -0400, Derrick Stolee wrote:

> >> +	while (p > path) {
> >> +		if (is_dir_sep(*p))
> >> +			fill_bloom_key(path, p - path,
> >> +				       &revs->bloom_keys[path_component_nr++],
> >> +				       revs->bloom_filter_settings);
> >> +		p--;
> >> +	}
> > 
> > This walks the directory hierarchy upwards and adds bloom filters for
> > shorter and shorter paths, ("deepest first").  Good.
> > 
> > And it supports all directory separators.  On Windows that would be
> > slash (/) and backslash (\).  I assume paths are normalized to use
> > only slashes when bloom filters are written, correct?  Then the lookup
> > side needs to normalize a given path to only use slashes as well,
> > otherwise paths with backslashes cannot be found.  This part seems to
> > be missing.
> 
> Yes, that's a good point. We _require_ the paths to be normalized
> here to be Unix-style paths or else the Bloom filter keys are
> incorrect. Thankfully, they are.

Unfortunately, they aren't always...

Path normalization is done in normalize_path_copy_len(), whose
description says, among other things:

   * Performs the following normalizations on src, storing the result in dst:
   * - Ensures that components are separated by '/' (Windows only)

and the code indeed does:

        if (is_dir_sep(c)) {
                *dst++ = '/';

Now, while parsing pathspecs this function is called via:

  parse_pathspec()
    init_pathspec_item()
      prefix_path_gently()
        normalize_path_copy_len()

Unfortunately, init_pathspec_item() has this chain of conditions:

        /* Create match string which will be used for pathspec matching */
        if (pathspec_prefix >= 0) {
                match = xstrdup(copyfrom);
                prefixlen = pathspec_prefix;
        } else if (magic & PATHSPEC_FROMTOP) {
                match = xstrdup(copyfrom);
                prefixlen = 0;
        } else {
                match = prefix_path_gently(prefix, prefixlen,
                                           &prefixlen, copyfrom);
                if (!match) {
                        const char *hint_path = get_git_work_tree();
                        if (!hint_path)
                                hint_path = get_git_dir();
                        die(_("%s: '%s' is outside repository at '%s'"), elt,
                            copyfrom, absolute_path(hint_path));
                }
        }

which means that it doesn't always calls prefix_path_gently(), which,
in turn, means that 'pathspec_item->match' might remain un-normalized
in case of some unusual pathspecs.

The first condition is supposed to handle the case when one Git
process passes pathspecs to another, and is supposed to be "internal
use only"; see 233c3e6c59 (parse_pathspec: preserve prefix length via
PATHSPEC_PREFIX_ORIGIN, 2013-07-14), I haven't even tried to grok what
that might entail.

The second condition handles pathspecs explicitly relative to the root
of the work tree, i.e. ':/path'.  Adding a printf() to show the
original path and the resulting 'pathspec_item->match' does confirm
that no normalization is performed:

  expecting success of 9999.1 'test': 
          mkdir -p dir &&
          >dir/file &&
          git add ":/dir/file" &&
          git add ":(top)dir/file" &&
          test_might_fail git add ":/dir//file" &&
          git add ":(top)dir//file"
  
  orig:  ':/dir/file'
  match: 'dir/file'
  orig:  ':(top)dir/file'
  match: 'dir/file'
  orig:  ':/dir//file'
  match: 'dir//file'
  fatal: oops in prep_exclude
  orig:  ':(top)dir//file'
  match: 'dir//file'
  fatal: oops in prep_exclude
  not ok 1 - test

This is, of course, bad for Bloom filters, because the repeated
slashes are hashed as well and commits will be omitted from the output
of pathspec-limited revision walks, but apparently it also affects
other parts of Git.

And the else branch handles the rest, which, I believe, is by far the
most common case.

> Let's make that clear in-code by
> using '/' instead of is_dir_sep().
> 
> Thanks,
> -Stolee

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH v3 00/10] More commit-graph/Bloom filter improvements
  2020-06-23 17:46 ` [PATCH v2 00/11] " Derrick Stolee via GitGitGadget
                     ` (11 preceding siblings ...)
  2020-06-24 23:11   ` [PATCH v2 00/11] More commit-graph/Bloom filter improvements Junio C Hamano
@ 2020-06-26 12:30   ` Derrick Stolee via GitGitGadget
  2020-06-26 12:30     ` [PATCH v3 01/10] commit-graph: place bloom_settings in context Derrick Stolee via GitGitGadget
                       ` (10 more replies)
  12 siblings, 11 replies; 71+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-06-26 12:30 UTC (permalink / raw)
  To: git; +Cc: me, szeder.dev, l.s.r, Derrick Stolee

This builds on sg/commit-graph-cleanups, which took several patches from
Szeder's series [1] and applied them almost directly to a more-recent
version of Git [2].

[1] https://lore.kernel.org/git/20200529085038.26008-1-szeder.dev@gmail.com/
[2] 
https://lore.kernel.org/git/pull.650.git.1591362032.gitgitgadget@gmail.com/

This series adds a few extra improvements, several of which are rooted in
Szeder's original series. I maintained his authorship and sign-off, even
though the patches did not apply or cherry-pick at all.

(In v2, I have removed the range-diff comparison to Szeder's series, so look
at the v1 cover letter for that.)

The patches have been significantly reordered. René pointed out (and Szeder
discovered in the old thread) that we are not re-using the
bloom_filter_settings from the existing commit-graph when writing a new one.

 1. commit-graph: place bloom_settings in context
 2. commit-graph: change test to die on parse, not load

These are mostly the same, except we now use a pointer to the settings in
the commit-graph write context.

 3. bloom: get_bloom_filter() cleanups

This new patch is a subtle change in behavior that will become relevant in
the very next patch. In fact, if we swap patch 3 and 4, then
t4216-log-bloom.sh fails with a segfault due to a NULL filter.

 4. commit-graph: persist existence of changed-paths

This patch is now updated to use the existing changed-path filter settings.

 5. commit-graph: unify the signatures of all write_graph_chunk_*()
    functions
 6. commit-graph: simplify chunk writes into loop
 7. commit-graph: check chunk sizes after writing

These are all the same as before.

 8. revision.c: fix whitespace

This patch is the cleanup part of Taylor's patch.

 9. revision: empty pathspecs should not use Bloom filters

Here is Taylor's fix for empty pathspecs.

 10. commit-graph: check all leading directories in changed path Bloom
     filters
 11. bloom: enforce a minimum size of 8 bytes

Finally, we get these performance patches. Patch 10 is updated to have the
better logic around directory separators and empty paths. Also, the list of
Bloom keys is ordered with the deepest path first. That has some tiny
performance benefits for deep paths since we can short-circuit the multi-key
checks more often. That code path is much faster than the tree parsing, so
it is hard to measure any change.

Updates in V3:

 * Responded to René's feedback.
 * Fixed the test in Patch 4 to use GIT_TEST_ variables and extend the
   GIT_TRACE2 depth to work with 'seen' branch.

Thanks, -Stolee

Derrick Stolee (5):
  commit-graph: place bloom_settings in context
  commit-graph: change test to die on parse, not load
  bloom: fix logic in get_bloom_filter()
  commit-graph: persist existence of changed-paths
  revision.c: fix whitespace

SZEDER Gábor (4):
  commit-graph: unify the signatures of all write_graph_chunk_*()
    functions
  commit-graph: simplify chunk writes into loop
  commit-graph: check chunk sizes after writing
  commit-graph: check all leading directories in changed path Bloom
    filters

Taylor Blau (1):
  revision: empty pathspecs should not use Bloom filters

 Documentation/git-commit-graph.txt |   5 +-
 bloom.c                            |  14 ++-
 builtin/commit-graph.c             |   5 +-
 commit-graph.c                     | 138 +++++++++++++++++++++--------
 commit-graph.h                     |   3 +-
 revision.c                         |  58 +++++++++---
 revision.h                         |   6 +-
 t/t4216-log-bloom.sh               |  23 ++++-
 t/t5318-commit-graph.sh            |   2 +-
 9 files changed, 189 insertions(+), 65 deletions(-)


base-commit: 7fbfe07ab4d4e58c0971dac73001b89f180a0af3
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-659%2Fderrickstolee%2Fbloom-2-v3
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-659/derrickstolee/bloom-2-v3
Pull-Request: https://github.com/gitgitgadget/git/pull/659

Range-diff vs v2:

  1:  57002040bc =  1:  57002040bc commit-graph: place bloom_settings in context
  2:  6b63f9bd8a =  2:  6b63f9bd8a commit-graph: change test to die on parse, not load
  3:  492deaf916 !  3:  2f809499ab bloom: get_bloom_filter() cleanups
     @@ Metadata
      Author: Derrick Stolee <dstolee@microsoft.com>
      
       ## Commit message ##
     -    bloom: get_bloom_filter() cleanups
     +    bloom: fix logic in get_bloom_filter()
      
          The get_bloom_filter() method is a bit complicated in some parts where
          it does not need to be. In particular, it needs to return a NULL filter
     @@ Commit message
      
          Also clean up some style issues while we are here.
      
     +    Helped-by: René Scharfe <l.s.r@web.de>
          Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
      
       ## bloom.c ##
     @@ bloom.c: struct bloom_filter *get_bloom_filter(struct repository *r,
      -			else
      -				return NULL;
      -		}
     -+		    r->objects->commit_graph->chunk_bloom_indexes &&
     -+		    load_bloom_filter_from_graph(r->objects->commit_graph, filter, c))
     -+			return filter;
     ++		    r->objects->commit_graph->chunk_bloom_indexes)
     ++			load_bloom_filter_from_graph(r->objects->commit_graph, filter, c);
       	}
       
      -	if (filter->data || !compute_if_not_present)
      +	if (filter->data)
       		return filter;
     -+	if (!filter->data && !compute_if_not_present)
     ++	if (!compute_if_not_present)
      +		return NULL;
       
       	repo_diff_setup(r, &diffopt);
  4:  8727b25468 !  4:  33e22d05cb commit-graph: persist existence of changed-paths
     @@ commit-graph.c: static void write_graph_chunk_bloom_data(struct hashfile *f,
       		progress = start_delayed_progress(
       			_("Writing changed paths Bloom filters data"),
      @@ commit-graph.c: static int write_commit_graph_file(struct write_commit_graph_context *ctx)
     + 	int num_chunks = 3;
     + 	uint64_t chunk_offset;
       	struct object_id file_hash;
     - 	const struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
     +-	const struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
     ++	struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
       
      -	ctx->bloom_settings = &bloom_settings;
     -+	if (!ctx->bloom_settings)
     ++	if (!ctx->bloom_settings) {
     ++		bloom_settings.bits_per_entry = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY",
     ++							      bloom_settings.bits_per_entry);
     ++		bloom_settings.num_hashes = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_NUM_HASHES",
     ++							  bloom_settings.num_hashes);
      +		ctx->bloom_settings = &bloom_settings;
     ++	}
       
       	if (ctx->split) {
       		struct strbuf tmp_file = STRBUF_INIT;
     @@ t/t4216-log-bloom.sh: test_expect_success 'Use Bloom filters if they exist in th
       	test_bloom_filters_used_when_some_filters_are_missing "-- A/B"
       '
       
     -+BASE_BDAT_OFFSET=2240
     -+BASE_K_BYTE_OFFSET=$((BASE_BDAT_OFFSET + 10))
     -+BASE_LEN_BYTE_OFFSET=$((BASE_BDAT_OFFSET + 14))
     -+
     -+corrupt_graph() {
     -+	pos=$1
     -+	data="${2:-\0}"
     -+	grepstr=$3
     -+	orig_size=$(wc -c < .git/objects/info/commit-graph) &&
     -+	zero_pos=${4:-${orig_size}} &&
     -+	printf "$data" | dd of=".git/objects/info/commit-graph" bs=1 seek="$pos" conv=notrunc &&
     -+	dd of=".git/objects/info/commit-graph" bs=1 seek="$zero_pos" if=/dev/null
     -+}
     -+
      +test_expect_success 'persist filter settings' '
      +	test_when_finished rm -rf .git/objects/info/commit-graph* &&
     -+	GIT_TRACE2_EVENT="$(pwd)/trace2.txt" git commit-graph write --reachable --changed-paths &&
     -+	grep "{\"hash_version\":1,\"num_hashes\":7,\"bits_per_entry\":10}" trace2.txt &&
     -+	cp .git/objects/info/commit-graph commit-graph-before &&
     -+	corrupt_graph $BASE_K_BYTE_OFFSET "\09" &&
     -+	corrupt_graph $BASE_LEN_BYTE_OFFSET "\0F" &&
     -+	cp .git/objects/info/commit-graph commit-graph-after &&
     -+	test_commit c18 A/corrupt &&
     -+	GIT_TRACE2_EVENT="$(pwd)/trace2.txt" git commit-graph write --reachable --changed-paths &&
     -+	grep "{\"hash_version\":1,\"num_hashes\":57,\"bits_per_entry\":70}" trace2.txt
     ++	rm -rf .git/objects/info/commit-graph* &&
     ++	GIT_TRACE2_EVENT="$(pwd)/trace2.txt" \
     ++		GIT_TRACE2_EVENT_NESTING=5 \
     ++		GIT_TEST_BLOOM_SETTINGS_NUM_HASHES=9 \
     ++		GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY=15 \
     ++		git commit-graph write --reachable --changed-paths &&
     ++	grep "{\"hash_version\":1,\"num_hashes\":9,\"bits_per_entry\":15}" trace2.txt &&
     ++	GIT_TRACE2_EVENT="$(pwd)/trace2-auto.txt" \
     ++		GIT_TRACE2_EVENT_NESTING=5 \
     ++		git commit-graph write --reachable --changed-paths &&
     ++	grep "{\"hash_version\":1,\"num_hashes\":9,\"bits_per_entry\":15}" trace2-auto.txt
      +'
      +
       test_done
  5:  244668fec4 !  5:  81c45d5260 commit-graph: unify the signatures of all write_graph_chunk_*() functions
     @@ Commit message
          This opens up the possibility for further cleanups and foolproofing in
          the following two patches.
      
     +    Helped-by: René Scharfe <l.s.r@web.de>
          Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
          Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
      
     @@ commit-graph.c: static void write_graph_chunk_fanout(struct hashfile *f,
       	for (count = 0; count < ctx->commits.nr; count++, list++) {
       		display_progress(ctx->progress, ++ctx->progress_cnt);
      -		hashwrite(f, (*list)->object.oid.hash, (int)hash_len);
     -+		hashwrite(f, (*list)->object.oid.hash, (int)the_hash_algo->rawsz);
     ++		hashwrite(f, (*list)->object.oid.hash, the_hash_algo->rawsz);
       	}
      +
      +	return 0;
  6:  8b959f2f37 !  6:  8828dcd906 commit-graph: simplify chunk writes into loop
     @@ Commit message
          file.  This will open up the possibility for a bit of foolproofing in
          the following patch.
      
     +    Helped-by: René Scharfe <l.s.r@web.de>
          Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
          Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
      
     @@ commit-graph.c: static int write_commit_graph_file(struct write_commit_graph_con
      -		return -1;
      +
      +	for (i = 0; i < num_chunks; i++) {
     -+		if (chunks[i].write_fn(f, ctx)) {
     -+			error(_("failed writing chunk with id %"PRIx32""),
     -+			      chunks[i].id);
     ++		if (chunks[i].write_fn(f, ctx))
      +			return -1;
     -+		}
       	}
      +
       	stop_progress(&ctx->progress);
  7:  3eb10933dc !  7:  ddbf297755 commit-graph: check chunk sizes after writing
     @@ Commit message
          in a loop we can do this check in a single place for all chunks, and
          any chunks added in the future will get checked as well.
      
     +    Helped-by: René Scharfe <l.s.r@web.de>
          Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
          Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
      
       ## commit-graph.c ##
      @@ commit-graph.c: static int write_commit_graph_file(struct write_commit_graph_context *ctx)
     - 			num_chunks * ctx->commits.nr);
       	}
       
     -+	chunk_offset = f->total + f->offset;
       	for (i = 0; i < num_chunks; i++) {
     -+		uint64_t end_offset;
     ++		uint64_t start_offset = f->total + f->offset;
      +
     - 		if (chunks[i].write_fn(f, ctx)) {
     - 			error(_("failed writing chunk with id %"PRIx32""),
     - 			      chunks[i].id);
     + 		if (chunks[i].write_fn(f, ctx))
       			return -1;
     - 		}
      +
     -+		end_offset = f->total + f->offset;
     -+		if (end_offset - chunk_offset != chunks[i].size)
     ++		if (f->total + f->offset != start_offset + chunks[i].size)
      +			BUG("expected to write %"PRId64" bytes to chunk %"PRIx32", but wrote %"PRId64" instead",
     -+			    chunks[i].size, chunks[i].id, end_offset - chunk_offset);
     -+		chunk_offset = end_offset;
     ++			    chunks[i].size, chunks[i].id,
     ++			    f->total + f->offset - start_offset);
       	}
       
       	stop_progress(&ctx->progress);
  8:  0bcfc1f051 =  8:  8b63706141 revision.c: fix whitespace
  9:  719c7091a7 =  9:  7d6163305a revision: empty pathspecs should not use Bloom filters
 10:  9c2076b4ce ! 10:  40061233ca commit-graph: check all leading directories in changed path Bloom filters
     @@ revision.c: static void prepare_to_use_bloom_filter(struct rev_info *revs)
      -	fill_bloom_key(path, len, revs->bloom_key, revs->bloom_filter_settings);
      +	p = path;
      +	while (*p) {
     -+		if (is_dir_sep(*p))
     ++		/*
     ++		 * At this point, the path is normalized to use Unix-style
     ++		 * path separators. This is required due to how the
     ++		 * changed-path Bloom filters store the paths.
     ++		 */
     ++		if (*p == '/')
      +			path_component_nr++;
      +		p++;
      +	}
     @@ revision.c: static void prepare_to_use_bloom_filter(struct rev_info *revs)
      +
      +	p = path + len - 1;
      +	while (p > path) {
     -+		if (is_dir_sep(*p))
     ++		if (*p == '/')
      +			fill_bloom_key(path, p - path,
      +				       &revs->bloom_keys[path_component_nr++],
      +				       revs->bloom_filter_settings);
 11:  1022c0ad21 <  -:  ---------- bloom: enforce a minimum size of 8 bytes

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH v3 01/10] commit-graph: place bloom_settings in context
  2020-06-26 12:30   ` [PATCH v3 00/10] " Derrick Stolee via GitGitGadget
@ 2020-06-26 12:30     ` Derrick Stolee via GitGitGadget
  2020-06-26 12:30     ` [PATCH v3 02/10] commit-graph: change test to die on parse, not load Derrick Stolee via GitGitGadget
                       ` (9 subsequent siblings)
  10 siblings, 0 replies; 71+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-06-26 12:30 UTC (permalink / raw)
  To: git; +Cc: me, szeder.dev, l.s.r, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Place an instance of struct bloom_settings into the struct
write_commit_graph_context. This allows simplifying the function
prototype of write_graph_chunk_bloom_data(). This will allow us
to combine the function prototypes and use function pointers to
simplify write_commit_graph_file().

By using a pointer, we can later replace the settings to match those
that exist in the current commit-graph, in case a future Git version
allows customization of these parameters.

Reported-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 887837e882..d0fedcd9b1 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -882,6 +882,7 @@ struct write_commit_graph_context {
 
 	const struct split_commit_graph_opts *split_opts;
 	size_t total_bloom_filter_data_size;
+	const struct bloom_filter_settings *bloom_settings;
 };
 
 static void write_graph_chunk_fanout(struct hashfile *f,
@@ -1103,8 +1104,7 @@ static void write_graph_chunk_bloom_indexes(struct hashfile *f,
 }
 
 static void write_graph_chunk_bloom_data(struct hashfile *f,
-					 struct write_commit_graph_context *ctx,
-					 const struct bloom_filter_settings *settings)
+					 struct write_commit_graph_context *ctx)
 {
 	struct commit **list = ctx->commits.list;
 	struct commit **last = ctx->commits.list + ctx->commits.nr;
@@ -1116,9 +1116,9 @@ static void write_graph_chunk_bloom_data(struct hashfile *f,
 			_("Writing changed paths Bloom filters data"),
 			ctx->commits.nr);
 
-	hashwrite_be32(f, settings->hash_version);
-	hashwrite_be32(f, settings->num_hashes);
-	hashwrite_be32(f, settings->bits_per_entry);
+	hashwrite_be32(f, ctx->bloom_settings->hash_version);
+	hashwrite_be32(f, ctx->bloom_settings->num_hashes);
+	hashwrite_be32(f, ctx->bloom_settings->bits_per_entry);
 
 	while (list < last) {
 		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
@@ -1541,6 +1541,8 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	struct object_id file_hash;
 	const struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
 
+	ctx->bloom_settings = &bloom_settings;
+
 	if (ctx->split) {
 		struct strbuf tmp_file = STRBUF_INIT;
 
@@ -1642,7 +1644,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 		write_graph_chunk_extra_edges(f, ctx);
 	if (ctx->changed_paths) {
 		write_graph_chunk_bloom_indexes(f, ctx);
-		write_graph_chunk_bloom_data(f, ctx, &bloom_settings);
+		write_graph_chunk_bloom_data(f, ctx);
 	}
 	if (ctx->num_commit_graphs_after > 1 &&
 	    write_graph_chunk_base(f, ctx)) {
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH v3 02/10] commit-graph: change test to die on parse, not load
  2020-06-26 12:30   ` [PATCH v3 00/10] " Derrick Stolee via GitGitGadget
  2020-06-26 12:30     ` [PATCH v3 01/10] commit-graph: place bloom_settings in context Derrick Stolee via GitGitGadget
@ 2020-06-26 12:30     ` Derrick Stolee via GitGitGadget
  2020-06-26 12:30     ` [PATCH v3 03/10] bloom: fix logic in get_bloom_filter() Derrick Stolee via GitGitGadget
                       ` (8 subsequent siblings)
  10 siblings, 0 replies; 71+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-06-26 12:30 UTC (permalink / raw)
  To: git; +Cc: me, szeder.dev, l.s.r, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

43d3561 (commit-graph write: don't die if the existing graph is corrupt,
2019-03-25) introduced the GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD environment
variable. This was created to verify that commit-graph was not loaded
when writing a new non-incremental commit-graph.

An upcoming change wants to load a commit-graph in some valuable cases,
but we want to maintain that we don't trust the commit-graph data when
writing our new file. Instead of dying on load, instead die if we ever
try to parse a commit from the commit-graph. This functionally verifies
the same intended behavior, but allows a more advanced feature in the
next change.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c          | 12 ++++++++----
 commit-graph.h          |  2 +-
 t/t5318-commit-graph.sh |  2 +-
 3 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index d0fedcd9b1..6a28d4a5a6 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -564,10 +564,6 @@ static int prepare_commit_graph(struct repository *r)
 		return !!r->objects->commit_graph;
 	r->objects->commit_graph_attempted = 1;
 
-	if (git_env_bool(GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD, 0))
-		die("dying as requested by the '%s' variable on commit-graph load!",
-		    GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD);
-
 	prepare_repo_settings(r);
 
 	if (!git_env_bool(GIT_TEST_COMMIT_GRAPH, 0) &&
@@ -790,6 +786,14 @@ static int parse_commit_in_graph_one(struct repository *r,
 
 int parse_commit_in_graph(struct repository *r, struct commit *item)
 {
+	static int checked_env = 0;
+
+	if (!checked_env &&
+	    git_env_bool(GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE, 0))
+		die("dying as requested by the '%s' variable on commit-graph parse!",
+		    GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE);
+	checked_env = 1;
+
 	if (!prepare_commit_graph(r))
 		return 0;
 	return parse_commit_in_graph_one(r, r->objects->commit_graph, item);
diff --git a/commit-graph.h b/commit-graph.h
index 881c9b46e5..f0fb13e3f2 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -5,7 +5,7 @@
 #include "object-store.h"
 
 #define GIT_TEST_COMMIT_GRAPH "GIT_TEST_COMMIT_GRAPH"
-#define GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD "GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD"
+#define GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE "GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE"
 #define GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS "GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS"
 
 /*
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 1073f9e3cf..5ec01abdaa 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -436,7 +436,7 @@ corrupt_graph_verify() {
 		cp $objdir/info/commit-graph commit-graph-pre-write-test
 	fi &&
 	git status --short &&
-	GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD=true git commit-graph write &&
+	GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE=true git commit-graph write &&
 	git commit-graph verify
 }
 
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH v3 03/10] bloom: fix logic in get_bloom_filter()
  2020-06-26 12:30   ` [PATCH v3 00/10] " Derrick Stolee via GitGitGadget
  2020-06-26 12:30     ` [PATCH v3 01/10] commit-graph: place bloom_settings in context Derrick Stolee via GitGitGadget
  2020-06-26 12:30     ` [PATCH v3 02/10] commit-graph: change test to die on parse, not load Derrick Stolee via GitGitGadget
@ 2020-06-26 12:30     ` Derrick Stolee via GitGitGadget
  2020-06-27 16:33       ` SZEDER Gábor
  2020-06-26 12:30     ` [PATCH v3 04/10] commit-graph: persist existence of changed-paths Derrick Stolee via GitGitGadget
                       ` (7 subsequent siblings)
  10 siblings, 1 reply; 71+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-06-26 12:30 UTC (permalink / raw)
  To: git; +Cc: me, szeder.dev, l.s.r, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The get_bloom_filter() method is a bit complicated in some parts where
it does not need to be. In particular, it needs to return a NULL filter
only when compute_if_not_present is zero AND the filter data cannot be
loaded from a commit-graph file. This currently happens by accident
because the commit-graph does not load changed-path Bloom filters from
an existing commit-graph when writing a new one. This will change in a
later patch.

Also clean up some style issues while we are here.

Helped-by: René Scharfe <l.s.r@web.de>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 bloom.c | 14 ++++++--------
 1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/bloom.c b/bloom.c
index c38d1cff0c..2af5389795 100644
--- a/bloom.c
+++ b/bloom.c
@@ -186,7 +186,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
 	struct diff_options diffopt;
 	int max_changes = 512;
 
-	if (bloom_filters.slab_size == 0)
+	if (!bloom_filters.slab_size)
 		return NULL;
 
 	filter = bloom_filter_slab_at(&bloom_filters, c);
@@ -194,16 +194,14 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
 	if (!filter->data) {
 		load_commit_graph_info(r, c);
 		if (c->graph_pos != COMMIT_NOT_FROM_GRAPH &&
-			r->objects->commit_graph->chunk_bloom_indexes) {
-			if (load_bloom_filter_from_graph(r->objects->commit_graph, filter, c))
-				return filter;
-			else
-				return NULL;
-		}
+		    r->objects->commit_graph->chunk_bloom_indexes)
+			load_bloom_filter_from_graph(r->objects->commit_graph, filter, c);
 	}
 
-	if (filter->data || !compute_if_not_present)
+	if (filter->data)
 		return filter;
+	if (!compute_if_not_present)
+		return NULL;
 
 	repo_diff_setup(r, &diffopt);
 	diffopt.flags.recursive = 1;
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH v3 04/10] commit-graph: persist existence of changed-paths
  2020-06-26 12:30   ` [PATCH v3 00/10] " Derrick Stolee via GitGitGadget
                       ` (2 preceding siblings ...)
  2020-06-26 12:30     ` [PATCH v3 03/10] bloom: fix logic in get_bloom_filter() Derrick Stolee via GitGitGadget
@ 2020-06-26 12:30     ` Derrick Stolee via GitGitGadget
  2020-06-26 12:30     ` [PATCH v3 05/10] commit-graph: unify the signatures of all write_graph_chunk_*() functions SZEDER Gábor via GitGitGadget
                       ` (6 subsequent siblings)
  10 siblings, 0 replies; 71+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-06-26 12:30 UTC (permalink / raw)
  To: git; +Cc: me, szeder.dev, l.s.r, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The changed-path Bloom filters were released in v2.27.0, but have a
significant drawback. A user can opt-in to writing the changed-path
filters using the "--changed-paths" option to "git commit-graph write"
but the next write will drop the filters unless that option is
specified.

This becomes even more important when considering the interaction with
gc.writeCommitGraph (on by default) or fetch.writeCommitGraph (part of
features.experimental). These config options trigger commit-graph writes
that the user did not signal, and hence there is no --changed-paths
option available.

Allow a user that opts-in to the changed-path filters to persist the
property of "my commit-graph has changed-path filters" automatically. A
user can drop filters using the --no-changed-paths option.

In the process, we need to be extremely careful to match the Bloom
filter settings as specified by the commit-graph. This will allow future
versions of Git to customize these settings, and the version with this
change will persist those settings as commit-graphs are rewritten on
top.

Use the trace2 API to signal the settings used during the write, and
check that output in a test after manually adjusting the correct bytes
in the commit-graph file.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-commit-graph.txt |  5 +++-
 builtin/commit-graph.c             |  5 +++-
 commit-graph.c                     | 45 ++++++++++++++++++++++++++++--
 commit-graph.h                     |  1 +
 t/t4216-log-bloom.sh               | 17 ++++++++++-
 5 files changed, 67 insertions(+), 6 deletions(-)

diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
index f4b13c005b..369b222b08 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -60,7 +60,10 @@ existing commit-graph file.
 With the `--changed-paths` option, compute and write information about the
 paths changed between a commit and it's first parent. This operation can
 take a while on large repositories. It provides significant performance gains
-for getting history of a directory or a file with `git log -- <path>`.
+for getting history of a directory or a file with `git log -- <path>`. If
+this option is given, future commit-graph writes will automatically assume
+that this option was intended. Use `--no-changed-paths` to stop storing this
+data.
 +
 With the `--split` option, write the commit-graph as a chain of multiple
 commit-graph files stored in `<dir>/info/commit-graphs`. The new commits
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 59009837dc..ff7b177c33 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -151,6 +151,7 @@ static int graph_write(int argc, const char **argv)
 	};
 
 	opts.progress = isatty(2);
+	opts.enable_changed_paths = -1;
 	split_opts.size_multiple = 2;
 	split_opts.max_commits = 0;
 	split_opts.expire_time = 0;
@@ -171,7 +172,9 @@ static int graph_write(int argc, const char **argv)
 		flags |= COMMIT_GRAPH_WRITE_SPLIT;
 	if (opts.progress)
 		flags |= COMMIT_GRAPH_WRITE_PROGRESS;
-	if (opts.enable_changed_paths ||
+	if (!opts.enable_changed_paths)
+		flags |= COMMIT_GRAPH_NO_WRITE_BLOOM_FILTERS;
+	if (opts.enable_changed_paths == 1 ||
 	    git_env_bool(GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS, 0))
 		flags |= COMMIT_GRAPH_WRITE_BLOOM_FILTERS;
 
diff --git a/commit-graph.c b/commit-graph.c
index 6a28d4a5a6..11088fc11f 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -16,6 +16,8 @@
 #include "progress.h"
 #include "bloom.h"
 #include "commit-slab.h"
+#include "json-writer.h"
+#include "trace2.h"
 
 void git_test_write_commit_graph_or_die(void)
 {
@@ -1107,6 +1109,21 @@ static void write_graph_chunk_bloom_indexes(struct hashfile *f,
 	stop_progress(&progress);
 }
 
+static void trace2_bloom_filter_settings(struct write_commit_graph_context *ctx)
+{
+	struct json_writer jw = JSON_WRITER_INIT;
+
+	jw_object_begin(&jw, 0);
+	jw_object_intmax(&jw, "hash_version", ctx->bloom_settings->hash_version);
+	jw_object_intmax(&jw, "num_hashes", ctx->bloom_settings->num_hashes);
+	jw_object_intmax(&jw, "bits_per_entry", ctx->bloom_settings->bits_per_entry);
+	jw_end(&jw);
+
+	trace2_data_json("bloom", ctx->r, "settings", &jw);
+
+	jw_release(&jw);
+}
+
 static void write_graph_chunk_bloom_data(struct hashfile *f,
 					 struct write_commit_graph_context *ctx)
 {
@@ -1115,6 +1132,8 @@ static void write_graph_chunk_bloom_data(struct hashfile *f,
 	struct progress *progress = NULL;
 	int i = 0;
 
+	trace2_bloom_filter_settings(ctx);
+
 	if (ctx->report_progress)
 		progress = start_delayed_progress(
 			_("Writing changed paths Bloom filters data"),
@@ -1543,9 +1562,15 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	int num_chunks = 3;
 	uint64_t chunk_offset;
 	struct object_id file_hash;
-	const struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
+	struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
 
-	ctx->bloom_settings = &bloom_settings;
+	if (!ctx->bloom_settings) {
+		bloom_settings.bits_per_entry = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY",
+							      bloom_settings.bits_per_entry);
+		bloom_settings.num_hashes = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_NUM_HASHES",
+							  bloom_settings.num_hashes);
+		ctx->bloom_settings = &bloom_settings;
+	}
 
 	if (ctx->split) {
 		struct strbuf tmp_file = STRBUF_INIT;
@@ -1970,9 +1995,23 @@ int write_commit_graph(struct object_directory *odb,
 	ctx->split = flags & COMMIT_GRAPH_WRITE_SPLIT ? 1 : 0;
 	ctx->check_oids = flags & COMMIT_GRAPH_WRITE_CHECK_OIDS ? 1 : 0;
 	ctx->split_opts = split_opts;
-	ctx->changed_paths = flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS ? 1 : 0;
 	ctx->total_bloom_filter_data_size = 0;
 
+	if (flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS)
+		ctx->changed_paths = 1;
+	if (!(flags & COMMIT_GRAPH_NO_WRITE_BLOOM_FILTERS)) {
+		struct commit_graph *g;
+		prepare_commit_graph_one(ctx->r, ctx->odb);
+
+		g = ctx->r->objects->commit_graph;
+
+		/* We have changed-paths already. Keep them in the next graph */
+		if (g && g->chunk_bloom_data) {
+			ctx->changed_paths = 1;
+			ctx->bloom_settings = g->bloom_filter_settings;
+		}
+	}
+
 	if (ctx->split) {
 		struct commit_graph *g;
 		prepare_commit_graph(ctx->r);
diff --git a/commit-graph.h b/commit-graph.h
index f0fb13e3f2..45b1e5bca3 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -96,6 +96,7 @@ enum commit_graph_write_flags {
 	/* Make sure that each OID in the input is a valid commit OID. */
 	COMMIT_GRAPH_WRITE_CHECK_OIDS = (1 << 3),
 	COMMIT_GRAPH_WRITE_BLOOM_FILTERS = (1 << 4),
+	COMMIT_GRAPH_NO_WRITE_BLOOM_FILTERS = (1 << 5),
 };
 
 struct split_commit_graph_opts {
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index c7011f33e2..73ed51b595 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -126,7 +126,7 @@ test_expect_success 'setup - add commit-graph to the chain without Bloom filters
 	test_commit c14 A/anotherFile2 &&
 	test_commit c15 A/B/anotherFile2 &&
 	test_commit c16 A/B/C/anotherFile2 &&
-	GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=0 git commit-graph write --reachable --split &&
+	git commit-graph write --reachable --split --no-changed-paths &&
 	test_line_count = 2 .git/objects/info/commit-graphs/commit-graph-chain
 '
 
@@ -152,4 +152,19 @@ test_expect_success 'Use Bloom filters if they exist in the latest but not all c
 	test_bloom_filters_used_when_some_filters_are_missing "-- A/B"
 '
 
+test_expect_success 'persist filter settings' '
+	test_when_finished rm -rf .git/objects/info/commit-graph* &&
+	rm -rf .git/objects/info/commit-graph* &&
+	GIT_TRACE2_EVENT="$(pwd)/trace2.txt" \
+		GIT_TRACE2_EVENT_NESTING=5 \
+		GIT_TEST_BLOOM_SETTINGS_NUM_HASHES=9 \
+		GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY=15 \
+		git commit-graph write --reachable --changed-paths &&
+	grep "{\"hash_version\":1,\"num_hashes\":9,\"bits_per_entry\":15}" trace2.txt &&
+	GIT_TRACE2_EVENT="$(pwd)/trace2-auto.txt" \
+		GIT_TRACE2_EVENT_NESTING=5 \
+		git commit-graph write --reachable --changed-paths &&
+	grep "{\"hash_version\":1,\"num_hashes\":9,\"bits_per_entry\":15}" trace2-auto.txt
+'
+
 test_done
\ No newline at end of file
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH v3 05/10] commit-graph: unify the signatures of all write_graph_chunk_*() functions
  2020-06-26 12:30   ` [PATCH v3 00/10] " Derrick Stolee via GitGitGadget
                       ` (3 preceding siblings ...)
  2020-06-26 12:30     ` [PATCH v3 04/10] commit-graph: persist existence of changed-paths Derrick Stolee via GitGitGadget
@ 2020-06-26 12:30     ` SZEDER Gábor via GitGitGadget
  2020-06-26 12:30     ` [PATCH v3 06/10] commit-graph: simplify chunk writes into loop SZEDER Gábor via GitGitGadget
                       ` (5 subsequent siblings)
  10 siblings, 0 replies; 71+ messages in thread
From: SZEDER Gábor via GitGitGadget @ 2020-06-26 12:30 UTC (permalink / raw)
  To: git; +Cc: me, szeder.dev, l.s.r, Derrick Stolee, SZEDER Gábor

From: =?UTF-8?q?SZEDER=20G=C3=A1bor?= <szeder.dev@gmail.com>

Update the write_graph_chunk_*() helper functions to have the same
signature:

  - Return an int error code from all these functions.
    write_graph_chunk_base() already has an int error code, now the
    others will have one, too, but since they don't indicate any
    error, they will always return 0.

  - Drop the hash size parameter of write_graph_chunk_oids() and
    write_graph_chunk_data(); its value can be read directly from
    'the_hash_algo' inside these functions as well.

This opens up the possibility for further cleanups and foolproofing in
the following two patches.

Helped-by: René Scharfe <l.s.r@web.de>
Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 42 ++++++++++++++++++++++++++----------------
 1 file changed, 26 insertions(+), 16 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 11088fc11f..d51682998d 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -891,8 +891,8 @@ struct write_commit_graph_context {
 	const struct bloom_filter_settings *bloom_settings;
 };
 
-static void write_graph_chunk_fanout(struct hashfile *f,
-				     struct write_commit_graph_context *ctx)
+static int write_graph_chunk_fanout(struct hashfile *f,
+				    struct write_commit_graph_context *ctx)
 {
 	int i, count = 0;
 	struct commit **list = ctx->commits.list;
@@ -913,17 +913,21 @@ static void write_graph_chunk_fanout(struct hashfile *f,
 
 		hashwrite_be32(f, count);
 	}
+
+	return 0;
 }
 
-static void write_graph_chunk_oids(struct hashfile *f, int hash_len,
-				   struct write_commit_graph_context *ctx)
+static int write_graph_chunk_oids(struct hashfile *f,
+				  struct write_commit_graph_context *ctx)
 {
 	struct commit **list = ctx->commits.list;
 	int count;
 	for (count = 0; count < ctx->commits.nr; count++, list++) {
 		display_progress(ctx->progress, ++ctx->progress_cnt);
-		hashwrite(f, (*list)->object.oid.hash, (int)hash_len);
+		hashwrite(f, (*list)->object.oid.hash, the_hash_algo->rawsz);
 	}
+
+	return 0;
 }
 
 static const unsigned char *commit_to_sha1(size_t index, void *table)
@@ -932,8 +936,8 @@ static const unsigned char *commit_to_sha1(size_t index, void *table)
 	return commits[index]->object.oid.hash;
 }
 
-static void write_graph_chunk_data(struct hashfile *f, int hash_len,
-				   struct write_commit_graph_context *ctx)
+static int write_graph_chunk_data(struct hashfile *f,
+				  struct write_commit_graph_context *ctx)
 {
 	struct commit **list = ctx->commits.list;
 	struct commit **last = ctx->commits.list + ctx->commits.nr;
@@ -950,7 +954,7 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len,
 			die(_("unable to parse commit %s"),
 				oid_to_hex(&(*list)->object.oid));
 		tree = get_commit_tree_oid(*list);
-		hashwrite(f, tree->hash, hash_len);
+		hashwrite(f, tree->hash, the_hash_algo->rawsz);
 
 		parent = (*list)->parents;
 
@@ -1030,10 +1034,12 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len,
 
 		list++;
 	}
+
+	return 0;
 }
 
-static void write_graph_chunk_extra_edges(struct hashfile *f,
-					  struct write_commit_graph_context *ctx)
+static int write_graph_chunk_extra_edges(struct hashfile *f,
+					 struct write_commit_graph_context *ctx)
 {
 	struct commit **list = ctx->commits.list;
 	struct commit **last = ctx->commits.list + ctx->commits.nr;
@@ -1082,10 +1088,12 @@ static void write_graph_chunk_extra_edges(struct hashfile *f,
 
 		list++;
 	}
+
+	return 0;
 }
 
-static void write_graph_chunk_bloom_indexes(struct hashfile *f,
-					    struct write_commit_graph_context *ctx)
+static int write_graph_chunk_bloom_indexes(struct hashfile *f,
+					   struct write_commit_graph_context *ctx)
 {
 	struct commit **list = ctx->commits.list;
 	struct commit **last = ctx->commits.list + ctx->commits.nr;
@@ -1107,6 +1115,7 @@ static void write_graph_chunk_bloom_indexes(struct hashfile *f,
 	}
 
 	stop_progress(&progress);
+	return 0;
 }
 
 static void trace2_bloom_filter_settings(struct write_commit_graph_context *ctx)
@@ -1124,8 +1133,8 @@ static void trace2_bloom_filter_settings(struct write_commit_graph_context *ctx)
 	jw_release(&jw);
 }
 
-static void write_graph_chunk_bloom_data(struct hashfile *f,
-					 struct write_commit_graph_context *ctx)
+static int write_graph_chunk_bloom_data(struct hashfile *f,
+					struct write_commit_graph_context *ctx)
 {
 	struct commit **list = ctx->commits.list;
 	struct commit **last = ctx->commits.list + ctx->commits.nr;
@@ -1151,6 +1160,7 @@ static void write_graph_chunk_bloom_data(struct hashfile *f,
 	}
 
 	stop_progress(&progress);
+	return 0;
 }
 
 static int oid_compare(const void *_a, const void *_b)
@@ -1667,8 +1677,8 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 			num_chunks * ctx->commits.nr);
 	}
 	write_graph_chunk_fanout(f, ctx);
-	write_graph_chunk_oids(f, hashsz, ctx);
-	write_graph_chunk_data(f, hashsz, ctx);
+	write_graph_chunk_oids(f, ctx);
+	write_graph_chunk_data(f, ctx);
 	if (ctx->num_extra_edges)
 		write_graph_chunk_extra_edges(f, ctx);
 	if (ctx->changed_paths) {
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH v3 06/10] commit-graph: simplify chunk writes into loop
  2020-06-26 12:30   ` [PATCH v3 00/10] " Derrick Stolee via GitGitGadget
                       ` (4 preceding siblings ...)
  2020-06-26 12:30     ` [PATCH v3 05/10] commit-graph: unify the signatures of all write_graph_chunk_*() functions SZEDER Gábor via GitGitGadget
@ 2020-06-26 12:30     ` SZEDER Gábor via GitGitGadget
  2020-06-26 12:30     ` [PATCH v3 07/10] commit-graph: check chunk sizes after writing SZEDER Gábor via GitGitGadget
                       ` (4 subsequent siblings)
  10 siblings, 0 replies; 71+ messages in thread
From: SZEDER Gábor via GitGitGadget @ 2020-06-26 12:30 UTC (permalink / raw)
  To: git; +Cc: me, szeder.dev, l.s.r, Derrick Stolee, SZEDER Gábor

From: =?UTF-8?q?SZEDER=20G=C3=A1bor?= <szeder.dev@gmail.com>

In write_commit_graph_file() we now have one block of code filling the
array of 'struct chunk_info' with the IDs and sizes of chunks to be
written, and an other block of code calling the functions responsible
for writing individual chunks.  In case of optional chunks like Extra
Edge List an Base Graphs List there is also a condition checking
whether that chunk is necessary/desired, and that same condition is
repeated in both blocks of code. Other, newer chunks have similar
optional conditions.

Eliminate these repeated conditions by storing the function pointers
responsible for writing individual chunks in the 'struct chunk_info'
array as well, and calling them in a loop to write the commit-graph
file.  This will open up the possibility for a bit of foolproofing in
the following patch.

Helped-by: René Scharfe <l.s.r@web.de>
Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 28 ++++++++++++++++------------
 1 file changed, 16 insertions(+), 12 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index d51682998d..e43ee58ea6 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1555,9 +1555,13 @@ static int write_graph_chunk_base(struct hashfile *f,
 	return 0;
 }
 
+typedef int (*chunk_write_fn)(struct hashfile *f,
+			      struct write_commit_graph_context *ctx);
+
 struct chunk_info {
 	uint32_t id;
 	uint64_t size;
+	chunk_write_fn write_fn;
 };
 
 static int write_commit_graph_file(struct write_commit_graph_context *ctx)
@@ -1620,27 +1624,34 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 
 	chunks[0].id = GRAPH_CHUNKID_OIDFANOUT;
 	chunks[0].size = GRAPH_FANOUT_SIZE;
+	chunks[0].write_fn = write_graph_chunk_fanout;
 	chunks[1].id = GRAPH_CHUNKID_OIDLOOKUP;
 	chunks[1].size = hashsz * ctx->commits.nr;
+	chunks[1].write_fn = write_graph_chunk_oids;
 	chunks[2].id = GRAPH_CHUNKID_DATA;
 	chunks[2].size = (hashsz + 16) * ctx->commits.nr;
+	chunks[2].write_fn = write_graph_chunk_data;
 	if (ctx->num_extra_edges) {
 		chunks[num_chunks].id = GRAPH_CHUNKID_EXTRAEDGES;
 		chunks[num_chunks].size = 4 * ctx->num_extra_edges;
+		chunks[num_chunks].write_fn = write_graph_chunk_extra_edges;
 		num_chunks++;
 	}
 	if (ctx->changed_paths) {
 		chunks[num_chunks].id = GRAPH_CHUNKID_BLOOMINDEXES;
 		chunks[num_chunks].size = sizeof(uint32_t) * ctx->commits.nr;
+		chunks[num_chunks].write_fn = write_graph_chunk_bloom_indexes;
 		num_chunks++;
 		chunks[num_chunks].id = GRAPH_CHUNKID_BLOOMDATA;
 		chunks[num_chunks].size = sizeof(uint32_t) * 3
 					  + ctx->total_bloom_filter_data_size;
+		chunks[num_chunks].write_fn = write_graph_chunk_bloom_data;
 		num_chunks++;
 	}
 	if (ctx->num_commit_graphs_after > 1) {
 		chunks[num_chunks].id = GRAPH_CHUNKID_BASE;
 		chunks[num_chunks].size = hashsz * (ctx->num_commit_graphs_after - 1);
+		chunks[num_chunks].write_fn = write_graph_chunk_base;
 		num_chunks++;
 	}
 
@@ -1676,19 +1687,12 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 			progress_title.buf,
 			num_chunks * ctx->commits.nr);
 	}
-	write_graph_chunk_fanout(f, ctx);
-	write_graph_chunk_oids(f, ctx);
-	write_graph_chunk_data(f, ctx);
-	if (ctx->num_extra_edges)
-		write_graph_chunk_extra_edges(f, ctx);
-	if (ctx->changed_paths) {
-		write_graph_chunk_bloom_indexes(f, ctx);
-		write_graph_chunk_bloom_data(f, ctx);
-	}
-	if (ctx->num_commit_graphs_after > 1 &&
-	    write_graph_chunk_base(f, ctx)) {
-		return -1;
+
+	for (i = 0; i < num_chunks; i++) {
+		if (chunks[i].write_fn(f, ctx))
+			return -1;
 	}
+
 	stop_progress(&ctx->progress);
 	strbuf_release(&progress_title);
 
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH v3 07/10] commit-graph: check chunk sizes after writing
  2020-06-26 12:30   ` [PATCH v3 00/10] " Derrick Stolee via GitGitGadget
                       ` (5 preceding siblings ...)
  2020-06-26 12:30     ` [PATCH v3 06/10] commit-graph: simplify chunk writes into loop SZEDER Gábor via GitGitGadget
@ 2020-06-26 12:30     ` SZEDER Gábor via GitGitGadget
  2020-06-26 12:30     ` [PATCH v3 08/10] revision.c: fix whitespace Derrick Stolee via GitGitGadget
                       ` (3 subsequent siblings)
  10 siblings, 0 replies; 71+ messages in thread
From: SZEDER Gábor via GitGitGadget @ 2020-06-26 12:30 UTC (permalink / raw)
  To: git; +Cc: me, szeder.dev, l.s.r, Derrick Stolee, SZEDER Gábor

From: =?UTF-8?q?SZEDER=20G=C3=A1bor?= <szeder.dev@gmail.com>

In my experience while experimenting with new commit-graph chunks,
early versions of the corresponding new write_commit_graph_my_chunk()
functions are, sadly but not surprisingly, often buggy, and write more
or less data than they are supposed to, especially if the chunk size
is not directly proportional to the number of commits.  This then
causes all kinds of issues when reading such a bogus commit-graph
file, raising the question of whether the writing or the reading part
happens to be buggy this time.

Let's catch such issues early, already when writing the commit-graph
file, and check that each write_graph_chunk_*() function wrote the
amount of data that it was expected to, and what has been encoded in
the Chunk Lookup table.  Now that all commit-graph chunks are written
in a loop we can do this check in a single place for all chunks, and
any chunks added in the future will get checked as well.

Helped-by: René Scharfe <l.s.r@web.de>
Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/commit-graph.c b/commit-graph.c
index e43ee58ea6..a0766a86f5 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1689,8 +1689,15 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	}
 
 	for (i = 0; i < num_chunks; i++) {
+		uint64_t start_offset = f->total + f->offset;
+
 		if (chunks[i].write_fn(f, ctx))
 			return -1;
+
+		if (f->total + f->offset != start_offset + chunks[i].size)
+			BUG("expected to write %"PRId64" bytes to chunk %"PRIx32", but wrote %"PRId64" instead",
+			    chunks[i].size, chunks[i].id,
+			    f->total + f->offset - start_offset);
 	}
 
 	stop_progress(&ctx->progress);
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH v3 08/10] revision.c: fix whitespace
  2020-06-26 12:30   ` [PATCH v3 00/10] " Derrick Stolee via GitGitGadget
                       ` (6 preceding siblings ...)
  2020-06-26 12:30     ` [PATCH v3 07/10] commit-graph: check chunk sizes after writing SZEDER Gábor via GitGitGadget
@ 2020-06-26 12:30     ` Derrick Stolee via GitGitGadget
  2020-06-26 12:30     ` [PATCH v3 09/10] revision: empty pathspecs should not use Bloom filters Taylor Blau via GitGitGadget
                       ` (2 subsequent siblings)
  10 siblings, 0 replies; 71+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-06-26 12:30 UTC (permalink / raw)
  To: git; +Cc: me, szeder.dev, l.s.r, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Here, four spaces were used instead of tab characters.

Reported-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 revision.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/revision.c b/revision.c
index c644c66091..ed59084f50 100644
--- a/revision.c
+++ b/revision.c
@@ -697,11 +697,11 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
 
 	/* remove single trailing slash from path, if needed */
 	if (pi->match[last_index] == '/') {
-	    path_alloc = xstrdup(pi->match);
-	    path_alloc[last_index] = '\0';
-	    path = path_alloc;
+		path_alloc = xstrdup(pi->match);
+		path_alloc[last_index] = '\0';
+		path = path_alloc;
 	} else
-	    path = pi->match;
+		path = pi->match;
 
 	len = strlen(path);
 
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH v3 09/10] revision: empty pathspecs should not use Bloom filters
  2020-06-26 12:30   ` [PATCH v3 00/10] " Derrick Stolee via GitGitGadget
                       ` (7 preceding siblings ...)
  2020-06-26 12:30     ` [PATCH v3 08/10] revision.c: fix whitespace Derrick Stolee via GitGitGadget
@ 2020-06-26 12:30     ` Taylor Blau via GitGitGadget
  2020-06-26 12:30     ` [PATCH v3 10/10] commit-graph: check all leading directories in changed path " SZEDER Gábor via GitGitGadget
  2020-07-01 13:27     ` [PATCH v4 00/10] More commit-graph/Bloom filter improvements Derrick Stolee via GitGitGadget
  10 siblings, 0 replies; 71+ messages in thread
From: Taylor Blau via GitGitGadget @ 2020-06-26 12:30 UTC (permalink / raw)
  To: git; +Cc: me, szeder.dev, l.s.r, Derrick Stolee, Taylor Blau

From: Taylor Blau <me@ttaylorr.com>

The prepare_to_use_bloom_filter() method was not intended to be called
on an empty pathspec. However, 'git log -- .' and 'git log' are subtly
different: the latter reports all commits while the former will simplify
commits that do not change the root tree.

This means that the path used to construct the bloom_key might be empty,
and that value is not added to the Bloom filter during construction.
That means that the results are likely incorrect!

To resolve the issue, be careful about the length of the path and stop
filling Bloom filters. To be completely sure we do not use them, drop
the pointer to the bloom_filter_settings from the commit-graph. That
allows our test to look at the trace2 logs to verify no Bloom filter
statistics are reported.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 revision.c           | 4 ++++
 t/t4216-log-bloom.sh | 4 ++++
 2 files changed, 8 insertions(+)

diff --git a/revision.c b/revision.c
index ed59084f50..b53377cd52 100644
--- a/revision.c
+++ b/revision.c
@@ -704,6 +704,10 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
 		path = pi->match;
 
 	len = strlen(path);
+	if (!len) {
+		revs->bloom_filter_settings = NULL;
+		return;
+	}
 
 	revs->bloom_key = xmalloc(sizeof(struct bloom_key));
 	fill_bloom_key(path, len, revs->bloom_key, revs->bloom_filter_settings);
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index 73ed51b595..e3e4badd4c 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -112,6 +112,10 @@ test_expect_success 'git log -- multiple path specs does not use Bloom filters'
 	test_bloom_filters_not_used "-- file4 A/file1"
 '
 
+test_expect_success 'git log -- "." pathspec at root does not use Bloom filters' '
+	test_bloom_filters_not_used "-- ."
+'
+
 test_expect_success 'git log with wildcard that resolves to a single path uses Bloom filters' '
 	test_bloom_filters_used "-- *4" &&
 	test_bloom_filters_used "-- *renamed"
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH v3 10/10] commit-graph: check all leading directories in changed path Bloom filters
  2020-06-26 12:30   ` [PATCH v3 00/10] " Derrick Stolee via GitGitGadget
                       ` (8 preceding siblings ...)
  2020-06-26 12:30     ` [PATCH v3 09/10] revision: empty pathspecs should not use Bloom filters Taylor Blau via GitGitGadget
@ 2020-06-26 12:30     ` SZEDER Gábor via GitGitGadget
  2020-07-01 13:27     ` [PATCH v4 00/10] More commit-graph/Bloom filter improvements Derrick Stolee via GitGitGadget
  10 siblings, 0 replies; 71+ messages in thread
From: SZEDER Gábor via GitGitGadget @ 2020-06-26 12:30 UTC (permalink / raw)
  To: git; +Cc: me, szeder.dev, l.s.r, Derrick Stolee, SZEDER Gábor

From: =?UTF-8?q?SZEDER=20G=C3=A1bor?= <szeder.dev@gmail.com>

The file 'dir/subdir/file' can only be modified if its leading
directories 'dir' and 'dir/subdir' are modified as well.

So when checking modified path Bloom filters looking for commits
modifying a path with multiple path components, then check not only
the full path in the Bloom filters, but all its leading directories as
well.  Take care to check these paths in "deepest first" order,
because it's the full path that is least likely to be modified, and
the Bloom filter queries can short circuit sooner.

This can significantly reduce the average false positive rate, by
about an order of magnitude or three(!), and can further speed up
pathspec-limited revision walks.  The table below compares the average
false positive rate and runtime of

  git rev-list HEAD -- "$path"

before and after this change for 5000+ randomly* selected paths from
each repository:

                    Average false           Average        Average
                    positive rate           runtime        runtime
                  before     after     before     after   difference
  ------------------------------------------------------------------
  git             3.220%   0.7853%     0.0558s   0.0387s   -30.6%
  linux           2.453%   0.0296%     0.1046s   0.0766s   -26.8%
  tensorflow      2.536%   0.6977%     0.0594s   0.0420s   -29.2%

*Path selection was done with the following pipeline:

	git ls-tree -r --name-only HEAD | sort -R | head -n 5000

The improvements in runtime are much smaller than the improvements in
average false positive rate, as we are clearly reaching diminishing
returns here.  However, all these timings depend on that accessing
tree objects is reasonably fast (warm caches).  If we had a partial
clone and the tree objects had to be fetched from a promisor remote,
e.g.:

  $ git clone --filter=tree:0 --bare file://.../webkit.git webkit.notrees.git
  $ git -C webkit.git -c core.modifiedPathBloomFilters=1 \
        commit-graph write --reachable
  $ cp webkit.git/objects/info/commit-graph webkit.notrees.git/objects/info/
  $ git -C webkit.notrees.git -c core.modifiedPathBloomFilters=1 \
        rev-list HEAD -- "$path"

then checking all leading path component can reduce the runtime from
over an hour to a few seconds (and this is with the clone and the
promisor on the same machine).

This adjusts the tracing values in t4216-log-bloom.sh, which provides a
concrete way to notice the improvement.

Helped-by: Taylor Blau <me@ttaylorr.com>
Helped-by: René Scharfe <l.s.r@web.de>
Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 revision.c           | 46 +++++++++++++++++++++++++++++++++++---------
 revision.h           |  6 ++++--
 t/t4216-log-bloom.sh |  2 +-
 3 files changed, 42 insertions(+), 12 deletions(-)

diff --git a/revision.c b/revision.c
index b53377cd52..b40bc5b51b 100644
--- a/revision.c
+++ b/revision.c
@@ -670,9 +670,10 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
 {
 	struct pathspec_item *pi;
 	char *path_alloc = NULL;
-	const char *path;
+	const char *path, *p;
 	int last_index;
-	int len;
+	size_t len;
+	int path_component_nr = 1;
 
 	if (!revs->commits)
 		return;
@@ -709,8 +710,33 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
 		return;
 	}
 
-	revs->bloom_key = xmalloc(sizeof(struct bloom_key));
-	fill_bloom_key(path, len, revs->bloom_key, revs->bloom_filter_settings);
+	p = path;
+	while (*p) {
+		/*
+		 * At this point, the path is normalized to use Unix-style
+		 * path separators. This is required due to how the
+		 * changed-path Bloom filters store the paths.
+		 */
+		if (*p == '/')
+			path_component_nr++;
+		p++;
+	}
+
+	revs->bloom_keys_nr = path_component_nr;
+	ALLOC_ARRAY(revs->bloom_keys, revs->bloom_keys_nr);
+
+	fill_bloom_key(path, len, &revs->bloom_keys[0],
+		       revs->bloom_filter_settings);
+	path_component_nr = 1;
+
+	p = path + len - 1;
+	while (p > path) {
+		if (*p == '/')
+			fill_bloom_key(path, p - path,
+				       &revs->bloom_keys[path_component_nr++],
+				       revs->bloom_filter_settings);
+		p--;
+	}
 
 	if (trace2_is_enabled() && !bloom_filter_atexit_registered) {
 		atexit(trace2_bloom_filter_statistics_atexit);
@@ -724,7 +750,7 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
 						 struct commit *commit)
 {
 	struct bloom_filter *filter;
-	int result;
+	int result = 1, j;
 
 	if (!revs->repo->objects->commit_graph)
 		return -1;
@@ -744,9 +770,11 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
 		return -1;
 	}
 
-	result = bloom_filter_contains(filter,
-				       revs->bloom_key,
-				       revs->bloom_filter_settings);
+	for (j = 0; result && j < revs->bloom_keys_nr; j++) {
+		result = bloom_filter_contains(filter,
+					       &revs->bloom_keys[j],
+					       revs->bloom_filter_settings);
+	}
 
 	if (result)
 		count_bloom_filter_maybe++;
@@ -786,7 +814,7 @@ static int rev_compare_tree(struct rev_info *revs,
 			return REV_TREE_SAME;
 	}
 
-	if (revs->bloom_key && !nth_parent) {
+	if (revs->bloom_keys_nr && !nth_parent) {
 		bloom_ret = check_maybe_different_in_bloom_filter(revs, commit);
 
 		if (bloom_ret == 0)
diff --git a/revision.h b/revision.h
index 7c026fe41f..abbfb4ab59 100644
--- a/revision.h
+++ b/revision.h
@@ -295,8 +295,10 @@ struct rev_info {
 	struct topo_walk_info *topo_walk_info;
 
 	/* Commit graph bloom filter fields */
-	/* The bloom filter key for the pathspec */
-	struct bloom_key *bloom_key;
+	/* The bloom filter key(s) for the pathspec */
+	struct bloom_key *bloom_keys;
+	int bloom_keys_nr;
+
 	/*
 	 * The bloom filter settings used to generate the key.
 	 * This is loaded from the commit-graph being used.
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index e3e4badd4c..d7dd717347 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -146,7 +146,7 @@ test_expect_success 'setup - add commit-graph to the chain with Bloom filters' '
 
 test_bloom_filters_used_when_some_filters_are_missing () {
 	log_args=$1
-	bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"zero_length_filter\":0,\"maybe\":8,\"definitely_not\":6"
+	bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"zero_length_filter\":0,\"maybe\":6,\"definitely_not\":8"
 	setup "$log_args" &&
 	grep -q "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.perf" &&
 	test_cmp log_wo_bloom log_w_bloom
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 10/11] commit-graph: check all leading directories in changed path Bloom filters
  2020-06-26  6:34         ` SZEDER Gábor
@ 2020-06-26 14:42           ` Derrick Stolee
  0 siblings, 0 replies; 71+ messages in thread
From: Derrick Stolee @ 2020-06-26 14:42 UTC (permalink / raw)
  To: SZEDER Gábor
  Cc: René Scharfe, SZEDER Gábor via GitGitGadget, git, me,
	Derrick Stolee

On 6/26/2020 2:34 AM, SZEDER Gábor wrote:
> On Thu, Jun 25, 2020 at 11:05:04AM -0400, Derrick Stolee wrote:
> 
>>>> +	while (p > path) {
>>>> +		if (is_dir_sep(*p))
>>>> +			fill_bloom_key(path, p - path,
>>>> +				       &revs->bloom_keys[path_component_nr++],
>>>> +				       revs->bloom_filter_settings);
>>>> +		p--;
>>>> +	}
>>>
>>> This walks the directory hierarchy upwards and adds bloom filters for
>>> shorter and shorter paths, ("deepest first").  Good.
>>>
>>> And it supports all directory separators.  On Windows that would be
>>> slash (/) and backslash (\).  I assume paths are normalized to use
>>> only slashes when bloom filters are written, correct?  Then the lookup
>>> side needs to normalize a given path to only use slashes as well,
>>> otherwise paths with backslashes cannot be found.  This part seems to
>>> be missing.
>>
>> Yes, that's a good point. We _require_ the paths to be normalized
>> here to be Unix-style paths or else the Bloom filter keys are
>> incorrect. Thankfully, they are.
> 
> Unfortunately, they aren't always...
> 
> Path normalization is done in normalize_path_copy_len(), whose
> description says, among other things:
> 
>    * Performs the following normalizations on src, storing the result in dst:
>    * - Ensures that components are separated by '/' (Windows only)
> 
> and the code indeed does:
> 
>         if (is_dir_sep(c)) {
>                 *dst++ = '/';
> 
> Now, while parsing pathspecs this function is called via:
> 
>   parse_pathspec()
>     init_pathspec_item()
>       prefix_path_gently()
>         normalize_path_copy_len()
> 
> Unfortunately, init_pathspec_item() has this chain of conditions:
> 
>         /* Create match string which will be used for pathspec matching */
>         if (pathspec_prefix >= 0) {
>                 match = xstrdup(copyfrom);
>                 prefixlen = pathspec_prefix;
>         } else if (magic & PATHSPEC_FROMTOP) {
>                 match = xstrdup(copyfrom);
>                 prefixlen = 0;
>         } else {
>                 match = prefix_path_gently(prefix, prefixlen,
>                                            &prefixlen, copyfrom);
>                 if (!match) {
>                         const char *hint_path = get_git_work_tree();
>                         if (!hint_path)
>                                 hint_path = get_git_dir();
>                         die(_("%s: '%s' is outside repository at '%s'"), elt,
>                             copyfrom, absolute_path(hint_path));
>                 }
>         }
> 
> which means that it doesn't always calls prefix_path_gently(), which,
> in turn, means that 'pathspec_item->match' might remain un-normalized
> in case of some unusual pathspecs.
> 
> The first condition is supposed to handle the case when one Git
> process passes pathspecs to another, and is supposed to be "internal
> use only"; see 233c3e6c59 (parse_pathspec: preserve prefix length via
> PATHSPEC_PREFIX_ORIGIN, 2013-07-14), I haven't even tried to grok what
> that might entail.
> 
> The second condition handles pathspecs explicitly relative to the root
> of the work tree, i.e. ':/path'.  Adding a printf() to show the
> original path and the resulting 'pathspec_item->match' does confirm
> that no normalization is performed:
> 
>   expecting success of 9999.1 'test': 
>           mkdir -p dir &&
>           >dir/file &&
>           git add ":/dir/file" &&
>           git add ":(top)dir/file" &&
>           test_might_fail git add ":/dir//file" &&
>           git add ":(top)dir//file"
>   
>   orig:  ':/dir/file'
>   match: 'dir/file'
>   orig:  ':(top)dir/file'
>   match: 'dir/file'
>   orig:  ':/dir//file'
>   match: 'dir//file'
>   fatal: oops in prep_exclude
>   orig:  ':(top)dir//file'
>   match: 'dir//file'
>   fatal: oops in prep_exclude
>   not ok 1 - test
> 
> This is, of course, bad for Bloom filters, because the repeated
> slashes are hashed as well and commits will be omitted from the output
> of pathspec-limited revision walks, but apparently it also affects
> other parts of Git.
> 
> And the else branch handles the rest, which, I believe, is by far the
> most common case.

Thanks for this analysis. Clearly, there is already a bug here
when the input data is not pristine. I didn't see this message
when I submitted my v3, but normalizing the path data before
computing filters can (hopefully) be done as a small patch
before or after my v3 PATCH 10 without much conflict.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v3 03/10] bloom: fix logic in get_bloom_filter()
  2020-06-26 12:30     ` [PATCH v3 03/10] bloom: fix logic in get_bloom_filter() Derrick Stolee via GitGitGadget
@ 2020-06-27 16:33       ` SZEDER Gábor
  2020-06-29 13:02         ` Derrick Stolee
  0 siblings, 1 reply; 71+ messages in thread
From: SZEDER Gábor @ 2020-06-27 16:33 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget; +Cc: git, me, l.s.r, Derrick Stolee

On Fri, Jun 26, 2020 at 12:30:29PM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
> 
> The get_bloom_filter() method is a bit complicated in some parts where
> it does not need to be. In particular, it needs to return a NULL filter
> only when compute_if_not_present is zero AND the filter data cannot be
> loaded from a commit-graph file. This currently happens by accident
> because the commit-graph does not load changed-path Bloom filters from
> an existing commit-graph when writing a new one. This will change in a
> later patch.
> 
> Also clean up some style issues while we are here.
> 
> Helped-by: René Scharfe <l.s.r@web.de>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  bloom.c | 14 ++++++--------
>  1 file changed, 6 insertions(+), 8 deletions(-)
> 
> diff --git a/bloom.c b/bloom.c
> index c38d1cff0c..2af5389795 100644
> --- a/bloom.c
> +++ b/bloom.c
> @@ -186,7 +186,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
>  	struct diff_options diffopt;
>  	int max_changes = 512;
>  
> -	if (bloom_filters.slab_size == 0)
> +	if (!bloom_filters.slab_size)
>  		return NULL;
>
>  	filter = bloom_filter_slab_at(&bloom_filters, c);
> @@ -194,16 +194,14 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
>  	if (!filter->data) {
>  		load_commit_graph_info(r, c);
>  		if (c->graph_pos != COMMIT_NOT_FROM_GRAPH &&
> -			r->objects->commit_graph->chunk_bloom_indexes) {
> -			if (load_bloom_filter_from_graph(r->objects->commit_graph, filter, c))
> -				return filter;
> -			else
> -				return NULL;
> -		}
> +		    r->objects->commit_graph->chunk_bloom_indexes)
> +			load_bloom_filter_from_graph(r->objects->commit_graph, filter, c);
>  	}
>  
> -	if (filter->data || !compute_if_not_present)
> +	if (filter->data)
>  		return filter;
> +	if (!compute_if_not_present)
> +		return NULL;

Some callers of get_bloom_filter() invoke it with
compute_if_not_present=0, but are not prepared to handle a NULL return
value and dereference it right away:

  write_graph_chunk_bloom_indexes():

                struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
                cur_pos += filter->len;

  write_graph_chunk_bloom_data():

                struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
                display_progress(progress, ++i);
                hashwrite(f, filter->data, filter->len * sizeof(unsigned char));

I don't know whether this was an issue before, but I didn't really
tried.  Unfortunately, starting with this patch this causes
segmentation faults basically in all real repositories I use for
testing.

  expecting success of 9999.1 'test': 
          for i in $(test_seq 1 513)
          do
                  >file-$i || return 1
          done &&
          git add file-* &&
          git commit -q -m one &&
  
          git commit-graph write --reachable --changed-paths
  
  Segmentation fault
  not ok 1 - test


  Program received signal SIGSEGV, Segmentation fault.
  0x0000000000515848 in write_graph_chunk_bloom_indexes (f=0x9fe650, 
      ctx=0x9d2000) at commit-graph.c:1101
  1101                    cur_pos += filter->len;
  (gdb) print filter
  $1 = (struct bloom_filter *) 0x0



>  	repo_diff_setup(r, &diffopt);
>  	diffopt.flags.recursive = 1;
> -- 
> gitgitgadget
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v3 03/10] bloom: fix logic in get_bloom_filter()
  2020-06-27 16:33       ` SZEDER Gábor
@ 2020-06-29 13:02         ` Derrick Stolee
  0 siblings, 0 replies; 71+ messages in thread
From: Derrick Stolee @ 2020-06-29 13:02 UTC (permalink / raw)
  To: SZEDER Gábor, Derrick Stolee via GitGitGadget
  Cc: git, me, l.s.r, Derrick Stolee

On 6/27/2020 12:33 PM, SZEDER Gábor wrote:
> On Fri, Jun 26, 2020 at 12:30:29PM +0000, Derrick Stolee via GitGitGadget wrote:
>> From: Derrick Stolee <dstolee@microsoft.com>
>>
>> The get_bloom_filter() method is a bit complicated in some parts where
>> it does not need to be. In particular, it needs to return a NULL filter
>> only when compute_if_not_present is zero AND the filter data cannot be
>> loaded from a commit-graph file. This currently happens by accident
>> because the commit-graph does not load changed-path Bloom filters from
>> an existing commit-graph when writing a new one. This will change in a
>> later patch.
>>
>> Also clean up some style issues while we are here.
>>
>> Helped-by: René Scharfe <l.s.r@web.de>
>> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
>> ---
>>  bloom.c | 14 ++++++--------
>>  1 file changed, 6 insertions(+), 8 deletions(-)
>>
>> diff --git a/bloom.c b/bloom.c
>> index c38d1cff0c..2af5389795 100644
>> --- a/bloom.c
>> +++ b/bloom.c
>> @@ -186,7 +186,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
>>  	struct diff_options diffopt;
>>  	int max_changes = 512;
>>  
>> -	if (bloom_filters.slab_size == 0)
>> +	if (!bloom_filters.slab_size)
>>  		return NULL;
>>
>>  	filter = bloom_filter_slab_at(&bloom_filters, c);
>> @@ -194,16 +194,14 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
>>  	if (!filter->data) {
>>  		load_commit_graph_info(r, c);
>>  		if (c->graph_pos != COMMIT_NOT_FROM_GRAPH &&
>> -			r->objects->commit_graph->chunk_bloom_indexes) {
>> -			if (load_bloom_filter_from_graph(r->objects->commit_graph, filter, c))
>> -				return filter;
>> -			else
>> -				return NULL;
>> -		}
>> +		    r->objects->commit_graph->chunk_bloom_indexes)
>> +			load_bloom_filter_from_graph(r->objects->commit_graph, filter, c);
>>  	}
>>  
>> -	if (filter->data || !compute_if_not_present)
>> +	if (filter->data)
>>  		return filter;
>> +	if (!compute_if_not_present)
>> +		return NULL;
> 
> Some callers of get_bloom_filter() invoke it with
> compute_if_not_present=0, but are not prepared to handle a NULL return
> value and dereference it right away:
> 
>   write_graph_chunk_bloom_indexes():
> 
>                 struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
>                 cur_pos += filter->len;
> 
>   write_graph_chunk_bloom_data():
> 
>                 struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
>                 display_progress(progress, ++i);
>                 hashwrite(f, filter->data, filter->len * sizeof(unsigned char));

In theory, these _should_ be safe, because we already computed
the filters in an earlier step, right? We should have generated
the filter and populated it in the slab.

> I don't know whether this was an issue before, but I didn't really
> tried.  Unfortunately, starting with this patch this causes
> segmentation faults basically in all real repositories I use for
> testing.
> 
>   expecting success of 9999.1 'test': 
>           for i in $(test_seq 1 513)
>           do
>                   >file-$i || return 1
>           done &&
>           git add file-* &&
>           git commit -q -m one &&
>   
>           git commit-graph write --reachable --changed-paths
>   
>   Segmentation fault
>   not ok 1 - test

However, you are demonstrating a failure that doesn't appear
in our test suite. I was able to reproduce it.

I can confirm that this patch causes a SIGSEGV when writing
the commit-graph in the Git repository, too.

So, what is wrong with my earlier assumption? There are
two problems.

The thing I notice is that an empty filter (no changes
with respect to the first parent) will have NULL
filter->data, so we are returning NULL instead of a
correctly-empty filter (with len zero).

But what you are hitting here is the max number of changes
limit. That also returns a NULL filter, because we mark
the filter as "TOO LARGE" to store. We store that as a
zero-length filter.

The following fixup corrects the bug and adds a test
similar to yours, but with extra care around ensuring the
revision walk still works appropriately for that large
commit.

In the next version, I will include more in the commit
message about these side-effect changes, especially around
the stats for zero-length filters. The trace2 message will
no longer differentiate between zero-length filters and
NULL filters.

Thanks,
-Stolee

-- >8 --

From f9867adc5de8a072f41b91fd6cd87edfcc92e05e Mon Sep 17 00:00:00 2001
From: Derrick Stolee <dstolee@microsoft.com>
Date: Mon, 29 Jun 2020 08:52:33 -0400
Subject: [PATCH] fixup! bloom: fix logic in get_bloom_filter()

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c       |  8 ++++++--
 revision.c           |  7 -------
 t/t4216-log-bloom.sh | 24 ++++++++++++++++++++++--
 3 files changed, 28 insertions(+), 11 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index a0766a86f5..6752916c1a 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1108,7 +1108,8 @@ static int write_graph_chunk_bloom_indexes(struct hashfile *f,
 
 	while (list < last) {
 		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
-		cur_pos += filter->len;
+		size_t len = filter ? filter->len : 0;
+		cur_pos += len;
 		display_progress(progress, ++i);
 		hashwrite_be32(f, cur_pos);
 		list++;
@@ -1154,8 +1155,11 @@ static int write_graph_chunk_bloom_data(struct hashfile *f,
 
 	while (list < last) {
 		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
+		size_t len = filter ? filter->len : 0;
 		display_progress(progress, ++i);
-		hashwrite(f, filter->data, filter->len * sizeof(unsigned char));
+
+		if (len)
+			hashwrite(f, filter->data, len * sizeof(unsigned char));
 		list++;
 	}
 
diff --git a/revision.c b/revision.c
index b40bc5b51b..b9118001f9 100644
--- a/revision.c
+++ b/revision.c
@@ -633,7 +633,6 @@ static unsigned int count_bloom_filter_maybe;
 static unsigned int count_bloom_filter_definitely_not;
 static unsigned int count_bloom_filter_false_positive;
 static unsigned int count_bloom_filter_not_present;
-static unsigned int count_bloom_filter_length_zero;
 
 static void trace2_bloom_filter_statistics_atexit(void)
 {
@@ -641,7 +640,6 @@ static void trace2_bloom_filter_statistics_atexit(void)
 
 	jw_object_begin(&jw, 0);
 	jw_object_intmax(&jw, "filter_not_present", count_bloom_filter_not_present);
-	jw_object_intmax(&jw, "zero_length_filter", count_bloom_filter_length_zero);
 	jw_object_intmax(&jw, "maybe", count_bloom_filter_maybe);
 	jw_object_intmax(&jw, "definitely_not", count_bloom_filter_definitely_not);
 	jw_object_intmax(&jw, "false_positive", count_bloom_filter_false_positive);
@@ -765,11 +763,6 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
 		return -1;
 	}
 
-	if (!filter->len) {
-		count_bloom_filter_length_zero++;
-		return -1;
-	}
-
 	for (j = 0; result && j < revs->bloom_keys_nr; j++) {
 		result = bloom_filter_contains(filter,
 					       &revs->bloom_keys[j],
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index d7dd717347..4892364e74 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -60,7 +60,7 @@ setup () {
 
 test_bloom_filters_used () {
 	log_args=$1
-	bloom_trace_prefix="statistics:{\"filter_not_present\":0,\"zero_length_filter\":0,\"maybe\""
+	bloom_trace_prefix="statistics:{\"filter_not_present\":0,\"maybe\""
 	setup "$log_args" &&
 	grep -q "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.perf" &&
 	test_cmp log_wo_bloom log_w_bloom &&
@@ -146,7 +146,7 @@ test_expect_success 'setup - add commit-graph to the chain with Bloom filters' '
 
 test_bloom_filters_used_when_some_filters_are_missing () {
 	log_args=$1
-	bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"zero_length_filter\":0,\"maybe\":6,\"definitely_not\":8"
+	bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"maybe\":6,\"definitely_not\":8"
 	setup "$log_args" &&
 	grep -q "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.perf" &&
 	test_cmp log_wo_bloom log_w_bloom
@@ -171,4 +171,24 @@ test_expect_success 'persist filter settings' '
 	grep "{\"hash_version\":1,\"num_hashes\":9,\"bits_per_entry\":15}" trace2-auto.txt
 '
 
+test_expect_success 'correctly report changes over limit' '
+	git init 513changes &&
+	(
+		cd 513changes &&
+		for i in $(test_seq 1 513)
+		do
+			echo $i >file$i.txt || return 1
+		done &&
+		git add . &&
+		git commit -m "files" &&
+		git commit-graph write --reachable --changed-paths &&
+		for i in $(test_seq 1 513)
+		do
+			git -c core.commitGraph=false log -- file$i.txt >expect &&
+			git log -- file$i.txt >actual &&
+			test_cmp expect actual || return 1
+		done
+	)
+'
+
 test_done
\ No newline at end of file
-- 
2.27.0.203.gf402ea6816


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH v4 00/10] More commit-graph/Bloom filter improvements
  2020-06-26 12:30   ` [PATCH v3 00/10] " Derrick Stolee via GitGitGadget
                       ` (9 preceding siblings ...)
  2020-06-26 12:30     ` [PATCH v3 10/10] commit-graph: check all leading directories in changed path " SZEDER Gábor via GitGitGadget
@ 2020-07-01 13:27     ` Derrick Stolee via GitGitGadget
  2020-07-01 13:27       ` [PATCH v4 01/10] commit-graph: place bloom_settings in context Derrick Stolee via GitGitGadget
                         ` (9 more replies)
  10 siblings, 10 replies; 71+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-07-01 13:27 UTC (permalink / raw)
  To: git; +Cc: me, szeder.dev, l.s.r, Derrick Stolee

This builds on sg/commit-graph-cleanups, which took several patches from
Szeder's series [1] and applied them almost directly to a more-recent
version of Git [2].

[1] https://lore.kernel.org/git/20200529085038.26008-1-szeder.dev@gmail.com/
[2] 
https://lore.kernel.org/git/pull.650.git.1591362032.gitgitgadget@gmail.com/

This series adds a few extra improvements, several of which are rooted in
Szeder's original series. I maintained his authorship and sign-off, even
though the patches did not apply or cherry-pick at all.

(In v2, I have removed the range-diff comparison to Szeder's series, so look
at the v1 cover letter for that.)

The patches have been significantly reordered. René pointed out (and Szeder
discovered in the old thread) that we are not re-using the
bloom_filter_settings from the existing commit-graph when writing a new one.

 1. commit-graph: place bloom_settings in context
 2. commit-graph: change test to die on parse, not load

These are mostly the same, except we now use a pointer to the settings in
the commit-graph write context.

 3. bloom: get_bloom_filter() cleanups

This new patch is a subtle change in behavior that will become relevant in
the very next patch. In fact, if we swap patch 3 and 4, then
t4216-log-bloom.sh fails with a segfault due to a NULL filter.

 4. commit-graph: persist existence of changed-paths

This patch is now updated to use the existing changed-path filter settings.

 5. commit-graph: unify the signatures of all write_graph_chunk_*()
    functions
 6. commit-graph: simplify chunk writes into loop
 7. commit-graph: check chunk sizes after writing

These are all the same as before.

 8. revision.c: fix whitespace

This patch is the cleanup part of Taylor's patch.

 9. revision: empty pathspecs should not use Bloom filters

Here is Taylor's fix for empty pathspecs.

 10. commit-graph: check all leading directories in changed path Bloom
     filters
 11. bloom: enforce a minimum size of 8 bytes

Finally, we get these performance patches. Patch 10 is updated to have the
better logic around directory separators and empty paths. Also, the list of
Bloom keys is ordered with the deepest path first. That has some tiny
performance benefits for deep paths since we can short-circuit the multi-key
checks more often. That code path is much faster than the tree parsing, so
it is hard to measure any change.

Updates in V3:

 * Responded to René's feedback.
 * Fixed the test in Patch 4 to use GIT_TEST_ variables and extend the
   GIT_TRACE2 depth to work with 'seen' branch.

Update in V4;

 * Fixed the bug with "too large" commits. Test is added. The fixup! I sent
   earlier doesn't actually squash cleanly, so I resolved the conflicts
   during the rebase.

Thanks, -Stolee

Derrick Stolee (5):
  commit-graph: place bloom_settings in context
  commit-graph: change test to die on parse, not load
  bloom: fix logic in get_bloom_filter()
  commit-graph: persist existence of changed-paths
  revision.c: fix whitespace

SZEDER Gábor (4):
  commit-graph: unify the signatures of all write_graph_chunk_*()
    functions
  commit-graph: simplify chunk writes into loop
  commit-graph: check chunk sizes after writing
  commit-graph: check all leading directories in changed path Bloom
    filters

Taylor Blau (1):
  revision: empty pathspecs should not use Bloom filters

 Documentation/git-commit-graph.txt |   5 +-
 bloom.c                            |  14 ++-
 builtin/commit-graph.c             |   5 +-
 commit-graph.c                     | 146 +++++++++++++++++++++--------
 commit-graph.h                     |   3 +-
 revision.c                         |  63 +++++++++----
 revision.h                         |   6 +-
 t/t4216-log-bloom.sh               |  45 ++++++++-
 t/t5318-commit-graph.sh            |   2 +-
 9 files changed, 215 insertions(+), 74 deletions(-)


base-commit: 7fbfe07ab4d4e58c0971dac73001b89f180a0af3
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-659%2Fderrickstolee%2Fbloom-2-v4
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-659/derrickstolee/bloom-2-v4
Pull-Request: https://github.com/gitgitgadget/git/pull/659

Range-diff vs v3:

  1:  57002040bc =  1:  57002040bc commit-graph: place bloom_settings in context
  2:  6b63f9bd8a =  2:  6b63f9bd8a commit-graph: change test to die on parse, not load
  3:  2f809499ab !  3:  3c532ebabc bloom: fix logic in get_bloom_filter()
     @@ Commit message
      
          Also clean up some style issues while we are here.
      
     +    One side-effect of returning a NULL filter is that the filters that are
     +    reported as "too large" will now be reported as NULL insead of length
     +    zero. This case was not properly covered before, so add a test. Further,
     +    remote the counting of the zero-length filters from revision.c and the
     +    trace2 logs.
     +
          Helped-by: René Scharfe <l.s.r@web.de>
     +    Helped-by: SZEDER Gábor <szeder.dev@gmail.com>
          Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
      
       ## bloom.c ##
     @@ bloom.c: struct bloom_filter *get_bloom_filter(struct repository *r,
       
       	repo_diff_setup(r, &diffopt);
       	diffopt.flags.recursive = 1;
     +
     + ## commit-graph.c ##
     +@@ commit-graph.c: static void write_graph_chunk_bloom_indexes(struct hashfile *f,
     + 
     + 	while (list < last) {
     + 		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
     +-		cur_pos += filter->len;
     ++		size_t len = filter ? filter->len : 0;
     ++		cur_pos += len;
     + 		display_progress(progress, ++i);
     + 		hashwrite_be32(f, cur_pos);
     + 		list++;
     +@@ commit-graph.c: static void write_graph_chunk_bloom_data(struct hashfile *f,
     + 
     + 	while (list < last) {
     + 		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
     ++		size_t len = filter ? filter->len : 0;
     + 		display_progress(progress, ++i);
     +-		hashwrite(f, filter->data, filter->len * sizeof(unsigned char));
     ++
     ++		if (len)
     ++			hashwrite(f, filter->data, len * sizeof(unsigned char));
     + 		list++;
     + 	}
     + 
     +
     + ## revision.c ##
     +@@ revision.c: static unsigned int count_bloom_filter_maybe;
     + static unsigned int count_bloom_filter_definitely_not;
     + static unsigned int count_bloom_filter_false_positive;
     + static unsigned int count_bloom_filter_not_present;
     +-static unsigned int count_bloom_filter_length_zero;
     + 
     + static void trace2_bloom_filter_statistics_atexit(void)
     + {
     +@@ revision.c: static void trace2_bloom_filter_statistics_atexit(void)
     + 
     + 	jw_object_begin(&jw, 0);
     + 	jw_object_intmax(&jw, "filter_not_present", count_bloom_filter_not_present);
     +-	jw_object_intmax(&jw, "zero_length_filter", count_bloom_filter_length_zero);
     + 	jw_object_intmax(&jw, "maybe", count_bloom_filter_maybe);
     + 	jw_object_intmax(&jw, "definitely_not", count_bloom_filter_definitely_not);
     + 	jw_object_intmax(&jw, "false_positive", count_bloom_filter_false_positive);
     +@@ revision.c: static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
     + 		return -1;
     + 	}
     + 
     +-	if (!filter->len) {
     +-		count_bloom_filter_length_zero++;
     +-		return -1;
     +-	}
     +-
     + 	result = bloom_filter_contains(filter,
     + 				       revs->bloom_key,
     + 				       revs->bloom_filter_settings);
     +
     + ## t/t4216-log-bloom.sh ##
     +@@ t/t4216-log-bloom.sh: setup () {
     + 
     + test_bloom_filters_used () {
     + 	log_args=$1
     +-	bloom_trace_prefix="statistics:{\"filter_not_present\":0,\"zero_length_filter\":0,\"maybe\""
     ++	bloom_trace_prefix="statistics:{\"filter_not_present\":0,\"maybe\""
     + 	setup "$log_args" &&
     + 	grep -q "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.perf" &&
     + 	test_cmp log_wo_bloom log_w_bloom &&
     +@@ t/t4216-log-bloom.sh: test_expect_success 'setup - add commit-graph to the chain with Bloom filters' '
     + 
     + test_bloom_filters_used_when_some_filters_are_missing () {
     + 	log_args=$1
     +-	bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"zero_length_filter\":0,\"maybe\":8,\"definitely_not\":6"
     ++	bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"maybe\":8,\"definitely_not\":6"
     + 	setup "$log_args" &&
     + 	grep -q "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.perf" &&
     + 	test_cmp log_wo_bloom log_w_bloom
     +@@ t/t4216-log-bloom.sh: test_expect_success 'Use Bloom filters if they exist in the latest but not all c
     + 	test_bloom_filters_used_when_some_filters_are_missing "-- A/B"
     + '
     + 
     ++test_expect_success 'correctly report changes over limit' '
     ++	git init 513changes &&
     ++	(
     ++		cd 513changes &&
     ++		for i in $(test_seq 1 513)
     ++		do
     ++			echo $i >file$i.txt || return 1
     ++		done &&
     ++		git add . &&
     ++		git commit -m "files" &&
     ++		git commit-graph write --reachable --changed-paths &&
     ++		for i in $(test_seq 1 513)
     ++		do
     ++			git -c core.commitGraph=false log -- file$i.txt >expect &&
     ++			git log -- file$i.txt >actual &&
     ++			test_cmp expect actual || return 1
     ++		done
     ++	)
     ++'
     ++
     + test_done
     + \ No newline at end of file
  4:  33e22d05cb !  4:  f1e3a8516e commit-graph: persist existence of changed-paths
     @@ t/t4216-log-bloom.sh: test_expect_success 'Use Bloom filters if they exist in th
      +	grep "{\"hash_version\":1,\"num_hashes\":9,\"bits_per_entry\":15}" trace2-auto.txt
      +'
      +
     - test_done
     - \ No newline at end of file
     + test_expect_success 'correctly report changes over limit' '
     + 	git init 513changes &&
     + 	(
  5:  81c45d5260 =  5:  c079921473 commit-graph: unify the signatures of all write_graph_chunk_*() functions
  6:  8828dcd906 =  6:  5ed0ce20a4 commit-graph: simplify chunk writes into loop
  7:  ddbf297755 =  7:  b982c9bf80 commit-graph: check chunk sizes after writing
  8:  8b63706141 =  8:  af750d8887 revision.c: fix whitespace
  9:  7d6163305a =  9:  a95de3cceb revision: empty pathspecs should not use Bloom filters
 10:  40061233ca ! 10:  9c4a00ab08 commit-graph: check all leading directories in changed path Bloom filters
     @@ t/t4216-log-bloom.sh: test_expect_success 'setup - add commit-graph to the chain
       
       test_bloom_filters_used_when_some_filters_are_missing () {
       	log_args=$1
     --	bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"zero_length_filter\":0,\"maybe\":8,\"definitely_not\":6"
     -+	bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"zero_length_filter\":0,\"maybe\":6,\"definitely_not\":8"
     +-	bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"maybe\":8,\"definitely_not\":6"
     ++	bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"maybe\":6,\"definitely_not\":8"
       	setup "$log_args" &&
       	grep -q "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.perf" &&
       	test_cmp log_wo_bloom log_w_bloom

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH v4 01/10] commit-graph: place bloom_settings in context
  2020-07-01 13:27     ` [PATCH v4 00/10] More commit-graph/Bloom filter improvements Derrick Stolee via GitGitGadget
@ 2020-07-01 13:27       ` Derrick Stolee via GitGitGadget
  2020-07-01 13:27       ` [PATCH v4 02/10] commit-graph: change test to die on parse, not load Derrick Stolee via GitGitGadget
                         ` (8 subsequent siblings)
  9 siblings, 0 replies; 71+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-07-01 13:27 UTC (permalink / raw)
  To: git; +Cc: me, szeder.dev, l.s.r, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Place an instance of struct bloom_settings into the struct
write_commit_graph_context. This allows simplifying the function
prototype of write_graph_chunk_bloom_data(). This will allow us
to combine the function prototypes and use function pointers to
simplify write_commit_graph_file().

By using a pointer, we can later replace the settings to match those
that exist in the current commit-graph, in case a future Git version
allows customization of these parameters.

Reported-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 887837e882..d0fedcd9b1 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -882,6 +882,7 @@ struct write_commit_graph_context {
 
 	const struct split_commit_graph_opts *split_opts;
 	size_t total_bloom_filter_data_size;
+	const struct bloom_filter_settings *bloom_settings;
 };
 
 static void write_graph_chunk_fanout(struct hashfile *f,
@@ -1103,8 +1104,7 @@ static void write_graph_chunk_bloom_indexes(struct hashfile *f,
 }
 
 static void write_graph_chunk_bloom_data(struct hashfile *f,
-					 struct write_commit_graph_context *ctx,
-					 const struct bloom_filter_settings *settings)
+					 struct write_commit_graph_context *ctx)
 {
 	struct commit **list = ctx->commits.list;
 	struct commit **last = ctx->commits.list + ctx->commits.nr;
@@ -1116,9 +1116,9 @@ static void write_graph_chunk_bloom_data(struct hashfile *f,
 			_("Writing changed paths Bloom filters data"),
 			ctx->commits.nr);
 
-	hashwrite_be32(f, settings->hash_version);
-	hashwrite_be32(f, settings->num_hashes);
-	hashwrite_be32(f, settings->bits_per_entry);
+	hashwrite_be32(f, ctx->bloom_settings->hash_version);
+	hashwrite_be32(f, ctx->bloom_settings->num_hashes);
+	hashwrite_be32(f, ctx->bloom_settings->bits_per_entry);
 
 	while (list < last) {
 		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
@@ -1541,6 +1541,8 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	struct object_id file_hash;
 	const struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
 
+	ctx->bloom_settings = &bloom_settings;
+
 	if (ctx->split) {
 		struct strbuf tmp_file = STRBUF_INIT;
 
@@ -1642,7 +1644,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 		write_graph_chunk_extra_edges(f, ctx);
 	if (ctx->changed_paths) {
 		write_graph_chunk_bloom_indexes(f, ctx);
-		write_graph_chunk_bloom_data(f, ctx, &bloom_settings);
+		write_graph_chunk_bloom_data(f, ctx);
 	}
 	if (ctx->num_commit_graphs_after > 1 &&
 	    write_graph_chunk_base(f, ctx)) {
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH v4 02/10] commit-graph: change test to die on parse, not load
  2020-07-01 13:27     ` [PATCH v4 00/10] More commit-graph/Bloom filter improvements Derrick Stolee via GitGitGadget
  2020-07-01 13:27       ` [PATCH v4 01/10] commit-graph: place bloom_settings in context Derrick Stolee via GitGitGadget
@ 2020-07-01 13:27       ` Derrick Stolee via GitGitGadget
  2020-07-01 13:27       ` [PATCH v4 03/10] bloom: fix logic in get_bloom_filter() Derrick Stolee via GitGitGadget
                         ` (7 subsequent siblings)
  9 siblings, 0 replies; 71+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-07-01 13:27 UTC (permalink / raw)
  To: git; +Cc: me, szeder.dev, l.s.r, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

43d3561 (commit-graph write: don't die if the existing graph is corrupt,
2019-03-25) introduced the GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD environment
variable. This was created to verify that commit-graph was not loaded
when writing a new non-incremental commit-graph.

An upcoming change wants to load a commit-graph in some valuable cases,
but we want to maintain that we don't trust the commit-graph data when
writing our new file. Instead of dying on load, instead die if we ever
try to parse a commit from the commit-graph. This functionally verifies
the same intended behavior, but allows a more advanced feature in the
next change.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c          | 12 ++++++++----
 commit-graph.h          |  2 +-
 t/t5318-commit-graph.sh |  2 +-
 3 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index d0fedcd9b1..6a28d4a5a6 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -564,10 +564,6 @@ static int prepare_commit_graph(struct repository *r)
 		return !!r->objects->commit_graph;
 	r->objects->commit_graph_attempted = 1;
 
-	if (git_env_bool(GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD, 0))
-		die("dying as requested by the '%s' variable on commit-graph load!",
-		    GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD);
-
 	prepare_repo_settings(r);
 
 	if (!git_env_bool(GIT_TEST_COMMIT_GRAPH, 0) &&
@@ -790,6 +786,14 @@ static int parse_commit_in_graph_one(struct repository *r,
 
 int parse_commit_in_graph(struct repository *r, struct commit *item)
 {
+	static int checked_env = 0;
+
+	if (!checked_env &&
+	    git_env_bool(GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE, 0))
+		die("dying as requested by the '%s' variable on commit-graph parse!",
+		    GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE);
+	checked_env = 1;
+
 	if (!prepare_commit_graph(r))
 		return 0;
 	return parse_commit_in_graph_one(r, r->objects->commit_graph, item);
diff --git a/commit-graph.h b/commit-graph.h
index 881c9b46e5..f0fb13e3f2 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -5,7 +5,7 @@
 #include "object-store.h"
 
 #define GIT_TEST_COMMIT_GRAPH "GIT_TEST_COMMIT_GRAPH"
-#define GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD "GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD"
+#define GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE "GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE"
 #define GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS "GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS"
 
 /*
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 1073f9e3cf..5ec01abdaa 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -436,7 +436,7 @@ corrupt_graph_verify() {
 		cp $objdir/info/commit-graph commit-graph-pre-write-test
 	fi &&
 	git status --short &&
-	GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD=true git commit-graph write &&
+	GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE=true git commit-graph write &&
 	git commit-graph verify
 }
 
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH v4 03/10] bloom: fix logic in get_bloom_filter()
  2020-07-01 13:27     ` [PATCH v4 00/10] More commit-graph/Bloom filter improvements Derrick Stolee via GitGitGadget
  2020-07-01 13:27       ` [PATCH v4 01/10] commit-graph: place bloom_settings in context Derrick Stolee via GitGitGadget
  2020-07-01 13:27       ` [PATCH v4 02/10] commit-graph: change test to die on parse, not load Derrick Stolee via GitGitGadget
@ 2020-07-01 13:27       ` Derrick Stolee via GitGitGadget
  2020-07-01 13:27       ` [PATCH v4 04/10] commit-graph: persist existence of changed-paths Derrick Stolee via GitGitGadget
                         ` (6 subsequent siblings)
  9 siblings, 0 replies; 71+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-07-01 13:27 UTC (permalink / raw)
  To: git; +Cc: me, szeder.dev, l.s.r, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The get_bloom_filter() method is a bit complicated in some parts where
it does not need to be. In particular, it needs to return a NULL filter
only when compute_if_not_present is zero AND the filter data cannot be
loaded from a commit-graph file. This currently happens by accident
because the commit-graph does not load changed-path Bloom filters from
an existing commit-graph when writing a new one. This will change in a
later patch.

Also clean up some style issues while we are here.

One side-effect of returning a NULL filter is that the filters that are
reported as "too large" will now be reported as NULL insead of length
zero. This case was not properly covered before, so add a test. Further,
remote the counting of the zero-length filters from revision.c and the
trace2 logs.

Helped-by: René Scharfe <l.s.r@web.de>
Helped-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 bloom.c              | 14 ++++++--------
 commit-graph.c       |  8 ++++++--
 revision.c           |  7 -------
 t/t4216-log-bloom.sh | 24 ++++++++++++++++++++++--
 4 files changed, 34 insertions(+), 19 deletions(-)

diff --git a/bloom.c b/bloom.c
index c38d1cff0c..2af5389795 100644
--- a/bloom.c
+++ b/bloom.c
@@ -186,7 +186,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
 	struct diff_options diffopt;
 	int max_changes = 512;
 
-	if (bloom_filters.slab_size == 0)
+	if (!bloom_filters.slab_size)
 		return NULL;
 
 	filter = bloom_filter_slab_at(&bloom_filters, c);
@@ -194,16 +194,14 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
 	if (!filter->data) {
 		load_commit_graph_info(r, c);
 		if (c->graph_pos != COMMIT_NOT_FROM_GRAPH &&
-			r->objects->commit_graph->chunk_bloom_indexes) {
-			if (load_bloom_filter_from_graph(r->objects->commit_graph, filter, c))
-				return filter;
-			else
-				return NULL;
-		}
+		    r->objects->commit_graph->chunk_bloom_indexes)
+			load_bloom_filter_from_graph(r->objects->commit_graph, filter, c);
 	}
 
-	if (filter->data || !compute_if_not_present)
+	if (filter->data)
 		return filter;
+	if (!compute_if_not_present)
+		return NULL;
 
 	repo_diff_setup(r, &diffopt);
 	diffopt.flags.recursive = 1;
diff --git a/commit-graph.c b/commit-graph.c
index 6a28d4a5a6..50ce039a53 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1098,7 +1098,8 @@ static void write_graph_chunk_bloom_indexes(struct hashfile *f,
 
 	while (list < last) {
 		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
-		cur_pos += filter->len;
+		size_t len = filter ? filter->len : 0;
+		cur_pos += len;
 		display_progress(progress, ++i);
 		hashwrite_be32(f, cur_pos);
 		list++;
@@ -1126,8 +1127,11 @@ static void write_graph_chunk_bloom_data(struct hashfile *f,
 
 	while (list < last) {
 		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
+		size_t len = filter ? filter->len : 0;
 		display_progress(progress, ++i);
-		hashwrite(f, filter->data, filter->len * sizeof(unsigned char));
+
+		if (len)
+			hashwrite(f, filter->data, len * sizeof(unsigned char));
 		list++;
 	}
 
diff --git a/revision.c b/revision.c
index c644c66091..7339750af1 100644
--- a/revision.c
+++ b/revision.c
@@ -633,7 +633,6 @@ static unsigned int count_bloom_filter_maybe;
 static unsigned int count_bloom_filter_definitely_not;
 static unsigned int count_bloom_filter_false_positive;
 static unsigned int count_bloom_filter_not_present;
-static unsigned int count_bloom_filter_length_zero;
 
 static void trace2_bloom_filter_statistics_atexit(void)
 {
@@ -641,7 +640,6 @@ static void trace2_bloom_filter_statistics_atexit(void)
 
 	jw_object_begin(&jw, 0);
 	jw_object_intmax(&jw, "filter_not_present", count_bloom_filter_not_present);
-	jw_object_intmax(&jw, "zero_length_filter", count_bloom_filter_length_zero);
 	jw_object_intmax(&jw, "maybe", count_bloom_filter_maybe);
 	jw_object_intmax(&jw, "definitely_not", count_bloom_filter_definitely_not);
 	jw_object_intmax(&jw, "false_positive", count_bloom_filter_false_positive);
@@ -735,11 +733,6 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
 		return -1;
 	}
 
-	if (!filter->len) {
-		count_bloom_filter_length_zero++;
-		return -1;
-	}
-
 	result = bloom_filter_contains(filter,
 				       revs->bloom_key,
 				       revs->bloom_filter_settings);
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index c7011f33e2..2761208e74 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -60,7 +60,7 @@ setup () {
 
 test_bloom_filters_used () {
 	log_args=$1
-	bloom_trace_prefix="statistics:{\"filter_not_present\":0,\"zero_length_filter\":0,\"maybe\""
+	bloom_trace_prefix="statistics:{\"filter_not_present\":0,\"maybe\""
 	setup "$log_args" &&
 	grep -q "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.perf" &&
 	test_cmp log_wo_bloom log_w_bloom &&
@@ -142,7 +142,7 @@ test_expect_success 'setup - add commit-graph to the chain with Bloom filters' '
 
 test_bloom_filters_used_when_some_filters_are_missing () {
 	log_args=$1
-	bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"zero_length_filter\":0,\"maybe\":8,\"definitely_not\":6"
+	bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"maybe\":8,\"definitely_not\":6"
 	setup "$log_args" &&
 	grep -q "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.perf" &&
 	test_cmp log_wo_bloom log_w_bloom
@@ -152,4 +152,24 @@ test_expect_success 'Use Bloom filters if they exist in the latest but not all c
 	test_bloom_filters_used_when_some_filters_are_missing "-- A/B"
 '
 
+test_expect_success 'correctly report changes over limit' '
+	git init 513changes &&
+	(
+		cd 513changes &&
+		for i in $(test_seq 1 513)
+		do
+			echo $i >file$i.txt || return 1
+		done &&
+		git add . &&
+		git commit -m "files" &&
+		git commit-graph write --reachable --changed-paths &&
+		for i in $(test_seq 1 513)
+		do
+			git -c core.commitGraph=false log -- file$i.txt >expect &&
+			git log -- file$i.txt >actual &&
+			test_cmp expect actual || return 1
+		done
+	)
+'
+
 test_done
\ No newline at end of file
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH v4 04/10] commit-graph: persist existence of changed-paths
  2020-07-01 13:27     ` [PATCH v4 00/10] More commit-graph/Bloom filter improvements Derrick Stolee via GitGitGadget
                         ` (2 preceding siblings ...)
  2020-07-01 13:27       ` [PATCH v4 03/10] bloom: fix logic in get_bloom_filter() Derrick Stolee via GitGitGadget
@ 2020-07-01 13:27       ` Derrick Stolee via GitGitGadget
  2020-07-01 13:27       ` [PATCH v4 05/10] commit-graph: unify the signatures of all write_graph_chunk_*() functions SZEDER Gábor via GitGitGadget
                         ` (5 subsequent siblings)
  9 siblings, 0 replies; 71+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-07-01 13:27 UTC (permalink / raw)
  To: git; +Cc: me, szeder.dev, l.s.r, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The changed-path Bloom filters were released in v2.27.0, but have a
significant drawback. A user can opt-in to writing the changed-path
filters using the "--changed-paths" option to "git commit-graph write"
but the next write will drop the filters unless that option is
specified.

This becomes even more important when considering the interaction with
gc.writeCommitGraph (on by default) or fetch.writeCommitGraph (part of
features.experimental). These config options trigger commit-graph writes
that the user did not signal, and hence there is no --changed-paths
option available.

Allow a user that opts-in to the changed-path filters to persist the
property of "my commit-graph has changed-path filters" automatically. A
user can drop filters using the --no-changed-paths option.

In the process, we need to be extremely careful to match the Bloom
filter settings as specified by the commit-graph. This will allow future
versions of Git to customize these settings, and the version with this
change will persist those settings as commit-graphs are rewritten on
top.

Use the trace2 API to signal the settings used during the write, and
check that output in a test after manually adjusting the correct bytes
in the commit-graph file.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-commit-graph.txt |  5 +++-
 builtin/commit-graph.c             |  5 +++-
 commit-graph.c                     | 45 ++++++++++++++++++++++++++++--
 commit-graph.h                     |  1 +
 t/t4216-log-bloom.sh               | 17 ++++++++++-
 5 files changed, 67 insertions(+), 6 deletions(-)

diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
index f4b13c005b..369b222b08 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -60,7 +60,10 @@ existing commit-graph file.
 With the `--changed-paths` option, compute and write information about the
 paths changed between a commit and it's first parent. This operation can
 take a while on large repositories. It provides significant performance gains
-for getting history of a directory or a file with `git log -- <path>`.
+for getting history of a directory or a file with `git log -- <path>`. If
+this option is given, future commit-graph writes will automatically assume
+that this option was intended. Use `--no-changed-paths` to stop storing this
+data.
 +
 With the `--split` option, write the commit-graph as a chain of multiple
 commit-graph files stored in `<dir>/info/commit-graphs`. The new commits
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 59009837dc..ff7b177c33 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -151,6 +151,7 @@ static int graph_write(int argc, const char **argv)
 	};
 
 	opts.progress = isatty(2);
+	opts.enable_changed_paths = -1;
 	split_opts.size_multiple = 2;
 	split_opts.max_commits = 0;
 	split_opts.expire_time = 0;
@@ -171,7 +172,9 @@ static int graph_write(int argc, const char **argv)
 		flags |= COMMIT_GRAPH_WRITE_SPLIT;
 	if (opts.progress)
 		flags |= COMMIT_GRAPH_WRITE_PROGRESS;
-	if (opts.enable_changed_paths ||
+	if (!opts.enable_changed_paths)
+		flags |= COMMIT_GRAPH_NO_WRITE_BLOOM_FILTERS;
+	if (opts.enable_changed_paths == 1 ||
 	    git_env_bool(GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS, 0))
 		flags |= COMMIT_GRAPH_WRITE_BLOOM_FILTERS;
 
diff --git a/commit-graph.c b/commit-graph.c
index 50ce039a53..6762704324 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -16,6 +16,8 @@
 #include "progress.h"
 #include "bloom.h"
 #include "commit-slab.h"
+#include "json-writer.h"
+#include "trace2.h"
 
 void git_test_write_commit_graph_or_die(void)
 {
@@ -1108,6 +1110,21 @@ static void write_graph_chunk_bloom_indexes(struct hashfile *f,
 	stop_progress(&progress);
 }
 
+static void trace2_bloom_filter_settings(struct write_commit_graph_context *ctx)
+{
+	struct json_writer jw = JSON_WRITER_INIT;
+
+	jw_object_begin(&jw, 0);
+	jw_object_intmax(&jw, "hash_version", ctx->bloom_settings->hash_version);
+	jw_object_intmax(&jw, "num_hashes", ctx->bloom_settings->num_hashes);
+	jw_object_intmax(&jw, "bits_per_entry", ctx->bloom_settings->bits_per_entry);
+	jw_end(&jw);
+
+	trace2_data_json("bloom", ctx->r, "settings", &jw);
+
+	jw_release(&jw);
+}
+
 static void write_graph_chunk_bloom_data(struct hashfile *f,
 					 struct write_commit_graph_context *ctx)
 {
@@ -1116,6 +1133,8 @@ static void write_graph_chunk_bloom_data(struct hashfile *f,
 	struct progress *progress = NULL;
 	int i = 0;
 
+	trace2_bloom_filter_settings(ctx);
+
 	if (ctx->report_progress)
 		progress = start_delayed_progress(
 			_("Writing changed paths Bloom filters data"),
@@ -1547,9 +1566,15 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	int num_chunks = 3;
 	uint64_t chunk_offset;
 	struct object_id file_hash;
-	const struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
+	struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
 
-	ctx->bloom_settings = &bloom_settings;
+	if (!ctx->bloom_settings) {
+		bloom_settings.bits_per_entry = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY",
+							      bloom_settings.bits_per_entry);
+		bloom_settings.num_hashes = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_NUM_HASHES",
+							  bloom_settings.num_hashes);
+		ctx->bloom_settings = &bloom_settings;
+	}
 
 	if (ctx->split) {
 		struct strbuf tmp_file = STRBUF_INIT;
@@ -1974,9 +1999,23 @@ int write_commit_graph(struct object_directory *odb,
 	ctx->split = flags & COMMIT_GRAPH_WRITE_SPLIT ? 1 : 0;
 	ctx->check_oids = flags & COMMIT_GRAPH_WRITE_CHECK_OIDS ? 1 : 0;
 	ctx->split_opts = split_opts;
-	ctx->changed_paths = flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS ? 1 : 0;
 	ctx->total_bloom_filter_data_size = 0;
 
+	if (flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS)
+		ctx->changed_paths = 1;
+	if (!(flags & COMMIT_GRAPH_NO_WRITE_BLOOM_FILTERS)) {
+		struct commit_graph *g;
+		prepare_commit_graph_one(ctx->r, ctx->odb);
+
+		g = ctx->r->objects->commit_graph;
+
+		/* We have changed-paths already. Keep them in the next graph */
+		if (g && g->chunk_bloom_data) {
+			ctx->changed_paths = 1;
+			ctx->bloom_settings = g->bloom_filter_settings;
+		}
+	}
+
 	if (ctx->split) {
 		struct commit_graph *g;
 		prepare_commit_graph(ctx->r);
diff --git a/commit-graph.h b/commit-graph.h
index f0fb13e3f2..45b1e5bca3 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -96,6 +96,7 @@ enum commit_graph_write_flags {
 	/* Make sure that each OID in the input is a valid commit OID. */
 	COMMIT_GRAPH_WRITE_CHECK_OIDS = (1 << 3),
 	COMMIT_GRAPH_WRITE_BLOOM_FILTERS = (1 << 4),
+	COMMIT_GRAPH_NO_WRITE_BLOOM_FILTERS = (1 << 5),
 };
 
 struct split_commit_graph_opts {
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index 2761208e74..52ad998f9e 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -126,7 +126,7 @@ test_expect_success 'setup - add commit-graph to the chain without Bloom filters
 	test_commit c14 A/anotherFile2 &&
 	test_commit c15 A/B/anotherFile2 &&
 	test_commit c16 A/B/C/anotherFile2 &&
-	GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=0 git commit-graph write --reachable --split &&
+	git commit-graph write --reachable --split --no-changed-paths &&
 	test_line_count = 2 .git/objects/info/commit-graphs/commit-graph-chain
 '
 
@@ -152,6 +152,21 @@ test_expect_success 'Use Bloom filters if they exist in the latest but not all c
 	test_bloom_filters_used_when_some_filters_are_missing "-- A/B"
 '
 
+test_expect_success 'persist filter settings' '
+	test_when_finished rm -rf .git/objects/info/commit-graph* &&
+	rm -rf .git/objects/info/commit-graph* &&
+	GIT_TRACE2_EVENT="$(pwd)/trace2.txt" \
+		GIT_TRACE2_EVENT_NESTING=5 \
+		GIT_TEST_BLOOM_SETTINGS_NUM_HASHES=9 \
+		GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY=15 \
+		git commit-graph write --reachable --changed-paths &&
+	grep "{\"hash_version\":1,\"num_hashes\":9,\"bits_per_entry\":15}" trace2.txt &&
+	GIT_TRACE2_EVENT="$(pwd)/trace2-auto.txt" \
+		GIT_TRACE2_EVENT_NESTING=5 \
+		git commit-graph write --reachable --changed-paths &&
+	grep "{\"hash_version\":1,\"num_hashes\":9,\"bits_per_entry\":15}" trace2-auto.txt
+'
+
 test_expect_success 'correctly report changes over limit' '
 	git init 513changes &&
 	(
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH v4 05/10] commit-graph: unify the signatures of all write_graph_chunk_*() functions
  2020-07-01 13:27     ` [PATCH v4 00/10] More commit-graph/Bloom filter improvements Derrick Stolee via GitGitGadget
                         ` (3 preceding siblings ...)
  2020-07-01 13:27       ` [PATCH v4 04/10] commit-graph: persist existence of changed-paths Derrick Stolee via GitGitGadget
@ 2020-07-01 13:27       ` SZEDER Gábor via GitGitGadget
  2020-07-01 13:27       ` [PATCH v4 06/10] commit-graph: simplify chunk writes into loop SZEDER Gábor via GitGitGadget
                         ` (4 subsequent siblings)
  9 siblings, 0 replies; 71+ messages in thread
From: SZEDER Gábor via GitGitGadget @ 2020-07-01 13:27 UTC (permalink / raw)
  To: git; +Cc: me, szeder.dev, l.s.r, Derrick Stolee, SZEDER Gábor

From: =?UTF-8?q?SZEDER=20G=C3=A1bor?= <szeder.dev@gmail.com>

Update the write_graph_chunk_*() helper functions to have the same
signature:

  - Return an int error code from all these functions.
    write_graph_chunk_base() already has an int error code, now the
    others will have one, too, but since they don't indicate any
    error, they will always return 0.

  - Drop the hash size parameter of write_graph_chunk_oids() and
    write_graph_chunk_data(); its value can be read directly from
    'the_hash_algo' inside these functions as well.

This opens up the possibility for further cleanups and foolproofing in
the following two patches.

Helped-by: René Scharfe <l.s.r@web.de>
Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 42 ++++++++++++++++++++++++++----------------
 1 file changed, 26 insertions(+), 16 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 6762704324..1a6d26f864 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -891,8 +891,8 @@ struct write_commit_graph_context {
 	const struct bloom_filter_settings *bloom_settings;
 };
 
-static void write_graph_chunk_fanout(struct hashfile *f,
-				     struct write_commit_graph_context *ctx)
+static int write_graph_chunk_fanout(struct hashfile *f,
+				    struct write_commit_graph_context *ctx)
 {
 	int i, count = 0;
 	struct commit **list = ctx->commits.list;
@@ -913,17 +913,21 @@ static void write_graph_chunk_fanout(struct hashfile *f,
 
 		hashwrite_be32(f, count);
 	}
+
+	return 0;
 }
 
-static void write_graph_chunk_oids(struct hashfile *f, int hash_len,
-				   struct write_commit_graph_context *ctx)
+static int write_graph_chunk_oids(struct hashfile *f,
+				  struct write_commit_graph_context *ctx)
 {
 	struct commit **list = ctx->commits.list;
 	int count;
 	for (count = 0; count < ctx->commits.nr; count++, list++) {
 		display_progress(ctx->progress, ++ctx->progress_cnt);
-		hashwrite(f, (*list)->object.oid.hash, (int)hash_len);
+		hashwrite(f, (*list)->object.oid.hash, the_hash_algo->rawsz);
 	}
+
+	return 0;
 }
 
 static const unsigned char *commit_to_sha1(size_t index, void *table)
@@ -932,8 +936,8 @@ static const unsigned char *commit_to_sha1(size_t index, void *table)
 	return commits[index]->object.oid.hash;
 }
 
-static void write_graph_chunk_data(struct hashfile *f, int hash_len,
-				   struct write_commit_graph_context *ctx)
+static int write_graph_chunk_data(struct hashfile *f,
+				  struct write_commit_graph_context *ctx)
 {
 	struct commit **list = ctx->commits.list;
 	struct commit **last = ctx->commits.list + ctx->commits.nr;
@@ -950,7 +954,7 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len,
 			die(_("unable to parse commit %s"),
 				oid_to_hex(&(*list)->object.oid));
 		tree = get_commit_tree_oid(*list);
-		hashwrite(f, tree->hash, hash_len);
+		hashwrite(f, tree->hash, the_hash_algo->rawsz);
 
 		parent = (*list)->parents;
 
@@ -1030,10 +1034,12 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len,
 
 		list++;
 	}
+
+	return 0;
 }
 
-static void write_graph_chunk_extra_edges(struct hashfile *f,
-					  struct write_commit_graph_context *ctx)
+static int write_graph_chunk_extra_edges(struct hashfile *f,
+					 struct write_commit_graph_context *ctx)
 {
 	struct commit **list = ctx->commits.list;
 	struct commit **last = ctx->commits.list + ctx->commits.nr;
@@ -1082,10 +1088,12 @@ static void write_graph_chunk_extra_edges(struct hashfile *f,
 
 		list++;
 	}
+
+	return 0;
 }
 
-static void write_graph_chunk_bloom_indexes(struct hashfile *f,
-					    struct write_commit_graph_context *ctx)
+static int write_graph_chunk_bloom_indexes(struct hashfile *f,
+					   struct write_commit_graph_context *ctx)
 {
 	struct commit **list = ctx->commits.list;
 	struct commit **last = ctx->commits.list + ctx->commits.nr;
@@ -1108,6 +1116,7 @@ static void write_graph_chunk_bloom_indexes(struct hashfile *f,
 	}
 
 	stop_progress(&progress);
+	return 0;
 }
 
 static void trace2_bloom_filter_settings(struct write_commit_graph_context *ctx)
@@ -1125,8 +1134,8 @@ static void trace2_bloom_filter_settings(struct write_commit_graph_context *ctx)
 	jw_release(&jw);
 }
 
-static void write_graph_chunk_bloom_data(struct hashfile *f,
-					 struct write_commit_graph_context *ctx)
+static int write_graph_chunk_bloom_data(struct hashfile *f,
+					struct write_commit_graph_context *ctx)
 {
 	struct commit **list = ctx->commits.list;
 	struct commit **last = ctx->commits.list + ctx->commits.nr;
@@ -1155,6 +1164,7 @@ static void write_graph_chunk_bloom_data(struct hashfile *f,
 	}
 
 	stop_progress(&progress);
+	return 0;
 }
 
 static int oid_compare(const void *_a, const void *_b)
@@ -1671,8 +1681,8 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 			num_chunks * ctx->commits.nr);
 	}
 	write_graph_chunk_fanout(f, ctx);
-	write_graph_chunk_oids(f, hashsz, ctx);
-	write_graph_chunk_data(f, hashsz, ctx);
+	write_graph_chunk_oids(f, ctx);
+	write_graph_chunk_data(f, ctx);
 	if (ctx->num_extra_edges)
 		write_graph_chunk_extra_edges(f, ctx);
 	if (ctx->changed_paths) {
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH v4 06/10] commit-graph: simplify chunk writes into loop
  2020-07-01 13:27     ` [PATCH v4 00/10] More commit-graph/Bloom filter improvements Derrick Stolee via GitGitGadget
                         ` (4 preceding siblings ...)
  2020-07-01 13:27       ` [PATCH v4 05/10] commit-graph: unify the signatures of all write_graph_chunk_*() functions SZEDER Gábor via GitGitGadget
@ 2020-07-01 13:27       ` SZEDER Gábor via GitGitGadget
  2020-07-01 13:27       ` [PATCH v4 07/10] commit-graph: check chunk sizes after writing SZEDER Gábor via GitGitGadget
                         ` (3 subsequent siblings)
  9 siblings, 0 replies; 71+ messages in thread
From: SZEDER Gábor via GitGitGadget @ 2020-07-01 13:27 UTC (permalink / raw)
  To: git; +Cc: me, szeder.dev, l.s.r, Derrick Stolee, SZEDER Gábor

From: =?UTF-8?q?SZEDER=20G=C3=A1bor?= <szeder.dev@gmail.com>

In write_commit_graph_file() we now have one block of code filling the
array of 'struct chunk_info' with the IDs and sizes of chunks to be
written, and an other block of code calling the functions responsible
for writing individual chunks.  In case of optional chunks like Extra
Edge List an Base Graphs List there is also a condition checking
whether that chunk is necessary/desired, and that same condition is
repeated in both blocks of code. Other, newer chunks have similar
optional conditions.

Eliminate these repeated conditions by storing the function pointers
responsible for writing individual chunks in the 'struct chunk_info'
array as well, and calling them in a loop to write the commit-graph
file.  This will open up the possibility for a bit of foolproofing in
the following patch.

Helped-by: René Scharfe <l.s.r@web.de>
Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 28 ++++++++++++++++------------
 1 file changed, 16 insertions(+), 12 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 1a6d26f864..2b26a9dad3 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1559,9 +1559,13 @@ static int write_graph_chunk_base(struct hashfile *f,
 	return 0;
 }
 
+typedef int (*chunk_write_fn)(struct hashfile *f,
+			      struct write_commit_graph_context *ctx);
+
 struct chunk_info {
 	uint32_t id;
 	uint64_t size;
+	chunk_write_fn write_fn;
 };
 
 static int write_commit_graph_file(struct write_commit_graph_context *ctx)
@@ -1624,27 +1628,34 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 
 	chunks[0].id = GRAPH_CHUNKID_OIDFANOUT;
 	chunks[0].size = GRAPH_FANOUT_SIZE;
+	chunks[0].write_fn = write_graph_chunk_fanout;
 	chunks[1].id = GRAPH_CHUNKID_OIDLOOKUP;
 	chunks[1].size = hashsz * ctx->commits.nr;
+	chunks[1].write_fn = write_graph_chunk_oids;
 	chunks[2].id = GRAPH_CHUNKID_DATA;
 	chunks[2].size = (hashsz + 16) * ctx->commits.nr;
+	chunks[2].write_fn = write_graph_chunk_data;
 	if (ctx->num_extra_edges) {
 		chunks[num_chunks].id = GRAPH_CHUNKID_EXTRAEDGES;
 		chunks[num_chunks].size = 4 * ctx->num_extra_edges;
+		chunks[num_chunks].write_fn = write_graph_chunk_extra_edges;
 		num_chunks++;
 	}
 	if (ctx->changed_paths) {
 		chunks[num_chunks].id = GRAPH_CHUNKID_BLOOMINDEXES;
 		chunks[num_chunks].size = sizeof(uint32_t) * ctx->commits.nr;
+		chunks[num_chunks].write_fn = write_graph_chunk_bloom_indexes;
 		num_chunks++;
 		chunks[num_chunks].id = GRAPH_CHUNKID_BLOOMDATA;
 		chunks[num_chunks].size = sizeof(uint32_t) * 3
 					  + ctx->total_bloom_filter_data_size;
+		chunks[num_chunks].write_fn = write_graph_chunk_bloom_data;
 		num_chunks++;
 	}
 	if (ctx->num_commit_graphs_after > 1) {
 		chunks[num_chunks].id = GRAPH_CHUNKID_BASE;
 		chunks[num_chunks].size = hashsz * (ctx->num_commit_graphs_after - 1);
+		chunks[num_chunks].write_fn = write_graph_chunk_base;
 		num_chunks++;
 	}
 
@@ -1680,19 +1691,12 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 			progress_title.buf,
 			num_chunks * ctx->commits.nr);
 	}
-	write_graph_chunk_fanout(f, ctx);
-	write_graph_chunk_oids(f, ctx);
-	write_graph_chunk_data(f, ctx);
-	if (ctx->num_extra_edges)
-		write_graph_chunk_extra_edges(f, ctx);
-	if (ctx->changed_paths) {
-		write_graph_chunk_bloom_indexes(f, ctx);
-		write_graph_chunk_bloom_data(f, ctx);
-	}
-	if (ctx->num_commit_graphs_after > 1 &&
-	    write_graph_chunk_base(f, ctx)) {
-		return -1;
+
+	for (i = 0; i < num_chunks; i++) {
+		if (chunks[i].write_fn(f, ctx))
+			return -1;
 	}
+
 	stop_progress(&ctx->progress);
 	strbuf_release(&progress_title);
 
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH v4 07/10] commit-graph: check chunk sizes after writing
  2020-07-01 13:27     ` [PATCH v4 00/10] More commit-graph/Bloom filter improvements Derrick Stolee via GitGitGadget
                         ` (5 preceding siblings ...)
  2020-07-01 13:27       ` [PATCH v4 06/10] commit-graph: simplify chunk writes into loop SZEDER Gábor via GitGitGadget
@ 2020-07-01 13:27       ` SZEDER Gábor via GitGitGadget
  2020-07-01 13:27       ` [PATCH v4 08/10] revision.c: fix whitespace Derrick Stolee via GitGitGadget
                         ` (2 subsequent siblings)
  9 siblings, 0 replies; 71+ messages in thread
From: SZEDER Gábor via GitGitGadget @ 2020-07-01 13:27 UTC (permalink / raw)
  To: git; +Cc: me, szeder.dev, l.s.r, Derrick Stolee, SZEDER Gábor

From: =?UTF-8?q?SZEDER=20G=C3=A1bor?= <szeder.dev@gmail.com>

In my experience while experimenting with new commit-graph chunks,
early versions of the corresponding new write_commit_graph_my_chunk()
functions are, sadly but not surprisingly, often buggy, and write more
or less data than they are supposed to, especially if the chunk size
is not directly proportional to the number of commits.  This then
causes all kinds of issues when reading such a bogus commit-graph
file, raising the question of whether the writing or the reading part
happens to be buggy this time.

Let's catch such issues early, already when writing the commit-graph
file, and check that each write_graph_chunk_*() function wrote the
amount of data that it was expected to, and what has been encoded in
the Chunk Lookup table.  Now that all commit-graph chunks are written
in a loop we can do this check in a single place for all chunks, and
any chunks added in the future will get checked as well.

Helped-by: René Scharfe <l.s.r@web.de>
Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/commit-graph.c b/commit-graph.c
index 2b26a9dad3..6752916c1a 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1693,8 +1693,15 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	}
 
 	for (i = 0; i < num_chunks; i++) {
+		uint64_t start_offset = f->total + f->offset;
+
 		if (chunks[i].write_fn(f, ctx))
 			return -1;
+
+		if (f->total + f->offset != start_offset + chunks[i].size)
+			BUG("expected to write %"PRId64" bytes to chunk %"PRIx32", but wrote %"PRId64" instead",
+			    chunks[i].size, chunks[i].id,
+			    f->total + f->offset - start_offset);
 	}
 
 	stop_progress(&ctx->progress);
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH v4 08/10] revision.c: fix whitespace
  2020-07-01 13:27     ` [PATCH v4 00/10] More commit-graph/Bloom filter improvements Derrick Stolee via GitGitGadget
                         ` (6 preceding siblings ...)
  2020-07-01 13:27       ` [PATCH v4 07/10] commit-graph: check chunk sizes after writing SZEDER Gábor via GitGitGadget
@ 2020-07-01 13:27       ` Derrick Stolee via GitGitGadget
  2020-07-01 13:27       ` [PATCH v4 09/10] revision: empty pathspecs should not use Bloom filters Taylor Blau via GitGitGadget
  2020-07-01 13:27       ` [PATCH v4 10/10] commit-graph: check all leading directories in changed path " SZEDER Gábor via GitGitGadget
  9 siblings, 0 replies; 71+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-07-01 13:27 UTC (permalink / raw)
  To: git; +Cc: me, szeder.dev, l.s.r, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Here, four spaces were used instead of tab characters.

Reported-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 revision.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/revision.c b/revision.c
index 7339750af1..ddf09ab0aa 100644
--- a/revision.c
+++ b/revision.c
@@ -695,11 +695,11 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
 
 	/* remove single trailing slash from path, if needed */
 	if (pi->match[last_index] == '/') {
-	    path_alloc = xstrdup(pi->match);
-	    path_alloc[last_index] = '\0';
-	    path = path_alloc;
+		path_alloc = xstrdup(pi->match);
+		path_alloc[last_index] = '\0';
+		path = path_alloc;
 	} else
-	    path = pi->match;
+		path = pi->match;
 
 	len = strlen(path);
 
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH v4 09/10] revision: empty pathspecs should not use Bloom filters
  2020-07-01 13:27     ` [PATCH v4 00/10] More commit-graph/Bloom filter improvements Derrick Stolee via GitGitGadget
                         ` (7 preceding siblings ...)
  2020-07-01 13:27       ` [PATCH v4 08/10] revision.c: fix whitespace Derrick Stolee via GitGitGadget
@ 2020-07-01 13:27       ` Taylor Blau via GitGitGadget
  2020-07-01 13:27       ` [PATCH v4 10/10] commit-graph: check all leading directories in changed path " SZEDER Gábor via GitGitGadget
  9 siblings, 0 replies; 71+ messages in thread
From: Taylor Blau via GitGitGadget @ 2020-07-01 13:27 UTC (permalink / raw)
  To: git; +Cc: me, szeder.dev, l.s.r, Derrick Stolee, Taylor Blau

From: Taylor Blau <me@ttaylorr.com>

The prepare_to_use_bloom_filter() method was not intended to be called
on an empty pathspec. However, 'git log -- .' and 'git log' are subtly
different: the latter reports all commits while the former will simplify
commits that do not change the root tree.

This means that the path used to construct the bloom_key might be empty,
and that value is not added to the Bloom filter during construction.
That means that the results are likely incorrect!

To resolve the issue, be careful about the length of the path and stop
filling Bloom filters. To be completely sure we do not use them, drop
the pointer to the bloom_filter_settings from the commit-graph. That
allows our test to look at the trace2 logs to verify no Bloom filter
statistics are reported.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 revision.c           | 4 ++++
 t/t4216-log-bloom.sh | 4 ++++
 2 files changed, 8 insertions(+)

diff --git a/revision.c b/revision.c
index ddf09ab0aa..667ca36e1c 100644
--- a/revision.c
+++ b/revision.c
@@ -702,6 +702,10 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
 		path = pi->match;
 
 	len = strlen(path);
+	if (!len) {
+		revs->bloom_filter_settings = NULL;
+		return;
+	}
 
 	revs->bloom_key = xmalloc(sizeof(struct bloom_key));
 	fill_bloom_key(path, len, revs->bloom_key, revs->bloom_filter_settings);
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index 52ad998f9e..29338f36bf 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -112,6 +112,10 @@ test_expect_success 'git log -- multiple path specs does not use Bloom filters'
 	test_bloom_filters_not_used "-- file4 A/file1"
 '
 
+test_expect_success 'git log -- "." pathspec at root does not use Bloom filters' '
+	test_bloom_filters_not_used "-- ."
+'
+
 test_expect_success 'git log with wildcard that resolves to a single path uses Bloom filters' '
 	test_bloom_filters_used "-- *4" &&
 	test_bloom_filters_used "-- *renamed"
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH v4 10/10] commit-graph: check all leading directories in changed path Bloom filters
  2020-07-01 13:27     ` [PATCH v4 00/10] More commit-graph/Bloom filter improvements Derrick Stolee via GitGitGadget
                         ` (8 preceding siblings ...)
  2020-07-01 13:27       ` [PATCH v4 09/10] revision: empty pathspecs should not use Bloom filters Taylor Blau via GitGitGadget
@ 2020-07-01 13:27       ` SZEDER Gábor via GitGitGadget
  9 siblings, 0 replies; 71+ messages in thread
From: SZEDER Gábor via GitGitGadget @ 2020-07-01 13:27 UTC (permalink / raw)
  To: git; +Cc: me, szeder.dev, l.s.r, Derrick Stolee, SZEDER Gábor

From: =?UTF-8?q?SZEDER=20G=C3=A1bor?= <szeder.dev@gmail.com>

The file 'dir/subdir/file' can only be modified if its leading
directories 'dir' and 'dir/subdir' are modified as well.

So when checking modified path Bloom filters looking for commits
modifying a path with multiple path components, then check not only
the full path in the Bloom filters, but all its leading directories as
well.  Take care to check these paths in "deepest first" order,
because it's the full path that is least likely to be modified, and
the Bloom filter queries can short circuit sooner.

This can significantly reduce the average false positive rate, by
about an order of magnitude or three(!), and can further speed up
pathspec-limited revision walks.  The table below compares the average
false positive rate and runtime of

  git rev-list HEAD -- "$path"

before and after this change for 5000+ randomly* selected paths from
each repository:

                    Average false           Average        Average
                    positive rate           runtime        runtime
                  before     after     before     after   difference
  ------------------------------------------------------------------
  git             3.220%   0.7853%     0.0558s   0.0387s   -30.6%
  linux           2.453%   0.0296%     0.1046s   0.0766s   -26.8%
  tensorflow      2.536%   0.6977%     0.0594s   0.0420s   -29.2%

*Path selection was done with the following pipeline:

	git ls-tree -r --name-only HEAD | sort -R | head -n 5000

The improvements in runtime are much smaller than the improvements in
average false positive rate, as we are clearly reaching diminishing
returns here.  However, all these timings depend on that accessing
tree objects is reasonably fast (warm caches).  If we had a partial
clone and the tree objects had to be fetched from a promisor remote,
e.g.:

  $ git clone --filter=tree:0 --bare file://.../webkit.git webkit.notrees.git
  $ git -C webkit.git -c core.modifiedPathBloomFilters=1 \
        commit-graph write --reachable
  $ cp webkit.git/objects/info/commit-graph webkit.notrees.git/objects/info/
  $ git -C webkit.notrees.git -c core.modifiedPathBloomFilters=1 \
        rev-list HEAD -- "$path"

then checking all leading path component can reduce the runtime from
over an hour to a few seconds (and this is with the clone and the
promisor on the same machine).

This adjusts the tracing values in t4216-log-bloom.sh, which provides a
concrete way to notice the improvement.

Helped-by: Taylor Blau <me@ttaylorr.com>
Helped-by: René Scharfe <l.s.r@web.de>
Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 revision.c           | 46 +++++++++++++++++++++++++++++++++++---------
 revision.h           |  6 ++++--
 t/t4216-log-bloom.sh |  2 +-
 3 files changed, 42 insertions(+), 12 deletions(-)

diff --git a/revision.c b/revision.c
index 667ca36e1c..b9118001f9 100644
--- a/revision.c
+++ b/revision.c
@@ -668,9 +668,10 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
 {
 	struct pathspec_item *pi;
 	char *path_alloc = NULL;
-	const char *path;
+	const char *path, *p;
 	int last_index;
-	int len;
+	size_t len;
+	int path_component_nr = 1;
 
 	if (!revs->commits)
 		return;
@@ -707,8 +708,33 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
 		return;
 	}
 
-	revs->bloom_key = xmalloc(sizeof(struct bloom_key));
-	fill_bloom_key(path, len, revs->bloom_key, revs->bloom_filter_settings);
+	p = path;
+	while (*p) {
+		/*
+		 * At this point, the path is normalized to use Unix-style
+		 * path separators. This is required due to how the
+		 * changed-path Bloom filters store the paths.
+		 */
+		if (*p == '/')
+			path_component_nr++;
+		p++;
+	}
+
+	revs->bloom_keys_nr = path_component_nr;
+	ALLOC_ARRAY(revs->bloom_keys, revs->bloom_keys_nr);
+
+	fill_bloom_key(path, len, &revs->bloom_keys[0],
+		       revs->bloom_filter_settings);
+	path_component_nr = 1;
+
+	p = path + len - 1;
+	while (p > path) {
+		if (*p == '/')
+			fill_bloom_key(path, p - path,
+				       &revs->bloom_keys[path_component_nr++],
+				       revs->bloom_filter_settings);
+		p--;
+	}
 
 	if (trace2_is_enabled() && !bloom_filter_atexit_registered) {
 		atexit(trace2_bloom_filter_statistics_atexit);
@@ -722,7 +748,7 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
 						 struct commit *commit)
 {
 	struct bloom_filter *filter;
-	int result;
+	int result = 1, j;
 
 	if (!revs->repo->objects->commit_graph)
 		return -1;
@@ -737,9 +763,11 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
 		return -1;
 	}
 
-	result = bloom_filter_contains(filter,
-				       revs->bloom_key,
-				       revs->bloom_filter_settings);
+	for (j = 0; result && j < revs->bloom_keys_nr; j++) {
+		result = bloom_filter_contains(filter,
+					       &revs->bloom_keys[j],
+					       revs->bloom_filter_settings);
+	}
 
 	if (result)
 		count_bloom_filter_maybe++;
@@ -779,7 +807,7 @@ static int rev_compare_tree(struct rev_info *revs,
 			return REV_TREE_SAME;
 	}
 
-	if (revs->bloom_key && !nth_parent) {
+	if (revs->bloom_keys_nr && !nth_parent) {
 		bloom_ret = check_maybe_different_in_bloom_filter(revs, commit);
 
 		if (bloom_ret == 0)
diff --git a/revision.h b/revision.h
index 7c026fe41f..abbfb4ab59 100644
--- a/revision.h
+++ b/revision.h
@@ -295,8 +295,10 @@ struct rev_info {
 	struct topo_walk_info *topo_walk_info;
 
 	/* Commit graph bloom filter fields */
-	/* The bloom filter key for the pathspec */
-	struct bloom_key *bloom_key;
+	/* The bloom filter key(s) for the pathspec */
+	struct bloom_key *bloom_keys;
+	int bloom_keys_nr;
+
 	/*
 	 * The bloom filter settings used to generate the key.
 	 * This is loaded from the commit-graph being used.
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index 29338f36bf..4892364e74 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -146,7 +146,7 @@ test_expect_success 'setup - add commit-graph to the chain with Bloom filters' '
 
 test_bloom_filters_used_when_some_filters_are_missing () {
 	log_args=$1
-	bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"maybe\":8,\"definitely_not\":6"
+	bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"maybe\":6,\"definitely_not\":8"
 	setup "$log_args" &&
 	grep -q "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.perf" &&
 	test_cmp log_wo_bloom log_w_bloom
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 71+ messages in thread

end of thread, back to index

Thread overview: 71+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-15 20:14 [PATCH 0/8] More commit-graph/Bloom filter improvements Derrick Stolee via GitGitGadget
2020-06-15 20:14 ` [PATCH 1/8] commit-graph: place bloom_settings in context Derrick Stolee via GitGitGadget
2020-06-18 20:30   ` René Scharfe
2020-06-19 12:58     ` Derrick Stolee
2020-06-15 20:14 ` [PATCH 2/8] commit-graph: unify the signatures of all write_graph_chunk_*() functions SZEDER Gábor via GitGitGadget
2020-06-18 20:30   ` René Scharfe
2020-06-15 20:14 ` [PATCH 3/8] commit-graph: simplify chunk writes into loop SZEDER Gábor via GitGitGadget
2020-06-18 20:30   ` René Scharfe
2020-06-15 20:14 ` [PATCH 4/8] commit-graph: check chunk sizes after writing SZEDER Gábor via GitGitGadget
2020-06-15 20:14 ` [PATCH 5/8] commit-graph: check all leading directories in changed path Bloom filters SZEDER Gábor via GitGitGadget
2020-06-18 20:31   ` René Scharfe
2020-06-19  9:14     ` René Scharfe
2020-06-19 17:17   ` Taylor Blau
2020-06-19 17:19     ` Taylor Blau
2020-06-23 13:47     ` Derrick Stolee
2020-06-15 20:14 ` [PATCH 6/8] bloom: enforce a minimum size of 8 bytes Derrick Stolee via GitGitGadget
2020-06-15 20:14 ` [PATCH 7/8] commit-graph: change test to die on parse, not load Derrick Stolee via GitGitGadget
2020-06-15 20:14 ` [PATCH 8/8] commit-graph: persist existence of changed-paths Derrick Stolee via GitGitGadget
2020-06-17 21:21 ` [PATCH 0/8] More commit-graph/Bloom filter improvements Junio C Hamano
2020-06-18  1:46   ` Derrick Stolee
2020-06-23 17:46 ` [PATCH v2 00/11] " Derrick Stolee via GitGitGadget
2020-06-23 17:47   ` [PATCH v2 01/11] commit-graph: place bloom_settings in context Derrick Stolee via GitGitGadget
2020-06-23 17:47   ` [PATCH v2 02/11] commit-graph: change test to die on parse, not load Derrick Stolee via GitGitGadget
2020-06-23 17:47   ` [PATCH v2 03/11] bloom: get_bloom_filter() cleanups Derrick Stolee via GitGitGadget
2020-06-25  7:24     ` René Scharfe
2020-06-23 17:47   ` [PATCH v2 04/11] commit-graph: persist existence of changed-paths Derrick Stolee via GitGitGadget
2020-06-23 17:47   ` [PATCH v2 05/11] commit-graph: unify the signatures of all write_graph_chunk_*() functions SZEDER Gábor via GitGitGadget
2020-06-25  7:25     ` René Scharfe
2020-06-23 17:47   ` [PATCH v2 06/11] commit-graph: simplify chunk writes into loop SZEDER Gábor via GitGitGadget
2020-06-25  7:25     ` René Scharfe
2020-06-25 14:59       ` Derrick Stolee
2020-06-23 17:47   ` [PATCH v2 07/11] commit-graph: check chunk sizes after writing SZEDER Gábor via GitGitGadget
2020-06-25  7:25     ` René Scharfe
2020-06-25 15:02       ` Derrick Stolee
2020-06-23 17:47   ` [PATCH v2 08/11] revision.c: fix whitespace Derrick Stolee via GitGitGadget
2020-06-23 17:47   ` [PATCH v2 09/11] revision: empty pathspecs should not use Bloom filters Taylor Blau via GitGitGadget
2020-06-23 17:47   ` [PATCH v2 10/11] commit-graph: check all leading directories in changed path " SZEDER Gábor via GitGitGadget
2020-06-25  7:25     ` René Scharfe
2020-06-25 15:05       ` Derrick Stolee
2020-06-26  6:34         ` SZEDER Gábor
2020-06-26 14:42           ` Derrick Stolee
2020-06-23 17:47   ` [PATCH v2 11/11] bloom: enforce a minimum size of 8 bytes Derrick Stolee via GitGitGadget
2020-06-24 23:11   ` [PATCH v2 00/11] More commit-graph/Bloom filter improvements Junio C Hamano
2020-06-24 23:32     ` Derrick Stolee
2020-06-25  0:38       ` Junio C Hamano
2020-06-25 13:38         ` Derrick Stolee
2020-06-25 16:34           ` Junio C Hamano
2020-06-26 12:30   ` [PATCH v3 00/10] " Derrick Stolee via GitGitGadget
2020-06-26 12:30     ` [PATCH v3 01/10] commit-graph: place bloom_settings in context Derrick Stolee via GitGitGadget
2020-06-26 12:30     ` [PATCH v3 02/10] commit-graph: change test to die on parse, not load Derrick Stolee via GitGitGadget
2020-06-26 12:30     ` [PATCH v3 03/10] bloom: fix logic in get_bloom_filter() Derrick Stolee via GitGitGadget
2020-06-27 16:33       ` SZEDER Gábor
2020-06-29 13:02         ` Derrick Stolee
2020-06-26 12:30     ` [PATCH v3 04/10] commit-graph: persist existence of changed-paths Derrick Stolee via GitGitGadget
2020-06-26 12:30     ` [PATCH v3 05/10] commit-graph: unify the signatures of all write_graph_chunk_*() functions SZEDER Gábor via GitGitGadget
2020-06-26 12:30     ` [PATCH v3 06/10] commit-graph: simplify chunk writes into loop SZEDER Gábor via GitGitGadget
2020-06-26 12:30     ` [PATCH v3 07/10] commit-graph: check chunk sizes after writing SZEDER Gábor via GitGitGadget
2020-06-26 12:30     ` [PATCH v3 08/10] revision.c: fix whitespace Derrick Stolee via GitGitGadget
2020-06-26 12:30     ` [PATCH v3 09/10] revision: empty pathspecs should not use Bloom filters Taylor Blau via GitGitGadget
2020-06-26 12:30     ` [PATCH v3 10/10] commit-graph: check all leading directories in changed path " SZEDER Gábor via GitGitGadget
2020-07-01 13:27     ` [PATCH v4 00/10] More commit-graph/Bloom filter improvements Derrick Stolee via GitGitGadget
2020-07-01 13:27       ` [PATCH v4 01/10] commit-graph: place bloom_settings in context Derrick Stolee via GitGitGadget
2020-07-01 13:27       ` [PATCH v4 02/10] commit-graph: change test to die on parse, not load Derrick Stolee via GitGitGadget
2020-07-01 13:27       ` [PATCH v4 03/10] bloom: fix logic in get_bloom_filter() Derrick Stolee via GitGitGadget
2020-07-01 13:27       ` [PATCH v4 04/10] commit-graph: persist existence of changed-paths Derrick Stolee via GitGitGadget
2020-07-01 13:27       ` [PATCH v4 05/10] commit-graph: unify the signatures of all write_graph_chunk_*() functions SZEDER Gábor via GitGitGadget
2020-07-01 13:27       ` [PATCH v4 06/10] commit-graph: simplify chunk writes into loop SZEDER Gábor via GitGitGadget
2020-07-01 13:27       ` [PATCH v4 07/10] commit-graph: check chunk sizes after writing SZEDER Gábor via GitGitGadget
2020-07-01 13:27       ` [PATCH v4 08/10] revision.c: fix whitespace Derrick Stolee via GitGitGadget
2020-07-01 13:27       ` [PATCH v4 09/10] revision: empty pathspecs should not use Bloom filters Taylor Blau via GitGitGadget
2020-07-01 13:27       ` [PATCH v4 10/10] commit-graph: check all leading directories in changed path " SZEDER Gábor via GitGitGadget

Git Mailing List Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/git/0 git/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 git git/ https://lore.kernel.org/git \
		git@vger.kernel.org
	public-inbox-index git

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.git


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git