git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/8] More commit-graph/Bloom filter improvements
@ 2020-06-15 20:14 Derrick Stolee via GitGitGadget
  2020-06-15 20:14 ` [PATCH 1/8] commit-graph: place bloom_settings in context Derrick Stolee via GitGitGadget
                   ` (9 more replies)
  0 siblings, 10 replies; 76+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-06-15 20:14 UTC (permalink / raw)
  To: git; +Cc: me, szeder.dev, Derrick Stolee

This builds on sg/commit-graph-cleanups, which took several patches from
Szeder's series [1] and applied them almost directly to a more-recent
version of Git [2].

[1] https://lore.kernel.org/git/20200529085038.26008-1-szeder.dev@gmail.com/
[2] 
https://lore.kernel.org/git/pull.650.git.1591362032.gitgitgadget@gmail.com/

This series adds a few extra improvements, several of which are rooted in
Szeder's original series. I maintained his authorship and sign-off, even
though the patches did not apply or cherry-pick at all.

 1. commit-graph: place bloom_settings in context
 2. commit-graph: unify the signatures of all write_graph_chunk_*()
    functions
 3. commit-graph: simplify chunk writes into loop
 4. commit-graph: check chunk sizes after writing
 5. commit-graph: check all leading directories in changed path Bloom
    filters

Patch 1 is a new preparation patch to then apply Szeder's ideas in the next
four. Some are refactoring or defensive programming, but Patch 5 presents a
meaningful performance improvement. By creating bloom_keys for each leading
directory in a path, we can greatly improve the false-positive rate.

 6. bloom: enforce a minimum size of 8 bytes

Patch 6 is based on a comment of Szeder's that since we are using 1-byte
alignment in the filters, that some small filters do not fit the theoretical
analysis that calculated the expected false-positive rate. By increasing the
minimum (non-zero) filter size, we can gain significant performance benefits
while increasing the file size a small amount.

 7. commit-graph: change test to die on parse, not load
 8. commit-graph: persist existence of changed-paths

The final two patches handle the unresolved usability issue: if a user
writes a commit-graph with --changed-paths, the next write will probably
clear them out. Think about gc.writeCommitGraph or fetch.writeCommitGraph,
which do not allow for the --changed-paths option directly. Another idea is
to add a config option, but I will leave that to others [3].

[3] https://github.com/gitgitgadget/git/pull/633

Here is an analysis of the range-diff between this series and Szeder's PoC
submission.

These patches either are part of sg/commit-graph-cleanups or were discarded
as unnecessary.

 1:  7a8dbfba53a <  -:  ----------- tree-walk.c: don't match submodule entries for 'submod/anything'
 2:  df25e984c58 <  -:  ----------- commit-graph: fix parsing the Chunk Lookup table
 3:  598f7f9a978 <  -:  ----------- commit-graph-format.txt: all multi-byte numbers are in network byte order
 4:  b29e5d39ed6 <  -:  ----------- commit-slab: add a function to deep free entries on the slab
 5:  18f4db7bfb9 <  -:  ----------- diff.h: drop diff_tree_oid() & friends' return value
 6:  bf336f109e6 <  -:  ----------- commit-graph: clean up #includes
 7:  b7f0f831bcf <  -:  ----------- commit-graph: simplify parse_commit_graph() #1
 8:  f2752000052 <  -:  ----------- commit-graph: simplify parse_commit_graph() #2
 9:  4e184b8743c <  -:  ----------- commit-graph: simplify write_commit_graph_file() #1
10:  344dd337da5 <  -:  ----------- commit-graph: simplify write_commit_graph_file() #2
11:  56e3c4f57b3 <  -:  ----------- commit-graph: allocate the 'struct chunk_info' array dinamically

This first patch enables the next refactoring patch.

 -:  ----------- >  1:  c966969071b commit-graph: place bloom_settings in context

This patch is recognized as similar, but all differences are due to
whitespace corrections and the new write_graph_chunk_*() methods.

12:  28fb1b5bdfe !  2:  65eb15221c8 commit-graph: unify the signatures of all write_graph_chunk_*() functions
    @@ Commit message
         This opens up the possibility for further cleanups and foolproofing in
         the following two patches.

         Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
    +    Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

      ## commit-graph.c ##
     @@ commit-graph.c: struct write_commit_graph_context {
    -     const struct split_commit_graph_opts *split_opts;
    +     struct bloom_filter_settings bloom_settings;
      };

     -static void write_graph_chunk_fanout(struct hashfile *f,
    +-                     struct write_commit_graph_context *ctx)
     +static int write_graph_chunk_fanout(struct hashfile *f,
    -                      struct write_commit_graph_context *ctx)
    ++                    struct write_commit_graph_context *ctx)
      {
          int i, count = 0;
    +     struct commit **list = ctx->commits.list;
     @@ commit-graph.c: static void write_graph_chunk_fanout(struct hashfile *f,

              hashwrite_be32(f, count);
          }
    ++
     +    return 0;
      }

     -static void write_graph_chunk_oids(struct hashfile *f, int hash_len,
    +-                   struct write_commit_graph_context *ctx)
     +static int write_graph_chunk_oids(struct hashfile *f,
    -                    struct write_commit_graph_context *ctx)
    ++                  struct write_commit_graph_context *ctx)
      {
          struct commit **list = ctx->commits.list;
          int count;
          for (count = 0; count < ctx->commits.nr; count++, list++) {
              display_progress(ctx->progress, ++ctx->progress_cnt);
     -        hashwrite(f, (*list)->object.oid.hash, (int)hash_len);
    -+        hashwrite(f, (*list)->object.oid.hash, the_hash_algo->rawsz);
    ++        hashwrite(f, (*list)->object.oid.hash, (int)the_hash_algo->rawsz);
          }
    ++
     +    return 0;
      }

    @@ commit-graph.c: static const unsigned char *commit_to_sha1(size_t index, void *t
      }

     -static void write_graph_chunk_data(struct hashfile *f, int hash_len,
    +-                   struct write_commit_graph_context *ctx)
     +static int write_graph_chunk_data(struct hashfile *f,
    -                    struct write_commit_graph_context *ctx)
    ++                  struct write_commit_graph_context *ctx)
      {
          struct commit **list = ctx->commits.list;
    +     struct commit **last = ctx->commits.list + ctx->commits.nr;
     @@ commit-graph.c: static void write_graph_chunk_data(struct hashfile *f, int hash_len,
                  die(_("unable to parse commit %s"),
                      oid_to_hex(&(*list)->object.oid));
    @@ commit-graph.c: static void write_graph_chunk_data(struct hashfile *f, int hash_

              list++;
          }
    ++
     +    return 0;
      }

     -static void write_graph_chunk_extra_edges(struct hashfile *f,
    +-                      struct write_commit_graph_context *ctx)
     +static int write_graph_chunk_extra_edges(struct hashfile *f,
    -                       struct write_commit_graph_context *ctx)
    ++                     struct write_commit_graph_context *ctx)
      {
          struct commit **list = ctx->commits.list;
    +     struct commit **last = ctx->commits.list + ctx->commits.nr;
     @@ commit-graph.c: static void write_graph_chunk_extra_edges(struct hashfile *f,

              list++;
          }
    ++
    ++    return 0;
    + }
    + 
    +-static void write_graph_chunk_bloom_indexes(struct hashfile *f,
    +-                        struct write_commit_graph_context *ctx)
    ++static int write_graph_chunk_bloom_indexes(struct hashfile *f,
    ++                       struct write_commit_graph_context *ctx)
    + {
    +     struct commit **list = ctx->commits.list;
    +     struct commit **last = ctx->commits.list + ctx->commits.nr;
    +@@ commit-graph.c: static void write_graph_chunk_bloom_indexes(struct hashfile *f,
    +     }
    + 
    +     stop_progress(&progress);
    ++    return 0;
    + }
    + 
    +-static void write_graph_chunk_bloom_data(struct hashfile *f,
    +-                     struct write_commit_graph_context *ctx)
    ++static int write_graph_chunk_bloom_data(struct hashfile *f,
    ++                    struct write_commit_graph_context *ctx)
    + {
    +     struct commit **list = ctx->commits.list;
    +     struct commit **last = ctx->commits.list + ctx->commits.nr;
    +@@ commit-graph.c: static void write_graph_chunk_bloom_data(struct hashfile *f,
    +     }
    + 
    +     stop_progress(&progress);
     +    return 0;
      }

      static int oid_compare(const void *_a, const void *_b)
     @@ commit-graph.c: static int write_commit_graph_file(struct write_commit_graph_context *ctx)
    -             chunks_nr * ctx->commits.nr);
    +             num_chunks * ctx->commits.nr);
          }
          write_graph_chunk_fanout(f, ctx);
     -    write_graph_chunk_oids(f, hashsz, ctx);
    @@ commit-graph.c: static int write_commit_graph_file(struct write_commit_graph_con
     +    write_graph_chunk_data(f, ctx);
          if (ctx->num_extra_edges)
              write_graph_chunk_extra_edges(f, ctx);
    -     if (ctx->num_commit_graphs_after > 1 &&
    +     if (ctx->changed_paths) {

These patches follow the same intent, but are significantly different
because they are updated with split commit-graphs and the existing
changed-path Bloom filters.

13:  1e1e59e2592 <  -:  ----------- commit-graph: simplify write_commit_graph_file() #3
 -:  ----------- >  3:  3d24b9802df commit-graph: simplify chunk writes into loop
14:  6f0d912e4b8 <  -:  ----------- commit-graph: check chunk sizes after writing
 -:  ----------- >  4:  bdca834e6da commit-graph: check chunk sizes after writing
24:  dc96f0d9822 <  -:  ----------- commit-graph: check all leading directories in modified path Bloom filters
 -:  ----------- >  5:  9975fc96f12 commit-graph: check all leading directories in changed path Bloom filters

These three patches are a few valuable improvements of my own design:

 -:  ----------- >  6:  2a5f1e17528 bloom: enforce a minimum size of 8 bytes
 -:  ----------- >  7:  60bbc15d24a commit-graph: change test to die on parse, not load
 -:  ----------- >  8:  db5b8fe8439 commit-graph: persist existence of changed-paths

At this point, we have updated the existing changed-path Bloom filter
implementation to be on even terms with Szeder's modified-path Bloom filter
implementation.

The next batch of patches contain Szeder's implementation. These implement a
completely different file format, so they are not intended as ways to move
forward. If there is a significant improvement to be found by using this
file format instead of the established one (comparing the old implementation
with these patches), then we could consider swapping the optional chunks for
those that he proposes.

While I had the motivation and energy to defend the current implementation
by applying Szeder's (excellent) ideas to the existing format, I do not have
intent to go through the effort to compare the file formats explicitly at
this point. I would be interested to read a performance analysis, if someone
were to provide one now.

15:  0ab955aac32 <  -:  ----------- commit-graph-format.txt: document the modified path Bloom filter chunks
16:  4c128d51dfe <  -:  ----------- Add a generic and minimal Bloom filter implementation
17:  41f02bc38f7 <  -:  ----------- Import a streaming-capable Murmur3 hash function implementation
18:  e5fd1da48d4 <  -:  ----------- commit-graph: write "empty" Modified Path Bloom Filter Index chunk
19:  2dd882ec601 <  -:  ----------- commit-graph: add commit slab for modified path Bloom filters
20:  f30e495c2b0 <  -:  ----------- commit-graph: fill the Modified Path Bloom Filter Index chunk
21:  e904cb58301 <  -:  ----------- commit-graph: load and use the Modified Path Bloom Filter Index chunk
22:  c71647ca374 <  -:  ----------- commit-graph: write the Modified Path Bloom Filters chunk
23:  50898d42291 <  -:  ----------- commit-graph: load and use the Modified Path Bloom Filters chunk
25:  7cbf1bc6b66 <  -:  ----------- commit-graph: check embedded modified path Bloom filters with a mask
26:  3951fdedf6a <  -:  ----------- commit-graph: deduplicate modified path Bloom filters
27:  5aba19a2766 <  -:  ----------- commit-graph: load modified path Bloom filters for merge commits
28:  93fc6af1d2f <  -:  ----------- commit-graph: write Modified Path Bloom Filter Merge Index chunk
29:  f87b37bf08e <  -:  ----------- commit-graph: extract init and free write_commit_graph_context
30:  943b0d9554c <  -:  ----------- commit-graph: move write_commit_graph_reachable below write_commit_graph
31:  47b26ea61aa <  -:  ----------- t7007-show: make the first test compatible with the next patch
32:  9201b71071c <  -:  ----------- PoC commit-graph: use revision walk machinery for '--reachable'
33:  5c72d97e5e9 <  -:  ----------- commit-graph: write modified path Bloom filters in "history order"

This patch is likely worth investigating again:

34:  8b40ec4cd30 <  -:  ----------- commit-graph: use modified path Bloom filters with wildcards, if possible

Thanks, -Stolee

Derrick Stolee (4):
  commit-graph: place bloom_settings in context
  bloom: enforce a minimum size of 8 bytes
  commit-graph: change test to die on parse, not load
  commit-graph: persist existence of changed-paths

SZEDER Gábor (4):
  commit-graph: unify the signatures of all write_graph_chunk_*()
    functions
  commit-graph: simplify chunk writes into loop
  commit-graph: check chunk sizes after writing
  commit-graph: check all leading directories in changed path Bloom
    filters

 Documentation/git-commit-graph.txt |   5 +-
 bloom.c                            |   4 ++
 builtin/commit-graph.c             |   5 +-
 commit-graph.c                     | 112 ++++++++++++++++++++---------
 commit-graph.h                     |   3 +-
 revision.c                         |  35 ++++++---
 revision.h                         |   6 +-
 t/t4216-log-bloom.sh               |   4 +-
 t/t5318-commit-graph.sh            |   2 +-
 9 files changed, 124 insertions(+), 52 deletions(-)


base-commit: 7fbfe07ab4d4e58c0971dac73001b89f180a0af3
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-659%2Fderrickstolee%2Fbloom-2-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-659/derrickstolee/bloom-2-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/659
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 76+ messages in thread

end of thread, other threads:[~2020-10-16 13:53 UTC | newest]

Thread overview: 76+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-15 20:14 [PATCH 0/8] More commit-graph/Bloom filter improvements Derrick Stolee via GitGitGadget
2020-06-15 20:14 ` [PATCH 1/8] commit-graph: place bloom_settings in context Derrick Stolee via GitGitGadget
2020-06-18 20:30   ` René Scharfe
2020-06-19 12:58     ` Derrick Stolee
2020-06-15 20:14 ` [PATCH 2/8] commit-graph: unify the signatures of all write_graph_chunk_*() functions SZEDER Gábor via GitGitGadget
2020-06-18 20:30   ` René Scharfe
2020-06-15 20:14 ` [PATCH 3/8] commit-graph: simplify chunk writes into loop SZEDER Gábor via GitGitGadget
2020-06-18 20:30   ` René Scharfe
2020-06-15 20:14 ` [PATCH 4/8] commit-graph: check chunk sizes after writing SZEDER Gábor via GitGitGadget
2020-06-15 20:14 ` [PATCH 5/8] commit-graph: check all leading directories in changed path Bloom filters SZEDER Gábor via GitGitGadget
2020-06-18 20:31   ` René Scharfe
2020-06-19  9:14     ` René Scharfe
2020-06-19 17:17   ` Taylor Blau
2020-06-19 17:19     ` Taylor Blau
2020-06-23 13:47     ` Derrick Stolee
2020-06-15 20:14 ` [PATCH 6/8] bloom: enforce a minimum size of 8 bytes Derrick Stolee via GitGitGadget
2020-06-15 20:14 ` [PATCH 7/8] commit-graph: change test to die on parse, not load Derrick Stolee via GitGitGadget
2020-06-15 20:14 ` [PATCH 8/8] commit-graph: persist existence of changed-paths Derrick Stolee via GitGitGadget
2020-06-17 21:21 ` [PATCH 0/8] More commit-graph/Bloom filter improvements Junio C Hamano
2020-06-18  1:46   ` Derrick Stolee
2020-06-23 17:46 ` [PATCH v2 00/11] " Derrick Stolee via GitGitGadget
2020-06-23 17:47   ` [PATCH v2 01/11] commit-graph: place bloom_settings in context Derrick Stolee via GitGitGadget
2020-06-23 17:47   ` [PATCH v2 02/11] commit-graph: change test to die on parse, not load Derrick Stolee via GitGitGadget
2020-06-23 17:47   ` [PATCH v2 03/11] bloom: get_bloom_filter() cleanups Derrick Stolee via GitGitGadget
2020-06-25  7:24     ` René Scharfe
2020-06-23 17:47   ` [PATCH v2 04/11] commit-graph: persist existence of changed-paths Derrick Stolee via GitGitGadget
2020-06-23 17:47   ` [PATCH v2 05/11] commit-graph: unify the signatures of all write_graph_chunk_*() functions SZEDER Gábor via GitGitGadget
2020-06-25  7:25     ` René Scharfe
2020-06-23 17:47   ` [PATCH v2 06/11] commit-graph: simplify chunk writes into loop SZEDER Gábor via GitGitGadget
2020-06-25  7:25     ` René Scharfe
2020-06-25 14:59       ` Derrick Stolee
2020-06-23 17:47   ` [PATCH v2 07/11] commit-graph: check chunk sizes after writing SZEDER Gábor via GitGitGadget
2020-06-25  7:25     ` René Scharfe
2020-06-25 15:02       ` Derrick Stolee
2020-06-23 17:47   ` [PATCH v2 08/11] revision.c: fix whitespace Derrick Stolee via GitGitGadget
2020-06-23 17:47   ` [PATCH v2 09/11] revision: empty pathspecs should not use Bloom filters Taylor Blau via GitGitGadget
2020-06-23 17:47   ` [PATCH v2 10/11] commit-graph: check all leading directories in changed path " SZEDER Gábor via GitGitGadget
2020-06-25  7:25     ` René Scharfe
2020-06-25 15:05       ` Derrick Stolee
2020-06-26  6:34         ` SZEDER Gábor
2020-06-26 14:42           ` Derrick Stolee
2020-06-23 17:47   ` [PATCH v2 11/11] bloom: enforce a minimum size of 8 bytes Derrick Stolee via GitGitGadget
2020-06-24 23:11   ` [PATCH v2 00/11] More commit-graph/Bloom filter improvements Junio C Hamano
2020-06-24 23:32     ` Derrick Stolee
2020-06-25  0:38       ` Junio C Hamano
2020-06-25 13:38         ` Derrick Stolee
2020-06-25 16:34           ` Junio C Hamano
2020-06-26 12:30   ` [PATCH v3 00/10] " Derrick Stolee via GitGitGadget
2020-06-26 12:30     ` [PATCH v3 01/10] commit-graph: place bloom_settings in context Derrick Stolee via GitGitGadget
2020-06-26 12:30     ` [PATCH v3 02/10] commit-graph: change test to die on parse, not load Derrick Stolee via GitGitGadget
2020-06-26 12:30     ` [PATCH v3 03/10] bloom: fix logic in get_bloom_filter() Derrick Stolee via GitGitGadget
2020-06-27 16:33       ` SZEDER Gábor
2020-06-29 13:02         ` Derrick Stolee
2020-06-26 12:30     ` [PATCH v3 04/10] commit-graph: persist existence of changed-paths Derrick Stolee via GitGitGadget
2020-06-26 12:30     ` [PATCH v3 05/10] commit-graph: unify the signatures of all write_graph_chunk_*() functions SZEDER Gábor via GitGitGadget
2020-06-26 12:30     ` [PATCH v3 06/10] commit-graph: simplify chunk writes into loop SZEDER Gábor via GitGitGadget
2020-06-26 12:30     ` [PATCH v3 07/10] commit-graph: check chunk sizes after writing SZEDER Gábor via GitGitGadget
2020-06-26 12:30     ` [PATCH v3 08/10] revision.c: fix whitespace Derrick Stolee via GitGitGadget
2020-06-26 12:30     ` [PATCH v3 09/10] revision: empty pathspecs should not use Bloom filters Taylor Blau via GitGitGadget
2020-06-26 12:30     ` [PATCH v3 10/10] commit-graph: check all leading directories in changed path " SZEDER Gábor via GitGitGadget
2020-07-01 13:27     ` [PATCH v4 00/10] More commit-graph/Bloom filter improvements Derrick Stolee via GitGitGadget
2020-07-01 13:27       ` [PATCH v4 01/10] commit-graph: place bloom_settings in context Derrick Stolee via GitGitGadget
2020-07-01 13:27       ` [PATCH v4 02/10] commit-graph: change test to die on parse, not load Derrick Stolee via GitGitGadget
2020-07-01 13:27       ` [PATCH v4 03/10] bloom: fix logic in get_bloom_filter() Derrick Stolee via GitGitGadget
2020-07-01 13:27       ` [PATCH v4 04/10] commit-graph: persist existence of changed-paths Derrick Stolee via GitGitGadget
2020-10-15 13:21         ` SZEDER Gábor
2020-10-15 21:41           ` Taylor Blau
2020-10-16  2:18             ` Derrick Stolee
2020-10-16  3:18               ` Taylor Blau
2020-10-16 13:52                 ` Derrick Stolee
2020-07-01 13:27       ` [PATCH v4 05/10] commit-graph: unify the signatures of all write_graph_chunk_*() functions SZEDER Gábor via GitGitGadget
2020-07-01 13:27       ` [PATCH v4 06/10] commit-graph: simplify chunk writes into loop SZEDER Gábor via GitGitGadget
2020-07-01 13:27       ` [PATCH v4 07/10] commit-graph: check chunk sizes after writing SZEDER Gábor via GitGitGadget
2020-07-01 13:27       ` [PATCH v4 08/10] revision.c: fix whitespace Derrick Stolee via GitGitGadget
2020-07-01 13:27       ` [PATCH v4 09/10] revision: empty pathspecs should not use Bloom filters Taylor Blau via GitGitGadget
2020-07-01 13:27       ` [PATCH v4 10/10] commit-graph: check all leading directories in changed path " SZEDER Gábor via GitGitGadget

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).