[RFC PATCH 0/6] bloom: reuse existing Bloom filters when possible during upgrade

* [RFC PATCH 0/6] bloom: reuse existing Bloom filters when possible during upgrade
@ 2023-08-07 16:37 Taylor Blau
  2023-08-07 16:37 ` [RFC PATCH 1/6] bloom: annotate filters with hash version Taylor Blau
                   ` (6 more replies)
  0 siblings, 7 replies; 24+ messages in thread
From: Taylor Blau @ 2023-08-07 16:37 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Jonathan Tan, Junio C Hamano

This series is based off of 'jt/path-filter-fix'.

These few patches implement an idea that we discussed in [1], where we
attempt to reuse existing Bloom filters during an upgrade from v1 to v2
Bloom filters while rewriting the commit-graph.

The core idea is that Bloom filters are reusable when there aren't any
non-ASCII paths in a commit's tree-diff against its first parent (or the
empty tree, if none exists). If we assume that a commit's tree-diff
meets those conditions, we can't conclude anything about whether either
tree contains non-ASCII characters, since they could be unmodified on
either side and thus excluded from the tree-diff.

But assuming the RHS (that there aren't any non-ASCII characters present
in the tree's path set) *does* give us that there aren't any such paths
present in the first-parent tree diff, either.

This series checks whether or not commits meet that criteria, and reuses
the existing Bloom filter (if one exists) when possible. In practice, we
end up visiting relatively few trees, since we mark trees we've already
visited.

On both linux.git and git.git, this series gives a significant speed-up
when upgrading Bloom filters from v1 to v2. On linux.git:

    Benchmark 1: GIT_TEST_UPGRADE_BLOOM_FILTERS=0 git.compile commit-graph write --reachable --changed-paths
      Time (mean ± σ):     124.873 s ±  0.316 s    [User: 124.081 s, System: 0.643 s]
      Range (min … max):   124.621 s … 125.227 s    3 runs

    Benchmark 2: GIT_TEST_UPGRADE_BLOOM_FILTERS=1 git.compile commit-graph write --reachable --changed-paths
      Time (mean ± σ):     79.271 s ±  0.163 s    [User: 74.611 s, System: 4.521 s]
      Range (min … max):   79.112 s … 79.437 s    3 runs

    Summary
      'GIT_TEST_UPGRADE_BLOOM_FILTERS=1 git.compile commit-graph write --reachable --changed-paths' ran
        1.58 ± 0.01 times faster than 'GIT_TEST_UPGRADE_BLOOM_FILTERS=0 git.compile commit-graph write --reachable --changed-paths'

On git.git (where we do have some non-ASCII paths), the change goes from
4.163 seconds to 3.348 seconds, for a 1.24x speed-up.

I'm sending this as an RFC, since we are in the middle of the -rc phase,
and 'jt/path-filter-fix' isn't expected[2] to merge into 'master' until
we're on the other side of 2.42.

The structure of this series is as follows:

  - The first three patches prepare to load the `BDAT` chunk, even when
    the graph's Bloom filter settings are incompatible with the value in
    `commitGraph.changedPathsVersion`.
  - The fourth patch begins loading `BDAT` chunks unconditionally.
  - The fifth patch is a clean-up.
  - The sixth and final patch implements the approach discussed above.

Thanks in advance for your thoughts and review :-).

[1]: https://lore.kernel.org/git/ZMKvsObx+uaKA8zF@nand.local/
[2]: https://lore.kernel.org/git/xmqqy1it6ykm.fsf@gitster.g/

Taylor Blau (6):
  bloom: annotate filters with hash version
  bloom: prepare to discard incompatible Bloom filters
  t/t4216-log-bloom.sh: harden `test_bloom_filters_not_used()`
  commit-graph.c: unconditionally load Bloom filters
  object.h: fix mis-aligned flag bits table
  commit-graph: reuse existing Bloom filters where possible

 bloom.c              | 117 +++++++++++++++++++++++++++++++++++++++++--
 bloom.h              |  22 +++++++-
 commit-graph.c       |  24 +++++----
 object.h             |   3 +-
 t/t4216-log-bloom.sh |  49 ++++++++++++++++--
 5 files changed, 195 insertions(+), 20 deletions(-)

-- 
2.41.0.407.g6d1c33951b

^ permalink raw reply	[flat|nested] 24+ messages in thread