Git Mailing List Archive on lore.kernel.org
 help / color / Atom feed
From: "SZEDER Gábor" <szeder.dev@gmail.com>
To: git@vger.kernel.org
Cc: "Junio C Hamano" <gitster@pobox.com>,
	"Garima Singh" <garima.singh@microsoft.com>,
	"Derrick Stolee" <stolee@gmail.com>,
	"Jakub Narebski" <jnareb@gmail.com>, "Jeff King" <peff@peff.net>,
	"Taylor Blau" <me@ttaylorr.com>,
	"SZEDER Gábor" <szeder.dev@gmail.com>
Subject: [PoC PATCH 00/34] An alternative modified path Bloom filters implementation
Date: Fri, 29 May 2020 10:50:04 +0200
Message-ID: <20200529085038.26008-1-szeder.dev@gmail.com> (raw)

Sigh...  but better late than never, right?

I experimented quite a bit with modified path Bloom filters a year and
more ago, and got quite far...  but my disappointment in the
inadequacy of all double hashing schemes, the arrival of split
commit-graphs, and, well, life in general has put the whole thing on
the back burner, and I haven't touched it for a couple of releases.

Now I finally managed to take a closer look at the current changed
paths Bloom filters implementation, and saw that it has some of the
same issues that I had stumbled upon and that it missed some
optimization opportunities.  Unfortunately, fixing those issues and
performing those optimizations do require a thorough format change.

So here is my proof of concept version, in all its incompleteness,
with the following benefits:

  - Better understanding of the problem it tries to optimize.
  - Better understanding of the issues with many small Bloom filters.
  - Better hashing scheme (though it should be better still).
  - Orders of magnitude lower average false positive rate.
  - Consequently, faster pathspec-limited revision walks.
  - Faster processing of the tree-diff output and lower memory usage
    while computing Bloom filters (from scratch...).
  - Optional support for storing Bloom filters for all parents of
    merge commits.
  - Deduplicates Bloom filters.
  - Supports multiple pathspecs right from the start.
  - Supports some wildcards in pathspecs.
  - Handles as many commits as the commit-graph format can.
  - It has the right name :)  The diff machinery and all its frontends
    report "modified" paths with the letter 'M', not "changed".
  - More cleanups, more bugfixes.
  - Consistent output with and without modified path Bloom filters for
    over 80k random paths in 16 repositories, even with submodules in
    them.  Well, at least on my machine, if nowhere else...

Alas, the drawbacks are significant:

  - No tests whatsoever.
  - Computes all modified path Bloom filters from scratch when
    writing, no matter what.
  - Doesn't work with split commit-graphs.
  - Basically if anything works besides 'git commit-graph write
    --reachable' it's a miracle.
  - Not a single test.
  - Many BUG()s, which should rather be graceful errors...  though I
    have to admit that at this point they are indeed bugs.
  - Many TODOs, both in commit messages and code, some incomplete
    commit messages, crappy subject lines, even missing signoffs.
  - Some ridiculously long variable, function, macro and config
    variable names.
  - It's based on v2.25.0 (no technical reason, but that's the version
    I used to run the baseline benchmarks the last time, which takes
    days...)
  - I'm pretty sure that there are more...
  - Oh, did I mention that there are no tests?


The first 14 patches are preparatory fixes and cleanups:

  01/34 tree-walk.c: don't match submodule entries for 'submod/anything'

This fix or something similar is necessary to have consistent output
with and without modified path Bloom filters for paths crossing
submodule boundary.

  02/34 commit-graph: fix parsing the Chunk Lookup table

The minimal (though not the best) fix for a bug which, I think, is as
old as the commit-graph.  I don't know how to test this.

  03/34 commit-graph-format.txt: all multi-byte numbers are in network byte order
  04/34 commit-slab: add a function to deep free entries on the slab
  05/34 diff.h: drop diff_tree_oid() & friends' return value
  06/34 commit-graph: clean up #includes

A couple of minor cleanups.

  07/34 commit-graph: simplify parse_commit_graph() #1
  08/34 commit-graph: simplify parse_commit_graph() #2

These two would be the right, though not minimal fix for the parsing
bug above.

  09/34 commit-graph: simplify write_commit_graph_file() #1
  10/34 commit-graph: simplify write_commit_graph_file() #2
  11/34 commit-graph: allocate the 'struct chunk_info' array dinamically

I think these three cleanup patches are a better alternative of
3be7efcafc (commit-graph: define and use MAX_NUM_CHUNKS, 2020-03-30),
because...

  12/34 commit-graph: unify the signatures of all write_graph_chunk_*() functions
  13/34 commit-graph: simplify write_commit_graph_file() #3
  14/34 commit-graph: check chunk sizes after writing

... they laid the ground work for this patch.

  15/34 commit-graph-format.txt: document the modified path Bloom filter chunks

This is the most important one, specifying and _justifying_ the new
chunk formats.

Do grab a cup or pint of your favourite beverage and get comfy before
reading this one.  You have been warned.

  16/34 Add a generic and minimal Bloom filter implementation
  17/34 Import a streaming-capable Murmur3 hash function implementation
  18/34 commit-graph: write "empty" Modified Path Bloom Filter Index chunk
  19/34 commit-graph: add commit slab for modified path Bloom filters
  20/34 commit-graph: fill the Modified Path Bloom Filter Index chunk

This shows a more efficient approach to process the tree-diff output
into Bloom filters.

  21/34 commit-graph: load and use the Modified Path Bloom Filter Index chunk
  22/34 commit-graph: write the Modified Path Bloom Filters chunk
  23/34 commit-graph: load and use the Modified Path Bloom Filters chunk
  24/34 commit-graph: check all leading directories in modified path Bloom filters

This was a good lightbulb moment.  It is essential to try to maintain
reasonable performance in repositories where the vast majority of
changes are concentrated to a single directory.

  25/34 commit-graph: check embedded modified path Bloom filters with a mask
  26/34 commit-graph: deduplicate modified path Bloom filters
  27/34 commit-graph: load modified path Bloom filters for merge commits
  28/34 commit-graph: write Modified Path Bloom Filter Merge Index chunk
  29/34 commit-graph: extract init and free write_commit_graph_context
  30/34 commit-graph: move write_commit_graph_reachable below write_commit_graph
  31/34 t7007-show: make the first test compatible with the next patch
  32/34 PoC commit-graph: use revision walk machinery for '--reachable'

Once upon a time I thought this was the greatest idea ever, but as
time goes by I get more and more concerned that this is a really dumb
idea, though don't yet know why.

  33/34 commit-graph: write modified path Bloom filters in "history order"
  34/34 commit-graph: use modified path Bloom filters with wildcards, if possible

Finally a cherry on top.



SZEDER Gábor (34):
  tree-walk.c: don't match submodule entries for 'submod/anything'
  commit-graph: fix parsing the Chunk Lookup table
  commit-graph-format.txt: all multi-byte numbers are in network byte
    order
  commit-slab: add a function to deep free entries on the slab
  diff.h: drop diff_tree_oid() & friends' return value
  commit-graph: clean up #includes
  commit-graph: simplify parse_commit_graph() #1
  commit-graph: simplify parse_commit_graph() #2
  commit-graph: simplify write_commit_graph_file() #1
  commit-graph: simplify write_commit_graph_file() #2
  commit-graph: allocate the 'struct chunk_info' array dinamically
  commit-graph: unify the signatures of all write_graph_chunk_*()
    functions
  commit-graph: simplify write_commit_graph_file() #3
  commit-graph: check chunk sizes after writing
  commit-graph-format.txt: document the modified path Bloom filter
    chunks
  Add a generic and minimal Bloom filter implementation
  Import a streaming-capable Murmur3 hash function implementation
  commit-graph: write "empty" Modified Path Bloom Filter Index chunk
  commit-graph: add commit slab for modified path Bloom filters
  commit-graph: fill the Modified Path Bloom Filter Index chunk
  commit-graph: load and use the Modified Path Bloom Filter Index chunk
  commit-graph: write the Modified Path Bloom Filters chunk
  commit-graph: load and use the Modified Path Bloom Filters chunk
  commit-graph: check all leading directories in modified path Bloom
    filters
  commit-graph: check embedded modified path Bloom filters with a mask
  commit-graph: deduplicate modified path Bloom filters
  commit-graph: load modified path Bloom filters for merge commits
  commit-graph: write Modified Path Bloom Filter Merge Index chunk
  commit-graph: extract init and free write_commit_graph_context
  commit-graph: move write_commit_graph_reachable below
    write_commit_graph
  t7007-show: make the first test compatible with the next patch
  PoC commit-graph: use revision walk machinery for '--reachable'
  commit-graph: write modified path Bloom filters in "history order"
  commit-graph: use modified path Bloom filters with wildcards, if
    possible

 Documentation/config/core.txt                 |   19 +
 .../technical/commit-graph-format.txt         |  127 +-
 Makefile                                      |    2 +
 bloom-filter.c                                |   91 ++
 bloom-filter.h                                |   47 +
 commit-graph.c                                | 1239 +++++++++++++++--
 commit-graph.h                                |   24 +-
 commit-slab-decl.h                            |    1 +
 commit-slab-impl.h                            |   13 +
 commit-slab.h                                 |   10 +
 compat/PMurHash.c                             |  291 ++++
 compat/PMurHash.h                             |   62 +
 diff.h                                        |   10 +-
 pathspec.c                                    |   10 +
 pathspec.h                                    |   13 +
 revision.c                                    |   27 +-
 shallow.c                                     |   14 +-
 t/t4010-diff-pathspec.sh                      |    4 +-
 t/t5318-commit-graph.sh                       |    3 +-
 t/t7007-show.sh                               |    7 +-
 tree-diff.c                                   |   21 +-
 tree-walk.c                                   |    9 +-
 22 files changed, 1872 insertions(+), 172 deletions(-)
 create mode 100644 bloom-filter.c
 create mode 100644 bloom-filter.h
 create mode 100644 compat/PMurHash.c
 create mode 100644 compat/PMurHash.h

-- 
2.27.0.rc1.431.g5c813f95dc


             reply index

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-05-29  8:50 SZEDER Gábor [this message]
2020-05-28 22:09 ` Johannes Schindelin
2020-05-29  8:50 ` [PATCH 01/34] tree-walk.c: don't match submodule entries for 'submod/anything' SZEDER Gábor
2020-05-29  8:50 ` [PATCH 02/34] commit-graph: fix parsing the Chunk Lookup table SZEDER Gábor
2020-05-29  8:50 ` [PATCH 03/34] commit-graph-format.txt: all multi-byte numbers are in network byte order SZEDER Gábor
2020-05-29  8:50 ` [PATCH 04/34] commit-slab: add a function to deep free entries on the slab SZEDER Gábor
2020-06-04 16:43   ` Derrick Stolee
2020-06-27 15:56     ` SZEDER Gábor
2020-06-29 11:30       ` Derrick Stolee
2020-05-29  8:50 ` [PATCH 05/34] diff.h: drop diff_tree_oid() & friends' return value SZEDER Gábor
2020-05-29  8:50 ` [PATCH 06/34] commit-graph: clean up #includes SZEDER Gábor
2020-05-29  8:50 ` [PATCH 07/34] commit-graph: simplify parse_commit_graph() #1 SZEDER Gábor
2020-05-29  8:50 ` [PATCH 08/34] commit-graph: simplify parse_commit_graph() #2 SZEDER Gábor
2020-05-29  8:50 ` [PATCH 09/34] commit-graph: simplify write_commit_graph_file() #1 SZEDER Gábor
2020-05-29  8:50 ` [PATCH 10/34] commit-graph: simplify write_commit_graph_file() #2 SZEDER Gábor
2020-05-29  8:50 ` [PATCH 11/34] commit-graph: allocate the 'struct chunk_info' array dinamically SZEDER Gábor
2020-05-29  8:50 ` [PATCH 12/34] commit-graph: unify the signatures of all write_graph_chunk_*() functions SZEDER Gábor
2020-05-29  8:50 ` [PATCH 13/34] commit-graph: simplify write_commit_graph_file() #3 SZEDER Gábor
2020-05-29  8:50 ` [PATCH 14/34] commit-graph: check chunk sizes after writing SZEDER Gábor
2020-05-29  8:50 ` [PATCH 15/34] commit-graph-format.txt: document the modified path Bloom filter chunks SZEDER Gábor
2020-06-02 17:59   ` SZEDER Gábor
2020-05-29  8:50 ` [PATCH 16/34] Add a generic and minimal Bloom filter implementation SZEDER Gábor
2020-05-29  8:50 ` [PATCH 17/34] Import a streaming-capable Murmur3 hash function implementation SZEDER Gábor
2020-05-29  8:50 ` [PATCH 18/34] commit-graph: write "empty" Modified Path Bloom Filter Index chunk SZEDER Gábor
2020-05-29  8:50 ` [PATCH 19/34] commit-graph: add commit slab for modified path Bloom filters SZEDER Gábor
2020-05-29  8:50 ` [PATCH 20/34] commit-graph: fill the Modified Path Bloom Filter Index chunk SZEDER Gábor
2020-05-29  8:50 ` [PATCH 21/34] commit-graph: load and use " SZEDER Gábor
2020-05-29  8:50 ` [PATCH 22/34] commit-graph: write the Modified Path Bloom Filters chunk SZEDER Gábor
2020-05-29  8:50 ` [PATCH 23/34] commit-graph: load and use " SZEDER Gábor
2020-05-29  8:50 ` [PATCH 24/34] commit-graph: check all leading directories in modified path Bloom filters SZEDER Gábor
2020-05-29  8:50 ` [PATCH 25/34] commit-graph: check embedded modified path Bloom filters with a mask SZEDER Gábor
2020-05-29  8:50 ` [PATCH 26/34] commit-graph: deduplicate modified path Bloom filters SZEDER Gábor
2020-05-29  8:50 ` [PATCH 27/34] commit-graph: load modified path Bloom filters for merge commits SZEDER Gábor
2020-05-29  8:50 ` [PATCH 28/34] commit-graph: write Modified Path Bloom Filter Merge Index chunk SZEDER Gábor
2020-05-29  8:50 ` [PATCH 29/34] commit-graph: extract init and free write_commit_graph_context SZEDER Gábor
2020-05-29  8:50 ` [PATCH 30/34] commit-graph: move write_commit_graph_reachable below write_commit_graph SZEDER Gábor
2020-05-29  8:50 ` [PATCH 31/34] t7007-show: make the first test compatible with the next patch SZEDER Gábor
2020-05-29  8:50 ` [PATCH 32/34] PoC commit-graph: use revision walk machinery for '--reachable' SZEDER Gábor
2020-05-29  8:50 ` [PATCH 33/34] commit-graph: write modified path Bloom filters in "history order" SZEDER Gábor
2020-05-29  8:50 ` [PATCH 34/34] commit-graph: use modified path Bloom filters with wildcards, if possible SZEDER Gábor
2020-05-29 13:59 ` [PoC PATCH 00/34] An alternative modified path Bloom filters implementation Derrick Stolee
2020-06-01 23:25 ` Taylor Blau
2020-06-02 17:08   ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200529085038.26008-1-szeder.dev@gmail.com \
    --to=szeder.dev@gmail.com \
    --cc=garima.singh@microsoft.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jnareb@gmail.com \
    --cc=me@ttaylorr.com \
    --cc=peff@peff.net \
    --cc=stolee@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Git Mailing List Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/git/0 git/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 git git/ https://lore.kernel.org/git \
		git@vger.kernel.org
	public-inbox-index git

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.git


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git