From: Derrick Stolee <stolee@gmail.com>
To: "SZEDER Gábor" <szeder.dev@gmail.com>, git@vger.kernel.org
Cc: "Jeff King" <peff@peff.net>, "Junio C Hamano" <gitster@pobox.com>,
"Ævar Arnfjörð Bjarmason" <avarab@gmail.com>,
"Stefan Beller" <sbeller@google.com>,
"Duy Nguyen" <pclouds@gmail.com>,
"Jonathan Tan" <jonathantanmy@google.com>
Subject: Re: [PATCH 0/4] Bloom filter experiment
Date: Mon, 15 Oct 2018 10:39:45 -0400 [thread overview]
Message-ID: <61559c5b-546e-d61b-d2e1-68de692f5972@gmail.com> (raw)
In-Reply-To: <20181009193445.21908-1-szeder.dev@gmail.com>
On 10/9/2018 3:34 PM, SZEDER Gábor wrote:
> To keep the ball rolling, here is my proof of concept in a somewhat
> cleaned-up form, with still plenty of rough edges.
Peff, Szeder, and Jonathan,
Thanks for giving me the kick in the pants to finally write a proof of
concept for my personal take on how this should work. My implementation
borrows things from both Szeder and Jonathan's series. You can find my
commits for all of the versions on GitHub (it's a bit too messy to share
as a patch series right now, I think):
Repo: https://github.com/derrickstolee/git
Branches: bloom/* (includes bloom/stolee, bloom/peff, bloom/szeder, and
bloom/tan for the respective implementations, and bloom/base as the
common ancestor)
My implementation uses the following scheme:
1. Bloom filters are computed and stored on a commit-by-commit basis.
2. The filters are sized according to the number of changes in each
commit, with a minimum of one 64-bit word.
3. The filters are stored in the commit-graph using two new optional
chunks: one stores a single 32-bit integer for each commit that provides
the end of its Bloom filter in the second "data" chunk. The data chunk
also stores the magic constants (hash version, num hash keys, and num
bits per entry).
4. We fill the Bloom filters as (const char *data, int len) pairs as
"struct bloom_filter"s in a commit slab.
5. In order to evaluate containment, we need the struct bloom_filter,
but also struct bloom_settings (stores the magic constants in one
place), and struct bloom_key (stores the _k_ hash values). This allows
us to hash a path once and test the same path against many Bloom filters.
6. When we compute the Bloom filters, we don't store a filter for
commits whose first-parent diff has more than 512 paths.
7. When we compute the commit-graph, we can re-use the pre-existing
filters without needing to recompute diffs. (Caveat: the current
implementation will re-compute diffs for the commits with diffs that
were too large.)
You can build the Bloom filters in my implementation this way:
GIT_TEST_BLOOM_FILTERS=1 ./git commit-graph write --reachable
> You can play around with it like this:
>
> $ GIT_USE_POC_BLOOM_FILTER=$((8*1024*1024*8)) git commit-graph write
> Computing commit graph generation numbers: 100% (52801/52801), done.
> Computing bloom filter: 100% (52801/52801), done.
> # Yeah, I even added progress indicator! :)
> $ GIT_TRACE_BLOOM_FILTER=2 GIT_USE_POC_BLOOM_FILTER=y git rev-list --count --full-history HEAD -- t/valgrind/valgrind.sh
> 886
> 20:40:24.783699 revision.c:486 bloom filter total queries: 66095 definitely not: 64953 maybe: 1142 false positives: 256 fp ratio: 0.003873
Jonathan used this same test, so will I. Here is a summary table:
| Implementation | Queries | Maybe | FP # | FP % |
|----------------|---------|-------|------|-------|
| Szeder | 66095 | 1142 | 256 | 0.38% |
| Jonathan | 66459 | 107 | 89 | 0.16% |
| Stolee | 53025 | 492 | 479 | 0.90% |
(Note that we must have used different starting points, which is why my
"Queries" is so much smaller.)
The increase in false-positive percentage is expected in my
implementation. I'm using the correct filter sizes to hit a <1% FP
ratio. This could be lowered by changing the settings, and the size
would dynamically grow. For my Git repo (which contains
git-for-windows/git and microsoft/git) this implementation grows the
commmit-graph file from 5.8 MB to 7.3 MB (1.5 MB total, compared to
Szeder's 8MB filter). For 105,260 commits, that rounds out to less than
20 bytes per commit (compared to Jonathan's 256 bytes per commit).
Related stats for my Linux repo: 781,756 commits, commit-graph grows
from 43.8 to 55.6 MB (~12 MB additional, ~16 bytes per commit).
I haven't done a side-by-side performance test for these
implementations, but it would be interesting to do so.
Despite writing a lot of code in a short amount of time, there is a lot
of work to be done before this is submittable:
1. There are three different environment variables right now. It would
be better to have one GIT_TEST_ variable and rely on existing tracing
for logs (trace2 values would be good here).
2. We need config values for writing and consuming bloom filters, but
also to override the default settings.
3. My bloom.c/bloom.h is too coupled to the commit-graph. I want to
harden that interface to let Bloom filters live as their own thing, but
that the commit-graph could load a bloom filter from the file instead of
from the slab.
4. Tests, tests, and more tests.
We'll see how much time I have to do this polish, but I think the
benefit is proven.
Thanks,
-Stolee
next prev parent reply other threads:[~2018-10-15 14:39 UTC|newest]
Thread overview: 78+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-10-03 13:23 We should add a "git gc --auto" after "git clone" due to commit graph Ævar Arnfjörð Bjarmason
2018-10-03 13:36 ` SZEDER Gábor
2018-10-03 13:42 ` Derrick Stolee
2018-10-03 14:18 ` Ævar Arnfjörð Bjarmason
2018-10-03 14:01 ` Ævar Arnfjörð Bjarmason
2018-10-03 14:17 ` SZEDER Gábor
2018-10-03 14:22 ` Ævar Arnfjörð Bjarmason
2018-10-03 14:53 ` SZEDER Gábor
2018-10-03 15:19 ` Ævar Arnfjörð Bjarmason
2018-10-03 16:59 ` SZEDER Gábor
2018-10-05 6:09 ` Junio C Hamano
2018-10-10 22:07 ` SZEDER Gábor
2018-10-10 23:01 ` Ævar Arnfjörð Bjarmason
2018-10-03 19:08 ` Stefan Beller
2018-10-03 19:21 ` Jeff King
2018-10-03 20:35 ` Ævar Arnfjörð Bjarmason
2018-10-03 17:47 ` Stefan Beller
2018-10-03 18:47 ` Ævar Arnfjörð Bjarmason
2018-10-03 18:51 ` Jeff King
2018-10-03 18:59 ` Derrick Stolee
2018-10-03 19:18 ` Jeff King
2018-10-08 16:41 ` SZEDER Gábor
2018-10-08 16:57 ` Derrick Stolee
2018-10-08 18:10 ` SZEDER Gábor
2018-10-08 18:29 ` Derrick Stolee
2018-10-09 3:08 ` Jeff King
2018-10-09 13:48 ` Bloom Filters (was Re: We should add a "git gc --auto" after "git clone" due to commit graph) Derrick Stolee
2018-10-09 18:45 ` Ævar Arnfjörð Bjarmason
2018-10-09 18:46 ` Jeff King
2018-10-09 19:03 ` Derrick Stolee
2018-10-09 21:14 ` Jeff King
2018-10-09 23:12 ` Bloom Filters Jeff King
2018-10-09 23:13 ` [PoC -- do not apply 1/3] initial tree-bitmap proof of concept Jeff King
2018-10-09 23:14 ` [PoC -- do not apply 2/3] test-tree-bitmap: add "dump" mode Jeff King
2018-10-10 0:48 ` Junio C Hamano
2018-10-11 3:13 ` Jeff King
2018-10-09 23:14 ` [PoC -- do not apply 3/3] test-tree-bitmap: replace ewah with custom rle encoding Jeff King
2018-10-10 0:58 ` Junio C Hamano
2018-10-11 3:20 ` Jeff King
2018-10-11 12:33 ` Bloom Filters Derrick Stolee
2018-10-11 13:43 ` Jeff King
2018-10-09 21:30 ` We should add a "git gc --auto" after "git clone" due to commit graph SZEDER Gábor
2018-10-09 19:34 ` [PATCH 0/4] Bloom filter experiment SZEDER Gábor
2018-10-09 19:34 ` [PATCH 1/4] Add a (very) barebones Bloom filter implementation SZEDER Gábor
2018-10-09 19:34 ` [PATCH 2/4] commit-graph: write a Bloom filter containing changed paths for each commit SZEDER Gábor
2018-10-09 21:06 ` Jeff King
2018-10-09 21:37 ` SZEDER Gábor
2018-10-09 19:34 ` [PATCH 3/4] revision.c: use the Bloom filter to speed up path-limited revision walks SZEDER Gábor
2018-10-09 19:34 ` [PATCH 4/4] revision.c: add GIT_TRACE_BLOOM_FILTER for a bit of statistics SZEDER Gábor
2018-10-09 19:47 ` [PATCH 0/4] Bloom filter experiment Derrick Stolee
2018-10-11 1:21 ` [PATCH 0/2] Per-commit filter proof of concept Jonathan Tan
2018-10-11 1:21 ` [PATCH 1/2] One filter per commit Jonathan Tan
2018-10-11 12:49 ` Derrick Stolee
2018-10-11 19:11 ` [PATCH] Per-commit and per-parent filters for 2 parents Jonathan Tan
2018-10-11 1:21 ` [PATCH 2/2] Only make bloom filter for first parent Jonathan Tan
2018-10-11 7:37 ` [PATCH 0/2] Per-commit filter proof of concept Ævar Arnfjörð Bjarmason
2018-10-15 14:39 ` Derrick Stolee [this message]
2018-10-16 4:45 ` [PATCH 0/4] Bloom filter experiment Junio C Hamano
2018-10-16 11:13 ` Derrick Stolee
2018-10-16 12:57 ` Ævar Arnfjörð Bjarmason
2018-10-16 13:03 ` Derrick Stolee
2018-10-18 2:00 ` Junio C Hamano
2018-10-16 23:41 ` Jonathan Tan
2018-10-08 23:02 ` We should add a "git gc --auto" after "git clone" due to commit graph Junio C Hamano
2018-10-03 14:32 ` Duy Nguyen
2018-10-03 16:45 ` Duy Nguyen
2018-10-04 21:42 ` [RFC PATCH] " Ævar Arnfjörð Bjarmason
2018-10-05 12:05 ` Derrick Stolee
2018-10-05 13:05 ` Ævar Arnfjörð Bjarmason
2018-10-05 13:45 ` Derrick Stolee
2018-10-05 14:04 ` Ævar Arnfjörð Bjarmason
2018-10-05 19:21 ` Jeff King
2018-10-05 19:41 ` Derrick Stolee
2018-10-05 19:47 ` Jeff King
2018-10-05 20:00 ` Derrick Stolee
2018-10-05 20:02 ` Jeff King
2018-10-05 20:01 ` Ævar Arnfjörð Bjarmason
2018-10-05 20:09 ` Jeff King
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=61559c5b-546e-d61b-d2e1-68de692f5972@gmail.com \
--to=stolee@gmail.com \
--cc=avarab@gmail.com \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=jonathantanmy@google.com \
--cc=pclouds@gmail.com \
--cc=peff@peff.net \
--cc=sbeller@google.com \
--cc=szeder.dev@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).