Git Mailing List Archive on lore.kernel.org
 help / color / Atom feed
From: "SZEDER Gábor" <szeder.dev@gmail.com>
To: git@vger.kernel.org
Cc: Junio C Hamano <gitster@pobox.com>,
	Garima Singh <garima.singh@microsoft.com>,
	Derrick Stolee <stolee@gmail.com>,
	Jakub Narebski <jnareb@gmail.com>, Jeff King <peff@peff.net>,
	Taylor Blau <me@ttaylorr.com>
Subject: Re: [PATCH 15/34] commit-graph-format.txt: document the modified path Bloom filter chunks
Date: Tue, 2 Jun 2020 19:59:49 +0200
Message-ID: <20200602175949.GA2898@szeder.dev> (raw)
In-Reply-To: <20200529085038.26008-16-szeder.dev@gmail.com>

On Fri, May 29, 2020 at 10:50:19AM +0200, SZEDER Gábor wrote:
> Modified Path Bloom Filters

> Each modified path Bloom filter consists of:
> 
>   - 4 bytes specifying the number of bits in the Bloom filter's bit
>     array.
> 
>     For practical purposes 32 bit is more than sufficient to store the
>     number of bits in the Bloom filter's array.  When using k = 7
>     hashes, i.e. 10 bits per path, then we could store over 400
>     million paths in a single Bloom filter; with k = 32 hashes we'd
>     use 46 bit per path, which is still over 93 million paths.

> Alternatives considered
> -----------------------
> 
> Here are some alternatives that I've considered but discarded and
> ideas that I haven't (yet) followed through:

>   - Varints.  Using 4 bytes for the size of each Bloom filter in the
>     Modified Path Bloom Filters chunk is a lot more than necessary.
>     How much space can be saved by using varints?

Not that much, at least compared to the whole commit-graph file.

Since 63 bit modified path Bloom filters are embedded in the index
entries, it's pointless to store smaller Bloom filters, so the size of
non-embedded Bloom filters can be defined as

  nr_bits = 64 + decode_varint(...)

A one byte varint can encode values up to 127, while a two bytes
varint can encode values up to 16511.  So the max nr_bits is 191 and
16575, respectively, which means max 19 or 1657 paths with 7 hashes,
i.e. 10 bits per path.  The table below shows the percentage of
commits with embedded modified path Bloom filters and commits with
non-embedded Bloom filters with one byte and two bytes varints, and
how much space is saved.

                    Percentage of commits   |
                    modifying <= N paths    |   varint   commit-graph
                  compared to first parent  | size diff    file size
                  <=6      <=19     <=1657  |  (bytes)     (ls -lh)
  ------------------------------------------+-------------------------------
  elasticsearch  18.32%   65.56%    99.79%  |   -90158       9.5M      -0.9%
  jdk            26.62%   70.46%    97.94%  |   -90262        16M      -0.6%
  webkit         38.42%   82.24%    99.92%  |  -298824        19M      -1.6%
  android-base   42.32%   88.02%    99.98%  |  -132327        30M      -0.5%
  llvm-project   53.60%   93.86%    99.99%  |  -344239        24M      -1.4%
  gecko-dev      54.54%   87.15%    99.87%  |  -625029        63M      -1.0%
  tensorflow     55.17%   90.76%    99.52%  |   -83758       9.0M      -0.9%
  rails          58.71%   95.13%    99.99%  |   -53153       6.0M      -0.9%
  rust           59.29%   88.03%    99.96%  |   -90677       8.2M      -1.1%
  glibc          61.59%   91.11%    99.96%  |   -38422       3.8M      -1.0%
  gcc            63.80%   95.39%    99.97%  |  -179463        18M      -1.0%
  go             65.31%   95.19%    99.99%  |   -39109       3.2M      -1.2%
  cmssw          67.58%   93.02%    99.91%  |  -154440        23M      -0.7%
  linux          72.79%   95.27%    99.78%  |  -527045        78M      -0.7%
  cpython        81.91%   97.78%   100.00%  |   -40678       7.8M      -0.5%
  git            90.28%   98.56%   100.00%  |   -13406       3.9M      -0.4%
  homebrew-cask  98.61%   99.58%    99.99%  |    -2992       6.6M      -0.1%
  homebrew-core  99.81%   99.93%   100.00%  |     -810        11M      -0.1%

>     How much is the runtime overhead of decoding those varints?

Not enough to be above noise level.  ("a lot of paths",
MurmurHash3, k = 7, enhanced double hashing, 32 bit uint arithmetic)

                                       |   Average time spent
                   Average runtime     |  loading and querying
                                       |      Bloom filters
                  uint32     varint    |   uint32     varint
  -------------------------------------+-----------------------------
  android-base    0.1456s   0.144172s  |  0.01550s   0.01571s    1.3%
  cmssw           0.0313s   0.032046s  |  0.00383s   0.00395s    3.1%
  cpython         0.0810s   0.081649s  |  0.00765s   0.00785s    2.6%
  elasticsearch   0.0136s   0.013899s  |  0.00246s   0.00258s    4.8%
  gcc             0.2114s   0.211080s  |  0.02259s   0.02261s      0%
  gecko-dev       0.4815s   0.474903s  |  0.06004s   0.06028s    0.3%
  git             0.0310s   0.031156s  |  0.00192s   0.00195s    1.5%
  glibc           0.0282s   0.029032s  |  0.00320s   0.00342s    6.8%
  go              0.0403s   0.039408s  |  0.00428s   0.00414s   -3.3%
  jdk             0.0068s   0.006666s  |  0.00117s   0.00113s   -3.5%
  linux           0.0873s   0.087438s  |  0.01109s   0.01133s    2.1%
  llvm-project    0.4135s   0.418390s  |  0.04716s   0.04834s    2.5%
  rails           0.0391s   0.038997s  |  0.00449s   0.00448s   -0.3%
  rust            0.0439s   0.044569s  |  0.00444s   0.00461s    3.8%
  tensorflow      0.0487s   0.049166s  |  0.00618s   0.00634s    2.5%
  webkit          0.2420s   0.241807s  |  0.03353s   0.03379s    0.7%

>     How can we ensure that varint decoding doesn't read past the end
>     of the mmapped memory region?

With a decode_varint() variant that takes the max number of bytes to
read as an additional parameter, and returns 0 if the varint is too
long.


  reply index

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-05-29  8:50 [PoC PATCH 00/34] An alternative modified path Bloom filters implementation SZEDER Gábor
2020-05-28 22:09 ` Johannes Schindelin
2020-05-29  8:50 ` [PATCH 01/34] tree-walk.c: don't match submodule entries for 'submod/anything' SZEDER Gábor
2020-05-29  8:50 ` [PATCH 02/34] commit-graph: fix parsing the Chunk Lookup table SZEDER Gábor
2020-05-29  8:50 ` [PATCH 03/34] commit-graph-format.txt: all multi-byte numbers are in network byte order SZEDER Gábor
2020-05-29  8:50 ` [PATCH 04/34] commit-slab: add a function to deep free entries on the slab SZEDER Gábor
2020-06-04 16:43   ` Derrick Stolee
2020-06-27 15:56     ` SZEDER Gábor
2020-06-29 11:30       ` Derrick Stolee
2020-05-29  8:50 ` [PATCH 05/34] diff.h: drop diff_tree_oid() & friends' return value SZEDER Gábor
2020-05-29  8:50 ` [PATCH 06/34] commit-graph: clean up #includes SZEDER Gábor
2020-05-29  8:50 ` [PATCH 07/34] commit-graph: simplify parse_commit_graph() #1 SZEDER Gábor
2020-05-29  8:50 ` [PATCH 08/34] commit-graph: simplify parse_commit_graph() #2 SZEDER Gábor
2020-05-29  8:50 ` [PATCH 09/34] commit-graph: simplify write_commit_graph_file() #1 SZEDER Gábor
2020-05-29  8:50 ` [PATCH 10/34] commit-graph: simplify write_commit_graph_file() #2 SZEDER Gábor
2020-05-29  8:50 ` [PATCH 11/34] commit-graph: allocate the 'struct chunk_info' array dinamically SZEDER Gábor
2020-05-29  8:50 ` [PATCH 12/34] commit-graph: unify the signatures of all write_graph_chunk_*() functions SZEDER Gábor
2020-05-29  8:50 ` [PATCH 13/34] commit-graph: simplify write_commit_graph_file() #3 SZEDER Gábor
2020-05-29  8:50 ` [PATCH 14/34] commit-graph: check chunk sizes after writing SZEDER Gábor
2020-05-29  8:50 ` [PATCH 15/34] commit-graph-format.txt: document the modified path Bloom filter chunks SZEDER Gábor
2020-06-02 17:59   ` SZEDER Gábor [this message]
2020-05-29  8:50 ` [PATCH 16/34] Add a generic and minimal Bloom filter implementation SZEDER Gábor
2020-05-29  8:50 ` [PATCH 17/34] Import a streaming-capable Murmur3 hash function implementation SZEDER Gábor
2020-05-29  8:50 ` [PATCH 18/34] commit-graph: write "empty" Modified Path Bloom Filter Index chunk SZEDER Gábor
2020-05-29  8:50 ` [PATCH 19/34] commit-graph: add commit slab for modified path Bloom filters SZEDER Gábor
2020-05-29  8:50 ` [PATCH 20/34] commit-graph: fill the Modified Path Bloom Filter Index chunk SZEDER Gábor
2020-05-29  8:50 ` [PATCH 21/34] commit-graph: load and use " SZEDER Gábor
2020-05-29  8:50 ` [PATCH 22/34] commit-graph: write the Modified Path Bloom Filters chunk SZEDER Gábor
2020-05-29  8:50 ` [PATCH 23/34] commit-graph: load and use " SZEDER Gábor
2020-05-29  8:50 ` [PATCH 24/34] commit-graph: check all leading directories in modified path Bloom filters SZEDER Gábor
2020-05-29  8:50 ` [PATCH 25/34] commit-graph: check embedded modified path Bloom filters with a mask SZEDER Gábor
2020-05-29  8:50 ` [PATCH 26/34] commit-graph: deduplicate modified path Bloom filters SZEDER Gábor
2020-05-29  8:50 ` [PATCH 27/34] commit-graph: load modified path Bloom filters for merge commits SZEDER Gábor
2020-05-29  8:50 ` [PATCH 28/34] commit-graph: write Modified Path Bloom Filter Merge Index chunk SZEDER Gábor
2020-05-29  8:50 ` [PATCH 29/34] commit-graph: extract init and free write_commit_graph_context SZEDER Gábor
2020-05-29  8:50 ` [PATCH 30/34] commit-graph: move write_commit_graph_reachable below write_commit_graph SZEDER Gábor
2020-05-29  8:50 ` [PATCH 31/34] t7007-show: make the first test compatible with the next patch SZEDER Gábor
2020-05-29  8:50 ` [PATCH 32/34] PoC commit-graph: use revision walk machinery for '--reachable' SZEDER Gábor
2020-05-29  8:50 ` [PATCH 33/34] commit-graph: write modified path Bloom filters in "history order" SZEDER Gábor
2020-05-29  8:50 ` [PATCH 34/34] commit-graph: use modified path Bloom filters with wildcards, if possible SZEDER Gábor
2020-05-29 13:59 ` [PoC PATCH 00/34] An alternative modified path Bloom filters implementation Derrick Stolee
2020-06-01 23:25 ` Taylor Blau
2020-06-02 17:08   ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200602175949.GA2898@szeder.dev \
    --to=szeder.dev@gmail.com \
    --cc=garima.singh@microsoft.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jnareb@gmail.com \
    --cc=me@ttaylorr.com \
    --cc=peff@peff.net \
    --cc=stolee@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Git Mailing List Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/git/0 git/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 git git/ https://lore.kernel.org/git \
		git@vger.kernel.org
	public-inbox-index git

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.git


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git