All of lore.kernel.org
 help / color / mirror / Atom feed
From: "SZEDER Gábor" <szeder.dev@gmail.com>
To: Taylor Blau <me@ttaylorr.com>
Cc: git@vger.kernel.org, dstolee@microsoft.com, gitster@pobox.com,
	peff@peff.net
Subject: Re: [PATCH v2 10/13] bloom: encode out-of-bounds filters as non-empty
Date: Fri, 18 Sep 2020 00:13:02 +0200	[thread overview]
Message-ID: <20200917221302.GC23146@szeder.dev> (raw)
In-Reply-To: <4653b5b4bcd254a3791797214b46722b4062dc18.1600279373.git.me@ttaylorr.com>

On Wed, Sep 16, 2020 at 02:07:59PM -0400, Taylor Blau wrote:
> When a changed-path Bloom filter has either zero, or more than a
> certain number (commonly 512) of entries, the commit-graph machinery
> encodes it as "missing". More specifically, it sets the indices adjacent
> in the BIDX chunk as equal to each other to indicate a "length 0"
> filter; that is, that the filter occupies zero bytes on disk.
> 
> This has heretofore been fine, since the commit-graph machinery has no
> need to care about these filters with too few or too many changed paths.
> Both cases act like no filter has been generated at all, and so there is
> no need to store them.
> 
> In a subsequent commit, however, the commit-graph machinery will learn
> to only compute Bloom filters for some commits in the current
> commit-graph layer. This is a change from the current implementation
> which computes Bloom filters for all commits that are in the layer being
> written. Critically for this patch, only computing some of the Bloom
> filters means adding a third state for length 0 Bloom filters: zero
> entries, too many entries, or "hasn't been computed".
> 
> It will be important for that future patch to distinguish between "not
> representable" (i.e., zero or too-many changed paths), and "hasn't been
> computed". In particular, we don't want to waste time recomputing
> filters that have already been computed.
> 
> To that end, change how we store Bloom filters in the "computed but not
> representable" category:
> 
>   - Bloom filters with no entries are stored as a single byte with all
>     bits low (i.e., all queries to that Bloom filter will return
>     "definitely not")
> 
>   - Bloom filters with too many entries are stored as a single byte with
>     all bits set high (i.e., all queries to that Bloom filter will
>     return "maybe").
> 
> These rules are sufficient to not incur a behavior change by changing
> the on-disk representation of these two classes. Likewise, no
> specification changes are necessary for the commit-graph format, either:
> 
>   - Filters that were previously empty will be recomputed and stored
>     according to the new rules, and
> 
>   - old clients reading filters generated by new clients will interpret
>     the filters correctly and be none the wiser to how they were
>     generated.
> 
> Clients will invoke the Bloom machinery in more cases than before, but
> this can be addressed by returning a NULL filter when all bits are set
> high. This can be addressed in a future patch.

OTOH, clients will invoke the tree-diff machinery in fewer cases than
before, because querying the Bloom filter of commits not modifying any
files will now return "definitely not".

> Finally, note that this does increase the size of on-disk commit-graphs,
> but far less than other proposals. In particular, this is generally more
> efficient than storing a bitmap for which commits haven't computed their
> Bloom filters. Storing a bitmap incurs a penalty of one bit per commit,
> whereas storing explicit filters as above incurs a penalty of one byte
> per too-large or too-small commit.

s/too-small/empty/

> In practice, these boundary commits likely occupy a small proportion of
> the overall number of commits, and so the size penalty is likely smaller
> than storing a bitmap for all commits.

                 |      Percentage of
                 |    commits modifying
                 |   0 path   |  >= 512 paths
  ---------------+------------+----------------
  android-base   |   13.20%   |   0.13%
  cmssw          |    0.15%   |   0.23%
  cpython        |    3.07%   |   0.01%
  elasticsearch  |    0.70%   |   1.00%
  gcc            |    0.00%   |   0.08%
  gecko-dev      |    0.14%   |   0.64%
  git            |    0.11%   |   0.02%
  glibc          |    0.02%   |   0.10%
  go             |    0.00%   |   0.07%
  homebrew-cask  |    0.40%   |   0.02%
  homebrew-core  |    0.01%   |   0.01%
  jdk            |    0.26%   |   5.64%
  linux          |    0.01%   |   0.51%
  llvm-project   |    0.12%   |   0.03%
  rails          |    0.10%   |   0.10%
  rust           |    0.07%   |   0.17%
  tensorflow     |    0.09%   |   1.02%
  webkit         |    0.05%   |   0.31%


> A test to exercise filters which contain too many changed path entries
> will be introduced in a subsequent patch.


> diff --git a/bloom.h b/bloom.h
> index c6d77e8393..70a8840896 100644
> --- a/bloom.h
> +++ b/bloom.h
> @@ -93,6 +93,7 @@ enum bloom_filter_computed {
>  	BLOOM_NOT_COMPUTED = (1 << 0),
>  	BLOOM_COMPUTED     = (1 << 1),
>  	BLOOM_TRUNC_LARGE  = (1 << 2),
> +	BLOOM_TRUNC_SMALL  = (1 << 3),

s/SMALL/EMPTY/

This "small" suffix in the constant, variable, and trace2 key names is
misleading, because we only mean empty commits.

>  };
>  
>  struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
> diff --git a/commit-graph.c b/commit-graph.c
> index 1ca754f19c..bd4247bca5 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -974,6 +974,7 @@ struct write_commit_graph_context {
>  
>  	int count_bloom_filter_computed;
>  	int count_bloom_filter_not_computed;
> +	int count_bloom_filter_trunc_small;
>  	int count_bloom_filter_trunc_large;
>  };
>  

  reply	other threads:[~2020-09-17 22:13 UTC|newest]

Thread overview: 75+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-09-09 15:22 [PATCH 00/12] more miscellaneous Bloom filter improvements, redux Taylor Blau
2020-09-09 15:22 ` [PATCH 01/12] commit-graph: introduce 'get_bloom_filter_settings()' Taylor Blau
2020-09-09 15:22 ` [PATCH 02/12] t4216: use an '&&'-chain Taylor Blau
2020-09-09 15:22 ` [PATCH 03/12] commit-graph: pass a 'struct repository *' in more places Taylor Blau
2020-09-09 15:23 ` [PATCH 04/12] t/helper/test-read-graph.c: prepare repo settings Taylor Blau
2020-09-09 15:23 ` [PATCH 05/12] commit-graph: respect 'commitGraph.readChangedPaths' Taylor Blau
2020-09-09 15:23 ` [PATCH 06/12] commit-graph.c: store maximum changed paths Taylor Blau
2020-09-09 15:23 ` [PATCH 07/12] bloom: split 'get_bloom_filter()' in two Taylor Blau
2020-09-09 15:23 ` [PATCH 08/12] bloom: use provided 'struct bloom_filter_settings' Taylor Blau
2020-09-09 15:23 ` [PATCH 09/12] bloom/diff: properly short-circuit on max_changes Taylor Blau
2020-09-09 15:23 ` [PATCH 10/12] bloom: encode out-of-bounds filters as non-empty Taylor Blau
2020-09-10  3:35   ` Taylor Blau
2020-09-10 15:45     ` Taylor Blau
2020-09-11 18:15       ` Derrick Stolee
2020-09-09 15:23 ` [PATCH 11/12] commit-graph: rename 'split_commit_graph_opts' Taylor Blau
2020-09-09 15:24 ` [PATCH 12/12] builtin/commit-graph.c: introduce '--max-new-filters=<n>' Taylor Blau
2020-09-11 17:52   ` Jeff King
2020-09-11 18:59     ` Taylor Blau
2020-09-11 19:25       ` Taylor Blau
2020-09-14 20:12         ` Taylor Blau
2020-09-14 20:31           ` Derrick Stolee
2020-09-14 20:36             ` Taylor Blau
2020-09-15  0:59               ` Derrick Stolee
2020-09-15  4:31                 ` Taylor Blau
2020-09-15 21:49               ` Junio C Hamano
2020-09-15 21:53                 ` Taylor Blau
2020-09-11 19:47       ` Jeff King
2020-09-11 19:31     ` Junio C Hamano
2020-09-16 18:06 ` [PATCH v2 00/13] more miscellaneous Bloom filter improvements, redux Taylor Blau
2020-09-16 18:06   ` [PATCH v2 01/13] commit-graph: introduce 'get_bloom_filter_settings()' Taylor Blau
2020-09-16 18:07   ` [PATCH v2 02/13] t4216: use an '&&'-chain Taylor Blau
2020-09-16 18:07   ` [PATCH v2 03/13] commit-graph: pass a 'struct repository *' in more places Taylor Blau
2020-09-16 18:07   ` [PATCH v2 04/13] t/helper/test-read-graph.c: prepare repo settings Taylor Blau
2020-09-16 18:07   ` [PATCH v2 05/13] commit-graph: respect 'commitGraph.readChangedPaths' Taylor Blau
2020-09-16 18:07   ` [PATCH v2 06/13] commit-graph.c: store maximum changed paths Taylor Blau
2020-09-16 18:07   ` [PATCH v2 07/13] bloom: split 'get_bloom_filter()' in two Taylor Blau
2020-09-16 18:07   ` [PATCH v2 08/13] bloom: use provided 'struct bloom_filter_settings' Taylor Blau
2020-09-16 18:07   ` [PATCH v2 09/13] bloom/diff: properly short-circuit on max_changes Taylor Blau
2020-09-16 18:07   ` [PATCH v2 10/13] bloom: encode out-of-bounds filters as non-empty Taylor Blau
2020-09-17 22:13     ` SZEDER Gábor [this message]
2020-09-17 23:13       ` Taylor Blau
2020-09-18  0:52         ` Junio C Hamano
2020-09-18  1:15           ` Taylor Blau
2020-09-16 18:08   ` [PATCH v2 11/13] commit-graph: rename 'split_commit_graph_opts' Taylor Blau
2020-09-16 18:08   ` [PATCH v2 12/13] builtin/commit-graph.c: introduce '--max-new-filters=<n>' Taylor Blau
2020-09-18  9:23     ` SZEDER Gábor
2020-09-18 13:27       ` Taylor Blau
2020-09-16 18:08   ` [PATCH v2 13/13] commit-graph: introduce 'commitGraph.maxNewFilters' Taylor Blau
2020-09-16 22:51   ` [PATCH v2 00/13] more miscellaneous Bloom filter improvements, redux Derrick Stolee
2020-09-16 23:07     ` Junio C Hamano
2020-09-17  0:45       ` Taylor Blau
2020-09-17  0:59         ` Junio C Hamano
2020-09-17  1:10           ` Taylor Blau
2020-09-17 13:34             ` Taylor Blau
2020-09-17 13:38               ` Derrick Stolee
2020-09-18  2:58 ` [PATCH v3 " Taylor Blau
2020-09-18  2:58   ` [PATCH v3 01/13] commit-graph: introduce 'get_bloom_filter_settings()' Taylor Blau
2020-09-18  2:58   ` [PATCH v3 02/13] t4216: use an '&&'-chain Taylor Blau
2020-09-18  2:59   ` [PATCH v3 03/13] commit-graph: pass a 'struct repository *' in more places Taylor Blau
2020-09-18  2:59   ` [PATCH v3 04/13] t/helper/test-read-graph.c: prepare repo settings Taylor Blau
2020-09-18  2:59   ` [PATCH v3 05/13] commit-graph: respect 'commitGraph.readChangedPaths' Taylor Blau
2020-09-18  2:59   ` [PATCH v3 06/13] commit-graph.c: store maximum changed paths Taylor Blau
2020-09-18  2:59   ` [PATCH v3 07/13] bloom: split 'get_bloom_filter()' in two Taylor Blau
2020-09-18  2:59   ` [PATCH v3 08/13] bloom: use provided 'struct bloom_filter_settings' Taylor Blau
2020-09-18 16:27     ` SZEDER Gábor
2020-09-18 16:32       ` Taylor Blau
2020-09-18  2:59   ` [PATCH v3 09/13] bloom/diff: properly short-circuit on max_changes Taylor Blau
2020-09-18  2:59   ` [PATCH v3 10/13] bloom: encode out-of-bounds filters as non-empty Taylor Blau
2020-09-18  2:59   ` [PATCH v3 11/13] commit-graph: rename 'split_commit_graph_opts' Taylor Blau
2020-09-18  2:59   ` [PATCH v3 12/13] builtin/commit-graph.c: introduce '--max-new-filters=<n>' Taylor Blau
2020-09-18  2:59   ` [PATCH v3 13/13] commit-graph: introduce 'commitGraph.maxNewFilters' Taylor Blau
2020-09-18 13:29     ` Taylor Blau
2020-09-18 17:43       ` Junio C Hamano
2020-09-18 13:31   ` [PATCH v3 00/13] more miscellaneous Bloom filter improvements, redux Taylor Blau
2020-09-18 13:34     ` Taylor Blau

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200917221302.GC23146@szeder.dev \
    --to=szeder.dev@gmail.com \
    --cc=dstolee@microsoft.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=me@ttaylorr.com \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.