All of lore.kernel.org
 help / color / mirror / Atom feed
From: Taylor Blau <me@ttaylorr.com>
To: "SZEDER Gábor" <szeder.dev@gmail.com>
Cc: Taylor Blau <me@ttaylorr.com>,
	git@vger.kernel.org, dstolee@microsoft.com, gitster@pobox.com,
	peff@peff.net
Subject: Re: [PATCH v2 10/13] bloom: encode out-of-bounds filters as non-empty
Date: Thu, 17 Sep 2020 19:13:44 -0400	[thread overview]
Message-ID: <20200917231344.GA1591704@nand.local> (raw)
In-Reply-To: <20200917221302.GC23146@szeder.dev>

On Fri, Sep 18, 2020 at 12:13:02AM +0200, SZEDER Gábor wrote:
> > Clients will invoke the Bloom machinery in more cases than before, but
> > this can be addressed by returning a NULL filter when all bits are set
> > high. This can be addressed in a future patch.
>
> OTOH, clients will invoke the tree-diff machinery in fewer cases than
> before, because querying the Bloom filter of commits not modifying any
> files will now return "definitely not".

Absolutely right.

> > Finally, note that this does increase the size of on-disk commit-graphs,
> > but far less than other proposals. In particular, this is generally more
> > efficient than storing a bitmap for which commits haven't computed their
> > Bloom filters. Storing a bitmap incurs a penalty of one bit per commit,
> > whereas storing explicit filters as above incurs a penalty of one byte
> > per too-large or too-small commit.
>
> s/too-small/empty/

Fair enough, although I'm not planning to alter this or any other patch
now that it's picked up unless there's a real show-stopper (this doesn't
seem like one).

> > In practice, these boundary commits likely occupy a small proportion of
> > the overall number of commits, and so the size penalty is likely smaller
> > than storing a bitmap for all commits.
>
>                  |      Percentage of
>                  |    commits modifying
>                  |   0 path   |  >= 512 paths
>   ---------------+------------+----------------
>   android-base   |   13.20%   |   0.13%
>   cmssw          |    0.15%   |   0.23%
>   cpython        |    3.07%   |   0.01%
>   elasticsearch  |    0.70%   |   1.00%
>   gcc            |    0.00%   |   0.08%
>   gecko-dev      |    0.14%   |   0.64%
>   git            |    0.11%   |   0.02%
>   glibc          |    0.02%   |   0.10%
>   go             |    0.00%   |   0.07%
>   homebrew-cask  |    0.40%   |   0.02%
>   homebrew-core  |    0.01%   |   0.01%
>   jdk            |    0.26%   |   5.64%
>   linux          |    0.01%   |   0.51%
>   llvm-project   |    0.12%   |   0.03%
>   rails          |    0.10%   |   0.10%
>   rust           |    0.07%   |   0.17%
>   tensorflow     |    0.09%   |   1.02%
>   webkit         |    0.05%   |   0.31%

This is very useful information to have! Without the total number of
commits, it's impossible to know whether or not this is a win over the
BFXL chunk. But, since the number of commits is probably "large" versus
the percentage of boundary commits which is "small", it's almost
certainly an advantage.

> > A test to exercise filters which contain too many changed path entries
> > will be introduced in a subsequent patch.
>
>
> > diff --git a/bloom.h b/bloom.h
> > index c6d77e8393..70a8840896 100644
> > --- a/bloom.h
> > +++ b/bloom.h
> > @@ -93,6 +93,7 @@ enum bloom_filter_computed {
> >  	BLOOM_NOT_COMPUTED = (1 << 0),
> >  	BLOOM_COMPUTED     = (1 << 1),
> >  	BLOOM_TRUNC_LARGE  = (1 << 2),
> > +	BLOOM_TRUNC_SMALL  = (1 << 3),
>
> s/SMALL/EMPTY/
>
> This "small" suffix in the constant, variable, and trace2 key names is
> misleading, because we only mean empty commits.

I could buy that it might be misleading; I only picked this since it was
the opposite of "large". You could imagine that BLOOM_TRUNC_X means
"truncated in the direction of 'x'", but to be honest I don't think that
this matters.

I understand the churn of coming back to this after the topic has
already been merged creates more hassle, but frankly this series has
already gone on for quite a while, and it has been holding up important
bug fixes that are unrelated to the main feature.

So, I think that if it's truly misleading, we could revisit this after
the topic is merged. But, I'm not planning on changing anything at this
point.

Thanks,
Taylor

  reply	other threads:[~2020-09-17 23:13 UTC|newest]

Thread overview: 75+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-09-09 15:22 [PATCH 00/12] more miscellaneous Bloom filter improvements, redux Taylor Blau
2020-09-09 15:22 ` [PATCH 01/12] commit-graph: introduce 'get_bloom_filter_settings()' Taylor Blau
2020-09-09 15:22 ` [PATCH 02/12] t4216: use an '&&'-chain Taylor Blau
2020-09-09 15:22 ` [PATCH 03/12] commit-graph: pass a 'struct repository *' in more places Taylor Blau
2020-09-09 15:23 ` [PATCH 04/12] t/helper/test-read-graph.c: prepare repo settings Taylor Blau
2020-09-09 15:23 ` [PATCH 05/12] commit-graph: respect 'commitGraph.readChangedPaths' Taylor Blau
2020-09-09 15:23 ` [PATCH 06/12] commit-graph.c: store maximum changed paths Taylor Blau
2020-09-09 15:23 ` [PATCH 07/12] bloom: split 'get_bloom_filter()' in two Taylor Blau
2020-09-09 15:23 ` [PATCH 08/12] bloom: use provided 'struct bloom_filter_settings' Taylor Blau
2020-09-09 15:23 ` [PATCH 09/12] bloom/diff: properly short-circuit on max_changes Taylor Blau
2020-09-09 15:23 ` [PATCH 10/12] bloom: encode out-of-bounds filters as non-empty Taylor Blau
2020-09-10  3:35   ` Taylor Blau
2020-09-10 15:45     ` Taylor Blau
2020-09-11 18:15       ` Derrick Stolee
2020-09-09 15:23 ` [PATCH 11/12] commit-graph: rename 'split_commit_graph_opts' Taylor Blau
2020-09-09 15:24 ` [PATCH 12/12] builtin/commit-graph.c: introduce '--max-new-filters=<n>' Taylor Blau
2020-09-11 17:52   ` Jeff King
2020-09-11 18:59     ` Taylor Blau
2020-09-11 19:25       ` Taylor Blau
2020-09-14 20:12         ` Taylor Blau
2020-09-14 20:31           ` Derrick Stolee
2020-09-14 20:36             ` Taylor Blau
2020-09-15  0:59               ` Derrick Stolee
2020-09-15  4:31                 ` Taylor Blau
2020-09-15 21:49               ` Junio C Hamano
2020-09-15 21:53                 ` Taylor Blau
2020-09-11 19:47       ` Jeff King
2020-09-11 19:31     ` Junio C Hamano
2020-09-16 18:06 ` [PATCH v2 00/13] more miscellaneous Bloom filter improvements, redux Taylor Blau
2020-09-16 18:06   ` [PATCH v2 01/13] commit-graph: introduce 'get_bloom_filter_settings()' Taylor Blau
2020-09-16 18:07   ` [PATCH v2 02/13] t4216: use an '&&'-chain Taylor Blau
2020-09-16 18:07   ` [PATCH v2 03/13] commit-graph: pass a 'struct repository *' in more places Taylor Blau
2020-09-16 18:07   ` [PATCH v2 04/13] t/helper/test-read-graph.c: prepare repo settings Taylor Blau
2020-09-16 18:07   ` [PATCH v2 05/13] commit-graph: respect 'commitGraph.readChangedPaths' Taylor Blau
2020-09-16 18:07   ` [PATCH v2 06/13] commit-graph.c: store maximum changed paths Taylor Blau
2020-09-16 18:07   ` [PATCH v2 07/13] bloom: split 'get_bloom_filter()' in two Taylor Blau
2020-09-16 18:07   ` [PATCH v2 08/13] bloom: use provided 'struct bloom_filter_settings' Taylor Blau
2020-09-16 18:07   ` [PATCH v2 09/13] bloom/diff: properly short-circuit on max_changes Taylor Blau
2020-09-16 18:07   ` [PATCH v2 10/13] bloom: encode out-of-bounds filters as non-empty Taylor Blau
2020-09-17 22:13     ` SZEDER Gábor
2020-09-17 23:13       ` Taylor Blau [this message]
2020-09-18  0:52         ` Junio C Hamano
2020-09-18  1:15           ` Taylor Blau
2020-09-16 18:08   ` [PATCH v2 11/13] commit-graph: rename 'split_commit_graph_opts' Taylor Blau
2020-09-16 18:08   ` [PATCH v2 12/13] builtin/commit-graph.c: introduce '--max-new-filters=<n>' Taylor Blau
2020-09-18  9:23     ` SZEDER Gábor
2020-09-18 13:27       ` Taylor Blau
2020-09-16 18:08   ` [PATCH v2 13/13] commit-graph: introduce 'commitGraph.maxNewFilters' Taylor Blau
2020-09-16 22:51   ` [PATCH v2 00/13] more miscellaneous Bloom filter improvements, redux Derrick Stolee
2020-09-16 23:07     ` Junio C Hamano
2020-09-17  0:45       ` Taylor Blau
2020-09-17  0:59         ` Junio C Hamano
2020-09-17  1:10           ` Taylor Blau
2020-09-17 13:34             ` Taylor Blau
2020-09-17 13:38               ` Derrick Stolee
2020-09-18  2:58 ` [PATCH v3 " Taylor Blau
2020-09-18  2:58   ` [PATCH v3 01/13] commit-graph: introduce 'get_bloom_filter_settings()' Taylor Blau
2020-09-18  2:58   ` [PATCH v3 02/13] t4216: use an '&&'-chain Taylor Blau
2020-09-18  2:59   ` [PATCH v3 03/13] commit-graph: pass a 'struct repository *' in more places Taylor Blau
2020-09-18  2:59   ` [PATCH v3 04/13] t/helper/test-read-graph.c: prepare repo settings Taylor Blau
2020-09-18  2:59   ` [PATCH v3 05/13] commit-graph: respect 'commitGraph.readChangedPaths' Taylor Blau
2020-09-18  2:59   ` [PATCH v3 06/13] commit-graph.c: store maximum changed paths Taylor Blau
2020-09-18  2:59   ` [PATCH v3 07/13] bloom: split 'get_bloom_filter()' in two Taylor Blau
2020-09-18  2:59   ` [PATCH v3 08/13] bloom: use provided 'struct bloom_filter_settings' Taylor Blau
2020-09-18 16:27     ` SZEDER Gábor
2020-09-18 16:32       ` Taylor Blau
2020-09-18  2:59   ` [PATCH v3 09/13] bloom/diff: properly short-circuit on max_changes Taylor Blau
2020-09-18  2:59   ` [PATCH v3 10/13] bloom: encode out-of-bounds filters as non-empty Taylor Blau
2020-09-18  2:59   ` [PATCH v3 11/13] commit-graph: rename 'split_commit_graph_opts' Taylor Blau
2020-09-18  2:59   ` [PATCH v3 12/13] builtin/commit-graph.c: introduce '--max-new-filters=<n>' Taylor Blau
2020-09-18  2:59   ` [PATCH v3 13/13] commit-graph: introduce 'commitGraph.maxNewFilters' Taylor Blau
2020-09-18 13:29     ` Taylor Blau
2020-09-18 17:43       ` Junio C Hamano
2020-09-18 13:31   ` [PATCH v3 00/13] more miscellaneous Bloom filter improvements, redux Taylor Blau
2020-09-18 13:34     ` Taylor Blau

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200917231344.GA1591704@nand.local \
    --to=me@ttaylorr.com \
    --cc=dstolee@microsoft.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=peff@peff.net \
    --cc=szeder.dev@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.