Re: [PoC PATCH 00/34] An alternative modified path Bloom filters implementation

* Re: [PoC PATCH 00/34] An alternative modified path Bloom filters implementation
  2020-05-29  8:50 [PoC PATCH 00/34] An alternative modified path Bloom filters implementation SZEDER Gábor
@ 2020-05-28 22:09 ` Johannes Schindelin
  2020-05-29  8:50 ` [PATCH 01/34] tree-walk.c: don't match submodule entries for 'submod/anything' SZEDER Gábor
                   ` (35 subsequent siblings)
  36 siblings, 0 replies; 43+ messages in thread
From: Johannes Schindelin @ 2020-05-28 22:09 UTC (permalink / raw)
  To: SZEDER Gábor
  Cc: git, Junio C Hamano, Garima Singh, Derrick Stolee,
	Jakub Narebski, Jeff King, Taylor Blau

[-- Attachment #1: Type: text/plain, Size: 4505 bytes --]

Hi Gábor,

On Fri, 29 May 2020, SZEDER Gábor wrote:

> Sigh...  but better late than never, right?

Well, incremental patches on top of what is _already_ in Git are _even_
better.

> I experimented quite a bit with modified path Bloom filters a year and
> more ago, and got quite far...  but my disappointment in the
> inadequacy of all double hashing schemes, the arrival of split
> commit-graphs, and, well, life in general has put the whole thing on
> the back burner, and I haven't touched it for a couple of releases.
>
> Now I finally managed to take a closer look at the current changed
> paths Bloom filters implementation, and saw that it has some of the
> same issues that I had stumbled upon and that it missed some
> optimization opportunities.  Unfortunately, fixing those issues and
> performing those optimizations do require a thorough format change.

Thank you for this background information.

Your claim that there are opportunities to optimize, as well as your claim
that it requires a thorough format change, strike me as best backed up
with evidence by porting your optimizations on top of what is already in
`master` and which has been reviewed _already_ (as opposed to your new
patch series, which essentially threatens to throw away all the time
people spent on reviewing Garima's patches).

> So here is my proof of concept version, in all its incompleteness,
> with the following benefits:
>
>   - Better understanding of the problem it tries to optimize.
>   - Better understanding of the issues with many small Bloom filters.
>   - Better hashing scheme (though it should be better still).
>   - Orders of magnitude lower average false positive rate.
>   - Consequently, faster pathspec-limited revision walks.
>   - Faster processing of the tree-diff output and lower memory usage
>     while computing Bloom filters (from scratch...).
>   - Optional support for storing Bloom filters for all parents of
>     merge commits.
>   - Deduplicates Bloom filters.
>   - Supports multiple pathspecs right from the start.
>   - Supports some wildcards in pathspecs.
>   - Handles as many commits as the commit-graph format can.
>   - It has the right name :)  The diff machinery and all its frontends
>     report "modified" paths with the letter 'M', not "changed".
>   - More cleanups, more bugfixes.
>   - Consistent output with and without modified path Bloom filters for
>     over 80k random paths in 16 repositories, even with submodules in
>     them.  Well, at least on my machine, if nowhere else...

It strikes me as an obvious idea to make all those improvements in an
incremental manner.

> Alas, the drawbacks are significant:
>
>   - No tests whatsoever.
>   - Computes all modified path Bloom filters from scratch when
>     writing, no matter what.
>   - Doesn't work with split commit-graphs.
>   - Basically if anything works besides 'git commit-graph write
>     --reachable' it's a miracle.
>   - Not a single test.

You said that already in the first bullet point. But yes, I get it, there
are no tests.

>   - Many BUG()s, which should rather be graceful errors...  though I
>     have to admit that at this point they are indeed bugs.
>   - Many TODOs, both in commit messages and code, some incomplete
>     commit messages, crappy subject lines, even missing signoffs.
>   - Some ridiculously long variable, function, macro and config
>     variable names.
>   - It's based on v2.25.0 (no technical reason, but that's the version
>     I used to run the baseline benchmarks the last time, which takes
>     days...)
>   - I'm pretty sure that there are more...
>   - Oh, did I mention that there are no tests?

Ah, your patch series has no tests.

Please do understand that it can be perceived as quite frustrating that
an alternative is presented _this_ late, when already such a lot of effort
has gone into not only iterating several times on the patch series but
also into reviewing all of them.

I don't really think that I want to spend the time to review this
alternative (not that I am an expert in the space) because it would imply
that I help invalidating the effort that went into the current
implementation.

All this is to say: I appreciate your efforts to improve the path-based
Bloom filters, at the same time I wish that those efforts would happen in
a collaborative manner instead of simply dismissing other people's work.

Ciao,
Johannes

^ permalink raw reply	[flat|nested] 43+ messages in thread