Hi Gábor, On Fri, 29 May 2020, SZEDER Gábor wrote: > Sigh... but better late than never, right? Well, incremental patches on top of what is _already_ in Git are _even_ better. > I experimented quite a bit with modified path Bloom filters a year and > more ago, and got quite far... but my disappointment in the > inadequacy of all double hashing schemes, the arrival of split > commit-graphs, and, well, life in general has put the whole thing on > the back burner, and I haven't touched it for a couple of releases. > > Now I finally managed to take a closer look at the current changed > paths Bloom filters implementation, and saw that it has some of the > same issues that I had stumbled upon and that it missed some > optimization opportunities. Unfortunately, fixing those issues and > performing those optimizations do require a thorough format change. Thank you for this background information. Your claim that there are opportunities to optimize, as well as your claim that it requires a thorough format change, strike me as best backed up with evidence by porting your optimizations on top of what is already in `master` and which has been reviewed _already_ (as opposed to your new patch series, which essentially threatens to throw away all the time people spent on reviewing Garima's patches). > So here is my proof of concept version, in all its incompleteness, > with the following benefits: > > - Better understanding of the problem it tries to optimize. > - Better understanding of the issues with many small Bloom filters. > - Better hashing scheme (though it should be better still). > - Orders of magnitude lower average false positive rate. > - Consequently, faster pathspec-limited revision walks. > - Faster processing of the tree-diff output and lower memory usage > while computing Bloom filters (from scratch...). > - Optional support for storing Bloom filters for all parents of > merge commits. > - Deduplicates Bloom filters. > - Supports multiple pathspecs right from the start. > - Supports some wildcards in pathspecs. > - Handles as many commits as the commit-graph format can. > - It has the right name :) The diff machinery and all its frontends > report "modified" paths with the letter 'M', not "changed". > - More cleanups, more bugfixes. > - Consistent output with and without modified path Bloom filters for > over 80k random paths in 16 repositories, even with submodules in > them. Well, at least on my machine, if nowhere else... It strikes me as an obvious idea to make all those improvements in an incremental manner. > Alas, the drawbacks are significant: > > - No tests whatsoever. > - Computes all modified path Bloom filters from scratch when > writing, no matter what. > - Doesn't work with split commit-graphs. > - Basically if anything works besides 'git commit-graph write > --reachable' it's a miracle. > - Not a single test. You said that already in the first bullet point. But yes, I get it, there are no tests. > - Many BUG()s, which should rather be graceful errors... though I > have to admit that at this point they are indeed bugs. > - Many TODOs, both in commit messages and code, some incomplete > commit messages, crappy subject lines, even missing signoffs. > - Some ridiculously long variable, function, macro and config > variable names. > - It's based on v2.25.0 (no technical reason, but that's the version > I used to run the baseline benchmarks the last time, which takes > days...) > - I'm pretty sure that there are more... > - Oh, did I mention that there are no tests? Ah, your patch series has no tests. Please do understand that it can be perceived as quite frustrating that an alternative is presented _this_ late, when already such a lot of effort has gone into not only iterating several times on the patch series but also into reviewing all of them. I don't really think that I want to spend the time to review this alternative (not that I am an expert in the space) because it would imply that I help invalidating the effort that went into the current implementation. All this is to say: I appreciate your efforts to improve the path-based Bloom filters, at the same time I wish that those efforts would happen in a collaborative manner instead of simply dismissing other people's work. Ciao, Johannes