Git Mailing List Archive on
 help / color / Atom feed
From: Derrick Stolee <>
To: Jeff King <>
Cc: "SZEDER Gábor" <>,
	"Ævar Arnfjörð Bjarmason" <>,
	"Stefan Beller" <>, git <>,
	"Duy Nguyen" <>
Subject: Re: Bloom Filters
Date: Thu, 11 Oct 2018 08:33:58 -0400
Message-ID: <> (raw)
In-Reply-To: <>

On 10/9/2018 7:12 PM, Jeff King wrote:
> On Tue, Oct 09, 2018 at 05:14:50PM -0400, Jeff King wrote:
>> Hmph. It really sounds like we could do better with a custom RLE
>> solution. But that makes me feel like I'm missing something, because
>> surely I can't invent something better than the state of the art in a
>> simple thought experiment, right?
>> I know what I'm proposing would be quite bad for random access, but my
>> impression is that EWAH is the same. For the scale of bitmaps we're
>> talking about, I think linear/streaming access through the bitmap would
>> be OK.
> Thinking on it more, what I was missing is that for truly dense random
> bitmaps, this will perform much worse. Because it will use a byte to say
> "there's one 1", rather than a bit.
> But I think it does OK in practice for the very sparse bitmaps we tend
> to see in this application.  I was able to generate a complete output
> that can reproduce "log --name-status -t" for linux.git in 32MB. But:
>    - 15MB of that is commit sha1s, which will be stored elsewhere in a
>      "real" system
>    - 5MB of that is path list (which should shrink by a factor of 10 with
>      prefix compression, and is really a function of a tree size less
>      than history depth)
> So the per-commit cost is not too bad. That's still not counting merges,
> though, which would add another 10-15% (or maybe more; their bitmaps are
> less sparse).
> I don't know if this is a fruitful path at all or not. I was mostly just
> satisfying my own curiosity on the bitmap encoding question. But I'll
> post the patches, just to show my work. The first one is the same
> initial proof of concept I showed earlier.
>    [1/3]: initial tree-bitmap proof of concept
>    [2/3]: test-tree-bitmap: add "dump" mode
>    [3/3]: test-tree-bitmap: replace ewah with custom rle encoding
>   Makefile                    |   1 +
>   t/helper/test-tree-bitmap.c | 344 ++++++++++++++++++++++++++++++++++++
>   2 files changed, 345 insertions(+)
>   create mode 100644 t/helper/test-tree-bitmap.c
I'm trying to test this out myself, and am having trouble reverse 
engineering how I'm supposed to test it.

Looks like running "t/helper/test-tree-bitmap gen" will output a lot of 
binary data. Where should I store that? Does any file work?

Is this series just for the storage costs, assuming that we would 
replace all TREESAME checks with a query into this database? Or do you 
have a way to test how much this would improve a "git log -- <path>" query?


  reply index

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-10-03 19:18 We should add a "git gc --auto" after "git clone" due to commit graph Jeff King
2018-10-08 16:41 ` SZEDER Gábor
2018-10-08 16:57   ` Derrick Stolee
2018-10-08 18:10     ` SZEDER Gábor
2018-10-08 18:29       ` Derrick Stolee
2018-10-09  3:08         ` Jeff King
2018-10-09 13:48           ` Bloom Filters (was Re: We should add a "git gc --auto" after "git clone" due to commit graph) Derrick Stolee
2018-10-09 18:46             ` Jeff King
2018-10-09 19:03               ` Derrick Stolee
2018-10-09 21:14                 ` Jeff King
2018-10-09 23:12                   ` Bloom Filters Jeff King
2018-10-11 12:33                     ` Derrick Stolee [this message]
2018-10-11 13:43                       ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \ \ \ \ \ \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Git Mailing List Archive on

Archives are clonable:
	git clone --mirror git/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 git git/ \
	public-inbox-index git

Example config snippet for mirrors

Newsgroup available over NNTP:

AGPL code for this site: git clone