All of lore.kernel.org
 help / color / mirror / Atom feed
* bloom filters
@ 2014-05-17  2:28 Loic Dachary
  0 siblings, 0 replies; 4+ messages in thread
From: Loic Dachary @ 2014-05-17  2:28 UTC (permalink / raw)
  To: Sahid Ferdjaoui; +Cc: Ceph Development

[-- Attachment #1: Type: text/plain, Size: 482 bytes --]

Hi Sahid,

Here are the files implementing the bloom filter we discussed tonight

  https://github.com/ceph/ceph/blob/master/src/common/bloom_filter.hpp
  https://github.com/ceph/ceph/blob/master/src/common/bloom_filter.cc

and the associated unit tests

  https://github.com/ceph/ceph/blob/master/src/test/common/test_bloom_filter.cc

Improving code coverage would be nice way for you to learn the code base ;-)

Cheers

-- 
Loïc Dachary, Artisan Logiciel Libre


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Bloom Filters
  2018-10-11 12:33                     ` Derrick Stolee
@ 2018-10-11 13:43                       ` Jeff King
  0 siblings, 0 replies; 4+ messages in thread
From: Jeff King @ 2018-10-11 13:43 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: SZEDER Gábor, Ævar Arnfjörð Bjarmason,
	Stefan Beller, git, Duy Nguyen

On Thu, Oct 11, 2018 at 08:33:58AM -0400, Derrick Stolee wrote:

> > I don't know if this is a fruitful path at all or not. I was mostly just
> > satisfying my own curiosity on the bitmap encoding question. But I'll
> > post the patches, just to show my work. The first one is the same
> > initial proof of concept I showed earlier.
> > 
> >    [1/3]: initial tree-bitmap proof of concept
> >    [2/3]: test-tree-bitmap: add "dump" mode
> >    [3/3]: test-tree-bitmap: replace ewah with custom rle encoding
> > 
> >   Makefile                    |   1 +
> >   t/helper/test-tree-bitmap.c | 344 ++++++++++++++++++++++++++++++++++++
> >   2 files changed, 345 insertions(+)
> >   create mode 100644 t/helper/test-tree-bitmap.c
> I'm trying to test this out myself, and am having trouble reverse
> engineering how I'm supposed to test it.
> 
> Looks like running "t/helper/test-tree-bitmap gen" will output a lot of
> binary data. Where should I store that? Does any file work?

Yeah, you can do:

  # optionally run with GIT_TRACE=1 to see some per-bitmap stats
  test-tree-bitmap gen >out

  # this should be roughly the same as:
  #  git rev-list --all |
  #  git diff-tree --stdin -t --name-only
  test-tree-bitmap dump <out

> Is this series just for the storage costs, assuming that we would replace
> all TREESAME checks with a query into this database? Or do you have a way to
> test how much this would improve a "git log -- <path>" query?

Right, I was just looking at storage cost here. It's not integrated with
the diff code at all. I left some hypothetical numbers elsewhere in the
thread.

-Peff

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Bloom Filters
  2018-10-09 23:12                   ` Bloom Filters Jeff King
@ 2018-10-11 12:33                     ` Derrick Stolee
  2018-10-11 13:43                       ` Jeff King
  0 siblings, 1 reply; 4+ messages in thread
From: Derrick Stolee @ 2018-10-11 12:33 UTC (permalink / raw)
  To: Jeff King
  Cc: SZEDER Gábor, Ævar Arnfjörð Bjarmason,
	Stefan Beller, git, Duy Nguyen

On 10/9/2018 7:12 PM, Jeff King wrote:
> On Tue, Oct 09, 2018 at 05:14:50PM -0400, Jeff King wrote:
>
>> Hmph. It really sounds like we could do better with a custom RLE
>> solution. But that makes me feel like I'm missing something, because
>> surely I can't invent something better than the state of the art in a
>> simple thought experiment, right?
>>
>> I know what I'm proposing would be quite bad for random access, but my
>> impression is that EWAH is the same. For the scale of bitmaps we're
>> talking about, I think linear/streaming access through the bitmap would
>> be OK.
> Thinking on it more, what I was missing is that for truly dense random
> bitmaps, this will perform much worse. Because it will use a byte to say
> "there's one 1", rather than a bit.
>
> But I think it does OK in practice for the very sparse bitmaps we tend
> to see in this application.  I was able to generate a complete output
> that can reproduce "log --name-status -t" for linux.git in 32MB. But:
>
>    - 15MB of that is commit sha1s, which will be stored elsewhere in a
>      "real" system
>
>    - 5MB of that is path list (which should shrink by a factor of 10 with
>      prefix compression, and is really a function of a tree size less
>      than history depth)
>
> So the per-commit cost is not too bad. That's still not counting merges,
> though, which would add another 10-15% (or maybe more; their bitmaps are
> less sparse).
>
> I don't know if this is a fruitful path at all or not. I was mostly just
> satisfying my own curiosity on the bitmap encoding question. But I'll
> post the patches, just to show my work. The first one is the same
> initial proof of concept I showed earlier.
>
>    [1/3]: initial tree-bitmap proof of concept
>    [2/3]: test-tree-bitmap: add "dump" mode
>    [3/3]: test-tree-bitmap: replace ewah with custom rle encoding
>
>   Makefile                    |   1 +
>   t/helper/test-tree-bitmap.c | 344 ++++++++++++++++++++++++++++++++++++
>   2 files changed, 345 insertions(+)
>   create mode 100644 t/helper/test-tree-bitmap.c
I'm trying to test this out myself, and am having trouble reverse 
engineering how I'm supposed to test it.

Looks like running "t/helper/test-tree-bitmap gen" will output a lot of 
binary data. Where should I store that? Does any file work?

Is this series just for the storage costs, assuming that we would 
replace all TREESAME checks with a query into this database? Or do you 
have a way to test how much this would improve a "git log -- <path>" query?

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Bloom Filters
  2018-10-09 21:14                 ` Jeff King
@ 2018-10-09 23:12                   ` Jeff King
  2018-10-11 12:33                     ` Derrick Stolee
  0 siblings, 1 reply; 4+ messages in thread
From: Jeff King @ 2018-10-09 23:12 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: SZEDER Gábor, Ævar Arnfjörð Bjarmason,
	Stefan Beller, git, Duy Nguyen

On Tue, Oct 09, 2018 at 05:14:50PM -0400, Jeff King wrote:

> Hmph. It really sounds like we could do better with a custom RLE
> solution. But that makes me feel like I'm missing something, because
> surely I can't invent something better than the state of the art in a
> simple thought experiment, right?
> 
> I know what I'm proposing would be quite bad for random access, but my
> impression is that EWAH is the same. For the scale of bitmaps we're
> talking about, I think linear/streaming access through the bitmap would
> be OK.

Thinking on it more, what I was missing is that for truly dense random
bitmaps, this will perform much worse. Because it will use a byte to say
"there's one 1", rather than a bit.

But I think it does OK in practice for the very sparse bitmaps we tend
to see in this application.  I was able to generate a complete output
that can reproduce "log --name-status -t" for linux.git in 32MB. But:

  - 15MB of that is commit sha1s, which will be stored elsewhere in a
    "real" system

  - 5MB of that is path list (which should shrink by a factor of 10 with
    prefix compression, and is really a function of a tree size less
    than history depth)

So the per-commit cost is not too bad. That's still not counting merges,
though, which would add another 10-15% (or maybe more; their bitmaps are
less sparse).

I don't know if this is a fruitful path at all or not. I was mostly just
satisfying my own curiosity on the bitmap encoding question. But I'll
post the patches, just to show my work. The first one is the same
initial proof of concept I showed earlier.

  [1/3]: initial tree-bitmap proof of concept
  [2/3]: test-tree-bitmap: add "dump" mode
  [3/3]: test-tree-bitmap: replace ewah with custom rle encoding

 Makefile                    |   1 +
 t/helper/test-tree-bitmap.c | 344 ++++++++++++++++++++++++++++++++++++
 2 files changed, 345 insertions(+)
 create mode 100644 t/helper/test-tree-bitmap.c

-Peff

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2018-10-11 13:44 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-05-17  2:28 bloom filters Loic Dachary
2018-10-03 19:18 We should add a "git gc --auto" after "git clone" due to commit graph Jeff King
2018-10-08 16:41 ` SZEDER Gábor
2018-10-08 16:57   ` Derrick Stolee
2018-10-08 18:10     ` SZEDER Gábor
2018-10-08 18:29       ` Derrick Stolee
2018-10-09  3:08         ` Jeff King
2018-10-09 13:48           ` Bloom Filters (was Re: We should add a "git gc --auto" after "git clone" due to commit graph) Derrick Stolee
2018-10-09 18:46             ` Jeff King
2018-10-09 19:03               ` Derrick Stolee
2018-10-09 21:14                 ` Jeff King
2018-10-09 23:12                   ` Bloom Filters Jeff King
2018-10-11 12:33                     ` Derrick Stolee
2018-10-11 13:43                       ` Jeff King

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.