bloom filters

All of lore.kernel.org
 help / color / mirror / Atom feed

* bloom filters
@ 2014-05-17  2:28 Loic Dachary
  0 siblings, 0 replies; 4+ messages in thread
From: Loic Dachary @ 2014-05-17  2:28 UTC (permalink / raw)
  To: Sahid Ferdjaoui; +Cc: Ceph Development

[-- Attachment #1: Type: text/plain, Size: 482 bytes --]

Hi Sahid,

Here are the files implementing the bloom filter we discussed tonight

  https://github.com/ceph/ceph/blob/master/src/common/bloom_filter.hpp
  https://github.com/ceph/ceph/blob/master/src/common/bloom_filter.cc

and the associated unit tests

  https://github.com/ceph/ceph/blob/master/src/test/common/test_bloom_filter.cc

Improving code coverage would be nice way for you to learn the code base ;-)

Cheers

-- 
Loïc Dachary, Artisan Logiciel Libre


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: We should add a "git gc --auto" after "git clone" due to commit graph
@ 2018-10-03 19:18 Jeff King
  2018-10-08 16:41 ` SZEDER Gábor
  0 siblings, 1 reply; 4+ messages in thread
From: Jeff King @ 2018-10-03 19:18 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Ævar Arnfjörð Bjarmason, Stefan Beller,
	SZEDER Gábor, git, Duy Nguyen

On Wed, Oct 03, 2018 at 02:59:34PM -0400, Derrick Stolee wrote:

> > They don't help yet, and there's no good reason to enable bitmaps for
> > clients. I have a few patches that use bitmaps for things like
> > ahead/behind and --contains checks, but the utility of those may be
> > lessened quite a bit by Stolee's commit-graph work.  And if it isn't,
> > I'm mildly in favor of replacing the existing .bitmap format with
> > something better integrated with commit-graphs (which would give us an
> > opportunity to clean up some of the rough edges).
> 
> If the commit-graph doesn't improve enough on those applications, then we
> could consider adding a commit-to-commit reachability bitmap inside the
> commit-graph. ;)

That unfortunately wouldn't be enough for us to ditch the existing
.bitmap files, since we need full object reachability for some cases
(including packing). And commit-to-commit reachability is a trivial
subset of that. I'm not sure if it would be better to just leave
.bitmaps in place as a server-side thing, and grow a new thing for
commit-to-commit reachability (since it would presumably be easier).

I'm still excited about the prospect of a bloom filter for paths which
each commit touches. I think that's the next big frontier in getting
things like "git log -- path" to a reasonable run-time.

-Peff

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: We should add a "git gc --auto" after "git clone" due to commit graph
  2018-10-03 19:18 We should add a "git gc --auto" after "git clone" due to commit graph Jeff King
@ 2018-10-08 16:41 ` SZEDER Gábor
  2018-10-08 16:57   ` Derrick Stolee
  0 siblings, 1 reply; 4+ messages in thread
From: SZEDER Gábor @ 2018-10-08 16:41 UTC (permalink / raw)
  To: Jeff King
  Cc: Derrick Stolee, Ævar Arnfjörð Bjarmason,
	Stefan Beller, git, Duy Nguyen

On Wed, Oct 03, 2018 at 03:18:05PM -0400, Jeff King wrote:
> I'm still excited about the prospect of a bloom filter for paths which
> each commit touches. I think that's the next big frontier in getting
> things like "git log -- path" to a reasonable run-time.

There is certainly potential there.  With a (very) rough PoC
experiment, a 8MB bloom filter, and a carefully choosen path I can
achieve a nice, almost 25x speedup:

  $ time git rev-list --count HEAD -- t/valgrind/valgrind.sh
  6

  real    0m1.563s
  user    0m1.519s
  sys     0m0.045s

  $ time GIT_USE_POC_BLOOM_FILTER=y ~/src/git/git rev-list --count HEAD -- t/valgrind/valgrind.sh
  6

  real    0m0.063s
  user    0m0.043s
  sys     0m0.020s

  bloom filter total queries: 16269 definitely not: 16195 maybe: 74 false positives: 64 fp ratio: 0.003934

But I'm afraid it will take a while until I get around to turn it into
something presentable...


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: We should add a "git gc --auto" after "git clone" due to commit graph
  2018-10-08 16:41 ` SZEDER Gábor
@ 2018-10-08 16:57   ` Derrick Stolee
  2018-10-08 18:10     ` SZEDER Gábor
  0 siblings, 1 reply; 4+ messages in thread
From: Derrick Stolee @ 2018-10-08 16:57 UTC (permalink / raw)
  To: SZEDER Gábor, Jeff King
  Cc: Ævar Arnfjörð Bjarmason, Stefan Beller, git, Duy Nguyen

On 10/8/2018 12:41 PM, SZEDER Gábor wrote:
> On Wed, Oct 03, 2018 at 03:18:05PM -0400, Jeff King wrote:
>> I'm still excited about the prospect of a bloom filter for paths which
>> each commit touches. I think that's the next big frontier in getting
>> things like "git log -- path" to a reasonable run-time.
> There is certainly potential there.  With a (very) rough PoC
> experiment, a 8MB bloom filter, and a carefully choosen path I can
> achieve a nice, almost 25x speedup:
>
>    $ time git rev-list --count HEAD -- t/valgrind/valgrind.sh
>    6
>
>    real    0m1.563s
>    user    0m1.519s
>    sys     0m0.045s
>
>    $ time GIT_USE_POC_BLOOM_FILTER=y ~/src/git/git rev-list --count HEAD -- t/valgrind/valgrind.sh
>    6
>
>    real    0m0.063s
>    user    0m0.043s
>    sys     0m0.020s
>
>    bloom filter total queries: 16269 definitely not: 16195 maybe: 74 false positives: 64 fp ratio: 0.003934
Nice! These numbers make sense to me, in terms of how many TREESAME 
queries we actually need to perform for such a query.
> But I'm afraid it will take a while until I get around to turn it into
> something presentable...
Do you have the code pushed somewhere public where one could take a 
look? I could provide some early feedback.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: We should add a "git gc --auto" after "git clone" due to commit graph
  2018-10-08 16:57   ` Derrick Stolee
@ 2018-10-08 18:10     ` SZEDER Gábor
  2018-10-08 18:29       ` Derrick Stolee
  0 siblings, 1 reply; 4+ messages in thread
From: SZEDER Gábor @ 2018-10-08 18:10 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Jeff King, Ævar Arnfjörð Bjarmason, Stefan Beller,
	git, Duy Nguyen

On Mon, Oct 08, 2018 at 12:57:34PM -0400, Derrick Stolee wrote:
> On 10/8/2018 12:41 PM, SZEDER Gábor wrote:
> >On Wed, Oct 03, 2018 at 03:18:05PM -0400, Jeff King wrote:
> >>I'm still excited about the prospect of a bloom filter for paths which
> >>each commit touches. I think that's the next big frontier in getting
> >>things like "git log -- path" to a reasonable run-time.
> >There is certainly potential there.  With a (very) rough PoC
> >experiment, a 8MB bloom filter, and a carefully choosen path I can
> >achieve a nice, almost 25x speedup:
> >
> >   $ time git rev-list --count HEAD -- t/valgrind/valgrind.sh
> >   6
> >
> >   real    0m1.563s
> >   user    0m1.519s
> >   sys     0m0.045s
> >
> >   $ time GIT_USE_POC_BLOOM_FILTER=y ~/src/git/git rev-list --count HEAD -- t/valgrind/valgrind.sh
> >   6
> >
> >   real    0m0.063s
> >   user    0m0.043s
> >   sys     0m0.020s
> >
> >   bloom filter total queries: 16269 definitely not: 16195 maybe: 74 false positives: 64 fp ratio: 0.003934

> Nice! These numbers make sense to me, in terms of how many TREESAME queries
> we actually need to perform for such a query.

Yeah...  because you didn't notice that I deliberately cheated :)

As it turned out, it's not just about the number of diff queries that
we can spare, but, for the speedup _ratio_, it's more about how
expensive those diff queries are.

git.git has a rather flat hierarchy, and 't/' is the 372th entry in
the current root tree object, while 'valgrind/' is the 923th entry,
and the diff machinery spends considerable time wading through the
previous entries.  Notice the "carefully chosen path" remark in my
previous email; I think this particular path has the highest number of
preceeding tree entries, and, in addition, 't/' changes rather
frequently, so the diff machinery often has to scan two relatively big
tree objects.  Had I chosen 'Documentation/RelNotes/1.5.0.1.txt'
instead, i.e. another path two directories deep, but whose leading
path components are both near the beginning of the tree objects, the
speedup would be much less impressive: 0.282s vs. 0.049s, i.e. "only"
~5.7x instead of ~24.8x.

> >But I'm afraid it will take a while until I get around to turn it into
> >something presentable...
> Do you have the code pushed somewhere public where one could take a look? I
> Do you have the code pushed somewhere public where one could take a 
> look? I could provide some early feedback.

Nah, definitely not...  I know full well how embarassingly broken this
implementation is, I don't need others to tell me that ;)

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: We should add a "git gc --auto" after "git clone" due to commit graph
  2018-10-08 18:10     ` SZEDER Gábor
@ 2018-10-08 18:29       ` Derrick Stolee
  2018-10-09  3:08         ` Jeff King
  0 siblings, 1 reply; 4+ messages in thread
From: Derrick Stolee @ 2018-10-08 18:29 UTC (permalink / raw)
  To: SZEDER Gábor
  Cc: Jeff King, Ævar Arnfjörð Bjarmason, Stefan Beller,
	git, Duy Nguyen

On 10/8/2018 2:10 PM, SZEDER Gábor wrote:
> On Mon, Oct 08, 2018 at 12:57:34PM -0400, Derrick Stolee wrote:
>> Nice! These numbers make sense to me, in terms of how many TREESAME queries
>> we actually need to perform for such a query.
> Yeah...  because you didn't notice that I deliberately cheated :)
>
> As it turned out, it's not just about the number of diff queries that
> we can spare, but, for the speedup _ratio_, it's more about how
> expensive those diff queries are.
>
> git.git has a rather flat hierarchy, and 't/' is the 372th entry in
> the current root tree object, while 'valgrind/' is the 923th entry,
> and the diff machinery spends considerable time wading through the
> previous entries.  Notice the "carefully chosen path" remark in my
> previous email; I think this particular path has the highest number of
> preceeding tree entries, and, in addition, 't/' changes rather
> frequently, so the diff machinery often has to scan two relatively big
> tree objects.  Had I chosen 'Documentation/RelNotes/1.5.0.1.txt'
> instead, i.e. another path two directories deep, but whose leading
> path components are both near the beginning of the tree objects, the
> speedup would be much less impressive: 0.282s vs. 0.049s, i.e. "only"
> ~5.7x instead of ~24.8x.

This is expected. The performance ratio is better when the path is any 
of the following:

1. A very deep path (need to walk multiple trees to answer TREESAME)

2. An entry is late in a very wide tree (need to spend extra time 
parsing tree object)

3. The path doesn't change very often (need to inspect many TREESAME 
pairs before finding enough interesting commits)

4. Some sub-path changes often (so the TREESAME comparison needs to 
parse beyond that sub-path often)

Our standard examples (Git and Linux repos) don't have many paths that 
have these properties. But: they do exist. In other projects, this is 
actually typical. Think about Java projects that frequently have ~5 
levels of folders before actually touching a code file.

When I was implementing the Bloom filter feature for Azure Repos, I ran 
performance tests on the Linux repo using a random sampling of paths. 
The typical speedup was 5x while some outliers were in the 25x range.

>
>>> But I'm afraid it will take a while until I get around to turn it into
>>> something presentable...
>> Do you have the code pushed somewhere public where one could take a look? I
>> Do you have the code pushed somewhere public where one could take a
>> look? I could provide some early feedback.
> Nah, definitely not...  I know full well how embarassingly broken this
> implementation is, I don't need others to tell me that ;)
There are two questions that I was hoping to answer by looking at your code:

1. How do you store your Bloom filter? Is it connected to the 
commit-graph and split on a commit-by-commit basis (storing "$path" as a 
key), or is it one huge Bloom filter (storing "$commitid:$path" as key)?

2. Where does your Bloom filter check plug into the TREESAME logic? I 
haven't investigated this part, but hopefully it isn't too complicated.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: We should add a "git gc --auto" after "git clone" due to commit graph
  2018-10-08 18:29       ` Derrick Stolee
@ 2018-10-09  3:08         ` Jeff King
  2018-10-09 13:48           ` Bloom Filters (was Re: We should add a "git gc --auto" after "git clone" due to commit graph) Derrick Stolee
  0 siblings, 1 reply; 4+ messages in thread
From: Jeff King @ 2018-10-09  3:08 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: SZEDER Gábor, Ævar Arnfjörð Bjarmason,
	Stefan Beller, git, Duy Nguyen

On Mon, Oct 08, 2018 at 02:29:47PM -0400, Derrick Stolee wrote:

> > > > But I'm afraid it will take a while until I get around to turn it into
> > > > something presentable...
> > > Do you have the code pushed somewhere public where one could take a look? I
> > > Do you have the code pushed somewhere public where one could take a
> > > look? I could provide some early feedback.
> > Nah, definitely not...  I know full well how embarassingly broken this
> > implementation is, I don't need others to tell me that ;)
> There are two questions that I was hoping to answer by looking at your code:
> 
> 1. How do you store your Bloom filter? Is it connected to the commit-graph
> and split on a commit-by-commit basis (storing "$path" as a key), or is it
> one huge Bloom filter (storing "$commitid:$path" as key)?

I guess you've probably thought all of this through for your
implementation, but let me pontificate.

I'd have done it as one fixed-size filter per commit. Then you should be
able to hash the path keys once, and apply the result as a bitwise query
to each individual commit (I'm assuming that it's constant-time to
access the filter for each, as an index into an mmap'd array, with the
offset coming from a commit-graph entry we'd be able to look up anyway).

I think it would also be easier to deal with maintenance, since each
filter is independent (IIRC, you cannot delete from a bloom filter
without re-adding all of the other keys).

The obvious downside is that it's O(commits) storage instead of O(1).

Now let me ponder a bit further afield. A bloom filter lets us answer
the question "did this commit (probably) touch these paths?". But it
does not let us answer "which paths did this commit touch?".

That second one is less useful than you might think, because we almost
always care about not just the names of the paths, but their actual
object ids. Think about a --raw diff, or even traversing for
reachability (where if we knew the tree-diff cheaply, we could avoid
asking "have we seen this yet?" on most of the tree entries). The names
alone can make that a bit faster, but in the worst case you still have
to walk the whole tree to find their entries.

But there's also a related question: how do we match pathspec patterns?
For a changed path like "foo/bar/baz", I imagine a bloom filter would
mark all of "foo", "foo/bar", and "foo/bar/baz". But what about "*.c"? I
don't think a bloom filter can answer that.

At least not by itself. If we imagine that the commit-graph also had an
alphabetized list of every path in every tree, then it's easy: apply the
glob to that list once to get a set of concrete paths, and then query
the bloom filters for those. And that list actually isn't too big. The
complete set of paths in linux.git is only about 300k gzipped (I think
that's the most relevant measure, since it's an obvious win to avoid
repeating shared prefixes of long paths).

Imagine we have that list. Is a bloom filter still the best data
structure for each commit? At the point that we have the complete
universe of paths, we could give each commit a bitmap of changed paths.
That lets us ask "did this commit touch these paths" (collect the bits
from the list of paths, then check for 1's), as well as "what paths did
we touch" (collect the 1 bits, and then index the path list).  Those
bitmaps should compress very well via EWAH or similar (most of them
would be huge stretches of 0's punctuated by short runs of 1's).

So that seems promising to me (or at least not an obvious dead-end). I
do think maintenance gets to be a headache, though. Adding new paths
potentially means reordering the bitmaps, which means O(commits) work to
"incrementally" update the structure. (Unless you always add the new
paths at the end, but then you lose fast lookups in the list; that might
be an acceptable tradeoff).

And finally, there's one more radical option: could we actually store a
real per-commit tree-diff cache? I.e., imagine that each commit had the
equivalent of a --raw diff easily accessible, including object ids. That
would allow:

  - fast pathspec matches, including globs

  - fast --raw output (and faster -p output, since we can skip the tree
    entirely)

  - fast reachability traversals (we only need to bother to look at the
    objects for changed entries)

where "fast" is basically O(size of commit's changes), rather than
O(size of whole tree). This was one of the big ideas of packv4 that
never materialized. You can _almost_ do it with packv2, since after all,
we end up storing many trees as deltas. But those deltas are byte-wise
so it's hard for a reader to convert them directly into a pure-tree
diff (they also don't mention the "deleted" data, so it's really only
half a diff).

So let's imagine we'd store such a cache external to the regular object
data (i.e., as a commit-graph entry). The "log --raw" diff of linux.git
has 1.7M entries. The paths should easily compress to a single 32-bit
integer (e.g., as an index into a big path list). The oids are 20 bytes.
Add a few bytes for modes. That's about 80MB. Big, but not impossibly
so. Maybe pushing it for true gigantic repos, though.

Those numbers are ignoring merges, too. The meaning of "did this commit
touch that path" is a lot trickier for a merge commit, and I think may
depend on context. I'm not sure how even a bloom filter solution would
handle that (I was assuming we'd mostly punt and let merges fall back to
opening up the trees).

Phew. That was a lot. I don't want to derail any useful work either of
you is doing. These are just things I've been thinking over (or even in
some cases experimenting with), and I think it's worth laying all the
options on the table. I won't be surprised if you'd considered and
rejected any of these alternate approaches, but I'd be curious to hear
the counter-arguments. :)

-Peff

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Bloom Filters (was Re: We should add a "git gc --auto" after "git clone" due to commit graph)
  2018-10-09  3:08         ` Jeff King
@ 2018-10-09 13:48           ` Derrick Stolee
  2018-10-09 18:46             ` Jeff King
  0 siblings, 1 reply; 4+ messages in thread
From: Derrick Stolee @ 2018-10-09 13:48 UTC (permalink / raw)
  To: Jeff King
  Cc: SZEDER Gábor, Ævar Arnfjörð Bjarmason,
	Stefan Beller, git, Duy Nguyen

(Changing title to reflect the new topic.)

On 10/8/2018 11:08 PM, Jeff King wrote:
> On Mon, Oct 08, 2018 at 02:29:47PM -0400, Derrick Stolee wrote:
>
>> There are two questions that I was hoping to answer by looking at 
>> your code:
>> 1. How do you store your Bloom filter? Is it connected to the commit-graph
>> and split on a commit-by-commit basis (storing "$path" as a key), or is it
>> one huge Bloom filter (storing "$commitid:$path" as key)?
> I guess you've probably thought all of this through for your
> implementation, but let me pontificate.
>
> I'd have done it as one fixed-size filter per commit. Then you should be
> able to hash the path keys once, and apply the result as a bitwise query
> to each individual commit (I'm assuming that it's constant-time to
> access the filter for each, as an index into an mmap'd array, with the
> offset coming from a commit-graph entry we'd be able to look up anyway).

You're right that we want to hash the path a constant number of times. 
Add in that we want to re-use information already serialized when 
updating the file, and we see that having a commit-by-commit Bloom 
filter is a good idea. Using (commit, path) pairs requires lots of 
re-hashing, repeated work when extending the filter, and poor locality 
when evaluating membership of a single key.

The nice thing is that you can generate k 32-bit hash values based on 
two 32-bit hash values that are "independent enough" (see [1]). We used 
Murmur3 with two different seed values to generate hashes a & b, then 
used the arithmetic progression a, a + b, a + 2b, ..., a + (k-1)b as our 
k hash values. These can be computed up front and then dropped into any 
size filter using a simple modulo operation. This allows flexible sizes 
in our filters.

I don't think fixed size filters are a good idea, and instead would opt 
for flex-sized filters with a maximum size. The typical parameters use 7 
hash functions and a filter size of (at least) 10 bits per entry. For 
most commits (say 60-70%), 256 bits (32 bytes) is enough. Using a 
maximum of 512 bytes covers 99% of commits. We will want these bounds to 
be configurable via config. If we had a fixed size, then we either make 
it too small (and don't have sufficient coverage of commits) or too 
large (and waste a lot of space on the commits that change very little).

We can store these flex-sized filters in the commit-graph using two 
columns of data (two new optional chunks):

* Bloom filter data: stores the binary data of each commit's Bloom 
filter, concatenated together in the same order as the commits appear in 
the commit-graph.

* Bloom filter positions: The ith position of this column stores the 
start of the (i+1)th Bloom filter (The 0th filter starts at byte 0). A 
Bloom filter of size 0 is intended to mean "we didn't store this filter 
because it would be too large". We can compute the lengths of the filter 
by inspecting adjacent values.

In order to be flexible, we will want to encode some basic information 
into the Bloom filter data chunk, such as a tuple of (hash version, num 
hash bits, num bits per entry). This allows us to change the parameters 
in config but still be able to read a serialized filter. Here I assume 
that all filters share the same parameters. The "hash version" here is 
different than the_hash_algo, because we don't care about cryptographic 
security, only a uniform distrubution (hence, Murmur3 is a good, fast 
option).

[1] 
https://web.archive.org/web/20090131053735/http://www.eecs.harvard.edu/~kirsch/pubs/bbbf/esa06.pdf

>
> I think it would also be easier to deal with maintenance, since each
> filter is independent (IIRC, you cannot delete from a bloom filter
> without re-adding all of the other keys).
>
> The obvious downside is that it's O(commits) storage instead of O(1).
It would always be O(changes), as the Bloom filter needs to grow in size 
as the number of entries grows.
> Now let me ponder a bit further afield. A bloom filter lets us answer
> the question "did this commit (probably) touch these paths?". But it
> does not let us answer "which paths did this commit touch?".
>
> That second one is less useful than you might think, because we almost
> always care about not just the names of the paths, but their actual
> object ids. Think about a --raw diff, or even traversing for
> reachability (where if we knew the tree-diff cheaply, we could avoid
> asking "have we seen this yet?" on most of the tree entries). The names
> alone can make that a bit faster, but in the worst case you still have
> to walk the whole tree to find their entries.
>
> But there's also a related question: how do we match pathspec patterns?
> For a changed path like "foo/bar/baz", I imagine a bloom filter would
> mark all of "foo", "foo/bar", and "foo/bar/baz". But what about "*.c"? I
> don't think a bloom filter can answer that.

The filter needs to store every path that would be considered "not 
TREESAME". It can't store wildcards, so you would need to evaluate the 
wildcard and test all of those paths individually (not a good idea).

> At least not by itself. If we imagine that the commit-graph also had an
> alphabetized list of every path in every tree, then it's easy: apply the
> glob to that list once to get a set of concrete paths, and then query
> the bloom filters for those. And that list actually isn't too big. The
> complete set of paths in linux.git is only about 300k gzipped (I think
> that's the most relevant measure, since it's an obvious win to avoid
> repeating shared prefixes of long paths).
As you mention below, we would actually want a list of "every path that 
has ever appeared in the repo".
> Imagine we have that list. Is a bloom filter still the best data
> structure for each commit? At the point that we have the complete
> universe of paths, we could give each commit a bitmap of changed paths.
> That lets us ask "did this commit touch these paths" (collect the bits
> from the list of paths, then check for 1's), as well as "what paths did
> we touch" (collect the 1 bits, and then index the path list).  Those
> bitmaps should compress very well via EWAH or similar (most of them
> would be huge stretches of 0's punctuated by short runs of 1's).

I'm not convinced we would frequently have runs of 1's, and the bitmap 
would not compress much better than simply listing the positions. For 
example, a path "foo/bar" that resolves to a tree would only start a run 
if the next changes are the initial section of entries in that tree 
(sorted lexicographically) such as "foo/bar/a, foo/bar/b". If we deepen 
into a tree, then we will break the run of 1's unless we changed every 
path deeper than that tree.

> So that seems promising to me (or at least not an obvious dead-end). I
> do think maintenance gets to be a headache, though. Adding new paths
> potentially means reordering the bitmaps, which means O(commits) work to
> "incrementally" update the structure. (Unless you always add the new
> paths at the end, but then you lose fast lookups in the list; that might
> be an acceptable tradeoff).
>
> And finally, there's one more radical option: could we actually store a
> real per-commit tree-diff cache? I.e., imagine that each commit had the
> equivalent of a --raw diff easily accessible, including object ids. That
> would allow:
>
>    - fast pathspec matches, including globs
>
>    - fast --raw output (and faster -p output, since we can skip the tree
>      entirely)
>
>    - fast reachability traversals (we only need to bother to look at the
>      objects for changed entries)
>
> where "fast" is basically O(size of commit's changes), rather than
> O(size of whole tree). This was one of the big ideas of packv4 that
> never materialized. You can _almost_ do it with packv2, since after all,
> we end up storing many trees as deltas. But those deltas are byte-wise
> so it's hard for a reader to convert them directly into a pure-tree
> diff (they also don't mention the "deleted" data, so it's really only
> half a diff).
>
> So let's imagine we'd store such a cache external to the regular object
> data (i.e., as a commit-graph entry). The "log --raw" diff of linux.git
> has 1.7M entries. The paths should easily compress to a single 32-bit
> integer (e.g., as an index into a big path list). The oids are 20 bytes.
> Add a few bytes for modes. That's about 80MB. Big, but not impossibly
> so. Maybe pushing it for true gigantic repos, though.

Above, I mentioned my gut reaction that storing a "changed path bitmap" 
per commit would not compress well. That puts that implementation very 
close to the one you suggest here (except we also store the OID changes).

I just want to compare your 80MB here to ~4MB it would take to store 
those changed paths in Bloom filters (10 bits per entry -> ~2MB, but 
adding some slop for the commit-by-commit storage).

> Those numbers are ignoring merges, too. The meaning of "did this commit
> touch that path" is a lot trickier for a merge commit, and I think may
> depend on context. I'm not sure how even a bloom filter solution would
> handle that (I was assuming we'd mostly punt and let merges fall back to
> opening up the trees).

My solution here is to always store the list of paths changed against 
the first parent. If we evaluate TREESAME against our first parent while 
computing simplified file history, then we continue along first-parent 
history. It is possible to store filters for every parent, but I don't 
recommend it. The merge commit will typically have many more change 
paths against the second parent, since the second parent is usually 
bringing in a small change done by few to catch up to the work done in 
parallel by many. Those diffs will frequently run over our limit.

> Phew. That was a lot. I don't want to derail any useful work either of
> you is doing. These are just things I've been thinking over (or even in
> some cases experimenting with), and I think it's worth laying all the
> options on the table. I won't be surprised if you'd considered and
> rejected any of these alternate approaches, but I'd be curious to hear
> the counter-arguments. :)

This is a good discussion to have, since the commit-graph feature is 
getting to a stable place. We still have ongoing algorithm work with 
generation numbers, but this Bloom filter discussion (and 
implementation) can happen in parallel.

Thanks,

-Stolee

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Bloom Filters (was Re: We should add a "git gc --auto" after "git clone" due to commit graph)
  2018-10-09 13:48           ` Bloom Filters (was Re: We should add a "git gc --auto" after "git clone" due to commit graph) Derrick Stolee
@ 2018-10-09 18:46             ` Jeff King
  2018-10-09 19:03               ` Derrick Stolee
  0 siblings, 1 reply; 4+ messages in thread
From: Jeff King @ 2018-10-09 18:46 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: SZEDER Gábor, Ævar Arnfjörð Bjarmason,
	Stefan Beller, git, Duy Nguyen

On Tue, Oct 09, 2018 at 09:48:20AM -0400, Derrick Stolee wrote:

> [I snipped all of the parts about bloom filters that seemed entirely
>  reasonable to me ;) ]

> > Imagine we have that list. Is a bloom filter still the best data
> > structure for each commit? At the point that we have the complete
> > universe of paths, we could give each commit a bitmap of changed paths.
> > That lets us ask "did this commit touch these paths" (collect the bits
> > from the list of paths, then check for 1's), as well as "what paths did
> > we touch" (collect the 1 bits, and then index the path list).  Those
> > bitmaps should compress very well via EWAH or similar (most of them
> > would be huge stretches of 0's punctuated by short runs of 1's).
> 
> I'm not convinced we would frequently have runs of 1's, and the bitmap would
> not compress much better than simply listing the positions. For example, a
> path "foo/bar" that resolves to a tree would only start a run if the next
> changes are the initial section of entries in that tree (sorted
> lexicographically) such as "foo/bar/a, foo/bar/b". If we deepen into a tree,
> then we will break the run of 1's unless we changed every path deeper than
> that tree.

Yeah, I doubt we'd really have runs of 1's (by short, I just mean 1 or
2). I agree that listing the positions could work, though I sort of
assumed that was more or less what a decent compressed bitmap would
turn into. E.g., if bit N is set, we should be able to say "N-1
zeroes, 1 one" in about the same size as we could say "position N".

EWAH seems pretty awful in that regard. Or at least its serialized
format is (or maybe it's our implementation that is bad).

The patch below generates a bitmap for each commit in a repository (it
doesn't output the total list of paths; I've left that as an exercise
for the reader). On linux.git, the result is 57MB. But when I look at
the individual bitmap sizes (via GIT_TRACE), they're quite silly.
Storing a single set bit takes 28 bytes in serialized form!

There are only around 120k unique paths (including prefix trees).
Naively using run-length encoding and varints, our worst case should be
something like 18-20 bits to say "120k zeroes, then a 1, then all
zeroes".  And the average case should be better (you don't even need to
say "120k", but some smaller number).

I wonder if Roaring does better here.

Gzipping the resulting bitmaps drops the total size to about 7.5MB.
That's not a particularly important number, but I think it shows that
the built-in ewah compression is far from ideal.

Just listing the positions with a series of varints would generally be
fine, since we expect sparse bitmaps. I just hoped that a good
RLE scheme would degrade to roughly that for the sparse case, but also
perform well for more dense cases.

So at any rate, I do think it would not be out of the question to store
bitmaps like this. I'm much more worried about the maintenance cost of
adding new entries incrementally. I think it's only feasible if we give
up sorting, and then I wonder what other problems that might cause.

-Peff

-- >8 --
diff --git a/Makefile b/Makefile
index 13e1c52478..f6e823f2d6 100644
--- a/Makefile
+++ b/Makefile
@@ -751,6 +751,7 @@ TEST_PROGRAMS_NEED_X += test-parse-options
 TEST_PROGRAMS_NEED_X += test-pkt-line
 TEST_PROGRAMS_NEED_X += test-svn-fe
 TEST_PROGRAMS_NEED_X += test-tool
+TEST_PROGRAMS_NEED_X += test-tree-bitmap
 
 TEST_PROGRAMS = $(patsubst %,t/helper/%$X,$(TEST_PROGRAMS_NEED_X))
 
diff --git a/t/helper/test-tree-bitmap.c b/t/helper/test-tree-bitmap.c
new file mode 100644
index 0000000000..bc5cf0e514
--- /dev/null
+++ b/t/helper/test-tree-bitmap.c
@@ -0,0 +1,167 @@
+#include "cache.h"
+#include "revision.h"
+#include "diffcore.h"
+#include "argv-array.h"
+#include "ewah/ewok.h"
+
+/* map of pathnames to bit positions */
+struct pathmap_entry {
+	struct hashmap_entry ent;
+	unsigned pos;
+	char path[FLEX_ARRAY];
+};
+
+static int pathmap_entry_hashcmp(const void *unused_cmp_data,
+				 const void *entry,
+				 const void *entry_or_key,
+				 const void *keydata)
+{
+	const struct pathmap_entry *a = entry;
+	const struct pathmap_entry *b = entry_or_key;
+	const char *key = keydata;
+
+	return strcmp(a->path, key ? key : b->path);
+}
+
+static int pathmap_entry_strcmp(const void *va, const void *vb)
+{
+	struct pathmap_entry *a = *(struct pathmap_entry **)va;
+	struct pathmap_entry *b = *(struct pathmap_entry **)vb;
+	return strcmp(a->path, b->path);
+}
+
+struct walk_paths_data {
+	struct hashmap *paths;
+	struct commit *commit;
+};
+
+static void walk_paths(diff_format_fn_t fn, struct hashmap *paths)
+{
+	struct argv_array argv = ARGV_ARRAY_INIT;
+	struct rev_info revs;
+	struct walk_paths_data data;
+	struct commit *commit;
+
+	argv_array_pushl(&argv, "rev-list",
+			 "--all", "-t", "--no-renames",
+			 NULL);
+	init_revisions(&revs, NULL);
+	setup_revisions(argv.argc, argv.argv, &revs, NULL);
+	revs.diffopt.output_format = DIFF_FORMAT_CALLBACK;
+	revs.diffopt.format_callback = fn;
+	revs.diffopt.format_callback_data = &data;
+
+	data.paths = paths;
+
+	prepare_revision_walk(&revs);
+	while ((commit = get_revision(&revs))) {
+		data.commit = commit;
+		diff_tree_combined_merge(commit, 0, &revs);
+	}
+
+	reset_revision_walk();
+	argv_array_clear(&argv);
+}
+
+static void collect_commit_paths(struct diff_queue_struct *q,
+				 struct diff_options *opts,
+				 void *vdata)
+{
+	struct walk_paths_data *data = vdata;
+	int i;
+
+	for (i = 0; i < q->nr; i++) {
+		struct diff_filepair *p = q->queue[i];
+		const char *path = p->one->path;
+		struct pathmap_entry *entry;
+		struct hashmap_entry lookup;
+
+		hashmap_entry_init(&lookup, strhash(path));
+		entry = hashmap_get(data->paths, &lookup, path);
+		if (entry)
+			continue; /* already present */
+
+		FLEX_ALLOC_STR(entry, path, path);
+		entry->ent = lookup;
+		hashmap_put(data->paths, entry);
+	}
+}
+
+/* assign a bit position to all possible paths */
+static void collect_paths(struct hashmap *paths)
+{
+	struct pathmap_entry **sorted;
+	size_t i, n;
+	struct hashmap_iter iter;
+	struct pathmap_entry *entry;
+
+	/* grab all unique paths */
+	hashmap_init(paths, pathmap_entry_hashcmp, NULL, 0);
+	walk_paths(collect_commit_paths, paths);
+
+	/* and assign them bits in sorted order */
+	n = hashmap_get_size(paths);
+	ALLOC_ARRAY(sorted, n);
+	i = 0;
+	for (entry = hashmap_iter_first(paths, &iter);
+	     entry;
+	     entry = hashmap_iter_next(&iter)) {
+		assert(i < n);
+		sorted[i++] = entry;
+	}
+	QSORT(sorted, i, pathmap_entry_strcmp);
+	for (i = 0; i < n; i++)
+		sorted[i]->pos = i;
+	free(sorted);
+}
+
+/* generate the bitmap for a single commit */
+static void generate_bitmap(struct diff_queue_struct *q,
+			    struct diff_options *opts,
+			    void *vdata)
+{
+	struct walk_paths_data *data = vdata;
+	struct bitmap *bitmap = bitmap_new();
+	struct ewah_bitmap *ewah;
+	struct strbuf out = STRBUF_INIT;
+	size_t i;
+
+	for (i = 0; i < q->nr; i++) {
+		struct diff_filepair *p = q->queue[i];
+		const char *path = p->one->path;
+		struct pathmap_entry *entry;
+		struct hashmap_entry lookup;
+
+		hashmap_entry_init(&lookup, strhash(path));
+		entry = hashmap_get(data->paths, &lookup, path);
+		if (!entry)
+			BUG("mysterious path appeared: %s", path);
+
+		bitmap_set(bitmap, entry->pos);
+	}
+
+	ewah = bitmap_to_ewah(bitmap);
+	ewah_serialize_strbuf(ewah, &out);
+	fwrite(out.buf, 1, out.len, stdout);
+
+	trace_printf("bitmap %s %u %u",
+		     oid_to_hex(&data->commit->object.oid),
+		     (unsigned)q->nr,
+		     (unsigned)out.len);
+
+	strbuf_release(&out);
+	ewah_free(ewah);
+	bitmap_free(bitmap);
+}
+
+int cmd_main(int argc, const char **argv)
+{
+	struct hashmap paths;
+
+	setup_git_directory();
+	collect_paths(&paths);
+
+	walk_paths(generate_bitmap, &paths);
+
+	return 0;
+}

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: Bloom Filters (was Re: We should add a "git gc --auto" after "git clone" due to commit graph)
  2018-10-09 18:46             ` Jeff King
@ 2018-10-09 19:03               ` Derrick Stolee
  2018-10-09 21:14                 ` Jeff King
  0 siblings, 1 reply; 4+ messages in thread
From: Derrick Stolee @ 2018-10-09 19:03 UTC (permalink / raw)
  To: Jeff King
  Cc: SZEDER Gábor, Ævar Arnfjörð Bjarmason,
	Stefan Beller, git, Duy Nguyen

On 10/9/2018 2:46 PM, Jeff King wrote:
> On Tue, Oct 09, 2018 at 09:48:20AM -0400, Derrick Stolee wrote:
>
>> [I snipped all of the parts about bloom filters that seemed entirely
>>   reasonable to me ;) ]
>>> Imagine we have that list. Is a bloom filter still the best data
>>> structure for each commit? At the point that we have the complete
>>> universe of paths, we could give each commit a bitmap of changed paths.
>>> That lets us ask "did this commit touch these paths" (collect the bits
>>> from the list of paths, then check for 1's), as well as "what paths did
>>> we touch" (collect the 1 bits, and then index the path list).  Those
>>> bitmaps should compress very well via EWAH or similar (most of them
>>> would be huge stretches of 0's punctuated by short runs of 1's).
>> I'm not convinced we would frequently have runs of 1's, and the bitmap would
>> not compress much better than simply listing the positions. For example, a
>> path "foo/bar" that resolves to a tree would only start a run if the next
>> changes are the initial section of entries in that tree (sorted
>> lexicographically) such as "foo/bar/a, foo/bar/b". If we deepen into a tree,
>> then we will break the run of 1's unless we changed every path deeper than
>> that tree.
> Yeah, I doubt we'd really have runs of 1's (by short, I just mean 1 or
> 2). I agree that listing the positions could work, though I sort of
> assumed that was more or less what a decent compressed bitmap would
> turn into. E.g., if bit N is set, we should be able to say "N-1
> zeroes, 1 one" in about the same size as we could say "position N".
>
> EWAH seems pretty awful in that regard. Or at least its serialized
> format is (or maybe it's our implementation that is bad).
>
> The patch below generates a bitmap for each commit in a repository (it
> doesn't output the total list of paths; I've left that as an exercise
> for the reader). On linux.git, the result is 57MB. But when I look at
> the individual bitmap sizes (via GIT_TRACE), they're quite silly.
> Storing a single set bit takes 28 bytes in serialized form!
>
> There are only around 120k unique paths (including prefix trees).
> Naively using run-length encoding and varints, our worst case should be
> something like 18-20 bits to say "120k zeroes, then a 1, then all
> zeroes".  And the average case should be better (you don't even need to
> say "120k", but some smaller number).
>
> I wonder if Roaring does better here.

In these sparse cases, usually Roaring will organize the data as "array 
chunks" which are simply lists of the values. The thing that makes this 
still compressible is that we store two bytes per entry, as the entries 
are grouped by a common most-significant two bytes. SInce you say ~120k 
unique paths, the Roaring bitmap would have two or three chunks per 
bitmap (and those chunks could be empty). The overhead to store the 
chunk positions, types, and lengths does come at a cost, but it's more 
like 32 bytes _per commit_.

> Gzipping the resulting bitmaps drops the total size to about 7.5MB.
> That's not a particularly important number, but I think it shows that
> the built-in ewah compression is far from ideal.
>
> Just listing the positions with a series of varints would generally be
> fine, since we expect sparse bitmaps. I just hoped that a good
> RLE scheme would degrade to roughly that for the sparse case, but also
> perform well for more dense cases.
>
> So at any rate, I do think it would not be out of the question to store
> bitmaps like this. I'm much more worried about the maintenance cost of
> adding new entries incrementally. I think it's only feasible if we give
> up sorting, and then I wonder what other problems that might cause.
The patch below gives me a starting point to try the Bloom filter 
approach and see what the numbers are like. You did all the "git" stuff 
like computing the changed paths, so thanks!
>
> -- >8 --
> diff --git a/Makefile b/Makefile
> index 13e1c52478..f6e823f2d6 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -751,6 +751,7 @@ TEST_PROGRAMS_NEED_X += test-parse-options
>   TEST_PROGRAMS_NEED_X += test-pkt-line
>   TEST_PROGRAMS_NEED_X += test-svn-fe
>   TEST_PROGRAMS_NEED_X += test-tool
> +TEST_PROGRAMS_NEED_X += test-tree-bitmap
>   
>   TEST_PROGRAMS = $(patsubst %,t/helper/%$X,$(TEST_PROGRAMS_NEED_X))
>   
> diff --git a/t/helper/test-tree-bitmap.c b/t/helper/test-tree-bitmap.c
> new file mode 100644
> index 0000000000..bc5cf0e514
> --- /dev/null
> +++ b/t/helper/test-tree-bitmap.c
> @@ -0,0 +1,167 @@
> +#include "cache.h"
> +#include "revision.h"
> +#include "diffcore.h"
> +#include "argv-array.h"
> +#include "ewah/ewok.h"
> +
> +/* map of pathnames to bit positions */
> +struct pathmap_entry {
> +	struct hashmap_entry ent;
> +	unsigned pos;
> +	char path[FLEX_ARRAY];
> +};
> +
> +static int pathmap_entry_hashcmp(const void *unused_cmp_data,
> +				 const void *entry,
> +				 const void *entry_or_key,
> +				 const void *keydata)
> +{
> +	const struct pathmap_entry *a = entry;
> +	const struct pathmap_entry *b = entry_or_key;
> +	const char *key = keydata;
> +
> +	return strcmp(a->path, key ? key : b->path);
> +}
> +
> +static int pathmap_entry_strcmp(const void *va, const void *vb)
> +{
> +	struct pathmap_entry *a = *(struct pathmap_entry **)va;
> +	struct pathmap_entry *b = *(struct pathmap_entry **)vb;
> +	return strcmp(a->path, b->path);
> +}
> +
> +struct walk_paths_data {
> +	struct hashmap *paths;
> +	struct commit *commit;
> +};
> +
> +static void walk_paths(diff_format_fn_t fn, struct hashmap *paths)
> +{
> +	struct argv_array argv = ARGV_ARRAY_INIT;
> +	struct rev_info revs;
> +	struct walk_paths_data data;
> +	struct commit *commit;
> +
> +	argv_array_pushl(&argv, "rev-list",
> +			 "--all", "-t", "--no-renames",
> +			 NULL);
> +	init_revisions(&revs, NULL);
> +	setup_revisions(argv.argc, argv.argv, &revs, NULL);
> +	revs.diffopt.output_format = DIFF_FORMAT_CALLBACK;
> +	revs.diffopt.format_callback = fn;
> +	revs.diffopt.format_callback_data = &data;
> +
> +	data.paths = paths;
> +
> +	prepare_revision_walk(&revs);
> +	while ((commit = get_revision(&revs))) {
> +		data.commit = commit;
> +		diff_tree_combined_merge(commit, 0, &revs);
> +	}
> +
> +	reset_revision_walk();
> +	argv_array_clear(&argv);
> +}
> +
> +static void collect_commit_paths(struct diff_queue_struct *q,
> +				 struct diff_options *opts,
> +				 void *vdata)
> +{
> +	struct walk_paths_data *data = vdata;
> +	int i;
> +
> +	for (i = 0; i < q->nr; i++) {
> +		struct diff_filepair *p = q->queue[i];
> +		const char *path = p->one->path;
> +		struct pathmap_entry *entry;
> +		struct hashmap_entry lookup;
> +
> +		hashmap_entry_init(&lookup, strhash(path));
> +		entry = hashmap_get(data->paths, &lookup, path);
> +		if (entry)
> +			continue; /* already present */
> +
> +		FLEX_ALLOC_STR(entry, path, path);
> +		entry->ent = lookup;
> +		hashmap_put(data->paths, entry);
> +	}
> +}
> +
> +/* assign a bit position to all possible paths */
> +static void collect_paths(struct hashmap *paths)
> +{
> +	struct pathmap_entry **sorted;
> +	size_t i, n;
> +	struct hashmap_iter iter;
> +	struct pathmap_entry *entry;
> +
> +	/* grab all unique paths */
> +	hashmap_init(paths, pathmap_entry_hashcmp, NULL, 0);
> +	walk_paths(collect_commit_paths, paths);
> +
> +	/* and assign them bits in sorted order */
> +	n = hashmap_get_size(paths);
> +	ALLOC_ARRAY(sorted, n);
> +	i = 0;
> +	for (entry = hashmap_iter_first(paths, &iter);
> +	     entry;
> +	     entry = hashmap_iter_next(&iter)) {
> +		assert(i < n);
> +		sorted[i++] = entry;
> +	}
> +	QSORT(sorted, i, pathmap_entry_strcmp);
> +	for (i = 0; i < n; i++)
> +		sorted[i]->pos = i;
> +	free(sorted);
> +}
> +
> +/* generate the bitmap for a single commit */
> +static void generate_bitmap(struct diff_queue_struct *q,
> +			    struct diff_options *opts,
> +			    void *vdata)
> +{
> +	struct walk_paths_data *data = vdata;
> +	struct bitmap *bitmap = bitmap_new();
> +	struct ewah_bitmap *ewah;
> +	struct strbuf out = STRBUF_INIT;
> +	size_t i;
> +
> +	for (i = 0; i < q->nr; i++) {
> +		struct diff_filepair *p = q->queue[i];
> +		const char *path = p->one->path;
> +		struct pathmap_entry *entry;
> +		struct hashmap_entry lookup;
> +
> +		hashmap_entry_init(&lookup, strhash(path));
> +		entry = hashmap_get(data->paths, &lookup, path);
> +		if (!entry)
> +			BUG("mysterious path appeared: %s", path);
> +
> +		bitmap_set(bitmap, entry->pos);
> +	}
> +
> +	ewah = bitmap_to_ewah(bitmap);
> +	ewah_serialize_strbuf(ewah, &out);
> +	fwrite(out.buf, 1, out.len, stdout);
> +
> +	trace_printf("bitmap %s %u %u",
> +		     oid_to_hex(&data->commit->object.oid),
> +		     (unsigned)q->nr,
> +		     (unsigned)out.len);
> +
> +	strbuf_release(&out);
> +	ewah_free(ewah);
> +	bitmap_free(bitmap);
> +}
> +
> +int cmd_main(int argc, const char **argv)
> +{
> +	struct hashmap paths;
> +
> +	setup_git_directory();
> +	collect_paths(&paths);
> +
> +	walk_paths(generate_bitmap, &paths);
> +
> +	return 0;
> +}

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Bloom Filters (was Re: We should add a "git gc --auto" after "git clone" due to commit graph)
  2018-10-09 19:03               ` Derrick Stolee
@ 2018-10-09 21:14                 ` Jeff King
  2018-10-09 23:12                   ` Bloom Filters Jeff King
  0 siblings, 1 reply; 4+ messages in thread
From: Jeff King @ 2018-10-09 21:14 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: SZEDER Gábor, Ævar Arnfjörð Bjarmason,
	Stefan Beller, git, Duy Nguyen

On Tue, Oct 09, 2018 at 03:03:08PM -0400, Derrick Stolee wrote:

> > I wonder if Roaring does better here.
> 
> In these sparse cases, usually Roaring will organize the data as "array
> chunks" which are simply lists of the values. The thing that makes this
> still compressible is that we store two bytes per entry, as the entries are
> grouped by a common most-significant two bytes. SInce you say ~120k unique
> paths, the Roaring bitmap would have two or three chunks per bitmap (and
> those chunks could be empty). The overhead to store the chunk positions,
> types, and lengths does come at a cost, but it's more like 32 bytes _per
> commit_.

Hmph. It really sounds like we could do better with a custom RLE
solution. But that makes me feel like I'm missing something, because
surely I can't invent something better than the state of the art in a
simple thought experiment, right?

I know what I'm proposing would be quite bad for random access, but my
impression is that EWAH is the same. For the scale of bitmaps we're
talking about, I think linear/streaming access through the bitmap would
be OK.

> > So at any rate, I do think it would not be out of the question to store
> > bitmaps like this. I'm much more worried about the maintenance cost of
> > adding new entries incrementally. I think it's only feasible if we give
> > up sorting, and then I wonder what other problems that might cause.
> The patch below gives me a starting point to try the Bloom filter approach
> and see what the numbers are like. You did all the "git" stuff like
> computing the changed paths, so thanks!

Great, I hope it can be useful. I almost wrote it as perl consuming the
output of "log --format=%h --name-only", but realized I didn't have a
perl ewah implementation handy.

You'll probably want to tweak this part:

> > +	prepare_revision_walk(&revs);
> > +	while ((commit = get_revision(&revs))) {
> > +		data.commit = commit;
> > +		diff_tree_combined_merge(commit, 0, &revs);
> > +	}

...to handle merges in a particular way. This will actually ignore
merges totally. You could add "-m" to the revision arguments to get a
per-parent diff, but of course you'd see those in your callback
individually. If you want to do _just_ the first parent diff, I think
you'll have to pick it apart manually, like:

  while ((commit = get_revision(&revs))) {
	struct object_id *parent_oid;

	/* ignore non-first parents, but handle root commits like --root */
	if (commit->parents)
		parent = &commit->parents->item->object.oid;
	else
		parent = the_hash_algo->empty_tree;

	diff_tree_oid(parent, &commit->oid, ...);
  }

-Peff

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Bloom Filters
  2018-10-09 21:14                 ` Jeff King
@ 2018-10-09 23:12                   ` Jeff King
  2018-10-11 12:33                     ` Derrick Stolee
  0 siblings, 1 reply; 4+ messages in thread
From: Jeff King @ 2018-10-09 23:12 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: SZEDER Gábor, Ævar Arnfjörð Bjarmason,
	Stefan Beller, git, Duy Nguyen

On Tue, Oct 09, 2018 at 05:14:50PM -0400, Jeff King wrote:

> Hmph. It really sounds like we could do better with a custom RLE
> solution. But that makes me feel like I'm missing something, because
> surely I can't invent something better than the state of the art in a
> simple thought experiment, right?
> 
> I know what I'm proposing would be quite bad for random access, but my
> impression is that EWAH is the same. For the scale of bitmaps we're
> talking about, I think linear/streaming access through the bitmap would
> be OK.

Thinking on it more, what I was missing is that for truly dense random
bitmaps, this will perform much worse. Because it will use a byte to say
"there's one 1", rather than a bit.

But I think it does OK in practice for the very sparse bitmaps we tend
to see in this application.  I was able to generate a complete output
that can reproduce "log --name-status -t" for linux.git in 32MB. But:

  - 15MB of that is commit sha1s, which will be stored elsewhere in a
    "real" system

  - 5MB of that is path list (which should shrink by a factor of 10 with
    prefix compression, and is really a function of a tree size less
    than history depth)

So the per-commit cost is not too bad. That's still not counting merges,
though, which would add another 10-15% (or maybe more; their bitmaps are
less sparse).

I don't know if this is a fruitful path at all or not. I was mostly just
satisfying my own curiosity on the bitmap encoding question. But I'll
post the patches, just to show my work. The first one is the same
initial proof of concept I showed earlier.

  [1/3]: initial tree-bitmap proof of concept
  [2/3]: test-tree-bitmap: add "dump" mode
  [3/3]: test-tree-bitmap: replace ewah with custom rle encoding

 Makefile                    |   1 +
 t/helper/test-tree-bitmap.c | 344 ++++++++++++++++++++++++++++++++++++
 2 files changed, 345 insertions(+)
 create mode 100644 t/helper/test-tree-bitmap.c

-Peff

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Bloom Filters
  2018-10-09 23:12                   ` Bloom Filters Jeff King
@ 2018-10-11 12:33                     ` Derrick Stolee
  2018-10-11 13:43                       ` Jeff King
  0 siblings, 1 reply; 4+ messages in thread
From: Derrick Stolee @ 2018-10-11 12:33 UTC (permalink / raw)
  To: Jeff King
  Cc: SZEDER Gábor, Ævar Arnfjörð Bjarmason,
	Stefan Beller, git, Duy Nguyen

On 10/9/2018 7:12 PM, Jeff King wrote:
> On Tue, Oct 09, 2018 at 05:14:50PM -0400, Jeff King wrote:
>
>> Hmph. It really sounds like we could do better with a custom RLE
>> solution. But that makes me feel like I'm missing something, because
>> surely I can't invent something better than the state of the art in a
>> simple thought experiment, right?
>>
>> I know what I'm proposing would be quite bad for random access, but my
>> impression is that EWAH is the same. For the scale of bitmaps we're
>> talking about, I think linear/streaming access through the bitmap would
>> be OK.
> Thinking on it more, what I was missing is that for truly dense random
> bitmaps, this will perform much worse. Because it will use a byte to say
> "there's one 1", rather than a bit.
>
> But I think it does OK in practice for the very sparse bitmaps we tend
> to see in this application.  I was able to generate a complete output
> that can reproduce "log --name-status -t" for linux.git in 32MB. But:
>
>    - 15MB of that is commit sha1s, which will be stored elsewhere in a
>      "real" system
>
>    - 5MB of that is path list (which should shrink by a factor of 10 with
>      prefix compression, and is really a function of a tree size less
>      than history depth)
>
> So the per-commit cost is not too bad. That's still not counting merges,
> though, which would add another 10-15% (or maybe more; their bitmaps are
> less sparse).
>
> I don't know if this is a fruitful path at all or not. I was mostly just
> satisfying my own curiosity on the bitmap encoding question. But I'll
> post the patches, just to show my work. The first one is the same
> initial proof of concept I showed earlier.
>
>    [1/3]: initial tree-bitmap proof of concept
>    [2/3]: test-tree-bitmap: add "dump" mode
>    [3/3]: test-tree-bitmap: replace ewah with custom rle encoding
>
>   Makefile                    |   1 +
>   t/helper/test-tree-bitmap.c | 344 ++++++++++++++++++++++++++++++++++++
>   2 files changed, 345 insertions(+)
>   create mode 100644 t/helper/test-tree-bitmap.c
I'm trying to test this out myself, and am having trouble reverse 
engineering how I'm supposed to test it.

Looks like running "t/helper/test-tree-bitmap gen" will output a lot of 
binary data. Where should I store that? Does any file work?

Is this series just for the storage costs, assuming that we would 
replace all TREESAME checks with a query into this database? Or do you 
have a way to test how much this would improve a "git log -- <path>" query?

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Bloom Filters
  2018-10-11 12:33                     ` Derrick Stolee
@ 2018-10-11 13:43                       ` Jeff King
  0 siblings, 0 replies; 4+ messages in thread
From: Jeff King @ 2018-10-11 13:43 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: SZEDER Gábor, Ævar Arnfjörð Bjarmason,
	Stefan Beller, git, Duy Nguyen

On Thu, Oct 11, 2018 at 08:33:58AM -0400, Derrick Stolee wrote:

> > I don't know if this is a fruitful path at all or not. I was mostly just
> > satisfying my own curiosity on the bitmap encoding question. But I'll
> > post the patches, just to show my work. The first one is the same
> > initial proof of concept I showed earlier.
> > 
> >    [1/3]: initial tree-bitmap proof of concept
> >    [2/3]: test-tree-bitmap: add "dump" mode
> >    [3/3]: test-tree-bitmap: replace ewah with custom rle encoding
> > 
> >   Makefile                    |   1 +
> >   t/helper/test-tree-bitmap.c | 344 ++++++++++++++++++++++++++++++++++++
> >   2 files changed, 345 insertions(+)
> >   create mode 100644 t/helper/test-tree-bitmap.c
> I'm trying to test this out myself, and am having trouble reverse
> engineering how I'm supposed to test it.
> 
> Looks like running "t/helper/test-tree-bitmap gen" will output a lot of
> binary data. Where should I store that? Does any file work?

Yeah, you can do:

  # optionally run with GIT_TRACE=1 to see some per-bitmap stats
  test-tree-bitmap gen >out

  # this should be roughly the same as:
  #  git rev-list --all |
  #  git diff-tree --stdin -t --name-only
  test-tree-bitmap dump <out

> Is this series just for the storage costs, assuming that we would replace
> all TREESAME checks with a query into this database? Or do you have a way to
> test how much this would improve a "git log -- <path>" query?

Right, I was just looking at storage cost here. It's not integrated with
the diff code at all. I left some hypothetical numbers elsewhere in the
thread.

-Peff

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2018-10-11 13:44 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-05-17  2:28 bloom filters Loic Dachary
2018-10-03 19:18 We should add a "git gc --auto" after "git clone" due to commit graph Jeff King
2018-10-08 16:41 ` SZEDER Gábor
2018-10-08 16:57   ` Derrick Stolee
2018-10-08 18:10     ` SZEDER Gábor
2018-10-08 18:29       ` Derrick Stolee
2018-10-09  3:08         ` Jeff King
2018-10-09 13:48           ` Bloom Filters (was Re: We should add a "git gc --auto" after "git clone" due to commit graph) Derrick Stolee
2018-10-09 18:46             ` Jeff King
2018-10-09 19:03               ` Derrick Stolee
2018-10-09 21:14                 ` Jeff King
2018-10-09 23:12                   ` Bloom Filters Jeff King
2018-10-11 12:33                     ` Derrick Stolee
2018-10-11 13:43                       ` Jeff King

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.