[-- Attachment #1: Type: text/plain, Size: 4505 bytes --] Hi Gábor, On Fri, 29 May 2020, SZEDER Gábor wrote: > Sigh... but better late than never, right? Well, incremental patches on top of what is _already_ in Git are _even_ better. > I experimented quite a bit with modified path Bloom filters a year and > more ago, and got quite far... but my disappointment in the > inadequacy of all double hashing schemes, the arrival of split > commit-graphs, and, well, life in general has put the whole thing on > the back burner, and I haven't touched it for a couple of releases. > > Now I finally managed to take a closer look at the current changed > paths Bloom filters implementation, and saw that it has some of the > same issues that I had stumbled upon and that it missed some > optimization opportunities. Unfortunately, fixing those issues and > performing those optimizations do require a thorough format change. Thank you for this background information. Your claim that there are opportunities to optimize, as well as your claim that it requires a thorough format change, strike me as best backed up with evidence by porting your optimizations on top of what is already in `master` and which has been reviewed _already_ (as opposed to your new patch series, which essentially threatens to throw away all the time people spent on reviewing Garima's patches). > So here is my proof of concept version, in all its incompleteness, > with the following benefits: > > - Better understanding of the problem it tries to optimize. > - Better understanding of the issues with many small Bloom filters. > - Better hashing scheme (though it should be better still). > - Orders of magnitude lower average false positive rate. > - Consequently, faster pathspec-limited revision walks. > - Faster processing of the tree-diff output and lower memory usage > while computing Bloom filters (from scratch...). > - Optional support for storing Bloom filters for all parents of > merge commits. > - Deduplicates Bloom filters. > - Supports multiple pathspecs right from the start. > - Supports some wildcards in pathspecs. > - Handles as many commits as the commit-graph format can. > - It has the right name :) The diff machinery and all its frontends > report "modified" paths with the letter 'M', not "changed". > - More cleanups, more bugfixes. > - Consistent output with and without modified path Bloom filters for > over 80k random paths in 16 repositories, even with submodules in > them. Well, at least on my machine, if nowhere else... It strikes me as an obvious idea to make all those improvements in an incremental manner. > Alas, the drawbacks are significant: > > - No tests whatsoever. > - Computes all modified path Bloom filters from scratch when > writing, no matter what. > - Doesn't work with split commit-graphs. > - Basically if anything works besides 'git commit-graph write > --reachable' it's a miracle. > - Not a single test. You said that already in the first bullet point. But yes, I get it, there are no tests. > - Many BUG()s, which should rather be graceful errors... though I > have to admit that at this point they are indeed bugs. > - Many TODOs, both in commit messages and code, some incomplete > commit messages, crappy subject lines, even missing signoffs. > - Some ridiculously long variable, function, macro and config > variable names. > - It's based on v2.25.0 (no technical reason, but that's the version > I used to run the baseline benchmarks the last time, which takes > days...) > - I'm pretty sure that there are more... > - Oh, did I mention that there are no tests? Ah, your patch series has no tests. Please do understand that it can be perceived as quite frustrating that an alternative is presented _this_ late, when already such a lot of effort has gone into not only iterating several times on the patch series but also into reviewing all of them. I don't really think that I want to spend the time to review this alternative (not that I am an expert in the space) because it would imply that I help invalidating the effort that went into the current implementation. All this is to say: I appreciate your efforts to improve the path-based Bloom filters, at the same time I wish that those efforts would happen in a collaborative manner instead of simply dismissing other people's work. Ciao, Johannes
Sigh... but better late than never, right? I experimented quite a bit with modified path Bloom filters a year and more ago, and got quite far... but my disappointment in the inadequacy of all double hashing schemes, the arrival of split commit-graphs, and, well, life in general has put the whole thing on the back burner, and I haven't touched it for a couple of releases. Now I finally managed to take a closer look at the current changed paths Bloom filters implementation, and saw that it has some of the same issues that I had stumbled upon and that it missed some optimization opportunities. Unfortunately, fixing those issues and performing those optimizations do require a thorough format change. So here is my proof of concept version, in all its incompleteness, with the following benefits: - Better understanding of the problem it tries to optimize. - Better understanding of the issues with many small Bloom filters. - Better hashing scheme (though it should be better still). - Orders of magnitude lower average false positive rate. - Consequently, faster pathspec-limited revision walks. - Faster processing of the tree-diff output and lower memory usage while computing Bloom filters (from scratch...). - Optional support for storing Bloom filters for all parents of merge commits. - Deduplicates Bloom filters. - Supports multiple pathspecs right from the start. - Supports some wildcards in pathspecs. - Handles as many commits as the commit-graph format can. - It has the right name :) The diff machinery and all its frontends report "modified" paths with the letter 'M', not "changed". - More cleanups, more bugfixes. - Consistent output with and without modified path Bloom filters for over 80k random paths in 16 repositories, even with submodules in them. Well, at least on my machine, if nowhere else... Alas, the drawbacks are significant: - No tests whatsoever. - Computes all modified path Bloom filters from scratch when writing, no matter what. - Doesn't work with split commit-graphs. - Basically if anything works besides 'git commit-graph write --reachable' it's a miracle. - Not a single test. - Many BUG()s, which should rather be graceful errors... though I have to admit that at this point they are indeed bugs. - Many TODOs, both in commit messages and code, some incomplete commit messages, crappy subject lines, even missing signoffs. - Some ridiculously long variable, function, macro and config variable names. - It's based on v2.25.0 (no technical reason, but that's the version I used to run the baseline benchmarks the last time, which takes days...) - I'm pretty sure that there are more... - Oh, did I mention that there are no tests? The first 14 patches are preparatory fixes and cleanups: 01/34 tree-walk.c: don't match submodule entries for 'submod/anything' This fix or something similar is necessary to have consistent output with and without modified path Bloom filters for paths crossing submodule boundary. 02/34 commit-graph: fix parsing the Chunk Lookup table The minimal (though not the best) fix for a bug which, I think, is as old as the commit-graph. I don't know how to test this. 03/34 commit-graph-format.txt: all multi-byte numbers are in network byte order 04/34 commit-slab: add a function to deep free entries on the slab 05/34 diff.h: drop diff_tree_oid() & friends' return value 06/34 commit-graph: clean up #includes A couple of minor cleanups. 07/34 commit-graph: simplify parse_commit_graph() #1 08/34 commit-graph: simplify parse_commit_graph() #2 These two would be the right, though not minimal fix for the parsing bug above. 09/34 commit-graph: simplify write_commit_graph_file() #1 10/34 commit-graph: simplify write_commit_graph_file() #2 11/34 commit-graph: allocate the 'struct chunk_info' array dinamically I think these three cleanup patches are a better alternative of 3be7efcafc (commit-graph: define and use MAX_NUM_CHUNKS, 2020-03-30), because... 12/34 commit-graph: unify the signatures of all write_graph_chunk_*() functions 13/34 commit-graph: simplify write_commit_graph_file() #3 14/34 commit-graph: check chunk sizes after writing ... they laid the ground work for this patch. 15/34 commit-graph-format.txt: document the modified path Bloom filter chunks This is the most important one, specifying and _justifying_ the new chunk formats. Do grab a cup or pint of your favourite beverage and get comfy before reading this one. You have been warned. 16/34 Add a generic and minimal Bloom filter implementation 17/34 Import a streaming-capable Murmur3 hash function implementation 18/34 commit-graph: write "empty" Modified Path Bloom Filter Index chunk 19/34 commit-graph: add commit slab for modified path Bloom filters 20/34 commit-graph: fill the Modified Path Bloom Filter Index chunk This shows a more efficient approach to process the tree-diff output into Bloom filters. 21/34 commit-graph: load and use the Modified Path Bloom Filter Index chunk 22/34 commit-graph: write the Modified Path Bloom Filters chunk 23/34 commit-graph: load and use the Modified Path Bloom Filters chunk 24/34 commit-graph: check all leading directories in modified path Bloom filters This was a good lightbulb moment. It is essential to try to maintain reasonable performance in repositories where the vast majority of changes are concentrated to a single directory. 25/34 commit-graph: check embedded modified path Bloom filters with a mask 26/34 commit-graph: deduplicate modified path Bloom filters 27/34 commit-graph: load modified path Bloom filters for merge commits 28/34 commit-graph: write Modified Path Bloom Filter Merge Index chunk 29/34 commit-graph: extract init and free write_commit_graph_context 30/34 commit-graph: move write_commit_graph_reachable below write_commit_graph 31/34 t7007-show: make the first test compatible with the next patch 32/34 PoC commit-graph: use revision walk machinery for '--reachable' Once upon a time I thought this was the greatest idea ever, but as time goes by I get more and more concerned that this is a really dumb idea, though don't yet know why. 33/34 commit-graph: write modified path Bloom filters in "history order" 34/34 commit-graph: use modified path Bloom filters with wildcards, if possible Finally a cherry on top. SZEDER Gábor (34): tree-walk.c: don't match submodule entries for 'submod/anything' commit-graph: fix parsing the Chunk Lookup table commit-graph-format.txt: all multi-byte numbers are in network byte order commit-slab: add a function to deep free entries on the slab diff.h: drop diff_tree_oid() & friends' return value commit-graph: clean up #includes commit-graph: simplify parse_commit_graph() #1 commit-graph: simplify parse_commit_graph() #2 commit-graph: simplify write_commit_graph_file() #1 commit-graph: simplify write_commit_graph_file() #2 commit-graph: allocate the 'struct chunk_info' array dinamically commit-graph: unify the signatures of all write_graph_chunk_*() functions commit-graph: simplify write_commit_graph_file() #3 commit-graph: check chunk sizes after writing commit-graph-format.txt: document the modified path Bloom filter chunks Add a generic and minimal Bloom filter implementation Import a streaming-capable Murmur3 hash function implementation commit-graph: write "empty" Modified Path Bloom Filter Index chunk commit-graph: add commit slab for modified path Bloom filters commit-graph: fill the Modified Path Bloom Filter Index chunk commit-graph: load and use the Modified Path Bloom Filter Index chunk commit-graph: write the Modified Path Bloom Filters chunk commit-graph: load and use the Modified Path Bloom Filters chunk commit-graph: check all leading directories in modified path Bloom filters commit-graph: check embedded modified path Bloom filters with a mask commit-graph: deduplicate modified path Bloom filters commit-graph: load modified path Bloom filters for merge commits commit-graph: write Modified Path Bloom Filter Merge Index chunk commit-graph: extract init and free write_commit_graph_context commit-graph: move write_commit_graph_reachable below write_commit_graph t7007-show: make the first test compatible with the next patch PoC commit-graph: use revision walk machinery for '--reachable' commit-graph: write modified path Bloom filters in "history order" commit-graph: use modified path Bloom filters with wildcards, if possible Documentation/config/core.txt | 19 + .../technical/commit-graph-format.txt | 127 +- Makefile | 2 + bloom-filter.c | 91 ++ bloom-filter.h | 47 + commit-graph.c | 1239 +++++++++++++++-- commit-graph.h | 24 +- commit-slab-decl.h | 1 + commit-slab-impl.h | 13 + commit-slab.h | 10 + compat/PMurHash.c | 291 ++++ compat/PMurHash.h | 62 + diff.h | 10 +- pathspec.c | 10 + pathspec.h | 13 + revision.c | 27 +- shallow.c | 14 +- t/t4010-diff-pathspec.sh | 4 +- t/t5318-commit-graph.sh | 3 +- t/t7007-show.sh | 7 +- tree-diff.c | 21 +- tree-walk.c | 9 +- 22 files changed, 1872 insertions(+), 172 deletions(-) create mode 100644 bloom-filter.c create mode 100644 bloom-filter.h create mode 100644 compat/PMurHash.c create mode 100644 compat/PMurHash.h -- 2.27.0.rc1.431.g5c813f95dc
Submodules should be handled the same as regular directories with respect to the presence of a trailing slash, i.e. commands like: git diff rev1 rev2 -- $path git rev-list HEAD -- $path should produce the same output whether $path is 'submod' or 'submod/'. This has been fixed in commit 74b4f7f277 (tree-walk.c: ignore trailing slash on submodule in tree_entry_interesting(), 2014-01-23). Unfortunately, that commit had the unintended side effect to handle 'submod/anything' the same as 'submod' and 'submod/' as well, e.g.: $ git log --oneline --name-only -- sha1collisiondetection/whatever 4125f78222 sha1dc: update from upstream sha1collisiondetection 07a20f569b Makefile: fix unaligned loads in sha1dc with UBSan sha1collisiondetection 23e37f8e9d sha1dc: update from upstream sha1collisiondetection 86cfd61e6b sha1dc: optionally use sha1collisiondetection as a submodule sha1collisiondetection Fix this by rejecting submodules as partial pathnames when their trailing slash is followed by anything. Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com> --- t/t4010-diff-pathspec.sh | 4 +++- tree-walk.c | 9 ++++++++- 2 files changed, 11 insertions(+), 2 deletions(-) diff --git a/t/t4010-diff-pathspec.sh b/t/t4010-diff-pathspec.sh index e5ca359edf..65cc703c65 100755 --- a/t/t4010-diff-pathspec.sh +++ b/t/t4010-diff-pathspec.sh @@ -125,7 +125,9 @@ test_expect_success 'setup submodules' ' test_expect_success 'diff-tree ignores trailing slash on submodule path' ' git diff --name-only HEAD^ HEAD submod >expect && git diff --name-only HEAD^ HEAD submod/ >actual && - test_cmp expect actual + test_cmp expect actual && + git diff --name-only HEAD^ HEAD -- submod/whatever >actual && + test_must_be_empty actual ' test_expect_success 'diff multiple wildcard pathspecs' ' diff --git a/tree-walk.c b/tree-walk.c index d5a8e096a6..7fbb150312 100644 --- a/tree-walk.c +++ b/tree-walk.c @@ -849,7 +849,14 @@ static int match_entry(const struct pathspec_item *item, if (matchlen > pathlen) { if (match[pathlen] != '/') return 0; - if (!S_ISDIR(entry->mode) && !S_ISGITLINK(entry->mode)) + /* + * Reject non-directories as partial pathnames, except + * when match is a submodule with a trailing slash and + * nothing else (to handle 'submod/' and 'submod' + * uniformly). + */ + if (!S_ISDIR(entry->mode) && + (!S_ISGITLINK(entry->mode) || matchlen > pathlen + 1)) return 0; } -- 2.27.0.rc1.431.g5c813f95dc
The commit-graph file format specifies that the chunks may be in any order. However, if the OID Lookup chunk happens to be the last one in the file, then any command attempting to access the commit-graph data will fail with: fatal: invalid commit position. commit-graph is likely corrupt In this case the error is wrong, the commit-graph file does conform to the specification, but the parsing of the Chunk Lookup table is a bit buggy, and leaves the field holding the number of commits in the commit-graph zero-initialized. The number of commits in the commit-graph is determined while parsing the Chunk Lookup table, by dividing the size of the OID Lookup chunk with the hash size. However, the Chunk Lookup table doesn't actually store the size of the chunks, but it stores their starting offset. Consequently, the size of a chunk can only be calculated by subtracting the starting offsets of that chunk from the offset of the subsequent chunk, or in case of the last chunk from the offset recorded in the terminating label. This is currenly implemented in a bit complicated way: as we iterate over the entries of the Chunk Lookup table, we check the ID of each chunk and store its starting offset, then we check the ID of the last seen chunk and calculate its size using its previously saved offset if necessary (at the moment it's only necessary for the OID Lookup chunk). Alas, while parsing the Chunk Lookup table we only interate through the "real" chunks, but never look at the terminating label, thus don't even check whether it's necessary to calulate the size of the last chunk. Consequently, if the OID Lookup chunk is the last one, then we don't calculate its size and turn don't run the piece of code determining the number of commits in the commit graph, leaving the field holding that number unchanged (i.e. zero-initialized), eventually triggering the sanity check in load_oid_from_graph(). Fix this by iterating through all entries in the Chunk Lookup table, including the terminating label. Note that this is the minimal fix, suitable for the maintenance track. A better fix would be to simplify how the chunk sizes are calculated, but that is a more invasive change, less suitable for 'maint', so that will be done in later patches. Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com> --- commit-graph.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/commit-graph.c b/commit-graph.c index b205e65ed1..12df74c925 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -222,7 +222,7 @@ struct commit_graph *parse_commit_graph(void *graph_map, int fd, last_chunk_id = 0; last_chunk_offset = 8; chunk_lookup = data + 8; - for (i = 0; i < graph->num_chunks; i++) { + for (i = 0; i <= graph->num_chunks; i++) { uint32_t chunk_id; uint64_t chunk_offset; int chunk_repeated = 0; -- 2.27.0.rc1.431.g5c813f95dc
The commit-graph format specifies that "All 4-byte numbers are in network order", but the commit-graph contains 8-byte integers as well (file offsets in the Chunk Lookup table), and their byte order is unspecified. Clarify that all multi-byte integers are in network byte order. Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com> --- Documentation/technical/commit-graph-format.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt index a4f17441ae..efd0f99d8b 100644 --- a/Documentation/technical/commit-graph-format.txt +++ b/Documentation/technical/commit-graph-format.txt @@ -29,7 +29,7 @@ the body into "chunks" and provide a binary lookup table at the beginning of the body. The header includes certain values, such as number of chunks and hash type. -All 4-byte numbers are in network order. +All multi-byte numbers are in network byte order. HEADER: -- 2.27.0.rc1.431.g5c813f95dc
clear_##slabname() frees only the memory allocated for a commit slab itself, but entries in the commit slab might own additional memory outside the slab that should be freed as well. We already have (at least) one such commit slab, and this patch series is about to add one more. To free all additional memory owned by entries on the commit slab the user of such a slab could iterate over all commits it knows about, peek whether there is a valid entry associated with each commit, and free the additional memory, if any. Or it could rely on intimate knowledge about the internals of the commit slab implementation, and could itself iterate directly through all entries in the slab, and free the additional memory. Or it could just leak the additional memory... Introduce deep_clear_##slabname() to allow releasing memory owned by commit slab entries by invoking the 'void free_fn(elemtype *ptr)' function specified as parameter for each entry in the slab. Use it in get_shallow_commits() in 'shallow.c' to replace an open-coded iteration over a commit slab's entries. Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com> --- commit-slab-decl.h | 1 + commit-slab-impl.h | 13 +++++++++++++ commit-slab.h | 10 ++++++++++ shallow.c | 14 +++++--------- 4 files changed, 29 insertions(+), 9 deletions(-) diff --git a/commit-slab-decl.h b/commit-slab-decl.h index adc7b46c83..286164b7e2 100644 --- a/commit-slab-decl.h +++ b/commit-slab-decl.h @@ -32,6 +32,7 @@ struct slabname { \ void init_ ##slabname## _with_stride(struct slabname *s, unsigned stride); \ void init_ ##slabname(struct slabname *s); \ void clear_ ##slabname(struct slabname *s); \ +void deep_clear_ ##slabname(struct slabname *s, void (*free_fn)(elemtype *ptr)); \ elemtype *slabname## _at_peek(struct slabname *s, const struct commit *c, int add_if_missing); \ elemtype *slabname## _at(struct slabname *s, const struct commit *c); \ elemtype *slabname## _peek(struct slabname *s, const struct commit *c) diff --git a/commit-slab-impl.h b/commit-slab-impl.h index 5c0eb91a5d..557738df27 100644 --- a/commit-slab-impl.h +++ b/commit-slab-impl.h @@ -38,6 +38,19 @@ scope void clear_ ##slabname(struct slabname *s) \ FREE_AND_NULL(s->slab); \ } \ \ +scope void deep_clear_ ##slabname(struct slabname *s, void (*free_fn)(elemtype *)) \ +{ \ + unsigned int i; \ + for (i = 0; i < s->slab_count; i++) { \ + unsigned int j; \ + if (!s->slab[i]) \ + continue; \ + for (j = 0; j < s->slab_size; j++) \ + free_fn(&s->slab[i][j * s->stride]); \ + } \ + clear_ ##slabname(s); \ +} \ + \ scope elemtype *slabname## _at_peek(struct slabname *s, \ const struct commit *c, \ int add_if_missing) \ diff --git a/commit-slab.h b/commit-slab.h index 69bf0c807c..7bd472171e 100644 --- a/commit-slab.h +++ b/commit-slab.h @@ -42,6 +42,16 @@ * * Call this function before the slab falls out of scope to avoid * leaking memory. + * + * - void deep_clear_indegree(struct indegree *, void (*free_fn)(int*)) + * + * Empties the slab, similar to clear_indegree(), but in addition it + * calls the given 'free_fn' for each slab entry to release any + * additional memory that might be owned by the entry (but not the + * entry itself!). + * Note that 'free_fn' might be called even for entries for which no + * indegree_at() call has been made; in this case 'free_fn' is invoked + * with a pointer to a zero-initialized location. */ #define define_commit_slab(slabname, elemtype) \ diff --git a/shallow.c b/shallow.c index 7fd04afed1..d02ba494ee 100644 --- a/shallow.c +++ b/shallow.c @@ -84,6 +84,10 @@ int is_repository_shallow(struct repository *r) * supports a "valid" flag. */ define_commit_slab(commit_depth, int *); +void free_depth_in_slab(int **ptr) +{ + FREE_AND_NULL(*ptr); +} struct commit_list *get_shallow_commits(struct object_array *heads, int depth, int shallow_flag, int not_shallow_flag) { @@ -150,15 +154,7 @@ struct commit_list *get_shallow_commits(struct object_array *heads, int depth, } } } - for (i = 0; i < depths.slab_count; i++) { - int j; - - if (!depths.slab[i]) - continue; - for (j = 0; j < depths.slab_size; j++) - free(depths.slab[i][j]); - } - clear_commit_depth(&depths); + deep_clear_commit_depth(&depths, free_depth_in_slab); return result; } -- 2.27.0.rc1.431.g5c813f95dc
ll_diff_tree_oid() has only ever returned 0 [1], so it's return value is basically useless. It's only caller diff_tree_oid() has only ever returned the return value of ll_diff_tree_oid() as-is [2], so its return value is just as useless. Most of diff_tree_oid()'s callers simply ignore its return value, except: - diff_root_tree_oid() is a thin wrapper around diff_tree_oid() and returns with its return value, but all of diff_root_tree_oid()'s callers ignore its return value. - rev_compare_tree() and rev_same_tree_as_empty() do look at the return value in a condition, but, since the return value is always 0, the former's < 0 condition is never fulfilled, while the latter's >= 0 condition is always fulfilled. So let's drop the return value of ll_diff_tree_oid(), diff_tree_oid() and diff_root_tree_oid(), and drop those conditions from rev_compare_tree() and rev_same_tree_as_empty() as well. [1] ll_diff_tree_oid() and its ancestors have been returning only 0 ever since it was introduced as diff_tree() in 9174026cfe (Add "diff-tree" program to show which files have changed between two trees., 2005-04-09). [2] diff_tree_oid() traces back to diff-tree.c:main() in 9174026cfe as well. Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com> --- diff.h | 10 +++++----- revision.c | 9 +++------ tree-diff.c | 21 +++++++++------------ 3 files changed, 17 insertions(+), 23 deletions(-) diff --git a/diff.h b/diff.h index 6febe7e365..a1459de824 100644 --- a/diff.h +++ b/diff.h @@ -426,11 +426,11 @@ struct combine_diff_path *diff_tree_paths( struct combine_diff_path *p, const struct object_id *oid, const struct object_id **parents_oid, int nparent, struct strbuf *base, struct diff_options *opt); -int diff_tree_oid(const struct object_id *old_oid, - const struct object_id *new_oid, - const char *base, struct diff_options *opt); -int diff_root_tree_oid(const struct object_id *new_oid, const char *base, - struct diff_options *opt); +void diff_tree_oid(const struct object_id *old_oid, + const struct object_id *new_oid, + const char *base, struct diff_options *opt); +void diff_root_tree_oid(const struct object_id *new_oid, const char *base, + struct diff_options *opt); struct combine_diff_path { struct combine_diff_path *next; diff --git a/revision.c b/revision.c index 8136929e23..9dac1262ef 100644 --- a/revision.c +++ b/revision.c @@ -655,15 +655,12 @@ static int rev_compare_tree(struct rev_info *revs, tree_difference = REV_TREE_SAME; revs->pruning.flags.has_changes = 0; - if (diff_tree_oid(&t1->object.oid, &t2->object.oid, "", - &revs->pruning) < 0) - return REV_TREE_DIFFERENT; + diff_tree_oid(&t1->object.oid, &t2->object.oid, "", &revs->pruning); return tree_difference; } static int rev_same_tree_as_empty(struct rev_info *revs, struct commit *commit) { - int retval; struct tree *t1 = get_commit_tree(commit); if (!t1) @@ -671,9 +668,9 @@ static int rev_same_tree_as_empty(struct rev_info *revs, struct commit *commit) tree_difference = REV_TREE_SAME; revs->pruning.flags.has_changes = 0; - retval = diff_tree_oid(NULL, &t1->object.oid, "", &revs->pruning); + diff_tree_oid(NULL, &t1->object.oid, "", &revs->pruning); - return retval >= 0 && (tree_difference == REV_TREE_SAME); + return tree_difference == REV_TREE_SAME; } struct treesame_state { diff --git a/tree-diff.c b/tree-diff.c index 33ded7f8b3..9775e4bcaa 100644 --- a/tree-diff.c +++ b/tree-diff.c @@ -29,7 +29,7 @@ static struct combine_diff_path *ll_diff_tree_paths( struct combine_diff_path *p, const struct object_id *oid, const struct object_id **parents_oid, int nparent, struct strbuf *base, struct diff_options *opt); -static int ll_diff_tree_oid(const struct object_id *old_oid, +static void ll_diff_tree_oid(const struct object_id *old_oid, const struct object_id *new_oid, struct strbuf *base, struct diff_options *opt); @@ -673,7 +673,7 @@ static void try_to_follow_renames(const struct object_id *old_oid, q->nr = 1; } -static int ll_diff_tree_oid(const struct object_id *old_oid, +static void ll_diff_tree_oid(const struct object_id *old_oid, const struct object_id *new_oid, struct strbuf *base, struct diff_options *opt) { @@ -691,29 +691,26 @@ static int ll_diff_tree_oid(const struct object_id *old_oid, } opt->pathchange = pathchange_old; - return 0; } -int diff_tree_oid(const struct object_id *old_oid, - const struct object_id *new_oid, - const char *base_str, struct diff_options *opt) +void diff_tree_oid(const struct object_id *old_oid, + const struct object_id *new_oid, + const char *base_str, struct diff_options *opt) { struct strbuf base; - int retval; strbuf_init(&base, PATH_MAX); strbuf_addstr(&base, base_str); - retval = ll_diff_tree_oid(old_oid, new_oid, &base, opt); + ll_diff_tree_oid(old_oid, new_oid, &base, opt); if (!*base_str && opt->flags.follow_renames && diff_might_be_rename()) try_to_follow_renames(old_oid, new_oid, &base, opt); strbuf_release(&base); - - return retval; } -int diff_root_tree_oid(const struct object_id *new_oid, const char *base, struct diff_options *opt) +void diff_root_tree_oid(const struct object_id *new_oid, const char *base, + struct diff_options *opt) { - return diff_tree_oid(NULL, new_oid, base, opt); + diff_tree_oid(NULL, new_oid, base, opt); } -- 2.27.0.rc1.431.g5c813f95dc
Our CodingGuidelines says that it's sufficient to include one of 'git-compat-util.h' and 'cache.h', but both 'commit-graph.c' and 'commit-graph.h' include both. Let's include only 'git-compat-util.h' to loose a bunch of unnecessary dependencies; but include 'hash.h', because 'commit-graph.h' does require the definition of 'struct object_id'. 'commit-graph.h' explicitly includes 'repository.h' and 'string-list.h', but only needs the declaration of a few structs from them. Drop these includes and forward-declare the necessary structs instead. 'commit-graph.c' includes 'dir.h', but doesn't actually use anything from there, so let's drop that #include as well. Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com> --- commit-graph.c | 4 +--- commit-graph.h | 7 ++++--- 2 files changed, 5 insertions(+), 6 deletions(-) diff --git a/commit-graph.c b/commit-graph.c index 12df74c925..f5ba8a9578 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -1,7 +1,5 @@ -#include "cache.h" -#include "config.h" -#include "dir.h" #include "git-compat-util.h" +#include "config.h" #include "lockfile.h" #include "pack.h" #include "packfile.h" diff --git a/commit-graph.h b/commit-graph.h index 7f5c933fa2..847cd25cfc 100644 --- a/commit-graph.h +++ b/commit-graph.h @@ -2,14 +2,15 @@ #define COMMIT_GRAPH_H #include "git-compat-util.h" -#include "repository.h" -#include "string-list.h" -#include "cache.h" +#include "hash.h" #define GIT_TEST_COMMIT_GRAPH "GIT_TEST_COMMIT_GRAPH" #define GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD "GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD" struct commit; +struct repository; +struct raw_object_store; +struct string_list; char *get_commit_graph_filename(const char *obj_dir); int open_commit_graph(const char *graph_file, int *fd, struct stat *st); -- 2.27.0.rc1.431.g5c813f95dc
While we iterate over all entries of the Chunk Lookup table we make sure that we don't attempt to read past the end of the mmap-ed commit-graph file, and check in each iteration that the chunk ID and offset we are about to read is still within the mmap-ed memory region. However, these checks in each iteration are not really necessary, because the number of chunks in the commit-graph file is already known before this loop from the just parsed commit-graph header. So let's check that the commit-graph file is large enough for all entries in the Chunk Lookup table before we start iterating over those entries, and drop those per-iteration checks. While at it, take into account the size of everything that is necessary to have a valid commit-graph file, i.e. the size of the header, the size of the mandatory OID Fanout chunk, and the size of the signature in the trailer as well. Note that this necessitates the change of the error message as well, and, consequently, have to update the 'detect incorrect chunk count' test in 't5318-commit-graph.sh' as well. Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com> --- commit-graph.c | 16 +++++++++------- t/t5318-commit-graph.sh | 3 ++- 2 files changed, 11 insertions(+), 8 deletions(-) diff --git a/commit-graph.c b/commit-graph.c index f5ba8a9578..69facf3bf5 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -217,6 +217,15 @@ struct commit_graph *parse_commit_graph(void *graph_map, int fd, graph->data = graph_map; graph->data_len = graph_size; + if (graph_size < GRAPH_HEADER_SIZE + + (graph->num_chunks + 1) * GRAPH_CHUNKLOOKUP_WIDTH + + GRAPH_FANOUT_SIZE + the_hash_algo->rawsz) { + error(_("commit-graph file is too small to hold %u chunks"), + graph->num_chunks); + free(graph); + return NULL; + } + last_chunk_id = 0; last_chunk_offset = 8; chunk_lookup = data + 8; @@ -225,13 +234,6 @@ struct commit_graph *parse_commit_graph(void *graph_map, int fd, uint64_t chunk_offset; int chunk_repeated = 0; - if (data + graph_size - chunk_lookup < - GRAPH_CHUNKLOOKUP_WIDTH) { - error(_("commit-graph chunk lookup table entry missing; file may be incomplete")); - free(graph); - return NULL; - } - chunk_id = get_be32(chunk_lookup + 0); chunk_offset = get_be64(chunk_lookup + 4); diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh index 3f03de6018..b9a5101c09 100755 --- a/t/t5318-commit-graph.sh +++ b/t/t5318-commit-graph.sh @@ -572,7 +572,8 @@ test_expect_success 'detect invalid checksum hash' ' test_expect_success 'detect incorrect chunk count' ' corrupt_graph_and_verify $GRAPH_BYTE_CHUNK_COUNT "\377" \ - "chunk lookup table entry missing" $GRAPH_CHUNK_LOOKUP_OFFSET + "commit-graph file is too small to hold [0-9]* chunks" \ + $GRAPH_CHUNK_LOOKUP_OFFSET ' test_expect_success 'git fsck (checks commit-graph)' ' -- 2.27.0.rc1.431.g5c813f95dc
The Chunk Lookup table stores the chunks' starting offset in the commit-graph file, not their sizes. Consequently, the size of a chunk can only be calculated by subtracting its offset from the offset of the subsequent chunk (or that of the terminating label). This is currenly implemented in a bit complicated way: as we iterate over the entries of the Chunk Lookup table, we check the id of each chunk and store its starting offset, then we check the id of the last seen chunk and calculate its size using its previously saved offset. At the moment there is only one chunk for which we calculate its size, but this patch series will add more, and the repeated chunk id checks are not that pretty. Instead let's read ahead the offset of the next chunk on each iteration, so we can calculate the size of each chunk right away, right where we store its starting offset. Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com> --- commit-graph.c | 26 ++++++++++---------------- 1 file changed, 10 insertions(+), 16 deletions(-) diff --git a/commit-graph.c b/commit-graph.c index 69facf3bf5..f64b1c01a8 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -175,8 +175,7 @@ struct commit_graph *parse_commit_graph(void *graph_map, int fd, const unsigned char *data, *chunk_lookup; uint32_t i; struct commit_graph *graph; - uint64_t last_chunk_offset; - uint32_t last_chunk_id; + uint64_t next_chunk_offset; uint32_t graph_signature; unsigned char graph_version, hash_version; @@ -226,16 +225,17 @@ struct commit_graph *parse_commit_graph(void *graph_map, int fd, return NULL; } - last_chunk_id = 0; - last_chunk_offset = 8; chunk_lookup = data + 8; - for (i = 0; i <= graph->num_chunks; i++) { + next_chunk_offset = get_be64(chunk_lookup + 4); + for (i = 0; i < graph->num_chunks; i++) { uint32_t chunk_id; uint64_t chunk_offset; int chunk_repeated = 0; chunk_id = get_be32(chunk_lookup + 0); - chunk_offset = get_be64(chunk_lookup + 4); + chunk_offset = next_chunk_offset; + next_chunk_offset = get_be64(chunk_lookup + 4 + + GRAPH_CHUNKLOOKUP_WIDTH); chunk_lookup += GRAPH_CHUNKLOOKUP_WIDTH; @@ -257,8 +257,11 @@ struct commit_graph *parse_commit_graph(void *graph_map, int fd, case GRAPH_CHUNKID_OIDLOOKUP: if (graph->chunk_oid_lookup) chunk_repeated = 1; - else + else { graph->chunk_oid_lookup = data + chunk_offset; + graph->num_commits = (next_chunk_offset - chunk_offset) + / graph->hash_len; + } break; case GRAPH_CHUNKID_DATA: @@ -287,15 +290,6 @@ struct commit_graph *parse_commit_graph(void *graph_map, int fd, free(graph); return NULL; } - - if (last_chunk_id == GRAPH_CHUNKID_OIDLOOKUP) - { - graph->num_commits = (chunk_offset - last_chunk_offset) - / graph->hash_len; - } - - last_chunk_id = chunk_id; - last_chunk_offset = chunk_offset; } hashcpy(graph->oid.hash, graph->data + graph->data_len - graph->hash_len); -- 2.27.0.rc1.431.g5c813f95dc
In write_commit_graph_file() one block of code fills the array of chunk IDs, another block of code fills the array of chunk offsets, then the chunk IDs and offsets are written to the Chunk Lookup table, and finally a third block of code writes the actual chunks. In case of optional chunks like Extra Edge List and Base Graphs List there is also a condition checking whether that chunk is necessary/desired, and that same condition is repeated in all those three blocks of code. This patch series is about to add more optional chunks, so there would be even more repeated conditions. Those chunk offsets are relative to the beginning of the file, so they inherently depend on the size of the Chunk Lookup table, which in turn depends on the number of chunks that are to be written to the commit-graph file. IOW at the time we set the first chunk's ID we can't yet know its offset, because we don't yet know how many chunks there are. Simplify this by initially filling an array of chunk sizes, not offsets, and calculate the offsets based on the chunk sizes only later, while we are writing the Chunk Lookup table. This way we can fill the arrays of chunk IDs and sizes in one go, eliminating one set of repeated conditions. Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com> --- commit-graph.c | 33 +++++++++++++-------------------- 1 file changed, 13 insertions(+), 20 deletions(-) diff --git a/commit-graph.c b/commit-graph.c index f64b1c01a8..416f4b2468 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -1351,10 +1351,11 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx) struct hashfile *f; struct lock_file lk = LOCK_INIT; uint32_t chunk_ids[6]; - uint64_t chunk_offsets[6]; + uint64_t chunk_sizes[6]; const unsigned hashsz = the_hash_algo->rawsz; struct strbuf progress_title = STRBUF_INIT; int num_chunks = 3; + uint64_t chunk_offset; struct object_id file_hash; if (ctx->split) { @@ -1394,35 +1395,24 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx) } chunk_ids[0] = GRAPH_CHUNKID_OIDFANOUT; + chunk_sizes[0] = GRAPH_FANOUT_SIZE; chunk_ids[1] = GRAPH_CHUNKID_OIDLOOKUP; + chunk_sizes[1] = hashsz * ctx->commits.nr; chunk_ids[2] = GRAPH_CHUNKID_DATA; + chunk_sizes[2] = (hashsz + 16) * ctx->commits.nr; if (ctx->num_extra_edges) { chunk_ids[num_chunks] = GRAPH_CHUNKID_EXTRAEDGES; + chunk_sizes[3] = 4 * ctx->num_extra_edges; num_chunks++; } if (ctx->num_commit_graphs_after > 1) { chunk_ids[num_chunks] = GRAPH_CHUNKID_BASE; + chunk_sizes[4] = hashsz * (ctx->num_commit_graphs_after - 1); num_chunks++; } chunk_ids[num_chunks] = 0; - - chunk_offsets[0] = 8 + (num_chunks + 1) * GRAPH_CHUNKLOOKUP_WIDTH; - chunk_offsets[1] = chunk_offsets[0] + GRAPH_FANOUT_SIZE; - chunk_offsets[2] = chunk_offsets[1] + hashsz * ctx->commits.nr; - chunk_offsets[3] = chunk_offsets[2] + (hashsz + 16) * ctx->commits.nr; - - num_chunks = 3; - if (ctx->num_extra_edges) { - chunk_offsets[num_chunks + 1] = chunk_offsets[num_chunks] + - 4 * ctx->num_extra_edges; - num_chunks++; - } - if (ctx->num_commit_graphs_after > 1) { - chunk_offsets[num_chunks + 1] = chunk_offsets[num_chunks] + - hashsz * (ctx->num_commit_graphs_after - 1); - num_chunks++; - } + chunk_sizes[num_chunks] = 0; hashwrite_be32(f, GRAPH_SIGNATURE); @@ -1431,13 +1421,16 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx) hashwrite_u8(f, num_chunks); hashwrite_u8(f, ctx->num_commit_graphs_after - 1); + chunk_offset = 8 + (num_chunks + 1) * GRAPH_CHUNKLOOKUP_WIDTH; for (i = 0; i <= num_chunks; i++) { uint32_t chunk_write[3]; chunk_write[0] = htonl(chunk_ids[i]); - chunk_write[1] = htonl(chunk_offsets[i] >> 32); - chunk_write[2] = htonl(chunk_offsets[i] & 0xffffffff); + chunk_write[1] = htonl(chunk_offset >> 32); + chunk_write[2] = htonl(chunk_offset & 0xffffffff); hashwrite(f, chunk_write, 12); + + chunk_offset += chunk_sizes[i]; } if (ctx->report_progress) { -- 2.27.0.rc1.431.g5c813f95dc
Unify the 'chunk_ids' and 'chunk_sizes' arrays into an array of 'struct chunk_info'. This will allow more cleanups in the following patches. Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com> --- commit-graph.c | 36 ++++++++++++++++++++---------------- 1 file changed, 20 insertions(+), 16 deletions(-) diff --git a/commit-graph.c b/commit-graph.c index 416f4b2468..d95c739c10 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -1344,14 +1344,18 @@ static int write_graph_chunk_base(struct hashfile *f, return 0; } +struct chunk_info { + uint32_t id; + uint64_t size; +}; + static int write_commit_graph_file(struct write_commit_graph_context *ctx) { uint32_t i; int fd; struct hashfile *f; struct lock_file lk = LOCK_INIT; - uint32_t chunk_ids[6]; - uint64_t chunk_sizes[6]; + struct chunk_info chunks[6]; const unsigned hashsz = the_hash_algo->rawsz; struct strbuf progress_title = STRBUF_INIT; int num_chunks = 3; @@ -1394,25 +1398,25 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx) f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf); } - chunk_ids[0] = GRAPH_CHUNKID_OIDFANOUT; - chunk_sizes[0] = GRAPH_FANOUT_SIZE; - chunk_ids[1] = GRAPH_CHUNKID_OIDLOOKUP; - chunk_sizes[1] = hashsz * ctx->commits.nr; - chunk_ids[2] = GRAPH_CHUNKID_DATA; - chunk_sizes[2] = (hashsz + 16) * ctx->commits.nr; + chunks[0].id = GRAPH_CHUNKID_OIDFANOUT; + chunks[0].size = GRAPH_FANOUT_SIZE; + chunks[1].id = GRAPH_CHUNKID_OIDLOOKUP; + chunks[1].size = hashsz * ctx->commits.nr; + chunks[2].id = GRAPH_CHUNKID_DATA; + chunks[2].size = (hashsz + 16) * ctx->commits.nr; if (ctx->num_extra_edges) { - chunk_ids[num_chunks] = GRAPH_CHUNKID_EXTRAEDGES; - chunk_sizes[3] = 4 * ctx->num_extra_edges; + chunks[num_chunks].id = GRAPH_CHUNKID_EXTRAEDGES; + chunks[num_chunks].size = 4 * ctx->num_extra_edges; num_chunks++; } if (ctx->num_commit_graphs_after > 1) { - chunk_ids[num_chunks] = GRAPH_CHUNKID_BASE; - chunk_sizes[4] = hashsz * (ctx->num_commit_graphs_after - 1); + chunks[num_chunks].id = GRAPH_CHUNKID_BASE; + chunks[num_chunks].size = hashsz * (ctx->num_commit_graphs_after - 1); num_chunks++; } - chunk_ids[num_chunks] = 0; - chunk_sizes[num_chunks] = 0; + chunks[num_chunks].id = 0; + chunks[num_chunks].size = 0; hashwrite_be32(f, GRAPH_SIGNATURE); @@ -1425,12 +1429,12 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx) for (i = 0; i <= num_chunks; i++) { uint32_t chunk_write[3]; - chunk_write[0] = htonl(chunk_ids[i]); + chunk_write[0] = htonl(chunks[i].id); chunk_write[1] = htonl(chunk_offset >> 32); chunk_write[2] = htonl(chunk_offset & 0xffffffff); hashwrite(f, chunk_write, 12); - chunk_offset += chunk_sizes[i]; + chunk_offset += chunks[i].size; } if (ctx->report_progress) { -- 2.27.0.rc1.431.g5c813f95dc
During writing a new commit-graph file, the size of the 'struct chunk_info' array is hardcoded to be the maximum number of possible chunks plus one for the terminating label. This array size must be adjusted each time when a new chunk is added, which can be easily forgotten. So let's make the constant-sized 'struct chunk_info' array an ALLOC_GROW()-able array instead. There are a couple of cases where writing a new commit-graph file returns with error; those spots were adjusted to free the dinamically allocated 'struct chunk_info' array. Arguably the several ALLOC_GROW() call sites look unusal, because we usually call ALLOC_GROW() in loops, once in each iteration... However, this way anyone who adds a new chunk will notice that they have to cargo-cult the ALLOC_GROW() call as well. Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com> --- commit-graph.c | 74 ++++++++++++++++++++++++++++++++------------------ 1 file changed, 47 insertions(+), 27 deletions(-) diff --git a/commit-graph.c b/commit-graph.c index d95c739c10..c8ba17e457 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -1355,12 +1355,13 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx) int fd; struct hashfile *f; struct lock_file lk = LOCK_INIT; - struct chunk_info chunks[6]; + struct chunk_info *chunks = NULL; + int chunks_nr = 0, chunks_alloc = 0; const unsigned hashsz = the_hash_algo->rawsz; struct strbuf progress_title = STRBUF_INIT; - int num_chunks = 3; uint64_t chunk_offset; struct object_id file_hash; + int ret = 0; if (ctx->split) { struct strbuf tmp_file = STRBUF_INIT; @@ -1398,35 +1399,48 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx) f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf); } - chunks[0].id = GRAPH_CHUNKID_OIDFANOUT; - chunks[0].size = GRAPH_FANOUT_SIZE; - chunks[1].id = GRAPH_CHUNKID_OIDLOOKUP; - chunks[1].size = hashsz * ctx->commits.nr; - chunks[2].id = GRAPH_CHUNKID_DATA; - chunks[2].size = (hashsz + 16) * ctx->commits.nr; + ALLOC_GROW(chunks, chunks_nr + 1, chunks_alloc); + chunks[chunks_nr].id = GRAPH_CHUNKID_OIDFANOUT; + chunks[chunks_nr].size = GRAPH_FANOUT_SIZE; + chunks_nr++; + ALLOC_GROW(chunks, chunks_nr + 1, chunks_alloc); + chunks[chunks_nr].id = GRAPH_CHUNKID_OIDLOOKUP; + chunks[chunks_nr].size = hashsz * ctx->commits.nr; + chunks_nr++; + ALLOC_GROW(chunks, chunks_nr + 1, chunks_alloc); + chunks[chunks_nr].id = GRAPH_CHUNKID_DATA; + chunks[chunks_nr].size = (hashsz + 16) * ctx->commits.nr; + chunks_nr++; if (ctx->num_extra_edges) { - chunks[num_chunks].id = GRAPH_CHUNKID_EXTRAEDGES; - chunks[num_chunks].size = 4 * ctx->num_extra_edges; - num_chunks++; + ALLOC_GROW(chunks, chunks_nr + 1, chunks_alloc); + chunks[chunks_nr].id = GRAPH_CHUNKID_EXTRAEDGES; + chunks[chunks_nr].size = 4 * ctx->num_extra_edges; + chunks_nr++; } if (ctx->num_commit_graphs_after > 1) { - chunks[num_chunks].id = GRAPH_CHUNKID_BASE; - chunks[num_chunks].size = hashsz * (ctx->num_commit_graphs_after - 1); - num_chunks++; + ALLOC_GROW(chunks, chunks_nr + 1, chunks_alloc); + chunks[chunks_nr].id = GRAPH_CHUNKID_BASE; + chunks[chunks_nr].size = hashsz * (ctx->num_commit_graphs_after - 1); + chunks_nr++; } - chunks[num_chunks].id = 0; - chunks[num_chunks].size = 0; + ALLOC_GROW(chunks, chunks_nr + 1, chunks_alloc); + chunks[chunks_nr].id = 0; + chunks[chunks_nr].size = 0; + /* + * Do not increase 'chunks_nr', so it still reflects the number of + * actual chunks, without the Chunk Lookup table's terminating label. + */ hashwrite_be32(f, GRAPH_SIGNATURE); hashwrite_u8(f, GRAPH_VERSION); hashwrite_u8(f, oid_version()); - hashwrite_u8(f, num_chunks); + hashwrite_u8(f, chunks_nr); hashwrite_u8(f, ctx->num_commit_graphs_after - 1); - chunk_offset = 8 + (num_chunks + 1) * GRAPH_CHUNKLOOKUP_WIDTH; - for (i = 0; i <= num_chunks; i++) { + chunk_offset = 8 + (chunks_nr + 1) * GRAPH_CHUNKLOOKUP_WIDTH; + for (i = 0; i <= chunks_nr; i++) { uint32_t chunk_write[3]; chunk_write[0] = htonl(chunks[i].id); @@ -1441,11 +1455,11 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx) strbuf_addf(&progress_title, Q_("Writing out commit graph in %d pass", "Writing out commit graph in %d passes", - num_chunks), - num_chunks); + chunks_nr), + chunks_nr); ctx->progress = start_delayed_progress( progress_title.buf, - num_chunks * ctx->commits.nr); + chunks_nr * ctx->commits.nr); } write_graph_chunk_fanout(f, ctx); write_graph_chunk_oids(f, hashsz, ctx); @@ -1454,7 +1468,8 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx) write_graph_chunk_extra_edges(f, ctx); if (ctx->num_commit_graphs_after > 1 && write_graph_chunk_base(f, ctx)) { - return -1; + ret = -1; + goto cleanup; } stop_progress(&ctx->progress); strbuf_release(&progress_title); @@ -1481,7 +1496,8 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx) if (!chainf) { error(_("unable to open commit-graph chain file")); - return -1; + ret = -1; + goto cleanup; } if (ctx->base_graph_name) { @@ -1493,7 +1509,8 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx) if (result) { error(_("failed to rename base commit-graph file")); - return -1; + ret = -1; + goto cleanup; } } } else { @@ -1513,13 +1530,16 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx) if (result) { error(_("failed to rename temporary commit-graph file")); - return -1; + ret = -1; + goto cleanup; } } commit_lock_file(&lk); - return 0; +cleanup: + free(chunks); + return ret; } static void split_graph_merge_strategy(struct write_commit_graph_context *ctx) -- 2.27.0.rc1.431.g5c813f95dc
Update the write_graph_chunk_*() helper functions to have the same signature: - Return an int error code from all these functions. write_graph_chunk_base() already has an int error code, now the others will have one, too, but since they don't indicate any error, they will always return 0. - Drop the hash size parameter of write_graph_chunk_oids() and write_graph_chunk_data(); its value can be read directly from 'the_hash_algo' inside these functions as well. This opens up the possibility for further cleanups and foolproofing in the following two patches. Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com> --- commit-graph.c | 20 ++++++++++++-------- 1 file changed, 12 insertions(+), 8 deletions(-) diff --git a/commit-graph.c b/commit-graph.c index c8ba17e457..33e70bdfea 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -793,7 +793,7 @@ struct write_commit_graph_context { const struct split_commit_graph_opts *split_opts; }; -static void write_graph_chunk_fanout(struct hashfile *f, +static int write_graph_chunk_fanout(struct hashfile *f, struct write_commit_graph_context *ctx) { int i, count = 0; @@ -815,17 +815,19 @@ static void write_graph_chunk_fanout(struct hashfile *f, hashwrite_be32(f, count); } + return 0; } -static void write_graph_chunk_oids(struct hashfile *f, int hash_len, +static int write_graph_chunk_oids(struct hashfile *f, struct write_commit_graph_context *ctx) { struct commit **list = ctx->commits.list; int count; for (count = 0; count < ctx->commits.nr; count++, list++) { display_progress(ctx->progress, ++ctx->progress_cnt); - hashwrite(f, (*list)->object.oid.hash, (int)hash_len); + hashwrite(f, (*list)->object.oid.hash, the_hash_algo->rawsz); } + return 0; } static const unsigned char *commit_to_sha1(size_t index, void *table) @@ -834,7 +836,7 @@ static const unsigned char *commit_to_sha1(size_t index, void *table) return commits[index]->object.oid.hash; } -static void write_graph_chunk_data(struct hashfile *f, int hash_len, +static int write_graph_chunk_data(struct hashfile *f, struct write_commit_graph_context *ctx) { struct commit **list = ctx->commits.list; @@ -852,7 +854,7 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len, die(_("unable to parse commit %s"), oid_to_hex(&(*list)->object.oid)); tree = get_commit_tree_oid(*list); - hashwrite(f, tree->hash, hash_len); + hashwrite(f, tree->hash, the_hash_algo->rawsz); parent = (*list)->parents; @@ -932,9 +934,10 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len, list++; } + return 0; } -static void write_graph_chunk_extra_edges(struct hashfile *f, +static int write_graph_chunk_extra_edges(struct hashfile *f, struct write_commit_graph_context *ctx) { struct commit **list = ctx->commits.list; @@ -984,6 +987,7 @@ static void write_graph_chunk_extra_edges(struct hashfile *f, list++; } + return 0; } static int oid_compare(const void *_a, const void *_b) @@ -1462,8 +1466,8 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx) chunks_nr * ctx->commits.nr); } write_graph_chunk_fanout(f, ctx); - write_graph_chunk_oids(f, hashsz, ctx); - write_graph_chunk_data(f, hashsz, ctx); + write_graph_chunk_oids(f, ctx); + write_graph_chunk_data(f, ctx); if (ctx->num_extra_edges) write_graph_chunk_extra_edges(f, ctx); if (ctx->num_commit_graphs_after > 1 && -- 2.27.0.rc1.431.g5c813f95dc
In write_commit_graph_file() we now have one block of code filling the array of 'struct chunk_info' with the IDs and sizes of chunks to be written, and an other block of code calling the functions responsible for writing individual chunks. In case of optional chunks like Extra Edge List an Base Graphs List there is also a condition checking whether that chunk is necessary/desired, and that same condition is repeated in both blocks of code. This patch series is about to add more optional chunks, so there would be even more repeated conditions. Eliminate these repeated conditions by storing the function pointers responsible for writing individual chunks in the 'struct chunk_info' array as well, and calling them in a loop to write the commit-graph file. This will open up the possibility for a bit of foolproofing in the following patch. Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com> --- commit-graph.c | 24 ++++++++++++++---------- 1 file changed, 14 insertions(+), 10 deletions(-) diff --git a/commit-graph.c b/commit-graph.c index 33e70bdfea..fe26245fad 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -1348,9 +1348,12 @@ static int write_graph_chunk_base(struct hashfile *f, return 0; } +typedef int (*chunk_write_fn)(struct hashfile *f, + struct write_commit_graph_context *ctx); struct chunk_info { uint32_t id; uint64_t size; + chunk_write_fn write_fn; }; static int write_commit_graph_file(struct write_commit_graph_context *ctx) @@ -1406,31 +1409,37 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx) ALLOC_GROW(chunks, chunks_nr + 1, chunks_alloc); chunks[chunks_nr].id = GRAPH_CHUNKID_OIDFANOUT; chunks[chunks_nr].size = GRAPH_FANOUT_SIZE; + chunks[chunks_nr].write_fn = write_graph_chunk_fanout; chunks_nr++; ALLOC_GROW(chunks, chunks_nr + 1, chunks_alloc); chunks[chunks_nr].id = GRAPH_CHUNKID_OIDLOOKUP; chunks[chunks_nr].size = hashsz * ctx->commits.nr; + chunks[chunks_nr].write_fn = write_graph_chunk_oids; chunks_nr++; ALLOC_GROW(chunks, chunks_nr + 1, chunks_alloc); chunks[chunks_nr].id = GRAPH_CHUNKID_DATA; chunks[chunks_nr].size = (hashsz + 16) * ctx->commits.nr; + chunks[chunks_nr].write_fn = write_graph_chunk_data; chunks_nr++; if (ctx->num_extra_edges) { ALLOC_GROW(chunks, chunks_nr + 1, chunks_alloc); chunks[chunks_nr].id = GRAPH_CHUNKID_EXTRAEDGES; chunks[chunks_nr].size = 4 * ctx->num_extra_edges; + chunks[chunks_nr].write_fn = write_graph_chunk_extra_edges; chunks_nr++; } if (ctx->num_commit_graphs_after > 1) { ALLOC_GROW(chunks, chunks_nr + 1, chunks_alloc); chunks[chunks_nr].id = GRAPH_CHUNKID_BASE; chunks[chunks_nr].size = hashsz * (ctx->num_commit_graphs_after - 1); + chunks[chunks_nr].write_fn = write_graph_chunk_base; chunks_nr++; } ALLOC_GROW(chunks, chunks_nr + 1, chunks_alloc); chunks[chunks_nr].id = 0; chunks[chunks_nr].size = 0; + chunks[chunks_nr].write_fn = NULL; /* * Do not increase 'chunks_nr', so it still reflects the number of * actual chunks, without the Chunk Lookup table's terminating label. @@ -1465,16 +1474,11 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx) progress_title.buf, chunks_nr * ctx->commits.nr); } - write_graph_chunk_fanout(f, ctx); - write_graph_chunk_oids(f, ctx); - write_graph_chunk_data(f, ctx); - if (ctx->num_extra_edges) - write_graph_chunk_extra_edges(f, ctx); - if (ctx->num_commit_graphs_after > 1 && - write_graph_chunk_base(f, ctx)) { - ret = -1; - goto cleanup; - } + for (i = 0; i < chunks_nr; i++) + if (chunks[i].write_fn(f, ctx)) { + ret = -1; + goto cleanup; + } stop_progress(&ctx->progress); strbuf_release(&progress_title); -- 2.27.0.rc1.431.g5c813f95dc
In my experience while experimenting with new commit-graph chunks, early versions of the corresponding new write_commit_graph_my_chunk() functions are, sadly but not surprisingly, often buggy, and write more or less data than they are supposed to, especially if the chunk size is not directly proportional to the number of commits. This then causes all kinds of issues when reading such a bogus commit-graph file, raising the question of whether the writing or the reading part happens to be buggy this time. Let's catch such issues early, already when writing the commit-graph file, and check that each write_commit_graph_*() function wrote the amount of data that it was expected to, and what has been encoded in the Chunk Lookup table. Now that all commit-graph chunks are written in a loop we can do this check in a single place for all chunks, and any chunks added in the future will get checked as well. Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com> --- commit-graph.c | 13 ++++++++++++- 1 file changed, 12 insertions(+), 1 deletion(-) diff --git a/commit-graph.c b/commit-graph.c index fe26245fad..e3adffd58b 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -1474,11 +1474,22 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx) progress_title.buf, chunks_nr * ctx->commits.nr); } - for (i = 0; i < chunks_nr; i++) + chunk_offset = f->total + f->offset; + for (i = 0; i < chunks_nr; i++) { + uint64_t end_offset; + if (chunks[i].write_fn(f, ctx)) { ret = -1; goto cleanup; } + + end_offset = f->total + f->offset; + if (end_offset - chunk_offset != chunks[i].size) + BUG("expected to write %lu bytes to chunk %08x, but wrote %lu instead", + chunks[i].size, chunks[i].id, + end_offset - chunk_offset); + chunk_offset = end_offset; + } stop_progress(&ctx->progress); strbuf_release(&progress_title); -- 2.27.0.rc1.431.g5c813f95dc
During a pathspec-limited revision walk, e.g. 'git log -- dir/file', Git spends a significant part of its runtime in the tree-diff machinery, checking whether the given path was modified between subsequent commits. By using Bloom filters to store the paths modified by each commit we can quickly tell that a commit didn't modify the given path without invoking tree-diff, thus considerably reduce this diffing overhead, along with the runtime and memory usage. This patch extends the commit-graph file format with additional chunks to store those modified path Bloom filters. The rest of this log message takes a closer look at the problem and explains why these new chunks look like they do. In the following the terms "file" and "path" are not used interchangeably. If a commit modified 'dir/subdir/foo.c', then it modified that one file, but it modified three paths (namely 'dir', 'dir/subdir', and 'dir/subdir/foo.c'). Furthermore, unless otherwise noted, "a lot of paths" means 5000 randomly selected paths that have existed at one time or another during the history of the corresponding repository's default branch [1], except for git it means all 5089 paths that have ever existed during the history of v2.25.0, and for homebrew-cask and homebrew-core all 7307 and 6535 paths, respectively, that have ever existed in the history of their default branches some time ago. About pathspec-limited revision walks ------------------------------------- So, during a pathspec-limited revision walk Git spends a significant part of its runtime in the tree-diff machinery, checking whether the given path was modified between subsequent commits. The table below shows the average runtime of 'git rev-list HEAD -- $path' for "a lot of paths" in a couple of repositories, and how much of that runtime is spent in diff_tree_oid(). It also shows the potential average speedup, should we be able to reduce the tree-diff overhead to zero without introducing some other overhead (spoiler alert: we won't be able to achieve that, of course, but still will achieve even higher average speedup in several cases). Average Average Potential runtime diff time speedup ------------------------------------------------------- android-base 0.8780s 0.7320s 83.37% 6.14x cmssw 0.3143s 0.2800s 88.95% 9.19x cpython 0.7453s 0.6602s 88.58% 8.76x elasticsearch 0.1492s 0.1351s 90.55% 10.58x gcc 7.1852s 6.9432s 96.63% 29.69x gecko-dev 4.6113s 4.0964s 88.83% 8.96x git 0.6180s 0.5911s 95.65% 22.97x glibc 0.5618s 0.5313s 94.57% 18.42x go 0.4913s 0.4469s 90.96% 11.07x jdk 0.0482s 0.0431s 89.42% 9.45x linux 0.7043s 0.6163s 87.50% 8.00x llvm-project 2.6844s 2.2607s 84.22% 6.34x rails 0.2784s 0.2372s 85.20% 6.76x rust 0.7757s 0.7349s 94.74% 19.01x tensorflow 0.6258s 0.5735s 91.64% 11.96x webkit 1.9137s 1.6580s 86.64% 7.48x Notice that the average time spent diffing in the Linux and Git repositories is quite close (0.6163s vs 0.5911s), although the Linux repository contains about 15x more commits and 25x more files; more on this later. Instrumenting the tree-diff machinery and gathering some stats about the number of diff calls, tree object comparisons and decoded tree entries revealed some interesting, perhaps even counterintuitive things: - The number of scanned tree entries can be more important than the number of tree comparisons. Here are four paths from two repositories, two each in the same frequently-modified directory, with one near the beginning and one at the end of that directory. When listing the commits modifying these paths, i.e. 'git rev-list HEAD -- $path', the number of tree comparisons is (nearly) the same, but the number of scanned tree entries differs by over two orders of magnitude. Average Nr of scanned Nr of tree Repository $path diff time tree entries comparisons -------------------------------------------------------------------- git .clang-format 0.188s 40115 18055 git zlib.c 0.729s 10705983 17120 homebrew-core Formula/a2ps.rb 8.390s 2122937 326758 homebrew-core Formula/zzz.rb 80.495s 1235547836 326758 This is also noticeable when looking at the average number of scanned tree entries and tree comparisons when running 'git rev-list HEAD -- $path' for "a lot of paths": Average nr Average Average Average of scanned nr of tree nr of diff diff time tree entries comparisons calls ---------------------------------------------------------------- jdk 0.0431s 204466 5520 3702 elasticsearch 0.1351s 720393 17212 13457 rails 0.2372s 1299221 43369 33670 cmssw 0.2800s 1938067 24798 23739 go 0.4469s 3207155 76432 42346 glibc 0.5313s 6555231 40442 29208 tensorflow 0.5735s 3914851 97227 47262 git 0.5911s 7480371 21789 18168 linux 0.6163s 4004067 79514 58145 cpython 0.6602s 4695754 88408 70439 android-base 0.7320s 4312043 122280 96376 rust 0.7349s 5084683 68110 29847 webkit 1.6580s 9395349 303105 218906 llvm-project 2.2607s 11773792 469801 330250 gecko-dev 4.0964s 42396246 415843 387162 gcc 6.9432s 84853083 286904 174218 Note that although we have a lot less tree comparisons in the git than in the linux repository, it has almost double the amount of scanned tree entries, because the git repository has some frequently modified biggish directories (notably its root dir, 't' or 'Documentation'). As a result the average time spent diffing in the two repositories is roughly the same, although the linux repository is much larger. Or that we spend the most time diffing in the gcc repository, because it has by far the most scanned tree entries, even though it has less tree comparisons than any of the next three slowest repositories. - The number of path components in the pathspec, i.e. its depth in the directory tree seems to be irrelevant. When checking whether a path somewhere deep in the directory tree has been modified between a commit and its parent the tree-diff machinery can short-circuit, and it returns as soon as it finds the first leading directory that hasn't been modified. And more often than not it can short-circuit already while comparing the root trees, as all the <1.5 values in the "Average tree comparisons per diff call" column show. Average Average tree Average Average pathspec comparisons nr of diff nr of tree depth per diff call calls comparisons ---------------------------------------------------------------- android-base 5.78 1.26 96376 122280 cmssw 4.28 1.04 23739 24798 cpython 3.72 1.25 70439 88408 elasticsearch 9.13 1.28 13457 17212 gcc 5.13 1.64 174218 286904 gecko-dev 6.25 1.07 387162 415843 git 2.30 1.19 18168 21789 glibc 4.17 1.38 29208 40442 go 4.60 1.80 42346 76432 jdk 8.45 1.49 3702 5520 linux 4.49 1.36 58145 79514 llvm-project 5.45 1.42 330250 469801 rails 5.43 1.28 33670 43369 rust 4.70 2.28 29847 68110 tensorflow 5.56 2.06 47262 97227 webkit 6.33 1.38 218906 303105 Note the 2.xy average tree comparisons per diff call in the rust and tensorflow repositories. In the rust repository over 98% of the paths are in the 'src' directory and over 93% of commits modify a file under that directory, while in the tensorflow repository over 92% of paths are in the 'tensorflow' directory and over 90% of commits modify a file under that directory. Consequently, the tree-diff machinery can only rarely short-circuit while comparing the root trees of two subsequent commits, but has to dive down and compare the contents of the 'src' or 'tensorflow' directories most of the time as well. I suspect that we would get a similar ~2 value in the homebrew-core repository as well, because over 99.7% of all commits modify the 'Formula' directory, which contains over 90% of paths in the repository. Bloom filters intro ------------------- Quoting the (quite good) Wikipedia article, "A Bloom filter is a space-efficient probabilistic data structure [...] that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not – in other words, a query returns either 'possibly in set' or 'definitely not in set'". A Bloom filter is a bit array, initially all bits set to 0. To add an element to a Bloom filter, the element is hashed using 'k' independent hash functions, the resulting hashes are turned into 'k' array positions, and the bits at those positions in the filter are set to 1. To query for an element, the element is hashed using the same 'k' hash functions, the resulting hashes are turned into 'k' array positions, and the values of the bits at those positions are checked. If all those bits are set, then the element is 'possibly in set'. If even one of those bits is unset, then the element is 'definitely not in set'. Some of the for us relevant properties of Bloom filters are: - A Bloom filter doesn't store the elements themselves, it only sets a couple of bits based on the elements' hashes. Consequently, a Bloom filter can't tell which elements it contains; it can't even tell the number of elements in there. - A bit in the Bloom filter's bit array might be set by more than one element. This is where the probabilistic nature and the possibility of false positives come from: it is possible that some elements in the set happen to have set all 'k' bits that would indicate the presence of a particular element, even though that element itself is not part of the set. - Elements can't be removed from a Bloom filter: the 'k' bits indicating the presence of an element can't simply be unset, because that might unset bits that were set by other elements as well. There are enhanced Bloom filter variants that allow removal of elements, like a counting Bloom filter, but they come with additional complexity and require considerably more space. - Bloom filters can't be resized, because turning the hashes into positions in the bit array critically depends on the size of the bit array (usually pos[i] = hash[i] % bit_array_size). - Bloom filters of the same size can be bitwise OR-ed to create the union of the sets of elements in the filters. - When playing long enough with probabilities and making some reasonable assumptions and approximations it can be shown (see the Wikipedia article) that for a desired false positive probability 'p' there is an optimal number of 'k' hash functions: k = -log₂(p) (that's a base 2 logarithm there) For 1% false positive probability k = 6.64, but of course there can only be an integral number of hash functions, so 7 it is. Note that this value is independent both from the size of the Bloom filter and from the number of elements in there. Inversely, when using a properly sized bit array, then the false positive probability falls exponentially as 'k' increases. - To store 'n' elements in a Bloom filter with a desired false positive probability using the optimal number of 'k' bits/hash functions, we need a bit array with approximate size 'm': m ≈ n * k / ln(2) ≈ n * k * 10 / 7 When using 7 hash functions to aim for < 1% false positive probability this simplifies down to: m ≈ n * 7 * 10 / 7 = 10 * n i.e. approx. 10 bits per element. - In general the more elements, IOW the more set bits there are in a Bloom filter of certain size, the higher the probability of false positives. Similarly, the larger the size of a Bloom filter's bit array for a certain number of elements, the lower the probability of false positives. - A Bloom filter with all bits set appears to contain "everything", because all queries will return 'possibly in set'. Modified Path Bloom Filters --------------------------- We'll use Bloom filters to store modified paths and will store them in the Modified Path Bloom Filters chunk. Yes, plural: we'll use a lot of small Bloom filters, each one storing all paths modified by one commit compared to one of its parents (see the Alternatives section near the end for reasons for not using one big Bloom filter). Then as the pathspec-limited revision walk iterates over all commits it will query each modified path Bloom filter to see whether the given pathspec is in there, and if it is 'definitely not in set', then we can spare a tree-diff call and skip the commit right away, but if it's 'possibly in set', then we must do the tree-diff to make sure. Each modified path Bloom filter consists of: - 4 bytes specifying the number of bits in the Bloom filter's bit array. For practical purposes 32 bit is more than sufficient to store the number of bits in the Bloom filter's array. When using k = 7 hashes, i.e. 10 bits per path, then we could store over 400 million paths in a single Bloom filter; with k = 32 hashes we'd use 46 bit per path, which is still over 93 million paths. - The actual bit array. See the next and the Hashing scheme sections for details about how this bit array is used. All modified path Bloom filters will use the same k number of hashes per path, so we won't have to store that in each Bloom filter. I suspect that having modified path Bloom filters with different k number of hashes wouldn't bring enough benefit to justify storing that value in each and every filter. The order of modified path Bloom filters in this chunk is unspecified on purpose, so implementations can experiment with writing them in history or topological order, which may bring performance benefits through better locality. [Somewhere in the middle of this section the length of this commit message surpassed the length of 72441af7c4 (tree-diff: rework diff_tree() to generate diffs for multiparent cases as well, 2014-04-07), the mother of all commit messages from the great Kirill Smelkov, yay! :)] Modified Path Bloom Filter Index -------------------------------- Since modified path Bloom filters vary in size, we can't have an array of modified path Bloom filters indexed by the position of the commit OID in the OID Lookup chunk, but we'll need a level of indirection between the commit OID position and the position of the associated modified path Bloom filter. So the Modified Path Bloom Filter Index chunk will contain an array of offsets to quickly map commit OID positions to offsets in the Modified Path Bloom Filters chunk, i.e. its Nth entry contains the offset for the commit whose OID is Nth in the OID Lookup chunk. Since the commit-graph file can store information about 2^31-1 commits, a merely 4 byte offset per commit is clearly insufficient. However, surely noone want to have lots of gigabyte-sized modified path Bloom filters, so a standard 8 byte file offset will be underutilized. This allows us to use a few bits from those 8 bytes for special purposes. For now we'll have two special purpose bits: - The most significant bit indicates that the entry is not an offset into the Modified Path Bloom Filters chunk, but an "embedded" modified path Bloom filter containing all paths modified by the commit compared to its first parent; see the next section. - The second most significant bit indicates that modified path Bloom filters are stored not only for the first parent of a merge commit, but for all its parents, and the entry is neither an offset nor an "embedded" modified path Bloom filter, but an array index into the Modified Path Bloom Filter Merge Index chunk; see the "Merges" section below. - [TODO: perhaps we might want to reserve one or two more bits for special purposes?] Make these offsets relative to the start of the Modified Path Bloom Filters chunk, so they will depend neither on the number of chunks nor on the combined size of any other chunks that are written before the Modified Path Bloom Filters chunk, thus allowing implementations to calculate all these offsets without knowing anything about those other chunks. Furthermore, if the offsets were relative to the beginning of the file, then some huge chunks could make the offsets grow too large, and would mess up those special purpose bits (though, arguably, an exabyte sized commit-graph file is just too unlikely to be worried about...). An "all bits set" index entry can be used to indicate that there is no modified path Bloom filter stored for the corresponding commit. A reader implementation can either special-case such an entry, or interpret it as an embedded modified path Bloom filter that replies 'possibly in set' to all queries, the result is the same: it has to resort to running tree-diff for that commit. These embedded modified path Bloom filters have implications on the low-level Bloom filter format in the Modified Path Bloom Filters chunk as well, namely: - Their array contains only 63 bits, not 64, i.e. not 8 full bytes. Therefore, for simplicity, we'll store the size of the bit array as the number of bits, not bytes. - Assigning special purpose to the most significant bits of the index entries is convenient when the entry is an offset into the Modified Path Bloom Filters chunk. OTOH, it makes indexing into the Bloom filter's array awkward if we try to treat it like an ordinary array, i.e. whose 0th element comes first, because we'd have to account for the special purpose bits. Therefore, we'll store the bit array kind of in big endian byte order, i.e. the most significant byte first and the 0th byte last. In addition, this chunk will start with a small header storing a bit of metadata that applies to all modified path Bloom filters in all related chunks. The reason for storing this header in the Modified Path Bloom Filter Index chunk is that only this chunk is essential to use modified path Bloom filters (if all commits modify so few files that their modified path Bloom filters can all be embedded into the index chunk, then the Modified Path Bloom Filters chunk will contain nothing and thus can be omitted; e.g. the two homebrew repositories come quite close to this, as shown below). This header contains: - One byte storing the number of 'k' hashes per path. Using a single byte is more than sufficient: as 'k' increases the false positive probability falls exponentially, and quickly becomes unnecessarily low (i.e. with k = 31 even a full commit-graph containing 2^31-1 commits is expected to have only a single false positive). - [TODO: What else? We might want to have a version byte, though if a format change becomes necessary, then we could simply rename the chunks just as well...] Embedded Modified Path Bloom Filters ------------------------------------ The ideal size of a Bloom filter to store a set of 6 or 7 elements with 7 hash functions is approximately 60 or 70 bits, respectively. So in the space of a 8 byte Modified Path Bloom Filter Index entry we could comfortably store 6 entries by using one bit to indicate that the entry is not a file offset but an "embedded" modified path Bloom filter and the remaining 63 bits as the Bloom filter's bit array. As the table below shows, a significant portion or even the vast majority of commits modify no more than 6 paths, so we can embed a lot of modified path Bloom filters into the Modified Path Bloom Filter Index chunk. (Also note the unusually large percentage of empty diffs in the android-base and cpython repositories.) Percentage of commits modifying <=N (or =N) paths compared to their first parents 0 <=1 <=2 <=3 <=4 <=5 <=6 <=7 <=8 (=1) (=2) (=3) (=4) (=5) (=6) (=7) (=8) -------------------------------------------------------------------------------------------- elasticsearch 0.70% 4.00% 5.67% 8.50% 13.17% 16.55% 18.32% 21.18% 27.47% (3.30%) (1.67%) (2.83%) (4.67%) (3.37%) (1.77%) (2.86%) (6.29%) jdk 0.26% 3.47% 10.60% 13.99% 15.62% 19.02% 26.62% 34.30% 40.57% (3.20%) (7.14%) (3.39%) (1.63%) (3.40%) (7.60%) (7.68%) (6.27%) webkit 0.05% 0.07% 0.77% 2.14% 9.15% 26.66% 38.42% 47.34% 52.83% (0.02%) (0.71%) (1.37%) (7.01%) (17.51%) (11.76%) (8.92%) (5.49%) android-base 13.20% 13.62% 14.23% 18.55% 20.91% 35.18% 42.32% 50.82% 62.05% (0.42%) (0.62%) (4.32%) (2.36%) (14.28%) (7.14%) (8.49%) (11.23%) llvm-project 0.12% 0.12% 0.94% 6.45% 25.24% 46.68% 53.60% 60.97% 67.33% (0.00%) (0.81%) (5.51%) (18.79%) (21.44%) (6.92%) (7.37%) (6.36%) gecko-dev 0.14% 0.96% 1.88% 15.44% 32.42% 46.12% 54.54% 61.37% 66.65% (0.82%) (0.92%) (13.56%) (16.98%) (13.70%) (8.42%) (6.82%) (5.28%) tensorflow 0.09% 1.26% 2.72% 5.00% 26.30% 42.36% 55.17% 63.11% 69.70% (1.17%) (1.46%) (2.28%) (21.27%) (16.07%) (12.81%) (7.94%) (6.59%) rails 0.10% 2.09% 5.79% 16.03% 35.57% 51.47% 58.71% 65.15% 72.96% (1.99%) (3.70%) (10.23%) (19.54%) (15.90%) (7.24%) (6.44%) (7.82%) rust 0.07% 2.20% 5.11% 22.81% 42.35% 52.50% 59.29% 65.33% 70.02% (2.13%) (2.91%) (17.70%) (19.54%) (10.15%) (6.79%) (6.04%) (4.69%) glibc 0.02% 7.33% 14.03% 30.86% 42.22% 52.53% 61.59% 68.50% 73.68% (7.31%) (6.70%) (16.83%) (11.36%) (10.32%) (9.06%) (6.91%) (5.19%) gcc 0.00% 0.24% 10.92% 26.61% 39.85% 54.97% 63.80% 69.37% 76.52% (0.24%) (10.68%) (15.69%) (13.24%) (15.13%) (8.82%) (5.57%) (7.14%) go 0.00% 0.96% 9.09% 19.35% 39.97% 53.16% 65.31% 72.10% 77.36% (0.95%) (8.13%) (10.26%) (20.63%) (13.19%) (12.15%) (6.79%) (5.26%) cmssw 0.15% 0.19% 0.20% 2.43% 45.35% 56.49% 67.58% 73.17% 77.99% (0.03%) (0.01%) (2.23%) (42.92%) (11.15%) (11.08%) (5.59%) (4.83%) linux 0.01% 0.66% 3.97% 23.49% 46.15% 62.97% 72.79% 79.16% 83.57% (0.65%) (3.30%) (19.52%) (22.66%) (16.82%) (9.82%) (6.37%) (4.41%) cpython 3.07% 4.97% 27.73% 59.54% 70.34% 77.48% 81.91% 86.76% 89.34% (1.91%) (22.75%) (31.82%) (10.80%) (7.13%) (4.43%) (4.85%) (2.59%) git 0.11% 27.54% 55.92% 73.79% 81.90% 87.05% 90.28% 92.29% 93.82% (27.43%) (28.38%) (17.87%) (8.11%) (5.15%) (3.23%) (2.01%) (1.53%) homebrew-cask 0.40% 0.94% 95.41% 97.42% 98.11% 98.40% 98.61% 98.79% 98.93% (0.54%) (94.46%) (2.01%) (0.70%) (0.29%) (0.21%) (0.18%) (0.14%) homebrew-core 0.01% 0.07% 98.81% 99.35% 99.56% 99.75% 99.81% 99.84% 99.86% (0.07%) (98.74%) (0.53%) (0.22%) (0.19%) (0.06%) (0.03%) (0.02%) This saves space, because the Modified Path Bloom Filters chunk will contain a lot less Bloom filters (albeit those would be rather small filters). It reduces the probability of false positives, because all commits modifying 1-6 paths will have larger than strictly necessary modified path Bloom filters. Finally, it makes the Bloom filter query ever so slightly faster, partly because there is no redirection into the Modified Path Bloom Filters chunk, and partly because we can check all bit positions in a 63 bit modified path Bloom filter using a 64 bit mask at once, instead of checking those bit positions one by one, and we can use the same mask to check all embedded Bloom filters. The number of paths that can be stored in a 63 bit Bloom filter depending on the number of hashes per path: k | 3 | 4 | 5 | 6 | 7 | 8 | 9 - 11 | 12 - 14 | 15 - 22 | 23 - 44 ------+------+------+-----+-----+-----+-----+--------+---------+---------+-------- paths | <=14 | <=11 | <=8 | <=7 | <=6 | <=5 | <=4 | <=3 | <=2 | <=1 Hashing scheme -------------- We need to map each modified path to 'k' independent bit positions in the Bloom filters bit array. We want the hash function and hashing scheme that results in the lowest false positive rate and has the lowest probability of "colliding" paths (see below), but it should still be fast to compute, and should be widely available for alternative Git implementations. At this point we don't care about the runtime of pathspec-limited revision walks here. Lower false positive rate inherently leads to lower runtime, though we are reaching diminishing returns on common setups as the false positive rate gets lower and lower... and e.g. in the webkit repository we'll reach an average false positive rate of ~0.001% with only k = 7 hashes per path. However, on unusual setups/configurations accessing tree objects might be considerably more expensive than accessing commit objects, especially when using only commit info stored in the commit-graph. E.g. consider a future where we can distribute commit-graphs with modified path Bloom filters to partial clones containing only commit objects for most of the history: any tree-diff will be really expensive, because the tree objects must be fetched from the promisor. In such a setup every avoidable tree-diff call counts, and low false positive rate is king. For now I went with 32 bit MurmurHash3 used in enhanced double hashing scheme with 32 bit unsigned integer arithmetic, though as I will show below it seems that this is not the best option. So, to map each modified path to 'k' bit positions in the Bloom filter's array we first need 'k' independent hashes. In general, hashing a path 'k' times with the same hash function but using 'k' different seeds produces hashes that are independent enough. In practice, to reduce the overhead of hashing, especially for larger 'k' values, some variant of double hashing is often used to generate the 'k' independent-ish hashes from the results of only two hash function calls with different seeds. In general: h1 = h(seed1, path) h2 = h(seed2, path) for (i = 0; i < k; i++) pos[i] = (h1 + i * h2 + f(i)) % nr_bits Depending on how the f(i) term is defined there are a few named variants: - Double hashing: f(1) = 0: pos[i] = (h1 + i * h2) % nr_bits - Improved double hashing: f(i) adds a simple quadratic term: pos[i] = (h1 + i * h2 + i^2) % nr_bits - Enhanced double hashing: f(i) adds a not-so-simple cubic term: pos[i] = (h1 + i * h2 + (i^3 - i) / 6) % nr_bits This cubic term is equal to the following sequence of numbers, starting from i = 0: 0, 0, 1, 4, 10, 20, 35, 56, 84, 120, etc. These are almost the same as tetrahedral numbers, except that the mathematical definition starts with 1, 4, 10..., while the OEIS A000292 sequence starts with 0, 1, 4, 10... I'm puzzled by this term being 0 not only for i = 0, but for i = 1 as well, because this means that if (h2 % nr_bits == 0) (see below), then both pos[0] = h1 and pos[1] = h1, IOW in that case the path has one less bit set. We'll take a look at whether starting the sequence with a single 0 makes a difference (it does) and how that affects the false positive rate (slightly increases it overall), and this will be labeled "enhanced double hashing variant #1" below. This sequence is supposed to be just as good as a simple i^3 term, but this can easily be calculated incrementally without multiplication, using only addition. We'll take a look at how a simple cubic term affects the false positive rate (it increases it), and this will be labeled "enhanced double hashing variant #2" below. These hashing schemes usually work fairly well in practice, but in practice Bloom filters are primarily used to check whether an element is part of _one_ _big_ set, and, consequently, most of the wisdom and experience out there applies to big Bloom filters. We, however, will repeatedly check whether a particular element (i.e. path) is part of a large number of mostly very small sets, and our use case does challenge those best practices. In our use case it is important when two paths "collide", i.e. when one path in itself sets all the bit positions that (falsely) indicate the presence of another path in a modified path Bloom filter, so let's have a look at that. And let's hope against hope that I don't mess up the math... So, if we have k truly independent hashes for each path, then: (1) The probability of two paths mapping to the same k separate bit positions in a Bloom filter of size nr_bits is nr_bits! / ((nr_bits - k)! * k!). (2) The probability of a path setting only a single bit position is 1 / nr_bits^(k-1), and the probability of a path setting one particular bit position is k / nr_bits (assuming that it does set k separate bits), so the probability of a path setting a bit position that happens to be the only bit position set by an other path is k / nr_bits^k. In case of our 63 bit embedded modified path Bloom filters and k = 7 the probabilities of these two cases are about 1.81 * 10^(-9) and 1.77 * 10^(-12), respectively. I think that these are low enough not to worry about. However, the hashes in the various double hashing schemes are far from being independent. When starting out with unsigned 32 bit hashes (like we'll do) and using 64 bit arithmetic, then there is no integer overflow, because neither i nor f(i) are large enough for that. Then, due to the distributivity of the modulo operation (and because i < nr_bits), all double hashing schemes are equivalent to: pos[i] = (h1 % nr_bits + i * (h2 % nr_bits) + f(i) % nr_bits) % nr_bits For the above mentioned two colliding cases this means: (1) if (h(seed1, "foo") % nr_bits == h(seed1, "bar") % nr_bits && h(seed2, "foo") % nr_bits == h(seed2, "bar") % nr_bits) then all double hashing variants will work as if both paths were hashed to the same h1 and h2 values, and, consequently, both paths will be mapped to the same bit positions. This has the probability of 1 / nr_bits^2. (2) if (h2 % nr_bits == 0) then the i * h2 term will basically amount to nothing and this term can be simplified away, and all bit positions will depend only on h1; this has the probability of 1 / nr_bits. In case of double hashing the f(i) term is 0 as well, meaning that the path maps to a single bit position. The probability of a path setting a bit position that happens to be the only bit position set by an other path is k / nr_bits^2. The quadratic or cubic f(i) terms in improved or enhanced double hashing ensure that a path is mapped to multiple bit positions even in this case, though those bit positions are not nearly as random as one would like. In case of 63 bit Bloom filters and k = 7, the probabilities of these two cases are 1 / 3969 and 1 / 567, respectively. When using 32 bit unsigned integer arithmetic, then an integer overflow is definitely possible, so the double hashing formula becomes: pos[i] = (h1 + i * h2 + f(i)) % 2^32 % nr_bits If nr_bits is a power of two, then the "% 2^32" term can be simplified away, and we end up with the same formula as with 64 bit arithmetic, and the probabilities of cases (1) and (2) above remain the same. If nr_bits is not a power of two, then... well, I don't offhand know how to approach that formally :) Anyway: (1) This type of collision seems to occur if that two-liner condition above is true and for both paths there is an overflow for the same values of i, which has the approximate probability of 1 / ((k - 1) * nr_bits^2). (2) This type of collision seems to occur if (h2 % nr_bits == 0) and there is either no integer overflow for any values of i or if there is an overflow for all values of i, which has the approximate probability of k / ((k - 1) * nr_bits^2). In case of 63 bit Bloom filters and k = 7 the probabilities of these two cases are 1 / 23814 and 1 / 3402, respectively. There are several other colliding cases, e.g. with double hashing variants its more probable that a path maps only to 2, 3, etc. bit positions instead of 'k' than with 'k' truly independent hashes, though I haven't looked into how the probabilities work out. Anyway, all the above already shows that: - These colliding cases are several orders of magnitude more likely with any double hashing variant than with k truly independent hash functions. - When using some sort of double hashing, then these colliding cases can happen with a high enough probability that we can't just ignore them. - These colliding cases can happen much more frequently with double hashing than with improved or enhanced double hashing. - Using 64 vs 32 bit arithmetic while calculating various double hashing schemes makes a difference, and suggests that 32 bit arithmetic has lower false positives probability. - Bloom filters whose size is a power of two might have higher false positive probability. All this is important, because there are repositories out there that modify the same path in the majority of commits, e.g.: - homebrew-core: contains 6535 paths, and over 99.5% of commits modify the 'Formula' directory and have an embedded modified path Bloom filter. - homebrew-cask: contains 7307 paths, and over 95.5% of commits modify the 'Casks' directory and have an embedded modified path Bloom filter. - rust: contains 58k+ paths, and almost 93% of commits modify the 'src' directory, though only over 54% of commits modify that directory and have an embedded modified path Bloom filter. - tensorflow: contains 47k+ paths, and almost 91% of commits modify the 'tensorflow' directory, though only about 51% of commits modify that directory and have an embedded modified path Bloom filter. - go: contains 22k+ paths, and almost 84% of commits modify the 'src' directory, though only over 50% of commits modify that directory and have an embedded modified path Bloom filter. So e.g. if we were to look for commits modifying a path in the homebrew-core repository, which happens to map to the same bit positions in a 63 bit Bloom filter as the 'Formula' directory, then Boom! we would get over 99.5% false positive rate, effectively rendering modified path Bloom filters useless for that particular path. The table below shows the number of paths that happen to collide with the repository's frequently modified directory in embedded modified path Bloom filters using different hash functions and hashing schemes with 32 bit unsigned integer arithmetic and 64 bit arithmetic (the latter in parentheses): | Double hashing | Enhanced | 7 seeds | | double hashing | | Murmur3 | xxHash | Murmur3 | xxHash | Murmur3 | xxHash --------------+----------+----------+---------+---------+---------+-------- homebrew-core | 4 (15) | 5 (17) | 0 (0) | 0 (2) | 0 | 0 homebrew-cask | 2 (21) | 6 (14) | 0 (2) | 1 (1) | 0 | 0 rust | 18 (143) | 38 (144) | 1 (15) | 6 (24) | 0 | 0 tensorflow | 20 (110) | 18 (105) | 4 (15) | 0 (12) | 0 | 0 go | 9 (66) | 12 (62) | 0 (1) | 3 (8) | 0 | 0 --------------+----------+----------+---------+---------+---------+-------- all | 53 (355) | 79 (342) | 5 (33) | 10 (47) | 0 0 The effect of embedded modified path Bloom filters on these colliding cases can be both beneficial and harmful: - We use a larger than necessary Bloom filter to hold 1-6 paths, which lowers the probability of these cases considerably. This is especially important for the homebrew repositories. E.g. in homebrew-core over 98.6% of commits modify a single file in the 'Formula' directory, i.e. two paths in total. To store 2 paths using 7 hashes per path we would need a Bloom filter with a 20 bit array, which we would round up to 24 bits to use full bytes, i.e. those 98.6% of commits would have merely 24 bit Bloom filters. This makes those colliding cases all the more probable: 1 / 3456 for case (1) and 1 / 493.7 for case (2) with 32 bit arithmetic). - We use Bloom filters of the same size to hold 1-6 paths, so if a path were to run into these cases, then more Bloom filter queries would return false positives than when using appropriately sized Bloom filters. Now, there is a simple trick that can, to some extent, alleviate these collision issues: check all leading directories of the pathspecs, i.e. while looking for commits modifying 'dir/subdir/file', then query the modified path Bloom filters not only for the full path, but for all its leading directories 'dir/subdir' and 'dir' as well. This way we would only get a false positive if all bit positions of all leading directories were set as well, which can significantly reduce the probability of a pathspec running afoul of these colliding cases, and, in general, can reduce the false positive rate by an order of magnitude or three as well (see later in this patch series). Checking all leading directories is not a silver bullet, though, because it can only help if the pathspec does actually have leading directories, and the deeper the pathspec in the directory tree, i.e. the more leading directories it has, the lower the false positive rate becomes. - So this doesn't help pathspecs in the root tree, because they obviously don't have any leading directories. - Furthermore, this doesn't help pathspecs that are immediately below such a frequently modified directory, because their only leading directory is modified in the majority of commits. This means that it can't help in the homebrew repositories, because 85% or 90% of their paths are directly under their frequently modified directories. The only thing that can help even in these cases is hashing the paths k times using k different seeds. Phew. Let's see some actual benchmarks, shall we? The following tables in this section show the average false positive rate of various hash functions and hashing schemes while listing the histories of "a lot of paths", i.e. 'git rev-list HEAD -- $path'. A '*' denotes the lowest value in each row. All cases use k = 7 hashes per path, use the same basically random seeds of immense historical significance, store 6 path in embedded modified path Bloom filters, and check all leading directories of the given pathspec. The table below compares the average false positive rate of 64 bit and 32 bit unsigned integer arithmetic using double hashing and enhanced double hashing with the MurmurHash3 hash function: Double hashing | Enhanced double hashing 32 bit 64 bit | 32 bit 64 bit -------------------------------------+------------------------ android-base 0.008539% 0.015709% | *0.004155% 0.004961% cmssw 0.006022% 0.013334% | *0.003953% 0.004109% cpython 0.036840% 0.079816% | *0.016607% 0.019330% elasticsearch 0.004069% 0.005297% | 0.003249% *0.003101% gcc 0.016982% 0.034096% | *0.008919% 0.011152% gecko-dev 0.001058% 0.002691% | *0.000725% 0.000829% git 0.144256% 0.331346% | *0.069921% 0.079405% glibc 0.026480% 0.053838% | *0.016389% 0.017902% go 0.021930% 0.050348% | *0.012616% 0.014178% homebrew-cask 0.097523% 0.508175% | *0.009096% 0.042034% homebrew-core 0.120860% 0.556014% | *0.005360% 0.026810% jdk 0.006085% 0.007526% | 0.006431% *0.005911% linux 0.010908% 0.019081% | *0.007494% 0.007896% llvm-project 0.006417% 0.009327% | *0.003913% 0.004050% rails 0.024997% 0.046829% | 0.013134% *0.012361% rust 0.038579% 0.056852% | *0.025509% 0.027068% tensorflow 0.013732% 0.023307% | *0.008243% 0.008848% webkit 0.002212% 0.002950% | *0.001007% 0.001065% -------------------------------------+------------------------ all 0.028395% 0.110075% | *0.006085% 0.006675% w/o homebrew 0.010940% 0.021286% | *0.005968% 0.011023% The 64 bit unsigned arithmetic does indeed fare worse in almost every case, and significantly worse in the two homebrew repositories. So it seems that if using some form of double hashing, then 32 bit unsigned integer arithmetic is the way to go, even though several programming languages lack support for unsigned types (though thanks to the distributivity of the modulo operation, they can simply and cheaply implement it using 64 bit arithmetic and a '& (2^32-1)' mask). The table below compares the average false positive rate of various hashing schemes using 32 bit unsigned integer arithmetic and the MurmurHash3 hash function: Enhanced Enhanced Enhanced double double double Improved hashing hashing hashing double Double 7 seeds (original) variant 1 variant 2 hashing hashing --------------------------------------------------------------------------------- android-base 0.004214% *0.004155% 0.004853% 0.005193% 0.008252% 0.008539% cmssw *0.003344% 0.003953% 0.003546% 0.004732% 0.004226% 0.006022% cpython 0.016120% 0.016607% *0.015896% 0.020259% 0.020311% 0.036840% elasticsearch *0.003167% 0.003249% *0.003164% 0.003531% 0.003553% 0.004069% gcc 0.010281% *0.008919% 0.010359% 0.010642% 0.012166% 0.016982% gecko-dev 0.000804% 0.000725% *0.000646% *0.000648% 0.000822% 0.001058% git *0.063025% 0.069921% 0.067643% 0.080744% 0.091423% 0.144256% glibc *0.016278% 0.016389% 0.018660% 0.021094% 0.025596% 0.026480% go 0.013009% *0.012616% 0.013301% 0.014316% 0.016345% 0.021930% homebrew-cask *0.007199% 0.009096% 0.009070% 0.010722% 0.016617% 0.097523% homebrew-core *0.003041% 0.005360% 0.005277% 0.005148% 0.016150% 0.120860% jdk 0.005873% 0.006431% *0.005326% 0.005955% 0.006984% 0.006085% linux 0.007764% *0.007494% 0.007714% 0.008709% 0.009429% 0.010908% llvm-project *0.003367% 0.003913% 0.003730% 0.004064% 0.004809% 0.006417% rails *0.012708% 0.013134% 0.013929% 0.016316% 0.014358% 0.024997% rust *0.024245% 0.025509% 0.025045% 0.028204% 0.032491% 0.038579% tensorflow *0.007907% 0.008243% 0.008670% 0.009913% 0.010115% 0.013732% webkit *0.000999% *0.001007% 0.001142% 0.001113% 0.001274% 0.002212% --------------------------------------------------------------------------------- all *0.005646% 0.006085% 0.006226% 0.006928% 0.009264% 0.028395% w/o homebrew *0.005888% 0.005968% 0.006152% 0.006899% 0.007807% 0.010940% Double hashing and improved double hashing have higher average false positive rates than enhanced double hashing; note in particular the significantly higher false positive rate with double hashing in the two homebrew repositories. Enhanced double hashing is only slightly worse than 7 different seeds, at least in this particular case (i.e. MurmurHash3 and these specific seeds). Comparing these hashing schemes using a different hash function (xxHash or FNV1a) shows a similar trend; for brevity I won't include those tables here. The table below compares the average false positive rate of different 32 bit hash functions when used in enhanced double hashing scheme with k = 7 and 32 bit unsigned int arithmetic, or with 7 different seeds: Enhanced double hashing | 7 seeds | 7 uint32 Murmur3 xxHash FNV1a | Murmur3 xxHash FNV1a | SHA256 -----------------------------------------------+---------------------------------+----------- android-base 0.004155% 0.005556% 0.006113% | 0.004214% *0.004101% 0.005918% | 0.004168% cmssw 0.003953% 0.003775% 0.005087% | 0.003344% 0.003677% 0.004353% |*0.003322% cpython 0.016607% 0.015919% 0.021957% | 0.016120% *0.015238% 0.023649% | 0.017632% elasticsearch 0.003249% 0.003589% 0.004616% | 0.003167% 0.003360% 0.003586% |*0.003027% gcc *0.008919% 0.009376% 0.010555% | 0.010281% 0.009157% 0.011882% | 0.010338% gecko-dev 0.000725% 0.000721% 0.000932% | 0.000804% *0.000611% 0.000911% | 0.000737% git 0.069921% 0.063449% 0.083097% |*0.063025% 0.069137% 0.087132% | 0.063763% glibc 0.016389% 0.017477% 0.024321% | 0.016278% *0.016062% 0.022641% | 0.017241% go 0.012616% 0.012449% 0.016728% | 0.013009% *0.011692% 0.017214% | 0.012489% homebrew-cask 0.009096% 0.025104% 0.011073% | 0.007199% 0.007343% 0.009037% |*0.007050% homebrew-core 0.005360% 0.007940% 0.005450% |*0.003041% 0.003832% 0.004888% | 0.003884% jdk 0.006431% *0.005451% 0.007428% | 0.005873% 0.005738% 0.006654% | 0.005749% linux *0.007494% 0.008266% 0.009917% | 0.007764% 0.007723% 0.009256% | 0.007939% llvm-project 0.003913% 0.003727% 0.004584% | 0.003367% *0.003247% 0.005115% | 0.003664% rails 0.013134% 0.011277% 0.014103% | 0.012708% *0.010389% 0.015928% | 0.012506% rust 0.025509% 0.024596% 0.032983% |*0.024245% 0.025697% 0.031593% | 0.024794% tensorflow 0.008243% 0.008743% 0.011499% |*0.007907% 0.008089% 0.011922% | 0.008033% webkit 0.001007% 0.001155% 0.001401% | 0.000999% *0.000962% 0.001291% | 0.001054% -----------------------------------------------+---------------------------------+---------- all 0.006085% 0.007327% 0.007518% | 0.005646% *0.005554% 0.007591% | 0.005857% w/o homebrew 0.005968% 0.005979% 0.007545% | 0.005888% *0.005660% 0.007855% | 0.006039% MurmurHash3 and xxHash are neck and neck, be it enhanced double hashing or 7 different seeds, at least when we ignore the unusually high false positive rate of enhanced double hashing with xxHash in the homebrew-cask repository (it stumbles upon one of those colliding cases discussed above). FNV1a has a decidedly higher average false positive rate than any of the others. I was curious to see whether using 7 unsigned integers from SHA256 offers any benefits (being a cryptographic hash function, it should provide some high quality hash values), but apparently it doesn't fare any better than MurmurHash3 and xxHash. This leads me to believe that both MurmurHash3 and xxHash are as good as it gets, and I would not expect that any other hash function could achieve notably lower false positive rates. Now let's see how these hash functions and hashing schemes fare when writing commit-graph files with '--reachable' and with modified path Bloom filters from scratch at the end of this patch series. Hash functions tend to praise themselves about how fast they can process huge chunks of data, but we'll use them to hash many tiny strings... Total runtime of writing a commit-graph file with modified path Bloom filters from scratch (Time spent hashing) Enhanced double hashing | 7 seeds | 7 uint32 Murmur3 xxHash FNV1a | Murmur3 xxHash FNV1a | SHA256 -----------------------------------------------+---------------------------------+---------- android-base 40.880s 40.368s 40.569s | 41.375s 40.557s 41.843s | 42.706s (0.627s) (0.484s) (0.777s) | (1.467s) (0.886s) (2.138s) | (3.682s) cmssw 25.691s 25.224s 25.645s | 26.715s 25.998s 27.762s | 29.527s (0.894s) (0.650s) (1.179s) | (2.083s) (1.235s) (3.259s) | (5.077s) cpython 8.951s 8.929s 9.067s | 9.072s 9.015s 9.008s | 9.275s (0.057s) (0.042s) (0.045s) | (0.112s) (0.067s) (0.102s) | (0.324s) elasticsearch 14.470s 14.320s 14.703s | 14.983s 14.588s 15.948s | 16.720s (0.407s) (0.300s) (0.622s) | (0.991s) (0.594s) (1.760s) | (2.332s) gcc 36.917s 36.724s 37.251s | 37.971s 37.449s 38.178s | 38.332s (0.313s) (0.230s) (0.346s) | (0.694s) (0.418s) (0.919s) | (1.796s) gecko-dev 97.729s 96.791s 97.233s | 99.215s 97.158s 101.403s | 105.332s (1.730s) (1.267s) (2.099s) | (3.902s) (2.344s) (5.773s) | (9.553s) git 5.245s 5.401s 5.412s | 5.518s 5.457s 5.474s | 5.494s (0.022s) (0.017s) (0.018s) | (0.045s) (0.027s) (0.040s) | (0.124s) glibc 4.146s 4.156s 4.187s | 4.267s 4.201s 4.278s | 4.495s (0.060s) (0.045s) (0.057s) | (0.128s) (0.079s) (0.144s) | (0.331s) go 3.565s 3.563s 3.564s | 3.631s 3.582s 3.607s | 3.727s (0.040s) (0.030s) (0.035s) | (0.084s) (0.051s) (0.084s) | (0.221s) homebrew-cask 29.818s 29.936s 30.279s | 30.316s 30.435s 29.942s | 29.939s (0.025s) (0.020s) (0.019s) | (0.049s) (0.032s) (0.038s) | (0.153s) homebrew-core 55.478s 55.534s 56.248s | 56.551s 56.100s 56.445s | 55.641s (0.031s) (0.024s) (0.023s) | (0.062s) (0.038s) (0.047s) | (0.195s) jdk 19.418s 19.151s 20.246s | 21.270s 20.037s 24.073s | 25.916s (1.260s) (0.878s) (2.105s) | (3.154s) (1.833s) (6.133s) | (7.732s) linux 100.837s 100.130s 101.244s | 103.775s 101.645s 103.856s | 109.365s (2.027s) (1.498s) (2.030s) | (4.319s) (2.682s) (5.316s) | (11.075s) llvm-project 31.188s 31.392s 31.442s | 31.895s 31.479s 31.984s | 32.863s (0.334s) (0.251s) (0.345s) | (0.724s) (0.437s) (0.895s) | (1.794s) rails 5.607s 5.639s 5.694s | 5.742s 5.720s 5.865s | 6.087s (0.084s) (0.063s) (0.095s) | (0.189s) (0.116s) (0.252s) | (0.467s) rust 13.250s 13.206s 13.422s | 13.701s 13.426s 13.680s | 13.949s (0.163s) (0.122s) (0.169s) | (0.359s) (0.217s) (0.440s) | (0.880s) tensorflow 11.808s 11.608s 11.915s | 12.379s 11.966s 12.860s | 13.588s (0.368s) (0.267s) (0.497s) | (0.860s) (0.517s) (1.369s) | (2.081s) webkit 30.469s 30.735s 30.945s | 32.005s 31.004s 32.401s | 33.303s (0.501s) (0.376s) (0.733s) | (1.212s) (0.725s) (2.084s) | (3.044s) So xxHash is indeed the fastest, even in our use case, and MurmurHash3 comes in second. The time spent hashing with 7 seeds tends to be around twice as much as with enhanced double hashing when using the same hash function. However, the time spent hashing is only a fraction of the total runtime, and while its effect on total runtime is measurable both in best-of-five and average, it tends to be smaller than run-to-run noise. As for availability of hash functions: - MurmurHash3 is widely available; both the reference implementation and a streaming-capable ANSI C implementation is in the public domain. - xxHash is allegedly available in a variety of programming languages, as shown on www.xxhash.com (supposedly, but I haven't been able to load that page since months... some cloudflare host error persists) - SHA256 is widely available, and it must be part of every Git implementation in the near future anyway, but it's slower than the others, and, more importantly, it doesn't scale for k > 8. - FNV1a is so simple that anyone can implement a variant that incrementally computes two hashes up to the next directory separator in one go in about 20 lines of code (though note that the above benchmarks didn't use such an implementation). Alas, because of its higher false positive rate it's out anyway. Conclusion: we should seriously consider using MurmurHash3 (or xxHash) and hashing each path k times with k different seeds instead of any double hashing scheme. Merges ------ It's not clear whether it's worth computing, storing and using modified path Bloom filters for all parents of merge commits. - The number of paths modified between a merge commit and its second..nth parents is in general considerably larger than between any commit and its first parent. E.g. a lot of files are modified in Git's master branch while a topic cooks for a few weeks, and much more in Linux when a branch started from the previous release is merged near the end of the next merge window. Average number of modified paths Percentage compared to: of merge first second commits parent parent -------------------------------------------- android-base 73.6% 14.1 1553.6 cmssw 11.0% 16.1 977.4 cpython 11.7% 5.1 933.7 elasticsearch 8.4% 40.1 246.5 gecko-dev 3.5% 23.5 602.0 git 25.3% 4.0 394.8 homebrew-cask 9.6% 2.7 42.8 jdk 25.0% 184.1 408.2 linux 7.4% 26.1 2268.0 rails 22.2% 9.6 101.0 rust 27.0% 15.7 397.3 tensorflow 9.1% 39.2 1057.4 Consequently: - The tree-diff machinery has to work that much more to gather modified paths for all parents of merge commits, significantly increasing the runtime of writing the commit-graph file. - Storing modified path Bloom filters for all parents of merge commits significantly increases the size of the Modified Path Bloom Filters chunk, though depending on the percentage of merge commits and on the size of the other chunks the relative size increase of the whole commit-graph file might not be all that much. - [TODO: A few old test results suggest that pathspec-limited revision walks with default history simplification using a commit-graph file storing modified path Bloom filters for all merge parents are a few percents slower than when storing Bloom filters only for first parents. Even fewer old test results suggest that writing all Bloom filters for first parents first, and then all for second..nth parents might eliminate much of that runtime difference. Definitely need more benchmarking.] - During a pathspec-limited revision walk Git's default history simplification only checks whether the given path was modified between a merge commit and its second..nth parents when the path was modified between that merge commit and its first parent. This usually happens rarely, though these second..nth parent diffs tend to be more expensive than first parent diffs (because of the considerably more modified paths the tree-diff machinery can't short-circuit that early). Anyway, the potential speedup is low. - However, with '--full-history', i.e. without any history simplification, all merge commits are compared to all their parents, and typical additional speedups are in the range of 2x-3x, while in some cases over 7x or 11x can be achieved by using modified path Bloom filters for all parents of merge commits. Does it worth it? For me personally it doesn't, but I don't know how often others use '--full-history' and what trade-offs they might be willing to make. So the file format described here adds _optional_ support for storing modified path Bloom filters for all parents of merge commits, and the users can make this decision themselves. [TODO: describe it!] Deduplication ------------- Some commits have identical modified path Bloom filters, because they modify the same set of paths (or because they modify different set of paths but happen to end up setting the same bit positions in the Bloom filter). By omitting duplicates from the Modified Path Bloom Filters chunk its size can be reduced typically by around 5-20%, and in case of the android-base repository by over 69%. Explicitly allow that multiple entries of the Modified Path Bloom Filter Index chunk can refer to the same offset in the Modified Path Bloom Filters chunk. This is important, because even if an implementation doesn't perform this deduplication while writing the commit-graph file, it must be prepared for multiple index entries referring to the same offset in commit-graph file written by a different implementation. Modified Path Bloom Filter Excludes ----------------------------------- Some repositories contain leading directories that are modified in the great majority of commits, e.g. the homebrew-core repository's 'Formula' directory is modified in over 99.7% of commits, while homebrew-cask's 'Casks' and rust's 'src' are modified in over 93% of commits. And there is 'src/main/java/com/company/division/project' in convention-following Java projects... Modified path Bloom filters can't speed up revision walks when the pathspec is such a frequently modified leading directory, because because of potential false positives we'll have to run tree-diff for the majority of commits anyway. And it doesn't really make sense to query the history of such a leading directory in practice, because it will list the majority of commits, so one might as well look straight at the output of a pathspec-less 'git log'. However, adding those frequently modified leading directories to the modified path Bloom filters requires more space and increases the probability of false positives. So the file format described here adds support for excluding specific paths from modified path Bloom filters by listing them in the Modified Path Bloom Filter Excludes chunk. [TODO: Figure out the details!] Limitations ----------- Because of the possibility of false positives, if a modified path Bloom filter query returns 'possibly in set', then we have to invoke tree-diff to make sure that the path in question was indeed modified by the given commit. Consequently, Bloom filters can't improve performance that much when looking for the history of a frequently modified path, because a lot of tree-diff invocations can't be eliminated. In the extreme case when looking for the history of a path modified in every commit, then using Bloom filters will only add extra overhead. A modified path Bloom filter doesn't store the names of modified paths, it only sets a couple of bits based on those paths' hashes. Consequently, it can only be used when looking for the history of a concrete path, and: - It can't be used with a pathspec containing wildcards like 'git log -- "*.h"'. However, it could still be used when the pathspec contains leading directories without wildcards, e.g. 'git log -- "arch/x86/include/*.h", by limiting tree-diff only to commits modifying the 'arch/x86/include/' directory. - It can't tell which paths were modified by a given commit; for that we would still have to run tree-diff for the full tree. Submodules [TODO] ----------------- No modified path Bloom filters should be stored for commits modifying submodules. This is questionable, but is necessary to produce the same output with and without modified path Bloom filters. If 'dir/submod' is a gitlink file, then currently 'git rev-list HEAD -- dir/submod/whatever' lists all commits touching 'dir/submod', even when that 'whatever' has never existed. And that 'whatever' can be basically anything, so we will not find them in any of our modified path Bloom filters, therefore in a Bloom-filter-assisted revision walk we won't list any commits. The only way around this is to not not write any modified path Bloom filters for commits modifying submodules. Note, however, that once upon a time that command wouldn't list anything, either, but the behavior changed with commit 74b4f7f277 (tree-walk.c: ignore trailing slash on submodule in tree_entry_interesting(), 2014-01-23) to what we have now. As 74b4f7f277's log message only talks about handling 'dir/submod/' and 'dir/submod' (i.e. with and without trailing slash) consistently, I suspect that this behavior change with 'dir/submod/anything' is an unintended and undesired side effect. However, as it involves submodules I would rather not have an opinion :) In any case, someone with more clues about submodules should take a closer look and decide whether this is a bug or not, before this modified path Bloom filter thing goes much further. If it is a bug indeed, then it should be fixed and the remark about submodules should be removed from the modified path Bloom filter specs. If the current behavior is desired, then the remark about submodules should be updated to use proper English (IMO it must be part of the spec, because this is a subtle issue that developers of other implementations (JGit, libgit2, etc.) might easily overlook). Threats to validity ------------------- - Random paths are... random. Picking random paths can over-represent rarely modified files. Since modified path Bloom filters bring more benefits to rarely modified paths, the reported speedups later in the series might be higher than what the users will usually see. (I suppose that users more often check the logs of frequently modified files than of rarely modified ones.) Though some of these random paths made me stumble upon the issue with submodules discussed above, so... - Bugs :) It's not that hard to make subtle bugs that don't affect correctness, because the probabilistic nature of Bloom filters cover them up. However, bugs like incorrectly calculating the size of a Bloom filter or having an off-by-one error in the filter's array handling affect the false positive rate and in turn the runtime of pathspec-limited revision walks. Alternatives considered ----------------------- Here are some alternatives that I've considered but discarded and ideas that I haven't (yet) followed through: - One Bloom filter to rule them all? No. While the first proof of concept implementation [2] demonstrated that by combining hashes of modified pathnames with commit and parent object ids it is possible to use a single big Bloom filter and achieve notable speedups, that approach has several drawbacks: - Poor locality: The bit positions that are set for any element are supposed to be random, so we need to access random parts of the big Bloom filter array for each checked commit and path, which is bad for the caches. Furthermore, we would likely need to read the whole big Bloom filter array even when looking for only a handful of commits, e.g. for 'git log -3 -- dir/file'. - Need to grow: As the repository grows, more and more paths will be added to the modified path Bloom filter, thus lowering its false positive rate. Eventually we'll reach a point where we would need to increase the size of the Bloom filter, but a Bloom filter can't be resized, so we'll have to create a larger Bloom filter and re-fill it from scratch. This is a considerably more expensive operation than creating a modified-path-Bloom-filter-less commit-graph from scratch. - Expired commits: Unreachable commits are eventually pruned by 'git gc', but elements can't be removed from a Bloom filter. Consequently, all the bits set for paths modified by expired commits will remain set, slightly increasing the probability of false positives until the Bloom filter is eventually resized/rewritten. There are enhanced Bloom filter variants that allow removal of elements, like a counting Bloom filter, but they come with additional complexity and require considerably more space. - Excluded commits: Sometimes we can't or just don't want to record the paths modified by a particular commit, so we need a way to record which commits don't have entries in the big Bloom filter. - There are plans for split commit-graph files, because rewriting the whole commit-graph file in big repositories to add only relatively few new commits seems to impose considerable overhead. We would need a big modified path Bloom filter in each of the split commit-graph files that we'll combine together when the split commit-graph files are combined. Bloom filters of the same size can be quickly combined by bitwise OR-ing their arrays, but this implies that the split commit-graph files holding only a few commits would have to contain as large a Bloom filter as is used in the shared commit-graph file holding most of the commits in the repository, wasting valuable storage space. (Though the false positive rate for commits stored in the split commit-graph should be spectacularly low). Using a small Bloom filter for each commit avoids these issues, though, as discussed earlier, not without introducing issues of its own. - Xor filters "are a faster and more concise alternative to Bloom filters", but they turned out to be too large in our use case. Each Xor filter consists of a header and an array: the header contains a 64 bit seed and a 64 bit blocklength, while the array contains 3 * blocklength bytes. Consequently, the space advantage of Xor filters compared to Bloom filters is only true for large filters, where the header's size is negligible compared to the array's size. We, however, have a lot of very small filters, and all those extra headers cause an unacceptable size increase (partly because the Xor filter's header is 4 times the size of the header of our modified path Bloom filters, and partly because their headers make Xor filters too large to be embedded in the index chunk). MPBF chunk Total size Xor Filter (no dedup) size ------------------------------------------------ android-base 8620198 28840220 +334% cmssw 10347018 21538283 +208% cpython 526381 6393731 +1214% elasticsearch 4733274 9101088 +192% gcc 3724544 14637474 +393% gecko-dev 20591010 56887518 +276% git 160718 3115827 +1938% glibc 759132 3162898 +416% go 458375 2676306 +584% homebrew-cask 94383 5272317 +5586% homebrew-core 27823 8051585 +28938% jdk 13213208 15574498 +117% linux 26575278 69577374 +261% llvm-project 3988133 20235803 +507% rails 958691 5172643 +539% rust 1985782 7082949 +356% tensorflow 4173570 8333062 +200% webkit 5891751 15927504 +270% https://github.com/FastFilter/xor_singleheader - fastrange is "a fast alternative to the modulo reduction", but it turned out to significantly increase the average false positive rate when using (enhanced) double hashing. The traditional way to fairly map an integer value to the range [0,m) is the modulo reduction, i.e. 'h % m'. Alas, integer division is a costly instruction, and we have to do it a lot while checking modified path Bloom filters for every commit. Fastrange proposes to use the combination of an integer multiplication and a shift as: '((uint64_t)h * (uint64_t)m) >> 32', as it is supposed to be faster while being just as fair. In my benchmarks fastrange did indeed reduce the time spent checking modified path Bloom filters by up to ~10% (no table about this), though that's usually only a fraction of the total runtime of a pathspec-limited revision walk. Unfortunately, as the table below shows, it also increased the average false positive rate by about 5x when using enhanced double hashing. I haven't yet investigated why this happens. However, fastrange is a viable option when using the same hash function with 'k' different seeds, as in that case it's on par with the good old % operator, and in fact its false positive rate happens to be slightly lower. Enhanced | double hashing | 7 seeds | % operator fastrange | % operator fastrange -------------------------------------+---------------------- android-base *0.004155% 0.013987% | 0.004214% 0.004873% cmssw 0.003953% 0.006129% | *0.003344% 0.005513% cpython 0.016607% 0.033979% | *0.016120% 0.018069% elasticsearch 0.003249% *0.003149% | 0.003167% 0.003333% gcc 0.008919% 0.013098% | 0.010281% *0.007324% gecko-dev 0.000725% 0.000923% | 0.000804% *0.000631% git 0.069921% 0.106424% | *0.063025% 0.065347% glibc 0.016389% 0.025323% | *0.016278% 0.016796% go *0.012616% 0.020914% | 0.013009% 0.012915% homebrew-cask 0.009096% 0.182016% | 0.007199% *0.006937% homebrew-core 0.005360% 0.092957% | *0.003041% 0.004251% jdk 0.006431% 0.005732% | 0.005873% *0.005673% linux 0.007494% 0.013862% | 0.007764% *0.007127% llvm-project 0.003913% 0.004829% | 0.003367% *0.002943% rails 0.013134% 0.019524% | *0.012708% 0.012979% rust 0.025509% 0.029824% | *0.024245% 0.025174% tensorflow 0.008243% 0.012485% | *0.007907% 0.008120% webkit *0.001007% 0.001044% | *0.000999% 0.001220% -------------------------------------+---------------------- all 0.006085% 0.029034% | 0.005646% *0.005579% w/o homebrew 0.005968% 0.009478% | 0.005888% *0.005662% https://github.com/lemire/fastrange - fastmod "is a header file for fast 32-bit division remainders on 64-bit hardware", promising faster computation when the same divisor is (re)used for multiple remainder calculations. This is exactly what we are doing when checking a modified path Bloom filter, as we have to map a bunch of unsigned 32 bit hash values to bit positions with pos = hash % nr_bits, using the same nr_bits for all hash values. However, benchmarking has not shown conclusive improvements: while fastmod slightly reduces the time spent checking modified path Bloom filters in some repositories, it slightly increases in others, and any effect on the runtime of the whole pathspec-limited revision walk is below the noise. Perhaps we have too many "embedded" modified path Bloom filters that we can check with a mask. Or perhaps my CPU is too old and its integer multiplication instruction is not that much faster compared to integer division. https://github.com/lemire/fastmod - Varints. Using 4 bytes for the size of each Bloom filter in the Modified Path Bloom Filters chunk is a lot more than necessary. How much space can be saved by using varints? How much is the runtime overhead of decoding those varints? How can we ensure that varint decoding doesn't read past the end of the mmapped memory region? - Path-depth-dependent number of hash functions. As shown later in this patch series, we can drastically reduce the average false positive rate by checking the presence of the pathspec's leading directories as well. With that change the false positive rate will depend on the depth of the path in the directory tree as well: the deeper the path (i.e. the more leading directories can be checked), the lower the false positive rate, sometimes maybe even unnecessarily low (0% is not an approximation, but there was not a single false positive, though the number of checked paths at those depths is quite low): linux.git webkit.git Number of Number of Path checked Average false checked Average false depth paths at positive rate paths at positive rate depth that depth at that depth that depth at that depth -------------------------------------------------------------- 1 2 0.31974050% 0 n/a 2 97 0.06539501% 16 0.04927891% 3 1052 0.01782569% 176 0.00472128% 4 1674 0.00396338% 586 0.00399830% 5 1315 0.00350913% 775 0.00089653% 6 548 0.00149110% 1396 0.00025000% 7 113 0.00057555% 1004 0.00004777% 8 124 0.00085726% 397 0.00001380% 9 34 0.00025182% 311 0.00001468% 10 14 0% 250 0.00000182% 11 8 0% 45 0% 12 19 0% 39 0% 13 0 n/a 3 0% 14 0 n/a 1 0% 15 0 n/a 1 0% -------------------------------------------------------------- all 5000 0.007607% 5000 0.001012% So we could use fewer hashes when adding deep paths to modified path Bloom filters, saving same space while still achieving quite low false positive rate. Or when aiming for consistently lower false positive rate, then we could use more hashes only for shallower paths. [1] The repositories used in the above benchmarks are: android-base: https://android.googlesource.com/platform/frameworks/base cmssw: https://github.com/cms-sw/cmssw cpython: https://github.com/python/cpython elasticsearch: https://github.com/elastic/elasticsearch gcc: git://gcc.gnu.org/git/gcc.git gecko-dev: https://github.com/mozilla/gecko-dev git: https://git.kernel.org/pub/scm/git/git.git glibc: git://sourceware.org/git/glibc.git go: https://github.com/golang/go homebrew-cask: https://github.com/homebrew/homebrew-cask homebrew-core: https://github.com/homebrew/homebrew-core jdk: https://github.com/openjdk/jdk linux: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git llvm-project: https://github.com/llvm/llvm-project.git rails: https://github.com/rails/rails rust: https://github.com/rust-lang/rust tensorflow: https://github.com/tensorflow/tensorflow webkit: git://git.webkit.org/WebKit.git [2] First PoC Bloom filter experiment: https://public-inbox.org/git/20181009193445.21908-1-szeder.dev@gmail.com/ Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com> --- .../technical/commit-graph-format.txt | 125 ++++++++++++++++++ 1 file changed, 125 insertions(+) diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt index efd0f99d8b..6b041f94ee 100644 --- a/Documentation/technical/commit-graph-format.txt +++ b/Documentation/technical/commit-graph-format.txt @@ -99,6 +99,131 @@ CHUNK DATA: file's OID Lookup chunk is equal to i plus the number of commits in all base graphs. If B is non-zero, this chunk must exist. + Modified Path Bloom Filter Index (ID: {'M', 'P', 'B', 'I'}) [Optional] + 2 + (N * 8) bytes + + * 1-byte number (K) of hashes per path + + * An array of 8 byte entries, one for each N commits stored in the + commit-graph, with the ith entry associated to the ith commit in the + OID Lookup chunk. + Each entry should be interpreted as follows: + + - If all bits in the 8 bytes are set, then there is no modified path + Bloom filter stored for this commit. + + Thou shalt not store a modified path Bloom filter for a commit that + modifies a submodule (gitlink file)! + + - If the most significant bit of the first byte is set, then the + remaining 63 bits represent the bit array of an "embedded" Bloom + filter containing the set of paths that were modified between the + associated commit and its first parent, or in case of a root commit + between the associated commit and the empty tree. All embedded + modified path Bloom filters use the same hashing scheme that is + used in the Modified Path Bloom Filters chunk, see below. + + - If the second most significant bit of the first byte is set, then + the last four bytes form an array position into the Modified Path + Bloom Filter Merge Index chunk. The remaining bits in the first + four bytes should be set to 0. This can optionally be used to + store Bloom filters for all parents of merge commits. + [TODO: Only the last four bytes? Shouldn't that be last 62 bits?!] + + - If the two most significant bits of the first byte are unset, then + the entry represents an 8 byte offset pointing to a Bloom filter + in the Modified Path Bloom Filters chunk, which contains the set + of paths that were modified between the associated commit and its + first parent, or in case of a root commit between the associated + commit and the empty tree. This offset is relative to the start + of the Modified Path Bloom Filters chunk. Multiple entries can + point to the same offset. + + Modified Path Bloom Filter Merge Index (ID: {'M', 'P', 'B', 'M'}) [Optional] + An array of 8 byte entries. + If a merge commit's entry in the Modified Path Bloom Filter Index + chunk stores an array position into this chunk, then the entry at + that position is associated with that merge commit and its first + parent, the next entry with that merge commit and its second parent, + etc. for all parents of that merge commit. + + Each entry should be interpreted as follows, similar to the entries + in the Modified Path Bloom Filter Index chunk: + + - If all bits in the 8 bytes are set, then there is no modified path + Bloom filter stored for the associated merge commit and its + corresponding parent. + + - If the MSB of the first byte is set, then the remaining 63 bits + represent the bit array of an "embedded" Bloom filter containing + the set of paths that were modified between the associated merge + commit and its corresponding parent. All embedded modified path + Bloom filters use the same hashing scheme that is used in the + Modified Path Bloom Filters chunk, see below. + + - If the most significant bit of the first byte is unset, then the + entry represents an 8 byte offset pointing to a Bloom filter in + the Modified Path Bloom Filters chunk, which contains the set of + paths that were modified between the associated merge commit and + its corresponding parent. This offset is relative to the start of + the Modified Path Bloom Filters chunk. Multiple entries can point + to the same offset. + + This chunk should not exist if the commit-graph file doesn't contain + a Modified Path Bloom Filter Index chunk. This chunk can be omitted + if the Modified Path Bloom Filter Index doesn't contain any array + indexes pointing into it. This chunk can be omitted even when the + commit-graph contains merge commits. + + Modified Path Bloom Filters (ID: {'M', 'P', 'B', 'F'}) [Optional] + A number of consecuive modified path Bloom filters of varying sizes. + Each Bloom filter consists of 4 + M bytes: + + - The first 4 bytes specify the number of m bits in the Bloom + filter's bit array. + + - The next M bytes hold the Bloom filter's bit array of m bits. + + The bits in the array are indexed in network byte order, i.e. the + array's 0th bit is the least significant bit of the last byte, and + the (m-1)th bit is in the first byte. If m is not a multiple of 8, + then the unused leading bits should be set to 0. + + For each path (including leading directories) modified between a + commit and its parent K bit positions should be set using the + following hashing scheme: + + for i in [0, K) + bit_pos[i] = (hash1 + i * hash2 + (i * i * i - i) / 6) % m + + where hash1 and hash2 are the 32 bit MurmurHash3 hashes of the path + with seeds 0xe83c5163 and 0x3b376b0c, respectively. These bit + positions should be calculated using 32 bit unsigned integer + arithmetic. The directory separator is a single '/'. Directories + should be added without a trailing '/'. + + The order of Bloom filters in this chunk is unspecified. Multiple + entries in the Modified Path Bloom Filter Index or Modified Path + Bloom Filter Merge Index chunks can point to the same Bloom filter + in this chunk. + + This chunk should not exist if the commit-graph doesn't contain a + Modified Path Bloom Filter Index chunk. This chunk can be omitted + if neither the Modified Path Bloom Filter Index nor the Modified + Path Bloom Filter Merge Index chunks contain any offsets pointing + into it. + + Modified Path Bloom Filter Excludes (ID: {'M', 'P', 'B', 'X'}) [Optional] + A number of consecutive null terminated strings in memcmp() order. + Paths in these strings should not be added to any modified path + Bloom filters. + [TODO: Clarify! Only literal matches, or should we support + globbing/patterns as well? Only full match, or leading directories + as well?] + + This chunk should not exist if the commit-graph doesn't contain a + Modified Path Bloom Filter Index chunk. + TRAILER: H-byte HASH-checksum of all of the above. -- 2.27.0.rc1.431.g5c813f95dc
We'll use Bloom filters to store the paths modified between each commit and one of its parents. This means many small Bloom filters, which puts some constraints on the hashing scheme, as discussed in the previous commit. However, Bloom filters might turn out to be useful in other parts of Git as well, where they might be free from those constraints. So add a Bloom filter implementation that is fairly generic, and copes only with allocation, initialization and release of the filter and with manipulating the bits in its bit array, but leaves all the hashing to the user of the filter. Unfortunately, it's still somewhat specific, though, in how it maps the hashes to specific bits in memory, i.e. how it maps the hashes to array indices and how the bit array is indexed. We'll see how that will pan out... --- Makefile | 1 + bloom-filter.c | 91 ++++++++++++++++++++++++++++++++++++++++++++++++++ bloom-filter.h | 47 ++++++++++++++++++++++++++ 3 files changed, 139 insertions(+) create mode 100644 bloom-filter.c create mode 100644 bloom-filter.h diff --git a/Makefile b/Makefile index 09f98b777c..a9db1e8d9b 100644 --- a/Makefile +++ b/Makefile @@ -839,6 +839,7 @@ LIB_OBJS += base85.o LIB_OBJS += bisect.o LIB_OBJS += blame.o LIB_OBJS += blob.o +LIB_OBJS += bloom-filter.o LIB_OBJS += branch.o LIB_OBJS += bulk-checkin.o LIB_OBJS += bundle.o diff --git a/bloom-filter.c b/bloom-filter.c new file mode 100644 index 0000000000..919da26462 --- /dev/null +++ b/bloom-filter.c @@ -0,0 +1,91 @@ +#include "bloom-filter.h" + +void bloom_filter_init(struct bloom_filter *bf, unsigned int nr_hashes, + uint32_t nr_elements) +{ + /* n * k / ln(2) ≈ n * k / 0.69315 ≈ n * k * 10 / 7 */ + uint32_t nr_bits = st_mult(st_mult(nr_elements, nr_hashes), 10) / 7; + uint32_t nr_bytes = st_add(nr_bits, 7) / 8; + /* + * But we round up to fully utilize all bytes, thus lowering the + * probability of false positives a bit. + */ + bf->nr_bits = nr_bytes * 8; + bf->bits = xcalloc(nr_bytes, sizeof(*(bf->bits))); +} + +void bloom_filter_init_with_size(struct bloom_filter *bf, uint32_t nr_bits) +{ + uint32_t nr_bytes = st_add(nr_bits, 7) / 8; + bf->nr_bits = nr_bits; + bf->bits = xcalloc(nr_bytes, sizeof(*(bf->bits))); +} + +void bloom_filter_free(struct bloom_filter *bf) +{ + FREE_AND_NULL(bf->bits); + bf->nr_bits = 0; +} + +uint32_t bloom_filter_bytes(struct bloom_filter *bf) +{ + return (bf->nr_bits + 7) / 8; +} + +void bloom_filter_clear_all_bits(struct bloom_filter *bf) +{ + memset(bf->bits, 0, bloom_filter_bytes(bf)); +} + +void bloom_filter_set_all_bits(struct bloom_filter *bf) +{ + memset(bf->bits, 0xff, bloom_filter_bytes(bf)); +} + +static inline uint32_t bit_offset(uint32_t nr_bits, uint32_t hash) +{ + return hash % nr_bits; +} + +static inline uint32_t byte_offset(uint32_t nr_bits, uint32_t bit_offset) +{ + return (nr_bits - 1) / 8 - bit_offset / 8; +} + +static inline uint8_t byte_mask(uint32_t bit_offset) +{ + return 1 << (bit_offset % 8); +} + +static inline void bloom_filter_set_one_bit(struct bloom_filter *bf, + uint32_t hash) +{ + uint32_t offset = bit_offset(bf->nr_bits, hash); + bf->bits[byte_offset(bf->nr_bits, offset)] |= byte_mask(offset); +} + +void bloom_filter_set_bits(struct bloom_filter *bf, const uint32_t *hashes, + unsigned int nr_hashes) +{ + unsigned int i; + for (i = 0; i < nr_hashes; i++) + bloom_filter_set_one_bit(bf, hashes[i]); +} + +static inline int bloom_filter_check_one_bit(struct bloom_filter *bf, + uint32_t hash) +{ + uint32_t offset = bit_offset(bf->nr_bits, hash); + return bf->bits[byte_offset(bf->nr_bits, offset)] & byte_mask(offset); +} + +enum bloom_result bloom_filter_check_bits(struct bloom_filter *bf, + const uint32_t *hashes, + unsigned int nr_hashes) +{ + unsigned int i; + for (i = 0; i < nr_hashes; i++) + if (!bloom_filter_check_one_bit(bf, hashes[i])) + return BLOOM_DEFINITELY_NOT; + return BLOOM_POSSIBLY_YES; +} diff --git a/bloom-filter.h b/bloom-filter.h new file mode 100644 index 0000000000..2fcd4a2c19 --- /dev/null +++ b/bloom-filter.h @@ -0,0 +1,47 @@ +#ifndef BLOOM_FILTER_H +#define BLOOM_FILTER_H + +#include "git-compat-util.h" + +enum bloom_result { + /* + * BLOOM_CANT_TELL is a value that a caller can use to report that + * a Bloom filter is not available; bloom_filter_check_bits() will + * never return it. + */ + BLOOM_CANT_TELL = -1, + BLOOM_DEFINITELY_NOT = 0, + BLOOM_POSSIBLY_YES = 1 +}; + +struct bloom_filter { + uint32_t nr_bits; + uint8_t *bits; +}; + +/* + * Initialize a Bloom filter with the number of bits that is (close to) + * optimal to hold the given number of elements using the given number + * of hashes per element. + */ +void bloom_filter_init(struct bloom_filter *bf, uint32_t nr_hashes, + uint32_t nr_elements); + +/* Initialize a Bloom filter with the given number of bits */ +void bloom_filter_init_with_size(struct bloom_filter *bf, uint32_t nr_bits); + +void bloom_filter_free(struct bloom_filter *bf); + +/* Return the size of the Bloom filter's bit array in bytes */ +uint32_t bloom_filter_bytes(struct bloom_filter *bf); + +void bloom_filter_clear_all_bits(struct bloom_filter *bf); +void bloom_filter_set_all_bits(struct bloom_filter *bf); + +void bloom_filter_set_bits(struct bloom_filter *bf, const uint32_t *hashes, + unsigned int nr_hashes); +enum bloom_result bloom_filter_check_bits(struct bloom_filter *bf, + const uint32_t *hashes, + unsigned int nr_hashes); + +#endif -- 2.27.0.rc1.431.g5c813f95dc
Import the streaming-capable public domain Murmur3 hash function implementation from https://github.com/aappleby/smhasher/. It's imported as-is, with 2 space indentation (the horror!) and whitespace errors... The reason for importing the streaming-capable implementation is that we might get some benefit when hashing the paths 'dir', 'dir/subdir', 'dir/subdir/file' one ofter the other. --- Makefile | 1 + compat/PMurHash.c | 291 ++++++++++++++++++++++++++++++++++++++++++++++ compat/PMurHash.h | 62 ++++++++++ 3 files changed, 354 insertions(+) create mode 100644 compat/PMurHash.c create mode 100644 compat/PMurHash.h diff --git a/Makefile b/Makefile index a9db1e8d9b..1338f10796 100644 --- a/Makefile +++ b/Makefile @@ -853,6 +853,7 @@ LIB_OBJS += commit.o LIB_OBJS += commit-graph.o LIB_OBJS += commit-reach.o LIB_OBJS += compat/obstack.o +LIB_OBJS += compat/PMurHash.o LIB_OBJS += compat/terminal.o LIB_OBJS += config.o LIB_OBJS += connect.o diff --git a/compat/PMurHash.c b/compat/PMurHash.c new file mode 100644 index 0000000000..df8842aa29 --- /dev/null +++ b/compat/PMurHash.c @@ -0,0 +1,291 @@ +/*----------------------------------------------------------------------------- + * MurmurHash3 was written by Austin Appleby, and is placed in the public + * domain. + * + * This implementation was written by Shane Day, and is also public domain. + * + * This is a portable ANSI C implementation of MurmurHash3_x86_32 (Murmur3A) + * with support for progressive processing. + */ + +/*----------------------------------------------------------------------------- + +If you want to understand the MurmurHash algorithm you would be much better +off reading the original source. Just point your browser at: +http://code.google.com/p/smhasher/source/browse/trunk/MurmurHash3.cpp + + +What this version provides? + +1. Progressive data feeding. Useful when the entire payload to be hashed +does not fit in memory or when the data is streamed through the application. +Also useful when hashing a number of strings with a common prefix. A partial +hash of a prefix string can be generated and reused for each suffix string. + +2. Portability. Plain old C so that it should compile on any old compiler. +Both CPU endian and access-alignment neutral, but avoiding inefficient code +when possible depending on CPU capabilities. + +3. Drop in. I personally like nice self contained public domain code, making it +easy to pilfer without loads of refactoring to work properly in the existing +application code & makefile structure and mucking around with licence files. +Just copy PMurHash.h and PMurHash.c and you're ready to go. + + +How does it work? + +We can only process entire 32 bit chunks of input, except for the very end +that may be shorter. So along with the partial hash we need to give back to +the caller a carry containing up to 3 bytes that we were unable to process. +This carry also needs to record the number of bytes the carry holds. I use +the low 2 bits as a count (0..3) and the carry bytes are shifted into the +high byte in stream order. + +To handle endianess I simply use a macro that reads a uint32_t and define +that macro to be a direct read on little endian machines, a read and swap +on big endian machines, or a byte-by-byte read if the endianess is unknown. + +-----------------------------------------------------------------------------*/ + + +#include "PMurHash.h" + +/* I used ugly type names in the header to avoid potential conflicts with + * application or system typedefs & defines. Since I'm not including any more + * headers below here I can rename these so that the code reads like C99 */ +#undef uint32_t +#define uint32_t MH_UINT32 +#undef uint8_t +#define uint8_t MH_UINT8 + +/* MSVC warnings we choose to ignore */ +#if defined(_MSC_VER) + #pragma warning(disable: 4127) /* conditional expression is constant */ +#endif + +/*----------------------------------------------------------------------------- + * Endianess, misalignment capabilities and util macros + * + * The following 3 macros are defined in this section. The other macros defined + * are only needed to help derive these 3. + * + * READ_UINT32(x) Read a little endian unsigned 32-bit int + * UNALIGNED_SAFE Defined if READ_UINT32 works on non-word boundaries + * ROTL32(x,r) Rotate x left by r bits + */ + +/* Convention is to define __BYTE_ORDER == to one of these values */ +#if !defined(__BIG_ENDIAN) + #define __BIG_ENDIAN 4321 +#endif +#if !defined(__LITTLE_ENDIAN) + #define __LITTLE_ENDIAN 1234 +#endif + +/* I386 */ +#if defined(_M_IX86) || defined(__i386__) || defined(__i386) || defined(i386) + #define __BYTE_ORDER __LITTLE_ENDIAN + #define UNALIGNED_SAFE +#endif + +/* gcc 'may' define __LITTLE_ENDIAN__ or __BIG_ENDIAN__ to 1 (Note the trailing __), + * or even _LITTLE_ENDIAN or _BIG_ENDIAN (Note the single _ prefix) */ +#if !defined(__BYTE_ORDER) + #if defined(__LITTLE_ENDIAN__) && __LITTLE_ENDIAN__==1 || defined(_LITTLE_ENDIAN) && _LITTLE_ENDIAN==1 + #define __BYTE_ORDER __LITTLE_ENDIAN + #elif defined(__BIG_ENDIAN__) && __BIG_ENDIAN__==1 || defined(_BIG_ENDIAN) && _BIG_ENDIAN==1 + #define __BYTE_ORDER __BIG_ENDIAN + #endif +#endif + +/* gcc (usually) defines xEL/EB macros for ARM and MIPS endianess */ +#if !defined(__BYTE_ORDER) + #if defined(__ARMEL__) || defined(__MIPSEL__) + #define __BYTE_ORDER __LITTLE_ENDIAN + #endif + #if defined(__ARMEB__) || defined(__MIPSEB__) + #define __BYTE_ORDER __BIG_ENDIAN + #endif +#endif + +/* Now find best way we can to READ_UINT32 */ +#if __BYTE_ORDER==__LITTLE_ENDIAN + /* CPU endian matches murmurhash algorithm, so read 32-bit word directly */ + #define READ_UINT32(ptr) (*((uint32_t*)(ptr))) +#elif __BYTE_ORDER==__BIG_ENDIAN + /* TODO: Add additional cases below where a compiler provided bswap32 is available */ + #if defined(__GNUC__) && (__GNUC__>4 || (__GNUC__==4 && __GNUC_MINOR__>=3)) + #define READ_UINT32(ptr) (__builtin_bswap32(*((uint32_t*)(ptr)))) + #else + /* Without a known fast bswap32 we're just as well off doing this */ + #define READ_UINT32(ptr) (ptr[0]|ptr[1]<<8|ptr[2]<<16|ptr[3]<<24) + #define UNALIGNED_SAFE + #endif +#else + /* Unknown endianess so last resort is to read individual bytes */ + #define READ_UINT32(ptr) (ptr[0]|ptr[1]<<8|ptr[2]<<16|ptr[3]<<24) + + /* Since we're not doing word-reads we can skip the messing about with realignment */ + #define UNALIGNED_SAFE +#endif + +/* Find best way to ROTL32 */ +#if defined(_MSC_VER) + #include <stdlib.h> /* Microsoft put _rotl declaration in here */ + #define ROTL32(x,r) _rotl(x,r) +#else + /* gcc recognises this code and generates a rotate instruction for CPUs with one */ + #define ROTL32(x,r) (((uint32_t)x << r) | ((uint32_t)x >> (32 - r))) +#endif + + +/*----------------------------------------------------------------------------- + * Core murmurhash algorithm macros */ + +#define C1 (0xcc9e2d51) +#define C2 (0x1b873593) + +/* This is the main processing body of the algorithm. It operates + * on each full 32-bits of input. */ +#define DOBLOCK(h1, k1) do{ \ + k1 *= C1; \ + k1 = ROTL32(k1,15); \ + k1 *= C2; \ + \ + h1 ^= k1; \ + h1 = ROTL32(h1,13); \ + h1 = h1*5+0xe6546b64; \ + }while(0) + + +/* Append unaligned bytes to carry, forcing hash churn if we have 4 bytes */ +/* cnt=bytes to process, h1=name of h1 var, c=carry, n=bytes in c, ptr/len=payload */ +#define DOBYTES(cnt, h1, c, n, ptr, len) do{ \ + int _i = cnt; \ + while(_i--) { \ + c = c>>8 | *ptr++<<24; \ + n++; len--; \ + if(n==4) { \ + DOBLOCK(h1, c); \ + n = 0; \ + } \ + } }while(0) + +/*---------------------------------------------------------------------------*/ + +/* Main hashing function. Initialise carry to 0 and h1 to 0 or an initial seed + * if wanted. Both ph1 and pcarry are required arguments. */ +void PMurHash32_Process(uint32_t *ph1, uint32_t *pcarry, const void *key, int len) +{ + uint32_t h1 = *ph1; + uint32_t c = *pcarry; + + const uint8_t *ptr = (uint8_t*)key; + const uint8_t *end; + + /* Extract carry count from low 2 bits of c value */ + int n = c & 3; + +#if defined(UNALIGNED_SAFE) + /* This CPU handles unaligned word access */ + + /* Consume any carry bytes */ + int i = (4-n) & 3; + if(i && i <= len) { + DOBYTES(i, h1, c, n, ptr, len); + } + + /* Process 32-bit chunks */ + end = ptr + len/4*4; + for( ; ptr < end ; ptr+=4) { + uint32_t k1 = READ_UINT32(ptr); + DOBLOCK(h1, k1); + } + +#else /*UNALIGNED_SAFE*/ + /* This CPU does not handle unaligned word access */ + + /* Consume enough so that the next data byte is word aligned */ + int i = -(long)ptr & 3; + if(i && i <= len) { + DOBYTES(i, h1, c, n, ptr, len); + } + + /* We're now aligned. Process in aligned blocks. Specialise for each possible carry count */ + end = ptr + len/4*4; + switch(n) { /* how many bytes in c */ + case 0: /* c=[----] w=[3210] b=[3210]=w c'=[----] */ + for( ; ptr < end ; ptr+=4) { + uint32_t k1 = READ_UINT32(ptr); + DOBLOCK(h1, k1); + } + break; + case 1: /* c=[0---] w=[4321] b=[3210]=c>>24|w<<8 c'=[4---] */ + for( ; ptr < end ; ptr+=4) { + uint32_t k1 = c>>24; + c = READ_UINT32(ptr); + k1 |= c<<8; + DOBLOCK(h1, k1); + } + break; + case 2: /* c=[10--] w=[5432] b=[3210]=c>>16|w<<16 c'=[54--] */ + for( ; ptr < end ; ptr+=4) { + uint32_t k1 = c>>16; + c = READ_UINT32(ptr); + k1 |= c<<16; + DOBLOCK(h1, k1); + } + break; + case 3: /* c=[210-] w=[6543] b=[3210]=c>>8|w<<24 c'=[654-] */ + for( ; ptr < end ; ptr+=4) { + uint32_t k1 = c>>8; + c = READ_UINT32(ptr); + k1 |= c<<24; + DOBLOCK(h1, k1); + } + } +#endif /*UNALIGNED_SAFE*/ + + /* Advance over whole 32-bit chunks, possibly leaving 1..3 bytes */ + len -= len/4*4; + + /* Append any remaining bytes into carry */ + DOBYTES(len, h1, c, n, ptr, len); + + /* Copy out new running hash and carry */ + *ph1 = h1; + *pcarry = (c & ~0xff) | n; +} + +/*---------------------------------------------------------------------------*/ + +/* Finalize a hash. To match the original Murmur3A the total_length must be provided */ +uint32_t PMurHash32_Result(uint32_t h, uint32_t carry, uint32_t total_length) +{ + uint32_t k1; + int n = carry & 3; + if(n) { + k1 = carry >> (4-n)*8; + k1 *= C1; k1 = ROTL32(k1,15); k1 *= C2; h ^= k1; + } + h ^= total_length; + + /* fmix */ + h ^= h >> 16; + h *= 0x85ebca6b; + h ^= h >> 13; + h *= 0xc2b2ae35; + h ^= h >> 16; + + return h; +} + +/*---------------------------------------------------------------------------*/ + +/* Murmur3A compatable all-at-once */ +uint32_t PMurHash32(uint32_t seed, const void *key, int len) +{ + uint32_t h1=seed, carry=0; + PMurHash32_Process(&h1, &carry, key, len); + return PMurHash32_Result(h1, carry, len); +} diff --git a/compat/PMurHash.h b/compat/PMurHash.h new file mode 100644 index 0000000000..0ee645669e --- /dev/null +++ b/compat/PMurHash.h @@ -0,0 +1,62 @@ +/*----------------------------------------------------------------------------- + * MurmurHash3 was written by Austin Appleby, and is placed in the public + * domain. + * + * This implementation was written by Shane Day, and is also public domain. + * + * This is a portable ANSI C implementation of MurmurHash3_x86_32 (Murmur3A) + * with support for progressive processing. + */ + +/* ------------------------------------------------------------------------- */ +/* Determine what native type to use for uint32_t */ + +/* We can't use the name 'uint32_t' here because it will conflict with + * any version provided by the system headers or application. */ + +/* First look for special cases */ +#if defined(_MSC_VER) + #define MH_UINT32 unsigned long +#endif + +/* If the compiler says it's C99 then take its word for it */ +#if !defined(MH_UINT32) && ( \ + defined(__STDC_VERSION__) && __STDC_VERSION__ >= 199901L ) + #include <stdint.h> + #define MH_UINT32 uint32_t +#endif + +/* Otherwise try testing against max value macros from limit.h */ +#if !defined(MH_UINT32) + #include <limits.h> + #if (USHRT_MAX == 0xffffffffUL) + #define MH_UINT32 unsigned short + #elif (UINT_MAX == 0xffffffffUL) + #define MH_UINT32 unsigned int + #elif (ULONG_MAX == 0xffffffffUL) + #define MH_UINT32 unsigned long + #endif +#endif + +#if !defined(MH_UINT32) + #error Unable to determine type name for unsigned 32-bit int +#endif + +/* I'm yet to work on a platform where 'unsigned char' is not 8 bits */ +#define MH_UINT8 unsigned char + + +/* ------------------------------------------------------------------------- */ +/* Prototypes */ + +#ifdef __cplusplus +extern "C" { +#endif + +void PMurHash32_Process(MH_UINT32 *ph1, MH_UINT32 *pcarry, const void *key, int len); +MH_UINT32 PMurHash32_Result(MH_UINT32 h1, MH_UINT32 carry, MH_UINT32 total_length); +MH_UINT32 PMurHash32(MH_UINT32 seed, const void *key, int len); + +#ifdef __cplusplus +} +#endif -- 2.27.0.rc1.431.g5c813f95dc
Write the Modifed Path Bloom Filter Index chunk if the 'core.modifiedPathBloomFilters' configuration variable is set. For now store only "no Bloom filter for this commit" entries. The format of the various modified path Bloom filter-related chunks are not (yet) compatible with split commit-graph files, so ignore the config setting and don't write modifed path Bloom filters with '--split'. --- Documentation/config/core.txt | 6 ++++++ commit-graph.c | 40 +++++++++++++++++++++++++++++++++++ 2 files changed, 46 insertions(+) diff --git a/Documentation/config/core.txt b/Documentation/config/core.txt index 9e440b160d..820f7d3bcf 100644 --- a/Documentation/config/core.txt +++ b/Documentation/config/core.txt @@ -588,6 +588,12 @@ core.commitGraph:: to parse the graph structure of commits. Defaults to true. See linkgit:git-commit-graph[1] for more information. +core.modifiedPathBloomFilters:: + EXPERIMENTAL!! + If true, then Git will store and use Bloom filters in the + commit-graph file to speed up pathspec-limited revision walks. + Defaults to false. + core.useReplaceRefs:: If set to `false`, behave as if the `--no-replace-objects` option was given on the command line. See linkgit:git[1] and diff --git a/commit-graph.c b/commit-graph.c index e3adffd58b..28f147a418 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -21,6 +21,7 @@ #define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */ #define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */ #define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */ +#define GRAPH_CHUNKID_MODIFIED_PATH_BLOOM_FILTER_INDEX 0x4d504249 /* "MPBI" */ #define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16) @@ -33,6 +34,10 @@ #define GRAPH_LAST_EDGE 0x80000000 +#define GRAPH_MODIFIED_PATH_BLOOM_FILTER_NONE 0xffffffffffffffff + +#define GRAPH_MODIFIED_PATH_BLOOM_FILTER_DEFAULT_NR_HASHES 7 + #define GRAPH_HEADER_SIZE 8 #define GRAPH_FANOUT_SIZE (4 * 256) #define GRAPH_CHUNKLOOKUP_WIDTH 12 @@ -791,6 +796,11 @@ struct write_commit_graph_context { check_oids:1; const struct split_commit_graph_opts *split_opts; + + struct modified_path_bloom_filter_context { + unsigned use_modified_path_bloom_filters:1; + unsigned int num_hashes; + } mpbfctx; }; static int write_graph_chunk_fanout(struct hashfile *f, @@ -990,6 +1000,20 @@ static int write_graph_chunk_extra_edges(struct hashfile *f, return 0; } +static int write_graph_chunk_modified_path_bloom_index(struct hashfile *f, + struct write_commit_graph_context *ctx) +{ + const uint64_t no_bloom_filter = GRAPH_MODIFIED_PATH_BLOOM_FILTER_NONE; + int i; + + hashwrite(f, &ctx->mpbfctx.num_hashes, sizeof(uint8_t)); + for (i = 0; i < ctx->commits.nr; i++) { + display_progress(ctx->progress, ++ctx->progress_cnt); + hashwrite(f, &no_bloom_filter, sizeof(no_bloom_filter)); + } + return 0; +} + static int oid_compare(const void *_a, const void *_b) { const struct object_id *a = (const struct object_id *)_a; @@ -1428,6 +1452,14 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx) chunks[chunks_nr].write_fn = write_graph_chunk_extra_edges; chunks_nr++; } + if (ctx->mpbfctx.use_modified_path_bloom_filters) { + ALLOC_GROW(chunks, chunks_nr + 1, chunks_alloc); + chunks[chunks_nr].id = GRAPH_CHUNKID_MODIFIED_PATH_BLOOM_FILTER_INDEX; + chunks[chunks_nr].size = sizeof(uint8_t) + + ctx->commits.nr * sizeof(uint64_t); + chunks[chunks_nr].write_fn = write_graph_chunk_modified_path_bloom_index; + chunks_nr++; + } if (ctx->num_commit_graphs_after > 1) { ALLOC_GROW(chunks, chunks_nr + 1, chunks_alloc); chunks[chunks_nr].id = GRAPH_CHUNKID_BASE; @@ -1824,6 +1856,14 @@ int write_commit_graph(const char *obj_dir, ctx->check_oids = flags & COMMIT_GRAPH_WRITE_CHECK_OIDS ? 1 : 0; ctx->split_opts = split_opts; + git_config_get_bool("core.modifiedPathBloomFilters", &res); + if (res && ctx->split) { + warning("not writing modified path Bloom filters with --split"); + } else if (res) { + ctx->mpbfctx.use_modified_path_bloom_filters = 1; + ctx->mpbfctx.num_hashes = GRAPH_MODIFIED_PATH_BLOOM_FILTER_DEFAULT_NR_HASHES; + } + if (ctx->split) { struct commit_graph *g; prepare_commit_graph(ctx->r); -- 2.27.0.rc1.431.g5c813f95dc
--- commit-graph.c | 36 +++++++++++++++++++++++++++++++++++- 1 file changed, 35 insertions(+), 1 deletion(-) diff --git a/commit-graph.c b/commit-graph.c index 28f147a418..3a38f25ce9 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -14,6 +14,8 @@ #include "hashmap.h" #include "replace-object.h" #include "progress.h" +#include "bloom-filter.h" +#include "commit-slab.h" #define GRAPH_SIGNATURE 0x43475048 /* "CGPH" */ #define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */ @@ -47,6 +49,20 @@ /* Remember to update object flag allocation in object.h */ #define REACHABLE (1u<<15) +struct modified_path_bloom_filter_info { + struct bloom_filter filter; +}; + +static void free_modified_path_bloom_filter_info_in_slab( + struct modified_path_bloom_filter_info *bfi) +{ + bloom_filter_free(&bfi->filter); +} + +define_commit_slab(modified_path_bloom_filters, + struct modified_path_bloom_filter_info); +static struct modified_path_bloom_filters modified_path_bloom_filters; + char *get_commit_graph_filename(const char *obj_dir) { char *filename = xstrfmt("%s/info/commit-graph", obj_dir); @@ -1008,8 +1024,18 @@ static int write_graph_chunk_modified_path_bloom_index(struct hashfile *f, hashwrite(f, &ctx->mpbfctx.num_hashes, sizeof(uint8_t)); for (i = 0; i < ctx->commits.nr; i++) { + struct commit *commit = ctx->commits.list[i]; + struct modified_path_bloom_filter_info *bfi; + display_progress(ctx->progress, ++ctx->progress_cnt); - hashwrite(f, &no_bloom_filter, sizeof(no_bloom_filter)); + + bfi = modified_path_bloom_filters_peek( + &modified_path_bloom_filters, commit); + + if (!bfi || !bfi->filter.nr_bits) + hashwrite(f, &no_bloom_filter, sizeof(no_bloom_filter)); + else + BUG("writing proper Bloom filters is not implemented yet"); } return 0; } @@ -1864,6 +1890,9 @@ int write_commit_graph(const char *obj_dir, ctx->mpbfctx.num_hashes = GRAPH_MODIFIED_PATH_BLOOM_FILTER_DEFAULT_NR_HASHES; } + init_modified_path_bloom_filters(&modified_path_bloom_filters); + } + if (ctx->split) { struct commit_graph *g; prepare_commit_graph(ctx->r); @@ -1984,6 +2013,11 @@ int write_commit_graph(const char *obj_dir, free(ctx->commit_graph_hash_after); } + if (ctx->mpbfctx.use_modified_path_bloom_filters) + deep_clear_modified_path_bloom_filters( + &modified_path_bloom_filters, + free_modified_path_bloom_filter_info_in_slab); + free(ctx); return res; -- 2.27.0.rc1.431.g5c813f95dc
Collect all paths modified by all commits, and write embedded modified path Bloom filters for those commits that modify few enough paths to fit into a 63 bit array using 7 hashes per path, i.e. 6 or less. For now all modified path Bloom filters are created from scratch, re-using filters from the existing commit-graph file will be done in a later patch in this series. Plug this into the custom revision walking loop in close_reachable(), so the iteration is at least somewhat friendly to the delta base cache, and we won't have yet another loop iterating over all commits. Take care care that this loop might encounter the same commits multiple times (e.g. because more than one refs point to the same commit, or a commit is present in more than one pack file), and skip creating a modified path Bloom filter for already processed commits. We wouldn't need to be careful about this if we were to create modified path Bloom filters a bit later on, after the list of commits have been deduplicated, but at that point the commits are in SHA-1 order, i.e. effectively random, which is downright hostile to the caches. Perhaps processing the modified paths returned by the tree-diff machinery turned out to be more complicated than strictly necessary, but... So, my initial approach was to first let diff_tree_oid() and diffcore_std() do their thing and put all modified files into a diff queue, then iterate over that queue and add all paths (i.e. leading directories as well) to a hashmap with a suitable comparison function for deduplication, use the number of elements in that hashmap to properly size the Bloom filter's bit array, and finally iterate over all paths in the hashmap and add them to the Bloom filter. Simple, short, straightforward, almost idiomatic even, without subtle corner cases. This approach, however, involved lot of memory allocations: three allocations for each entry in the diff queue (IOW for each modified file), and one allocation for each hashmap entry (IOW for each modified path), not counting allocations for the hashmap's internals. All these memory allocations caused considerable overhead. This patch implements a more efficient, but more complex, approach: register callback functions in 'struct diff_options' for add/removal and modification, which process, deduplicate, and hash modified paths _while_ the tree-diff machinery scans and compares the two trees. Tree-diff calls these callback functions with the filenames in order [1], so the deduplication of modified paths basically boils down to skipping the common prefix with the previously processed path (and some subtlety in case of dir-file replacement, as explained in an in-code comment). This way we don't do any memory allocations for each modified file or path while processing the diff output, and the memory allocations that are still left are amortized over the whole commit-graph writing process. [1] The tree-diff machinery would call its callbacks with out of order filenames if it encountered a bogus tree object with unsorted entries. This doesn't affect the correctness of modified path Bloom filters only their sizes: the worst thing that can happen is that this approach would count and hash some paths more than once, thus Bloom filters containing paths modified compared to such a bogus tree object would end up bigger than expected. And bogus unsorted tree object cause all kinds of weird diff issues anyway... [TODO: The pathchange callback is no good.] Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com> --- commit-graph.c | 243 ++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 238 insertions(+), 5 deletions(-) diff --git a/commit-graph.c b/commit-graph.c index 3a38f25ce9..413605a29c 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -16,6 +16,8 @@ #include "progress.h" #include "bloom-filter.h" #include "commit-slab.h" +#include "diff.h" +#include "compat/PMurHash.h" #define GRAPH_SIGNATURE 0x43475048 /* "CGPH" */ #define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */ @@ -37,8 +39,10 @@ #define GRAPH_LAST_EDGE 0x80000000 #define GRAPH_MODIFIED_PATH_BLOOM_FILTER_NONE 0xffffffffffffffff +#define GRAPH_MODIFIED_PATH_BLOOM_FILTER_EMBEDDED_NR_BITS 63 #define GRAPH_MODIFIED_PATH_BLOOM_FILTER_DEFAULT_NR_HASHES 7 +#define GRAPH_MODIFIED_PATH_BLOOM_FILTER_DEFAULT_EMBEDDED_LIMIT 6 #define GRAPH_HEADER_SIZE 8 #define GRAPH_FANOUT_SIZE (4 * 256) @@ -773,6 +777,36 @@ struct tree *get_commit_tree_in_graph(struct repository *r, const struct commit return get_commit_tree_in_graph_one(r, r->objects->commit_graph, c); } +static uint32_t modified_path_bloom_seeds[] = { + 0xe83c5163U, + 0x3b376b0cU, + 0x72441af7U, + 0x433860f3U, + 0x6d9617f4U, + 0x22705334U, + 0xc8af66abU, + 0x43b44ccfU, + 0xb84f767cU, + 0x1da177e4U +}; + +static void compute_modified_path_bloom_hashes_for_path(const char *path, + size_t len, unsigned int num_hashes, uint32_t *hashes) +{ + uint32_t h1, h2, i; + + h1 = PMurHash32(modified_path_bloom_seeds[0], path, len); + h2 = PMurHash32(modified_path_bloom_seeds[1], path, len); + + /* Equivalent to hashes[i] = h1 + i * h2 + (i * i * i - i) / 6 */ + hashes[0] = h1; + for (i = 1; i < num_hashes; i++) { + h1 += h2; + h2 += i; + hashes[i] = h1; + } +} + struct packed_commit_list { struct commit **list; int nr; @@ -816,6 +850,29 @@ struct write_commit_graph_context { struct modified_path_bloom_filter_context { unsigned use_modified_path_bloom_filters:1; unsigned int num_hashes; + /* + * Number of paths to be added to "embedded" modified path + * Bloom filters. + */ + int embedded_limit; + + struct diff_options diffopt; + + /* + * Used to deduplicate paths modified by the currenly + * processed commit. + */ + struct strbuf prev_path; + size_t *hashed_prefix_lengths; + int hashed_prefix_lengths_nr, hashed_prefix_lengths_alloc; + + /* + * The tree-diff callbacks will fill this array with + * 'num_hashes' hash values for each path modified by the + * currently processed commit. + */ + uint32_t *hashes; + int hashes_nr, hashes_alloc; } mpbfctx; }; @@ -1032,10 +1089,20 @@ static int write_graph_chunk_modified_path_bloom_index(struct hashfile *f, bfi = modified_path_bloom_filters_peek( &modified_path_bloom_filters, commit); - if (!bfi || !bfi->filter.nr_bits) + if (!bfi || !bfi->filter.nr_bits) { hashwrite(f, &no_bloom_filter, sizeof(no_bloom_filter)); - else - BUG("writing proper Bloom filters is not implemented yet"); + } else if (bfi->filter.nr_bits == GRAPH_MODIFIED_PATH_BLOOM_FILTER_EMBEDDED_NR_BITS) { + uint8_t filterdata[sizeof(uint64_t)]; + memcpy(filterdata, bfi->filter.bits, + sizeof(filterdata)); + /* + * Set the most significant bit of the first byte to + * indicate embedded modified path Bloom Filter. + */ + filterdata[0] |= 1 << 7; + hashwrite(f, filterdata, sizeof(filterdata)); + } else + BUG("writing non-embedded Bloom filters is not implemented yet"); } return 0; } @@ -1074,9 +1141,162 @@ static int add_packed_commits(const struct object_id *oid, return 0; } +static void handle_modified_file( + struct modified_path_bloom_filter_context *mpbfctx, + const char *path) +{ + const char *prev = mpbfctx->prev_path.buf; + const char *p = path; + size_t common_prefix_len, already_hashed_prefix_len = 0; + int i; + + /* + * We want to hash each modified path only once, i.e. if a commit + * modifies both 'dir/foo' and 'dir/bar', then we want to hash + * 'dir' only once. + */ + + /* + * Skip common prefix with the previously handled path, so we + * hash any common leading directories only once. + */ + while (*prev && *p && *prev == *p) { + prev++; + p++; + } + common_prefix_len = p - path; + + /* + * However, merely skipping the common prefix is not enough. + * + * If a commit removes a file (symlink, or submodule) and adds a + * directory with the same name (or vice versa), e.g. removes + * 'foo' and adds 'foo/bar', and adds/modifies/removes 'foo.c', + * then the tree-diff machinery will call this function with + * those paths in the following order: + * + * foo + * foo.c + * foo/bar + * + * So the first call hashes 'foo', the second hashes 'foo.c', so + * far so good. In the third call the common prefix of 'foo.c' + * and 'foo/bar' is 'foo', and then an immediate strchrnul(p, '/') + * would find the subsequent '/' and would hash 'foo' _again_. + * Not so good. + * + * To avoid this we maintain the length of already hashed common + * prefixes and skip one more character (i.e. the '/') if + * necessary. + */ + for (i = 0; i < mpbfctx->hashed_prefix_lengths_nr; i++) { + if (common_prefix_len < mpbfctx->hashed_prefix_lengths[i]) { + mpbfctx->hashed_prefix_lengths_nr = i; + break; + } + + already_hashed_prefix_len = mpbfctx->hashed_prefix_lengths[i]; + } + if (common_prefix_len == already_hashed_prefix_len) + p = path + already_hashed_prefix_len + 1; + + do { + unsigned int new_hashes_nr = mpbfctx->hashes_nr + mpbfctx->num_hashes; + + p = strchrnul(p, '/'); + + ALLOC_GROW(mpbfctx->hashes, new_hashes_nr, + mpbfctx->hashes_alloc); + compute_modified_path_bloom_hashes_for_path(path, p - path, + mpbfctx->num_hashes, + mpbfctx->hashes + mpbfctx->hashes_nr); + mpbfctx->hashes_nr = new_hashes_nr; + + ALLOC_GROW(mpbfctx->hashed_prefix_lengths, + mpbfctx->hashed_prefix_lengths_nr + 1, + mpbfctx->hashed_prefix_lengths_alloc); + mpbfctx->hashed_prefix_lengths[mpbfctx->hashed_prefix_lengths_nr] = p - path; + mpbfctx->hashed_prefix_lengths_nr++; + } while (*p && p++); + + /* + * Update the previous path for the next iteration, copying only + * as many characters as necessary. + */ + strbuf_setlen(&mpbfctx->prev_path, common_prefix_len); + strbuf_add(&mpbfctx->prev_path, path + common_prefix_len, + p - (path + common_prefix_len)); +} + +static void file_add_remove_cb(struct diff_options *options, + int addremove, unsigned mode, + const struct object_id *oid, + int oid_valid, const char *fullpath, + unsigned dirty_submodule) +{ + handle_modified_file(options->change_fn_data, fullpath); +} + +static void file_change_cb(struct diff_options *options, + unsigned old_mode, unsigned new_mode, + const struct object_id *old_oid, + const struct object_id *new_oid, + int old_oid_valid, int new_oid_valid, + const char *fullpath, + unsigned old_dirty_submodule, + unsigned new_dirty_submodule) +{ + handle_modified_file(options->change_fn_data, fullpath); +} + +static void create_modified_path_bloom_filter( + struct modified_path_bloom_filter_context *mpbfctx, + struct commit *commit) +{ + struct modified_path_bloom_filter_info *bfi; + struct object_id *parent_oid; + uintmax_t path_component_count; + + if (!mpbfctx->use_modified_path_bloom_filters) + return; + + /* + * This function can be called multiple times for the same commit + * (a commit can be present in multiple pack files, or multiple + * refs can point to the same commit) so check first whether we have + * already created a modified path Bloom filter for it. + */ + bfi = modified_path_bloom_filters_peek(&modified_path_bloom_filters, + commit); + if (bfi && bfi->filter.nr_bits) + return; + + parent_oid = commit->parents ? &commit->parents->item->object.oid : + NULL; + mpbfctx->hashes_nr = 0; + strbuf_reset(&mpbfctx->prev_path); + diff_tree_oid(parent_oid, &commit->object.oid, "", &mpbfctx->diffopt); + path_component_count = mpbfctx->hashes_nr / mpbfctx->num_hashes; + + if (path_component_count > mpbfctx->embedded_limit) { + /* Not implemented yet. */ + } else { + bfi = modified_path_bloom_filters_at( + &modified_path_bloom_filters, commit); + + bloom_filter_init_with_size(&bfi->filter, + GRAPH_MODIFIED_PATH_BLOOM_FILTER_EMBEDDED_NR_BITS); + bloom_filter_set_bits(&bfi->filter, mpbfctx->hashes, + mpbfctx->hashes_nr); + } +} + static void add_missing_parents(struct write_commit_graph_context *ctx, struct commit *commit) { struct commit_list *parent; + + create_modified_path_bloom_filter(&ctx->mpbfctx, commit); + for (parent = commit->parents; parent; parent = parent->next) { if (!(parent->item->object.flags & REACHABLE)) { ALLOC_GROW(ctx->oids.list, ctx->oids.nr + 1, ctx->oids.alloc); @@ -1888,9 +2108,18 @@ int write_commit_graph(const char *obj_dir, } else if (res) { ctx->mpbfctx.use_modified_path_bloom_filters = 1; ctx->mpbfctx.num_hashes = GRAPH_MODIFIED_PATH_BLOOM_FILTER_DEFAULT_NR_HASHES; - } + ctx->mpbfctx.embedded_limit = GRAPH_MODIFIED_PATH_BLOOM_FILTER_EMBEDDED_NR_BITS / (ctx->mpbfctx.num_hashes * 10 / 7); init_modified_path_bloom_filters(&modified_path_bloom_filters); + + repo_diff_setup(ctx->r, &ctx->mpbfctx.diffopt); + ctx->mpbfctx.diffopt.flags.recursive = 1; + ctx->mpbfctx.diffopt.add_remove = file_add_remove_cb; + ctx->mpbfctx.diffopt.change = file_change_cb; + ctx->mpbfctx.diffopt.change_fn_data = &ctx->mpbfctx; + diff_setup_done(&ctx->mpbfctx.diffopt); + + strbuf_init(&ctx->mpbfctx.prev_path, PATH_MAX); } if (ctx->split) { @@ -2013,10 +2242,14 @@ int write_commit_graph(const char *obj_dir, free(ctx->commit_graph_hash_after); } - if (ctx->mpbfctx.use_modified_path_bloom_filters) + if (ctx->mpbfctx.use_modified_path_bloom_filters) { deep_clear_modified_path_bloom_filters( &modified_path_bloom_filters, free_modified_path_bloom_filter_info_in_slab); + strbuf_release(&ctx->mpbfctx.prev_path); + free(ctx->mpbfctx.hashed_prefix_lengths); + free(ctx->mpbfctx.hashes); + } free(ctx); -- 2.27.0.rc1.431.g5c813f95dc
Load the Modified Path Bloom Filter Index (MPBI) chunk and check the embedded modified path Bloom filters within to speed up pathspec-limited revison walks. Extend 'struct pathspec_item' to hold the hashes of each path for modified path Bloom filters. Pathspecs are used by many Git commands, and a lot of them doesn't do any revision walk at all, so don't initialize those fields in parse_pathspec(), but only when we are about to start a pathspec-limited revision walk, i.e. in prepare_revision_walk(). Initialize these fields only in the revs->pruning.pathspec instance, because that's the one used to limit the revision walk. Note that some revision walks re-parse the pathspecs, notably 'git log --follow' and the line-level log, but they don't use this pathspec instance. The table below shows the average runtime and speedup of git rev-list HEAD -- "$path" for 5000+ randomly selected paths from each repository with and without modified path Bloom filters: Average runtime without with MPBI MPBI -------------------------------------------- android-base 0.8780s n/a cmssw 0.3143s 0.2104s -33.1% cpython 0.7453s 0.3410s -54.2% gcc 7.1852s 3.3633s -53.2% gecko-dev 4.6113s n/a git 0.6180s 0.2184s -64.7% glibc 0.5618s 0.3245s -42.2% jdk 0.0482s 0.0395s -18.0% linux 0.7043s 0.4810s -31.6% llvm-project 2.6844s n/a rails 0.2784s 0.1792s -35.6% rust 0.7757s 0.6073s -21.7% webkit 1.9137s 1.4832s -22.5% The results in the homebrew-core repository can be spectacular, though that repository looks like as if it was specifically designed to show off the benefits of this patch: over 98% of its almost 164k commits in its linear history modify one file within the 'Formula' directory, and that directory contains almost 5000 files. Consequently, almost all commits will have a 63 bit "embedded" modified path Bloom filter containing only 2 paths, resulting in a false positive rate less than 0.1%, and they can spare a lot of expensive tree entry scanning to the end of that directory. So looking at the history of the first and last files in that directory: Runtime Max RSS without with without with MPBI MPBI Speedup MPBI MPBI ----------------------------------------------------------------- Formula/a2ps.rb 8.390s 0.218s 38.5x 422884k 129212k Formula/zzz.rb 80.495s 0.176s 450.4x 423744k 68364k ('Formula/a2ps.rb' is modified 32 times and has two false positives, while 'zzz.rb' is modified only twice and happens to have only a single false positive; that's why querying the history of the former is slower, even though it's at the beginning of the directory). Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com> --- commit-graph.c | 146 +++++++++++++++++++++++++++++++++++++++++++++++++ commit-graph.h | 13 +++++ pathspec.c | 9 +++ pathspec.h | 12 ++++ revision.c | 18 ++++++ 5 files changed, 198 insertions(+) diff --git a/commit-graph.c b/commit-graph.c index 413605a29c..fb24600bb3 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -26,6 +26,7 @@ #define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */ #define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */ #define GRAPH_CHUNKID_MODIFIED_PATH_BLOOM_FILTER_INDEX 0x4d504249 /* "MPBI" */ +#define GRAPH_CHUNKID_MODIFIED_PATH_BLOOM_FILTER_EXCLUDES 0x4d504258 /* "MPBX" */ #define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16) @@ -308,6 +309,23 @@ struct commit_graph *parse_commit_graph(void *graph_map, int fd, chunk_repeated = 1; else graph->chunk_base_graphs = data + chunk_offset; + break; + + case GRAPH_CHUNKID_MODIFIED_PATH_BLOOM_FILTER_INDEX: + if (graph->chunk_mpbf_index) + chunk_repeated = 1; + else { + graph->chunk_mpbf_index = data + chunk_offset; + graph->chunk_mpbf_index_size = next_chunk_offset - chunk_offset; + } + break; + + case GRAPH_CHUNKID_MODIFIED_PATH_BLOOM_FILTER_EXCLUDES: + if (graph->chunk_mpbf_excludes) + chunk_repeated = 1; + else + graph->chunk_mpbf_excludes = data + chunk_offset; + break; } if (chunk_repeated) { @@ -324,6 +342,29 @@ struct commit_graph *parse_commit_graph(void *graph_map, int fd, return NULL; } + if (graph->chunk_mpbf_index) { + int res = 0; + git_config_get_bool("core.modifiedPathBloomFilters", &res); + if (res) { + uint64_t expected_size = sizeof(uint8_t) + + graph->num_commits * sizeof(uint64_t); + if (graph->chunk_mpbf_index_size != expected_size) + BUG(_("for %u commits the Modified Path Bloom Filter Index chunk should be %ld bytes, but it's %ld"), + graph->num_commits, expected_size, + graph->chunk_mpbf_index_size); + graph->num_modified_path_bloom_hashes = *graph->chunk_mpbf_index; + if (!graph->num_modified_path_bloom_hashes) + BUG(_("Modified Path Bloom Filter Index chunk using 0 hashes")); + if (graph->chunk_mpbf_excludes) + /* + * Modified Path Bloom Filter Excludes are + * not supported yet. + */ + graph->chunk_mpbf_index = NULL; + } else + graph->chunk_mpbf_index = NULL; + } + return graph; } @@ -777,6 +818,68 @@ struct tree *get_commit_tree_in_graph(struct repository *r, const struct commit return get_commit_tree_in_graph_one(r, r->objects->commit_graph, c); } +static int load_modified_path_bloom_filter_from_graph( + struct commit_graph *graph, struct commit *commit, + struct commit *parent, struct bloom_filter *bf) +{ + const uint8_t *bloom_index; + int first_parent = 0; + + if (commit->graph_pos == COMMIT_NOT_FROM_GRAPH) + return 0; + if (!graph) + return 0; + if (!graph->chunk_mpbf_index) + return 0; + + if (!commit->parents || commit->parents->item == parent) + first_parent = 1; + + bloom_index = graph->chunk_mpbf_index + sizeof(uint8_t) + + sizeof(uint64_t) * commit->graph_pos; + + if (bloom_index[0] & (1 << 7)) { + uint64_t v; + if (!first_parent) + return 0; + memcpy(&v, bloom_index, sizeof(v)); + if (v == GRAPH_MODIFIED_PATH_BLOOM_FILTER_NONE) + return 0; + + /* embedded modified path Bloom filter */ + bf->nr_bits = GRAPH_MODIFIED_PATH_BLOOM_FILTER_EMBEDDED_NR_BITS; + bf->bits = (uint8_t*) bloom_index; + return 1; + } + /* support for non-embedded Bloom filters is not implemented yet. */ + return 0; +} + +enum bloom_result check_modified_path_bloom_filter(struct repository *r, + struct pathspec *pathspec, struct commit *parent, + struct commit *commit) +{ + struct bloom_filter bf; + int i; + + if (!pathspec->can_use_modified_path_bloom_filters) + return BLOOM_CANT_TELL; + if (!load_modified_path_bloom_filter_from_graph(r->objects->commit_graph, + commit, parent, &bf)) + return BLOOM_CANT_TELL; + + for (i = 0; i < pathspec->nr; i++) { + struct pathspec_item *pi = &pathspec->items[i]; + + if (bloom_filter_check_bits(&bf, + pi->modified_path_bloom_hashes, + pi->modified_path_bloom_hashes_nr)) + return BLOOM_POSSIBLY_YES; + } + + return BLOOM_DEFINITELY_NOT; +} + static uint32_t modified_path_bloom_seeds[] = { 0xe83c5163U, 0x3b376b0cU, @@ -807,6 +910,49 @@ static void compute_modified_path_bloom_hashes_for_path(const char *path, } } +void init_pathspec_bloom_fields(struct repository *r, + struct pathspec *pathspec) +{ + const unsigned bloom_compatible_magic = PATHSPEC_LITERAL; + struct commit_graph *graph = r->objects->commit_graph; + int i; + + if (!graph) + return; + if (!graph->chunk_mpbf_index) + return; + if (!pathspec->nr) + return; + if (pathspec->has_wildcard) + return; + if (pathspec->magic & ~bloom_compatible_magic) + return; + + for (i = 0; i < pathspec->nr; i++) { + struct pathspec_item *pi = &pathspec->items[i]; + const char *path = pi->match; + size_t len = pi->len; + + /* + * Pathspec parsing has normalized away any consecutive + * slashes, but a trailing slash might still be present, + * "remove" it. + */ + if (path[len - 1] == '/') + len--; + + pi->modified_path_bloom_hashes_nr = graph->num_modified_path_bloom_hashes; + ALLOC_ARRAY(pi->modified_path_bloom_hashes, + pi->modified_path_bloom_hashes_nr); + + compute_modified_path_bloom_hashes_for_path(path, len, + graph->num_modified_path_bloom_hashes, + pi->modified_path_bloom_hashes); + } + + pathspec->can_use_modified_path_bloom_filters = 1; +} + struct packed_commit_list { struct commit **list; int nr; diff --git a/commit-graph.h b/commit-graph.h index 847cd25cfc..09dfc16932 100644 --- a/commit-graph.h +++ b/commit-graph.h @@ -11,6 +11,8 @@ struct commit; struct repository; struct raw_object_store; struct string_list; +struct bloom_filter; +struct pathspec; char *get_commit_graph_filename(const char *obj_dir); int open_commit_graph(const char *graph_file, int *fd, struct stat *st); @@ -38,6 +40,12 @@ void load_commit_graph_info(struct repository *r, struct commit *item); struct tree *get_commit_tree_in_graph(struct repository *r, const struct commit *c); +enum bloom_result check_modified_path_bloom_filter(struct repository *r, + struct pathspec *pathspec, struct commit *parent, + struct commit *commit); +void init_pathspec_bloom_fields(struct repository *r, + struct pathspec *pathspec); + struct commit_graph { int graph_fd; @@ -59,6 +67,11 @@ struct commit_graph { const unsigned char *chunk_commit_data; const unsigned char *chunk_extra_edges; const unsigned char *chunk_base_graphs; + const unsigned char *chunk_mpbf_index; + uint64_t chunk_mpbf_index_size; + const unsigned char *chunk_mpbf_excludes; + + uint8_t num_modified_path_bloom_hashes; }; struct commit_graph *load_commit_graph_one_fd_st(int fd, struct stat *st); diff --git a/pathspec.c b/pathspec.c index 128f27fcb7..01e7ae6944 100644 --- a/pathspec.c +++ b/pathspec.c @@ -406,6 +406,8 @@ static void init_pathspec_item(struct pathspec_item *item, unsigned flags, item->attr_check = NULL; item->attr_match = NULL; item->attr_match_nr = 0; + item->modified_path_bloom_hashes_nr = 0; + item->modified_path_bloom_hashes = NULL; /* PATHSPEC_LITERAL_PATH ignores magic */ if (flags & PATHSPEC_LITERAL_PATH) { @@ -674,6 +676,12 @@ void copy_pathspec(struct pathspec *dst, const struct pathspec *src) } d->attr_check = attr_check_dup(s->attr_check); + + ALLOC_ARRAY(d->modified_path_bloom_hashes, + d->modified_path_bloom_hashes_nr); + COPY_ARRAY(d->modified_path_bloom_hashes, + s->modified_path_bloom_hashes, + d->modified_path_bloom_hashes_nr); } } @@ -684,6 +692,7 @@ void clear_pathspec(struct pathspec *pathspec) for (i = 0; i < pathspec->nr; i++) { free(pathspec->items[i].match); free(pathspec->items[i].original); + free(pathspec->items[i].modified_path_bloom_hashes); for (j = 0; j < pathspec->items[i].attr_match_nr; j++) free(pathspec->items[i].attr_match[j].value); diff --git a/pathspec.h b/pathspec.h index 454ce364fa..0c661e1b03 100644 --- a/pathspec.h +++ b/pathspec.h @@ -32,6 +32,7 @@ struct pathspec { unsigned int has_wildcard:1; unsigned int recursive:1; unsigned int recurse_submodules:1; + unsigned int can_use_modified_path_bloom_filters:1; unsigned magic; int max_depth; struct pathspec_item { @@ -52,6 +53,17 @@ struct pathspec { } match_mode; } *attr_match; struct attr_check *attr_check; + + /* + * These fields are only relevant during pathspec-limited + * revision walks, init_pathspec_item() leaves them + * zero-initialized, but copy_pathspec() copies them, + * and clear_pathspec() releases the allocated memory. + * IOW: they are only valid if 'struct pathspec's + * 'can_use_modified_path_bloom_filters' bit is set. + */ + uint32_t modified_path_bloom_hashes_nr; + uint32_t *modified_path_bloom_hashes; } *items; }; diff --git a/revision.c b/revision.c index 9dac1262ef..6ec4bc0cb1 100644 --- a/revision.c +++ b/revision.c @@ -29,6 +29,7 @@ #include "prio-queue.h" #include "hashmap.h" #include "utf8.h" +#include "bloom-filter.h" volatile show_early_output_fn_t show_early_output; @@ -629,6 +630,7 @@ static int rev_compare_tree(struct rev_info *revs, { struct tree *t1 = get_commit_tree(parent); struct tree *t2 = get_commit_tree(commit); + enum bloom_result bloom_ret; if (!t1) return REV_TREE_NEW; @@ -653,6 +655,12 @@ static int rev_compare_tree(struct rev_info *revs, return REV_TREE_SAME; } + bloom_ret = check_modified_path_bloom_filter(revs->repo, + &revs->pruning.pathspec, + parent, commit); + if (bloom_ret == BLOOM_DEFINITELY_NOT) + return REV_TREE_SAME; + tree_difference = REV_TREE_SAME; revs->pruning.flags.has_changes = 0; diff_tree_oid(&t1->object.oid, &t2->object.oid, "", &revs->pruning); @@ -662,10 +670,17 @@ static int rev_compare_tree(struct rev_info *revs, static int rev_same_tree_as_empty(struct rev_info *revs, struct commit *commit) { struct tree *t1 = get_commit_tree(commit); + enum bloom_result bloom_ret; if (!t1) return 0; + bloom_ret = check_modified_path_bloom_filter(revs->repo, + &revs->pruning.pathspec, + NULL, commit); + if (bloom_ret == BLOOM_DEFINITELY_NOT) + return 1; + tree_difference = REV_TREE_SAME; revs->pruning.flags.has_changes = 0; diff_tree_oid(NULL, &t1->object.oid, "", &revs->pruning); @@ -3376,6 +3391,9 @@ int prepare_revision_walk(struct rev_info *revs) simplify_merges(revs); if (revs->children.name) set_children(revs); + + init_pathspec_bloom_fields(revs->repo, &revs->pruning.pathspec); + return 0; } -- 2.27.0.rc1.431.g5c813f95dc
Write the Modified Path Bloom Filters chunk first, keeping track of the offsets where each (non-embedded) Bloom filter has been written, and then write the Modified Path Bloom Filter Index chunk using those recorded offsets. --- commit-graph.c | 78 ++++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 69 insertions(+), 9 deletions(-) diff --git a/commit-graph.c b/commit-graph.c index fb24600bb3..3210ec2f93 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -26,6 +26,7 @@ #define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */ #define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */ #define GRAPH_CHUNKID_MODIFIED_PATH_BLOOM_FILTER_INDEX 0x4d504249 /* "MPBI" */ +#define GRAPH_CHUNKID_MODIFIED_PATH_BLOOM_FILTERS 0x4d504246 /* "MPBF" */ #define GRAPH_CHUNKID_MODIFIED_PATH_BLOOM_FILTER_EXCLUDES 0x4d504258 /* "MPBX" */ #define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16) @@ -56,6 +57,12 @@ struct modified_path_bloom_filter_info { struct bloom_filter filter; + /* + * The offset relative to the start of the Modified Path Bloom + * Filters chunk where this Bloom filter has been written, + * -1 before that. + */ + uint64_t offset; }; static void free_modified_path_bloom_filter_info_in_slab( @@ -1019,6 +1026,9 @@ struct write_commit_graph_context { */ uint32_t *hashes; int hashes_nr, hashes_alloc; + + /* Excluding embedded modified path Bloom filters */ + uint64_t total_filter_size; } mpbfctx; }; @@ -1219,6 +1229,43 @@ static int write_graph_chunk_extra_edges(struct hashfile *f, return 0; } +static int write_graph_chunk_modified_path_bloom_filters(struct hashfile *f, + struct write_commit_graph_context *ctx) +{ + uint64_t offset = 0; + int i; + + for (i = 0; i < ctx->commits.nr; i++) { + struct commit *commit = ctx->commits.list[i]; + struct modified_path_bloom_filter_info *bfi; + unsigned int filter_size; + + display_progress(ctx->progress, ++ctx->progress_cnt); + + bfi = modified_path_bloom_filters_peek( + &modified_path_bloom_filters, commit); + + if (!bfi || !bfi->filter.nr_bits) + continue; + if (bfi->filter.nr_bits == GRAPH_MODIFIED_PATH_BLOOM_FILTER_EMBEDDED_NR_BITS) + continue; + + if (offset >> 62) + BUG("offset %lu is too large for the Modified Path Bloom Filter Index chunk", + offset); + + bfi->offset = offset; + + filter_size = bloom_filter_bytes(&bfi->filter); + + hashwrite_be32(f, bfi->filter.nr_bits); + hashwrite(f, bfi->filter.bits, filter_size); + + offset += sizeof(uint32_t) + filter_size; + } + return 0; +} + static int write_graph_chunk_modified_path_bloom_index(struct hashfile *f, struct write_commit_graph_context *ctx) { @@ -1247,8 +1294,11 @@ static int write_graph_chunk_modified_path_bloom_index(struct hashfile *f, */ filterdata[0] |= 1 << 7; hashwrite(f, filterdata, sizeof(filterdata)); + } else if (bfi->offset != -1) { + uint64_t offset = htonll(bfi->offset); + hashwrite(f, &offset, sizeof(offset)); } else - BUG("writing non-embedded Bloom filters is not implemented yet"); + BUG("modified path Bloom filter offset is still -1?!"); } return 0; } @@ -1424,17 +1474,20 @@ static void create_modified_path_bloom_filter( diff_tree_oid(parent_oid, &commit->object.oid, "", &mpbfctx->diffopt); path_component_count = mpbfctx->hashes_nr / mpbfctx->num_hashes; + bfi = modified_path_bloom_filters_at(&modified_path_bloom_filters, + commit); + bfi->offset = -1; if (path_component_count > mpbfctx->embedded_limit) { - /* Not implemented yet. */ - } else { - bfi = modified_path_bloom_filters_at( - &modified_path_bloom_filters, commit); - + bloom_filter_init(&bfi->filter, mpbfctx->num_hashes, + path_component_count); + mpbfctx->total_filter_size += sizeof(uint32_t) + + bloom_filter_bytes(&bfi->filter); + } else bloom_filter_init_with_size(&bfi->filter, GRAPH_MODIFIED_PATH_BLOOM_FILTER_EMBEDDED_NR_BITS); - bloom_filter_set_bits(&bfi->filter, mpbfctx->hashes, - mpbfctx->hashes_nr); - } + + bloom_filter_set_bits(&bfi->filter, mpbfctx->hashes, + mpbfctx->hashes_nr); } static void add_missing_parents(struct write_commit_graph_context *ctx, struct commit *commit) @@ -1845,6 +1898,13 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx) chunks_nr++; } if (ctx->mpbfctx.use_modified_path_bloom_filters) { + if (ctx->mpbfctx.total_filter_size) { + ALLOC_GROW(chunks, chunks_nr + 1, chunks_alloc); + chunks[chunks_nr].id = GRAPH_CHUNKID_MODIFIED_PATH_BLOOM_FILTERS; + chunks[chunks_nr].size = ctx->mpbfctx.total_filter_size; + chunks[chunks_nr].write_fn = write_graph_chunk_modified_path_bloom_filters; + chunks_nr++; + } ALLOC_GROW(chunks, chunks_nr + 1, chunks_alloc); chunks[chunks_nr].id = GRAPH_CHUNKID_MODIFIED_PATH_BLOOM_FILTER_INDEX; chunks[chunks_nr].size = sizeof(uint8_t) + -- 2.27.0.rc1.431.g5c813f95dc
The table below compares the average runtime and memory usage of git rev-list HEAD -- "$path" with and without modified path Bloom filters for 5000+ randomly selected paths from each repository: Average runtime | Max RSS without with Speedup | without with -------------------------------------------+------------------------- android-base 0.8780s 0.1742s 5.04x | 387MB 227MB -41.4% cmssw 0.3143s 0.0452s 6.95x | 181MB 79MB -56.4% cpython 0.7453s 0.0956s 7.80x | 159MB 71MB -55.4% elasticsearch 0.1492s 0.0191s 7.82x | 134MB 53MB -60.2% gcc 7.1852s 0.3584s 20.05x | 297MB 231MB -22.3% gecko-dev 4.6113s 0.6318s 7.30x | 832MB 600MB -27.9% git 0.6180s 0.0405s 15.26x | 131MB 43MB -67.2% glibc 0.5618s 0.0471s 11.93x | 136MB 50MB -63.3% go 0.4913s 0.0515s 9.53x | 130MB 50MB -61.5% jdk 0.0482s 0.0089s 5.42x | 52MB 39MB -25.0% linux 0.7043s 0.1129s 6.24x | 438MB 270MB -38.4% llvm-project 2.6844s 0.4873s 5.51x | 384MB 282MB -26.6% rails 0.2784s 0.0484s 5.75x | 88MB 51MB -42.1% rust 0.7757s 0.0619s 12.53x | 345MB 89MB -74.3% tensorflow 0.6258s 0.0642s 9.74x | 233MB 85MB -63.5% webkit 1.9137s 0.3332s 5.74x | 941MB 480MB -49.0% --- commit-graph.c | 44 ++++++++++++++++++++++++++++++++++++++++++-- commit-graph.h | 2 ++ 2 files changed, 44 insertions(+), 2 deletions(-) diff --git a/commit-graph.c b/commit-graph.c index 3210ec2f93..f9a21ecdfb 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -327,6 +327,15 @@ struct commit_graph *parse_commit_graph(void *graph_map, int fd, } break; + case GRAPH_CHUNKID_MODIFIED_PATH_BLOOM_FILTERS: + if (graph->chunk_mpbf_filters) + chunk_repeated = 1; + else { + graph->chunk_mpbf_filters = data + chunk_offset; + graph->chunk_mpbf_filters_size = next_chunk_offset - chunk_offset; + } + break; + case GRAPH_CHUNKID_MODIFIED_PATH_BLOOM_FILTER_EXCLUDES: if (graph->chunk_mpbf_excludes) chunk_repeated = 1; @@ -830,6 +839,7 @@ static int load_modified_path_bloom_filter_from_graph( struct commit *parent, struct bloom_filter *bf) { const uint8_t *bloom_index; + uint64_t offset; int first_parent = 0; if (commit->graph_pos == COMMIT_NOT_FROM_GRAPH) @@ -857,9 +867,39 @@ static int load_modified_path_bloom_filter_from_graph( bf->nr_bits = GRAPH_MODIFIED_PATH_BLOOM_FILTER_EMBEDDED_NR_BITS; bf->bits = (uint8_t*) bloom_index; return 1; + } else if (bloom_index[0] & (1 << 6)) { + /* + * Modified path Bloom filters for second..nth parents of + * merge commits are not implemented yet. + */ + return 0; + } else { + if (!first_parent) + return 0; + offset = get_be64(bloom_index); } - /* support for non-embedded Bloom filters is not implemented yet. */ - return 0; + + if (!graph->chunk_mpbf_filters) + BUG("commit %s refers to offset %lu of the Modified Path Bloom Filters chunk, but that chunk is missing", + oid_to_hex(&commit->object.oid), offset); + + if (offset + sizeof(uint32_t) >= graph->chunk_mpbf_filters_size) + BUG("commit %s refers to offset %lu of the Modified Path Bloom Filters chunk, but that's too large for chunk of size %lu bytes", + oid_to_hex(&commit->object.oid), offset, + graph->chunk_mpbf_filters_size); + + bf->nr_bits = get_be32(graph->chunk_mpbf_filters + offset); + if (!bf->nr_bits) + BUG("commit %s has a modified path Bloom filter at offset %lu, which has zero size", + oid_to_hex(&commit->object.oid), offset); + if (offset + sizeof(uint32_t) + bloom_filter_bytes(bf) > graph->chunk_mpbf_filters_size) + BUG("commit %s has a modified path Bloom filter of %u bits at offset %lu, which doesn't fit into a Modified Path Bloom Filters chunk of %lu bytes", + oid_to_hex(&commit->object.oid), bf->nr_bits, offset, + graph->chunk_mpbf_filters_size); + /* Casting away const-ness :( */ + bf->bits = (uint8_t*)(graph->chunk_mpbf_filters + offset + sizeof(uint32_t)); + + return 1; } enum bloom_result check_modified_path_bloom_filter(struct repository *r, diff --git a/commit-graph.h b/commit-graph.h index 09dfc16932..cde0d7fa30 100644 --- a/commit-graph.h +++ b/commit-graph.h @@ -69,6 +69,8 @@ struct commit_graph { const unsigned char *chunk_base_graphs; const unsigned char *chunk_mpbf_index; uint64_t chunk_mpbf_index_size; + const unsigned char *chunk_mpbf_filters; + uint64_t chunk_mpbf_filters_size; const unsigned char *chunk_mpbf_excludes; uint8_t num_modified_path_bloom_hashes; -- 2.27.0.rc1.431.g5c813f95dc
The file 'dir/subdir/file' can only be modified if its leading directories 'dir' and 'dir/subdir' are modified as well. So when checking modified path Bloom filters looking for commits modifying a path with multiple path components, then check not only the full path in the Bloom filters, but all its leading directories as well. Take care to check these paths in "deepest first" order, because it's the full path that is least likely to be modified, and the Bloom filter queries can short circuit sooner. This can significantly reduce the average false positive rate, by about an order of magnitude or three(!), and can further speed up pathspec-limited revision walks. The table below compares the average false positive rate and runtime of git -c core.modifiedPathBloomFilters=1 rev-list HEAD -- "$path" before and after this change for 5000+ randomly selected paths from each repository: Average false Average Average positive rate runtime runtime before after before after difference ------------------------------------------------------------------ android-base 0.526% 0.00416% 0.1727s 0.1503s -13.0% cmssw 0.494% 0.00395% 0.0426s 0.0332s -22.0% cpython 0.213% 0.01661% 0.0940s 0.0858s -8.8% elasticsearch 0.679% 0.00325% 0.0191s 0.0147s -23.2% gcc 0.398% 0.00892% 0.3423s 0.2192s -36.0% gecko-dev 0.472% 0.00073% 0.6243s 0.5134s -17.8% git 0.191% 0.06992% 0.0384s 0.0319s -17.1% glibc 0.392% 0.01639% 0.0425s 0.0296s -30.5% go 0.453% 0.01262% 0.0515s 0.0425s -17.5% jdk 0.662% 0.00643% 0.0083s 0.0070s -15.0% linux 0.434% 0.00749% 0.1102s 0.0911s -17.3% llvm-project 0.511% 0.00391% 0.4865s 0.4290s -11.8% rails 0.476% 0.01313% 0.0464s 0.0410s -11.6% rust 0.457% 0.02551% 0.0590s 0.0462s -21.7% tensorflow 0.511% 0.00824% 0.0642s 0.0517s -19.5% webkit 0.661% 0.00101% 0.3315s 0.2576s -22.3% The improvements in runtime are much smaller than the improvements in average false positive rate, as we are clearly reaching diminishing returns here. However, all these timings depend on that accessing tree objects is reasonably fast (warm caches). If we had a partial clone and the tree objects had to be fetched from a promisor remote, e.g.: $ git clone --filter=tree:0 --bare file://.../webkit.git webkit.notrees.git $ git -C webkit.git -c core.modifiedPathBloomFilters=1 \ commit-graph write --reachable $ cp webkit.git/objects/info/commit-graph webkit.notrees.git/objects/info/ $ git -C webkit.notrees.git -c core.modifiedPathBloomFilters=1 \ rev-list HEAD -- "$path" then checking all leading path component can reduce the runtime from over an hour to a few seconds (and this is with the clone and the promisor on the same machine). Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com> --- commit-graph.c | 27 ++++++++++++++++++++++----- 1 file changed, 22 insertions(+), 5 deletions(-) diff --git a/commit-graph.c b/commit-graph.c index f9a21ecdfb..1fd1b4f8dd 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -977,8 +977,10 @@ void init_pathspec_bloom_fields(struct repository *r, for (i = 0; i < pathspec->nr; i++) { struct pathspec_item *pi = &pathspec->items[i]; - const char *path = pi->match; + const char *path = pi->match, *p; size_t len = pi->len; + int path_component_nr = 0, j; + uint32_t *hashes; /* * Pathspec parsing has normalized away any consecutive @@ -988,13 +990,28 @@ void init_pathspec_bloom_fields(struct repository *r, if (path[len - 1] == '/') len--; - pi->modified_path_bloom_hashes_nr = graph->num_modified_path_bloom_hashes; + p = path; + do { + p = strchrnul(p + 1, '/'); + path_component_nr++; + } while (p - path < len); + + pi->modified_path_bloom_hashes_nr = path_component_nr * graph->num_modified_path_bloom_hashes; ALLOC_ARRAY(pi->modified_path_bloom_hashes, pi->modified_path_bloom_hashes_nr); - compute_modified_path_bloom_hashes_for_path(path, len, - graph->num_modified_path_bloom_hashes, - pi->modified_path_bloom_hashes); + p = path; + hashes = pi->modified_path_bloom_hashes + + pi->modified_path_bloom_hashes_nr; + for (j = 0; j < path_component_nr; j++) { + p = strchrnul(p + 1, '/'); + + hashes -= graph->num_modified_path_bloom_hashes; + compute_modified_path_bloom_hashes_for_path(path, + p - path, + graph->num_modified_path_bloom_hashes, + hashes); + } } pathspec->can_use_modified_path_bloom_filters = 1; -- 2.27.0.rc1.431.g5c813f95dc
Since the previous patch we check the presence of all leading directories of a pathspec in modified path Bloom filters to significantly lower the false positive rate. This means that we are checking a lot of bit positions in the Bloom filters. However, as shown earlier in this series, a significant portion of commits have embedded modified path Bloom filters, all with the same size of 63 bits. Use a 64 bit mask to check all relevant bits in embedded modified path Bloom filters in one step. I don't have benchmark results at hand, but as far as I can recall some old results this reduces the time spent loading and checking modified path Bloom filters by up to 10%. Checking Bloom filters is of course only a small part of the whole runtime of a pathspec-limited revision walk, so the overall improvement is only about 1-2%. Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com> --- commit-graph.c | 17 +++++++++++++++-- pathspec.c | 1 + pathspec.h | 1 + 3 files changed, 17 insertions(+), 2 deletions(-) diff --git a/commit-graph.c b/commit-graph.c index 1fd1b4f8dd..56b8f33d10 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -918,10 +918,16 @@ enum bloom_result check_modified_path_bloom_filter(struct repository *r, for (i = 0; i < pathspec->nr; i++) { struct pathspec_item *pi = &pathspec->items[i]; - if (bloom_filter_check_bits(&bf, + if (bf.nr_bits == GRAPH_MODIFIED_PATH_BLOOM_FILTER_EMBEDDED_NR_BITS) { + /* Gross! And potential alignment issues?! */ + if ((*((const uint64_t*)bf.bits) & pi->modified_path_bloom_mask) == pi->modified_path_bloom_mask) + return BLOOM_POSSIBLY_YES; + } else { + if (bloom_filter_check_bits(&bf, pi->modified_path_bloom_hashes, pi->modified_path_bloom_hashes_nr)) - return BLOOM_POSSIBLY_YES; + return BLOOM_POSSIBLY_YES; + } } return BLOOM_DEFINITELY_NOT; @@ -981,6 +987,7 @@ void init_pathspec_bloom_fields(struct repository *r, size_t len = pi->len; int path_component_nr = 0, j; uint32_t *hashes; + struct bloom_filter embedded_bf; /* * Pathspec parsing has normalized away any consecutive @@ -1012,6 +1019,12 @@ void init_pathspec_bloom_fields(struct repository *r, graph->num_modified_path_bloom_hashes, hashes); } + + embedded_bf.nr_bits = GRAPH_MODIFIED_PATH_BLOOM_FILTER_EMBEDDED_NR_BITS; + embedded_bf.bits = (uint8_t*) &pi->modified_path_bloom_mask; + bloom_filter_set_bits(&embedded_bf, + pi->modified_path_bloom_hashes, + pi->modified_path_bloom_hashes_nr); } pathspec->can_use_modified_path_bloom_filters = 1; diff --git a/pathspec.c b/pathspec.c index 01e7ae6944..29cadce013 100644 --- a/pathspec.c +++ b/pathspec.c @@ -408,6 +408,7 @@ static void init_pathspec_item(struct pathspec_item *item, unsigned flags, item->attr_match_nr = 0; item->modified_path_bloom_hashes_nr = 0; item->modified_path_bloom_hashes = NULL; + item->modified_path_bloom_mask = 0; /* PATHSPEC_LITERAL_PATH ignores magic */ if (flags & PATHSPEC_LITERAL_PATH) { diff --git a/pathspec.h b/pathspec.h index 0c661e1b03..c9975aea7a 100644 --- a/pathspec.h +++ b/pathspec.h @@ -64,6 +64,7 @@ struct pathspec { */ uint32_t modified_path_bloom_hashes_nr; uint32_t *modified_path_bloom_hashes; + uint64_t modified_path_bloom_mask; } *items; }; -- 2.27.0.rc1.431.g5c813f95dc
Some commits modify the same set of paths, and, consequently, have identical modified path Bloom filters. Use a hashmap to find these identical Bloom filters, and omit all duplicates from the Modified Path Bloom Filters chunk, reducing its size. MPBF chunk size without with Time spent deduplication deduping -------------------------------------------------------- android-base 8620198 2618764 -69.6% 0.089s cmssw 10347018 9094468 -12.1% 0.056s cpython 526381 371629 -29.4% 0.013s elasticsearch 4733274 3607106 -23.8% 0.029s gcc 3724544 3394521 -8.9% 0.072s gecko-dev 20591010 17042732 -17.2% 0.235s git 160718 139546 -13.2% 0.004s glibc 759132 699623 -7.8% 0.009s go 458375 421394 -8.1% 0.008s jdk 13213208 12158533 -8.0% 0.034s linux 26575278 25029700 -5.8% 0.169s llvm-project 3988133 3269438 -18.0% 0.085s rails 958691 614528 -35.9% 0.015s rust 1985782 1682919 -15.3% 0.023s tensorflow 4173570 3789768 -9.2% 0.022s webkit 5891751 5394507 -8.4% 0.096s Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com> --- commit-graph.c | 76 ++++++++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 71 insertions(+), 5 deletions(-) diff --git a/commit-graph.c b/commit-graph.c index 56b8f33d10..178bbf0113 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -57,6 +57,7 @@ struct modified_path_bloom_filter_info { struct bloom_filter filter; + struct modified_path_bloom_filter_info *duplicate_of; /* * The offset relative to the start of the Modified Path Bloom * Filters chunk where this Bloom filter has been written, @@ -1099,6 +1100,9 @@ struct write_commit_graph_context { /* Excluding embedded modified path Bloom filters */ uint64_t total_filter_size; + + /* Used to find identical modified path Bloom filters */ + struct hashmap dedup_hashmap; } mpbfctx; }; @@ -1315,7 +1319,11 @@ static int write_graph_chunk_modified_path_bloom_filters(struct hashfile *f, bfi = modified_path_bloom_filters_peek( &modified_path_bloom_filters, commit); - if (!bfi || !bfi->filter.nr_bits) + if (!bfi) + continue; + if (bfi->duplicate_of) + continue; + if (!bfi->filter.nr_bits) continue; if (bfi->filter.nr_bits == GRAPH_MODIFIED_PATH_BLOOM_FILTER_EMBEDDED_NR_BITS) continue; @@ -1364,6 +1372,9 @@ static int write_graph_chunk_modified_path_bloom_index(struct hashfile *f, */ filterdata[0] |= 1 << 7; hashwrite(f, filterdata, sizeof(filterdata)); + } else if (bfi->duplicate_of) { + uint64_t offset = htonll(bfi->duplicate_of->offset); + hashwrite(f, &offset, sizeof(offset)); } else if (bfi->offset != -1) { uint64_t offset = htonll(bfi->offset); hashwrite(f, &offset, sizeof(offset)); @@ -1515,6 +1526,53 @@ static void file_change_cb(struct diff_options *options, handle_modified_file(options->change_fn_data, fullpath); } +struct modified_path_bloom_filter_dedup_entry { + struct hashmap_entry entry; + struct modified_path_bloom_filter_info *bfi; +}; + +static int modified_path_bloom_filter_dedup_cmp(const void *unused_cmp_data, + const struct hashmap_entry *he1, + const struct hashmap_entry *he2, + const void *unused_keydata) +{ + const struct modified_path_bloom_filter_dedup_entry *a, *b; + + a = container_of(he1, const struct modified_path_bloom_filter_dedup_entry, entry); + b = container_of(he2, const struct modified_path_bloom_filter_dedup_entry, entry); + + if (a->bfi->filter.nr_bits != b->bfi->filter.nr_bits) + return 1; + + return memcmp(a->bfi->filter.bits, b->bfi->filter.bits, + bloom_filter_bytes(&a->bfi->filter)); +} + +static int handle_duplicate_modified_path_bloom_filter( + struct modified_path_bloom_filter_context *mpbfctx, + struct modified_path_bloom_filter_info *bfi) +{ + struct modified_path_bloom_filter_dedup_entry *e, stack_entry; + unsigned int h; + + h = memhash(bfi->filter.bits, bloom_filter_bytes(&bfi->filter)); + hashmap_entry_init(&stack_entry.entry, h); + stack_entry.bfi = bfi; + + e = hashmap_get_entry(&mpbfctx->dedup_hashmap, &stack_entry, entry, + NULL); + if (e) { + bfi->duplicate_of = e->bfi; + return 1; + } else { + e = xmalloc(sizeof(*e)); + hashmap_entry_init(&e->entry, h); + e->bfi = bfi; + hashmap_add(&mpbfctx->dedup_hashmap, &e->entry); + return 0; + } +} + static void create_modified_path_bloom_filter( struct modified_path_bloom_filter_context *mpbfctx, struct commit *commit) @@ -1547,17 +1605,20 @@ static void create_modified_path_bloom_filter( bfi = modified_path_bloom_filters_at(&modified_path_bloom_filters, commit); bfi->offset = -1; - if (path_component_count > mpbfctx->embedded_limit) { + if (path_component_count > mpbfctx->embedded_limit) bloom_filter_init(&bfi->filter, mpbfctx->num_hashes, path_component_count); - mpbfctx->total_filter_size += sizeof(uint32_t) + - bloom_filter_bytes(&bfi->filter); - } else + else bloom_filter_init_with_size(&bfi->filter, GRAPH_MODIFIED_PATH_BLOOM_FILTER_EMBEDDED_NR_BITS); bloom_filter_set_bits(&bfi->filter, mpbfctx->hashes, mpbfctx->hashes_nr); + + if (path_component_count > mpbfctx->embedded_limit && + !handle_duplicate_modified_path_bloom_filter(mpbfctx, bfi)) + mpbfctx->total_filter_size += sizeof(uint32_t) + + bloom_filter_bytes(&bfi->filter); } static void add_missing_parents(struct write_commit_graph_context *ctx, struct commit *commit) @@ -2396,6 +2457,8 @@ int write_commit_graph(const char *obj_dir, diff_setup_done(&ctx->mpbfctx.diffopt); strbuf_init(&ctx->mpbfctx.prev_path, PATH_MAX); + hashmap_init(&ctx->mpbfctx.dedup_hashmap, + modified_path_bloom_filter_dedup_cmp, NULL, 0); } if (ctx->split) { @@ -2525,6 +2588,9 @@ int write_commit_graph(const char *obj_dir, strbuf_release(&ctx->mpbfctx.prev_path); free(ctx->mpbfctx.hashed_prefix_lengths); free(ctx->mpbfctx.hashes); + hashmap_free_entries(&ctx->mpbfctx.dedup_hashmap, + struct modified_path_bloom_filter_dedup_entry, + entry); } free(ctx); -- 2.27.0.rc1.431.g5c813f95dc
--- commit-graph.c | 53 +++++++++++++++++++++++++++++++++++++++++++++----- commit-graph.h | 2 ++ 2 files changed, 50 insertions(+), 5 deletions(-) diff --git a/commit-graph.c b/commit-graph.c index 178bbf0113..32738ba4b8 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -26,6 +26,7 @@ #define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */ #define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */ #define GRAPH_CHUNKID_MODIFIED_PATH_BLOOM_FILTER_INDEX 0x4d504249 /* "MPBI" */ +#define GRAPH_CHUNKID_MODIFIED_PATH_BLOOM_FILTER_MERGE_INDEX 0x4d50424d /* "MPBM" */ #define GRAPH_CHUNKID_MODIFIED_PATH_BLOOM_FILTERS 0x4d504246 /* "MPBF" */ #define GRAPH_CHUNKID_MODIFIED_PATH_BLOOM_FILTER_EXCLUDES 0x4d504258 /* "MPBX" */ @@ -328,6 +329,15 @@ struct commit_graph *parse_commit_graph(void *graph_map, int fd, } break; + case GRAPH_CHUNKID_MODIFIED_PATH_BLOOM_FILTER_MERGE_INDEX: + if (graph->chunk_mpbf_merge_index) + chunk_repeated = 1; + else { + graph->chunk_mpbf_merge_index = data + chunk_offset; + graph->chunk_mpbf_merge_index_size = next_chunk_offset - chunk_offset; + } + break; + case GRAPH_CHUNKID_MODIFIED_PATH_BLOOM_FILTERS: if (graph->chunk_mpbf_filters) chunk_repeated = 1; @@ -869,11 +879,44 @@ static int load_modified_path_bloom_filter_from_graph( bf->bits = (uint8_t*) bloom_index; return 1; } else if (bloom_index[0] & (1 << 6)) { - /* - * Modified path Bloom filters for second..nth parents of - * merge commits are not implemented yet. - */ - return 0; + struct commit_list *p; + uint32_t pos; + int found = 0; + + pos = get_be32(bloom_index + sizeof(uint32_t)); + + if (!graph->chunk_mpbf_merge_index) + BUG("commit %s refers to position %u in the Modified Path Bloom Filter Merge Index chunk, but that chunk is missing", + oid_to_hex(&commit->object.oid), pos); + + for (p = commit->parents; p; p = p->next, pos++) + if (p->item == parent) { + found = 1; + break; + } + if (!found) + BUG("commit %s has no parent %s\n", + oid_to_hex(&commit->object.oid), + oid_to_hex(&parent->object.oid)); + + if (pos * sizeof(uint64_t) > graph->chunk_mpbf_merge_index_size) + BUG("commit %s and parent %s refer to position %u in the Modified Path Bloom Filter Merge Index chunk, but that's too large for a chunk of size %lu bytes", + oid_to_hex(&commit->object.oid), + oid_to_hex(&parent->object.oid), pos, + graph->chunk_mpbf_merge_index_size); + bloom_index = graph->chunk_mpbf_merge_index + sizeof(uint64_t) * pos; + if (bloom_index[0] & (1 << 7)) { + uint64_t v; + memcpy(&v, bloom_index, sizeof(v)); + if (v == GRAPH_MODIFIED_PATH_BLOOM_FILTER_NONE) + return 0; + + /* embedded modified path Bloom filter */ + bf->nr_bits = GRAPH_MODIFIED_PATH_BLOOM_FILTER_EMBEDDED_NR_BITS; + bf->bits = (uint8_t*) bloom_index; + return 1; + } else + offset = get_be64(bloom_index); } else { if (!first_parent) return 0; diff --git a/commit-graph.h b/commit-graph.h index cde0d7fa30..b2605a0e3b 100644 --- a/commit-graph.h +++ b/commit-graph.h @@ -69,6 +69,8 @@ struct commit_graph { const unsigned char *chunk_base_graphs; const unsigned char *chunk_mpbf_index; uint64_t chunk_mpbf_index_size; + const unsigned char *chunk_mpbf_merge_index; + uint64_t chunk_mpbf_merge_index_size; const unsigned char *chunk_mpbf_filters; uint64_t chunk_mpbf_filters_size; const unsigned char *chunk_mpbf_excludes; -- 2.27.0.rc1.431.g5c813f95dc
[Explain!] The table below compares the runtime of git rev-list --full-history HEAD -- "$path" with modified path Bloom filters for only first parents and with all parents of merge commits for 5000+ randomly selected paths from each repository. It also shows the size increase of the commit-graph file: | Average runtime | commit-graph file size | | Percentage | first all | first all of merge | parent merge Average | parent merge Size commits | only parents speedup | only parents increase ------------------------+-------------------------------+---------------------------- android-base 73.6% | 7.0890s 0.9681s 7.32x | 29.8MB 218.0MB 7.32x cmssw 11.0% | 0.7164s 0.2577s 2.78x | 22.9MB 51.8MB 2.26x cpython 11.7% | 0.3815s 0.1460s 2.61x | 7.8MB 10.6MB 1.36x elasticsearch 8.4% | 0.1217s 0.0582s 2.09x | 9.6MB 10.9MB 1.14x gecko-dev 3.5% | 1.6432s 0.9689s 1.70x | 63.5MB 90.4MB 1.42x git 25.3% | 0.8743s 0.1729s 5.06x | 3.8MB 10.4MB 2.74x homebrew-cask 9.6% | 1.1454s 0.1846s 6.20x | 6.5MB 7.2MB 1.10x jdk 25.0% | 0.2598s 0.1242s 2.09x | 15.1MB 20.3MB 1.34x linux 7.4% | 2.8037s 1.3339s 2.10x | 78.1MB 269.8MB 3.45x rails 22.2% | 0.2854s 0.0926s 3.08x | 6.0MB 8.4MB 1.40x rust 27.0% | 1.9803s 0.1794s 11.04x | 8.2MB 22.0MB 2.68x tensorflow 9.1% | 0.3886s 0.1237s 3.14x | 9.0MB 19.0MB 2.11x Storing modified path Bloom filters for all parents of merge commits significantly increases the size of the Modified Path Bloom Filters chunk, though when looking at the size of the whole commit-graph file the relative increase is sometimes not that much. FWIW, the average speedup in most cases (except android-base and linux) is higher than the size increase of the commit-graph file. Beware! To create a Bloom filter from the paths modified between a commit and one of its parents, it seems natural to extend create_modified_path_bloom_filter() with a 'struct commit *parent' parameter... and then things would blow up when encountering commit a733a5da97b2 (Merge branches 'release' and 'fluff' into release, 2008-02-07) in linux.git: $ git -C linux.git log -1 a733a5da97b2 |head -n2 commit a733a5da97b238e3e3167d3d0aee8fe1e8d04e97 Merge: 299cfe38081b 299cfe38081b 9e52797131e8 Notice how the same commit is both the first _and_ second parent! So extend create_modified_path_bloom_filter()'s signature with the int index of the particular parent. Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com> --- Documentation/config/core.txt | 13 +++ commit-graph.c | 169 +++++++++++++++++++++++++++++----- 2 files changed, 158 insertions(+), 24 deletions(-) diff --git a/Documentation/config/core.txt b/Documentation/config/core.txt index 820f7d3bcf..b4311c5f31 100644 --- a/Documentation/config/core.txt +++ b/Documentation/config/core.txt @@ -594,6 +594,19 @@ core.modifiedPathBloomFilters:: commit-graph file to speed up pathspec-limited revision walks. Defaults to false. +core.modifiedPathBloomFiltersForAllMergeParents:: + EXPERIMENTAL!!! + If true, then Git will store Bloom filters for all parents of + merge commits in the commit-graph file. + This has no impact on repositories with linear history. In + repositories with lots of merge commits this considerably + increases the runtime of writing the commit-graph file and its + size. It has little benefit in pathspec-limited revision walks + with the default history simplification, but can speed up + `--full-history` considerably. + Defaults to false, ignored if `core.modifiedPathBloomFilters` is + false. + core.useReplaceRefs:: If set to `false`, behave as if the `--no-replace-objects` option was given on the command line. See linkgit:git[1] and diff --git a/commit-graph.c b/commit-graph.c index 32738ba4b8..9cce79d452 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -65,12 +65,23 @@ struct modified_path_bloom_filter_info { * -1 before that. */ uint64_t offset; + uint32_t merge_index_pos; + struct modified_path_bloom_filter_info *next; }; static void free_modified_path_bloom_filter_info_in_slab( struct modified_path_bloom_filter_info *bfi) { + /* The instance got as parameter is on the slab, don't free() it. */ bloom_filter_free(&bfi->filter); + bfi = bfi->next; + /* The rest is in a linked list, and must be free()d. */ + while (bfi) { + struct modified_path_bloom_filter_info *prev = bfi; + bloom_filter_free(&bfi->filter); + bfi = bfi->next; + free(prev); + } } define_commit_slab(modified_path_bloom_filters, @@ -1115,7 +1126,8 @@ struct write_commit_graph_context { const struct split_commit_graph_opts *split_opts; struct modified_path_bloom_filter_context { - unsigned use_modified_path_bloom_filters:1; + unsigned use_modified_path_bloom_filters:1, + all_merge_parents:1; unsigned int num_hashes; /* * Number of paths to be added to "embedded" modified path @@ -1143,6 +1155,7 @@ struct write_commit_graph_context { /* Excluding embedded modified path Bloom filters */ uint64_t total_filter_size; + uint32_t num_merge_index_entries; /* Used to find identical modified path Bloom filters */ struct hashmap dedup_hashmap; @@ -1361,28 +1374,70 @@ static int write_graph_chunk_modified_path_bloom_filters(struct hashfile *f, bfi = modified_path_bloom_filters_peek( &modified_path_bloom_filters, commit); + for (; bfi; bfi = bfi->next) { + if (bfi->duplicate_of) + continue; + if (!bfi->filter.nr_bits) + continue; + if (bfi->filter.nr_bits == GRAPH_MODIFIED_PATH_BLOOM_FILTER_EMBEDDED_NR_BITS) + continue; - if (!bfi) - continue; - if (bfi->duplicate_of) - continue; - if (!bfi->filter.nr_bits) - continue; - if (bfi->filter.nr_bits == GRAPH_MODIFIED_PATH_BLOOM_FILTER_EMBEDDED_NR_BITS) - continue; + if (offset >> 62) + BUG("offset %lu is too large for the Modified Path Bloom Filter Index chunk", + offset); + + bfi->offset = offset; + + filter_size = bloom_filter_bytes(&bfi->filter); + + hashwrite_be32(f, bfi->filter.nr_bits); + hashwrite(f, bfi->filter.bits, filter_size); + + offset += sizeof(uint32_t) + filter_size; + } + } + return 0; +} - if (offset >> 62) - BUG("offset %lu is too large for the Modified Path Bloom Filter Index chunk", - offset); +static int write_graph_chunk_modified_path_bloom_merge_index( + struct hashfile *f, struct write_commit_graph_context *ctx) +{ + const uint64_t no_bloom_filter = GRAPH_MODIFIED_PATH_BLOOM_FILTER_NONE; + uint32_t pos = 0; + int i; - bfi->offset = offset; + for (i = 0; i < ctx->commits.nr; i++) { + struct commit *commit = ctx->commits.list[i]; + struct modified_path_bloom_filter_info *bfi; - filter_size = bloom_filter_bytes(&bfi->filter); + display_progress(ctx->progress, ++ctx->progress_cnt); - hashwrite_be32(f, bfi->filter.nr_bits); - hashwrite(f, bfi->filter.bits, filter_size); + bfi = modified_path_bloom_filters_peek( + &modified_path_bloom_filters, commit); - offset += sizeof(uint32_t) + filter_size; + if (!bfi || !bfi->next) + /* No merge index entries for this commit. */ + continue; + + do { + if (bfi->duplicate_of) { + uint64_t offset = htonll(bfi->duplicate_of->offset); + hashwrite(f, &offset, sizeof(offset)); + } else if (!bfi->filter.nr_bits) { + hashwrite(f, &no_bloom_filter, + sizeof(no_bloom_filter)); + } else if (bfi->filter.nr_bits == GRAPH_MODIFIED_PATH_BLOOM_FILTER_EMBEDDED_NR_BITS) { + uint8_t filterdata[sizeof(uint64_t)]; + memcpy(filterdata, bfi->filter.bits, sizeof(filterdata)); + filterdata[0] |= 1 << 7; + hashwrite(f, filterdata, sizeof(filterdata)); + } else if (bfi->offset != -1) { + uint64_t offset = htonll(bfi->offset); + hashwrite(f, &offset, sizeof(offset)); + } else + BUG("modified path Bloom filter offset is still -1?!"); + bfi->merge_index_pos = pos++; + } while ((bfi = bfi->next)); } return 0; } @@ -1403,7 +1458,19 @@ static int write_graph_chunk_modified_path_bloom_index(struct hashfile *f, bfi = modified_path_bloom_filters_peek( &modified_path_bloom_filters, commit); - if (!bfi || !bfi->filter.nr_bits) { + if (!bfi) { + hashwrite(f, &no_bloom_filter, sizeof(no_bloom_filter)); + } else if (bfi->next) { + uint8_t filterdata[sizeof(uint64_t)]; + if (!commit->parents->next) + BUG("uh-oh #1"); + if (bfi->merge_index_pos == -1) + BUG("uh-oh #2"); + memset(filterdata, 0, sizeof(filterdata)); + filterdata[0] |= 1 << 6; + ((uint32_t*)filterdata)[1] = htonl(bfi->merge_index_pos); + hashwrite(f, filterdata, sizeof(filterdata)); + } else if (!bfi->filter.nr_bits) { hashwrite(f, &no_bloom_filter, sizeof(no_bloom_filter)); } else if (bfi->filter.nr_bits == GRAPH_MODIFIED_PATH_BLOOM_FILTER_EMBEDDED_NR_BITS) { uint8_t filterdata[sizeof(uint64_t)]; @@ -1618,14 +1685,18 @@ static int handle_duplicate_modified_path_bloom_filter( static void create_modified_path_bloom_filter( struct modified_path_bloom_filter_context *mpbfctx, - struct commit *commit) + struct commit *commit, int nth_parent) { struct modified_path_bloom_filter_info *bfi; struct object_id *parent_oid; uintmax_t path_component_count; + int i; if (!mpbfctx->use_modified_path_bloom_filters) return; + if (nth_parent && + !mpbfctx->all_merge_parents) + return; /* * This function can be called multiple times for the same commit @@ -1635,11 +1706,25 @@ static void create_modified_path_bloom_filter( */ bfi = modified_path_bloom_filters_peek(&modified_path_bloom_filters, commit); - if (bfi && bfi->filter.nr_bits) + for (i = 0; i < nth_parent && bfi; i++, bfi = bfi->next); + if (i == nth_parent && bfi && bfi->filter.nr_bits) return; - parent_oid = commit->parents ? &commit->parents->item->object.oid : - NULL; + if (commit->parents) { + struct commit_list *p; + for (i = 0, p = commit->parents; + i < nth_parent && p; + i++, p = p->next); + if (!p) + BUG("couldn't find %dth parent of commit %s\n", + nth_parent + 1, oid_to_hex(&commit->object.oid)); + parent_oid = &p->item->object.oid; + } else { + if (nth_parent) + BUG("looking for the %dth parent of commit %s with no parents", + nth_parent + 1, oid_to_hex(&commit->object.oid)); + parent_oid = NULL; + } mpbfctx->hashes_nr = 0; strbuf_reset(&mpbfctx->prev_path); diff_tree_oid(parent_oid, &commit->object.oid, "", &mpbfctx->diffopt); @@ -1647,7 +1732,23 @@ static void create_modified_path_bloom_filter( bfi = modified_path_bloom_filters_at(&modified_path_bloom_filters, commit); + if (!nth_parent) { + /* This is the right bloom_filter_info instance. */ + if (mpbfctx->all_merge_parents && + commit->parents && commit->parents->next) + mpbfctx->num_merge_index_entries++; + } else { + /* i is one step ahead of bfi */ + for (i = 1; i < nth_parent && bfi; i++, bfi = bfi->next); + if (bfi->next) + BUG("Huh?!?"); + bfi->next = xcalloc(1, sizeof(*bfi)); + bfi = bfi->next; + mpbfctx->num_merge_index_entries++; + } + bfi->offset = -1; + bfi->merge_index_pos = -1; if (path_component_count > mpbfctx->embedded_limit) bloom_filter_init(&bfi->filter, mpbfctx->num_hashes, path_component_count); @@ -1667,10 +1768,20 @@ static void create_modified_path_bloom_filter( static void add_missing_parents(struct write_commit_graph_context *ctx, struct commit *commit) { struct commit_list *parent; + int nth_parent; + + if (!commit->parents) { + /* root commit */ + create_modified_path_bloom_filter(&ctx->mpbfctx, commit, 0); + return; + } - create_modified_path_bloom_filter(&ctx->mpbfctx, commit); + for (parent = commit->parents, nth_parent = 0; + parent; + parent = parent->next, nth_parent++) { + create_modified_path_bloom_filter(&ctx->mpbfctx, commit, + nth_parent); - for (parent = commit->parents; parent; parent = parent->next) { if (!(parent->item->object.flags & REACHABLE)) { ALLOC_GROW(ctx->oids.list, ctx->oids.nr + 1, ctx->oids.alloc); oidcpy(&ctx->oids.list[ctx->oids.nr], &(parent->item->object.oid)); @@ -2079,6 +2190,13 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx) chunks[chunks_nr].write_fn = write_graph_chunk_modified_path_bloom_filters; chunks_nr++; } + if (ctx->mpbfctx.num_merge_index_entries) { + ALLOC_GROW(chunks, chunks_nr + 1, chunks_alloc); + chunks[chunks_nr].id = GRAPH_CHUNKID_MODIFIED_PATH_BLOOM_FILTER_MERGE_INDEX; + chunks[chunks_nr].size = ctx->mpbfctx.num_merge_index_entries * sizeof(uint64_t); + chunks[chunks_nr].write_fn = write_graph_chunk_modified_path_bloom_merge_index; + chunks_nr++; + } ALLOC_GROW(chunks, chunks_nr + 1, chunks_alloc); chunks[chunks_nr].id = GRAPH_CHUNKID_MODIFIED_PATH_BLOOM_FILTER_INDEX; chunks[chunks_nr].size = sizeof(uint8_t) + @@ -2487,6 +2605,9 @@ int write_commit_graph(const char *obj_dir, warning("not writing modified path Bloom filters with --split"); } else if (res) { ctx->mpbfctx.use_modified_path_bloom_filters = 1; + if (!git_config_get_bool("core.modifiedPathBloomFiltersForAllMergeParents", &res) && + res) + ctx->mpbfctx.all_merge_parents = 1; ctx->mpbfctx.num_hashes = GRAPH_MODIFIED_PATH_BLOOM_FILTER_DEFAULT_NR_HASHES; ctx->mpbfctx.embedded_limit = GRAPH_MODIFIED_PATH_BLOOM_FILTER_EMBEDDED_NR_BITS / (ctx->mpbfctx.num_hashes * 10 / 7); -- 2.27.0.rc1.431.g5c813f95dc
The write_commit_graph() function is just over 180 lines long, among which 86 lines initialize the 'struct write_commit_graph_context' instance and 32 lines free its resources. Extract these into init_write_commit_graph_context() and free_write_commit_graph_context() functions, respectively, to make write_commit_graph() look much simpler, and, more importantly, to be able to reuse those init and free functions in the next patch. Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com> --- commit-graph.c | 102 ++++++++++++++++++++++++++++++------------------- 1 file changed, 62 insertions(+), 40 deletions(-) diff --git a/commit-graph.c b/commit-graph.c index 9cce79d452..b5cd4c863a 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -2570,19 +2570,16 @@ static void expire_commit_graphs(struct write_commit_graph_context *ctx) strbuf_release(&path); } -int write_commit_graph(const char *obj_dir, - struct string_list *pack_indexes, - struct string_list *commit_hex, - enum commit_graph_write_flags flags, - const struct split_commit_graph_opts *split_opts) +static struct write_commit_graph_context *init_write_commit_graph_context( + const char *obj_dir, enum commit_graph_write_flags flags, + const struct split_commit_graph_opts *split_opts) { struct write_commit_graph_context *ctx; - uint32_t i, count_distinct = 0; size_t len; int res = 0; if (!commit_graph_compatible(the_repository)) - return 0; + return NULL; ctx = xcalloc(1, sizeof(struct write_commit_graph_context)); ctx->r = the_repository; @@ -2637,6 +2634,7 @@ int write_commit_graph(const char *obj_dir, } if (ctx->num_commit_graphs_before) { + int i; ALLOC_ARRAY(ctx->commit_graph_filenames_before, ctx->num_commit_graphs_before); i = ctx->num_commit_graphs_before; g = ctx->r->objects->commit_graph; @@ -2664,8 +2662,64 @@ int write_commit_graph(const char *obj_dir, ctx->oids.alloc = 1024; ALLOC_ARRAY(ctx->oids.list, ctx->oids.alloc); + return ctx; +} + +static void free_write_commit_graph_context( + struct write_commit_graph_context *ctx) +{ + free(ctx->graph_name); + free(ctx->commits.list); + free(ctx->oids.list); + free(ctx->obj_dir); + + if (ctx->commit_graph_filenames_after) { + int i; + for (i = 0; i < ctx->num_commit_graphs_after; i++) { + free(ctx->commit_graph_filenames_after[i]); + free(ctx->commit_graph_hash_after[i]); + } + + for (i = 0; i < ctx->num_commit_graphs_before; i++) + free(ctx->commit_graph_filenames_before[i]); + + free(ctx->commit_graph_filenames_after); + free(ctx->commit_graph_filenames_before); + free(ctx->commit_graph_hash_after); + } + + if (ctx->mpbfctx.use_modified_path_bloom_filters) { + deep_clear_modified_path_bloom_filters( + &modified_path_bloom_filters, + free_modified_path_bloom_filter_info_in_slab); + strbuf_release(&ctx->mpbfctx.prev_path); + free(ctx->mpbfctx.hashed_prefix_lengths); + free(ctx->mpbfctx.hashes); + hashmap_free_entries(&ctx->mpbfctx.dedup_hashmap, + struct modified_path_bloom_filter_dedup_entry, + entry); + } + + free(ctx); +} + +int write_commit_graph(const char *obj_dir, + struct string_list *pack_indexes, + struct string_list *commit_hex, + enum commit_graph_write_flags flags, + const struct split_commit_graph_opts *split_opts) +{ + struct write_commit_graph_context *ctx; + uint32_t count_distinct = 0; + int res = 0; + + ctx = init_write_commit_graph_context(obj_dir, flags, split_opts); + if (!ctx) + return 0; + if (ctx->append && ctx->r->objects->commit_graph) { struct commit_graph *g = ctx->r->objects->commit_graph; + int i; for (i = 0; i < g->num_commits; i++) { const unsigned char *hash = g->chunk_oid_lookup + g->hash_len * i; hashcpy(ctx->oids.list[ctx->oids.nr++].hash, hash); @@ -2726,39 +2780,7 @@ int write_commit_graph(const char *obj_dir, expire_commit_graphs(ctx); cleanup: - free(ctx->graph_name); - free(ctx->commits.list); - free(ctx->oids.list); - free(ctx->obj_dir); - - if (ctx->commit_graph_filenames_after) { - for (i = 0; i < ctx->num_commit_graphs_after; i++) { - free(ctx->commit_graph_filenames_after[i]); - free(ctx->commit_graph_hash_after[i]); - } - - for (i = 0; i < ctx->num_commit_graphs_before; i++) - free(ctx->commit_graph_filenames_before[i]); - - free(ctx->commit_graph_filenames_after); - free(ctx->commit_graph_filenames_before); - free(ctx->commit_graph_hash_after); - } - - if (ctx->mpbfctx.use_modified_path_bloom_filters) { - deep_clear_modified_path_bloom_filters( - &modified_path_bloom_filters, - free_modified_path_bloom_filter_info_in_slab); - strbuf_release(&ctx->mpbfctx.prev_path); - free(ctx->mpbfctx.hashed_prefix_lengths); - free(ctx->mpbfctx.hashes); - hashmap_free_entries(&ctx->mpbfctx.dedup_hashmap, - struct modified_path_bloom_filter_dedup_entry, - entry); - } - - free(ctx); - + free_write_commit_graph_context(ctx); return res; } -- 2.27.0.rc1.431.g5c813f95dc
--- commit-graph.c | 30 +++++++++++++++--------------- 1 file changed, 15 insertions(+), 15 deletions(-) diff --git a/commit-graph.c b/commit-graph.c index b5cd4c863a..4cbfce61bf 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -1901,21 +1901,6 @@ static int add_ref_to_list(const char *refname, return 0; } -int write_commit_graph_reachable(const char *obj_dir, - enum commit_graph_write_flags flags, - const struct split_commit_graph_opts *split_opts) -{ - struct string_list list = STRING_LIST_INIT_DUP; - int result; - - for_each_ref(add_ref_to_list, &list); - result = write_commit_graph(obj_dir, NULL, &list, - flags, split_opts); - - string_list_clear(&list, 0); - return result; -} - static int fill_oids_from_packs(struct write_commit_graph_context *ctx, struct string_list *pack_indexes) { @@ -2784,6 +2769,21 @@ int write_commit_graph(const char *obj_dir, return res; } +int write_commit_graph_reachable(const char *obj_dir, + enum commit_graph_write_flags flags, + const struct split_commit_graph_opts *split_opts) +{ + struct string_list list = STRING_LIST_INIT_DUP; + int result; + + for_each_ref(add_ref_to_list, &list); + result = write_commit_graph(obj_dir, NULL, &list, + flags, split_opts); + + string_list_clear(&list, 0); + return result; +} + #define VERIFY_COMMIT_GRAPH_ERROR_HASH 2 static int verify_commit_graph_error; -- 2.27.0.rc1.431.g5c813f95dc
Otherwise it would fail with GIT_TEST_COMMIT_GRAPH=1. --- t/t7007-show.sh | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/t/t7007-show.sh b/t/t7007-show.sh index 42d3db6246..93244cd718 100755 --- a/t/t7007-show.sh +++ b/t/t7007-show.sh @@ -4,16 +4,15 @@ test_description='git show' . ./test-lib.sh -test_expect_success setup ' +test_expect_success 'showing a tag that points at a missing object' ' + test_when_finished "git tag -d foo-tag" && echo hello world >foo && H=$(git hash-object -w foo) && git tag -a foo-tag -m "Tags $H" $H && HH=$(expr "$H" : "\(..\)") && H38=$(expr "$H" : "..\(.*\)") && - rm -f .git/objects/$HH/$H38 -' + rm -f .git/objects/$HH/$H38 && -test_expect_success 'showing a tag that point at a missing object' ' test_must_fail git --no-pager show foo-tag ' -- 2.27.0.rc1.431.g5c813f95dc
[Not sure it is legal to do this. I hope it's legal. It would be great if it were legal. But I really doubt it's legal. FWIW, at least the test suite still passes, even with GIT_TEST_COMMIT_GRAPH=1. Or we have a hole in test coverage...] Our revision walk machinery knows a thing or two about how to traverse the history in an order that is friendlier to the delta base cache. The hand-rolled revision walk in commit-graph.c:close_reachable() is unaware of any such tricks: it is just a simple loop iterating over an array of commits, appending unseen parents to the end. This didn't really matter in the past, because we only accessed commit objects while writing the commit-graph. Now, however, we have modified path Bloom filters, and to fill them we need access to tree objects as well to gather all modified paths, which puts more strain on the caches, and the order in which we traverse the commits becomes relevant. So let's use the regular revision walk machinery (i.e. 'struct rev_info', setup_revisions(), get_revision() and friends) for '--reachable' to speed up writing a commit-graph with modified path Bloom filters. However, stick to the current algorithm when scanning pack files in commits, because passing all packed commits to setup_revisions() might do more harm than good (presumably, I didn't check; scanning all packs for commits can be hopelessly inefficient anyway). '--stdin-commits', too, could benefit from using the revision walk machinery when it's fed with branch tips from 'git for-each-ref', or fare worse when it's fed all commits from 'git rev-list' (didn't check this, either)... so let's just leave it using the current algorithm for now. The table below shows the reduced runtime and max RSS of writing a commit-graph file from scratch with '--reachable' and modified path Bloom filters, i.e. that of: git -c core.modifiedPathBloomFilters=1 commit-graph write --reachable Runtime Max RSS before after before after -------------------------------------------------------------------------- android-base 57.722s 40.113s -30.5% 711804k 697624k -2.0% cmssw 27.977s 25.154s -10.1% 980884k 786196k -19.9% cpython 10.161s 8.956s -11.9% 331736k 328704k -0.9% gcc 114.980s 36.975s -67.9% 561928k 484212k -13.8% gecko-dev 132.557s 97.314s -26.6% 2203544k 2161480k -1.9% git 8.097s 5.215s -35.6% 240564k 189652k -21.2% glibc 4.693s 3.996s -14.9% 215924k 207572k -3.9% homebrew-core 56.661s 56.384s -0.5% 469668k 475596k +1.3% jdk 20.355s 18.936s -7.0% 327244k 322316k -1.5% linux 215.235s 98.991s -54.0% 1549852k 1445028k -6.8% llvm-project 48.659s 30.933s -36.4% 783740k 736988k -6.0% rails 5.804s 5.389s -7.2% 284524k 284528k 0% rust 20.151s 12.921s -35.9% 687840k 664104k -3.5% webkit 31.226s 30.341s -2.8% 2018396k 1902884k -5.7% --- commit-graph.c | 110 ++++++++++++++++++++++++++++++++++++++++++------- 1 file changed, 94 insertions(+), 16 deletions(-) diff --git a/commit-graph.c b/commit-graph.c index 4cbfce61bf..1677e4fb3e 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -1703,6 +1703,9 @@ static void create_modified_path_bloom_filter( * (a commit can be present in multiple pack files, or multiple * refs can point to the same commit) so check first whether we have * already created a modified path Bloom filter for it. + * + * TODO: with '--reachable' we can now be sure that we only look at + * each commit only once, so this is unnecessary in that code path. */ bfi = modified_path_bloom_filters_peek(&modified_path_bloom_filters, commit); @@ -1891,16 +1894,6 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx) stop_progress(&ctx->progress); } -static int add_ref_to_list(const char *refname, - const struct object_id *oid, - int flags, void *cb_data) -{ - struct string_list *list = (struct string_list *)cb_data; - - string_list_append(list, oid_to_hex(oid)); - return 0; -} - static int fill_oids_from_packs(struct write_commit_graph_context *ctx, struct string_list *pack_indexes) { @@ -2773,14 +2766,99 @@ int write_commit_graph_reachable(const char *obj_dir, enum commit_graph_write_flags flags, const struct split_commit_graph_opts *split_opts) { - struct string_list list = STRING_LIST_INIT_DUP; - int result; + struct write_commit_graph_context *ctx; + struct rev_info revs; + const char *revs_argv[] = { NULL, "--all", NULL }; + struct commit *commit; + int result = 0; + + ctx = init_write_commit_graph_context(obj_dir, flags, split_opts); + if (!ctx) + return 0; + + /* + * Some git commands write a commit-graph in-process at the end, + * prevent their revision walk to interfere with ours. + */ + clear_object_flags(SEEN | ADDED | SHOWN | TOPO_WALK_EXPLORED | + TOPO_WALK_INDEGREE | UNINTERESTING); + + init_revisions(&revs, NULL); + setup_revisions(ARRAY_SIZE(revs_argv) - 1, revs_argv, &revs, NULL); + if (prepare_revision_walk(&revs)) { + error(_("revision walk setup failed")); + result = -1; + goto cleanup; + } + + if (ctx->report_progress) + ctx->progress = start_delayed_progress( + _("Gathering commit info"), + 0); + while ((commit = get_revision(&revs))) { + display_progress(ctx->progress, ctx->oids.nr + 1); + ALLOC_GROW(ctx->oids.list, ctx->oids.nr + 1, ctx->oids.alloc); + oidcpy(&ctx->oids.list[ctx->oids.nr], &commit->object.oid); + ctx->oids.nr++; + + if (!commit->parents) { + create_modified_path_bloom_filter(&ctx->mpbfctx, + commit, 0); + } else { + struct commit_list *parent; + int nth_parent; + + for (parent = commit->parents, nth_parent = 0; + parent; + parent = parent->next, nth_parent++) + create_modified_path_bloom_filter(&ctx->mpbfctx, + commit, nth_parent); + } + } + stop_progress(&ctx->progress); + + if (ctx->oids.nr >= GRAPH_EDGE_LAST_MASK) { + error(_("the commit graph format cannot write %d commits"), ctx->oids.nr); + result = -1; + goto cleanup; + } - for_each_ref(add_ref_to_list, &list); - result = write_commit_graph(obj_dir, NULL, &list, - flags, split_opts); + QSORT(ctx->oids.list, ctx->oids.nr, oid_compare); + + /* + * TODO: the revision walking loop above could fill ctx->commits + * straight away, counting extra edges as well. + */ + ctx->commits.alloc = ctx->oids.nr; + ALLOC_ARRAY(ctx->commits.list, ctx->commits.alloc); + + /* + * Despite its unassuming name, this function removes duplicate + * commits/oids and, worse, it does counts extra edges as well. + */ + copy_oids_to_commits(ctx); + + if (!ctx->commits.nr) + goto cleanup; + + if (ctx->split) { + split_graph_merge_strategy(ctx); - string_list_clear(&list, 0); + merge_commit_graphs(ctx); + } else + ctx->num_commit_graphs_after = 1; + + compute_generation_numbers(ctx); + + result = write_commit_graph_file(ctx); + + if (ctx->split) + mark_commit_graphs(ctx); + + expire_commit_graphs(ctx); + +cleanup: + free_write_commit_graph_context(ctx); return result; } -- 2.27.0.rc1.431.g5c813f95dc
[Explain! Try topo order!] I don't have recent benchmark results at hand yet, but as far as I can recall some old results this reduces the time spent loading and checking modified path Bloom filters by another ~10%. Checking Bloom filters is of course only a small part of the whole runtime of a pathspec-limited revision walk, so the overall improvement is only about 1-2%. Oh, well, I expected more from this change... But perhaps it has larger impact with less embedded modified path Bloom filters? This is the last patch in this series that improves pathspec-limited revision walk performance, so it's time to compare average runtime and memory usage of git rev-list HEAD -- "$path" for 5000+ randomly selected paths from each repository with and without modified path Bloom filters: Average runtime Average | Average max RSS without with speedup | without with ---------------------------------------------+---------------------------- android-base 0.8780s 0.1456s 6.03x | 387MB 91.2MB -76.5% cmssw 0.3143s 0.0313s 10.04x | 181MB 37.4MB -79.4% cpython 0.7453s 0.0810s 9.20x | 159MB 43.8MB -72.5% elasticsearch 0.1492s 0.0136s 10.95x | 134MB 21.8MB -83.7% gcc 7.1852s 0.2114s 34.00x | 297MB 91.1MB -69.3% gecko-dev 4.6113s 0.4815s 9.58x | 832MB 175.2MB -79.0% git 0.6180s 0.0310s 19.94x | 131MB 31.5MB -76.0% glibc 0.5618s 0.0282s 19.92x | 135MB 25.0MB -81.6% go 0.4913s 0.0403s 12.19x | 130MB 24.7MB -81.1% jdk 0.0482s 0.0068s 7.09x | 52MB 27.1MB -48.2% linux 0.7043s 0.0873s 8.08x | 438MB 150.9MB -65.5% llvm-project 2.6844s 0.4135s 6.49x | 384MB 126.5MB -67.1% rails 0.2784s 0.0391s 7.12x | 88MB 28.8MB -67.3% rust 0.7757s 0.0439s 17.67x | 344MB 42.0MB -87.8% tensorflow 0.6258s 0.0487s 12.85x | 233MB 36.7MB -84.2% webkit 1.9137s 0.2420s 7.91x | 940MB 84.7MB -91.0% The astute reader may notice that in several cases the achieved speedup is a bit higher than the calculated potential speedup shown in the log message of the patch adding the modified path Bloom filter specs, e.g. 34x vs. 29.7x in the gcc repository. I suspect that by eliminating much of the tree-diff overhead we load a lot less (tree) objects, putting less strain on the caches, and in turn making other parts of the process faster as well. Alas, this is not without extra cost: writing the commit-graph file with modified path Bloom filters takes a lot longer and the resulting commit-graph file is considerably larger: Writing a commit-graph file from | scratch with '--reachable' | commit-graph file size without with | without with ----------------------------------------------+-------------------------------- android-base 6.230s 40.880s 6.56x | 25077512 31278605 +24.7% cmssw 3.177s 25.691s 8.09x | 13017516 23971497 +84.1% cpython 1.344s 8.951s 6.66x | 6808068 8152146 +19.7% gcc 2.886s 36.917s 12.79x | 12839660 18068286 +40.7% gecko-dev 9.331s 97.729s 10.47x | 43335208 66568549 +53.6% git 0.750s 5.245s 6.99x | 3388208 4011595 +18.3% glibc 0.565s 4.146s 7.34x | 2832460 3936588 +38.9% homebrew-cask 1.083s 27.024s 24.95x | 5928644 6859160 +15.7% homebrew-core 1.702s 55.478s 32.60x | 9171324 10509040 +14.5% jdk 0.568s 19.418s 34.19x | 3237508 15858410 +389.8% linux 13.525s 100.837s 7.46x | 49800744 81939893 +64.5% llvm-project 4.275s 31.188s 7.30x | 19309116 25336867 +31.2% rails 0.986s 5.607s 5.69x | 4990252 6317541 +26.5% rust 1.260s 13.250s 10.52x | 6053824 8601440 +42.0% webkit 3.658s 30.469s 8.33x | 12288620 19438512 +58.1% --- commit-graph.c | 28 +++++++++++++++++++++------- 1 file changed, 21 insertions(+), 7 deletions(-) diff --git a/commit-graph.c b/commit-graph.c index 1677e4fb3e..8eb0cbedaf 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -67,6 +67,8 @@ struct modified_path_bloom_filter_info { uint64_t offset; uint32_t merge_index_pos; struct modified_path_bloom_filter_info *next; + /* TODO: need better names for 'next' and 'next_commit_bfi' */ + struct modified_path_bloom_filter_info *next_commit_bfi; }; static void free_modified_path_bloom_filter_info_in_slab( @@ -1159,6 +1161,10 @@ struct write_commit_graph_context { /* Used to find identical modified path Bloom filters */ struct hashmap dedup_hashmap; + + /* To chain up Bloom filters in history order */ + struct modified_path_bloom_filter_info *first_commit_bfi; + struct modified_path_bloom_filter_info *prev_commit_bfi; } mpbfctx; }; @@ -1363,18 +1369,18 @@ static int write_graph_chunk_modified_path_bloom_filters(struct hashfile *f, struct write_commit_graph_context *ctx) { uint64_t offset = 0; - int i; + struct modified_path_bloom_filter_info *next_commit_bfi; - for (i = 0; i < ctx->commits.nr; i++) { - struct commit *commit = ctx->commits.list[i]; - struct modified_path_bloom_filter_info *bfi; - unsigned int filter_size; + next_commit_bfi = ctx->mpbfctx.first_commit_bfi; + while (next_commit_bfi) { + struct modified_path_bloom_filter_info *bfi = next_commit_bfi; + + next_commit_bfi = next_commit_bfi->next_commit_bfi; display_progress(ctx->progress, ++ctx->progress_cnt); - bfi = modified_path_bloom_filters_peek( - &modified_path_bloom_filters, commit); for (; bfi; bfi = bfi->next) { + unsigned int filter_size; if (bfi->duplicate_of) continue; if (!bfi->filter.nr_bits) @@ -1762,6 +1768,14 @@ static void create_modified_path_bloom_filter( bloom_filter_set_bits(&bfi->filter, mpbfctx->hashes, mpbfctx->hashes_nr); + if (!mpbfctx->first_commit_bfi) { + mpbfctx->first_commit_bfi = bfi; + mpbfctx->prev_commit_bfi = bfi; + } else if (!nth_parent) { + mpbfctx->prev_commit_bfi->next_commit_bfi = bfi; + mpbfctx->prev_commit_bfi = bfi; + } + if (path_component_count > mpbfctx->embedded_limit && !handle_duplicate_modified_path_bloom_filter(mpbfctx, bfi)) mpbfctx->total_filter_size += sizeof(uint32_t) + -- 2.27.0.rc1.431.g5c813f95dc
Modified path Bloom filter don't store the names of modified paths, they only set a couple of bits based on those paths' hashes. Consequently, they can only be used when looking for the history of a concrete path, so we disabled them when looking for pathspecs with wildcards. However, if the pathspec has "wildcard-less" leading directories, then we can use modified path Bloom filters to skip commits that don't modify those leading directories. As a result, something like: git -c core.modifiedPathBloomFilters=1 rev-list HEAD -- 'compat/win32/pthread.*' will take only ~0.045s instead of ~1.24s, achieving over 27x speedup. For comparison, letting the shell do the wildcard matching, i.e. the equivalent of using the pathspecs 'compat/win32/pthread.c' and 'compat/win32/pthread.h' takes ~0.311s without using modified path Bloom filters: apparently tree-diff with wildcards can be considerably more expensive that without wildcards, even if the wildcard is in the last path component and that directory only contains a dozen files. Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com> --- commit-graph.c | 44 ++++++++++++++++++++++++++++++++++---------- 1 file changed, 34 insertions(+), 10 deletions(-) diff --git a/commit-graph.c b/commit-graph.c index 8eb0cbedaf..db43877426 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -1023,9 +1023,9 @@ static void compute_modified_path_bloom_hashes_for_path(const char *path, void init_pathspec_bloom_fields(struct repository *r, struct pathspec *pathspec) { - const unsigned bloom_compatible_magic = PATHSPEC_LITERAL; + const unsigned bloom_compatible_magic = PATHSPEC_LITERAL | PATHSPEC_GLOB; struct commit_graph *graph = r->objects->commit_graph; - int i; + int i, can_use_modified_path_bloom_filters; if (!graph) return; @@ -1033,15 +1033,14 @@ void init_pathspec_bloom_fields(struct repository *r, return; if (!pathspec->nr) return; - if (pathspec->has_wildcard) - return; if (pathspec->magic & ~bloom_compatible_magic) return; + can_use_modified_path_bloom_filters = 1; for (i = 0; i < pathspec->nr; i++) { struct pathspec_item *pi = &pathspec->items[i]; const char *path = pi->match, *p; - size_t len = pi->len; + size_t nowildcard_len = pi->nowildcard_len; int path_component_nr = 0, j; uint32_t *hashes; struct bloom_filter embedded_bf; @@ -1051,14 +1050,29 @@ void init_pathspec_bloom_fields(struct repository *r, * slashes, but a trailing slash might still be present, * "remove" it. */ - if (path[len - 1] == '/') - len--; + if (path[nowildcard_len - 1] == '/') + nowildcard_len--; p = path; do { p = strchrnul(p + 1, '/'); - path_component_nr++; - } while (p - path < len); + if (p - path <= nowildcard_len) + path_component_nr++; + } while (p - path < nowildcard_len); + /* + * If a pathspec uses wildcards but has wildcard-less + * leading directories, then we can use modified path Bloom + * filters to skip commits that don't modify those leading + * directories. + * However, if there is even one pathspec that has a wilcard + * in its first path component, then we have no choice but + * to run tree-diff anyway, so don't bother with Bloom + * filters at all in that case. + */ + if (!path_component_nr) { + can_use_modified_path_bloom_filters = 0; + break; + } pi->modified_path_bloom_hashes_nr = path_component_nr * graph->num_modified_path_bloom_hashes; ALLOC_ARRAY(pi->modified_path_bloom_hashes, @@ -1084,7 +1098,17 @@ void init_pathspec_bloom_fields(struct repository *r, pi->modified_path_bloom_hashes_nr); } - pathspec->can_use_modified_path_bloom_filters = 1; + if (can_use_modified_path_bloom_filters) { + pathspec->can_use_modified_path_bloom_filters = 1; + } else { + int j; + for (j = 0; j < i; j++) { + struct pathspec_item *pi = &pathspec->items[j]; + FREE_AND_NULL(pi->modified_path_bloom_hashes); + pi->modified_path_bloom_hashes_nr = 0; + pi->modified_path_bloom_mask = 0; + } + } } struct packed_commit_list { -- 2.27.0.rc1.431.g5c813f95dc
On 5/29/2020 4:50 AM, SZEDER Gábor wrote: > Sigh... but better late than never, right? True. The best time to present these ideas would be while the series was under review, which was a reasonable amount of time. But there's no helping that, so what can we learn from your extensive work here? > I experimented quite a bit with modified path Bloom filters a year and > more ago, and got quite far... but my disappointment in the > inadequacy of all double hashing schemes, the arrival of split > commit-graphs, and, well, life in general has put the whole thing on > the back burner, and I haven't touched it for a couple of releases. > > Now I finally managed to take a closer look at the current changed > paths Bloom filters implementation, and saw that it has some of the > same issues that I had stumbled upon and that it missed some > optimization opportunities. Unfortunately, fixing those issues and > performing those optimizations do require a thorough format change. My initial reaction is "it's too late for a format change now" but that's not quite true. Even if we ship v2.27.0 with the existing format, the data is in optional chunks. It would not be too difficult to replace the storage with a different set of optional chunks and stop reading the old ones. The question remains: exactly _how much better_ is your implementation, and is it enough to justify the upheaval? The other thing that impacts this consideration is how complicated the format becomes. Bloom filters are already complicated, so how can we keep things simple when possible? > So here is my proof of concept version, in all its incompleteness, > with the following benefits: These are all very interesting items. Here is my initial categorization: These are bold claims that I hope you have end-to-end performance numbers to back them up. Depending on how much the benefit is, then we can consider a replacing. > - Better hashing scheme (though it should be better still). > - Orders of magnitude lower average false positive rate. > - Consequently, faster pathspec-limited revision walks. These bullets seem like they may apply to the existing implementation, since they seem unrelated to the format itself. Is it possible to contribute these ideas to the existing implementation first? > - Faster processing of the tree-diff output and lower memory usage > while computing Bloom filters (from scratch...). > - Supports multiple pathspecs right from the start. > - Supports some wildcards in pathspecs. > - More cleanups, more bugfixes. These items are confusing to me. I either don't know what they mean, or doubt the value of having them present. > - Better understanding of the problem it tries to optimize. > - Better understanding of the issues with many small Bloom filters. > - Deduplicates Bloom filters. > - It has the right name :) The diff machinery and all its frontends > report "modified" paths with the letter 'M', not "changed". > - Handles as many commits as the commit-graph format can> - Optional support for storing Bloom filters for all parents of > merge commits. (This last item in particular is one where I have already communicated why there is little value in storing filters for diffs against a second parent. Perhaps you have some concrete numbers on what happens in terms of storage-vs-performance when adding these diffs.) > Alas, the drawbacks are significant: > > - No tests whatsoever. > - Computes all modified path Bloom filters from scratch when > writing, no matter what. > - Doesn't work with split commit-graphs. > - Basically if anything works besides 'git commit-graph write > --reachable' it's a miracle. > - Not a single test. > - Many BUG()s, which should rather be graceful errors... though I > have to admit that at this point they are indeed bugs. > - Many TODOs, both in commit messages and code, some incomplete > commit messages, crappy subject lines, even missing signoffs. > - Some ridiculously long variable, function, macro and config > variable names. > - It's based on v2.25.0 (no technical reason, but that's the version > I used to run the baseline benchmarks the last time, which takes > days...) > - I'm pretty sure that there are more... > - Oh, did I mention that there are no tests? And now you get to the _real_ hard part of this feature. It is easy to create a prototype that implements the idea, but it's another step to make it a full contribution. You clearly know this, which I expect is the reason this was not submitted earlier. > > The first 14 patches are preparatory fixes and cleanups: > > 01/34 tree-walk.c: don't match submodule entries for 'submod/anything' > > This fix or something similar is necessary to have consistent output > with and without modified path Bloom filters for paths crossing > submodule boundary. > > 02/34 commit-graph: fix parsing the Chunk Lookup table > > The minimal (though not the best) fix for a bug which, I think, is as > old as the commit-graph. I don't know how to test this. > > 03/34 commit-graph-format.txt: all multi-byte numbers are in network byte order > 04/34 commit-slab: add a function to deep free entries on the slab > 05/34 diff.h: drop diff_tree_oid() & friends' return value > 06/34 commit-graph: clean up #includes > > A couple of minor cleanups. > > 07/34 commit-graph: simplify parse_commit_graph() #1 > 08/34 commit-graph: simplify parse_commit_graph() #2 > > These two would be the right, though not minimal fix for the parsing > bug above. > > 09/34 commit-graph: simplify write_commit_graph_file() #1 > 10/34 commit-graph: simplify write_commit_graph_file() #2 > 11/34 commit-graph: allocate the 'struct chunk_info' array dinamically > > I think these three cleanup patches are a better alternative of > 3be7efcafc (commit-graph: define and use MAX_NUM_CHUNKS, 2020-03-30), > because... > > 12/34 commit-graph: unify the signatures of all write_graph_chunk_*() functions > 13/34 commit-graph: simplify write_commit_graph_file() #3 > 14/34 commit-graph: check chunk sizes after writing These 14 patches are probably worth submitting on their own. I hope they still apply cleanly to a recent master. > ... they laid the ground work for this patch. > > 15/34 commit-graph-format.txt: document the modified path Bloom filter chunks > > This is the most important one, specifying and _justifying_ the new > chunk formats. > > Do grab a cup or pint of your favourite beverage and get comfy before > reading this one. You have been warned. > > 16/34 Add a generic and minimal Bloom filter implementation > 17/34 Import a streaming-capable Murmur3 hash function implementation > 18/34 commit-graph: write "empty" Modified Path Bloom Filter Index chunk > 19/34 commit-graph: add commit slab for modified path Bloom filters > 20/34 commit-graph: fill the Modified Path Bloom Filter Index chunk > > This shows a more efficient approach to process the tree-diff output > into Bloom filters. > > 21/34 commit-graph: load and use the Modified Path Bloom Filter Index chunk > 22/34 commit-graph: write the Modified Path Bloom Filters chunk > 23/34 commit-graph: load and use the Modified Path Bloom Filters chunk Ok, these implement the new filter format. > 24/34 commit-graph: check all leading directories in modified path Bloom filters > > This was a good lightbulb moment. It is essential to try to maintain > reasonable performance in repositories where the vast majority of > changes are concentrated to a single directory. This is really interesting! This could apply to the existing format, so when you compare formats this patch should either be in both or in neither. > 25/34 commit-graph: check embedded modified path Bloom filters with a mask > 26/34 commit-graph: deduplicate modified path Bloom filters > 27/34 commit-graph: load modified path Bloom filters for merge commits > 28/34 commit-graph: write Modified Path Bloom Filter Merge Index chunk > 29/34 commit-graph: extract init and free write_commit_graph_context > 30/34 commit-graph: move write_commit_graph_reachable below write_commit_graph > 31/34 t7007-show: make the first test compatible with the next patch > 32/34 PoC commit-graph: use revision walk machinery for '--reachable' > > Once upon a time I thought this was the greatest idea ever, but as > time goes by I get more and more concerned that this is a really dumb > idea, though don't yet know why. Ok, here are some extra patches that implement your extra features. The multi-parent filters are a fair thing to compare against the existing implementation, but we need to compare file size changes, extra computation time changes, and how much the end-to-end rev-list performance changes. Are the costs justified? > 33/34 commit-graph: write modified path Bloom filters in "history order" I think we've covered this in these two patches: 3d112755056 commit-graph: examine commits by generation number d21ee7d1110 commit-graph: examine changed-path objects in pack order > 34/34 commit-graph: use modified path Bloom filters with wildcards, if possible > > Finally a cherry on top. This is also something that could be interesting for the current implementation. Perhaps group it with patches 1-14 and 24, if the idea applies to the existing implementation. I think it would be great to apply whatever code cleanups and fixes that you've presented here to the existing implementation _first_, then we have the optimal way to compare how your proposed format improves over the existing implementation. I can't commit more time to this series today, but hopefully the discussion is just getting started. I'll try to poke around some of the patches next week. Thanks, -Stolee
On Fri, May 29, 2020 at 10:50:04AM +0200, SZEDER Gábor wrote: > Sigh... but better late than never, right? Yes, indeed. I think that there is a balance here: I'm thrilled that you are choosing to spend your time working on and improving the changed-path Bloom filter implementation. Of course, it couldn't have hurt to have these ideas earlier when the list was more focused on reviewing Garima's original patches. But, nothing is set in stone, and it seems like there are some re-usable ideas and clean-ups below. I'll respond to the patch descriptions below with a "cut-line" where I think we could extract and apply what's already there before putting the rest aside to be worked on more. > I experimented quite a bit with modified path Bloom filters a year and > more ago, and got quite far... but my disappointment in the > inadequacy of all double hashing schemes, the arrival of split > commit-graphs, and, well, life in general has put the whole thing on > the back burner, and I haven't touched it for a couple of releases. > > Now I finally managed to take a closer look at the current changed > paths Bloom filters implementation, and saw that it has some of the > same issues that I had stumbled upon and that it missed some > optimization opportunities. Unfortunately, fixing those issues and > performing those optimizations do require a thorough format change. > > So here is my proof of concept version, in all its incompleteness, > with the following benefits: > > - Better understanding of the problem it tries to optimize. > - Better understanding of the issues with many small Bloom filters. > - Better hashing scheme (though it should be better still). > - Orders of magnitude lower average false positive rate. > - Consequently, faster pathspec-limited revision walks. > - Faster processing of the tree-diff output and lower memory usage > while computing Bloom filters (from scratch...). > - Optional support for storing Bloom filters for all parents of > merge commits. > - Deduplicates Bloom filters. > - Supports multiple pathspecs right from the start. > - Supports some wildcards in pathspecs. > - Handles as many commits as the commit-graph format can. > - It has the right name :) The diff machinery and all its frontends > report "modified" paths with the letter 'M', not "changed". > - More cleanups, more bugfixes. > - Consistent output with and without modified path Bloom filters for > over 80k random paths in 16 repositories, even with submodules in > them. Well, at least on my machine, if nowhere else... > > Alas, the drawbacks are significant: > > - No tests whatsoever. Yes, obviously this needs to be addressed, but I certainly don't fault you for posting what you have (thanks for adding the 'PoC' prefix to indicate it as such). > - Computes all modified path Bloom filters from scratch when > writing, no matter what. > - Doesn't work with split commit-graphs. I originally thought that this would be a non-starter, but it may not be. GitHub doesn't write Bloom filters at all (yet), but one consideration of ours is to only generate Bloom filters when we roll-up all of the split graphs. Our invocation during this bigger repository maintenance is something like: $ git commit-graph write --reachable --split=replace But if it turns out to be expensive to run 'git commit-graph write --split=no-merge --stdin-commits --write --changed-paths' on every push, we might just run the above with '--changed-paths'. I can imagine other schemes where it would be desirable to have your Bloom filter implementation over split graphs, in which case this should be addressed. But, I don't have a problem with a version that only works for non-split graphs first, so long as there isn't a compatibility boundary between that and a version that does work for split graphs. > - Basically if anything works besides 'git commit-graph write > --reachable' it's a miracle. > - Not a single test. > - Many BUG()s, which should rather be graceful errors... though I > have to admit that at this point they are indeed bugs. > - Many TODOs, both in commit messages and code, some incomplete > commit messages, crappy subject lines, even missing signoffs. > - Some ridiculously long variable, function, macro and config > variable names. > - It's based on v2.25.0 (no technical reason, but that's the version > I used to run the baseline benchmarks the last time, which takes > days...) > - I'm pretty sure that there are more... > - Oh, did I mention that there are no tests? ...I think so ;-). > > The first 14 patches are preparatory fixes and cleanups: > > 01/34 tree-walk.c: don't match submodule entries for 'submod/anything' > > This fix or something similar is necessary to have consistent output > with and without modified path Bloom filters for paths crossing > submodule boundary. > > 02/34 commit-graph: fix parsing the Chunk Lookup table > > The minimal (though not the best) fix for a bug which, I think, is as > old as the commit-graph. I don't know how to test this. > > 03/34 commit-graph-format.txt: all multi-byte numbers are in network byte order > 04/34 commit-slab: add a function to deep free entries on the slab > 05/34 diff.h: drop diff_tree_oid() & friends' return value > 06/34 commit-graph: clean up #includes > > A couple of minor cleanups. > > 07/34 commit-graph: simplify parse_commit_graph() #1 > 08/34 commit-graph: simplify parse_commit_graph() #2 > > These two would be the right, though not minimal fix for the parsing > bug above. > > 09/34 commit-graph: simplify write_commit_graph_file() #1 > 10/34 commit-graph: simplify write_commit_graph_file() #2 > 11/34 commit-graph: allocate the 'struct chunk_info' array dinamically > > I think these three cleanup patches are a better alternative of > 3be7efcafc (commit-graph: define and use MAX_NUM_CHUNKS, 2020-03-30), > because... > > 12/34 commit-graph: unify the signatures of all write_graph_chunk_*() functions > 13/34 commit-graph: simplify write_commit_graph_file() #3 > 14/34 commit-graph: check chunk sizes after writing I think you're right to draw the "laying the ground work" line here. Could these first fourteen patches be applied cleanly to master? They all look mostly like improvements to me, especially the second patch. > ... they laid the ground work for this patch. > > 15/34 commit-graph-format.txt: document the modified path Bloom filter chunks > > This is the most important one, specifying and _justifying_ the new > chunk formats. > > Do grab a cup or pint of your favourite beverage and get comfy before > reading this one. You have been warned. > > 16/34 Add a generic and minimal Bloom filter implementation > 17/34 Import a streaming-capable Murmur3 hash function implementation > 18/34 commit-graph: write "empty" Modified Path Bloom Filter Index chunk > 19/34 commit-graph: add commit slab for modified path Bloom filters > 20/34 commit-graph: fill the Modified Path Bloom Filter Index chunk > > This shows a more efficient approach to process the tree-diff output > into Bloom filters. > > 21/34 commit-graph: load and use the Modified Path Bloom Filter Index chunk > 22/34 commit-graph: write the Modified Path Bloom Filters chunk > 23/34 commit-graph: load and use the Modified Path Bloom Filters chunk > 24/34 commit-graph: check all leading directories in modified path Bloom filters > > This was a good lightbulb moment. It is essential to try to maintain > reasonable performance in repositories where the vast majority of > changes are concentrated to a single directory. > > 25/34 commit-graph: check embedded modified path Bloom filters with a mask > 26/34 commit-graph: deduplicate modified path Bloom filters > 27/34 commit-graph: load modified path Bloom filters for merge commits > 28/34 commit-graph: write Modified Path Bloom Filter Merge Index chunk > 29/34 commit-graph: extract init and free write_commit_graph_context > 30/34 commit-graph: move write_commit_graph_reachable below write_commit_graph > 31/34 t7007-show: make the first test compatible with the next patch > 32/34 PoC commit-graph: use revision walk machinery for '--reachable' > > Once upon a time I thought this was the greatest idea ever, but as > time goes by I get more and more concerned that this is a really dumb > idea, though don't yet know why. > > 33/34 commit-graph: write modified path Bloom filters in "history order" > 34/34 commit-graph: use modified path Bloom filters with wildcards, if possible > > Finally a cherry on top. The next twenty patches look like they need some more work. At a high-level, I think we need to do the following things, in roughly the following order: - Prove that this is a strict performance improvement on the existing Bloom filter implementation. - Sketch out a way that it would work for cases where '--split' is provided, and/or '--reachable' isn't provided. - Clean up the existing patches, adding tests which exercise the new functionality, and prove that there aren't regressions against the existing Bloom filter implementation. - Review and repeat. I guess I'm left wondering whether or not this is something that you want to continue, given that I outlined a pretty lengthy path for getting this in a suitable shape to keep going. Are you planning on keeping working on this? Thanks, Taylor
Taylor Blau <me@ttaylorr.com> writes: > On Fri, May 29, 2020 at 10:50:04AM +0200, SZEDER Gábor wrote: >> Sigh... but better late than never, right? > > Yes, indeed. I think that there is a balance here: I'm thrilled that you > are choosing to spend your time working on and improving the > changed-path Bloom filter implementation. > > Of course, it couldn't have hurt to have these ideas earlier when the > list was more focused on reviewing Garima's original patches. But, > nothing is set in stone, and it seems like there are some re-usable > ideas and clean-ups below. Yes, I had the same impression as you did, unlike some folks who sounded as if they felt offended seeing a comment that sabotages existing work. It could have been presented in a more useful ways (i.e. instead of risking to appear suggesting total replacement, which I do not think was the intention, massaged to build on top of what is already there as improvements), though. > I think you're right to draw the "laying the ground work" line here. > Could these first fourteen patches be applied cleanly to master? They > all look mostly like improvements to me, especially the second patch. Again, I concur. Thanks.
On Fri, May 29, 2020 at 10:50:19AM +0200, SZEDER Gábor wrote: > Modified Path Bloom Filters > Each modified path Bloom filter consists of: > > - 4 bytes specifying the number of bits in the Bloom filter's bit > array. > > For practical purposes 32 bit is more than sufficient to store the > number of bits in the Bloom filter's array. When using k = 7 > hashes, i.e. 10 bits per path, then we could store over 400 > million paths in a single Bloom filter; with k = 32 hashes we'd > use 46 bit per path, which is still over 93 million paths. > Alternatives considered > ----------------------- > > Here are some alternatives that I've considered but discarded and > ideas that I haven't (yet) followed through: > - Varints. Using 4 bytes for the size of each Bloom filter in the > Modified Path Bloom Filters chunk is a lot more than necessary. > How much space can be saved by using varints? Not that much, at least compared to the whole commit-graph file. Since 63 bit modified path Bloom filters are embedded in the index entries, it's pointless to store smaller Bloom filters, so the size of non-embedded Bloom filters can be defined as nr_bits = 64 + decode_varint(...) A one byte varint can encode values up to 127, while a two bytes varint can encode values up to 16511. So the max nr_bits is 191 and 16575, respectively, which means max 19 or 1657 paths with 7 hashes, i.e. 10 bits per path. The table below shows the percentage of commits with embedded modified path Bloom filters and commits with non-embedded Bloom filters with one byte and two bytes varints, and how much space is saved. Percentage of commits | modifying <= N paths | varint commit-graph compared to first parent | size diff file size <=6 <=19 <=1657 | (bytes) (ls -lh) ------------------------------------------+------------------------------- elasticsearch 18.32% 65.56% 99.79% | -90158 9.5M -0.9% jdk 26.62% 70.46% 97.94% | -90262 16M -0.6% webkit 38.42% 82.24% 99.92% | -298824 19M -1.6% android-base 42.32% 88.02% 99.98% | -132327 30M -0.5% llvm-project 53.60% 93.86% 99.99% | -344239 24M -1.4% gecko-dev 54.54% 87.15% 99.87% | -625029 63M -1.0% tensorflow 55.17% 90.76% 99.52% | -83758 9.0M -0.9% rails 58.71% 95.13% 99.99% | -53153 6.0M -0.9% rust 59.29% 88.03% 99.96% | -90677 8.2M -1.1% glibc 61.59% 91.11% 99.96% | -38422 3.8M -1.0% gcc 63.80% 95.39% 99.97% | -179463 18M -1.0% go 65.31% 95.19% 99.99% | -39109 3.2M -1.2% cmssw 67.58% 93.02% 99.91% | -154440 23M -0.7% linux 72.79% 95.27% 99.78% | -527045 78M -0.7% cpython 81.91% 97.78% 100.00% | -40678 7.8M -0.5% git 90.28% 98.56% 100.00% | -13406 3.9M -0.4% homebrew-cask 98.61% 99.58% 99.99% | -2992 6.6M -0.1% homebrew-core 99.81% 99.93% 100.00% | -810 11M -0.1% > How much is the runtime overhead of decoding those varints? Not enough to be above noise level. ("a lot of paths", MurmurHash3, k = 7, enhanced double hashing, 32 bit uint arithmetic) | Average time spent Average runtime | loading and querying | Bloom filters uint32 varint | uint32 varint -------------------------------------+----------------------------- android-base 0.1456s 0.144172s | 0.01550s 0.01571s 1.3% cmssw 0.0313s 0.032046s | 0.00383s 0.00395s 3.1% cpython 0.0810s 0.081649s | 0.00765s 0.00785s 2.6% elasticsearch 0.0136s 0.013899s | 0.00246s 0.00258s 4.8% gcc 0.2114s 0.211080s | 0.02259s 0.02261s 0% gecko-dev 0.4815s 0.474903s | 0.06004s 0.06028s 0.3% git 0.0310s 0.031156s | 0.00192s 0.00195s 1.5% glibc 0.0282s 0.029032s | 0.00320s 0.00342s 6.8% go 0.0403s 0.039408s | 0.00428s 0.00414s -3.3% jdk 0.0068s 0.006666s | 0.00117s 0.00113s -3.5% linux 0.0873s 0.087438s | 0.01109s 0.01133s 2.1% llvm-project 0.4135s 0.418390s | 0.04716s 0.04834s 2.5% rails 0.0391s 0.038997s | 0.00449s 0.00448s -0.3% rust 0.0439s 0.044569s | 0.00444s 0.00461s 3.8% tensorflow 0.0487s 0.049166s | 0.00618s 0.00634s 2.5% webkit 0.2420s 0.241807s | 0.03353s 0.03379s 0.7% > How can we ensure that varint decoding doesn't read past the end > of the mmapped memory region? With a decode_varint() variant that takes the max number of bytes to read as an additional parameter, and returns 0 if the varint is too long.
On 5/29/2020 4:50 AM, SZEDER Gábor wrote:
> +void free_depth_in_slab(int **ptr)
This needs to be "static" to compile with DEVELOPER=1.
(I'm working to apply your patches on top of ds/line-log-on-bloom
so we can get the benefit of these universally-good things. I just
want to report the things that I stumble on.)
Thanks,
-Stolee
On Thu, Jun 04, 2020 at 12:43:32PM -0400, Derrick Stolee wrote: > On 5/29/2020 4:50 AM, SZEDER Gábor wrote: > > +void free_depth_in_slab(int **ptr) > > This needs to be "static" to compile with DEVELOPER=1. You are right, of course, but... Do you get an error because of this when building with DEVELOPER=1? I don't get an error when building with DEVELOPER=1... > (I'm working to apply your patches on top of ds/line-log-on-bloom > so we can get the benefit of these universally-good things. I just > want to report the things that I stumble on.) Thanks.
On 6/27/2020 11:56 AM, SZEDER Gábor wrote: > On Thu, Jun 04, 2020 at 12:43:32PM -0400, Derrick Stolee wrote: >> On 5/29/2020 4:50 AM, SZEDER Gábor wrote: >>> +void free_depth_in_slab(int **ptr) >> >> This needs to be "static" to compile with DEVELOPER=1. > > You are right, of course, but... > > Do you get an error because of this when building with DEVELOPER=1? I > don't get an error when building with DEVELOPER=1... Yes, I believe that's how I discovered it. >> (I'm working to apply your patches on top of ds/line-log-on-bloom >> so we can get the benefit of these universally-good things. I just >> want to report the things that I stumble on.) > > Thanks. Sure thing. -Stolee