From: "SZEDER Gábor" <szeder.dev@gmail.com>
To: git@vger.kernel.org
Cc: "Junio C Hamano" <gitster@pobox.com>,
"Garima Singh" <garima.singh@microsoft.com>,
"Derrick Stolee" <stolee@gmail.com>,
"Jakub Narebski" <jnareb@gmail.com>, "Jeff King" <peff@peff.net>,
"Taylor Blau" <me@ttaylorr.com>,
"SZEDER Gábor" <szeder.dev@gmail.com>
Subject: [PATCH 21/34] commit-graph: load and use the Modified Path Bloom Filter Index chunk
Date: Fri, 29 May 2020 10:50:25 +0200 [thread overview]
Message-ID: <20200529085038.26008-22-szeder.dev@gmail.com> (raw)
In-Reply-To: <20200529085038.26008-1-szeder.dev@gmail.com>
Load the Modified Path Bloom Filter Index (MPBI) chunk and check the
embedded modified path Bloom filters within to speed up
pathspec-limited revison walks.
Extend 'struct pathspec_item' to hold the hashes of each path for
modified path Bloom filters. Pathspecs are used by many Git commands,
and a lot of them doesn't do any revision walk at all, so don't
initialize those fields in parse_pathspec(), but only when we are
about to start a pathspec-limited revision walk, i.e. in
prepare_revision_walk(). Initialize these fields only in the
revs->pruning.pathspec instance, because that's the one used to limit
the revision walk. Note that some revision walks re-parse the
pathspecs, notably 'git log --follow' and the line-level log, but they
don't use this pathspec instance.
The table below shows the average runtime and speedup of
git rev-list HEAD -- "$path"
for 5000+ randomly selected paths from each repository with and
without modified path Bloom filters:
Average runtime
without with
MPBI MPBI
--------------------------------------------
android-base 0.8780s n/a
cmssw 0.3143s 0.2104s -33.1%
cpython 0.7453s 0.3410s -54.2%
gcc 7.1852s 3.3633s -53.2%
gecko-dev 4.6113s n/a
git 0.6180s 0.2184s -64.7%
glibc 0.5618s 0.3245s -42.2%
jdk 0.0482s 0.0395s -18.0%
linux 0.7043s 0.4810s -31.6%
llvm-project 2.6844s n/a
rails 0.2784s 0.1792s -35.6%
rust 0.7757s 0.6073s -21.7%
webkit 1.9137s 1.4832s -22.5%
The results in the homebrew-core repository can be spectacular, though
that repository looks like as if it was specifically designed to show
off the benefits of this patch: over 98% of its almost 164k commits in
its linear history modify one file within the 'Formula' directory, and
that directory contains almost 5000 files. Consequently, almost all
commits will have a 63 bit "embedded" modified path Bloom filter
containing only 2 paths, resulting in a false positive rate less than
0.1%, and they can spare a lot of expensive tree entry scanning to the
end of that directory. So looking at the history of the first and
last files in that directory:
Runtime Max RSS
without with without with
MPBI MPBI Speedup MPBI MPBI
-----------------------------------------------------------------
Formula/a2ps.rb 8.390s 0.218s 38.5x 422884k 129212k
Formula/zzz.rb 80.495s 0.176s 450.4x 423744k 68364k
('Formula/a2ps.rb' is modified 32 times and has two false positives,
while 'zzz.rb' is modified only twice and happens to have only a
single false positive; that's why querying the history of the former
is slower, even though it's at the beginning of the directory).
Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
---
commit-graph.c | 146 +++++++++++++++++++++++++++++++++++++++++++++++++
commit-graph.h | 13 +++++
pathspec.c | 9 +++
pathspec.h | 12 ++++
revision.c | 18 ++++++
5 files changed, 198 insertions(+)
diff --git a/commit-graph.c b/commit-graph.c
index 413605a29c..fb24600bb3 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -26,6 +26,7 @@
#define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */
#define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */
#define GRAPH_CHUNKID_MODIFIED_PATH_BLOOM_FILTER_INDEX 0x4d504249 /* "MPBI" */
+#define GRAPH_CHUNKID_MODIFIED_PATH_BLOOM_FILTER_EXCLUDES 0x4d504258 /* "MPBX" */
#define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16)
@@ -308,6 +309,23 @@ struct commit_graph *parse_commit_graph(void *graph_map, int fd,
chunk_repeated = 1;
else
graph->chunk_base_graphs = data + chunk_offset;
+ break;
+
+ case GRAPH_CHUNKID_MODIFIED_PATH_BLOOM_FILTER_INDEX:
+ if (graph->chunk_mpbf_index)
+ chunk_repeated = 1;
+ else {
+ graph->chunk_mpbf_index = data + chunk_offset;
+ graph->chunk_mpbf_index_size = next_chunk_offset - chunk_offset;
+ }
+ break;
+
+ case GRAPH_CHUNKID_MODIFIED_PATH_BLOOM_FILTER_EXCLUDES:
+ if (graph->chunk_mpbf_excludes)
+ chunk_repeated = 1;
+ else
+ graph->chunk_mpbf_excludes = data + chunk_offset;
+ break;
}
if (chunk_repeated) {
@@ -324,6 +342,29 @@ struct commit_graph *parse_commit_graph(void *graph_map, int fd,
return NULL;
}
+ if (graph->chunk_mpbf_index) {
+ int res = 0;
+ git_config_get_bool("core.modifiedPathBloomFilters", &res);
+ if (res) {
+ uint64_t expected_size = sizeof(uint8_t) +
+ graph->num_commits * sizeof(uint64_t);
+ if (graph->chunk_mpbf_index_size != expected_size)
+ BUG(_("for %u commits the Modified Path Bloom Filter Index chunk should be %ld bytes, but it's %ld"),
+ graph->num_commits, expected_size,
+ graph->chunk_mpbf_index_size);
+ graph->num_modified_path_bloom_hashes = *graph->chunk_mpbf_index;
+ if (!graph->num_modified_path_bloom_hashes)
+ BUG(_("Modified Path Bloom Filter Index chunk using 0 hashes"));
+ if (graph->chunk_mpbf_excludes)
+ /*
+ * Modified Path Bloom Filter Excludes are
+ * not supported yet.
+ */
+ graph->chunk_mpbf_index = NULL;
+ } else
+ graph->chunk_mpbf_index = NULL;
+ }
+
return graph;
}
@@ -777,6 +818,68 @@ struct tree *get_commit_tree_in_graph(struct repository *r, const struct commit
return get_commit_tree_in_graph_one(r, r->objects->commit_graph, c);
}
+static int load_modified_path_bloom_filter_from_graph(
+ struct commit_graph *graph, struct commit *commit,
+ struct commit *parent, struct bloom_filter *bf)
+{
+ const uint8_t *bloom_index;
+ int first_parent = 0;
+
+ if (commit->graph_pos == COMMIT_NOT_FROM_GRAPH)
+ return 0;
+ if (!graph)
+ return 0;
+ if (!graph->chunk_mpbf_index)
+ return 0;
+
+ if (!commit->parents || commit->parents->item == parent)
+ first_parent = 1;
+
+ bloom_index = graph->chunk_mpbf_index + sizeof(uint8_t) +
+ sizeof(uint64_t) * commit->graph_pos;
+
+ if (bloom_index[0] & (1 << 7)) {
+ uint64_t v;
+ if (!first_parent)
+ return 0;
+ memcpy(&v, bloom_index, sizeof(v));
+ if (v == GRAPH_MODIFIED_PATH_BLOOM_FILTER_NONE)
+ return 0;
+
+ /* embedded modified path Bloom filter */
+ bf->nr_bits = GRAPH_MODIFIED_PATH_BLOOM_FILTER_EMBEDDED_NR_BITS;
+ bf->bits = (uint8_t*) bloom_index;
+ return 1;
+ }
+ /* support for non-embedded Bloom filters is not implemented yet. */
+ return 0;
+}
+
+enum bloom_result check_modified_path_bloom_filter(struct repository *r,
+ struct pathspec *pathspec, struct commit *parent,
+ struct commit *commit)
+{
+ struct bloom_filter bf;
+ int i;
+
+ if (!pathspec->can_use_modified_path_bloom_filters)
+ return BLOOM_CANT_TELL;
+ if (!load_modified_path_bloom_filter_from_graph(r->objects->commit_graph,
+ commit, parent, &bf))
+ return BLOOM_CANT_TELL;
+
+ for (i = 0; i < pathspec->nr; i++) {
+ struct pathspec_item *pi = &pathspec->items[i];
+
+ if (bloom_filter_check_bits(&bf,
+ pi->modified_path_bloom_hashes,
+ pi->modified_path_bloom_hashes_nr))
+ return BLOOM_POSSIBLY_YES;
+ }
+
+ return BLOOM_DEFINITELY_NOT;
+}
+
static uint32_t modified_path_bloom_seeds[] = {
0xe83c5163U,
0x3b376b0cU,
@@ -807,6 +910,49 @@ static void compute_modified_path_bloom_hashes_for_path(const char *path,
}
}
+void init_pathspec_bloom_fields(struct repository *r,
+ struct pathspec *pathspec)
+{
+ const unsigned bloom_compatible_magic = PATHSPEC_LITERAL;
+ struct commit_graph *graph = r->objects->commit_graph;
+ int i;
+
+ if (!graph)
+ return;
+ if (!graph->chunk_mpbf_index)
+ return;
+ if (!pathspec->nr)
+ return;
+ if (pathspec->has_wildcard)
+ return;
+ if (pathspec->magic & ~bloom_compatible_magic)
+ return;
+
+ for (i = 0; i < pathspec->nr; i++) {
+ struct pathspec_item *pi = &pathspec->items[i];
+ const char *path = pi->match;
+ size_t len = pi->len;
+
+ /*
+ * Pathspec parsing has normalized away any consecutive
+ * slashes, but a trailing slash might still be present,
+ * "remove" it.
+ */
+ if (path[len - 1] == '/')
+ len--;
+
+ pi->modified_path_bloom_hashes_nr = graph->num_modified_path_bloom_hashes;
+ ALLOC_ARRAY(pi->modified_path_bloom_hashes,
+ pi->modified_path_bloom_hashes_nr);
+
+ compute_modified_path_bloom_hashes_for_path(path, len,
+ graph->num_modified_path_bloom_hashes,
+ pi->modified_path_bloom_hashes);
+ }
+
+ pathspec->can_use_modified_path_bloom_filters = 1;
+}
+
struct packed_commit_list {
struct commit **list;
int nr;
diff --git a/commit-graph.h b/commit-graph.h
index 847cd25cfc..09dfc16932 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -11,6 +11,8 @@ struct commit;
struct repository;
struct raw_object_store;
struct string_list;
+struct bloom_filter;
+struct pathspec;
char *get_commit_graph_filename(const char *obj_dir);
int open_commit_graph(const char *graph_file, int *fd, struct stat *st);
@@ -38,6 +40,12 @@ void load_commit_graph_info(struct repository *r, struct commit *item);
struct tree *get_commit_tree_in_graph(struct repository *r,
const struct commit *c);
+enum bloom_result check_modified_path_bloom_filter(struct repository *r,
+ struct pathspec *pathspec, struct commit *parent,
+ struct commit *commit);
+void init_pathspec_bloom_fields(struct repository *r,
+ struct pathspec *pathspec);
+
struct commit_graph {
int graph_fd;
@@ -59,6 +67,11 @@ struct commit_graph {
const unsigned char *chunk_commit_data;
const unsigned char *chunk_extra_edges;
const unsigned char *chunk_base_graphs;
+ const unsigned char *chunk_mpbf_index;
+ uint64_t chunk_mpbf_index_size;
+ const unsigned char *chunk_mpbf_excludes;
+
+ uint8_t num_modified_path_bloom_hashes;
};
struct commit_graph *load_commit_graph_one_fd_st(int fd, struct stat *st);
diff --git a/pathspec.c b/pathspec.c
index 128f27fcb7..01e7ae6944 100644
--- a/pathspec.c
+++ b/pathspec.c
@@ -406,6 +406,8 @@ static void init_pathspec_item(struct pathspec_item *item, unsigned flags,
item->attr_check = NULL;
item->attr_match = NULL;
item->attr_match_nr = 0;
+ item->modified_path_bloom_hashes_nr = 0;
+ item->modified_path_bloom_hashes = NULL;
/* PATHSPEC_LITERAL_PATH ignores magic */
if (flags & PATHSPEC_LITERAL_PATH) {
@@ -674,6 +676,12 @@ void copy_pathspec(struct pathspec *dst, const struct pathspec *src)
}
d->attr_check = attr_check_dup(s->attr_check);
+
+ ALLOC_ARRAY(d->modified_path_bloom_hashes,
+ d->modified_path_bloom_hashes_nr);
+ COPY_ARRAY(d->modified_path_bloom_hashes,
+ s->modified_path_bloom_hashes,
+ d->modified_path_bloom_hashes_nr);
}
}
@@ -684,6 +692,7 @@ void clear_pathspec(struct pathspec *pathspec)
for (i = 0; i < pathspec->nr; i++) {
free(pathspec->items[i].match);
free(pathspec->items[i].original);
+ free(pathspec->items[i].modified_path_bloom_hashes);
for (j = 0; j < pathspec->items[i].attr_match_nr; j++)
free(pathspec->items[i].attr_match[j].value);
diff --git a/pathspec.h b/pathspec.h
index 454ce364fa..0c661e1b03 100644
--- a/pathspec.h
+++ b/pathspec.h
@@ -32,6 +32,7 @@ struct pathspec {
unsigned int has_wildcard:1;
unsigned int recursive:1;
unsigned int recurse_submodules:1;
+ unsigned int can_use_modified_path_bloom_filters:1;
unsigned magic;
int max_depth;
struct pathspec_item {
@@ -52,6 +53,17 @@ struct pathspec {
} match_mode;
} *attr_match;
struct attr_check *attr_check;
+
+ /*
+ * These fields are only relevant during pathspec-limited
+ * revision walks, init_pathspec_item() leaves them
+ * zero-initialized, but copy_pathspec() copies them,
+ * and clear_pathspec() releases the allocated memory.
+ * IOW: they are only valid if 'struct pathspec's
+ * 'can_use_modified_path_bloom_filters' bit is set.
+ */
+ uint32_t modified_path_bloom_hashes_nr;
+ uint32_t *modified_path_bloom_hashes;
} *items;
};
diff --git a/revision.c b/revision.c
index 9dac1262ef..6ec4bc0cb1 100644
--- a/revision.c
+++ b/revision.c
@@ -29,6 +29,7 @@
#include "prio-queue.h"
#include "hashmap.h"
#include "utf8.h"
+#include "bloom-filter.h"
volatile show_early_output_fn_t show_early_output;
@@ -629,6 +630,7 @@ static int rev_compare_tree(struct rev_info *revs,
{
struct tree *t1 = get_commit_tree(parent);
struct tree *t2 = get_commit_tree(commit);
+ enum bloom_result bloom_ret;
if (!t1)
return REV_TREE_NEW;
@@ -653,6 +655,12 @@ static int rev_compare_tree(struct rev_info *revs,
return REV_TREE_SAME;
}
+ bloom_ret = check_modified_path_bloom_filter(revs->repo,
+ &revs->pruning.pathspec,
+ parent, commit);
+ if (bloom_ret == BLOOM_DEFINITELY_NOT)
+ return REV_TREE_SAME;
+
tree_difference = REV_TREE_SAME;
revs->pruning.flags.has_changes = 0;
diff_tree_oid(&t1->object.oid, &t2->object.oid, "", &revs->pruning);
@@ -662,10 +670,17 @@ static int rev_compare_tree(struct rev_info *revs,
static int rev_same_tree_as_empty(struct rev_info *revs, struct commit *commit)
{
struct tree *t1 = get_commit_tree(commit);
+ enum bloom_result bloom_ret;
if (!t1)
return 0;
+ bloom_ret = check_modified_path_bloom_filter(revs->repo,
+ &revs->pruning.pathspec,
+ NULL, commit);
+ if (bloom_ret == BLOOM_DEFINITELY_NOT)
+ return 1;
+
tree_difference = REV_TREE_SAME;
revs->pruning.flags.has_changes = 0;
diff_tree_oid(NULL, &t1->object.oid, "", &revs->pruning);
@@ -3376,6 +3391,9 @@ int prepare_revision_walk(struct rev_info *revs)
simplify_merges(revs);
if (revs->children.name)
set_children(revs);
+
+ init_pathspec_bloom_fields(revs->repo, &revs->pruning.pathspec);
+
return 0;
}
--
2.27.0.rc1.431.g5c813f95dc
next prev parent reply other threads:[~2020-05-29 8:51 UTC|newest]
Thread overview: 43+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-05-29 8:50 [PoC PATCH 00/34] An alternative modified path Bloom filters implementation SZEDER Gábor
2020-05-28 22:09 ` Johannes Schindelin
2020-05-29 8:50 ` [PATCH 01/34] tree-walk.c: don't match submodule entries for 'submod/anything' SZEDER Gábor
2020-05-29 8:50 ` [PATCH 02/34] commit-graph: fix parsing the Chunk Lookup table SZEDER Gábor
2020-05-29 8:50 ` [PATCH 03/34] commit-graph-format.txt: all multi-byte numbers are in network byte order SZEDER Gábor
2020-05-29 8:50 ` [PATCH 04/34] commit-slab: add a function to deep free entries on the slab SZEDER Gábor
2020-06-04 16:43 ` Derrick Stolee
2020-06-27 15:56 ` SZEDER Gábor
2020-06-29 11:30 ` Derrick Stolee
2020-05-29 8:50 ` [PATCH 05/34] diff.h: drop diff_tree_oid() & friends' return value SZEDER Gábor
2020-05-29 8:50 ` [PATCH 06/34] commit-graph: clean up #includes SZEDER Gábor
2020-05-29 8:50 ` [PATCH 07/34] commit-graph: simplify parse_commit_graph() #1 SZEDER Gábor
2020-05-29 8:50 ` [PATCH 08/34] commit-graph: simplify parse_commit_graph() #2 SZEDER Gábor
2020-05-29 8:50 ` [PATCH 09/34] commit-graph: simplify write_commit_graph_file() #1 SZEDER Gábor
2020-05-29 8:50 ` [PATCH 10/34] commit-graph: simplify write_commit_graph_file() #2 SZEDER Gábor
2020-05-29 8:50 ` [PATCH 11/34] commit-graph: allocate the 'struct chunk_info' array dinamically SZEDER Gábor
2020-05-29 8:50 ` [PATCH 12/34] commit-graph: unify the signatures of all write_graph_chunk_*() functions SZEDER Gábor
2020-05-29 8:50 ` [PATCH 13/34] commit-graph: simplify write_commit_graph_file() #3 SZEDER Gábor
2020-05-29 8:50 ` [PATCH 14/34] commit-graph: check chunk sizes after writing SZEDER Gábor
2020-05-29 8:50 ` [PATCH 15/34] commit-graph-format.txt: document the modified path Bloom filter chunks SZEDER Gábor
2020-06-02 17:59 ` SZEDER Gábor
2020-05-29 8:50 ` [PATCH 16/34] Add a generic and minimal Bloom filter implementation SZEDER Gábor
2020-05-29 8:50 ` [PATCH 17/34] Import a streaming-capable Murmur3 hash function implementation SZEDER Gábor
2020-05-29 8:50 ` [PATCH 18/34] commit-graph: write "empty" Modified Path Bloom Filter Index chunk SZEDER Gábor
2020-05-29 8:50 ` [PATCH 19/34] commit-graph: add commit slab for modified path Bloom filters SZEDER Gábor
2020-05-29 8:50 ` [PATCH 20/34] commit-graph: fill the Modified Path Bloom Filter Index chunk SZEDER Gábor
2020-05-29 8:50 ` SZEDER Gábor [this message]
2020-05-29 8:50 ` [PATCH 22/34] commit-graph: write the Modified Path Bloom Filters chunk SZEDER Gábor
2020-05-29 8:50 ` [PATCH 23/34] commit-graph: load and use " SZEDER Gábor
2020-05-29 8:50 ` [PATCH 24/34] commit-graph: check all leading directories in modified path Bloom filters SZEDER Gábor
2020-05-29 8:50 ` [PATCH 25/34] commit-graph: check embedded modified path Bloom filters with a mask SZEDER Gábor
2020-05-29 8:50 ` [PATCH 26/34] commit-graph: deduplicate modified path Bloom filters SZEDER Gábor
2020-05-29 8:50 ` [PATCH 27/34] commit-graph: load modified path Bloom filters for merge commits SZEDER Gábor
2020-05-29 8:50 ` [PATCH 28/34] commit-graph: write Modified Path Bloom Filter Merge Index chunk SZEDER Gábor
2020-05-29 8:50 ` [PATCH 29/34] commit-graph: extract init and free write_commit_graph_context SZEDER Gábor
2020-05-29 8:50 ` [PATCH 30/34] commit-graph: move write_commit_graph_reachable below write_commit_graph SZEDER Gábor
2020-05-29 8:50 ` [PATCH 31/34] t7007-show: make the first test compatible with the next patch SZEDER Gábor
2020-05-29 8:50 ` [PATCH 32/34] PoC commit-graph: use revision walk machinery for '--reachable' SZEDER Gábor
2020-05-29 8:50 ` [PATCH 33/34] commit-graph: write modified path Bloom filters in "history order" SZEDER Gábor
2020-05-29 8:50 ` [PATCH 34/34] commit-graph: use modified path Bloom filters with wildcards, if possible SZEDER Gábor
2020-05-29 13:59 ` [PoC PATCH 00/34] An alternative modified path Bloom filters implementation Derrick Stolee
2020-06-01 23:25 ` Taylor Blau
2020-06-02 17:08 ` Junio C Hamano
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200529085038.26008-22-szeder.dev@gmail.com \
--to=szeder.dev@gmail.com \
--cc=garima.singh@microsoft.com \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=jnareb@gmail.com \
--cc=me@ttaylorr.com \
--cc=peff@peff.net \
--cc=stolee@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).