* [PATCH v2 01/11] commit-graph: use MAX_NUM_CHUNKS
2020-02-05 22:56 ` [PATCH v2 00/11] " Garima Singh via GitGitGadget
@ 2020-02-05 22:56 ` Garima Singh via GitGitGadget
2020-02-09 12:39 ` Jakub Narebski
2020-02-05 22:56 ` [PATCH v2 02/11] bloom: core Bloom filter implementation for changed paths Garima Singh via GitGitGadget
` (12 subsequent siblings)
13 siblings, 1 reply; 159+ messages in thread
From: Garima Singh via GitGitGadget @ 2020-02-05 22:56 UTC (permalink / raw)
To: git
Cc: stolee, szeder.dev, jonathantanmy, jeffhost, me, peff,
garimasigit, jnareb, christian.couder, emilyshaffer, gitster,
Garima Singh, Garima Singh
From: Garima Singh <garima.singh@microsoft.com>
This is a minor cleanup to make it easier to change the
number of chunks being written to the commit-graph in the future.
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
---
commit-graph.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/commit-graph.c b/commit-graph.c
index b205e65ed1..3c4d411326 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -23,6 +23,7 @@
#define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */
#define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */
#define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */
+#define MAX_NUM_CHUNKS 5
#define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16)
@@ -1356,8 +1357,8 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
int fd;
struct hashfile *f;
struct lock_file lk = LOCK_INIT;
- uint32_t chunk_ids[6];
- uint64_t chunk_offsets[6];
+ uint32_t chunk_ids[MAX_NUM_CHUNKS + 1];
+ uint64_t chunk_offsets[MAX_NUM_CHUNKS + 1];
const unsigned hashsz = the_hash_algo->rawsz;
struct strbuf progress_title = STRBUF_INIT;
int num_chunks = 3;
--
gitgitgadget
^ permalink raw reply related [flat|nested] 159+ messages in thread
* Re: [PATCH v2 01/11] commit-graph: use MAX_NUM_CHUNKS
2020-02-05 22:56 ` [PATCH v2 01/11] commit-graph: use MAX_NUM_CHUNKS Garima Singh via GitGitGadget
@ 2020-02-09 12:39 ` Jakub Narebski
0 siblings, 0 replies; 159+ messages in thread
From: Jakub Narebski @ 2020-02-09 12:39 UTC (permalink / raw)
To: Garima Singh via GitGitGadget
Cc: git, stolee, szeder.dev, jonathantanmy, jeffhost, me, peff,
garimasigit, christian.couder, emilyshaffer, gitster,
Garima Singh
"Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:
> From: Garima Singh <garima.singh@microsoft.com>
> Subject: Re: [PATCH v2 01/11] commit-graph: use MAX_NUM_CHUNKS
>
> This is a minor cleanup to make it easier to change the
> number of chunks being written to the commit-graph in the future.
Looks good to me...
...with the very minor possible nitpick that the subject probably should
be
[PATCH v2 01/11] commit-graph: define and use MAX_NUM_CHUNKS
But this is just a bikeshedding. Feel free to disregard this.
Best,
--
Jakub Narębski
> Signed-off-by: Garima Singh <garima.singh@microsoft.com>
> ---
> commit-graph.c | 5 +++--
> 1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index b205e65ed1..3c4d411326 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -23,6 +23,7 @@
> #define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */
> #define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */
> #define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */
> +#define MAX_NUM_CHUNKS 5
>
> #define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16)
>
> @@ -1356,8 +1357,8 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
> int fd;
> struct hashfile *f;
> struct lock_file lk = LOCK_INIT;
> - uint32_t chunk_ids[6];
> - uint64_t chunk_offsets[6];
> + uint32_t chunk_ids[MAX_NUM_CHUNKS + 1];
> + uint64_t chunk_offsets[MAX_NUM_CHUNKS + 1];
> const unsigned hashsz = the_hash_algo->rawsz;
> struct strbuf progress_title = STRBUF_INIT;
> int num_chunks = 3;
^ permalink raw reply [flat|nested] 159+ messages in thread
* [PATCH v2 02/11] bloom: core Bloom filter implementation for changed paths
2020-02-05 22:56 ` [PATCH v2 00/11] " Garima Singh via GitGitGadget
2020-02-05 22:56 ` [PATCH v2 01/11] commit-graph: use MAX_NUM_CHUNKS Garima Singh via GitGitGadget
@ 2020-02-05 22:56 ` Garima Singh via GitGitGadget
2020-02-15 17:17 ` Jakub Narebski
2020-02-16 16:49 ` Jakub Narebski
2020-02-05 22:56 ` [PATCH v2 03/11] diff: halt tree-diff early after max_changes Derrick Stolee via GitGitGadget
` (11 subsequent siblings)
13 siblings, 2 replies; 159+ messages in thread
From: Garima Singh via GitGitGadget @ 2020-02-05 22:56 UTC (permalink / raw)
To: git
Cc: stolee, szeder.dev, jonathantanmy, jeffhost, me, peff,
garimasigit, jnareb, christian.couder, emilyshaffer, gitster,
Garima Singh, Garima Singh
From: Garima Singh <garima.singh@microsoft.com>
Add the core Bloom filter logic for computing the paths changed between a
commit and its first parent. For details on what Bloom filters are and how they
work, please refer to Dr. Derrick Stolee's blog post [1]. It provides a concise
explaination of the adoption of Bloom filters as described in [2] and [3]
1. We currently use 7 and 10 for the number of hashes and the size of each
entry respectively. They served as great starting values, the mathematical
details behind this choice are described in [1] and [4]. The implementation
while not completely open to it at the moment, is flexible enough to allow
for tweaking these settings in the future.
Note: The performance gains we have observed with these values are
significant enough that we did not need to tweak these settings.
The performance numbers are included in the cover letter of this series
and in the message of a subsequent commit where we use Bloom filters in
to speed up `git log -- <path>`.
2. As described in the blog and in [3], we do not need 7 independent hashing
functions. We use the Murmur3 hashing scheme. Seed it twice and then
combine those to procure an arbitrary number of hash values.
3. The filters are sized according to the number of changes in the each commit,
with minimum size of one 64 bit word.
4. We fill the Bloom filters as (const char *data, int len) pairs as
"struct bloom_filter"s in a commit slab.
5. The seed_murmur3 method is implemented as described in [5]. It hashes the
given data using a given seed and produces a uniformly distributed hash
value.
[1] https://devblogs.microsoft.com/devops/super-charging-the-git-commit-graph-iv-Bloom-filters/
[2] Flavio Bonomi, Michael Mitzenmacher, Rina Panigrahy, Sushil Singh, George Varghese
"An Improved Construction for Counting Bloom Filters"
http://theory.stanford.edu/~rinap/papers/esa2006b.pdf
https://doi.org/10.1007/11841036_61
[3] Peter C. Dillinger and Panagiotis Manolios
"Bloom Filters in Probabilistic Verification"
http://www.ccs.neu.edu/home/pete/pub/Bloom-filters-verification.pdf
https://doi.org/10.1007/978-3-540-30494-4_26
[4] Thomas Mueller Graf, Daniel Lemire
"Xor Filters: Faster and Smaller Than Bloom and Cuckoo Filters"
https://arxiv.org/abs/1912.08258
[5] https://en.wikipedia.org/wiki/MurmurHash#Algorithm
Helped-by: Jeff King <peff@peff.net>
Helped-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
---
Makefile | 2 +
bloom.c | 228 ++++++++++++++++++++++++++++++++++++++++++
bloom.h | 56 +++++++++++
t/helper/test-bloom.c | 84 ++++++++++++++++
t/helper/test-tool.c | 1 +
t/helper/test-tool.h | 1 +
t/t0095-bloom.sh | 113 +++++++++++++++++++++
7 files changed, 485 insertions(+)
create mode 100644 bloom.c
create mode 100644 bloom.h
create mode 100644 t/helper/test-bloom.c
create mode 100755 t/t0095-bloom.sh
diff --git a/Makefile b/Makefile
index 6134104ae6..afba81f4a8 100644
--- a/Makefile
+++ b/Makefile
@@ -695,6 +695,7 @@ X =
PROGRAMS += $(patsubst %.o,git-%$X,$(PROGRAM_OBJS))
+TEST_BUILTINS_OBJS += test-bloom.o
TEST_BUILTINS_OBJS += test-chmtime.o
TEST_BUILTINS_OBJS += test-config.o
TEST_BUILTINS_OBJS += test-ctype.o
@@ -840,6 +841,7 @@ LIB_OBJS += base85.o
LIB_OBJS += bisect.o
LIB_OBJS += blame.o
LIB_OBJS += blob.o
+LIB_OBJS += bloom.o
LIB_OBJS += branch.o
LIB_OBJS += bulk-checkin.o
LIB_OBJS += bundle.o
diff --git a/bloom.c b/bloom.c
new file mode 100644
index 0000000000..6082193a75
--- /dev/null
+++ b/bloom.c
@@ -0,0 +1,228 @@
+#include "git-compat-util.h"
+#include "bloom.h"
+#include "commit-graph.h"
+#include "object-store.h"
+#include "diff.h"
+#include "diffcore.h"
+#include "revision.h"
+#include "hashmap.h"
+
+define_commit_slab(bloom_filter_slab, struct bloom_filter);
+
+struct bloom_filter_slab bloom_filters;
+
+struct pathmap_hash_entry {
+ struct hashmap_entry entry;
+ const char path[FLEX_ARRAY];
+};
+
+static uint32_t rotate_right(uint32_t value, int32_t count)
+{
+ uint32_t mask = 8 * sizeof(uint32_t) - 1;
+ count &= mask;
+ return ((value >> count) | (value << ((-count) & mask)));
+}
+
+/*
+ * Calculate a hash value for the given data using the given seed.
+ * Produces a uniformly distributed hash value.
+ * Not considered to be cryptographically secure.
+ * Implemented as described in https://en.wikipedia.org/wiki/MurmurHash#Algorithm
+ **/
+static uint32_t seed_murmur3(uint32_t seed, const char *data, int len)
+{
+ const uint32_t c1 = 0xcc9e2d51;
+ const uint32_t c2 = 0x1b873593;
+ const uint32_t r1 = 15;
+ const uint32_t r2 = 13;
+ const uint32_t m = 5;
+ const uint32_t n = 0xe6546b64;
+ int i;
+ uint32_t k1 = 0;
+ const char *tail;
+
+ int len4 = len / sizeof(uint32_t);
+
+ const uint32_t *blocks = (const uint32_t*)data;
+
+ uint32_t k;
+ for (i = 0; i < len4; i++)
+ {
+ k = blocks[i];
+ k *= c1;
+ k = rotate_right(k, r1);
+ k *= c2;
+
+ seed ^= k;
+ seed = rotate_right(seed, r2) * m + n;
+ }
+
+ tail = (data + len4 * sizeof(uint32_t));
+
+ switch (len & (sizeof(uint32_t) - 1))
+ {
+ case 3:
+ k1 ^= ((uint32_t)tail[2]) << 16;
+ /*-fallthrough*/
+ case 2:
+ k1 ^= ((uint32_t)tail[1]) << 8;
+ /*-fallthrough*/
+ case 1:
+ k1 ^= ((uint32_t)tail[0]) << 0;
+ k1 *= c1;
+ k1 = rotate_right(k1, r1);
+ k1 *= c2;
+ seed ^= k1;
+ break;
+ }
+
+ seed ^= (uint32_t)len;
+ seed ^= (seed >> 16);
+ seed *= 0x85ebca6b;
+ seed ^= (seed >> 13);
+ seed *= 0xc2b2ae35;
+ seed ^= (seed >> 16);
+
+ return seed;
+}
+
+static inline uint64_t get_bitmask(uint32_t pos)
+{
+ return ((uint64_t)1) << (pos & (BITS_PER_WORD - 1));
+}
+
+void load_bloom_filters(void)
+{
+ init_bloom_filter_slab(&bloom_filters);
+}
+
+void fill_bloom_key(const char *data,
+ int len,
+ struct bloom_key *key,
+ struct bloom_filter_settings *settings)
+{
+ int i;
+ const uint32_t seed0 = 0x293ae76f;
+ const uint32_t seed1 = 0x7e646e2c;
+ const uint32_t hash0 = seed_murmur3(seed0, data, len);
+ const uint32_t hash1 = seed_murmur3(seed1, data, len);
+
+ key->hashes = (uint32_t *)xcalloc(settings->num_hashes, sizeof(uint32_t));
+ for (i = 0; i < settings->num_hashes; i++)
+ key->hashes[i] = hash0 + i * hash1;
+}
+
+void add_key_to_filter(struct bloom_key *key,
+ struct bloom_filter *filter,
+ struct bloom_filter_settings *settings)
+{
+ int i;
+ uint64_t mod = filter->len * BITS_PER_WORD;
+
+ for (i = 0; i < settings->num_hashes; i++) {
+ uint64_t hash_mod = key->hashes[i] % mod;
+ uint64_t block_pos = hash_mod / BITS_PER_WORD;
+
+ filter->data[block_pos] |= get_bitmask(hash_mod);
+ }
+}
+
+struct bloom_filter *get_bloom_filter(struct repository *r,
+ struct commit *c)
+{
+ struct bloom_filter *filter;
+ struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
+ int i;
+ struct diff_options diffopt;
+
+ if (!bloom_filters.slab_size)
+ return NULL;
+
+ filter = bloom_filter_slab_at(&bloom_filters, c);
+
+ repo_diff_setup(r, &diffopt);
+ diffopt.flags.recursive = 1;
+ diff_setup_done(&diffopt);
+
+ if (c->parents)
+ diff_tree_oid(&c->parents->item->object.oid, &c->object.oid, "", &diffopt);
+ else
+ diff_tree_oid(NULL, &c->object.oid, "", &diffopt);
+ diffcore_std(&diffopt);
+
+ if (diff_queued_diff.nr <= 512) {
+ struct hashmap pathmap;
+ struct pathmap_hash_entry* e;
+ struct hashmap_iter iter;
+ hashmap_init(&pathmap, NULL, NULL, 0);
+
+ for (i = 0; i < diff_queued_diff.nr; i++) {
+ const char* path = diff_queued_diff.queue[i]->two->path;
+ const char* p = path;
+
+ /*
+ * Add each leading directory of the changed file, i.e. for
+ * 'dir/subdir/file' add 'dir' and 'dir/subdir' as well, so
+ * the Bloom filter could be used to speed up commands like
+ * 'git log dir/subdir', too.
+ *
+ * Note that directories are added without the trailing '/'.
+ */
+ do {
+ char* last_slash = strrchr(p, '/');
+
+ FLEX_ALLOC_STR(e, path, path);
+ hashmap_entry_init(&e->entry, strhash(p));
+ hashmap_add(&pathmap, &e->entry);
+
+ if (!last_slash)
+ last_slash = (char*)p;
+ *last_slash = '\0';
+
+ } while (*p);
+
+ diff_free_filepair(diff_queued_diff.queue[i]);
+ }
+
+ filter->len = (hashmap_get_size(&pathmap) * settings.bits_per_entry + BITS_PER_WORD - 1) / BITS_PER_WORD;
+ filter->data = xcalloc(filter->len, sizeof(uint64_t));
+
+ hashmap_for_each_entry(&pathmap, &iter, e, entry) {
+ struct bloom_key key;
+ fill_bloom_key(e->path, strlen(e->path), &key, &settings);
+ add_key_to_filter(&key, filter, &settings);
+ }
+
+ hashmap_free_entries(&pathmap, struct pathmap_hash_entry, entry);
+ } else {
+ for (i = 0; i < diff_queued_diff.nr; i++)
+ diff_free_filepair(diff_queued_diff.queue[i]);
+ filter->data = NULL;
+ filter->len = 0;
+ }
+
+ free(diff_queued_diff.queue);
+ DIFF_QUEUE_CLEAR(&diff_queued_diff);
+
+ return filter;
+}
+
+int bloom_filter_contains(struct bloom_filter *filter,
+ struct bloom_key *key,
+ struct bloom_filter_settings *settings)
+{
+ int i;
+ uint64_t mod = filter->len * BITS_PER_WORD;
+
+ if (!mod)
+ return -1;
+
+ for (i = 0; i < settings->num_hashes; i++) {
+ uint64_t hash_mod = key->hashes[i] % mod;
+ uint64_t block_pos = hash_mod / BITS_PER_WORD;
+ if (!(filter->data[block_pos] & get_bitmask(hash_mod)))
+ return 0;
+ }
+
+ return 1;
+}
diff --git a/bloom.h b/bloom.h
new file mode 100644
index 0000000000..7f40c751f7
--- /dev/null
+++ b/bloom.h
@@ -0,0 +1,56 @@
+#ifndef BLOOM_H
+#define BLOOM_H
+
+struct commit;
+struct repository;
+struct commit_graph;
+
+struct bloom_filter_settings {
+ uint32_t hash_version;
+ uint32_t num_hashes;
+ uint32_t bits_per_entry;
+};
+
+#define DEFAULT_BLOOM_FILTER_SETTINGS { 1, 7, 10 }
+#define BITS_PER_WORD 64
+
+/*
+ * A bloom_filter struct represents a data segment to
+ * use when testing hash values. The 'len' member
+ * dictates how many uint64_t entries are stored in
+ * 'data'.
+ */
+struct bloom_filter {
+ uint64_t *data;
+ int len;
+};
+
+/*
+ * A bloom_key represents the k hash values for a
+ * given hash input. These can be precomputed and
+ * stored in a bloom_key for re-use when testing
+ * against a bloom_filter.
+ */
+struct bloom_key {
+ uint32_t *hashes;
+};
+
+void load_bloom_filters(void);
+
+void fill_bloom_key(const char *data,
+ int len,
+ struct bloom_key *key,
+ struct bloom_filter_settings *settings);
+
+void add_key_to_filter(struct bloom_key *key,
+ struct bloom_filter *filter,
+ struct bloom_filter_settings *settings);
+
+struct bloom_filter *get_bloom_filter(struct repository *r,
+ struct commit *c);
+
+int bloom_filter_contains(struct bloom_filter *filter,
+ struct bloom_key *key,
+ struct bloom_filter_settings *settings);
+
+#endif
diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
new file mode 100644
index 0000000000..331957011b
--- /dev/null
+++ b/t/helper/test-bloom.c
@@ -0,0 +1,84 @@
+#include "test-tool.h"
+#include "git-compat-util.h"
+#include "bloom.h"
+#include "test-tool.h"
+#include "cache.h"
+#include "commit-graph.h"
+#include "commit.h"
+#include "config.h"
+#include "object-store.h"
+#include "object.h"
+#include "repository.h"
+#include "tree.h"
+
+struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
+
+static void print_bloom_filter(struct bloom_filter *filter) {
+ int i;
+
+ if (!filter) {
+ printf("No filter.\n");
+ return;
+ }
+ printf("Filter_Length:%d\n", filter->len);
+ printf("Filter_Data:");
+ for (i = 0; i < filter->len; i++){
+ printf("%"PRIx64"|", filter->data[i]);
+ }
+ printf("\n");
+}
+
+static void add_string_to_filter(const char *data, struct bloom_filter *filter) {
+ struct bloom_key key;
+ int i;
+
+ fill_bloom_key(data, strlen(data), &key, &settings);
+ printf("Hashes:");
+ for (i = 0; i < settings.num_hashes; i++){
+ printf("%08x|", key.hashes[i]);
+ }
+ printf("\n");
+ add_key_to_filter(&key, filter, &settings);
+}
+
+static void get_bloom_filter_for_commit(const struct object_id *commit_oid)
+{
+ struct commit *c;
+ struct bloom_filter *filter;
+ setup_git_directory();
+ c = lookup_commit(the_repository, commit_oid);
+ filter = get_bloom_filter(the_repository, c);
+ print_bloom_filter(filter);
+}
+
+int cmd__bloom(int argc, const char **argv)
+{
+ if (!strcmp(argv[1], "generate_filter")) {
+ struct bloom_filter filter;
+ int i = 2;
+ filter.len = (settings.bits_per_entry + BITS_PER_WORD - 1) / BITS_PER_WORD;
+ filter.data = xcalloc(filter.len, sizeof(uint64_t));
+
+ if (!argv[2]){
+ die("at least one input string expected");
+ }
+
+ while (argv[i]) {
+ add_string_to_filter(argv[i], &filter);
+ i++;
+ }
+
+ print_bloom_filter(&filter);
+ }
+
+ if (!strcmp(argv[1], "get_filter_for_commit")) {
+ struct object_id oid;
+ const char *end;
+ if (parse_oid_hex(argv[2], &oid, &end))
+ die("cannot parse oid '%s'", argv[2]);
+ load_bloom_filters();
+ get_bloom_filter_for_commit(&oid);
+ }
+
+ return 0;
+}
diff --git a/t/helper/test-tool.c b/t/helper/test-tool.c
index c9a232d238..ca4f4b0066 100644
--- a/t/helper/test-tool.c
+++ b/t/helper/test-tool.c
@@ -14,6 +14,7 @@ struct test_cmd {
};
static struct test_cmd cmds[] = {
+ { "bloom", cmd__bloom },
{ "chmtime", cmd__chmtime },
{ "config", cmd__config },
{ "ctype", cmd__ctype },
diff --git a/t/helper/test-tool.h b/t/helper/test-tool.h
index c8549fd87f..05d2b32451 100644
--- a/t/helper/test-tool.h
+++ b/t/helper/test-tool.h
@@ -4,6 +4,7 @@
#define USE_THE_INDEX_COMPATIBILITY_MACROS
#include "git-compat-util.h"
+int cmd__bloom(int argc, const char **argv);
int cmd__chmtime(int argc, const char **argv);
int cmd__config(int argc, const char **argv);
int cmd__ctype(int argc, const char **argv);
diff --git a/t/t0095-bloom.sh b/t/t0095-bloom.sh
new file mode 100755
index 0000000000..424fe4fc29
--- /dev/null
+++ b/t/t0095-bloom.sh
@@ -0,0 +1,113 @@
+#!/bin/sh
+
+test_description='test bloom.c'
+. ./test-lib.sh
+
+test_expect_success 'get bloom filters for commit with no changes' '
+ git init &&
+ git commit --allow-empty -m "c0" &&
+ cat >expect <<-\EOF &&
+ Filter_Length:0
+ Filter_Data:
+ EOF
+ test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
+ test_cmp expect actual
+'
+
+test_expect_success 'get bloom filter for commit with 10 changes' '
+ rm actual &&
+ rm expect &&
+ mkdir smallDir &&
+ for i in $(test_seq 0 9)
+ do
+ echo $i >smallDir/$i
+ done &&
+ git add smallDir &&
+ git commit -m "commit with 10 changes" &&
+ cat >expect <<-\EOF &&
+ Filter_Length:4
+ Filter_Data:508928809087080a|8a7648210804001|4089824400951000|841ab310098051a8|
+ EOF
+ test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
+ test_cmp expect actual
+'
+
+test_expect_success EXPENSIVE 'get bloom filter for commit with 513 changes' '
+ rm actual &&
+ rm expect &&
+ mkdir bigDir &&
+ for i in $(test_seq 0 512)
+ do
+ echo $i >bigDir/$i
+ done &&
+ git add bigDir &&
+ git commit -m "commit with 513 changes" &&
+ cat >expect <<-\EOF &&
+ Filter_Length:0
+ Filter_Data:
+ EOF
+ test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
+ test_cmp expect actual
+'
+
+test_expect_success 'compute bloom key for empty string' '
+ cat >expect <<-\EOF &&
+ Hashes:5615800c|5b966560|61174ab4|66983008|6c19155c|7199fab0|771ae004|
+ Filter_Length:1
+ Filter_Data:11000110001110|
+ EOF
+ test-tool bloom generate_filter "" >actual &&
+ test_cmp expect actual
+'
+
+test_expect_success 'compute bloom key for whitespace' '
+ cat >expect <<-\EOF &&
+ Hashes:1bf014e6|8a91b50b|f9335530|67d4f555|d676957a|4518359f|b3b9d5c4|
+ Filter_Length:1
+ Filter_Data:401004080200810|
+ EOF
+ test-tool bloom generate_filter " " >actual &&
+ test_cmp expect actual
+'
+
+test_expect_success 'compute bloom key for a root level folder' '
+ cat >expect <<-\EOF &&
+ Hashes:1a21016f|fff1c06d|e5c27f6b|cb933e69|b163fd67|9734bc65|7d057b63|
+ Filter_Length:1
+ Filter_Data:aaa800000000|
+ EOF
+ test-tool bloom generate_filter "A" >actual &&
+ test_cmp expect actual
+'
+
+test_expect_success 'compute bloom key for a root level file' '
+ cat >expect <<-\EOF &&
+ Hashes:e2d51107|30970605|7e58fb03|cc1af001|19dce4ff|679ed9fd|b560cefb|
+ Filter_Length:1
+ Filter_Data:a8000000000000aa|
+ EOF
+ test-tool bloom generate_filter "file.txt" >actual &&
+ test_cmp expect actual
+'
+
+test_expect_success 'compute bloom key for a deep folder' '
+ cat >expect <<-\EOF &&
+ Hashes:864cf838|27f055cd|c993b362|6b3710f7|0cda6e8c|ae7dcc21|502129b6|
+ Filter_Length:1
+ Filter_Data:1c0000600003000|
+ EOF
+ test-tool bloom generate_filter "A/B/C/D/E" >actual &&
+ test_cmp expect actual
+'
+
+test_expect_success 'compute bloom key for a deep file' '
+ cat >expect <<-\EOF &&
+ Hashes:07cdf850|4af629c7|8e1e5b3e|d1468cb5|146ebe2c|5796efa3|9abf211a|
+ Filter_Length:1
+ Filter_Data:4020100804010080|
+ EOF
+ test-tool bloom generate_filter "A/B/C/D/E/file.txt" >actual &&
+ test_cmp expect actual
+'
+
+test_done
--
gitgitgadget
^ permalink raw reply related [flat|nested] 159+ messages in thread
* Re: [PATCH v2 02/11] bloom: core Bloom filter implementation for changed paths
2020-02-05 22:56 ` [PATCH v2 02/11] bloom: core Bloom filter implementation for changed paths Garima Singh via GitGitGadget
@ 2020-02-15 17:17 ` Jakub Narebski
2020-02-16 16:49 ` Jakub Narebski
1 sibling, 0 replies; 159+ messages in thread
From: Jakub Narebski @ 2020-02-15 17:17 UTC (permalink / raw)
To: Garima Singh via GitGitGadget
Cc: git, Derrick Stolee, SZEDER Gábor, Jonathan Tan,
Jeff Hostetler, Taylor Blau, Jeff King, Christian Couder,
Emily Shaffer, Junio C Hamano, Garima Singh, Garima Singh
"Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:
> From: Garima Singh <garima.singh@microsoft.com>
>
> Add the core Bloom filter logic for computing the paths changed between a
> commit and its first parent. For details on what Bloom filters are and how they
> work, please refer to Dr. Derrick Stolee's blog post [1]. It provides a concise
> explaination of the adoption of Bloom filters as described in [2] and [3].
^^- to add
>
> 1. We currently use 7 and 10 for the number of hashes and the size of each
> entry respectively. They served as great starting values, the mathematical
> details behind this choice are described in [1] and [4]. The implementation,
^^- to add
> while not completely open to it at the moment, is flexible enough to allow
> for tweaking these settings in the future.
I don't know if it is worth it, but I think it should be size of each
entry, or in other words number of bits per element in the set, as first
value, and number of hashes as second.
About where those values come from. The idea is that you decide on the
acceptable number of false positives, for example 1% (or 0.8% given that
the values must be integers); that gives you number of bits per element
i.e. 10, and from there you can find optimal number of hashes i.e. 7.
The references mentioned (and Wikipedia article) have those equations.
>
> Note: The performance gains we have observed with these values are
> significant enough that we did not need to tweak these settings.
> The performance numbers are included in the cover letter of this series
> and in the message of a subsequent commit where we use Bloom filters in
> to speed up `git log -- <path>`.
All right.
>
> 2. As described in the blog and in [3], we do not need 7 independent hashing
> functions. We use the Murmur3 hashing scheme. Seed it twice and then
> combine those to procure an arbitrary number of hash values.
The technique from [3] is called "double hashing" (Algorithm 1 and
equation (4) on page 10). Note that in this paper there is also
presented "enhanced double hashing" scheme (Algorithm 2 and equation
(6)) -- more about it later.
This is a standard technique from the hashing literature, called open
addressing with double hashing in hash tables.
This "enhanced double hashing" technique is further analyzed in [6].
[6] Adam Kirsch, Michael Mitzenmacher
"Less Hashing, Same Performance: Building a Better Bloom Filter"
https://www.eecs.harvard.edu/~michaelm/postscripts/esa2006a.pdf
https://doi.org/10.5555/1400123.1400125
>
> 3. The filters are sized according to the number of changes in the each commit,
> with minimum size of one 64 bit word.
If I understand it correctly (but which might not be entirely clear),
the filter size in bits is the number of changes^* times 10, rounded up
to the nearest multiple of 64.
[*] where the number of changes is the number of changed files (new blob
objects) _and_ the number of changed directories (new tree objects,
excluding root tree object change).
The interesting corner case, which might be worth specifying explicitly,
is what happens in the case there are _no changes_ with respect to first
parent (which can happen with either commit created with `git commit
--allow-empty`, or merge created e.g. with `git merge --strategy=ours`).
Is this case represented as Bloom filter of length 0, or as a Bloom
filter of length of one 64-bit word which is minimal length composed of
all 0's (0x0000000000000000)?
>
> 4. We fill the Bloom filters as (const char *data, int len) pairs as
> "struct bloom_filter"s in a commit slab.
All right.
>
> 5. The seed_murmur3 method is implemented as described in [5]. It hashes the
> given data using a given seed and produces a uniformly distributed hash
> value.
Actually there are two variants of Murmur3 hash, and we should specify
which one we are using. There is Murmur3_32 which returns 32-bit value,
and Murmur3_128 which returns 128-bit value (which is different for x86
and x64 versions). We use Murmur3_32.
Also, seed_murmur3 is the name given the function, not the name of the
method i.e. of a non-cryptographic hash function.
One question that one might as is why use Murmur3 hash instead for
example already implemented FNV hash from hashmap implementation (FNV
hash i.e. Fowler–Noll–Vo hash function is another non-cryptographic hash
function). The answer is of course performance while maintaining good
enough quality (and for Bloom filter there is no problem of "hash
flooding" denial-of-service like for there is for a hash table -- no
need for SipHash or similar).
>
> [1] https://devblogs.microsoft.com/devops/super-charging-the-git-commit-graph-iv-Bloom-filters/
I would write it in full, similar to subsequent bibliographical entries,
that is:
[1] Derrick Stolee
"Supercharging the Git Commit Graph IV: Bloom Filters"
https://devblogs.microsoft.com/devops/super-charging-the-git-commit-graph-iv-Bloom-filters/
But that is just a matter of style.
>
> [2] Flavio Bonomi, Michael Mitzenmacher, Rina Panigrahy, Sushil Singh, George Varghese
> "An Improved Construction for Counting Bloom Filters"
> http://theory.stanford.edu/~rinap/papers/esa2006b.pdf
> https://doi.org/10.1007/11841036_61
>
> [3] Peter C. Dillinger and Panagiotis Manolios
> "Bloom Filters in Probabilistic Verification"
> http://www.ccs.neu.edu/home/pete/pub/Bloom-filters-verification.pdf
> https://doi.org/10.1007/978-3-540-30494-4_26
Good, we should be able to find them even if the URL with PDF stops
working for some reason.
>
> [4] Thomas Mueller Graf, Daniel Lemire
> "Xor Filters: Faster and Smaller Than Bloom and Cuckoo Filters"
> https://arxiv.org/abs/1912.08258
>
> [5] https://en.wikipedia.org/wiki/MurmurHash#Algorithm
>
> Helped-by: Jeff King <peff@peff.net>
> Helped-by: Derrick Stolee <dstolee@microsoft.com>
> Signed-off-by: Garima Singh <garima.singh@microsoft.com>
> ---
> Makefile | 2 +
> bloom.c | 228 ++++++++++++++++++++++++++++++++++++++++++
> bloom.h | 56 +++++++++++
> t/helper/test-bloom.c | 84 ++++++++++++++++
> t/helper/test-tool.c | 1 +
> t/helper/test-tool.h | 1 +
> t/t0095-bloom.sh | 113 +++++++++++++++++++++
> 7 files changed, 485 insertions(+)
> create mode 100644 bloom.c
> create mode 100644 bloom.h
> create mode 100644 t/helper/test-bloom.c
> create mode 100755 t/t0095-bloom.sh
As I wrote earlier, In my opinion this patch could be split into three
individual single-functionality pieces, to make it easier to review and
aid in bisectability if needed.
1. Add implementation of MurmurHash v3 (32-bit result)
Include tests based on test-tool (creating file similar to the
t/helper/test-hash.c, or enhancing to that file) that the implementation
is correct, for example that 'The quick brown fox jumps over the lazy
dog' or 'Hello world!' with a given seed (for example the default seed
of 0) hashes to the same value as other implementations, including the
reference implementation in https://github.com/aappleby/smhasher
2. Add implementation of [variant of] Bloom filter
Include generic Bloom filter tests i.e. that it correctly answers "yes"
and "maybe" (create filter, save it or print it, then use stored
filter), and tests specific to our implementation, namely that the size
of the filter behaves as it should.
3. Bloom filter implementation for changed paths
Here include tests that use 'test-tool bloom get_filter_for_commit',
that filter for commit with no changes and for commit with more than 512
changes works correctly, that directories are added along the files,
etc.
This split would make it easier to distinguish if the problems with
tests failing on big-endian architectures is caused by different output
from our implementation of Murmur3 hash, different bit sequence in the
Bloom filter, or just different printed output of Bloom filter data.
>
> diff --git a/Makefile b/Makefile
> index 6134104ae6..afba81f4a8 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -695,6 +695,7 @@ X =
>
> PROGRAMS += $(patsubst %.o,git-%$X,$(PROGRAM_OBJS))
>
> +TEST_BUILTINS_OBJS += test-bloom.o
> TEST_BUILTINS_OBJS += test-chmtime.o
> TEST_BUILTINS_OBJS += test-config.o
> TEST_BUILTINS_OBJS += test-ctype.o
> @@ -840,6 +841,7 @@ LIB_OBJS += base85.o
> LIB_OBJS += bisect.o
> LIB_OBJS += blame.o
> LIB_OBJS += blob.o
> +LIB_OBJS += bloom.o
> LIB_OBJS += branch.o
> LIB_OBJS += bulk-checkin.o
> LIB_OBJS += bundle.o
All right.
> diff --git a/bloom.c b/bloom.c
> new file mode 100644
> index 0000000000..6082193a75
> --- /dev/null
> +++ b/bloom.c
> @@ -0,0 +1,228 @@
> +#include "git-compat-util.h"
> +#include "bloom.h"
> +#include "commit-graph.h"
> +#include "object-store.h"
> +#include "diff.h"
> +#include "diffcore.h"
> +#include "revision.h"
> +#include "hashmap.h"
> +
> +define_commit_slab(bloom_filter_slab, struct bloom_filter);
> +
> +struct bloom_filter_slab bloom_filters;
All right, this is needed to store per-commit Bloom filter data
(inside-out object style, or in other jargon stored on slab).
> +
> +struct pathmap_hash_entry {
> + struct hashmap_entry entry;
> + const char path[FLEX_ARRAY];
> +};
O.K. this is used to add gather paths to add them all as elements to the
Bloom filter.
> +
> +static uint32_t rotate_right(uint32_t value, int32_t count)
> +{
> + uint32_t mask = 8 * sizeof(uint32_t) - 1;
> + count &= mask;
> + return ((value >> count) | (value << ((-count) & mask)));
> +}
Hmmm... both the algoritm on Wikipedia, and reference implementation use
rotate *left*, not rotate *right* in the implementation of Murmur3 hash,
see
https://en.wikipedia.org/wiki/MurmurHash#Algorithm
https://github.com/aappleby/smhasher/blob/master/src/MurmurHash3.cpp#L23
inline uint32_t rotl32 ( uint32_t x, int8_t r )
{
return (x << r) | (x >> (32 - r));
}
> +
> +/*
> + * Calculate a hash value for the given data using the given seed.
> + * Produces a uniformly distributed hash value.
> + * Not considered to be cryptographically secure.
> + * Implemented as described in https://en.wikipedia.org/wiki/MurmurHash#Algorithm
> + **/
^^-- why two _trailing_ asterisks?
Perhaps it would be worth it to add that this hash function is intended
to be fast while being reasonably good (it is distributed randomly
enough, and it doesn't have too many hash collisions on typical inputs).
But this might be too much for a comment.
> +static uint32_t seed_murmur3(uint32_t seed, const char *data, int len)
A few things: name of the function, type of parameters and ordering of
parameters.
About the name: when I first saw seed_murmur3() used, I thought it was
_setting_ the seed, not that it was returning the 32-bit hash value.
Other implementations use either murmur3_32, MurmurHash3_x86_32, or
something similar like hashmurmur3_32. If we were to specify that
'seed' is one of parameters, then using this word as part of suffix
would be better than using seed_ prefix; if we need it at all.
Because there is 32-bit and 128-bit variants of Murmur3, I think the _32
suffix should be a part of function name.
In short, I think that the name of the function should be murmur3_32, or
murmurhash3_32, or possibly murmur3_32_seed, or something like that.
About types of parameters and the return type of function: I understand
that 'data' parameter is of type 'const char *', instead of more generic
'const uint8_t*' or 'const void *' because of what we will be using the
hash function for. On the other hand taking a look at implementation of
FNV hash function in hashmap.{c,h} we see that the 'str*' variants take
'const char *' parameter _without_ length, and 'mem*' variants take
'const void *' parmeter with length of data.
Shouldn't 'len' parameter be of 'size_t' type, rather than 'int'? Both
the example implementation in C on Wikipedia page, and implementation in
C in qLibc use 'size_t'; the implementation of FNV hash in hashmap in
Git also uses 'size_t' (while admittedly the reference implementation in
C++ of Austin Appleby uses 'int' type for len parameter).
For 32-bit output variant of Murmur3 hash, using uint32_t as return type
is just fine. The '*hash*' functions from hashmap.{c,h} use 'unsigned
int' but I think 'uint32_t' is better.
About names and ordering of parameters: the 'seed' or 'hash_seed'
parameter should be either first or last; it is a matter of preference.
While example implementation on Wikipedia page, Appleby's reference
implementation in C++ have 'seed' as last parameter, memihash_cont()
from hashmap.c in Git has it as first parameter.
In short: I'm fine with either order (seed parameter first or last), and
either name (be it 'seed' or 'hash_seed').
> +{
> + const uint32_t c1 = 0xcc9e2d51;
> + const uint32_t c2 = 0x1b873593;
> + const uint32_t r1 = 15;
> + const uint32_t r2 = 13;
> + const uint32_t m = 5;
> + const uint32_t n = 0xe6546b64;
> + int i;
> + uint32_t k1 = 0;
> + const char *tail;
> +
> + int len4 = len / sizeof(uint32_t);
> +
> + const uint32_t *blocks = (const uint32_t*)data;
> +
> + uint32_t k;
> + for (i = 0; i < len4; i++)
> + {
> + k = blocks[i];
IMPORTANT: There is a comment around there in the example implementation
in C on Wikipedia that this operation above is a source of differing
results across endianness. The pseudo-code description of the algorithm
on Wikipedia (above of C code) says that endian swapping is only
necessary on big-endian machines (and that it is needed to place the
meaningful digits towards the low end of the value, to not be discarded
by the modulo arithmetic under overflow).
The original / reference implementation by Austin Appleby in C++ uses
getblock32() function for doing the block read... but it doesn't
actually implement the endian-swapping on big-endian architecture:
//-----------------------------------------------------------------------------
// Block read - if your platform needs to do endian-swapping or can only
// handle aligned reads, do the conversion here
FORCE_INLINE uint32_t getblock32 ( const uint32_t * p, int i )
{
return p[i];
}
References:
-----------
1. https://en.wikipedia.org/wiki/MurmurHash#Algorithm
2. https://github.com/aappleby/smhasher/blob/master/src/MurmurHash3.cpp
> + k *= c1;
> + k = rotate_right(k, r1);
It is k ROL r1 / ROTL32(k,15) / (k << 15) | (k >> (32 - 15))
(in other implementations), not rotate_right.
> + k *= c2;
> +
> + seed ^= k;
> + seed = rotate_right(seed, r2) * m + n;
It is hash ROL r2 / ROTL32(h1,13) / (h << 13) | (h >> (32 - 13))
(in other implementations), not rotate_right.
References:
-----------
1. https://en.wikipedia.org/wiki/MurmurHash#Algorithm
2. https://github.com/aappleby/smhasher/blob/master/src/MurmurHash3.cpp#L94
3. https://github.com/wolkykim/qlibc/blob/master/src/utilities/qhash.c#L258
> + }
> +
> + tail = (data + len4 * sizeof(uint32_t));
Hmmm... in the pseudocode implementation on Wikipedia this is the place
where one needs to respect endianness:
with any remainingBytesInKey do
remainingBytes ← SwapToLittleEndian(remainingBytesInKey)
// Note: Endian swapping is only necessary on big-endian machines.
// The purpose is to place the meaningful digits towards the low end of the value,
// so that these digits have the greatest potential to affect the low range digits
// in the subsequent multiplication. Consider that locating the meaningful digits
// in the high range would produce a greater effect upon the high digits of the
// multiplication, and notably, that such high digits are likely to be discarded
// by the modulo arithmetic under overflow. We don't want that.
On the other hand in the reference Appleby's C++ implementation the
endian-swapping is [ssumed to be] done only in the loop over data.
Either should be enough alone, but doing swapping for remaining bytes
only would work, it would be a better solution -- you do swap only once,
at the end.
It looks like the Crhomium implementation in C by Shane Day (public
domain) uses the second solution; well almost, see:
https://chromium.googlesource.com/external/smhasher/+/5b8fd3c31a58b87b80605dca7a64fad6cb3f8a0f/PMurHash.c#189
> +
> + switch (len & (sizeof(uint32_t) - 1))
> + {
> + case 3:
> + k1 ^= ((uint32_t)tail[2]) << 16;
> + /*-fallthrough*/
> + case 2:
> + k1 ^= ((uint32_t)tail[1]) << 8;
> + /*-fallthrough*/
> + case 1:
> + k1 ^= ((uint32_t)tail[0]) << 0;
> + k1 *= c1;
> + k1 = rotate_right(k1, r1);
> + k1 *= c2;
> + seed ^= k1;
> + break;
> + }
> +
> + seed ^= (uint32_t)len;
> + seed ^= (seed >> 16);
> + seed *= 0x85ebca6b;
> + seed ^= (seed >> 13);
> + seed *= 0xc2b2ae35;
> + seed ^= (seed >> 16);
> +
> + return seed;
> +}
> +
> +static inline uint64_t get_bitmask(uint32_t pos)
> +{
> + return ((uint64_t)1) << (pos & (BITS_PER_WORD - 1));
> +}
> +
> +void load_bloom_filters(void)
> +{
> + init_bloom_filter_slab(&bloom_filters);
> +}
> +
> +void fill_bloom_key(const char *data,
> + int len,
> + struct bloom_key *key,
> + struct bloom_filter_settings *settings)
> +{
> + int i;
> + const uint32_t seed0 = 0x293ae76f;
> + const uint32_t seed1 = 0x7e646e2c;
> + const uint32_t hash0 = seed_murmur3(seed0, data, len);
> + const uint32_t hash1 = seed_murmur3(seed1, data, len);
> +
> + key->hashes = (uint32_t *)xcalloc(settings->num_hashes, sizeof(uint32_t));
> + for (i = 0; i < settings->num_hashes; i++)
> + key->hashes[i] = hash0 + i * hash1;
> +}
> +
> +void add_key_to_filter(struct bloom_key *key,
> + struct bloom_filter *filter,
> + struct bloom_filter_settings *settings)
> +{
> + int i;
> + uint64_t mod = filter->len * BITS_PER_WORD;
> +
> + for (i = 0; i < settings->num_hashes; i++) {
> + uint64_t hash_mod = key->hashes[i] % mod;
> + uint64_t block_pos = hash_mod / BITS_PER_WORD;
> +
> + filter->data[block_pos] |= get_bitmask(hash_mod);
> + }
> +}
> +
> +struct bloom_filter *get_bloom_filter(struct repository *r,
> + struct commit *c)
> +{
> + struct bloom_filter *filter;
> + struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
> + int i;
> + struct diff_options diffopt;
> +
> + if (!bloom_filters.slab_size)
> + return NULL;
> +
> + filter = bloom_filter_slab_at(&bloom_filters, c);
> +
> + repo_diff_setup(r, &diffopt);
> + diffopt.flags.recursive = 1;
> + diff_setup_done(&diffopt);
> +
> + if (c->parents)
> + diff_tree_oid(&c->parents->item->object.oid, &c->object.oid, "", &diffopt);
> + else
> + diff_tree_oid(NULL, &c->object.oid, "", &diffopt);
> + diffcore_std(&diffopt);
> +
> + if (diff_queued_diff.nr <= 512) {
> + struct hashmap pathmap;
> + struct pathmap_hash_entry* e;
> + struct hashmap_iter iter;
> + hashmap_init(&pathmap, NULL, NULL, 0);
> +
> + for (i = 0; i < diff_queued_diff.nr; i++) {
> + const char* path = diff_queued_diff.queue[i]->two->path;
> + const char* p = path;
> +
> + /*
> + * Add each leading directory of the changed file, i.e. for
> + * 'dir/subdir/file' add 'dir' and 'dir/subdir' as well, so
> + * the Bloom filter could be used to speed up commands like
> + * 'git log dir/subdir', too.
> + *
> + * Note that directories are added without the trailing '/'.
> + */
> + do {
> + char* last_slash = strrchr(p, '/');
> +
> + FLEX_ALLOC_STR(e, path, path);
> + hashmap_entry_init(&e->entry, strhash(p));
> + hashmap_add(&pathmap, &e->entry);
> +
> + if (!last_slash)
> + last_slash = (char*)p;
> + *last_slash = '\0';
> +
> + } while (*p);
> +
> + diff_free_filepair(diff_queued_diff.queue[i]);
> + }
> +
> + filter->len = (hashmap_get_size(&pathmap) * settings.bits_per_entry + BITS_PER_WORD - 1) / BITS_PER_WORD;
> + filter->data = xcalloc(filter->len, sizeof(uint64_t));
> +
> + hashmap_for_each_entry(&pathmap, &iter, e, entry) {
> + struct bloom_key key;
> + fill_bloom_key(e->path, strlen(e->path), &key, &settings);
> + add_key_to_filter(&key, filter, &settings);
> + }
> +
> + hashmap_free_entries(&pathmap, struct pathmap_hash_entry, entry);
> + } else {
> + for (i = 0; i < diff_queued_diff.nr; i++)
> + diff_free_filepair(diff_queued_diff.queue[i]);
> + filter->data = NULL;
> + filter->len = 0;
> + }
> +
> + free(diff_queued_diff.queue);
> + DIFF_QUEUE_CLEAR(&diff_queued_diff);
> +
> + return filter;
> +}
> +
> +int bloom_filter_contains(struct bloom_filter *filter,
> + struct bloom_key *key,
> + struct bloom_filter_settings *settings)
> +{
> + int i;
> + uint64_t mod = filter->len * BITS_PER_WORD;
> +
> + if (!mod)
> + return -1;
> +
> + for (i = 0; i < settings->num_hashes; i++) {
> + uint64_t hash_mod = key->hashes[i] % mod;
> + uint64_t block_pos = hash_mod / BITS_PER_WORD;
> + if (!(filter->data[block_pos] & get_bitmask(hash_mod)))
> + return 0;
> + }
> +
> + return 1;
> +}
> diff --git a/bloom.h b/bloom.h
> new file mode 100644
> index 0000000000..7f40c751f7
> --- /dev/null
> +++ b/bloom.h
> @@ -0,0 +1,56 @@
> +#ifndef BLOOM_H
> +#define BLOOM_H
> +
> +struct commit;
> +struct repository;
> +struct commit_graph;
> +
> +struct bloom_filter_settings {
> + uint32_t hash_version;
> + uint32_t num_hashes;
> + uint32_t bits_per_entry;
> +};
> +
> +#define DEFAULT_BLOOM_FILTER_SETTINGS { 1, 7, 10 }
> +#define BITS_PER_WORD 64
> +
> +/*
> + * A bloom_filter struct represents a data segment to
> + * use when testing hash values. The 'len' member
> + * dictates how many uint64_t entries are stored in
> + * 'data'.
> + */
> +struct bloom_filter {
> + uint64_t *data;
> + int len;
> +};
> +
> +/*
> + * A bloom_key represents the k hash values for a
> + * given hash input. These can be precomputed and
> + * stored in a bloom_key for re-use when testing
> + * against a bloom_filter.
> + */
> +struct bloom_key {
> + uint32_t *hashes;
> +};
> +
> +void load_bloom_filters(void);
> +
> +void fill_bloom_key(const char *data,
> + int len,
> + struct bloom_key *key,
> + struct bloom_filter_settings *settings);
> +
> +void add_key_to_filter(struct bloom_key *key,
> + struct bloom_filter *filter,
> + struct bloom_filter_settings *settings);
> +
> +struct bloom_filter *get_bloom_filter(struct repository *r,
> + struct commit *c);
> +
> +int bloom_filter_contains(struct bloom_filter *filter,
> + struct bloom_key *key,
> + struct bloom_filter_settings *settings);
> +
> +#endif
> diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
> new file mode 100644
> index 0000000000..331957011b
> --- /dev/null
> +++ b/t/helper/test-bloom.c
> @@ -0,0 +1,84 @@
> +#include "test-tool.h"
> +#include "git-compat-util.h"
> +#include "bloom.h"
> +#include "test-tool.h"
> +#include "cache.h"
> +#include "commit-graph.h"
> +#include "commit.h"
> +#include "config.h"
> +#include "object-store.h"
> +#include "object.h"
> +#include "repository.h"
> +#include "tree.h"
> +
> +struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
> +
> +static void print_bloom_filter(struct bloom_filter *filter) {
> + int i;
> +
> + if (!filter) {
> + printf("No filter.\n");
> + return;
> + }
> + printf("Filter_Length:%d\n", filter->len);
> + printf("Filter_Data:");
> + for (i = 0; i < filter->len; i++){
> + printf("%"PRIx64"|", filter->data[i]);
> + }
> + printf("\n");
> +}
> +
> +static void add_string_to_filter(const char *data, struct bloom_filter *filter) {
> + struct bloom_key key;
> + int i;
> +
> + fill_bloom_key(data, strlen(data), &key, &settings);
> + printf("Hashes:");
> + for (i = 0; i < settings.num_hashes; i++){
> + printf("%08x|", key.hashes[i]);
> + }
> + printf("\n");
> + add_key_to_filter(&key, filter, &settings);
> +}
> +
> +static void get_bloom_filter_for_commit(const struct object_id *commit_oid)
> +{
> + struct commit *c;
> + struct bloom_filter *filter;
> + setup_git_directory();
> + c = lookup_commit(the_repository, commit_oid);
> + filter = get_bloom_filter(the_repository, c);
> + print_bloom_filter(filter);
> +}
> +
> +int cmd__bloom(int argc, const char **argv)
> +{
> + if (!strcmp(argv[1], "generate_filter")) {
> + struct bloom_filter filter;
> + int i = 2;
> + filter.len = (settings.bits_per_entry + BITS_PER_WORD - 1) / BITS_PER_WORD;
> + filter.data = xcalloc(filter.len, sizeof(uint64_t));
> +
> + if (!argv[2]){
> + die("at least one input string expected");
> + }
> +
> + while (argv[i]) {
> + add_string_to_filter(argv[i], &filter);
> + i++;
> + }
> +
> + print_bloom_filter(&filter);
> + }
> +
> + if (!strcmp(argv[1], "get_filter_for_commit")) {
> + struct object_id oid;
> + const char *end;
> + if (parse_oid_hex(argv[2], &oid, &end))
> + die("cannot parse oid '%s'", argv[2]);
> + load_bloom_filters();
> + get_bloom_filter_for_commit(&oid);
> + }
> +
> + return 0;
> +}
> diff --git a/t/helper/test-tool.c b/t/helper/test-tool.c
> index c9a232d238..ca4f4b0066 100644
> --- a/t/helper/test-tool.c
> +++ b/t/helper/test-tool.c
> @@ -14,6 +14,7 @@ struct test_cmd {
> };
>
> static struct test_cmd cmds[] = {
> + { "bloom", cmd__bloom },
> { "chmtime", cmd__chmtime },
> { "config", cmd__config },
> { "ctype", cmd__ctype },
> diff --git a/t/helper/test-tool.h b/t/helper/test-tool.h
> index c8549fd87f..05d2b32451 100644
> --- a/t/helper/test-tool.h
> +++ b/t/helper/test-tool.h
> @@ -4,6 +4,7 @@
> #define USE_THE_INDEX_COMPATIBILITY_MACROS
> #include "git-compat-util.h"
>
> +int cmd__bloom(int argc, const char **argv);
> int cmd__chmtime(int argc, const char **argv);
> int cmd__config(int argc, const char **argv);
> int cmd__ctype(int argc, const char **argv);
> diff --git a/t/t0095-bloom.sh b/t/t0095-bloom.sh
> new file mode 100755
> index 0000000000..424fe4fc29
> --- /dev/null
> +++ b/t/t0095-bloom.sh
> @@ -0,0 +1,113 @@
> +#!/bin/sh
> +
> +test_description='test bloom.c'
> +. ./test-lib.sh
> +
> +test_expect_success 'get bloom filters for commit with no changes' '
> + git init &&
> + git commit --allow-empty -m "c0" &&
> + cat >expect <<-\EOF &&
> + Filter_Length:0
> + Filter_Data:
> + EOF
> + test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
> + test_cmp expect actual
> +'
> +
> +test_expect_success 'get bloom filter for commit with 10 changes' '
> + rm actual &&
> + rm expect &&
> + mkdir smallDir &&
> + for i in $(test_seq 0 9)
> + do
> + echo $i >smallDir/$i
> + done &&
> + git add smallDir &&
> + git commit -m "commit with 10 changes" &&
> + cat >expect <<-\EOF &&
> + Filter_Length:4
> + Filter_Data:508928809087080a|8a7648210804001|4089824400951000|841ab310098051a8|
> + EOF
> + test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
> + test_cmp expect actual
> +'
> +
> +test_expect_success EXPENSIVE 'get bloom filter for commit with 513 changes' '
> + rm actual &&
> + rm expect &&
> + mkdir bigDir &&
> + for i in $(test_seq 0 512)
> + do
> + echo $i >bigDir/$i
> + done &&
> + git add bigDir &&
> + git commit -m "commit with 513 changes" &&
> + cat >expect <<-\EOF &&
> + Filter_Length:0
> + Filter_Data:
> + EOF
> + test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
> + test_cmp expect actual
> +'
> +
> +test_expect_success 'compute bloom key for empty string' '
> + cat >expect <<-\EOF &&
> + Hashes:5615800c|5b966560|61174ab4|66983008|6c19155c|7199fab0|771ae004|
> + Filter_Length:1
> + Filter_Data:11000110001110|
> + EOF
> + test-tool bloom generate_filter "" >actual &&
> + test_cmp expect actual
> +'
> +
> +test_expect_success 'compute bloom key for whitespace' '
> + cat >expect <<-\EOF &&
> + Hashes:1bf014e6|8a91b50b|f9335530|67d4f555|d676957a|4518359f|b3b9d5c4|
> + Filter_Length:1
> + Filter_Data:401004080200810|
> + EOF
> + test-tool bloom generate_filter " " >actual &&
> + test_cmp expect actual
> +'
> +
> +test_expect_success 'compute bloom key for a root level folder' '
> + cat >expect <<-\EOF &&
> + Hashes:1a21016f|fff1c06d|e5c27f6b|cb933e69|b163fd67|9734bc65|7d057b63|
> + Filter_Length:1
> + Filter_Data:aaa800000000|
> + EOF
> + test-tool bloom generate_filter "A" >actual &&
> + test_cmp expect actual
> +'
> +
> +test_expect_success 'compute bloom key for a root level file' '
> + cat >expect <<-\EOF &&
> + Hashes:e2d51107|30970605|7e58fb03|cc1af001|19dce4ff|679ed9fd|b560cefb|
> + Filter_Length:1
> + Filter_Data:a8000000000000aa|
> + EOF
> + test-tool bloom generate_filter "file.txt" >actual &&
> + test_cmp expect actual
> +'
> +
> +test_expect_success 'compute bloom key for a deep folder' '
> + cat >expect <<-\EOF &&
> + Hashes:864cf838|27f055cd|c993b362|6b3710f7|0cda6e8c|ae7dcc21|502129b6|
> + Filter_Length:1
> + Filter_Data:1c0000600003000|
> + EOF
> + test-tool bloom generate_filter "A/B/C/D/E" >actual &&
> + test_cmp expect actual
> +'
> +
> +test_expect_success 'compute bloom key for a deep file' '
> + cat >expect <<-\EOF &&
> + Hashes:07cdf850|4af629c7|8e1e5b3e|d1468cb5|146ebe2c|5796efa3|9abf211a|
> + Filter_Length:1
> + Filter_Data:4020100804010080|
> + EOF
> + test-tool bloom generate_filter "A/B/C/D/E/file.txt" >actual &&
> + test_cmp expect actual
> +'
> +
> +test_done
^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH v2 02/11] bloom: core Bloom filter implementation for changed paths
2020-02-05 22:56 ` [PATCH v2 02/11] bloom: core Bloom filter implementation for changed paths Garima Singh via GitGitGadget
2020-02-15 17:17 ` Jakub Narebski
@ 2020-02-16 16:49 ` Jakub Narebski
2020-02-22 0:32 ` Garima Singh
1 sibling, 1 reply; 159+ messages in thread
From: Jakub Narebski @ 2020-02-16 16:49 UTC (permalink / raw)
To: Garima Singh via GitGitGadget
Cc: git, Derrick Stolee, SZEDER Gábor, Jonathan Tan,
Jeff Hostetler, Taylor Blau, Jeff King, Christian Couder,
Emily Shaffer, Junio C Hamano, Garima Singh, Garima Singh
[I'm sorry for accidentally sending unfinished version of this email]
"Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:
> From: Garima Singh <garima.singh@microsoft.com>
>
> Add the core Bloom filter logic for computing the paths changed between a
> commit and its first parent. For details on what Bloom filters are and how they
> work, please refer to Dr. Derrick Stolee's blog post [1]. It provides a concise
> explaination of the adoption of Bloom filters as described in [2] and [3].
^^- to add
>
> 1. We currently use 7 and 10 for the number of hashes and the size of each
> entry respectively. They served as great starting values, the mathematical
> details behind this choice are described in [1] and [4]. The implementation,
^^- to add
> while not completely open to it at the moment, is flexible enough to allow
> for tweaking these settings in the future.
I don't know if it is worth it, but I think it should be size of each
entry, or in other words number of bits per element in the set, as first
value, and number of hashes as second.
About where those values come from. The idea is that you decide on the
acceptable number of false positives, for example 1% (or 0.8% given that
the values must be integers); that gives you number of bits per element
i.e. 10, and from there you can find optimal number of hashes i.e. 7.
The references mentioned (and Wikipedia article) have those equations.
>
> Note: The performance gains we have observed with these values are
> significant enough that we did not need to tweak these settings.
> The performance numbers are included in the cover letter of this series
> and in the message of a subsequent commit where we use Bloom filters in
> to speed up `git log -- <path>`.
All right.
>
> 2. As described in the blog and in [3], we do not need 7 independent hashing
> functions. We use the Murmur3 hashing scheme. Seed it twice and then
> combine those to procure an arbitrary number of hash values.
The technique from [3] is called "double hashing" (Algorithm 1 and
equation (4) on page 10). Note that in this paper there is also
presented "enhanced double hashing" scheme (Algorithm 2 and equation
(6)) -- more about it later.
This is a standard technique from the hashing literature, called open
addressing with double hashing in hash tables.
This "enhanced double hashing" technique is further analyzed in [6].
[6] Adam Kirsch, Michael Mitzenmacher
"Less Hashing, Same Performance: Building a Better Bloom Filter"
https://www.eecs.harvard.edu/~michaelm/postscripts/esa2006a.pdf
https://doi.org/10.5555/1400123.1400125
>
> 3. The filters are sized according to the number of changes in the each commit,
> with minimum size of one 64 bit word.
If I understand it correctly (but which might not be entirely clear),
the filter size in bits is the number of changes^* times 10, rounded up
to the nearest multiple of 64.
[*] where the number of changes is the number of changed files (new blob
objects) _and_ the number of changed directories (new tree objects,
excluding root tree object change).
The interesting corner case, which might be worth specifying explicitly,
is what happens in the case there are _no changes_ with respect to first
parent (which can happen with either commit created with `git commit
--allow-empty`, or merge created e.g. with `git merge --strategy=ours`).
Is this case represented as Bloom filter of length 0, or as a Bloom
filter of length of one 64-bit word which is minimal length composed of
all 0's (0x0000000000000000)?
>
> 4. We fill the Bloom filters as (const char *data, int len) pairs as
> "struct bloom_filter"s in a commit slab.
All right.
>
> 5. The seed_murmur3 method is implemented as described in [5]. It hashes the
> given data using a given seed and produces a uniformly distributed hash
> value.
Actually there are two variants of Murmur3 hash, and we should specify
which one we are using. There is Murmur3_32 which returns 32-bit value,
and Murmur3_128 which returns 128-bit value (which is different for x86
and x64 versions). We use Murmur3_32.
Also, seed_murmur3 is the name given the function, not the name of the
method i.e. of a non-cryptographic hash function.
One question that one might as is why use Murmur3 hash instead for
example already implemented FNV hash from hashmap implementation (FNV
hash i.e. Fowler–Noll–Vo hash function is another non-cryptographic hash
function). The answer is of course performance while maintaining good
enough quality (and for Bloom filter there is no problem of "hash
flooding" denial-of-service like for there is for a hash table -- no
need for SipHash or similar).
>
> [1] https://devblogs.microsoft.com/devops/super-charging-the-git-commit-graph-iv-Bloom-filters/
I would write it in full, similar to subsequent bibliographical entries,
that is:
[1] Derrick Stolee
"Supercharging the Git Commit Graph IV: Bloom Filters"
https://devblogs.microsoft.com/devops/super-charging-the-git-commit-graph-iv-Bloom-filters/
But that is just a matter of style.
>
> [2] Flavio Bonomi, Michael Mitzenmacher, Rina Panigrahy, Sushil Singh, George Varghese
> "An Improved Construction for Counting Bloom Filters"
> http://theory.stanford.edu/~rinap/papers/esa2006b.pdf
> https://doi.org/10.1007/11841036_61
>
> [3] Peter C. Dillinger and Panagiotis Manolios
> "Bloom Filters in Probabilistic Verification"
> http://www.ccs.neu.edu/home/pete/pub/Bloom-filters-verification.pdf
> https://doi.org/10.1007/978-3-540-30494-4_26
Good, we should be able to find them even if the URL with PDF stops
working for some reason.
>
> [4] Thomas Mueller Graf, Daniel Lemire
> "Xor Filters: Faster and Smaller Than Bloom and Cuckoo Filters"
> https://arxiv.org/abs/1912.08258
>
> [5] https://en.wikipedia.org/wiki/MurmurHash#Algorithm
>
> Helped-by: Jeff King <peff@peff.net>
> Helped-by: Derrick Stolee <dstolee@microsoft.com>
> Signed-off-by: Garima Singh <garima.singh@microsoft.com>
> ---
> Makefile | 2 +
> bloom.c | 228 ++++++++++++++++++++++++++++++++++++++++++
> bloom.h | 56 +++++++++++
> t/helper/test-bloom.c | 84 ++++++++++++++++
> t/helper/test-tool.c | 1 +
> t/helper/test-tool.h | 1 +
> t/t0095-bloom.sh | 113 +++++++++++++++++++++
> 7 files changed, 485 insertions(+)
> create mode 100644 bloom.c
> create mode 100644 bloom.h
> create mode 100644 t/helper/test-bloom.c
> create mode 100755 t/t0095-bloom.sh
As I wrote earlier, In my opinion this patch could be split into three
individual single-functionality pieces, to make it easier to review and
aid in bisectability if needed.
1. Add implementation of MurmurHash v3 (32-bit result)
Include tests based on test-tool (creating file similar to the
t/helper/test-hash.c, or enhancing to that file) that the implementation
is correct, for example that 'The quick brown fox jumps over the lazy
dog' or 'Hello world!' with a given seed (for example the default seed
of 0) hashes to the same value as other implementations, including the
reference implementation in https://github.com/aappleby/smhasher
2. Add implementation of [variant of] Bloom filter
Include generic Bloom filter tests i.e. that it correctly answers "yes"
and "maybe" (create filter, save it or print it, then use stored
filter), and tests specific to our implementation, namely that the size
of the filter behaves as it should.
3. Bloom filter implementation for changed paths
Here include tests that use 'test-tool bloom get_filter_for_commit',
that filter for commit with no changes and for commit with more than 512
changes works correctly, that directories are added along the files,
etc.
This split would make it easier to distinguish if the problems with
tests failing on big-endian architectures is caused by different output
from our implementation of Murmur3 hash, different bit sequence in the
Bloom filter, or just different printed output of Bloom filter data.
>
> diff --git a/Makefile b/Makefile
> index 6134104ae6..afba81f4a8 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -695,6 +695,7 @@ X =
>
> PROGRAMS += $(patsubst %.o,git-%$X,$(PROGRAM_OBJS))
>
> +TEST_BUILTINS_OBJS += test-bloom.o
> TEST_BUILTINS_OBJS += test-chmtime.o
> TEST_BUILTINS_OBJS += test-config.o
> TEST_BUILTINS_OBJS += test-ctype.o
> @@ -840,6 +841,7 @@ LIB_OBJS += base85.o
> LIB_OBJS += bisect.o
> LIB_OBJS += blame.o
> LIB_OBJS += blob.o
> +LIB_OBJS += bloom.o
> LIB_OBJS += branch.o
> LIB_OBJS += bulk-checkin.o
> LIB_OBJS += bundle.o
All right.
> diff --git a/bloom.c b/bloom.c
> new file mode 100644
> index 0000000000..6082193a75
> --- /dev/null
> +++ b/bloom.c
> @@ -0,0 +1,228 @@
> +#include "git-compat-util.h"
> +#include "bloom.h"
> +#include "commit-graph.h"
> +#include "object-store.h"
> +#include "diff.h"
> +#include "diffcore.h"
> +#include "revision.h"
> +#include "hashmap.h"
> +
> +define_commit_slab(bloom_filter_slab, struct bloom_filter);
> +
> +struct bloom_filter_slab bloom_filters;
All right, this is needed to store per-commit Bloom filter data
(inside-out object style, or in other jargon stored on slab).
> +
> +struct pathmap_hash_entry {
> + struct hashmap_entry entry;
> + const char path[FLEX_ARRAY];
> +};
O.K. this is used to add gather paths to add them all as elements to the
Bloom filter.
> +
> +static uint32_t rotate_right(uint32_t value, int32_t count)
> +{
> + uint32_t mask = 8 * sizeof(uint32_t) - 1;
> + count &= mask;
> + return ((value >> count) | (value << ((-count) & mask)));
> +}
Hmmm... both the algoritm on Wikipedia, and reference implementation use
rotate *left*, not rotate *right* in the implementation of Murmur3 hash,
see
https://en.wikipedia.org/wiki/MurmurHash#Algorithm
https://github.com/aappleby/smhasher/blob/master/src/MurmurHash3.cpp#L23
inline uint32_t rotl32 ( uint32_t x, int8_t r )
{
return (x << r) | (x >> (32 - r));
}
> +
> +/*
> + * Calculate a hash value for the given data using the given seed.
> + * Produces a uniformly distributed hash value.
> + * Not considered to be cryptographically secure.
> + * Implemented as described in https://en.wikipedia.org/wiki/MurmurHash#Algorithm
> + **/
^^-- why two _trailing_ asterisks?
Perhaps it would be worth it to add that this hash function is intended
to be fast while being reasonably good (it is distributed randomly
enough, and it doesn't have too many hash collisions on typical inputs).
But this might be too much for a comment.
> +static uint32_t seed_murmur3(uint32_t seed, const char *data, int len)
A few things: name of the function, type of parameters and ordering of
parameters.
About the name: when I first saw seed_murmur3() used, I thought it was
_setting_ the seed, not that it was returning the 32-bit hash value.
Other implementations use either murmur3_32, MurmurHash3_x86_32, or
something similar like hashmurmur3_32. If we were to specify that
'seed' is one of parameters, then using this word as part of suffix
would be better than using seed_ prefix; if we need it at all.
Because there is 32-bit and 128-bit variants of Murmur3, I think the _32
suffix should be a part of function name.
In short, I think that the name of the function should be murmur3_32, or
murmurhash3_32, or possibly murmur3_32_seed, or something like that.
About types of parameters and the return type of function: I understand
that 'data' parameter is of type 'const char *', instead of more generic
'const uint8_t*' or 'const void *' because of what we will be using the
hash function for. On the other hand taking a look at implementation of
FNV hash function in hashmap.{c,h} we see that the 'str*' variants take
'const char *' parameter _without_ length, and 'mem*' variants take
'const void *' parmeter with length of data.
Shouldn't 'len' parameter be of 'size_t' type, rather than 'int'? Both
the example implementation in C on Wikipedia page, and implementation in
C in qLibc use 'size_t'; the implementation of FNV hash in hashmap in
Git also uses 'size_t' (while admittedly the reference implementation in
C++ of Austin Appleby uses 'int' type for len parameter).
For 32-bit output variant of Murmur3 hash, using uint32_t as return type
is just fine. The '*hash*' functions from hashmap.{c,h} use 'unsigned
int' but I think 'uint32_t' is better.
About names and ordering of parameters: the 'seed' or 'hash_seed'
parameter should be either first or last; it is a matter of preference.
While example implementation on Wikipedia page, Appleby's reference
implementation in C++ have 'seed' as last parameter, memihash_cont()
from hashmap.c in Git has it as first parameter.
In short: I'm fine with either order (seed parameter first or last), and
either name (be it 'seed' or 'hash_seed').
> +{
> + const uint32_t c1 = 0xcc9e2d51;
> + const uint32_t c2 = 0x1b873593;
> + const uint32_t r1 = 15;
> + const uint32_t r2 = 13;
> + const uint32_t m = 5;
> + const uint32_t n = 0xe6546b64;
> + int i;
> + uint32_t k1 = 0;
> + const char *tail;
> +
> + int len4 = len / sizeof(uint32_t);
> +
> + const uint32_t *blocks = (const uint32_t*)data;
> +
> + uint32_t k;
> + for (i = 0; i < len4; i++)
> + {
> + k = blocks[i];
IMPORTANT: There is a comment around there in the example implementation
in C on Wikipedia that this operation above is a source of differing
results across endianness. The pseudo-code description of the algorithm
on Wikipedia (above of C code) says that endian swapping is only
necessary on big-endian machines (and that it is needed to place the
meaningful digits towards the low end of the value, to not be discarded
by the modulo arithmetic under overflow).
The original / reference implementation by Austin Appleby in C++ uses
getblock32() function for doing the block read... but it doesn't
actually implement the endian-swapping on big-endian architecture:
//-----------------------------------------------------------------------------
// Block read - if your platform needs to do endian-swapping or can only
// handle aligned reads, do the conversion here
FORCE_INLINE uint32_t getblock32 ( const uint32_t * p, int i )
{
return p[i];
}
References:
-----------
1. https://en.wikipedia.org/wiki/MurmurHash#Algorithm
2. https://github.com/aappleby/smhasher/blob/master/src/MurmurHash3.cpp
> + k *= c1;
> + k = rotate_right(k, r1);
It is k ROL r1 / ROTL32(k,15) / (k << 15) | (k >> (32 - 15))
(in other implementations), not rotate_right.
> + k *= c2;
> +
> + seed ^= k;
> + seed = rotate_right(seed, r2) * m + n;
It is hash ROL r2 / ROTL32(h1,13) / (h << 13) | (h >> (32 - 13))
(in other implementations), not rotate_right.
References:
-----------
1. https://en.wikipedia.org/wiki/MurmurHash#Algorithm
2. https://github.com/aappleby/smhasher/blob/master/src/MurmurHash3.cpp#L94
3. https://github.com/wolkykim/qlibc/blob/master/src/utilities/qhash.c#L258
> + }
> +
> + tail = (data + len4 * sizeof(uint32_t));
Hmmm... in the pseudocode implementation on Wikipedia this is the place
where one needs to respect endianness:
with any remainingBytesInKey do
remainingBytes ← SwapToLittleEndian(remainingBytesInKey)
// Note: Endian swapping is only necessary on big-endian machines.
// The purpose is to place the meaningful digits towards the low end of the value,
// so that these digits have the greatest potential to affect the low range digits
// in the subsequent multiplication. Consider that locating the meaningful digits
// in the high range would produce a greater effect upon the high digits of the
// multiplication, and notably, that such high digits are likely to be discarded
// by the modulo arithmetic under overflow. We don't want that.
On the other hand in the reference Appleby's C++ implementation the
endian-swapping is [ssumed to be] done only in the loop over data.
Either should be enough alone, but doing swapping for remaining bytes
only would work, it would be a better solution -- you do swap only once,
at the end.
It looks like the Chromium implementation in C by Shane Day (public
domain) uses the second solution; well almost, see:
https://chromium.googlesource.com/external/smhasher/+/5b8fd3c31a58b87b80605dca7a64fad6cb3f8a0f/PMurHash.c#189
> +
> + switch (len & (sizeof(uint32_t) - 1))
> + {
> + case 3:
> + k1 ^= ((uint32_t)tail[2]) << 16;
> + /*-fallthrough*/
> + case 2:
> + k1 ^= ((uint32_t)tail[1]) << 8;
> + /*-fallthrough*/
> + case 1:
> + k1 ^= ((uint32_t)tail[0]) << 0;
> + k1 *= c1;
> + k1 = rotate_right(k1, r1);
It is remainingBytes ROL r1 / ROTL32(k1,15) / (k << 15) | (k >> (32 - 15))
(in other implementations), not rotate_right. The same references as
before.
> + k1 *= c2;
> + seed ^= k1;
> + break;
> + }
> +
> + seed ^= (uint32_t)len;
> + seed ^= (seed >> 16);
> + seed *= 0x85ebca6b;
> + seed ^= (seed >> 13);
> + seed *= 0xc2b2ae35;
> + seed ^= (seed >> 16);
> +
> + return seed;
> +}
In https://public-inbox.org/git/ba856e20-0a3c-e2d2-6744-b9abfacdc465@gmail.com/
you posted "[PATCH] Process bloom filter data as 1 byte words".
This may avoid the Big-endian vs Little-endian confusion,
that is wrong results on Big-endian architectures, but
it also may slow down the algorithm.
The public domain implementation in PMurHash.c in SMHasher
(re)implementation in Chromium (see URL above) fall backs to 1-byte
operations only if it doesn't know the endianness (or if it is neither
little-endian, nor big-endian, i.e. middle-endian or mixed-endian --
though I doubt that Git works correctly on mixed-endian anyway).
Sidenote: it looks like the current implementation if Murmur hash in
Cromium uses MurmurHash3_x86_32, i.e. little-endian unaligned-safe
implementation, but prepares data by swapping with StringToLE32
https://github.com/chromium/chromium/blob/master/components/variations/variations_murmur_hash.h
Assuming that the terminating NUL ("\0") character of a c-string is not
included in hash calculations, then murmur3_x86_32 hash has the
following results (all results are for seed equal 0):
'' -> 0x00000000
' ' -> 0x7ef49b98
'Hello world!' -> 0x627b0c2c
'The quick brown fox jumps over the lazy dog' -> 0x2e4ff723
C source (from Wikipedia): https://godbolt.org/z/ofa2p8
C++ source (Appleby's): https://godbolt.org/z/BoSt6V
The implementation provided in this patch, with rotate_right (instead of
rotate_left) gives, on little-endian machine, different results:
'' -> 0x00000000
' ' -> 0xd1f27e64
'Hello world!' -> 0xa0791ad7
'The quick brown fox jumps over the lazy dog' -> 0x99f1676c
https://github.com/gitgitgadget/git/blob/e1b076a714d611e59d3d71c89221e41a3427fae4/bloom.c#L21
C source (via GitGitGadget): https://godbolt.org/z/R9s8Tt
Sidenote: While Godbolt.org site supports compiling with many different
compilers, including GCC, Clang (LLVM), icc (Intel), MSVC (via Wine),
and cross compiling for different platforms, including x86_64, ARM, MIPS,
PowerPC, power64 and power64le, AVR, it allows for execution only on
x86_64 i.e. little-endian.
We could create test similar to the one for SHA-1 and SHA-256 in
t/t0015-hash.sh but for murmur3, for example:
test_expect_success 'test basic Murmur3_32 hash values' '
printf " " | test-tool murmur3_32 0 >actual &&
printf "7ef49b98" >expected &&
test_cmp expected actual &&
...
'
or
test_expect_success 'test basic Murmur3_32 hash values' '
printf " " | test-tool murmur3_32 0 >actual &&
grep "7ef49b98" actual &&
...
'
> +
> +static inline uint64_t get_bitmask(uint32_t pos)
> +{
> + return ((uint64_t)1) << (pos & (BITS_PER_WORD - 1));
> +}
All right, that creates 64-bit wide mask with 1 bit set to 1 for a
64-bit word within filter data. I just wonder if the trick with the &
operation is truly faster than using simpler to understand modulo
with compiler optimizations.
static inline uint64_t get_bitmask(uint32_t pos)
{
return ((uint64_t)1) << (pos % BITS_PER_WORD);
}
Anyway, looks good (beside naming things, but I don't have better
proposal, and the function is static i.e. file-local anyway).
> +
> +void load_bloom_filters(void)
> +{
> + init_bloom_filter_slab(&bloom_filters);
> +}
Actually this function doesn't load anything. Perhaps it should be
named init_bloom_filters() or init_bloom_filters_storage(), or
bloom_filters_init()?
> +
> +void fill_bloom_key(const char *data,
> + int len,
> + struct bloom_key *key,
> + struct bloom_filter_settings *settings)
The last parameter could be of 'const bloom_filter_settings *' type.
> +{
> + int i;
> + const uint32_t seed0 = 0x293ae76f;
> + const uint32_t seed1 = 0x7e646e2c;
Where did those seeds values came from?
> + const uint32_t hash0 = seed_murmur3(seed0, data, len);
> + const uint32_t hash1 = seed_murmur3(seed1, data, len);
> +
> + key->hashes = (uint32_t *)xcalloc(settings->num_hashes, sizeof(uint32_t));
> + for (i = 0; i < settings->num_hashes; i++)
> + key->hashes[i] = hash0 + i * hash1;
Note that in [3] authors say that double hashing technique has some
problems. For one, we should ensure that hash1 is not zero, and even
better that it is odd (which makes it relatively prime to filter size
which is multiple of 64). It also suffers from something called
"approximate fingerprint collisions".
That is why the define "enhanced double hashing" technique, which does
not suffer from those problems (Algorithm 2, page 11/15).
+ for (i = 0; i < settings->num_hashes; i++) {
+ key->hashes[i] = hash0;
+
+ hash0 = hash0 + hash1;
+ hash1 = hash1 + i;
+ }
This can also be written in closed form, based on equation (6)
+ for (i = 0; i < settings->num_hashes; i++)
+ key->hashes[i] = hash0 + i * hash1 + i*(i*i - 1)/6;
In later paper [6] the closed form for "enhanced double hashing"
(p. 188) is slightly modified (or rather they use different variant of
this technique):
+ for (i = 0; i < settings->num_hashes; i++)
+ key->hashes[i] = hash0 + i * hash1 + i*i;
This is a variant of more generic "enhanced double hashing", section
5.2 (Enhanced) Double Hashing Schemes (page 199):
h_1(u) + i h_2(u) + f(i) mod m
with f(i) = i^2 = i*i.
They have tested that enhanced double hashing with both f(i) equal i*i
and equal i*i*i, and triple hashing technique, and they have found that
it performs slightly better than straight double hashing technique
(Fig. 1, page 212, section 3).
> +}
> +
> +void add_key_to_filter(struct bloom_key *key,
> + struct bloom_filter *filter,
> + struct bloom_filter_settings *settings)
Here again the 'settings' argument can be const (as can the 'key'
parameter).
> +{
> + int i;
> + uint64_t mod = filter->len * BITS_PER_WORD;
> +
> + for (i = 0; i < settings->num_hashes; i++) {
> + uint64_t hash_mod = key->hashes[i] % mod;
> + uint64_t block_pos = hash_mod / BITS_PER_WORD;
> +
> + filter->data[block_pos] |= get_bitmask(hash_mod);
> + }
> +}
All right, bloom_key is an intermediate representation that is used both
for creating Bloom filter, and for querying it. In the latter case the
same path may be tested against Bloom filters for commits with different
number of (blob and tree) changes, and thus against Bloom filters with
different lengths. It makes sense for bloom_key to store just values of
hash functions, without arithmetics modulo filter size.
Though I think it could be a good idea to create add_str_to_filter() as
a wrapper around add_key_to_filter() and fill_bloom_key() functions.
> +
> +struct bloom_filter *get_bloom_filter(struct repository *r,
> + struct commit *c)
> +{
> + struct bloom_filter *filter;
> + struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
> + int i;
> + struct diff_options diffopt;
> +
> + if (!bloom_filters.slab_size)
> + return NULL;
This is testing that commit slab for per-commit Bloom filters is
initialized, isn't it?
First, should we write the condition as
if (!bloom_filters.slab_size)
or would the following be more readable
if (bloom_filters.slab_size == 0)
Second, should we return NULL, or should we just initialize the slab?
Or is non-existence of slab treated as a signal that the Bloom filters
mechanism is turned off?
> +
> + filter = bloom_filter_slab_at(&bloom_filters, c);
Wouldn't it be better to check if the data for commit exists already on
the slab, and create the Bloom filter for commit changes only if it does
not exists, i.e.:
+ filter = bloom_filter_slab_peek(&bloom_filters, c);
+ if (filter)
+ return filter;
+ filter = bloom_filter_slab_at(&bloom_filters, c);
> +
> + repo_diff_setup(r, &diffopt);
> + diffopt.flags.recursive = 1;
> + diff_setup_done(&diffopt);
I'll punt on checking this. Looks all right from first glance, and
follows calling sequence in https://github.com/git/git/blob/master/diff.h#L26
> +
> + if (c->parents)
> + diff_tree_oid(&c->parents->item->object.oid, &c->object.oid, "", &diffopt);
> + else
> + diff_tree_oid(NULL, &c->object.oid, "", &diffopt);
> + diffcore_std(&diffopt);
All right, that computes first-parent diff (or diff from empty tree of
there are no parents).
> +
> + if (diff_queued_diff.nr <= 512) {
First, shouldn't this magic value 512 be hidden behind some symbolic
name (some preprocessor constant), e.g. BLOOM_MAX_CHANGES? On the other
hand this value is used only once (except tests), so it might be not
worth it -- especially coming up with a good name.
Second, there is a minor issue that diff_queue_struct.nr stores the
number of filepairs, that is the number of changed files, while the
number of elements added to Bloom filter is number of changed blobs and
trees. For example if the following files are changed:
sub/dir/file1
sub/file2
then diff_queued_diff.nr is 2, but number of elements to be added to
Bloom filter is 4.
sub/dir/file1
sub/file2
sub/dir/
sub/
I'm not sure if it matters in practice.
> + struct hashmap pathmap;
> + struct pathmap_hash_entry* e;
> + struct hashmap_iter iter;
> + hashmap_init(&pathmap, NULL, NULL, 0);
Stylistic issue: I have just noticed that here (and in some other
places), but not in all cases, you declare pointer types with asterisk
cuddled to type name, not to variable name, which contradicts
CodingGuidelines:
- When declaring pointers, the star sides with the variable
name, i.e. "char *string", not "char* string" or
"char * string". This makes it easier to understand code
like "char *string, c;".
In this case it should be
+ struct pathmap_hash_entry *e;
In many other places in this patch it is correct, though.
> +
> + for (i = 0; i < diff_queued_diff.nr; i++) {
> + const char* path = diff_queued_diff.queue[i]->two->path;
Is that correct that we consider only post-image name for storing
changes in Bloom filter? Currently if file was renamed (or deleted), it
is considered changed, and `git log -- <old-name>` lists commit that
changed file name too.
> + const char* p = path;
It should be "const char *" for both.
> +
> + /*
> + * Add each leading directory of the changed file, i.e. for
> + * 'dir/subdir/file' add 'dir' and 'dir/subdir' as well, so
> + * the Bloom filter could be used to speed up commands like
> + * 'git log dir/subdir', too.
> + *
> + * Note that directories are added without the trailing '/'.
> + */
> + do {
> + char* last_slash = strrchr(p, '/');
> +
> + FLEX_ALLOC_STR(e, path, path);
Here first 'path' is the field name, i.e. pathmap_hash_entry.path,
second 'path' is the name of local variable, aliased also to 'p'.
> + hashmap_entry_init(&e->entry, strhash(p));
I don't know why both 'path' and 'p' are used, while both point to the
same memory (and thus have the same contents). It is a bit confusing.
See also my previous comment.
> + hashmap_add(&pathmap, &e->entry);
> +
> + if (!last_slash)
> + last_slash = (char*)p;
> + *last_slash = '\0';
> +
> + } while (*p);
Looks good. We overwrite '/' with '\0', and gather shrinking pathnames
along the way.
> +
> + diff_free_filepair(diff_queued_diff.queue[i]);
> + }
> +
> + filter->len = (hashmap_get_size(&pathmap) * settings.bits_per_entry + BITS_PER_WORD - 1) / BITS_PER_WORD;
All right, this is division by BITS_PER_WORD, rounding up.
Sidenote: I see now why hashmap was used, it was to be able to get
number of unique changes (changed blobs and trees) easily.
> + filter->data = xcalloc(filter->len, sizeof(uint64_t));
> +
> + hashmap_for_each_entry(&pathmap, &iter, e, entry) {
> + struct bloom_key key;
> + fill_bloom_key(e->path, strlen(e->path), &key, &settings);
> + add_key_to_filter(&key, filter, &settings);
> + }
All right.
> +
> + hashmap_free_entries(&pathmap, struct pathmap_hash_entry, entry);
> + } else {
> + for (i = 0; i < diff_queued_diff.nr; i++)
> + diff_free_filepair(diff_queued_diff.queue[i]);
All right, that frees the memory taken by diff results.
> + filter->data = NULL;
> + filter->len = 0;
This needs to be explicitly stated both in the commit message and in the
API documentation (in comments) that bloom_filter.len == 0 means "no
data", while "no changes" is represented as bloom_filter with len == 1
and *data == (uint64_t)0;
EDIT: actually "no changes" is also represented as bloom_filter with len
equal 0, as it turns out.
One possible alternative could be representing "no data" value with
Bloom filter of length 1 and all 64 bits set to 1, and "no changes"
represented as filter of length 0. This is not unambiguous choice!
> + }
> +
> + free(diff_queued_diff.queue);
> + DIFF_QUEUE_CLEAR(&diff_queued_diff);
> +
> + return filter;
> +}
All right.
> +
> +int bloom_filter_contains(struct bloom_filter *filter,
> + struct bloom_key *key,
> + struct bloom_filter_settings *settings)
It might be good idea to define enum for return values, that is
NO_DATA = -1, NO = 0, MAYBE = 1.
> +{
> + int i;
> + uint64_t mod = filter->len * BITS_PER_WORD;
> +
> + if (!mod)
> + return -1;
All right, it is different way of writing
if (filter->len == 0)
return -1;
which means "no data" (too many elements for Bloom filter to store).
EDIT: or "no changes".
> +
> + for (i = 0; i < settings->num_hashes; i++) {
> + uint64_t hash_mod = key->hashes[i] % mod;
> + uint64_t block_pos = hash_mod / BITS_PER_WORD;
> + if (!(filter->data[block_pos] & get_bitmask(hash_mod)))
> + return 0;
All right, if any of hash functions (hash results) doesn't match what is
stored in filter, then the key cannot be contained in the Bloom filter.
> + }
> +
> + return 1;
All right, otherwise the key is probably included in filter, but may be
false positive (with around 1% probability in theory).
This means that if we get value of 0, we can skip checking the diff; we
know commit is TREESAME with respect to the path given.
> +}
> diff --git a/bloom.h b/bloom.h
> new file mode 100644
> index 0000000000..7f40c751f7
> --- /dev/null
> +++ b/bloom.h
> @@ -0,0 +1,56 @@
> +#ifndef BLOOM_H
> +#define BLOOM_H
Should we #include the stdint.h header for uint32_t and uint64_t types?
> +
> +struct commit;
> +struct repository;
> +struct commit_graph;
> +
Perhaps we should add block comment for this struct, like there is one
for struct bloom_filter below.
> +struct bloom_filter_settings {
> + uint32_t hash_version;
> + uint32_t num_hashes;
> + uint32_t bits_per_entry;
I guess that the type uint32_t was chosen to make it easier to store
this information and later retrieve it from the commit-graph file, isn't
it? Otherwise those types are much too large for sensible range of
values (which would all fit in 8-bits byte).
> +};
> +
> +#define DEFAULT_BLOOM_FILTER_SETTINGS { 1, 7, 10 }
> +#define BITS_PER_WORD 64
Sidenote: While CodingGuidelines explicitly says:
- We try to support a wide range of C compilers to compile Git with,
including old ones. You should not use features from newer C
standard, even if your compiler groks them.
There are a few exceptions to this guideline:
[...]
. since mid 2017 with cbc0f81d, we have been using designated
initializers for struct (e.g. "struct t v = { .val = 'a' };").
I don't think however that using designated initializers in
DEFAULT_BLOOM_FILTER_SETTINGS is needed, as this preprocessor constant
is just below the definition of struct bloom_filter_settings type.
> +
> +/*
> + * A bloom_filter struct represents a data segment to
> + * use when testing hash values. The 'len' member
> + * dictates how many uint64_t entries are stored in
> + * 'data'.
> + */
> +struct bloom_filter {
> + uint64_t *data;
> + int len;
> +};
Just wondering: is there any advantage or disadvantage to putting 'len'
field first (i.e. before 'data') versus putting it after (i.e. after
'data')? Is there a convention that Git uses?
> +
> +/*
> + * A bloom_key represents the k hash values for a
> + * given hash input. These can be precomputed and
> + * stored in a bloom_key for re-use when testing
> + * against a bloom_filter.
We might want to add that the number of hash values is given by Bloom
filter settings, and it is assumed to be the same for all bloom_key
variables / objects.
> + */
> +struct bloom_key {
> + uint32_t *hashes;
> +};
> +
> +void load_bloom_filters(void);
> +
> +void fill_bloom_key(const char *data,
> + int len,
> + struct bloom_key *key,
> + struct bloom_filter_settings *settings);
> +
> +void add_key_to_filter(struct bloom_key *key,
> + struct bloom_filter *filter,
> + struct bloom_filter_settings *settings);
> +
> +struct bloom_filter *get_bloom_filter(struct repository *r,
> + struct commit *c);
> +
> +int bloom_filter_contains(struct bloom_filter *filter,
> + struct bloom_key *key,
> + struct bloom_filter_settings *settings);
> +
> +#endif
> diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
> new file mode 100644
> index 0000000000..331957011b
> --- /dev/null
> +++ b/t/helper/test-bloom.c
> @@ -0,0 +1,84 @@
> +#include "test-tool.h"
> +#include "git-compat-util.h"
> +#include "bloom.h"
> +#include "test-tool.h"
> +#include "cache.h"
> +#include "commit-graph.h"
> +#include "commit.h"
> +#include "config.h"
> +#include "object-store.h"
> +#include "object.h"
> +#include "repository.h"
> +#include "tree.h"
> +
> +struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
> +
> +static void print_bloom_filter(struct bloom_filter *filter) {
> + int i;
> +
> + if (!filter) {
> + printf("No filter.\n");
> + return;
> + }
> + printf("Filter_Length:%d\n", filter->len);
> + printf("Filter_Data:");
> + for (i = 0; i < filter->len; i++){
> + printf("%"PRIx64"|", filter->data[i]);
> + }
> + printf("\n");
> +}
> +
> +static void add_string_to_filter(const char *data, struct bloom_filter *filter) {
> + struct bloom_key key;
> + int i;
> +
> + fill_bloom_key(data, strlen(data), &key, &settings);
> + printf("Hashes:");
> + for (i = 0; i < settings.num_hashes; i++){
> + printf("%08x|", key.hashes[i]);
> + }
> + printf("\n");
> + add_key_to_filter(&key, filter, &settings);
> +}
> +
> +static void get_bloom_filter_for_commit(const struct object_id *commit_oid)
> +{
> + struct commit *c;
> + struct bloom_filter *filter;
> + setup_git_directory();
> + c = lookup_commit(the_repository, commit_oid);
> + filter = get_bloom_filter(the_repository, c);
> + print_bloom_filter(filter);
> +}
> +
> +int cmd__bloom(int argc, const char **argv)
> +{
> + if (!strcmp(argv[1], "generate_filter")) {
> + struct bloom_filter filter;
> + int i = 2;
> + filter.len = (settings.bits_per_entry + BITS_PER_WORD - 1) / BITS_PER_WORD;
> + filter.data = xcalloc(filter.len, sizeof(uint64_t));
> +
> + if (!argv[2]){
> + die("at least one input string expected");
> + }
> +
> + while (argv[i]) {
> + add_string_to_filter(argv[i], &filter);
> + i++;
> + }
> +
> + print_bloom_filter(&filter);
> + }
> +
> + if (!strcmp(argv[1], "get_filter_for_commit")) {
> + struct object_id oid;
> + const char *end;
> + if (parse_oid_hex(argv[2], &oid, &end))
> + die("cannot parse oid '%s'", argv[2]);
> + load_bloom_filters();
> + get_bloom_filter_for_commit(&oid);
> + }
> +
> + return 0;
> +}
I won't comment on test-tool code, as I think the Bloom filter and
Murmur3 hash tests should be structured differently, which would
completely change test-bloom.c code.
> diff --git a/t/helper/test-tool.c b/t/helper/test-tool.c
> index c9a232d238..ca4f4b0066 100644
> --- a/t/helper/test-tool.c
> +++ b/t/helper/test-tool.c
> @@ -14,6 +14,7 @@ struct test_cmd {
> };
>
> static struct test_cmd cmds[] = {
> + { "bloom", cmd__bloom },
> { "chmtime", cmd__chmtime },
> { "config", cmd__config },
> { "ctype", cmd__ctype },
> diff --git a/t/helper/test-tool.h b/t/helper/test-tool.h
> index c8549fd87f..05d2b32451 100644
> --- a/t/helper/test-tool.h
> +++ b/t/helper/test-tool.h
> @@ -4,6 +4,7 @@
> #define USE_THE_INDEX_COMPATIBILITY_MACROS
> #include "git-compat-util.h"
>
> +int cmd__bloom(int argc, const char **argv);
> int cmd__chmtime(int argc, const char **argv);
> int cmd__config(int argc, const char **argv);
> int cmd__ctype(int argc, const char **argv);
All right, looks good.
> diff --git a/t/t0095-bloom.sh b/t/t0095-bloom.sh
> new file mode 100755
> index 0000000000..424fe4fc29
> --- /dev/null
> +++ b/t/t0095-bloom.sh
> @@ -0,0 +1,113 @@
> +#!/bin/sh
> +
> +test_description='test bloom.c'
This description is a bit lackluster...
> +. ./test-lib.sh
> +
> +test_expect_success 'get bloom filters for commit with no changes' '
> + git init &&
> + git commit --allow-empty -m "c0" &&
> + cat >expect <<-\EOF &&
> + Filter_Length:0
> + Filter_Data:
> + EOF
> + test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
> + test_cmp expect actual
> +'
A few things. First, I wonder why we need to provide object ID;
couldn't 'test-tool bloom get_filter_for_commit' parse commit-ish
argument, or would it make it too complicated for no reason?
Second, why both "no changes" (here) and "no data" have the same
representation of filter with length equal 0? Let's take a look at the
code.
For no changes:
filter->len = (hashmap_get_size(&pathmap) * settings.bits_per_entry + BITS_PER_WORD - 1) / BITS_PER_WORD;
^^^^^^^^^^^^^^^^^^^^^^^^^^ == 0 for no changes
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
\-- == 0 + BITS_PER_WORD - 1 for no changes
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
\-- == 0 for no changes
filter->data = xcalloc(filter->len, sizeof(uint64_t));
^^^^^^^^^^^ == 0 for no changes
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
\-- is NULL or unique pointer that can be passed to free()
For more than 512 changed files:
filter->data = NULL;
filter->len = 0;
Not being able to distinguish between "no data" and "no changes in the
commit" cases means that we would always perform full diff for commit
with no changes, unnecessarily. Fortunately there should be no hit to
performance, as in this case we need to simply compare objects IDs of
top tree to know that there is no change.
If it is a design decision we go with, it should be in my opinion at
least explained in the commit message explicitly.
> +
> +test_expect_success 'get bloom filter for commit with 10 changes' '
> + rm actual &&
> + rm expect &&
> + mkdir smallDir &&
> + for i in $(test_seq 0 9)
> + do
> + echo $i >smallDir/$i
> + done &&
> + git add smallDir &&
> + git commit -m "commit with 10 changes" &&
> + cat >expect <<-\EOF &&
> + Filter_Length:4
> + Filter_Data:508928809087080a|8a7648210804001|4089824400951000|841ab310098051a8|
> + EOF
> + test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
> + test_cmp expect actual
> +'
This test is in my opinion fragile, as it unnecessarily test the
implementation details instead of the functionality provided. If we
change the hashing scheme (for example going from double hashing to some
variant of enhanced double hashing), or change the base hash function
(for example from Murmur3_32 to xxHash_64), or change the number of hash
functions (perhaps because changing of number of bits per element, and
thus optimal number of hash functions from 7 to 6), or change from
64-bit word blocks to 32-bit word blocks, the test would have to be
changed.
What I think would be a good test is something like t/t0011-hashmap.sh.
For example test that the Bloom filter size scales correctly could look
like this:
test_bloom() {
echo "$1" | test-tool bloom $3 >actual &&
echo "$2" >expect &&
test_cmp expect actual
}
test_expect_success 'Bloom filter for commit size scales with number of changes' '
mkdir smallDir &&
for i in $(test_seq 0 9)
do
echo $i >smallDir/$i
done &&
git add smallDir &&
git commit -m "commit with 10 changes" &&
HEAD=$(git rev-parse HEAD) &&
cat | test-tool bloom >actual <<-EOF &&
add-commit $HEAD
len-commit $HEAD
EOF
echo "4" >expect &&
test_cmp expect actual
'
> +
> +test_expect_success EXPENSIVE 'get bloom filter for commit with 513 changes' '
> + rm actual &&
> + rm expect &&
> + mkdir bigDir &&
> + for i in $(test_seq 0 512)
> + do
> + echo $i >bigDir/$i
> + done &&
> + git add bigDir &&
> + git commit -m "commit with 513 changes" &&
> + cat >expect <<-\EOF &&
> + Filter_Length:0
> + Filter_Data:
> + EOF
> + test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
> + test_cmp expect actual
> +'
All right, it is good test to have (though perhaps in modified form,
less fragile one).
> +
> +test_expect_success 'compute bloom key for empty string' '
> + cat >expect <<-\EOF &&
> + Hashes:5615800c|5b966560|61174ab4|66983008|6c19155c|7199fab0|771ae004|
> + Filter_Length:1
> + Filter_Data:11000110001110|
> + EOF
> + test-tool bloom generate_filter "" >actual &&
> + test_cmp expect actual
> +'
This might be unnecessarily fragile test, but it might be a good test
for double hashing or enhanced double hashing technique. Murmur3 hash
on empty data (empty string) always return seed value, so the result of
(enhanced) double hashing technique is predictable, given two seed
values.
> +
> +test_expect_success 'compute bloom key for whitespace' '
> + cat >expect <<-\EOF &&
> + Hashes:1bf014e6|8a91b50b|f9335530|67d4f555|d676957a|4518359f|b3b9d5c4|
> + Filter_Length:1
> + Filter_Data:401004080200810|
> + EOF
> + test-tool bloom generate_filter " " >actual &&
> + test_cmp expect actual
> +'
Instead of those two fragile tests (that depend on irrelevant details of
the implementation), it would be better to create test similar to those
in t/t0011-hashmap.sh, for example:
test_expect_success 'testing Bloom filter querying' '
test_bloom "add abc
add abcdef
check abc
check abcdef
check abcdee
check abcdefghi
len" "maybe
maybe
no
no
1"
'
Or maybe something like this:
test_expect_success 'testing Bloom filter querying' '
cat >commands <<\-EOF &&
add abc
add abcdef
check abc
check abcdef
check abcdee
check abcdefghi
len
EOF
cat >expect <<\-EOF &&
maybe
maybe
no
no
1
EOF
test-tool bloom <commands >actual &&
test_cmp expect actual
'
> +
> +test_expect_success 'compute bloom key for a root level folder' '
> + cat >expect <<-\EOF &&
> + Hashes:1a21016f|fff1c06d|e5c27f6b|cb933e69|b163fd67|9734bc65|7d057b63|
> + Filter_Length:1
> + Filter_Data:aaa800000000|
> + EOF
> + test-tool bloom generate_filter "A" >actual &&
> + test_cmp expect actual
> +'
> +
> +test_expect_success 'compute bloom key for a root level file' '
> + cat >expect <<-\EOF &&
> + Hashes:e2d51107|30970605|7e58fb03|cc1af001|19dce4ff|679ed9fd|b560cefb|
> + Filter_Length:1
> + Filter_Data:a8000000000000aa|
> + EOF
> + test-tool bloom generate_filter "file.txt" >actual &&
> + test_cmp expect actual
> +'
> +
> +test_expect_success 'compute bloom key for a deep folder' '
> + cat >expect <<-\EOF &&
> + Hashes:864cf838|27f055cd|c993b362|6b3710f7|0cda6e8c|ae7dcc21|502129b6|
> + Filter_Length:1
> + Filter_Data:1c0000600003000|
> + EOF
> + test-tool bloom generate_filter "A/B/C/D/E" >actual &&
> + test_cmp expect actual
> +'
> +
> +test_expect_success 'compute bloom key for a deep file' '
> + cat >expect <<-\EOF &&
> + Hashes:07cdf850|4af629c7|8e1e5b3e|d1468cb5|146ebe2c|5796efa3|9abf211a|
> + Filter_Length:1
> + Filter_Data:4020100804010080|
> + EOF
> + test-tool bloom generate_filter "A/B/C/D/E/file.txt" >actual &&
> + test_cmp expect actual
> +'
What are those meant to test? For the Bloom filter itself it doesn't
matter if we add "A/B/C/file.txt" string to filter, or "ABC" string.
What we didn't test is that changed _directories_ are also added to the
Bloom filter for a commit. Such test could look like this:
test_expect_success 'changed directories are added to Bloom filter' '
mkdir -p A/B &&
echo "foo" >A/B/file.txt &&
git add A/B/file.txt &&
git commit -m "add A/B/file.txt" &&
HEAD=$(git rev-parse HEAD) &&
cat >commands <<-EOF &&
add-commit $HEAD
check A/B/file.txt
check A/B
check A
EOF
cat >expect <<\-EOF &&
maybe
maybe
maybe
EOF
test-tool bloom <commands >actual &&
test_cmp expect actual
'
> +
> +test_done
Reviewed-by: Jakub Narębski <jnareb@gmail.com>
Thanks for working on this.
Best,
--
Jakub Narębski
^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH v2 02/11] bloom: core Bloom filter implementation for changed paths
2020-02-16 16:49 ` Jakub Narebski
@ 2020-02-22 0:32 ` Garima Singh
2020-02-23 13:38 ` Jakub Narebski
0 siblings, 1 reply; 159+ messages in thread
From: Garima Singh @ 2020-02-22 0:32 UTC (permalink / raw)
To: Jakub Narebski, Garima Singh via GitGitGadget
Cc: git, Derrick Stolee, SZEDER Gábor, Jonathan Tan,
Jeff Hostetler, Taylor Blau, Jeff King, Christian Couder,
Emily Shaffer, Junio C Hamano, Garima Singh
On 2/16/2020 11:49 AM, Jakub Narebski wrote:
>> From: Garima Singh <garima.singh@microsoft.com>
>>
>> Add the core Bloom filter logic for computing the paths changed between a
>> commit and its first parent. For details on what Bloom filters are and how they
>> work, please refer to Dr. Derrick Stolee's blog post [1]. It provides a concise
>> explaination of the adoption of Bloom filters as described in [2] and [3].
> ^^- to add
Not sure what this means. Can you please clarify.
>> 1. We currently use 7 and 10 for the number of hashes and the size of each
>> entry respectively. They served as great starting values, the mathematical
>> details behind this choice are described in [1] and [4]. The implementation,
> ^^- to add
Not sure what this means. Can you please clarify.
>> 3. The filters are sized according to the number of changes in the each commit,
>> with minimum size of one 64 bit word.
>
> If I understand it correctly (but which might not be entirely clear),
> the filter size in bits is the number of changes^* times 10, rounded up
> to the nearest multiple of 64.
>
> [*] where the number of changes is the number of changed files (new blob
> objects) _and_ the number of changed directories (new tree objects,
> excluding root tree object change).
>
Yes.
> The interesting corner case, which might be worth specifying explicitly,
> is what happens in the case there are _no changes_ with respect to first
> parent (which can happen with either commit created with `git commit
> --allow-empty`, or merge created e.g. with `git merge --strategy=ours`).
> Is this case represented as Bloom filter of length 0, or as a Bloom
> filter of length of one 64-bit word which is minimal length composed of
> all 0's (0x0000000000000000)?
>
See t0095-bloom.sh: The filter for a commit with no changes is of length 0.
I will call it out specifically in the appropriate commit message as well.
>>
>> [1] https://devblogs.microsoft.com/devops/super-charging-the-git-commit-graph-iv-Bloom-filters/
>
> I would write it in full, similar to subsequent bibliographical entries,
> that is:
>
> [1] Derrick Stolee
> "Supercharging the Git Commit Graph IV: Bloom Filters"
> https://devblogs.microsoft.com/devops/super-charging-the-git-commit-graph-iv-Bloom-filters/
>
> But that is just a matter of style.
>
Sounds good. Will do.
>>
>> [4] Thomas Mueller Graf, Daniel Lemire
>> "Xor Filters: Faster and Smaller Than Bloom and Cuckoo Filters"
>> https://arxiv.org/abs/1912.08258
>>
>> [5] https://en.wikipedia.org/wiki/MurmurHash#Algorithm
>>
>> Helped-by: Jeff King <peff@peff.net>
>> Helped-by: Derrick Stolee <dstolee@microsoft.com>
>> Signed-off-by: Garima Singh <garima.singh@microsoft.com>
>> ---
>> Makefile | 2 +
>> bloom.c | 228 ++++++++++++++++++++++++++++++++++++++++++
>> bloom.h | 56 +++++++++++
>> t/helper/test-bloom.c | 84 ++++++++++++++++
>> t/helper/test-tool.c | 1 +
>> t/helper/test-tool.h | 1 +
>> t/t0095-bloom.sh | 113 +++++++++++++++++++++
>> 7 files changed, 485 insertions(+)
>> create mode 100644 bloom.c
>> create mode 100644 bloom.h
>> create mode 100644 t/helper/test-bloom.c
>> create mode 100755 t/t0095-bloom.sh
>
> As I wrote earlier, In my opinion this patch could be split into three
> individual single-functionality pieces, to make it easier to review and
> aid in bisectability if needed.
>
Doing this in v3.
>> +
>> +static uint32_t rotate_right(uint32_t value, int32_t count)
>> +{
>> + uint32_t mask = 8 * sizeof(uint32_t) - 1;
>> + count &= mask;
>> + return ((value >> count) | (value << ((-count) & mask)));
>> +}
>
> Hmmm... both the algoritm on Wikipedia, and reference implementation use
> rotate *left*, not rotate *right* in the implementation of Murmur3 hash,
> see
>
> https://en.wikipedia.org/wiki/MurmurHash#Algorithm
> https://github.com/aappleby/smhasher/blob/master/src/MurmurHash3.cpp#L23
>
>
> inline uint32_t rotl32 ( uint32_t x, int8_t r )
> {
> return (x << r) | (x >> (32 - r));
> }
>
Thanks! Fixed this in v3. More on it later.
>> +
>> +/*
>> + * Calculate a hash value for the given data using the given seed.
>> + * Produces a uniformly distributed hash value.
>> + * Not considered to be cryptographically secure.
>> + * Implemented as described in https://en.wikipedia.org/wiki/MurmurHash#Algorithm
>> + **/
> ^^-- why two _trailing_ asterisks?
>
Oops. Fixed.
>> +static uint32_t seed_murmur3(uint32_t seed, const char *data, int len)
>
> In short, I think that the name of the function should be murmur3_32, or
> murmurhash3_32, or possibly murmur3_32_seed, or something like that.
>
Renamed it to murmur3_seeded in v3. The input and output types in the
signature make it clear that it is 32-bit version.
>> +{
>> + const uint32_t c1 = 0xcc9e2d51;
>> + const uint32_t c2 = 0x1b873593;
>> + const uint32_t r1 = 15;
>> + const uint32_t r2 = 13;
>> + const uint32_t m = 5;
>> + const uint32_t n = 0xe6546b64;
>> + int i;
>> + uint32_t k1 = 0;
>> + const char *tail;
>> +
>> + int len4 = len / sizeof(uint32_t);
>> +
>> + const uint32_t *blocks = (const uint32_t*)data;
>> +
>> + uint32_t k;
>> + for (i = 0; i < len4; i++)
>> + {
>> + k = blocks[i];
>
> IMPORTANT: There is a comment around there in the example implementation
> in C on Wikipedia that this operation above is a source of differing
> results across endianness.
Thanks! SZEDER found this on his CI pipeline and we have fixed it to
process the data in 1 byte words to avoid hitting any endian-ness issues.
See this part of the thread that carries the fix and the related discussion.
https://lore.kernel.org/git/ba856e20-0a3c-e2d2-6744-b9abfacdc465@gmail.com/
I will be squashing those changes in appropriately in v3.
>> + k1 *= c2;
>> + seed ^= k1;
>> + break;
>> + }
>> +
>> + seed ^= (uint32_t)len;
>> + seed ^= (seed >> 16);
>> + seed *= 0x85ebca6b;
>> + seed ^= (seed >> 13);
>> + seed *= 0xc2b2ae35;
>> + seed ^= (seed >> 16);
>> +
>> + return seed;
>> +}
>
> In https://public-inbox.org/git/ba856e20-0a3c-e2d2-6744-b9abfacdc465@gmail.com/
> you posted "[PATCH] Process bloom filter data as 1 byte words".
> This may avoid the Big-endian vs Little-endian confusion,
> that is wrong results on Big-endian architectures, but
> it also may slow down the algorithm.
>
Oh cool! You have seen that patch. And yes, we understand that it might add
a little overhead but at this point it is more important to be correct on all
architectures instead of micro-optimizing and introducing different
implementations for Little-endian and Big-endian. This would make this
series overly complicated. Optimizing the hashing techniques would deserve a
series of its own, which we can definitely revisit later.
> The public domain implementation in PMurHash.c in SMHasher
> (re)implementation in Chromium (see URL above) fall backs to 1-byte
> operations only if it doesn't know the endianness (or if it is neither
> little-endian, nor big-endian, i.e. middle-endian or mixed-endian --
> though I doubt that Git works correctly on mixed-endian anyway).
>
>
> Sidenote: it looks like the current implementation if Murmur hash in
> Cromium uses MurmurHash3_x86_32, i.e. little-endian unaligned-safe
> implementation, but prepares data by swapping with StringToLE32
> https://github.com/chromium/chromium/blob/master/components/variations/variations_murmur_hash.h
>
>
> Assuming that the terminating NUL ("\0") character of a c-string is not
> included in hash calculations, then murmur3_x86_32 hash has the
> following results (all results are for seed equal 0):
>
> '' -> 0x00000000
> ' ' -> 0x7ef49b98
> 'Hello world!' -> 0x627b0c2c
> 'The quick brown fox jumps over the lazy dog' -> 0x2e4ff723
>
> C source (from Wikipedia): https://godbolt.org/z/ofa2p8
> C++ source (Appleby's): https://godbolt.org/z/BoSt6V
>
> The implementation provided in this patch, with rotate_right (instead of
> rotate_left) gives, on little-endian machine, different results:
>
> '' -> 0x00000000
> ' ' -> 0xd1f27e64
> 'Hello world!' -> 0xa0791ad7
> 'The quick brown fox jumps over the lazy dog' -> 0x99f1676c
>
> https://github.com/gitgitgadget/git/blob/e1b076a714d611e59d3d71c89221e41a3427fae4/bloom.c#L21
> C source (via GitGitGadget): https://godbolt.org/z/R9s8Tt
>
Thanks! This is an excellent catch! Fixing the rotate_right to rotate_left,
gives us the same answers as the two implementations you pointed out. I have
added the appropriate unit tests in v3 and they match the values you obtained
from the other implementations. Thanks a lot for the rigor!
We based our implementation on the pseudo code and not on the sample code
presented here: https://en.wikipedia.org/wiki/MurmurHash#Algorithm
We just didn't parse the ROL instruction correctly.
>> +
>> +void load_bloom_filters(void)
>> +{
>> + init_bloom_filter_slab(&bloom_filters);
>> +}
>
>
> Actually this function doesn't load anything. Perhaps it should be
> named init_bloom_filters() or init_bloom_filters_storage(), or
> bloom_filters_init()?
>
Changed to init_bloom_filters() in v3. Thanks!
>> +
>> +void fill_bloom_key(const char *data,
>> + int len,
>> + struct bloom_key *key,
>> + struct bloom_filter_settings *settings)
>
> The last parameter could be of 'const bloom_filter_settings *' type.
>
Done.
>> +{
>> + int i;
>> + const uint32_t seed0 = 0x293ae76f;
>> + const uint32_t seed1 = 0x7e646e2c;
>
> Where did those seeds values came from?
>
Those values were chosen randomly. They will be fixed constants for the
current hashing version. I will add a note calling this out in the
appropriate commit messages and the Documentation in v3.
>> + const uint32_t hash0 = seed_murmur3(seed0, data, len);
>> + const uint32_t hash1 = seed_murmur3(seed1, data, len);
>> +
>> + key->hashes = (uint32_t *)xcalloc(settings->num_hashes, sizeof(uint32_t));
>> + for (i = 0; i < settings->num_hashes; i++)
>> + key->hashes[i] = hash0 + i * hash1;
>
> Note that in [3] authors say that double hashing technique has some
> problems. For one, we should ensure that hash1 is not zero, and even
> better that it is odd (which makes it relatively prime to filter size
> which is multiple of 64). It also suffers from something called
> "approximate fingerprint collisions".
>
> That is why the define "enhanced double hashing" technique, which does
> not suffer from those problems (Algorithm 2, page 11/15).
>
> + for (i = 0; i < settings->num_hashes; i++) {
> + key->hashes[i] = hash0;
> +
> + hash0 = hash0 + hash1;
> + hash1 = hash1 + i;
> + }
>
> This can also be written in closed form, based on equation (6)
>
> + for (i = 0; i < settings->num_hashes; i++)
> + key->hashes[i] = hash0 + i * hash1 + i*(i*i - 1)/6;
>
>
> In later paper [6] the closed form for "enhanced double hashing"
> (p. 188) is slightly modified (or rather they use different variant of
> this technique):
>
> + for (i = 0; i < settings->num_hashes; i++)
> + key->hashes[i] = hash0 + i * hash1 + i*i;
>
> This is a variant of more generic "enhanced double hashing", section
> 5.2 (Enhanced) Double Hashing Schemes (page 199):
>
> h_1(u) + i h_2(u) + f(i) mod m
>
> with f(i) = i^2 = i*i.
>
> They have tested that enhanced double hashing with both f(i) equal i*i
> and equal i*i*i, and triple hashing technique, and they have found that
> it performs slightly better than straight double hashing technique
> (Fig. 1, page 212, section 3).
>
Thanks for the detailed research here! The hash becoming zero and the
approximate fingerprint collision are both extremely rare situations. In both
cases, we would just see git log having to diff more trees than if it didn't
occur. While these techniques would be great optimizations to do, especially
if this implementation gets pulled into more generic hashing applications
in the code, we think that for the purposes of the current series - it is not
worth it. I say this because Azure Repos has been using this exact hashing
technique for several years now without any glitches. And we think it would
be great to rely on this battle tested strategy in atleast the first version
of this feature.
>> +}
>> +
>> +void add_key_to_filter(struct bloom_key *key,
>> + struct bloom_filter *filter,
>> + struct bloom_filter_settings *settings)
>
> Here again the 'settings' argument can be const (as can the 'key'
> parameter).
>
Done.
>> +
>> +struct bloom_filter *get_bloom_filter(struct repository *r,
>> + struct commit *c)
>> +{
>> + struct bloom_filter *filter;
>> + struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
>> + int i;
>> + struct diff_options diffopt;
>> +
>> + if (!bloom_filters.slab_size)
>> + return NULL;
>
> This is testing that commit slab for per-commit Bloom filters is
> initialized, isn't it?
>
> First, should we write the condition as
>
> if (!bloom_filters.slab_size)
>
> or would the following be more readable
>
> if (bloom_filters.slab_size == 0)
>
Sure. Switched to `if (bloom_filter.slab_size == 0)` in v3.
> Second, should we return NULL, or should we just initialize the slab?
> Or is non-existence of slab treated as a signal that the Bloom filters
> mechanism is turned off?
>
Yes. We purposefully choose to return NULL and ignore the mechanism
overall because we use Bloom filters best effort only.
>> +
>> + if (diff_queued_diff.nr <= 512) {
>
> Second, there is a minor issue that diff_queue_struct.nr stores the
> number of filepairs, that is the number of changed files, while the
> number of elements added to Bloom filter is number of changed blobs and
> trees. For example if the following files are changed:
>
> sub/dir/file1
> sub/file2
>
> then diff_queued_diff.nr is 2, but number of elements to be added to
> Bloom filter is 4.
>
> sub/dir/file1
> sub/file2
> sub/dir/
> sub/
>
> I'm not sure if it matters in practice.
>
It does not matter much in practice, since the directories usually tend
to collapse across the changes. Still, I will add another limit after
creating the hashmap entries to cap at 640 so that we have a maximum of
100 changes in the bloom filter.
We plan to make these values configurable later.
>> + struct hashmap pathmap;
>> + struct pathmap_hash_entry* e;
>> + struct hashmap_iter iter;
>> + hashmap_init(&pathmap, NULL, NULL, 0);
>
> Stylistic issue: I have just noticed that here (and in some other
> places), but not in all cases, you declare pointer types with asterisk
> cuddled to type name, not to variable name, which contradicts
> CodingGuidelines
Thanks for noticing that! Fixed all of these in v3.
>> +
>> + for (i = 0; i < diff_queued_diff.nr; i++) {
>> + const char* path = diff_queued_diff.queue[i]->two->path;
>
> Is that correct that we consider only post-image name for storing
> changes in Bloom filter? Currently if file was renamed (or deleted), it
> is considered changed, and `git log -- <old-name>` lists commit that
> changed file name too.
>
The tests in t4216-log-bloom.sh ensure that the output of `git log -- <oldname>`
remains unchanged for renamed and deleted files, when using bloom filters.
I realize that I fat fingered over checking the old name, and didn't have an
explicit deleted file in the test. I have added them in v3, and the tests pass.
So the behavior is preserved and as expected when using Bloom filters.
Thanks for paying close attention!
>> + const char* p = path;
>
> It should be "const char *" for both.
>
>> +
>> + /*
>> + * Add each leading directory of the changed file, i.e. for
>> + * 'dir/subdir/file' add 'dir' and 'dir/subdir' as well, so
>> + * the Bloom filter could be used to speed up commands like
>> + * 'git log dir/subdir', too.
>> + *
>> + * Note that directories are added without the trailing '/'.
>> + */
>> + do {
>> + char* last_slash = strrchr(p, '/');
>> +
>> + FLEX_ALLOC_STR(e, path, path);
>
> Here first 'path' is the field name, i.e. pathmap_hash_entry.path,
> second 'path' is the name of local variable, aliased also to 'p'.
>
>> + hashmap_entry_init(&e->entry, strhash(p));
>
> I don't know why both 'path' and 'p' are used, while both point to the
> same memory (and thus have the same contents). It is a bit confusing.
> See also my previous comment.
>
Cleaned up in v3. Thanks!
>> + filter->data = NULL;
>> + filter->len = 0;
>
> This needs to be explicitly stated both in the commit message and in the
> API documentation (in comments) that bloom_filter.len == 0 means "no
> data", while "no changes" is represented as bloom_filter with len == 1
> and *data == (uint64_t)0;
>
> EDIT: actually "no changes" is also represented as bloom_filter with len
> equal 0, as it turns out.
>
> One possible alternative could be representing "no data" value with
> Bloom filter of length 1 and all 64 bits set to 1, and "no changes"
> represented as filter of length 0. This is not unambiguous choice!
>
There is no gain in distinguishing between the absence of a filter and
a commit having no changes. The effect on `git log -- path` is the same in
both cases. We fall back to the normal diffing algorithm in revision.c.
I will make this clearer in the appropriate commit messages and in the
Documentation in v3.
>> +}
>> diff --git a/bloom.h b/bloom.h
>> new file mode 100644
>> index 0000000000..7f40c751f7
>> --- /dev/null
>> +++ b/bloom.h
>> @@ -0,0 +1,56 @@
>> +#ifndef BLOOM_H
>> +#define BLOOM_H
>
> Should we #include the stdint.h header for uint32_t and uint64_t types?
>
git-compat-util.h takes care of this.
>> +
>> +struct commit;
>> +struct repository;
>> +struct commit_graph;
>> +
>
> Perhaps we should add block comment for this struct, like there is one
> for struct bloom_filter below.
>
Done in v3.
>> +struct bloom_filter_settings {
>> + uint32_t hash_version;
>> + uint32_t num_hashes;
>> + uint32_t bits_per_entry;
>
> I guess that the type uint32_t was chosen to make it easier to store
> this information and later retrieve it from the commit-graph file, isn't
> it? Otherwise those types are much too large for sensible range of
> values (which would all fit in 8-bits byte).
>
Yes.
>> +
>> +/*
>> + * A bloom_key represents the k hash values for a
>> + * given hash input. These can be precomputed and
>> + * stored in a bloom_key for re-use when testing
>> + * against a bloom_filter.
>
> We might want to add that the number of hash values is given by Bloom
> filter settings, and it is assumed to be the same for all bloom_key
> variables / objects.
>
Incorporated in v3.
>> +. ./test-lib.sh
>> +
>> +test_expect_success 'get bloom filters for commit with no changes' '
>> + git init &&
>> + git commit --allow-empty -m "c0" &&
>> + cat >expect <<-\EOF &&
>> + Filter_Length:0
>> + Filter_Data:
>> + EOF
>> + test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
>> + test_cmp expect actual
>> +'
>
> A few things. First, I wonder why we need to provide object ID;
> couldn't 'test-tool bloom get_filter_for_commit' parse commit-ish
> argument, or would it make it too complicated for no reason?
>
Yes it was overkill for what I need in the test.
>> +
>> +test_expect_success 'get bloom filter for commit with 10 changes' '
>> + rm actual &&
>> + rm expect &&
>> + mkdir smallDir &&
>> + for i in $(test_seq 0 9)
>> + do
>> + echo $i >smallDir/$i
>> + done &&
>> + git add smallDir &&
>> + git commit -m "commit with 10 changes" &&
>> + cat >expect <<-\EOF &&
>> + Filter_Length:4
>> + Filter_Data:508928809087080a|8a7648210804001|4089824400951000|841ab310098051a8|
>> + EOF
>> + test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
>> + test_cmp expect actual
>> +'
>
> This test is in my opinion fragile, as it unnecessarily test the
> implementation details instead of the functionality provided. If we
> change the hashing scheme (for example going from double hashing to some
> variant of enhanced double hashing), or change the base hash function
> (for example from Murmur3_32 to xxHash_64), or change the number of hash
> functions (perhaps because changing of number of bits per element, and
> thus optimal number of hash functions from 7 to 6), or change from
> 64-bit word blocks to 32-bit word blocks, the test would have to be
> changed.
>
Regarding this and the rest of you comments on t0095-log-bloom.sh:
I am tweaking it as necessary but the entire point of these tests is to
break for the things you called out. They need to be intricately tied
to the current hashing strategy and are hence intended to be fragile so
as to catch any subtle or accidental changes in the hashing computation.
Any change like the ones you have called out would require a hash version
change and all the compatibility reactions that come with it.
I have added more tests around the murmur3_seeded method in v3. Removed
some of the redundant ones.
The other more evolved test cases you call out are covered in the e2e
integration tests in t4216-log-bloom.sh
>
> Reviewed-by: Jakub Narębski <jnareb@gmail.com>
>
> Thanks for working on this.
>
> Best,
>
Thank you once again for an excellent and in-depth review of this patch!
You have helped make this code so much better!
Cheers!
Garima Singh
^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH v2 02/11] bloom: core Bloom filter implementation for changed paths
2020-02-22 0:32 ` Garima Singh
@ 2020-02-23 13:38 ` Jakub Narebski
2020-02-24 17:34 ` Garima Singh
0 siblings, 1 reply; 159+ messages in thread
From: Jakub Narebski @ 2020-02-23 13:38 UTC (permalink / raw)
To: Garima Singh
Cc: Garima Singh via GitGitGadget, git, Derrick Stolee,
SZEDER Gábor, Jonathan Tan, Jeff Hostetler, Taylor Blau,
Jeff King, Christian Couder, Emily Shaffer, Junio C Hamano,
Garima Singh
Garima Singh <garimasigit@gmail.com> writes:
> On 2/16/2020 11:49 AM, Jakub Narebski wrote:
>>> From: Garima Singh <garima.singh@microsoft.com>
>>>
>>> Add the core Bloom filter logic for computing the paths changed between a
>>> commit and its first parent. For details on what Bloom filters are and how they
>>> work, please refer to Dr. Derrick Stolee's blog post [1]. It provides a concise
>>> explaination of the adoption of Bloom filters as described in [2] and [3].
>> ^^- to add
>
> Not sure what this means. Can you please clarify.
>
>>> 1. We currently use 7 and 10 for the number of hashes and the size of each
>>> entry respectively. They served as great starting values, the mathematical
>>> details behind this choice are described in [1] and [4]. The implementation,
>> ^^- to add
>
> Not sure what this means. Can you please clarify.
I'm sorry for not being clear. What I wanted to say that in both cases
the last line should have ended in either full stop in first case, or
comma in second case:
"as described in [2] and [3]."
"The implementation,"
What I wrote (trying to put the arrow below final fullstop or comma)
only works when one is using with fixed-width font.
>>> 3. The filters are sized according to the number of changes in the each commit,
>>> with minimum size of one 64 bit word.
[...]
>> The interesting corner case, which might be worth specifying explicitly,
>> is what happens in the case there are _no changes_ with respect to first
>> parent (which can happen with either commit created with `git commit
>> --allow-empty`, or merge created e.g. with `git merge --strategy=ours`).
>> Is this case represented as Bloom filter of length 0, or as a Bloom
>> filter of length of one 64-bit word which is minimal length composed of
>> all 0's (0x0000000000000000)?
>>
>
> See t0095-bloom.sh: The filter for a commit with no changes is of length 0.
> I will call it out specifically in the appropriate commit message as well.
I have realized this only later that both "no changes" and "no data"
uses filter of length 0; which works well because checking the diff if
there were no changes is cheap (both tree oids are the same).
>>> ---
>>> Makefile | 2 +
>>> bloom.c | 228 ++++++++++++++++++++++++++++++++++++++++++
>>> bloom.h | 56 +++++++++++
>>> t/helper/test-bloom.c | 84 ++++++++++++++++
>>> t/helper/test-tool.c | 1 +
>>> t/helper/test-tool.h | 1 +
>>> t/t0095-bloom.sh | 113 +++++++++++++++++++++
>>> 7 files changed, 485 insertions(+)
>>> create mode 100644 bloom.c
>>> create mode 100644 bloom.h
>>> create mode 100644 t/helper/test-bloom.c
>>> create mode 100755 t/t0095-bloom.sh
>>
>> As I wrote earlier, In my opinion this patch could be split into three
>> individual single-functionality pieces, to make it easier to review and
>> aid in bisectability if needed.
>
> Doing this in v3.
Thanks. Though if it makes (much) more work for you, I can work with
unsplit patch, no problem.
>>> +
>>> +static uint32_t rotate_right(uint32_t value, int32_t count)
>>> +{
>>> + uint32_t mask = 8 * sizeof(uint32_t) - 1;
>>> + count &= mask;
>>> + return ((value >> count) | (value << ((-count) & mask)));
>>> +}
>>
>> Hmmm... both the algoritm on Wikipedia, and reference implementation use
>> rotate *left*, not rotate *right* in the implementation of Murmur3 hash,
>> see
>>
>> https://en.wikipedia.org/wiki/MurmurHash#Algorithm
>> https://github.com/aappleby/smhasher/blob/master/src/MurmurHash3.cpp#L23
>>
>>
>> inline uint32_t rotl32 ( uint32_t x, int8_t r )
>> {
>> return (x << r) | (x >> (32 - r));
>> }
>
> Thanks! Fixed this in v3. More on it later.
Sidenote: If I understand it correctly Bloom filters functionality is
included in Scalar [1]. What will happen then with all those Bloom
filter chunks in commit-graph files with wrong hash functions?
[1]: https://devblogs.microsoft.com/devops/introducing-scalar/
>>> +
>>> +/*
>>> + * Calculate a hash value for the given data using the given seed.
>>> + * Produces a uniformly distributed hash value.
>>> + * Not considered to be cryptographically secure.
>>> + * Implemented as described in https://en.wikipedia.org/wiki/MurmurHash#Algorithm
>>> + **/
>> ^^-- why two _trailing_ asterisks?
>
> Oops. Fixed.
Often two _leading_ asterisks are used to mark commit as containing
docstring in some specific format, like Doxygen. Two _trailing_
asterisks looks like typo.
>>> +static uint32_t seed_murmur3(uint32_t seed, const char *data, int len)
>>
>> In short, I think that the name of the function should be murmur3_32, or
>> murmurhash3_32, or possibly murmur3_32_seed, or something like that.
>
> Renamed it to murmur3_seeded in v3. The input and output types in the
> signature make it clear that it is 32-bit version.
All right, I can agree with that.
>>> +{
>>> + const uint32_t c1 = 0xcc9e2d51;
>>> + const uint32_t c2 = 0x1b873593;
>>> + const uint32_t r1 = 15;
>>> + const uint32_t r2 = 13;
>>> + const uint32_t m = 5;
>>> + const uint32_t n = 0xe6546b64;
>>> + int i;
>>> + uint32_t k1 = 0;
>>> + const char *tail;
>>> +
>>> + int len4 = len / sizeof(uint32_t);
>>> +
>>> + const uint32_t *blocks = (const uint32_t*)data;
>>> +
>>> + uint32_t k;
>>> + for (i = 0; i < len4; i++)
>>> + {
>>> + k = blocks[i];
>>
>> IMPORTANT: There is a comment around there in the example implementation
>> in C on Wikipedia that this operation above is a source of differing
>> results across endianness.
>
> Thanks! SZEDER found this on his CI pipeline and we have fixed it to
> process the data in 1 byte words to avoid hitting any endian-ness issues.
> See this part of the thread that carries the fix and the related discussion.
> https://lore.kernel.org/git/ba856e20-0a3c-e2d2-6744-b9abfacdc465@gmail.com/
> I will be squashing those changes in appropriately in v3.
[...]
>>> + k1 *= c2;
>>> + seed ^= k1;
>>> + break;
>>> + }
>>> +
>>> + seed ^= (uint32_t)len;
>>> + seed ^= (seed >> 16);
>>> + seed *= 0x85ebca6b;
>>> + seed ^= (seed >> 13);
>>> + seed *= 0xc2b2ae35;
>>> + seed ^= (seed >> 16);
>>> +
>>> + return seed;
>>> +}
>>
>> In https://public-inbox.org/git/ba856e20-0a3c-e2d2-6744-b9abfacdc465@gmail.com/
>> you posted "[PATCH] Process bloom filter data as 1 byte words".
>> This may avoid the Big-endian vs Little-endian confusion,
>> that is wrong results on Big-endian architectures, but
>> it also may slow down the algorithm.
>
> Oh cool! You have seen that patch. And yes, we understand that it might add
> a little overhead but at this point it is more important to be correct on all
> architectures instead of micro-optimizing and introducing different
> implementations for Little-endian and Big-endian. This would make this
> series overly complicated. Optimizing the hashing techniques would deserve a
> series of its own, which we can definitely revisit later.
Right, "first make it work, then make it right, and, finally, make it fast.".
Anyway, could you maybe compare performance of Git for old version
(operating on 32-bit/4-bytes words) and new version (operating on 1-byte
words) file history operation with Bloom filters, to see if it matters
or not?
>> The public domain implementation in PMurHash.c in SMHasher
>> (re)implementation in Chromium (see URL above) fall backs to 1-byte
>> operations only if it doesn't know the endianness (or if it is neither
>> little-endian, nor big-endian, i.e. middle-endian or mixed-endian --
>> though I doubt that Git works correctly on mixed-endian anyway).
>>
>>
>> Sidenote: it looks like the current implementation if Murmur hash in
>> Chromium uses MurmurHash3_x86_32, i.e. little-endian unaligned-safe
>> implementation, but prepares data by swapping with StringToLE32
>> https://github.com/chromium/chromium/blob/master/components/variations/variations_murmur_hash.h
The solution in PMurHash.c in Chromium, and the pseudo-code algorithm on
Wikipedia do endian handling only for remaining bytes (while the
solution in Appleby's code [beginnings of], and in current
above-mentioned Chromium implementation do the conversion for all
bytes). I think that handling it only for remaining bytes (for data
sizes not being multiply of 32-bits / 4-bytes) is enough; all other
operations, that is multiply, rotate, xor and addition do not depend on
endianness.
>> Assuming that the terminating NUL ("\0") character of a c-string is not
>> included in hash calculations, then murmur3_x86_32 hash has the
>> following results (all results are for seed equal 0):
>>
>> '' -> 0x00000000
>> ' ' -> 0x7ef49b98
>> 'Hello world!' -> 0x627b0c2c
>> 'The quick brown fox jumps over the lazy dog' -> 0x2e4ff723
>>
>> C source (from Wikipedia): https://godbolt.org/z/ofa2p8
>> C++ source (Appleby's): https://godbolt.org/z/BoSt6V
>>
>> The implementation provided in this patch, with rotate_right (instead of
>> rotate_left) gives, on little-endian machine, different results:
>>
>> '' -> 0x00000000
>> ' ' -> 0xd1f27e64
>> 'Hello world!' -> 0xa0791ad7
>> 'The quick brown fox jumps over the lazy dog' -> 0x99f1676c
>>
>> https://github.com/gitgitgadget/git/blob/e1b076a714d611e59d3d71c89221e41a3427fae4/bloom.c#L21
>> C source (via GitGitGadget): https://godbolt.org/z/R9s8Tt
>>
>
> Thanks! This is an excellent catch! Fixing the rotate_right to rotate_left,
> gives us the same answers as the two implementations you pointed out. I have
> added the appropriate unit tests in v3 and they match the values you obtained
> from the other implementations. Thanks a lot for the rigor!
>
> We based our implementation on the pseudo code and not on the sample code
> presented here: https://en.wikipedia.org/wiki/MurmurHash#Algorithm
> We just didn't parse the ROL instruction correctly.
All right, that's good.
Note that the pseudo code includes the following:
with any remainingBytesInKey do
remainingBytes ← SwapToLittleEndian(remainingBytesInKey)
// Note: Endian swapping is only necessary on big-endian machines.
// The purpose is to place the meaningful digits towards the low end of the value,
// so that these digits have the greatest potential to affect the low range digits
// in the subsequent multiplication. Consider that locating the meaningful digits
// in the high range would produce a greater effect upon the high digits of the
// multiplication, and notably, that such high digits are likely to be discarded
// by the modulo arithmetic under overflow. We don't want that.
[...]
>>> +{
>>> + int i;
>>> + const uint32_t seed0 = 0x293ae76f;
>>> + const uint32_t seed1 = 0x7e646e2c;
>>
>> Where did those seeds values came from?
>>
>
> Those values were chosen randomly. They will be fixed constants for the
> current hashing version. I will add a note calling this out in the
> appropriate commit messages and the Documentation in v3.
Nice to know.
I wonder if those seed values should be relatively prime, and whether
seed1 should be odd (from theoretical point of view).
>>> + const uint32_t hash0 = seed_murmur3(seed0, data, len);
>>> + const uint32_t hash1 = seed_murmur3(seed1, data, len);
>>> +
>>> + key->hashes = (uint32_t *)xcalloc(settings->num_hashes, sizeof(uint32_t));
>>> + for (i = 0; i < settings->num_hashes; i++)
>>> + key->hashes[i] = hash0 + i * hash1;
>>
>> Note that in [3] authors say that double hashing technique has some
>> problems. For one, we should ensure that hash1 is not zero, and even
>> better that it is odd (which makes it relatively prime to filter size
>> which is multiple of 64). It also suffers from something called
>> "approximate fingerprint collisions".
>>
>> That is why the define "enhanced double hashing" technique, which does
>> not suffer from those problems (Algorithm 2, page 11/15).
>>
>> + for (i = 0; i < settings->num_hashes; i++) {
>> + key->hashes[i] = hash0;
>> +
>> + hash0 = hash0 + hash1;
>> + hash1 = hash1 + i;
>> + }
>>
>> This can also be written in closed form, based on equation (6)
>>
>> + for (i = 0; i < settings->num_hashes; i++)
>> + key->hashes[i] = hash0 + i * hash1 + i*(i*i - 1)/6;
>>
>>
>> In later paper [6] the closed form for "enhanced double hashing"
>> (p. 188) is slightly modified (or rather they use different variant of
>> this technique):
>>
>> + for (i = 0; i < settings->num_hashes; i++)
>> + key->hashes[i] = hash0 + i * hash1 + i*i;
>>
>> This is a variant of more generic "enhanced double hashing", section
>> 5.2 (Enhanced) Double Hashing Schemes (page 199):
>>
>> h_1(u) + i h_2(u) + f(i) mod m
>>
>> with f(i) = i^2 = i*i.
>>
>> They have tested that enhanced double hashing with both f(i) equal i*i
>> and equal i*i*i, and triple hashing technique, and they have found that
>> it performs slightly better than straight double hashing technique
>> (Fig. 1, page 212, section 3).
>>
>
> Thanks for the detailed research here! The hash becoming zero and the
> approximate fingerprint collision are both extremely rare situations. In both
> cases, we would just see `git log` having to diff more trees than if it didn't
> occur. While these techniques would be great optimizations to do, especially
> if this implementation gets pulled into more generic hashing applications
> in the code, we think that for the purposes of the current series - it is not
> worth it. I say this because Azure Repos has been using this exact hashing
> technique for several years now without any glitches. And we think it would
> be great to rely on this battle tested strategy in at least the first version
> of this feature.
All right, that is a good strategy.
I wonder if switching from double hashing to enhanced double hashing
(for example the variant with i*i added) would bring any noticeable
performance improvements in Git operations (due to less false
positives).
>>> +
>>> +struct bloom_filter *get_bloom_filter(struct repository *r,
>>> + struct commit *c)
>>> +{
>>> + struct bloom_filter *filter;
>>> + struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
>>> + int i;
>>> + struct diff_options diffopt;
>>> +
>>> + if (!bloom_filters.slab_size)
>>> + return NULL;
>>
>> This is testing that commit slab for per-commit Bloom filters is
>> initialized, isn't it?
>>
>> First, should we write the condition as
>>
>> if (!bloom_filters.slab_size)
>>
>> or would the following be more readable
>>
>> if (bloom_filters.slab_size == 0)
>>
>
> Sure. Switched to `if (bloom_filter.slab_size == 0)` in v3.
Though either works, and the former looks more like the test if
bloom_filters slab are initialized, now that I thought about it a bit.
Your choice.
>> Second, should we return NULL, or should we just initialize the slab?
>> Or is non-existence of slab treated as a signal that the Bloom filters
>> mechanism is turned off?
>>
>
> Yes. We purposefully choose to return NULL and ignore the mechanism
> overall because we use Bloom filters best effort only.
All right.
>>> +
>>> + if (diff_queued_diff.nr <= 512) {
>>
>> Second, there is a minor issue that diff_queue_struct.nr stores the
>> number of filepairs, that is the number of changed files, while the
>> number of elements added to Bloom filter is number of changed blobs and
>> trees. For example if the following files are changed:
>>
>> sub/dir/file1
>> sub/file2
>>
>> then diff_queued_diff.nr is 2, but number of elements to be added to
>> Bloom filter is 4.
>>
>> sub/dir/file1
>> sub/file2
>> sub/dir/
>> sub/
>>
>> I'm not sure if it matters in practice.
>>
>
> It does not matter much in practice, since the directories usually tend
> to collapse across the changes. Still, I will add another limit after
> creating the hashmap entries to cap at 640 so that we have a maximum of
> 100 changes in the bloom filter.
>
> We plan to make these values configurable later.
I'm not sure if it is truly necessary; we can treat limit on number of
changed paths as "best effort" limit on Bloom filter size.
I just wanted to point out the difference.
Side note: I wonder if it would be worth it (in the future) to change
handling commits with large amount of changes. I was thinking about
switching to soft and hard limit: soft limit would be on the size of the
Bloom filter, that is if number of elements times bits per element is
greater that size threshold, we don't increase the size of the filter.
This would mean that the false positives ratio (the number of files that
are not present but get answer "maybe" instead of "no" out of the
filter) would increase, so there would be a need for another hard limit
where we decide that it is not worth it, and not store the data for the
Bloom filter -- current "no data" case with empty filter with length 0.
This hard limit can be imposed on number of changed files, or on number
of paths added to filter, or on number of bits set to 1 in the filter
(on popcount), or some combination thereof.
[...]
>>> +
>>> + for (i = 0; i < diff_queued_diff.nr; i++) {
>>> + const char* path = diff_queued_diff.queue[i]->two->path;
>>
>> Is that correct that we consider only post-image name for storing
>> changes in Bloom filter? Currently if file was renamed (or deleted), it
>> is considered changed, and `git log -- <old-name>` lists commit that
>> changed file name too.
>
> The tests in t4216-log-bloom.sh ensure that the output of `git log -- <oldname>`
> remains unchanged for renamed and deleted files, when using bloom filters.
> I realize that I fat fingered over checking the old name, and didn't have an
> explicit deleted file in the test. I have added them in v3, and the tests pass.
> So the behavior is preserved and as expected when using Bloom filters.
> Thanks for paying close attention!
It seems like it shouldn't be working, as we are not adding the old name
to Bloom filter, but that only means that I misunderstood how
diff_tree_oid() works with default options. It turns out that without
explicitly turning on rename detection it shows rename as deletion of
old name and addition of new name -- so if tracking deletion works
correctly, then tracking renames should work correctly.
So it is in fact correct, which as you said was confirmed by (improved)
tests. I think also that if there was a bug in handling renames in this
code it would have been detected when running CI with
GIT_TEST_COMMIT_GRAPH_BLOOM_FILTERS.
[...]
>>> + filter->data = NULL;
>>> + filter->len = 0;
>>
>> This needs to be explicitly stated both in the commit message and in the
>> API documentation (in comments) that bloom_filter.len == 0 means "no
>> data", while "no changes" is represented as bloom_filter with len == 1
>> and *data == (uint64_t)0;
>>
>> EDIT: actually "no changes" is also represented as bloom_filter with len
>> equal 0, as it turns out.
>>
>> One possible alternative could be representing "no data" value with
>> Bloom filter of length 1 and all 64 bits set to 1, and "no changes"
>> represented as filter of length 0. This is not unambiguous choice!
>>
>
> There is no gain in distinguishing between the absence of a filter and
> a commit having no changes. The effect on `git log -- path` is the same in
> both cases. We fall back to the normal diffing algorithm in revision.c.
> I will make this clearer in the appropriate commit messages and in the
> Documentation in v3.
You are right, which I have realized only when reviewing subsequent
patches in the series.
In the absence of a filter, the "no data" case, we need to fall back to
examining the diff anyway.
In the case of commit having no changes, the "no changes" case,
computing the diff is cheap because Git can realize that both trees have
the same oid. So we do not lose performance this way, and we avoid
special-casing it (avoiding branching) when computing the Bloom filter,
if the "no change" case was represented by filter of length 1 and all
zero bits as data. Comparing tree oids and matching first hash function
in bloom_key against all zeros Bloom filter should be, I think, of
similar performance.
[...]
>>> +. ./test-lib.sh
>>> +
>>> +test_expect_success 'get bloom filters for commit with no changes' '
>>> + git init &&
>>> + git commit --allow-empty -m "c0" &&
>>> + cat >expect <<-\EOF &&
>>> + Filter_Length:0
>>> + Filter_Data:
>>> + EOF
>>> + test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
>>> + test_cmp expect actual
>>> +'
>>
>> A few things. First, I wonder why we need to provide object ID;
>> couldn't 'test-tool bloom get_filter_for_commit' parse commit-ish
>> argument, or would it make it too complicated for no reason?
>
> Yes it was overkill for what I need in the test.
All right, I agree with that.
>>> +
>>> +test_expect_success 'get bloom filter for commit with 10 changes' '
>>> + rm actual &&
>>> + rm expect &&
>>> + mkdir smallDir &&
>>> + for i in $(test_seq 0 9)
>>> + do
>>> + echo $i >smallDir/$i
>>> + done &&
>>> + git add smallDir &&
>>> + git commit -m "commit with 10 changes" &&
>>> + cat >expect <<-\EOF &&
>>> + Filter_Length:4
>>> + Filter_Data:508928809087080a|8a7648210804001|4089824400951000|841ab310098051a8|
>>> + EOF
>>> + test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
>>> + test_cmp expect actual
>>> +'
>>
>> This test is in my opinion fragile, as it unnecessarily test the
>> implementation details instead of the functionality provided. If we
>> change the hashing scheme (for example going from double hashing to some
>> variant of enhanced double hashing), or change the base hash function
>> (for example from Murmur3_32 to xxHash_64), or change the number of hash
>> functions (perhaps because changing of number of bits per element, and
>> thus optimal number of hash functions from 7 to 6), or change from
>> 64-bit word blocks to 32-bit word blocks, the test would have to be
>> changed.
>
> Regarding this and the rest of you comments on t0095-log-bloom.sh:
>
> I am tweaking it as necessary but the entire point of these tests is to
> break for the things you called out. They need to be intricately tied
> to the current hashing strategy and are hence intended to be fragile so
> as to catch any subtle or accidental changes in the hashing computation.
> Any change like the ones you have called out would require a hash version
> change and all the compatibility reactions that come with it.
All right, if we assume that commit-graph is not something purely local^*,
and we need iteroperability, then this test is necessary and is
necessarily fragile.
*. This may happen because the repository and the commit-graph file in
it is on network disk, and accessed by hosts with different
endianness. Or in the future (or possibly now, if one is using
Scalar) the commit-graph file can be sent together with packfile
during the fetch operation.
On the other hand testing the functionality of Murmur hash, and of Bloom
filter would help finding possible troubles if we decide in the future
to change the algorithm details (change hash function, and/or move from
double hashing to enhanced double hashing, and/or change how commits
with large number of changes are handled, or even switching to xor
filters [1]).
[1]: Graf, Thomas Mueller; Lemire, Daniel (2019), "Xor Filters: Faster
and Smaller Than Bloom and Cuckoo Filters", https://arxiv.org/abs/1912.08258
> I have added more tests around the murmur3_seeded method in v3. Removed
> some of the redundant ones.
There is another test that might be worth adding (see the comment below
why), namely one test checking that bloom_key is computed as expected.
> The other more evolved test cases you call out are covered in the e2e
> integration tests in t4216-log-bloom.sh
All right, but there is another issue to consider. Good tests should
not only catch the breakage, but also help to detect where the bug is.
That is one of advantages that unit tests (like the ones I have
proposed) have over end-to-end functional tests. They are also often
faster.
On the other hand e2e tests can catch problems with integration, and
actually check that the user-visible behaviour is as expected.
Best,
--
Jakub Narębski
>>
>> Reviewed-by: Jakub Narębski <jnareb@gmail.com>
>>
>> Thanks for working on this.
>>
>> Best,
>
> Thank you once again for an excellent and in-depth review of this patch!
> You have helped make this code so much better!
^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH v2 02/11] bloom: core Bloom filter implementation for changed paths
2020-02-23 13:38 ` Jakub Narebski
@ 2020-02-24 17:34 ` Garima Singh
2020-02-24 18:20 ` Jakub Narebski
0 siblings, 1 reply; 159+ messages in thread
From: Garima Singh @ 2020-02-24 17:34 UTC (permalink / raw)
To: Jakub Narebski
Cc: Garima Singh via GitGitGadget, git, Derrick Stolee,
SZEDER Gábor, Jonathan Tan, Jeff Hostetler, Taylor Blau,
Jeff King, Christian Couder, Emily Shaffer, Junio C Hamano,
Garima Singh
On 2/23/2020 8:38 AM, Jakub Narebski wrote:
> Garima Singh <garimasigit@gmail.com> writes:
>> On 2/16/2020 11:49 AM, Jakub Narebski wrote:
>>>> From: Garima Singh <garima.singh@microsoft.com>
>>>>
>>>> Add the core Bloom filter logic for computing the paths changed between a
>>>> commit and its first parent. For details on what Bloom filters are and how they
>>>> work, please refer to Dr. Derrick Stolee's blog post [1]. It provides a concise
>>>> explaination of the adoption of Bloom filters as described in [2] and [3].
>>> ^^- to add
>>
>> Not sure what this means. Can you please clarify.
>>
>>>> 1. We currently use 7 and 10 for the number of hashes and the size of each
>>>> entry respectively. They served as great starting values, the mathematical
>>>> details behind this choice are described in [1] and [4]. The implementation,
>>> ^^- to add
>>
>> Not sure what this means. Can you please clarify.
>
> I'm sorry for not being clear. What I wanted to say that in both cases
> the last line should have ended in either full stop in first case, or
> comma in second case:
>
> "as described in [2] and [3]."
>
> "The implementation,"
>
> What I wrote (trying to put the arrow below final fullstop or comma)
> only works when one is using with fixed-width font.
>
Aah. Cool. Thanks!
>>>> ---
>>>> Makefile | 2 +
>>>> bloom.c | 228 ++++++++++++++++++++++++++++++++++++++++++
>>>> bloom.h | 56 +++++++++++
>>>> t/helper/test-bloom.c | 84 ++++++++++++++++
>>>> t/helper/test-tool.c | 1 +
>>>> t/helper/test-tool.h | 1 +
>>>> t/t0095-bloom.sh | 113 +++++++++++++++++++++
>>>> 7 files changed, 485 insertions(+)
>>>> create mode 100644 bloom.c
>>>> create mode 100644 bloom.h
>>>> create mode 100644 t/helper/test-bloom.c
>>>> create mode 100755 t/t0095-bloom.sh
>>>
>>> As I wrote earlier, In my opinion this patch could be split into three
>>> individual single-functionality pieces, to make it easier to review and
>>> aid in bisectability if needed.
>>
>> Doing this in v3.
>
> Thanks. Though if it makes (much) more work for you, I can work with
> unsplit patch, no problem.
>
Thanks! That's great! Splitting the patches will add some overhead. I will
try and do it provided it does not delay getting v3 on the list.
>>>> +
>>>> +static uint32_t rotate_right(uint32_t value, int32_t count)
>>>> +{
>>>> + uint32_t mask = 8 * sizeof(uint32_t) - 1;
>>>> + count &= mask;
>>>> + return ((value >> count) | (value << ((-count) & mask)));
>>>> +}
>>>
>>> Hmmm... both the algoritm on Wikipedia, and reference implementation use
>>> rotate *left*, not rotate *right* in the implementation of Murmur3 hash,
>>> see
>>>
>>> https://en.wikipedia.org/wiki/MurmurHash#Algorithm
>>> https://github.com/aappleby/smhasher/blob/master/src/MurmurHash3.cpp#L23
>>>
>>>
>>> inline uint32_t rotl32 ( uint32_t x, int8_t r )
>>> {
>>> return (x << r) | (x >> (32 - r));
>>> }
>>
>> Thanks! Fixed this in v3. More on it later.
>
> Sidenote: If I understand it correctly Bloom filters functionality is
> included in Scalar [1]. What will happen then with all those Bloom
> filter chunks in commit-graph files with wrong hash functions?
>
> [1]: https://devblogs.microsoft.com/devops/introducing-scalar/
>
It is not included in Scalar. Scalar will write to the commit-graph in
the background using the features available in the git version it is working
with. It will update to include changed path Bloom filters when they are
available in git. We are not taking the Bloom filter into microsoft/git
until the format is approved and accepted by the core git community.
>>>> +{
>>>> + const uint32_t c1 = 0xcc9e2d51;
>>>> + const uint32_t c2 = 0x1b873593;
>>>> + const uint32_t r1 = 15;
>>>> + const uint32_t r2 = 13;
>>>> + const uint32_t m = 5;
>>>> + const uint32_t n = 0xe6546b64;
>>>> + int i;
>>>> + uint32_t k1 = 0;
>>>> + const char *tail;
>>>> +
>>>> + int len4 = len / sizeof(uint32_t);
>>>> +
>>>> + const uint32_t *blocks = (const uint32_t*)data;
>>>> +
>>>> + uint32_t k;
>>>> + for (i = 0; i < len4; i++)
>>>> + {
>>>> + k = blocks[i];
>>>
>>> IMPORTANT: There is a comment around there in the example implementation
>>> in C on Wikipedia that this operation above is a source of differing
>>> results across endianness.
>>
>> Thanks! SZEDER found this on his CI pipeline and we have fixed it to
>> process the data in 1 byte words to avoid hitting any endian-ness issues.
>> See this part of the thread that carries the fix and the related discussion.
>> https://lore.kernel.org/git/ba856e20-0a3c-e2d2-6744-b9abfacdc465@gmail.com/
>> I will be squashing those changes in appropriately in v3.
>
> [...]
>>>> + k1 *= c2;
>>>> + seed ^= k1;
>>>> + break;
>>>> + }
>>>> +
>>>> + seed ^= (uint32_t)len;
>>>> + seed ^= (seed >> 16);
>>>> + seed *= 0x85ebca6b;
>>>> + seed ^= (seed >> 13);
>>>> + seed *= 0xc2b2ae35;
>>>> + seed ^= (seed >> 16);
>>>> +
>>>> + return seed;
>>>> +}
>>>
>>> In https://public-inbox.org/git/ba856e20-0a3c-e2d2-6744-b9abfacdc465@gmail.com/
>>> you posted "[PATCH] Process bloom filter data as 1 byte words".
>>> This may avoid the Big-endian vs Little-endian confusion,
>>> that is wrong results on Big-endian architectures, but
>>> it also may slow down the algorithm.
>>
>> Oh cool! You have seen that patch. And yes, we understand that it might add
>> a little overhead but at this point it is more important to be correct on all
>> architectures instead of micro-optimizing and introducing different
>> implementations for Little-endian and Big-endian. This would make this
>> series overly complicated. Optimizing the hashing techniques would deserve a
>> series of its own, which we can definitely revisit later.
>
> Right, "first make it work, then make it right, and, finally, make it fast.".
>
> Anyway, could you maybe compare performance of Git for old version
> (operating on 32-bit/4-bytes words) and new version (operating on 1-byte
> words) file history operation with Bloom filters, to see if it matters
> or not?
>
We chose to switch to 1 byte words for correctness, not performance.
Also, this specific implementation choice is a very small portion of the
end to end time spent computing and writing Bloom filters. We run two murmur3
hashes per path, which is one path per `git log` query; and one path per change
after parsing trees to compute a diff. Measuring performance and micro-optimizing
is not worth the effort and/or trading in the simplicity here.
>>>> +
>>>> +struct bloom_filter *get_bloom_filter(struct repository *r,
>>>> + struct commit *c)
>>>> +{
>>>> + struct bloom_filter *filter;
>>>> + struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
>>>> + int i;
>>>> + struct diff_options diffopt;
>>>> +
>>>> + if (!bloom_filters.slab_size)
>>>> + return NULL;
>>>
>>> This is testing that commit slab for per-commit Bloom filters is
>>> initialized, isn't it?
>>>
>>> First, should we write the condition as
>>>
>>> if (!bloom_filters.slab_size)
>>>
>>> or would the following be more readable
>>>
>>> if (bloom_filters.slab_size == 0)
>>>
>>
>> Sure. Switched to `if (bloom_filter.slab_size == 0)` in v3.
>
> Though either works, and the former looks more like the test if
> bloom_filters slab are initialized, now that I thought about it a bit.
> Your choice.
>
:)
>>>> +
>>>> + if (diff_queued_diff.nr <= 512) {
>>>
>>> Second, there is a minor issue that diff_queue_struct.nr stores the
>>> number of filepairs, that is the number of changed files, while the
>>> number of elements added to Bloom filter is number of changed blobs and
>>> trees. For example if the following files are changed:
>>>
>>> sub/dir/file1
>>> sub/file2
>>>
>>> then diff_queued_diff.nr is 2, but number of elements to be added to
>>> Bloom filter is 4.
>>>
>>> sub/dir/file1
>>> sub/file2
>>> sub/dir/
>>> sub/
>>>
>>> I'm not sure if it matters in practice.
>>>
>>
>> It does not matter much in practice, since the directories usually tend
>> to collapse across the changes. Still, I will add another limit after
>> creating the hashmap entries to cap at 640 so that we have a maximum of
>> 100 changes in the bloom filter.
>>
>> We plan to make these values configurable later.
>
> I'm not sure if it is truly necessary; we can treat limit on number of
> changed paths as "best effort" limit on Bloom filter size.
>
> I just wanted to point out the difference.
>
Sure. Not doing this for v3. Glad it got discussed here though!
>
> Side note: I wonder if it would be worth it (in the future) to change
> handling commits with large amount of changes. I was thinking about
> switching to soft and hard limit: soft limit would be on the size of the
> Bloom filter, that is if number of elements times bits per element is
> greater that size threshold, we don't increase the size of the filter.
>
> This would mean that the false positives ratio (the number of files that
> are not present but get answer "maybe" instead of "no" out of the
> filter) would increase, so there would be a need for another hard limit
> where we decide that it is not worth it, and not store the data for the
> Bloom filter -- current "no data" case with empty filter with length 0.
> This hard limit can be imposed on number of changed files, or on number
> of paths added to filter, or on number of bits set to 1 in the filter
> (on popcount), or some combination thereof.
>
> [...]
Could be considered in the future. Doesn't make the cut for the current
series though.
Thanks
Garima Singh
^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH v2 02/11] bloom: core Bloom filter implementation for changed paths
2020-02-24 17:34 ` Garima Singh
@ 2020-02-24 18:20 ` Jakub Narebski
0 siblings, 0 replies; 159+ messages in thread
From: Jakub Narebski @ 2020-02-24 18:20 UTC (permalink / raw)
To: Garima Singh
Cc: Garima Singh via GitGitGadget, git, Derrick Stolee,
SZEDER Gábor, Jonathan Tan, Jeff Hostetler, Taylor Blau,
Jeff King, Christian Couder, Emily Shaffer, Junio C Hamano,
Garima Singh
Garima Singh <garimasigit@gmail.com> writes:
> On 2/23/2020 8:38 AM, Jakub Narebski wrote:
>> Garima Singh <garimasigit@gmail.com> writes:
>>> On 2/16/2020 11:49 AM, Jakub Narebski wrote:
>>>>> From: Garima Singh <garima.singh@microsoft.com>
[...]
>>>> IMPORTANT: There is a comment around there in the example implementation
>>>> in C on Wikipedia that this operation above is a source of differing
>>>> results across endianness.
>>>
>>> Thanks! SZEDER found this on his CI pipeline and we have fixed it to
>>> process the data in 1 byte words to avoid hitting any endian-ness issues.
>>> See this part of the thread that carries the fix and the related discussion.
>>> https://lore.kernel.org/git/ba856e20-0a3c-e2d2-6744-b9abfacdc465@gmail.com/
>>> I will be squashing those changes in appropriately in v3.
>>
>> [...]
>>>>
>>>> In https://public-inbox.org/git/ba856e20-0a3c-e2d2-6744-b9abfacdc465@gmail.com/
>>>> you posted "[PATCH] Process bloom filter data as 1 byte words".
>>>> This may avoid the Big-endian vs Little-endian confusion,
>>>> that is wrong results on Big-endian architectures, but
>>>> it also may slow down the algorithm.
>>>
>>> Oh cool! You have seen that patch. And yes, we understand that it might add
>>> a little overhead but at this point it is more important to be correct on all
>>> architectures instead of micro-optimizing and introducing different
>>> implementations for Little-endian and Big-endian. This would make this
>>> series overly complicated. Optimizing the hashing techniques would deserve a
>>> series of its own, which we can definitely revisit later.
>>
>> Right, "first make it work, then make it right, and, finally, make it fast.".
>>
>> Anyway, could you maybe compare performance of Git for old version
>> (operating on 32-bit/4-bytes words) and new version (operating on 1-byte
>> words) file history operation with Bloom filters, to see if it matters
>> or not?
>>
>
> We chose to switch to 1 byte words for correctness, not performance.
> Also, this specific implementation choice is a very small portion of the
> end to end time spent computing and writing Bloom filters. We run two murmur3
> hashes per path, which is one path per `git log` query; and one path per change
> after parsing trees to compute a diff. Measuring performance and micro-optimizing
> is not worth the effort and/or trading in the simplicity here.
All right.
I still think that adding to_le32() invocation before the part that
processes remaining bytes (the 'switch' instruction in v2 code), just
like in pseudo-code on Wikipedia:
with any remainingBytesInKey do
remainingBytes ← SwapToLittleEndian(remainingBytesInKey)
would be enough to have correct results regardlless of endianness.
As I wrote
JN> The solution in PMurHash.c in Chromium [1], and the pseudo-code algorithm on
JN> Wikipedia do endian handling only for remaining bytes (while the
JN> beginnings of solution in Appleby's code, and solution in current
JN> above-mentioned Chromium implementation do the conversion for all
JN> bytes). I think that handling it only for remaining bytes (for data
JN> sizes not being multiply of 32-bits / 4-bytes) is enough; all other
JN> operations, that is multiply, rotate, xor and addition do not depend on
JN> endianness.
[1]: https://chromium.googlesource.com/external/smhasher/+/5b8fd3c31a58b87b80605dca7a64fad6cb3f8a0f/PMurHash.c
If you have access to, or can run code on some big-endian architecture,
it should be easy enough to check it.
Anyway, if you decide on 1-byte at time implementation, please put a
comment about 32-bit chunk implementation.
>> Side note: I wonder if it would be worth it (in the future) to change
>> handling commits with large amount of changes. I was thinking about
>> switching to soft and hard limit: soft limit would be on the size of the
>> Bloom filter, that is if number of elements times bits per element is
>> greater that size threshold, we don't increase the size of the filter.
>>
>> This would mean that the false positives ratio (the number of files that
>> are not present but get answer "maybe" instead of "no" out of the
>> filter) would increase, so there would be a need for another hard limit
>> where we decide that it is not worth it, and not store the data for the
>> Bloom filter -- current "no data" case with empty filter with length 0.
>> This hard limit can be imposed on number of changed files, or on number
>> of paths added to filter, or on number of bits set to 1 in the filter
>> (on popcount), or some combination thereof.
>>
>> [...]
>
> Could be considered in the future. Doesn't make the cut for the current
> series though.
Right.
Best,
--
Jakub Narębski
^ permalink raw reply [flat|nested] 159+ messages in thread
* [PATCH v2 03/11] diff: halt tree-diff early after max_changes
2020-02-05 22:56 ` [PATCH v2 00/11] " Garima Singh via GitGitGadget
2020-02-05 22:56 ` [PATCH v2 01/11] commit-graph: use MAX_NUM_CHUNKS Garima Singh via GitGitGadget
2020-02-05 22:56 ` [PATCH v2 02/11] bloom: core Bloom filter implementation for changed paths Garima Singh via GitGitGadget
@ 2020-02-05 22:56 ` Derrick Stolee via GitGitGadget
2020-02-17 0:00 ` Jakub Narebski
2020-02-05 22:56 ` [PATCH v2 04/11] commit-graph: compute Bloom filters for changed paths Garima Singh via GitGitGadget
` (10 subsequent siblings)
13 siblings, 1 reply; 159+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-02-05 22:56 UTC (permalink / raw)
To: git
Cc: stolee, szeder.dev, jonathantanmy, jeffhost, me, peff,
garimasigit, jnareb, christian.couder, emilyshaffer, gitster,
Garima Singh, Derrick Stolee
From: Derrick Stolee <dstolee@microsoft.com>
When computing the changed-paths bloom filters for the commit-graph,
we limit the size of the filter by restricting the number of paths
in the diff. Instead of computing a large diff and then ignoring the
result, it is better to halt the diff computation early.
Create a new "max_changes" option in struct diff_options. If non-zero,
then halt the diff computation after discovering strictly more changed
paths. This includes paths corresponding to trees that change.
Use this max_changes option in the bloom filter calculations. This
reduces the time taken to compute the filters for the Linux kernel
repo from 2m50s to 2m35s. On a large internal repository with ~500
commits that perform tree-wide changes, the time reduced from
6m15s to 3m48s.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
---
bloom.c | 4 +++-
diff.h | 5 +++++
tree-diff.c | 6 ++++++
3 files changed, 14 insertions(+), 1 deletion(-)
diff --git a/bloom.c b/bloom.c
index 6082193a75..818382c03b 100644
--- a/bloom.c
+++ b/bloom.c
@@ -134,6 +134,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
int i;
struct diff_options diffopt;
+ int max_changes = 512;
if (!bloom_filters.slab_size)
return NULL;
@@ -142,6 +143,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
repo_diff_setup(r, &diffopt);
diffopt.flags.recursive = 1;
+ diffopt.max_changes = max_changes;
diff_setup_done(&diffopt);
if (c->parents)
@@ -150,7 +152,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
diff_tree_oid(NULL, &c->object.oid, "", &diffopt);
diffcore_std(&diffopt);
- if (diff_queued_diff.nr <= 512) {
+ if (diff_queued_diff.nr <= max_changes) {
struct hashmap pathmap;
struct pathmap_hash_entry* e;
struct hashmap_iter iter;
diff --git a/diff.h b/diff.h
index 6febe7e365..9443dc1b00 100644
--- a/diff.h
+++ b/diff.h
@@ -285,6 +285,11 @@ struct diff_options {
/* Number of hexdigits to abbreviate raw format output to. */
int abbrev;
+ /* If non-zero, then stop computing after this many changes. */
+ int max_changes;
+ /* For internal use only. */
+ int num_changes;
+
int ita_invisible_in_index;
/* white-space error highlighting */
#define WSEH_NEW (1<<12)
diff --git a/tree-diff.c b/tree-diff.c
index 33ded7f8b3..f3d303c6e5 100644
--- a/tree-diff.c
+++ b/tree-diff.c
@@ -434,6 +434,9 @@ static struct combine_diff_path *ll_diff_tree_paths(
if (diff_can_quit_early(opt))
break;
+ if (opt->max_changes && opt->num_changes > opt->max_changes)
+ break;
+
if (opt->pathspec.nr) {
skip_uninteresting(&t, base, opt);
for (i = 0; i < nparent; i++)
@@ -518,6 +521,7 @@ static struct combine_diff_path *ll_diff_tree_paths(
/* t↓ */
update_tree_entry(&t);
+ opt->num_changes++;
}
/* t > p[imin] */
@@ -535,6 +539,7 @@ static struct combine_diff_path *ll_diff_tree_paths(
skip_emit_tp:
/* ∀ pi=p[imin] pi↓ */
update_tp_entries(tp, nparent);
+ opt->num_changes++;
}
}
@@ -552,6 +557,7 @@ struct combine_diff_path *diff_tree_paths(
const struct object_id **parents_oid, int nparent,
struct strbuf *base, struct diff_options *opt)
{
+ opt->num_changes = 0;
p = ll_diff_tree_paths(p, oid, parents_oid, nparent, base, opt);
/*
--
gitgitgadget
^ permalink raw reply related [flat|nested] 159+ messages in thread
* Re: [PATCH v2 03/11] diff: halt tree-diff early after max_changes
2020-02-05 22:56 ` [PATCH v2 03/11] diff: halt tree-diff early after max_changes Derrick Stolee via GitGitGadget
@ 2020-02-17 0:00 ` Jakub Narebski
2020-02-22 0:37 ` Garima Singh
0 siblings, 1 reply; 159+ messages in thread
From: Jakub Narebski @ 2020-02-17 0:00 UTC (permalink / raw)
To: Derrick Stolee via GitGitGadget
Cc: git, Derrick Stolee, Derrick Stolee, SZEDER Gábor,
Jonathan Tan, Jeff Hostetler, Taylor Blau, Jeff King,
Garima Singh, Christian Couder, Emily Shaffer, Junio C Hamano,
Garima Singh
"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> From: Derrick Stolee <dstolee@microsoft.com>
>
> When computing the changed-paths bloom filters for the commit-graph,
> we limit the size of the filter by restricting the number of paths
> in the diff. Instead of computing a large diff and then ignoring the
> result, it is better to halt the diff computation early.
Good idea.
>
> Create a new "max_changes" option in struct diff_options. If non-zero,
> then halt the diff computation after discovering strictly more changed
> paths. This includes paths corresponding to trees that change.
All right; also, it doesn't need to be exact, though it would be good if
it was.
512 changed paths (changed files) usually generate more than 512
elements to be added to the Bloom filter (changed directories and
files), anyway.
>
> Use this max_changes option in the bloom filter calculations. This
> reduces the time taken to compute the filters for the Linux kernel
> repo from 2m50s to 2m35s. On a large internal repository with ~500
> commits that perform tree-wide changes, the time reduced from
> 6m15s to 3m48s.
I wonder if there is some large open-source project with many commits
performing tree-wide changes, that is with many commits with more than
512 changed files with respect to the first parent.
Maybe https://github.com/whosonfirst-data/whosonfirst-data-venue-us-ny
from "Top Ten Worst Repositories to host on GitHub - Git Merge 2017"
could be a good repository to test ;-)
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> Signed-off-by: Garima Singh <garima.singh@microsoft.com>
Looks good to me, but that is from cursory examination.
Don't know the area to say anything more.
> ---
> bloom.c | 4 +++-
> diff.h | 5 +++++
> tree-diff.c | 6 ++++++
> 3 files changed, 14 insertions(+), 1 deletion(-)
>
> diff --git a/bloom.c b/bloom.c
> index 6082193a75..818382c03b 100644
> --- a/bloom.c
> +++ b/bloom.c
> @@ -134,6 +134,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
> struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
> int i;
> struct diff_options diffopt;
> + int max_changes = 512;
>
> if (!bloom_filters.slab_size)
> return NULL;
> @@ -142,6 +143,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
>
> repo_diff_setup(r, &diffopt);
> diffopt.flags.recursive = 1;
> + diffopt.max_changes = max_changes;
> diff_setup_done(&diffopt);
>
> if (c->parents)
> @@ -150,7 +152,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
> diff_tree_oid(NULL, &c->object.oid, "", &diffopt);
> diffcore_std(&diffopt);
>
> - if (diff_queued_diff.nr <= 512) {
> + if (diff_queued_diff.nr <= max_changes) {
> struct hashmap pathmap;
> struct pathmap_hash_entry* e;
> struct hashmap_iter iter;
> diff --git a/diff.h b/diff.h
> index 6febe7e365..9443dc1b00 100644
> --- a/diff.h
> +++ b/diff.h
> @@ -285,6 +285,11 @@ struct diff_options {
> /* Number of hexdigits to abbreviate raw format output to. */
> int abbrev;
>
> + /* If non-zero, then stop computing after this many changes. */
> + int max_changes;
> + /* For internal use only. */
> + int num_changes;
> +
> int ita_invisible_in_index;
> /* white-space error highlighting */
> #define WSEH_NEW (1<<12)
> diff --git a/tree-diff.c b/tree-diff.c
> index 33ded7f8b3..f3d303c6e5 100644
> --- a/tree-diff.c
> +++ b/tree-diff.c
> @@ -434,6 +434,9 @@ static struct combine_diff_path *ll_diff_tree_paths(
> if (diff_can_quit_early(opt))
> break;
>
> + if (opt->max_changes && opt->num_changes > opt->max_changes)
> + break;
> +
> if (opt->pathspec.nr) {
> skip_uninteresting(&t, base, opt);
> for (i = 0; i < nparent; i++)
> @@ -518,6 +521,7 @@ static struct combine_diff_path *ll_diff_tree_paths(
>
> /* t↓ */
> update_tree_entry(&t);
> + opt->num_changes++;
> }
>
> /* t > p[imin] */
> @@ -535,6 +539,7 @@ static struct combine_diff_path *ll_diff_tree_paths(
> skip_emit_tp:
> /* ∀ pi=p[imin] pi↓ */
> update_tp_entries(tp, nparent);
> + opt->num_changes++;
> }
> }
>
> @@ -552,6 +557,7 @@ struct combine_diff_path *diff_tree_paths(
> const struct object_id **parents_oid, int nparent,
> struct strbuf *base, struct diff_options *opt)
> {
> + opt->num_changes = 0;
> p = ll_diff_tree_paths(p, oid, parents_oid, nparent, base, opt);
>
> /*
^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH v2 03/11] diff: halt tree-diff early after max_changes
2020-02-17 0:00 ` Jakub Narebski
@ 2020-02-22 0:37 ` Garima Singh
0 siblings, 0 replies; 159+ messages in thread
From: Garima Singh @ 2020-02-22 0:37 UTC (permalink / raw)
To: Jakub Narebski, Derrick Stolee via GitGitGadget
Cc: git, Derrick Stolee, Derrick Stolee, SZEDER Gábor,
Jonathan Tan, Jeff Hostetler, Taylor Blau, Jeff King,
Christian Couder, Emily Shaffer, Junio C Hamano, Garima Singh
On 2/16/2020 7:00 PM, Jakub Narebski wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
>> From: Derrick Stolee <dstolee@microsoft.com>
>>
>> Use this max_changes option in the bloom filter calculations. This
>> reduces the time taken to compute the filters for the Linux kernel
>> repo from 2m50s to 2m35s. On a large internal repository with ~500
>> commits that perform tree-wide changes, the time reduced from
>> 6m15s to 3m48s.
>
> I wonder if there is some large open-source project with many commits
> performing tree-wide changes, that is with many commits with more than
> 512 changed files with respect to the first parent.
>
> Maybe https://github.com/whosonfirst-data/whosonfirst-data-venue-us-ny
> from "Top Ten Worst Repositories to host on GitHub - Git Merge 2017"
> could be a good repository to test ;-)
>
Thanks for the suggestion! I will see if any of these repos gives us a
good test bed and add the perf improvement numbers in the appropriate
commit messages in v3.
Cheers!
Garima Singh
^ permalink raw reply [flat|nested] 159+ messages in thread
* [PATCH v2 04/11] commit-graph: compute Bloom filters for changed paths
2020-02-05 22:56 ` [PATCH v2 00/11] " Garima Singh via GitGitGadget
` (2 preceding siblings ...)
2020-02-05 22:56 ` [PATCH v2 03/11] diff: halt tree-diff early after max_changes Derrick Stolee via GitGitGadget
@ 2020-02-05 22:56 ` Garima Singh via GitGitGadget
2020-02-17 21:56 ` Jakub Narebski
2020-02-05 22:56 ` [PATCH v2 05/11] commit-graph: examine changed-path objects in pack order Jeff King via GitGitGadget
` (9 subsequent siblings)
13 siblings, 1 reply; 159+ messages in thread
From: Garima Singh via GitGitGadget @ 2020-02-05 22:56 UTC (permalink / raw)
To: git
Cc: stolee, szeder.dev, jonathantanmy, jeffhost, me, peff,
garimasigit, jnareb, christian.couder, emilyshaffer, gitster,
Garima Singh, Garima Singh
From: Garima Singh <garima.singh@microsoft.com>
Compute Bloom filters for the paths that changed between a commit and its
first parent using the implementation in bloom.c, when the
COMMIT_GRAPH_WRITE_CHANGED_PATHS flag is set. This computation is done on a
commit-by-commit basis. We will write these Bloom filters to the commit graph
file in the next change.
Helped-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
---
commit-graph.c | 32 +++++++++++++++++++++++++++++++-
commit-graph.h | 3 ++-
2 files changed, 33 insertions(+), 2 deletions(-)
diff --git a/commit-graph.c b/commit-graph.c
index 3c4d411326..724bfcffc4 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -16,6 +16,7 @@
#include "hashmap.h"
#include "replace-object.h"
#include "progress.h"
+#include "bloom.h"
#define GRAPH_SIGNATURE 0x43475048 /* "CGPH" */
#define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
@@ -795,9 +796,11 @@ struct write_commit_graph_context {
unsigned append:1,
report_progress:1,
split:1,
- check_oids:1;
+ check_oids:1,
+ changed_paths:1;
const struct split_commit_graph_opts *split_opts;
+ uint32_t total_bloom_filter_data_size;
};
static void write_graph_chunk_fanout(struct hashfile *f,
@@ -1140,6 +1143,28 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
stop_progress(&ctx->progress);
}
+static void compute_bloom_filters(struct write_commit_graph_context *ctx)
+{
+ int i;
+ struct progress *progress = NULL;
+
+ load_bloom_filters();
+
+ if (ctx->report_progress)
+ progress = start_progress(
+ _("Computing commit diff Bloom filters"),
+ ctx->commits.nr);
+
+ for (i = 0; i < ctx->commits.nr; i++) {
+ struct commit *c = ctx->commits.list[i];
+ struct bloom_filter *filter = get_bloom_filter(ctx->r, c);
+ ctx->total_bloom_filter_data_size += sizeof(uint64_t) * filter->len;
+ display_progress(progress, i + 1);
+ }
+
+ stop_progress(&progress);
+}
+
static int add_ref_to_list(const char *refname,
const struct object_id *oid,
int flags, void *cb_data)
@@ -1794,6 +1819,8 @@ int write_commit_graph(const char *obj_dir,
ctx->split = flags & COMMIT_GRAPH_WRITE_SPLIT ? 1 : 0;
ctx->check_oids = flags & COMMIT_GRAPH_WRITE_CHECK_OIDS ? 1 : 0;
ctx->split_opts = split_opts;
+ ctx->changed_paths = flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS ? 1 : 0;
+ ctx->total_bloom_filter_data_size = 0;
if (ctx->split) {
struct commit_graph *g;
@@ -1888,6 +1915,9 @@ int write_commit_graph(const char *obj_dir,
compute_generation_numbers(ctx);
+ if (ctx->changed_paths)
+ compute_bloom_filters(ctx);
+
res = write_commit_graph_file(ctx);
if (ctx->split)
diff --git a/commit-graph.h b/commit-graph.h
index 7f5c933fa2..952a4b83be 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -76,7 +76,8 @@ enum commit_graph_write_flags {
COMMIT_GRAPH_WRITE_PROGRESS = (1 << 1),
COMMIT_GRAPH_WRITE_SPLIT = (1 << 2),
/* Make sure that each OID in the input is a valid commit OID. */
- COMMIT_GRAPH_WRITE_CHECK_OIDS = (1 << 3)
+ COMMIT_GRAPH_WRITE_CHECK_OIDS = (1 << 3),
+ COMMIT_GRAPH_WRITE_BLOOM_FILTERS = (1 << 4)
};
struct split_commit_graph_opts {
--
gitgitgadget
^ permalink raw reply related [flat|nested] 159+ messages in thread
* Re: [PATCH v2 04/11] commit-graph: compute Bloom filters for changed paths
2020-02-05 22:56 ` [PATCH v2 04/11] commit-graph: compute Bloom filters for changed paths Garima Singh via GitGitGadget
@ 2020-02-17 21:56 ` Jakub Narebski
2020-02-22 0:55 ` Garima Singh
0 siblings, 1 reply; 159+ messages in thread
From: Jakub Narebski @ 2020-02-17 21:56 UTC (permalink / raw)
To: Garima Singh via GitGitGadget
Cc: git, Derrick Stolee, SZEDER Gábor, Jonathan Tan,
Jeff Hostetler, Taylor Blau, Jeff King, Garima Singh,
Christian Couder, Emily Shaffer, Junio C Hamano, Garima Singh
"Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:
> From: Garima Singh <garima.singh@microsoft.com>
> Subject: [PATCH v2 04/11] commit-graph: compute Bloom filters for changed paths
>
> Compute Bloom filters for the paths that changed between a commit and its
> first parent using the implementation in bloom.c, when the
> COMMIT_GRAPH_WRITE_CHANGED_PATHS flag is set. This computation is done on a
> commit-by-commit basis. We will write these Bloom filters to the commit graph
> file in the next change.
I have no major complaints about the contents of this patch (except lack
of test, and type of total_bloom_filter_data_size), but the commit
message could have been worded better.
I would write something like this instead:
Add new COMMIT_GRAPH_WRITE_CHANGED_PATHS flag that makes Git compute
Bloom filters that store the information about changed paths (that
changed between a commit and its first parent) for each commit in the
commit-graph. This computation is done on a commit-by-commit basis.
We will write these Bloom filters to the commit-graph file, to store
this data on disk, in the next change in this series.
In my opinion the fact that we compute Bloom filters for each and every
commit in the commit-graph file is more important than quite obvious
fact that we use implementation from bloom.c.
>
> Helped-by: Derrick Stolee <dstolee@microsoft.com>
> Signed-off-by: Garima Singh <garima.singh@microsoft.com>
> ---
> commit-graph.c | 32 +++++++++++++++++++++++++++++++-
> commit-graph.h | 3 ++-
> 2 files changed, 33 insertions(+), 2 deletions(-)
It would be good to have at least sanity check of this feature, perhaps
one that would check that the number of per-commit Bloom filters on slab
matches the number of commits in the commit-graph.
It could look something like this:
test_expect_success 'create Bloom filters for all commit-graph commits' '
# create commit-graph with 2 commits
git rev-parse HEAD HEAD^ | git commit-graph write --stdin-commits &&
# generate Bloom filters for commit-graph commits
cat >commands <<\-EOF &&
add-graph-commits
filters-count
EOF
NUM_FILTERS=$(git test-tool bloom <commands) %%
test "$NUM_FILTERS" -eq 2
'
>
> diff --git a/commit-graph.c b/commit-graph.c
> index 3c4d411326..724bfcffc4 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -16,6 +16,7 @@
> #include "hashmap.h"
> #include "replace-object.h"
> #include "progress.h"
> +#include "bloom.h"
>
> #define GRAPH_SIGNATURE 0x43475048 /* "CGPH" */
> #define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
> @@ -795,9 +796,11 @@ struct write_commit_graph_context {
> unsigned append:1,
> report_progress:1,
> split:1,
> - check_oids:1;
> + check_oids:1,
> + changed_paths:1;
All right, this flag will be used for handling future `--changed-paths`
option to the `git commit-graph write`.
>
> const struct split_commit_graph_opts *split_opts;
> + uint32_t total_bloom_filter_data_size;
This is total size of Bloom filters data, in bytes, that will later be
used for BDAT chunk size. However the commit-graph format uses 8 bytes
for byte-offset, not 4 bytes. Why it is uint32_t and not uint64_t then?
> };
>
> static void write_graph_chunk_fanout(struct hashfile *f,
> @@ -1140,6 +1143,28 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
> stop_progress(&ctx->progress);
> }
>
> +static void compute_bloom_filters(struct write_commit_graph_context *ctx)
> +{
> + int i;
> + struct progress *progress = NULL;
> +
> + load_bloom_filters();
> +
> + if (ctx->report_progress)
> + progress = start_progress(
> + _("Computing commit diff Bloom filters"),
> + ctx->commits.nr);
> +
Shouldn't we initialize ctx->total_bloom_filter_data_size to 0 here? We
cannot use compute_bloom_filters() to _update_ Bloom filters data, I
think -- we don't distinguish here between new and existing data (where
existing data size is already included in total Bloom filters size). At
least I don't think so.
> + for (i = 0; i < ctx->commits.nr; i++) {
> + struct commit *c = ctx->commits.list[i];
Here we process commit in whatever order commits are in the
commits.list, which probably means lexicographical order, in practice
random order.
> + struct bloom_filter *filter = get_bloom_filter(ctx->r, c);
> + ctx->total_bloom_filter_data_size += sizeof(uint64_t) * filter->len;
> + display_progress(progress, i + 1);
> + }
> +
> + stop_progress(&progress);
> +}
> +
> static int add_ref_to_list(const char *refname,
> const struct object_id *oid,
> int flags, void *cb_data)
> @@ -1794,6 +1819,8 @@ int write_commit_graph(const char *obj_dir,
> ctx->split = flags & COMMIT_GRAPH_WRITE_SPLIT ? 1 : 0;
> ctx->check_oids = flags & COMMIT_GRAPH_WRITE_CHECK_OIDS ? 1 : 0;
> ctx->split_opts = split_opts;
> + ctx->changed_paths = flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS ? 1 : 0;
> + ctx->total_bloom_filter_data_size = 0;
>
> if (ctx->split) {
> struct commit_graph *g;
> @@ -1888,6 +1915,9 @@ int write_commit_graph(const char *obj_dir,
>
> compute_generation_numbers(ctx);
>
> + if (ctx->changed_paths)
> + compute_bloom_filters(ctx);
> +
All right.
> res = write_commit_graph_file(ctx);
>
> if (ctx->split)
> diff --git a/commit-graph.h b/commit-graph.h
> index 7f5c933fa2..952a4b83be 100644
> --- a/commit-graph.h
> +++ b/commit-graph.h
> @@ -76,7 +76,8 @@ enum commit_graph_write_flags {
> COMMIT_GRAPH_WRITE_PROGRESS = (1 << 1),
> COMMIT_GRAPH_WRITE_SPLIT = (1 << 2),
> /* Make sure that each OID in the input is a valid commit OID. */
> - COMMIT_GRAPH_WRITE_CHECK_OIDS = (1 << 3)
> + COMMIT_GRAPH_WRITE_CHECK_OIDS = (1 << 3),
> + COMMIT_GRAPH_WRITE_BLOOM_FILTERS = (1 << 4)
All right.
Side note: perhaps we could add trailing comma after new enum entry,
that is
+ COMMIT_GRAPH_WRITE_BLOOM_FILTERS = (1 << 4),
following new CodingGuidelines recommendation
- We try to support a wide range of C compilers to compile Git with,
including old ones. You should not use features from newer C
standard, even if your compiler groks them.
There are a few exceptions to this guideline:
. since early 2012 with e1327023ea, we have been using an enum
definition whose last element is followed by a comma. This, like
an array initializer that ends with a trailing comma, can be used
to reduce the patch noise when adding a new identifier at the end.
https://github.com/git/git/blob/master/Documentation/CodingGuidelines#L197
> };
>
> struct split_commit_graph_opts {
^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH v2 04/11] commit-graph: compute Bloom filters for changed paths
2020-02-17 21:56 ` Jakub Narebski
@ 2020-02-22 0:55 ` Garima Singh
2020-02-23 17:34 ` Jakub Narebski
0 siblings, 1 reply; 159+ messages in thread
From: Garima Singh @ 2020-02-22 0:55 UTC (permalink / raw)
To: Jakub Narebski, Garima Singh via GitGitGadget
Cc: git, Derrick Stolee, SZEDER Gábor, Jonathan Tan,
Jeff Hostetler, Taylor Blau, Jeff King, Christian Couder,
Emily Shaffer, Junio C Hamano, Garima Singh
On 2/17/2020 4:56 PM, Jakub Narebski wrote:
> "Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
>> From: Garima Singh <garima.singh@microsoft.com>
>> Subject: [PATCH v2 04/11] commit-graph: compute Bloom filters for changed paths
>>
>> Compute Bloom filters for the paths that changed between a commit and its
>> first parent using the implementation in bloom.c, when the
>> COMMIT_GRAPH_WRITE_CHANGED_PATHS flag is set. This computation is done on a
>> commit-by-commit basis. We will write these Bloom filters to the commit graph
>> file in the next change.
>
> I have no major complaints about the contents of this patch (except lack
> of test, and type of total_bloom_filter_data_size), but the commit
> message could have been worded better.
>
> I would write something like this instead:
>
> Add new COMMIT_GRAPH_WRITE_CHANGED_PATHS flag that makes Git compute
> Bloom filters that store the information about changed paths (that
> changed between a commit and its first parent) for each commit in the
> commit-graph. This computation is done on a commit-by-commit basis.
>
> We will write these Bloom filters to the commit-graph file, to store
> this data on disk, in the next change in this series.
>
> In my opinion the fact that we compute Bloom filters for each and every
> commit in the commit-graph file is more important than quite obvious
> fact that we use implementation from bloom.c.
>
Nice! Incorporated in v3. Thanks!
>>
>> Helped-by: Derrick Stolee <dstolee@microsoft.com>
>> Signed-off-by: Garima Singh <garima.singh@microsoft.com>
>> ---
>> commit-graph.c | 32 +++++++++++++++++++++++++++++++-
>> commit-graph.h | 3 ++-
>> 2 files changed, 33 insertions(+), 2 deletions(-)
>
> It would be good to have at least sanity check of this feature, perhaps
> one that would check that the number of per-commit Bloom filters on slab
> matches the number of commits in the commit-graph.
>
The combination of all the e2e tests in this series with the test
flag being turned on in the CI, and the performance gains we are seeing
confirm that this is happening correctly.
>>
>> const struct split_commit_graph_opts *split_opts;
>> + uint32_t total_bloom_filter_data_size;
>
> This is total size of Bloom filters data, in bytes, that will later be
> used for BDAT chunk size. However the commit-graph format uses 8 bytes
> for byte-offset, not 4 bytes. Why it is uint32_t and not uint64_t then?
>
Changed to size_t. Thanks for noticing!
>> };
>>
>> static void write_graph_chunk_fanout(struct hashfile *f,
>> @@ -1140,6 +1143,28 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
>> stop_progress(&ctx->progress);
>> }
>>
>> +static void compute_bloom_filters(struct write_commit_graph_context *ctx)
>> +{
>> + int i;
>> + struct progress *progress = NULL;
>> +
>> + load_bloom_filters();
>> +
>> + if (ctx->report_progress)
>> + progress = start_progress(
>> + _("Computing commit diff Bloom filters"),
>> + ctx->commits.nr);
>> +
>
> Shouldn't we initialize ctx->total_bloom_filter_data_size to 0 here? We
> cannot use compute_bloom_filters() to _update_ Bloom filters data, I
> think -- we don't distinguish here between new and existing data (where
> existing data size is already included in total Bloom filters size). At
> least I don't think so.
>
This line in commit-graph.c takes care of reinitializing the graph context and
by consequence the bloom filter data size.
ctx = xcalloc(1, sizeof(struct write_commit_graph_context));
So the total size gets recalculated every time, which is correct.
>
> Side note: perhaps we could add trailing comma after new enum entry,
> that is
>
> + COMMIT_GRAPH_WRITE_BLOOM_FILTERS = (1 << 4),
>
> following new CodingGuidelines recommendation
>
Thanks! Fixed in v3.
Cheers!
Garima Singh
^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH v2 04/11] commit-graph: compute Bloom filters for changed paths
2020-02-22 0:55 ` Garima Singh
@ 2020-02-23 17:34 ` Jakub Narebski
0 siblings, 0 replies; 159+ messages in thread
From: Jakub Narebski @ 2020-02-23 17:34 UTC (permalink / raw)
To: Garima Singh
Cc: Garima Singh via GitGitGadget, git, Derrick Stolee,
SZEDER Gábor, Jonathan Tan, Jeff Hostetler, Taylor Blau,
Jeff King, Christian Couder, Emily Shaffer, Junio C Hamano,
Garima Singh
Garima Singh <garimasigit@gmail.com> writes:
> On 2/17/2020 4:56 PM, Jakub Narebski wrote:
>> "Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:
[...]
>>> ---
>>> commit-graph.c | 32 +++++++++++++++++++++++++++++++-
>>> commit-graph.h | 3 ++-
>>> 2 files changed, 33 insertions(+), 2 deletions(-)
>>
>> It would be good to have at least sanity check of this feature, perhaps
>> one that would check that the number of per-commit Bloom filters on slab
>> matches the number of commits in the commit-graph.
>
> The combination of all the e2e tests in this series with the test
> flag being turned on in the CI, and the performance gains we are seeing
> confirm that this is happening correctly.
Well, the advantage of unit tests over e2e functional tests is that they
can pinpoint the source of bug much more easily.
That said, I don't think there is absolute need for unit tests here,
though it would be nice to have them.
>>>
>>> const struct split_commit_graph_opts *split_opts;
>>> + uint32_t total_bloom_filter_data_size;
>>
>> This is total size of Bloom filters data, in bytes, that will later be
>> used for BDAT chunk size. However the commit-graph format uses 8 bytes
>> for byte-offset, not 4 bytes. Why it is uint32_t and not uint64_t then?
>
> Changed to size_t. Thanks for noticing!
Right, this is a local value (size_t may be different size on different
architectures), even though it will be stored indirectly in chunk lookup
table as pair of uint64_t offsets.
>>> };
>>>
>>> static void write_graph_chunk_fanout(struct hashfile *f,
>>> @@ -1140,6 +1143,28 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
>>> stop_progress(&ctx->progress);
>>> }
>>>
>>> +static void compute_bloom_filters(struct write_commit_graph_context *ctx)
>>> +{
>>> + int i;
>>> + struct progress *progress = NULL;
>>> +
>>> + load_bloom_filters();
>>> +
>>> + if (ctx->report_progress)
>>> + progress = start_progress(
>>> + _("Computing commit diff Bloom filters"),
>>> + ctx->commits.nr);
>>> +
>>
>> Shouldn't we initialize ctx->total_bloom_filter_data_size to 0 here? We
>> cannot use compute_bloom_filters() to _update_ Bloom filters data, I
>> think -- we don't distinguish here between new and existing data (where
>> existing data size is already included in total Bloom filters size). At
>> least I don't think so.
>>
>
> This line in commit-graph.c takes care of reinitializing the graph context and
> by consequence the bloom filter data size.
>
> ctx = xcalloc(1, sizeof(struct write_commit_graph_context));
>
> So the total size gets recalculated every time, which is correct.
True, I have missed this.
Best,
--
Jakub Narębski
^ permalink raw reply [flat|nested] 159+ messages in thread
* [PATCH v2 05/11] commit-graph: examine changed-path objects in pack order
2020-02-05 22:56 ` [PATCH v2 00/11] " Garima Singh via GitGitGadget
` (3 preceding siblings ...)
2020-02-05 22:56 ` [PATCH v2 04/11] commit-graph: compute Bloom filters for changed paths Garima Singh via GitGitGadget
@ 2020-02-05 22:56 ` Jeff King via GitGitGadget
2020-02-18 17:59 ` Jakub Narebski
2020-02-05 22:56 ` [PATCH v2 06/11] commit-graph: examine commits by generation number Derrick Stolee via GitGitGadget
` (8 subsequent siblings)
13 siblings, 1 reply; 159+ messages in thread
From: Jeff King via GitGitGadget @ 2020-02-05 22:56 UTC (permalink / raw)
To: git
Cc: stolee, szeder.dev, jonathantanmy, jeffhost, me, peff,
garimasigit, jnareb, christian.couder, emilyshaffer, gitster,
Garima Singh, Jeff King
From: Jeff King <peff@peff.net>
Looking at the diff of commit objects in pack order is much faster than
in sha1 order, as it gives locality to the access of tree deltas
(whereas sha1 order is effectively random). Unfortunately the
commit-graph code sorts the commits (several times, sometimes as an oid
and sometimes a pointer-to-commit), and we ultimately traverse in sha1
order.
Instead, let's remember the position at which we see each commit, and
traverse in that order when looking at bloom filters. This drops my time
for "git commit-graph write --changed-paths" in linux.git from ~4
minutes to ~1.5 minutes.
Probably the "--reachable" code path would want something similar.
Or alternatively, we could use a different data structure (either a
hash, or maybe even just a bit in "struct commit") to keep track of
which oids we've seen, etc instead of sorting. And then we could keep
the original order.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
---
commit-graph.c | 34 +++++++++++++++++++++++++++++++++-
1 file changed, 33 insertions(+), 1 deletion(-)
diff --git a/commit-graph.c b/commit-graph.c
index 724bfcffc4..e125511a1c 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -17,6 +17,7 @@
#include "replace-object.h"
#include "progress.h"
#include "bloom.h"
+#include "commit-slab.h"
#define GRAPH_SIGNATURE 0x43475048 /* "CGPH" */
#define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
@@ -46,6 +47,29 @@
/* Remember to update object flag allocation in object.h */
#define REACHABLE (1u<<15)
+/* Keep track of the order in which commits are added to our list. */
+define_commit_slab(commit_pos, int);
+static struct commit_pos commit_pos = COMMIT_SLAB_INIT(1, commit_pos);
+
+static void set_commit_pos(struct repository *r, const struct object_id *oid)
+{
+ static int32_t max_pos;
+ struct commit *commit = lookup_commit(r, oid);
+
+ if (!commit)
+ return; /* should never happen, but be lenient */
+
+ *commit_pos_at(&commit_pos, commit) = max_pos++;
+}
+
+static int commit_pos_cmp(const void *va, const void *vb)
+{
+ const struct commit *a = *(const struct commit **)va;
+ const struct commit *b = *(const struct commit **)vb;
+ return commit_pos_at(&commit_pos, a) -
+ commit_pos_at(&commit_pos, b);
+}
+
char *get_commit_graph_filename(const char *obj_dir)
{
char *filename = xstrfmt("%s/info/commit-graph", obj_dir);
@@ -1027,6 +1051,8 @@ static int add_packed_commits(const struct object_id *oid,
oidcpy(&(ctx->oids.list[ctx->oids.nr]), oid);
ctx->oids.nr++;
+ set_commit_pos(ctx->r, oid);
+
return 0;
}
@@ -1147,6 +1173,7 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
{
int i;
struct progress *progress = NULL;
+ struct commit **sorted_by_pos;
load_bloom_filters();
@@ -1155,13 +1182,18 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
_("Computing commit diff Bloom filters"),
ctx->commits.nr);
+ ALLOC_ARRAY(sorted_by_pos, ctx->commits.nr);
+ COPY_ARRAY(sorted_by_pos, ctx->commits.list, ctx->commits.nr);
+ QSORT(sorted_by_pos, ctx->commits.nr, commit_pos_cmp);
+
for (i = 0; i < ctx->commits.nr; i++) {
- struct commit *c = ctx->commits.list[i];
+ struct commit *c = sorted_by_pos[i];
struct bloom_filter *filter = get_bloom_filter(ctx->r, c);
ctx->total_bloom_filter_data_size += sizeof(uint64_t) * filter->len;
display_progress(progress, i + 1);
}
+ free(sorted_by_pos);
stop_progress(&progress);
}
--
gitgitgadget
^ permalink raw reply related [flat|nested] 159+ messages in thread
* Re: [PATCH v2 05/11] commit-graph: examine changed-path objects in pack order
2020-02-05 22:56 ` [PATCH v2 05/11] commit-graph: examine changed-path objects in pack order Jeff King via GitGitGadget
@ 2020-02-18 17:59 ` Jakub Narebski
2020-02-24 18:29 ` Garima Singh
0 siblings, 1 reply; 159+ messages in thread
From: Jakub Narebski @ 2020-02-18 17:59 UTC (permalink / raw)
To: Jeff King via GitGitGadget
Cc: git, Derrick Stolee, SZEDER Gábor, Jonathan Tan,
Jeff Hostetler, Taylor Blau, Jeff King, Garima Singh,
Christian Couder, Emily Shaffer, Junio C Hamano, Garima Singh
"Jeff King via GitGitGadget" <gitgitgadget@gmail.com> writes:
> From: Jeff King <peff@peff.net>
>
> Looking at the diff of commit objects in pack order is much faster than
> in sha1 order, as it gives locality to the access of tree deltas
Nitpick: should we still say sha1 order? Git is still using SHA-1 as an
*oid*, but hopefully soon it will be transitioning to NewHash = SHA-256.
(No need to change anything.)
> (whereas sha1 order is effectively random). Unfortunately the
> commit-graph code sorts the commits (several times, sometimes as an oid
> and sometimes a pointer-to-commit), and we ultimately traverse in sha1
> order.
Actually, commit-graph code needs write_commit_graph_context.commits.list
to be in lexicographical order to be able to turn position in graph into
reference to a commit. The information about the parents of the commit
are stored using positional references within the graph file.
>
> Instead, let's remember the position at which we see each commit, and
> traverse in that order when looking at bloom filters. This drops my time
> for "git commit-graph write --changed-paths" in linux.git from ~4
> minutes to ~1.5 minutes.
Nitpick: with reordering of patches (which I think is otherwise a good
thing) this patch actually comes before the one adding "--changed-paths"
option to "git commit-graph write". So it 'This would drop my time'
rather than 'This drops my time...' ;-)
>
> Probably the "--reachable" code path would want something similar.
Has anyone tried doing this?
>
> Or alternatively, we could use a different data structure (either a
> hash, or maybe even just a bit in "struct commit") to keep track of
> which oids we've seen, etc instead of sorting. And then we could keep
> the original order.
I think it is nice to keep those "what ifs?" thoughts in the commit
message. They add some color.
>
> Signed-off-by: Jeff King <peff@peff.net>
> Signed-off-by: Garima Singh <garima.singh@microsoft.com>
> ---
> commit-graph.c | 34 +++++++++++++++++++++++++++++++++-
> 1 file changed, 33 insertions(+), 1 deletion(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index 724bfcffc4..e125511a1c 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -17,6 +17,7 @@
> #include "replace-object.h"
> #include "progress.h"
> #include "bloom.h"
> +#include "commit-slab.h"
>
> #define GRAPH_SIGNATURE 0x43475048 /* "CGPH" */
> #define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
> @@ -46,6 +47,29 @@
> /* Remember to update object flag allocation in object.h */
> #define REACHABLE (1u<<15)
>
> +/* Keep track of the order in which commits are added to our list. */
> +define_commit_slab(commit_pos, int);
> +static struct commit_pos commit_pos = COMMIT_SLAB_INIT(1, commit_pos);
> +
> +static void set_commit_pos(struct repository *r, const struct object_id *oid)
> +{
> + static int32_t max_pos;
> + struct commit *commit = lookup_commit(r, oid);
> +
> + if (!commit)
> + return; /* should never happen, but be lenient */
> +
> + *commit_pos_at(&commit_pos, commit) = max_pos++;
> +}
All right, that is nice and universal function.
> +
> +static int commit_pos_cmp(const void *va, const void *vb)
> +{
> + const struct commit *a = *(const struct commit **)va;
> + const struct commit *b = *(const struct commit **)vb;
> + return commit_pos_at(&commit_pos, a) -
> + commit_pos_at(&commit_pos, b);
> +}
Hmmm... I wonder what would happen in commit_pos was not set (like
e.g. commit-graph commits not coming from the packfile). Let's look up
the documenation...
commit_pos_at() returns a pointer to an int... why are we comparing
pointers and not values? Shouldn't it be
+ return *commit_pos_at(&commit_pos, a) -
+ *commit_pos_at(&commit_pos, b);
With commit_pos_at() the location to store the data is allocated as
necessary (if data for commit doesn't exists), and because we are using
xalloc() the *commit_pos_at() is 0-initialized. This means that if
commits didn't come from the packfile, we sort all commits as being
equal. Luckily we fix that in next patch.
> +
> char *get_commit_graph_filename(const char *obj_dir)
> {
> char *filename = xstrfmt("%s/info/commit-graph", obj_dir);
> @@ -1027,6 +1051,8 @@ static int add_packed_commits(const struct object_id *oid,
> oidcpy(&(ctx->oids.list[ctx->oids.nr]), oid);
> ctx->oids.nr++;
>
> + set_commit_pos(ctx->r, oid);
> +
> return 0;
> }
>
> @@ -1147,6 +1173,7 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
> {
> int i;
> struct progress *progress = NULL;
> + struct commit **sorted_by_pos;
In the next patch in series we would sort commits by generation number
and creation data; shouldn't this variable name be more generic to
reflect this, for example just `sorted_commits` or `commits_sorted`?
>
> load_bloom_filters();
>
> @@ -1155,13 +1182,18 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
> _("Computing commit diff Bloom filters"),
> ctx->commits.nr);
>
> + ALLOC_ARRAY(sorted_by_pos, ctx->commits.nr);
> + COPY_ARRAY(sorted_by_pos, ctx->commits.list, ctx->commits.nr);
> + QSORT(sorted_by_pos, ctx->commits.nr, commit_pos_cmp);
> +
All right: allocate array, copy data, sort it.
We need to copy data because (what I think) we need commits in
lexicographical order to be able to turn the position in graph that
parents of a commit are stored as into the reference to this commit.
> for (i = 0; i < ctx->commits.nr; i++) {
> - struct commit *c = ctx->commits.list[i];
> + struct commit *c = sorted_by_pos[i];
All right: use sorted data.
> struct bloom_filter *filter = get_bloom_filter(ctx->r, c);
> ctx->total_bloom_filter_data_size += sizeof(uint64_t) * filter->len;
> display_progress(progress, i + 1);
> }
>
> + free(sorted_by_pos);
Can we free the slab data, i.e. call `clear_commit_pos(&commit_pos);`
here? Otherwise we are leaking memory (well, except that finishing
command makes the operating system to free memory for us).
> stop_progress(&progress);
> }
Best,
--
Jakub Narębski
^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH v2 05/11] commit-graph: examine changed-path objects in pack order
2020-02-18 17:59 ` Jakub Narebski
@ 2020-02-24 18:29 ` Garima Singh
0 siblings, 0 replies; 159+ messages in thread
From: Garima Singh @ 2020-02-24 18:29 UTC (permalink / raw)
To: Jakub Narebski, Jeff King via GitGitGadget
Cc: git, Derrick Stolee, SZEDER Gábor, Jonathan Tan,
Jeff Hostetler, Taylor Blau, Jeff King, Christian Couder,
Emily Shaffer, Junio C Hamano, Garima Singh
On 2/18/2020 12:59 PM, Jakub Narebski wrote:
> "Jeff King via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
>> From: Jeff King <peff@peff.net>
>>
>> Looking at the diff of commit objects in pack order is much faster than
>> in sha1 order, as it gives locality to the access of tree deltas
>
> Nitpick: should we still say sha1 order? Git is still using SHA-1 as an
> *oid*, but hopefully soon it will be transitioning to NewHash = SHA-256.
> (No need to change anything.)
>
>> (whereas sha1 order is effectively random). Unfortunately the
>> commit-graph code sorts the commits (several times, sometimes as an oid
>> and sometimes a pointer-to-commit), and we ultimately traverse in sha1
>> order.
>
> Actually, commit-graph code needs write_commit_graph_context.commits.list
> to be in lexicographical order to be able to turn position in graph into
> reference to a commit. The information about the parents of the commit
> are stored using positional references within the graph file.
>
You are right. Fixing the commit message in v3.
>>
>> Instead, let's remember the position at which we see each commit, and
>> traverse in that order when looking at bloom filters. This drops my time
>> for "git commit-graph write --changed-paths" in linux.git from ~4
>> minutes to ~1.5 minutes.
>
> Nitpick: with reordering of patches (which I think is otherwise a good
> thing) this patch actually comes before the one adding "--changed-paths"
> option to "git commit-graph write". So it 'This would drop my time'
> rather than 'This drops my time...' ;-)
>
:) I will fix that up.
>>
>> Probably the "--reachable" code path would want something similar.
>
> Has anyone tried doing this?
>
I will and I will include the perf numbers in the appropriately in v3.
>> +
>> char *get_commit_graph_filename(const char *obj_dir)
>> {
>> char *filename = xstrfmt("%s/info/commit-graph", obj_dir);
>> @@ -1027,6 +1051,8 @@ static int add_packed_commits(const struct object_id *oid,
>> oidcpy(&(ctx->oids.list[ctx->oids.nr]), oid);
>> ctx->oids.nr++;
>>
>> + set_commit_pos(ctx->r, oid);
>> +
>> return 0;
>> }
>>
>> @@ -1147,6 +1173,7 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
>> {
>> int i;
>> struct progress *progress = NULL;
>> + struct commit **sorted_by_pos;
>
> In the next patch in series we would sort commits by generation number
> and creation data; shouldn't this variable name be more generic to
> reflect this, for example just `sorted_commits` or `commits_sorted`?
>
Good call. I will clean this up in both commits.
Thanks for the review!
Cheers!
Garima Singh
^ permalink raw reply [flat|nested] 159+ messages in thread
* [PATCH v2 06/11] commit-graph: examine commits by generation number
2020-02-05 22:56 ` [PATCH v2 00/11] " Garima Singh via GitGitGadget
` (4 preceding siblings ...)
2020-02-05 22:56 ` [PATCH v2 05/11] commit-graph: examine changed-path objects in pack order Jeff King via GitGitGadget
@ 2020-02-05 22:56 ` Derrick Stolee via GitGitGadget
2020-02-19 0:32 ` Jakub Narebski
2020-02-05 22:56 ` [PATCH v2 07/11] commit-graph: write Bloom filters to commit graph file Garima Singh via GitGitGadget
` (7 subsequent siblings)
13 siblings, 1 reply; 159+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-02-05 22:56 UTC (permalink / raw)
To: git
Cc: stolee, szeder.dev, jonathantanmy, jeffhost, me, peff,
garimasigit, jnareb, christian.couder, emilyshaffer, gitster,
Garima Singh, Derrick Stolee
From: Derrick Stolee <dstolee@microsoft.com>
When running 'git commit-graph write --changed-paths', we sort the
commits by pack-order to save time when computing the changed-paths
bloom filters. This does not help when finding the commits via the
--reachable flag.
If not using pack-order, then sort by generation number before
examining the diff. Commits with similar generation are more likely
to have many trees in common, making the diff faster.
On the Linux kernel repository, this change reduced the computation
time for 'git commit-graph write --reachable --changed-paths' from
3m00s to 1m37s.
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
---
commit-graph.c | 33 ++++++++++++++++++++++++++++++---
1 file changed, 30 insertions(+), 3 deletions(-)
diff --git a/commit-graph.c b/commit-graph.c
index e125511a1c..32a315058f 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -70,6 +70,25 @@ static int commit_pos_cmp(const void *va, const void *vb)
commit_pos_at(&commit_pos, b);
}
+static int commit_gen_cmp(const void *va, const void *vb)
+{
+ const struct commit *a = *(const struct commit **)va;
+ const struct commit *b = *(const struct commit **)vb;
+
+ /* lower generation commits first */
+ if (a->generation < b->generation)
+ return -1;
+ else if (a->generation > b->generation)
+ return 1;
+
+ /* use date as a heuristic when generations are equal */
+ if (a->date < b->date)
+ return -1;
+ else if (a->date > b->date)
+ return 1;
+ return 0;
+}
+
char *get_commit_graph_filename(const char *obj_dir)
{
char *filename = xstrfmt("%s/info/commit-graph", obj_dir);
@@ -821,7 +840,8 @@ struct write_commit_graph_context {
report_progress:1,
split:1,
check_oids:1,
- changed_paths:1;
+ changed_paths:1,
+ order_by_pack:1;
const struct split_commit_graph_opts *split_opts;
uint32_t total_bloom_filter_data_size;
@@ -1184,7 +1204,11 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
ALLOC_ARRAY(sorted_by_pos, ctx->commits.nr);
COPY_ARRAY(sorted_by_pos, ctx->commits.list, ctx->commits.nr);
- QSORT(sorted_by_pos, ctx->commits.nr, commit_pos_cmp);
+
+ if (ctx->order_by_pack)
+ QSORT(sorted_by_pos, ctx->commits.nr, commit_pos_cmp);
+ else
+ QSORT(sorted_by_pos, ctx->commits.nr, commit_gen_cmp);
for (i = 0; i < ctx->commits.nr; i++) {
struct commit *c = sorted_by_pos[i];
@@ -1902,6 +1926,7 @@ int write_commit_graph(const char *obj_dir,
}
if (pack_indexes) {
+ ctx->order_by_pack = 1;
if ((res = fill_oids_from_packs(ctx, pack_indexes)))
goto cleanup;
}
@@ -1911,8 +1936,10 @@ int write_commit_graph(const char *obj_dir,
goto cleanup;
}
- if (!pack_indexes && !commit_hex)
+ if (!pack_indexes && !commit_hex) {
+ ctx->order_by_pack = 1;
fill_oids_from_all_packs(ctx);
+ }
close_reachable(ctx);
--
gitgitgadget
^ permalink raw reply related [flat|nested] 159+ messages in thread
* Re: [PATCH v2 06/11] commit-graph: examine commits by generation number
2020-02-05 22:56 ` [PATCH v2 06/11] commit-graph: examine commits by generation number Derrick Stolee via GitGitGadget
@ 2020-02-19 0:32 ` Jakub Narebski
2020-02-24 20:45 ` Garima Singh
0 siblings, 1 reply; 159+ messages in thread
From: Jakub Narebski @ 2020-02-19 0:32 UTC (permalink / raw)
To: Derrick Stolee via GitGitGadget
Cc: git, Derrick Stolee, SZEDER Gábor, Jonathan Tan,
Jeff Hostetler, Taylor Blau, Jeff King, Garima Singh,
Christian Couder, Emily Shaffer, Junio C Hamano, Garima Singh,
Derrick Stolee
"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> From: Derrick Stolee <dstolee@microsoft.com>
>
> When running 'git commit-graph write --changed-paths', we sort the
> commits by pack-order to save time when computing the changed-paths
> bloom filters. This does not help when finding the commits via the
> --reachable flag.
Minor improvement suggestion: s/--reachable flag/'--reachable' flag/.
>
> If not using pack-order, then sort by generation number before
> examining the diff.
All right, that is good description of what the patch does.
> Commits with similar generation are more likely
> to have many trees in common, making the diff faster.
Is this what causes the performance improvement, that subsequently
examined commits are more likely to have more trees in common, which
means that those trees would be hot in cache, making generating diff
faster? Is it what profiling shows?
>
> On the Linux kernel repository, this change reduced the computation
> time for 'git commit-graph write --reachable --changed-paths' from
> 3m00s to 1m37s.
Would using the trick used for packfiles also for '--reachable', which
would mean commits examined in recency / reachability order, give
similar, worse or better performance improvements?
We would want this sorting order as one of possibilities anyway, because
'--stdin-commits' we could get commits in random order.
>
> Helped-by: Jeff King <peff@peff.net>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> Signed-off-by: Garima Singh <garima.singh@microsoft.com>
> ---
> commit-graph.c | 33 ++++++++++++++++++++++++++++++---
> 1 file changed, 30 insertions(+), 3 deletions(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index e125511a1c..32a315058f 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -70,6 +70,25 @@ static int commit_pos_cmp(const void *va, const void *vb)
> commit_pos_at(&commit_pos, b);
> }
>
> +static int commit_gen_cmp(const void *va, const void *vb)
> +{
> + const struct commit *a = *(const struct commit **)va;
> + const struct commit *b = *(const struct commit **)vb;
> +
> + /* lower generation commits first */
Shouldn't higher generation commits come first, in recency-like order?
Or it doesn't matter if it is sorted in ascending or descending order,
as long as commits with close generation numbers are examined close
together?
> + if (a->generation < b->generation)
> + return -1;
> + else if (a->generation > b->generation)
> + return 1;
> +
> + /* use date as a heuristic when generations are equal */
> + if (a->date < b->date)
> + return -1;
> + else if (a->date > b->date)
> + return 1;
> + return 0;
> +}
I thought we have had such comparison function defined somewhere in Git
already, but I think I'm wrong here.
> +
> char *get_commit_graph_filename(const char *obj_dir)
> {
> char *filename = xstrfmt("%s/info/commit-graph", obj_dir);
> @@ -821,7 +840,8 @@ struct write_commit_graph_context {
> report_progress:1,
> split:1,
> check_oids:1,
> - changed_paths:1;
> + changed_paths:1,
> + order_by_pack:1;
>
> const struct split_commit_graph_opts *split_opts;
> uint32_t total_bloom_filter_data_size;
> @@ -1184,7 +1204,11 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
>
> ALLOC_ARRAY(sorted_by_pos, ctx->commits.nr);
> COPY_ARRAY(sorted_by_pos, ctx->commits.list, ctx->commits.nr);
> - QSORT(sorted_by_pos, ctx->commits.nr, commit_pos_cmp);
> +
> + if (ctx->order_by_pack)
> + QSORT(sorted_by_pos, ctx->commits.nr, commit_pos_cmp);
> + else
> + QSORT(sorted_by_pos, ctx->commits.nr, commit_gen_cmp);
Here 'sorted_b_pos' variable name no longer reflects reality...
(see comment to the previous patch in the series).
>
> for (i = 0; i < ctx->commits.nr; i++) {
> struct commit *c = sorted_by_pos[i];
> @@ -1902,6 +1926,7 @@ int write_commit_graph(const char *obj_dir,
> }
>
> if (pack_indexes) {
> + ctx->order_by_pack = 1;
> if ((res = fill_oids_from_packs(ctx, pack_indexes)))
> goto cleanup;
> }
> @@ -1911,8 +1936,10 @@ int write_commit_graph(const char *obj_dir,
> goto cleanup;
> }
>
> - if (!pack_indexes && !commit_hex)
> + if (!pack_indexes && !commit_hex) {
> + ctx->order_by_pack = 1;
> fill_oids_from_all_packs(ctx);
> + }
>
> close_reachable(ctx);
All right, that covers all cases where 'git commit-graph write' writes
serialized commit-graph based on the commits found in packfiles:
'--stdin-packs' and default no option case, in that order.
Looks good.
Best,
--
Jakub Narębski
^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH v2 06/11] commit-graph: examine commits by generation number
2020-02-19 0:32 ` Jakub Narebski
@ 2020-02-24 20:45 ` Garima Singh
0 siblings, 0 replies; 159+ messages in thread
From: Garima Singh @ 2020-02-24 20:45 UTC (permalink / raw)
To: Jakub Narebski, Derrick Stolee via GitGitGadget
Cc: git, Derrick Stolee, SZEDER Gábor, Jonathan Tan,
Jeff Hostetler, Taylor Blau, Jeff King, Christian Couder,
Emily Shaffer, Junio C Hamano, Garima Singh, Derrick Stolee
On 2/18/2020 7:32 PM, Jakub Narebski wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
>> From: Derrick Stolee <dstolee@microsoft.com>
>>
>> When running 'git commit-graph write --changed-paths', we sort the
>> commits by pack-order to save time when computing the changed-paths
>> bloom filters. This does not help when finding the commits via the
>> --reachable flag.
>
> Minor improvement suggestion: s/--reachable flag/'--reachable' flag/.
>
Sure.
>> Commits with similar generation are more likely
>> to have many trees in common, making the diff faster.
>
> Is this what causes the performance improvement, that subsequently
> examined commits are more likely to have more trees in common, which
> means that those trees would be hot in cache, making generating diff
> faster? Is it what profiling shows?
>
Yes.
>>
>> Helped-by: Jeff King <peff@peff.net>
>> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
>> Signed-off-by: Garima Singh <garima.singh@microsoft.com>
>> ---
>> commit-graph.c | 33 ++++++++++++++++++++++++++++++---
>> 1 file changed, 30 insertions(+), 3 deletions(-)
>>
>> diff --git a/commit-graph.c b/commit-graph.c
>> index e125511a1c..32a315058f 100644
>> --- a/commit-graph.c
>> +++ b/commit-graph.c
>> @@ -70,6 +70,25 @@ static int commit_pos_cmp(const void *va, const void *vb)
>> commit_pos_at(&commit_pos, b);
>> }
>>
>> +static int commit_gen_cmp(const void *va, const void *vb)
>> +{
>> + const struct commit *a = *(const struct commit **)va;
>> + const struct commit *b = *(const struct commit **)vb;
>> +
>> + /* lower generation commits first */
>
> Shouldn't higher generation commits come first, in recency-like order?
> Or it doesn't matter if it is sorted in ascending or descending order,
> as long as commits with close generation numbers are examined close
> together?
>
The direction does not matter. Locality is important.
>> + if (a->generation < b->generation)
>> + return -1;
>> + else if (a->generation > b->generation)
>> + return 1;
>> +
>> + /* use date as a heuristic when generations are equal */
>> + if (a->date < b->date)
>> + return -1;
>> + else if (a->date > b->date)
>> + return 1;
>> + return 0;
>> +}
>
> I thought we have had such comparison function defined somewhere in Git
> already, but I think I'm wrong here.
>
It actually exists in commit.h
I will just use it here.
Thanks for pointing it out!
>> +
>> char *get_commit_graph_filename(const char *obj_dir)
>> {
>> char *filename = xstrfmt("%s/info/commit-graph", obj_dir);
>> @@ -821,7 +840,8 @@ struct write_commit_graph_context {
>> report_progress:1,
>> split:1,
>> check_oids:1,
>> - changed_paths:1;
>> + changed_paths:1,
>> + order_by_pack:1;
>>
>> const struct split_commit_graph_opts *split_opts;
>> uint32_t total_bloom_filter_data_size;
>> @@ -1184,7 +1204,11 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
>>
>> ALLOC_ARRAY(sorted_by_pos, ctx->commits.nr);
>> COPY_ARRAY(sorted_by_pos, ctx->commits.list, ctx->commits.nr);
>> - QSORT(sorted_by_pos, ctx->commits.nr, commit_pos_cmp);
>> +
>> + if (ctx->order_by_pack)
>> + QSORT(sorted_by_pos, ctx->commits.nr, commit_pos_cmp);
>> + else
>> + QSORT(sorted_by_pos, ctx->commits.nr, commit_gen_cmp);
>
> Here 'sorted_b_pos' variable name no longer reflects reality...
> (see comment to the previous patch in the series).
>
Yup. Fixing.
Thanks!
Garima Singh
^ permalink raw reply [flat|nested] 159+ messages in thread
* [PATCH v2 07/11] commit-graph: write Bloom filters to commit graph file
2020-02-05 22:56 ` [PATCH v2 00/11] " Garima Singh via GitGitGadget
` (5 preceding siblings ...)
2020-02-05 22:56 ` [PATCH v2 06/11] commit-graph: examine commits by generation number Derrick Stolee via GitGitGadget
@ 2020-02-05 22:56 ` Garima Singh via GitGitGadget
2020-02-19 15:13 ` Jakub Narebski
2020-02-05 22:56 ` [PATCH v2 08/11] commit-graph: reuse existing Bloom filters during write Garima Singh via GitGitGadget
` (6 subsequent siblings)
13 siblings, 1 reply; 159+ messages in thread
From: Garima Singh via GitGitGadget @ 2020-02-05 22:56 UTC (permalink / raw)
To: git
Cc: stolee, szeder.dev, jonathantanmy, jeffhost, me, peff,
garimasigit, jnareb, christian.couder, emilyshaffer, gitster,
Garima Singh, Garima Singh
From: Garima Singh <garima.singh@microsoft.com>
Update the technical documentation for commit-graph-format with the formats for
the Bloom filter index (BIDX) and Bloom filter data (BDAT) chunks. Write the
computed Bloom filters information to the commit graph file using this format.
Helped-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
---
.../technical/commit-graph-format.txt | 24 ++++
commit-graph.c | 118 +++++++++++++++++-
commit-graph.h | 7 +-
3 files changed, 145 insertions(+), 4 deletions(-)
diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
index a4f17441ae..22e511643d 100644
--- a/Documentation/technical/commit-graph-format.txt
+++ b/Documentation/technical/commit-graph-format.txt
@@ -17,6 +17,9 @@ metadata, including:
- The parents of the commit, stored using positional references within
the graph file.
+- The Bloom filter of the commit carrying the paths that were changed between
+ the commit and its first parent.
+
These positional references are stored as unsigned 32-bit integers
corresponding to the array position within the list of commit OIDs. Due
to some special constants we use to track parents, we can store at most
@@ -93,6 +96,27 @@ CHUNK DATA:
positions for the parents until reaching a value with the most-significant
bit on. The other bits correspond to the position of the last parent.
+ Bloom Filter Index (ID: {'B', 'I', 'D', 'X'}) (N * 4 bytes) [Optional]
+ * The ith entry, BIDX[i], stores the number of 8-byte word blocks in all
+ Bloom filters from commit 0 to commit i (inclusive) in lexicographic
+ order. The Bloom filter for the i-th commit spans from BIDX[i-1] to
+ BIDX[i] (plus header length), where BIDX[-1] is 0.
+ * The BIDX chunk is ignored if the BDAT chunk is not present.
+
+ Bloom Filter Data (ID: {'B', 'D', 'A', 'T'}) [Optional]
+ * It starts with header consisting of three unsigned 32-bit integers:
+ - Version of the hash algorithm being used. We currently only support
+ value 1 which implies the murmur3 hash implemented exactly as described
+ in https://en.wikipedia.org/wiki/MurmurHash#Algorithm
+ - The number of times a path is hashed and hence the number of bit positions
+ that cumulatively determine whether a file is present in the commit.
+ - The minimum number of bits 'b' per entry in the Bloom filter. If the filter
+ contains 'n' entries, then the filter size is the minimum number of 64-bit
+ words that contain n*b bits.
+ * The rest of the chunk is the concatenation of all the computed Bloom
+ filters for the commits in lexicographic order.
+ * The BDAT chunk is present iff BIDX is present.
+
Base Graphs List (ID: {'B', 'A', 'S', 'E'}) [Optional]
This list of H-byte hashes describe a set of B commit-graph files that
form a commit-graph chain. The graph position for the ith commit in this
diff --git a/commit-graph.c b/commit-graph.c
index 32a315058f..4585b3b702 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -24,8 +24,10 @@
#define GRAPH_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
#define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */
#define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */
+#define GRAPH_CHUNKID_BLOOMINDEXES 0x42494458 /* "BIDX" */
+#define GRAPH_CHUNKID_BLOOMDATA 0x42444154 /* "BDAT" */
#define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */
-#define MAX_NUM_CHUNKS 5
+#define MAX_NUM_CHUNKS 7
#define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16)
@@ -325,6 +327,32 @@ struct commit_graph *parse_commit_graph(void *graph_map, int fd,
chunk_repeated = 1;
else
graph->chunk_base_graphs = data + chunk_offset;
+ break;
+
+ case GRAPH_CHUNKID_BLOOMINDEXES:
+ if (graph->chunk_bloom_indexes)
+ chunk_repeated = 1;
+ else
+ graph->chunk_bloom_indexes = data + chunk_offset;
+ break;
+
+ case GRAPH_CHUNKID_BLOOMDATA:
+ if (graph->chunk_bloom_data)
+ chunk_repeated = 1;
+ else {
+ uint32_t hash_version;
+ graph->chunk_bloom_data = data + chunk_offset;
+ hash_version = get_be32(data + chunk_offset);
+
+ if (hash_version != 1)
+ break;
+
+ graph->bloom_filter_settings = xmalloc(sizeof(struct bloom_filter_settings));
+ graph->bloom_filter_settings->hash_version = hash_version;
+ graph->bloom_filter_settings->num_hashes = get_be32(data + chunk_offset + 4);
+ graph->bloom_filter_settings->bits_per_entry = get_be32(data + chunk_offset + 8);
+ }
+ break;
}
if (chunk_repeated) {
@@ -343,6 +371,17 @@ struct commit_graph *parse_commit_graph(void *graph_map, int fd,
last_chunk_offset = chunk_offset;
}
+ /* We need both the bloom chunks to exist together. Else ignore the data */
+ if ((graph->chunk_bloom_indexes && !graph->chunk_bloom_data)
+ || (!graph->chunk_bloom_indexes && graph->chunk_bloom_data)) {
+ graph->chunk_bloom_indexes = NULL;
+ graph->chunk_bloom_data = NULL;
+ graph->bloom_filter_settings = NULL;
+ }
+
+ if (graph->chunk_bloom_indexes && graph->chunk_bloom_data)
+ load_bloom_filters();
+
hashcpy(graph->oid.hash, graph->data + graph->data_len - graph->hash_len);
if (verify_commit_graph_lite(graph)) {
@@ -1040,6 +1079,59 @@ static void write_graph_chunk_extra_edges(struct hashfile *f,
}
}
+static void write_graph_chunk_bloom_indexes(struct hashfile *f,
+ struct write_commit_graph_context *ctx)
+{
+ struct commit **list = ctx->commits.list;
+ struct commit **last = ctx->commits.list + ctx->commits.nr;
+ uint32_t cur_pos = 0;
+ struct progress *progress = NULL;
+ int i = 0;
+
+ if (ctx->report_progress)
+ progress = start_delayed_progress(
+ _("Writing changed paths Bloom filters index"),
+ ctx->commits.nr);
+
+ while (list < last) {
+ struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
+ cur_pos += filter->len;
+ display_progress(progress, ++i);
+ hashwrite_be32(f, cur_pos);
+ list++;
+ }
+
+ stop_progress(&progress);
+}
+
+static void write_graph_chunk_bloom_data(struct hashfile *f,
+ struct write_commit_graph_context *ctx,
+ struct bloom_filter_settings *settings)
+{
+ struct commit **list = ctx->commits.list;
+ struct commit **last = ctx->commits.list + ctx->commits.nr;
+ struct progress *progress = NULL;
+ int i = 0;
+
+ if (ctx->report_progress)
+ progress = start_delayed_progress(
+ _("Writing changed paths Bloom filters data"),
+ ctx->commits.nr);
+
+ hashwrite_be32(f, settings->hash_version);
+ hashwrite_be32(f, settings->num_hashes);
+ hashwrite_be32(f, settings->bits_per_entry);
+
+ while (list < last) {
+ struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
+ display_progress(progress, ++i);
+ hashwrite(f, filter->data, filter->len * sizeof(uint64_t));
+ list++;
+ }
+
+ stop_progress(&progress);
+}
+
static int oid_compare(const void *_a, const void *_b)
{
const struct object_id *a = (const struct object_id *)_a;
@@ -1198,8 +1290,8 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
load_bloom_filters();
if (ctx->report_progress)
- progress = start_progress(
- _("Computing commit diff Bloom filters"),
+ progress = start_delayed_progress(
+ _("Computing changed paths Bloom filters"),
ctx->commits.nr);
ALLOC_ARRAY(sorted_by_pos, ctx->commits.nr);
@@ -1444,6 +1536,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
struct strbuf progress_title = STRBUF_INIT;
int num_chunks = 3;
struct object_id file_hash;
+ struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
if (ctx->split) {
struct strbuf tmp_file = STRBUF_INIT;
@@ -1488,6 +1581,12 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
chunk_ids[num_chunks] = GRAPH_CHUNKID_EXTRAEDGES;
num_chunks++;
}
+ if (ctx->changed_paths) {
+ chunk_ids[num_chunks] = GRAPH_CHUNKID_BLOOMINDEXES;
+ num_chunks++;
+ chunk_ids[num_chunks] = GRAPH_CHUNKID_BLOOMDATA;
+ num_chunks++;
+ }
if (ctx->num_commit_graphs_after > 1) {
chunk_ids[num_chunks] = GRAPH_CHUNKID_BASE;
num_chunks++;
@@ -1506,6 +1605,15 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
4 * ctx->num_extra_edges;
num_chunks++;
}
+ if (ctx->changed_paths) {
+ chunk_offsets[num_chunks + 1] = chunk_offsets[num_chunks] +
+ sizeof(uint32_t) * ctx->commits.nr;
+ num_chunks++;
+
+ chunk_offsets[num_chunks + 1] = chunk_offsets[num_chunks] +
+ sizeof(uint32_t) * 3 + ctx->total_bloom_filter_data_size;
+ num_chunks++;
+ }
if (ctx->num_commit_graphs_after > 1) {
chunk_offsets[num_chunks + 1] = chunk_offsets[num_chunks] +
hashsz * (ctx->num_commit_graphs_after - 1);
@@ -1543,6 +1651,10 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
write_graph_chunk_data(f, hashsz, ctx);
if (ctx->num_extra_edges)
write_graph_chunk_extra_edges(f, ctx);
+ if (ctx->changed_paths) {
+ write_graph_chunk_bloom_indexes(f, ctx);
+ write_graph_chunk_bloom_data(f, ctx, &bloom_settings);
+ }
if (ctx->num_commit_graphs_after > 1 &&
write_graph_chunk_base(f, ctx)) {
return -1;
diff --git a/commit-graph.h b/commit-graph.h
index 952a4b83be..25fefefb3e 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -10,6 +10,7 @@
#define GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD "GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD"
struct commit;
+struct bloom_filter_settings;
char *get_commit_graph_filename(const char *obj_dir);
int open_commit_graph(const char *graph_file, int *fd, struct stat *st);
@@ -58,6 +59,10 @@ struct commit_graph {
const unsigned char *chunk_commit_data;
const unsigned char *chunk_extra_edges;
const unsigned char *chunk_base_graphs;
+ const unsigned char *chunk_bloom_indexes;
+ const unsigned char *chunk_bloom_data;
+
+ struct bloom_filter_settings *bloom_filter_settings;
};
struct commit_graph *load_commit_graph_one_fd_st(int fd, struct stat *st);
@@ -77,7 +82,7 @@ enum commit_graph_write_flags {
COMMIT_GRAPH_WRITE_SPLIT = (1 << 2),
/* Make sure that each OID in the input is a valid commit OID. */
COMMIT_GRAPH_WRITE_CHECK_OIDS = (1 << 3),
- COMMIT_GRAPH_WRITE_BLOOM_FILTERS = (1 << 4)
+ COMMIT_GRAPH_WRITE_BLOOM_FILTERS = (1 << 4),
};
struct split_commit_graph_opts {
--
gitgitgadget
^ permalink raw reply related [flat|nested] 159+ messages in thread
* Re: [PATCH v2 07/11] commit-graph: write Bloom filters to commit graph file
2020-02-05 22:56 ` [PATCH v2 07/11] commit-graph: write Bloom filters to commit graph file Garima Singh via GitGitGadget
@ 2020-02-19 15:13 ` Jakub Narebski
2020-02-24 21:14 ` Garima Singh
0 siblings, 1 reply; 159+ messages in thread
From: Jakub Narebski @ 2020-02-19 15:13 UTC (permalink / raw)
To: Garima Singh via GitGitGadget
Cc: git, Derrick Stolee, SZEDER Gábor, Jonathan Tan,
Jeff Hostetler, Taylor Blau, Jeff King, Garima Singh,
Christian Couder, Emily Shaffer, Junio C Hamano, Garima Singh
"Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:
> From: Garima Singh <garima.singh@microsoft.com>
>
> Update the technical documentation for commit-graph-format with the formats for
> the Bloom filter index (BIDX) and Bloom filter data (BDAT) chunks. Write the
> computed Bloom filters information to the commit graph file using this format.
Nice description.
The only minor nitpick is with the formating: it is 80-character wide,
which is a bit wide.
>
> Helped-by: Derrick Stolee <dstolee@microsoft.com>
> Signed-off-by: Garima Singh <garima.singh@microsoft.com>
> ---
> .../technical/commit-graph-format.txt | 24 ++++
> commit-graph.c | 118 +++++++++++++++++-
> commit-graph.h | 7 +-
> 3 files changed, 145 insertions(+), 4 deletions(-)
>
> diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
> index a4f17441ae..22e511643d 100644
> --- a/Documentation/technical/commit-graph-format.txt
> +++ b/Documentation/technical/commit-graph-format.txt
> @@ -17,6 +17,9 @@ metadata, including:
> - The parents of the commit, stored using positional references within
> the graph file.
>
> +- The Bloom filter of the commit carrying the paths that were changed between
> + the commit and its first parent.
> +
All right.
Should we also state that it is optional (meta)data? This would be
first optional piece of data stored in commit-graph, I think.
> These positional references are stored as unsigned 32-bit integers
> corresponding to the array position within the list of commit OIDs. Due
> to some special constants we use to track parents, we can store at most
> @@ -93,6 +96,27 @@ CHUNK DATA:
> positions for the parents until reaching a value with the most-significant
> bit on. The other bits correspond to the position of the last parent.
>
> + Bloom Filter Index (ID: {'B', 'I', 'D', 'X'}) (N * 4 bytes) [Optional]
> + * The ith entry, BIDX[i], stores the number of 8-byte word blocks in all
> + Bloom filters from commit 0 to commit i (inclusive) in lexicographic
> + order. The Bloom filter for the i-th commit spans from BIDX[i-1] to
> + BIDX[i] (plus header length), where BIDX[-1] is 0.
> + * The BIDX chunk is ignored if the BDAT chunk is not present.
All right. Looks good.
> +
> + Bloom Filter Data (ID: {'B', 'D', 'A', 'T'}) [Optional]
> + * It starts with header consisting of three unsigned 32-bit integers:
> + - Version of the hash algorithm being used. We currently only support
> + value 1 which implies the murmur3 hash implemented exactly as described
> + in https://en.wikipedia.org/wiki/MurmurHash#Algorithm
First a minor issue: shouldn't this nested unordered list be indented
with a hanging indent formatted with spaces? That is be formatted like
the following:
+ Bloom Filter Data (ID: {'B', 'D', 'A', 'T'}) [Optional]
+ * It starts with header consisting of three unsigned 32-bit integers:
+ - Version of the hash algorithm being used. We currently only support
+ value 1 which implies the murmur3 hash implemented exactly as
+ described in https://en.wikipedia.org/wiki/MurmurHash#Algorithm
But the existing formatting with spaces and tabs might be fine as it is,
that is it renders as nested list with Asciidoc; it only looks a bit
weird as patch, not so as text.
Second, and more important: it is in my opinion not enough information,
at least if we are assuming that the information in this document should
be enough for clean-room reimplementation of Bloom filter functionality
(for example by JGit). To generate compatible Bloom filters, one needs
also the information on how to create $k$ functionally-independent hash
functions out of murmur3 hash. We do it currently using double hashing
technique; if that changes then the exact set of bits in the Bloom
filter would also change.
The additional description could look something like the following:
+ * It starts with header consisting of three unsigned 32-bit integers:
+ - Version of the hash algorithm being used. We currently only support
+ value 1 which implies the murmur3_32 hash implemented exactly as
+ described in https://en.wikipedia.org/wiki/MurmurHash#Algorithm
+ and double hashing technique with 0x293ae76f and 0x7e646e2c seeds
+ as described in https://doi.org/10.1007/978-3-540-30494-4_26
+ "Bloom Filters in Probabilistic Verification"
Also, it should be explicitly noted that we use murmur3_32, because
there is also 128-bit version of murmur3 hash.
> + - The number of times a path is hashed and hence the number of bit positions
> + that cumulatively determine whether a file is present in the commit.
All right, in the original Bloom filter it was the number of different
hash functions. With the double hashing technique, it is the number of
times a path is hashed.
> + - The minimum number of bits 'b' per entry in the Bloom filter. If the filter
> + contains 'n' entries, then the filter size is the minimum number of 64-bit
> + words that contain n*b bits.
All right, that means empty Bloom filter, representing "no changes",
with 'n' equal 0 entries, is represented as size 0 filter. That is, if
we read this rule exactly as written.
Should we add the information that size 0 / length 0 filter is
considered "no data" case? Or should we leave it to implementation?
There are two corner cases:
- "no changes" case, where all queries are answered with "no"
can be represented as filter of size 0, or as Bloom filter with all
bits set to 0
- "no data" case (used when there are more than 512 changed files)
where all queries are answered with "maybe", currently represented
as filter of size 0; can also be represented as Bloom filter with all
bits set to 1
> + * The rest of the chunk is the concatenation of all the computed Bloom
> + filters for the commits in lexicographic order.
All right.
> + * The BDAT chunk is present iff BIDX is present.
Perhaps we should spell 'iff' in full, that is 'if and only if'?
> +
> Base Graphs List (ID: {'B', 'A', 'S', 'E'}) [Optional]
> This list of H-byte hashes describe a set of B commit-graph files that
> form a commit-graph chain. The graph position for the ith commit in this
> diff --git a/commit-graph.c b/commit-graph.c
> index 32a315058f..4585b3b702 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -24,8 +24,10 @@
> #define GRAPH_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
> #define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */
> #define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */
> +#define GRAPH_CHUNKID_BLOOMINDEXES 0x42494458 /* "BIDX" */
> +#define GRAPH_CHUNKID_BLOOMDATA 0x42444154 /* "BDAT" */
> #define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */
> -#define MAX_NUM_CHUNKS 5
> +#define MAX_NUM_CHUNKS 7
>
> #define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16)
>
> @@ -325,6 +327,32 @@ struct commit_graph *parse_commit_graph(void *graph_map, int fd,
> chunk_repeated = 1;
> else
> graph->chunk_base_graphs = data + chunk_offset;
> + break;
> +
> + case GRAPH_CHUNKID_BLOOMINDEXES:
> + if (graph->chunk_bloom_indexes)
> + chunk_repeated = 1;
> + else
> + graph->chunk_bloom_indexes = data + chunk_offset;
> + break;
> +
> + case GRAPH_CHUNKID_BLOOMDATA:
> + if (graph->chunk_bloom_data)
> + chunk_repeated = 1;
> + else {
> + uint32_t hash_version;
> + graph->chunk_bloom_data = data + chunk_offset;
> + hash_version = get_be32(data + chunk_offset);
> +
> + if (hash_version != 1)
> + break;
Shouldn't we mark Bloom filter as not to be used? Or is it left for
later commit?
In the future it might be good idea to notify the user (perhaps
protected with some advice.* option) that there is problem with Bloom
filter data, namely that we have encountered unsupported hash version.
> +
> + graph->bloom_filter_settings = xmalloc(sizeof(struct bloom_filter_settings));
Why is this structure allocated dynamically? We are leaking admittedly
a small amount of memory because we never free this xmalloc() result.
If we need this field being a pointer to struct to have NULL mean no
supported Bloom filter data, we could have instead use chunk_bloom_*
fields instead - we can set at least one of them to NULL.
> + graph->bloom_filter_settings->hash_version = hash_version;
> + graph->bloom_filter_settings->num_hashes = get_be32(data + chunk_offset + 4);
> + graph->bloom_filter_settings->bits_per_entry = get_be32(data + chunk_offset + 8);
All right; these 4 and 8 are sizeof(uint32_t) and 2*sizeof(uint32_t),
respectively.
> + }
> + break;
> }
>
> if (chunk_repeated) {
> @@ -343,6 +371,17 @@ struct commit_graph *parse_commit_graph(void *graph_map, int fd,
> last_chunk_offset = chunk_offset;
> }
>
> + /* We need both the bloom chunks to exist together. Else ignore the data */
> + if ((graph->chunk_bloom_indexes && !graph->chunk_bloom_data)
> + || (!graph->chunk_bloom_indexes && graph->chunk_bloom_data)) {
> + graph->chunk_bloom_indexes = NULL;
> + graph->chunk_bloom_data = NULL;
> + graph->bloom_filter_settings = NULL;
> + }
> +
> + if (graph->chunk_bloom_indexes && graph->chunk_bloom_data)
> + load_bloom_filters();
Wouldn't it be simpler to rely on the fact that both Bloom chunks must
exists for it to matter, and write it like this:
+ if (graph->chunk_bloom_indexes && graph->chunk_bloom_data) {
+ load_bloom_filters();
+ } else {
+ graph->chunk_bloom_indexes = NULL;
+ graph->chunk_bloom_data = NULL;
+ graph->bloom_filter_settings = NULL;
+ }
> +
> hashcpy(graph->oid.hash, graph->data + graph->data_len - graph->hash_len);
>
> if (verify_commit_graph_lite(graph)) {
> @@ -1040,6 +1079,59 @@ static void write_graph_chunk_extra_edges(struct hashfile *f,
> }
> }
>
> +static void write_graph_chunk_bloom_indexes(struct hashfile *f,
> + struct write_commit_graph_context *ctx)
> +{
> + struct commit **list = ctx->commits.list;
> + struct commit **last = ctx->commits.list + ctx->commits.nr;
> + uint32_t cur_pos = 0;
> + struct progress *progress = NULL;
> + int i = 0;
> +
> + if (ctx->report_progress)
> + progress = start_delayed_progress(
> + _("Writing changed paths Bloom filters index"),
> + ctx->commits.nr);
> +
> + while (list < last) {
> + struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
> + cur_pos += filter->len;
> + display_progress(progress, ++i);
> + hashwrite_be32(f, cur_pos);
> + list++;
> + }
> +
> + stop_progress(&progress);
> +}
All right, looks good.
> +
> +static void write_graph_chunk_bloom_data(struct hashfile *f,
> + struct write_commit_graph_context *ctx,
> + struct bloom_filter_settings *settings)
> +{
> + struct commit **list = ctx->commits.list;
> + struct commit **last = ctx->commits.list + ctx->commits.nr;
> + struct progress *progress = NULL;
> + int i = 0;
> +
> + if (ctx->report_progress)
> + progress = start_delayed_progress(
> + _("Writing changed paths Bloom filters data"),
> + ctx->commits.nr);
> +
> + hashwrite_be32(f, settings->hash_version);
> + hashwrite_be32(f, settings->num_hashes);
> + hashwrite_be32(f, settings->bits_per_entry);
> +
> + while (list < last) {
> + struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
> + display_progress(progress, ++i);
> + hashwrite(f, filter->data, filter->len * sizeof(uint64_t));
> + list++;
> + }
> +
> + stop_progress(&progress);
> +}
All right, looks good.
Side note: why have while loop here instead of for loop, like in
previous patches? I'm not saying this is a bad idea (especially with
same names for same variables).
> +
> static int oid_compare(const void *_a, const void *_b)
> {
> const struct object_id *a = (const struct object_id *)_a;
> @@ -1198,8 +1290,8 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
> load_bloom_filters();
>
> if (ctx->report_progress)
> - progress = start_progress(
> - _("Computing commit diff Bloom filters"),
> + progress = start_delayed_progress(
> + _("Computing changed paths Bloom filters"),
> ctx->commits.nr);
>
Ooops. This look like a fixup which should be made to the original
earlier commit instead, isn't it?
> ALLOC_ARRAY(sorted_by_pos, ctx->commits.nr);
> @@ -1444,6 +1536,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
> struct strbuf progress_title = STRBUF_INIT;
> int num_chunks = 3;
> struct object_id file_hash;
> + struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
>
> if (ctx->split) {
> struct strbuf tmp_file = STRBUF_INIT;
> @@ -1488,6 +1581,12 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
> chunk_ids[num_chunks] = GRAPH_CHUNKID_EXTRAEDGES;
> num_chunks++;
> }
> + if (ctx->changed_paths) {
> + chunk_ids[num_chunks] = GRAPH_CHUNKID_BLOOMINDEXES;
> + num_chunks++;
> + chunk_ids[num_chunks] = GRAPH_CHUNKID_BLOOMDATA;
> + num_chunks++;
> + }
All right, adding chunks and counting them.
> if (ctx->num_commit_graphs_after > 1) {
> chunk_ids[num_chunks] = GRAPH_CHUNKID_BASE;
> num_chunks++;
> @@ -1506,6 +1605,15 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
> 4 * ctx->num_extra_edges;
> num_chunks++;
> }
> + if (ctx->changed_paths) {
> + chunk_offsets[num_chunks + 1] = chunk_offsets[num_chunks] +
> + sizeof(uint32_t) * ctx->commits.nr;
> + num_chunks++;
> +
> + chunk_offsets[num_chunks + 1] = chunk_offsets[num_chunks] +
> + sizeof(uint32_t) * 3 + ctx->total_bloom_filter_data_size;
> + num_chunks++;
> + }
All right, calculating chunk offsets.
> if (ctx->num_commit_graphs_after > 1) {
> chunk_offsets[num_chunks + 1] = chunk_offsets[num_chunks] +
> hashsz * (ctx->num_commit_graphs_after - 1);
> @@ -1543,6 +1651,10 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
> write_graph_chunk_data(f, hashsz, ctx);
> if (ctx->num_extra_edges)
> write_graph_chunk_extra_edges(f, ctx);
> + if (ctx->changed_paths) {
> + write_graph_chunk_bloom_indexes(f, ctx);
> + write_graph_chunk_bloom_data(f, ctx, &bloom_settings);
> + }
All right, writing BIDX and BDAT chunks with default settings.
By the way, in the future, when appending to existing commit-graph file,
shouldn't we re-use existing settings even if they are different from
default settings? But that is question for the future...
> if (ctx->num_commit_graphs_after > 1 &&
> write_graph_chunk_base(f, ctx)) {
> return -1;
> diff --git a/commit-graph.h b/commit-graph.h
> index 952a4b83be..25fefefb3e 100644
> --- a/commit-graph.h
> +++ b/commit-graph.h
> @@ -10,6 +10,7 @@
> #define GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD "GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD"
>
> struct commit;
> +struct bloom_filter_settings;
>
> char *get_commit_graph_filename(const char *obj_dir);
> int open_commit_graph(const char *graph_file, int *fd, struct stat *st);
> @@ -58,6 +59,10 @@ struct commit_graph {
> const unsigned char *chunk_commit_data;
> const unsigned char *chunk_extra_edges;
> const unsigned char *chunk_base_graphs;
> + const unsigned char *chunk_bloom_indexes;
> + const unsigned char *chunk_bloom_data;
All right.
> +
> + struct bloom_filter_settings *bloom_filter_settings;
Why it is pointer to struct, instead of being just struct type?
Is there reason for that?
> };
>
> struct commit_graph *load_commit_graph_one_fd_st(int fd, struct stat *st);
> @@ -77,7 +82,7 @@ enum commit_graph_write_flags {
> COMMIT_GRAPH_WRITE_SPLIT = (1 << 2),
> /* Make sure that each OID in the input is a valid commit OID. */
> COMMIT_GRAPH_WRITE_CHECK_OIDS = (1 << 3),
> - COMMIT_GRAPH_WRITE_BLOOM_FILTERS = (1 << 4)
> + COMMIT_GRAPH_WRITE_BLOOM_FILTERS = (1 << 4),
This looks like accidental change; if we want to use trailing comma in
enum, this change should be in my opinion done in the commit that added
COMMIT_GRAPH_WRITE_BLOOM_FILTERS (as I have written in a comment there).
> };
>
> struct split_commit_graph_opts {
Thank you for your work on this series.
Best,
--
Jakub Narębski
^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH v2 07/11] commit-graph: write Bloom filters to commit graph file
2020-02-19 15:13 ` Jakub Narebski
@ 2020-02-24 21:14 ` Garima Singh
2020-02-25 11:40 ` Jakub Narebski
0 siblings, 1 reply; 159+ messages in thread
From: Garima Singh @ 2020-02-24 21:14 UTC (permalink / raw)
To: Jakub Narebski, Garima Singh via GitGitGadget
Cc: git, Derrick Stolee, SZEDER Gábor, Jonathan Tan,
Jeff Hostetler, Taylor Blau, Jeff King, Christian Couder,
Emily Shaffer, Junio C Hamano, Garima Singh
On 2/19/2020 10:13 AM, Jakub Narebski wrote:
> "Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
>> From: Garima Singh <garima.singh@microsoft.com>
>>
>> Update the technical documentation for commit-graph-format with the formats for
>> the Bloom filter index (BIDX) and Bloom filter data (BDAT) chunks. Write the
>> computed Bloom filters information to the commit graph file using this format.
>
> Nice description.
>
> The only minor nitpick is with the formating: it is 80-character wide,
> which is a bit wide.
>
Fixed in v3. Thanks!
>>
>> Helped-by: Derrick Stolee <dstolee@microsoft.com>
>> Signed-off-by: Garima Singh <garima.singh@microsoft.com>
>> ---
>> .../technical/commit-graph-format.txt | 24 ++++
>> commit-graph.c | 118 +++++++++++++++++-
>> commit-graph.h | 7 +-
>> 3 files changed, 145 insertions(+), 4 deletions(-)
>>
>> diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
>> index a4f17441ae..22e511643d 100644
>> --- a/Documentation/technical/commit-graph-format.txt
>> +++ b/Documentation/technical/commit-graph-format.txt
>> @@ -17,6 +17,9 @@ metadata, including:
>> - The parents of the commit, stored using positional references within
>> the graph file.
>>
>> +- The Bloom filter of the commit carrying the paths that were changed between
>> + the commit and its first parent.
>> +
>
> All right.
>
> Should we also state that it is optional (meta)data? This would be
> first optional piece of data stored in commit-graph, I think.
>
However the entire commit graph file is non critical metadata since git commands
work just fine without it, just slower. The same applies to the changed path
bloom filters.
Based on the definition of optional you are suggesting, edge data is optional
because not every commit-graph has octopus merges.
>> +
>> + Bloom Filter Data (ID: {'B', 'D', 'A', 'T'}) [Optional]
>> + * It starts with header consisting of three unsigned 32-bit integers:
>> + - Version of the hash algorithm being used. We currently only support
>> + value 1 which implies the murmur3 hash implemented exactly as described
>> + in https://en.wikipedia.org/wiki/MurmurHash#Algorithm
>
> First a minor issue: shouldn't this nested unordered list be indented
> with a hanging indent formatted with spaces? That is be formatted like
> the following:
>
> + Bloom Filter Data (ID: {'B', 'D', 'A', 'T'}) [Optional]
> + * It starts with header consisting of three unsigned 32-bit integers:
> + - Version of the hash algorithm being used. We currently only support
> + value 1 which implies the murmur3 hash implemented exactly as
> + described in https://en.wikipedia.org/wiki/MurmurHash#Algorithm
>
> But the existing formatting with spaces and tabs might be fine as it is,
> that is it renders as nested list with Asciidoc; it only looks a bit
> weird as patch, not so as text.
>
> Second, and more important: it is in my opinion not enough information,
> at least if we are assuming that the information in this document should
> be enough for clean-room reimplementation of Bloom filter functionality
> (for example by JGit). To generate compatible Bloom filters, one needs
> also the information on how to create $k$ functionally-independent hash
> functions out of murmur3 hash. We do it currently using double hashing
> technique; if that changes then the exact set of bits in the Bloom
> filter would also change.
>
> The additional description could look something like the following:
>
> + * It starts with header consisting of three unsigned 32-bit integers:
> + - Version of the hash algorithm being used. We currently only support
> + value 1 which implies the murmur3_32 hash implemented exactly as
> + described in https://en.wikipedia.org/wiki/MurmurHash#Algorithm
> + and double hashing technique with 0x293ae76f and 0x7e646e2c seeds
> + as described in https://doi.org/10.1007/978-3-540-30494-4_26
> + "Bloom Filters in Probabilistic Verification"
>
> Also, it should be explicitly noted that we use murmur3_32, because
> there is also 128-bit version of murmur3 hash.
>
I will incorporate this in. Thanks!
>> + * The BDAT chunk is present iff BIDX is present.
>
> Perhaps we should spell 'iff' in full, that is 'if and only if'?
>
Sure.
>> +
>> Base Graphs List (ID: {'B', 'A', 'S', 'E'}) [Optional]
>> This list of H-byte hashes describe a set of B commit-graph files that
>> form a commit-graph chain. The graph position for the ith commit in this
>> diff --git a/commit-graph.c b/commit-graph.c
>> index 32a315058f..4585b3b702 100644
>> --- a/commit-graph.c
>> +++ b/commit-graph.c
>> @@ -24,8 +24,10 @@
>> #define GRAPH_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
>> #define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */
>> #define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */
>> +#define GRAPH_CHUNKID_BLOOMINDEXES 0x42494458 /* "BIDX" */
>> +#define GRAPH_CHUNKID_BLOOMDATA 0x42444154 /* "BDAT" */
>> #define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */
>> -#define MAX_NUM_CHUNKS 5
>> +#define MAX_NUM_CHUNKS 7
>>
>> #define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16)
>>
>> @@ -325,6 +327,32 @@ struct commit_graph *parse_commit_graph(void *graph_map, int fd,
>> chunk_repeated = 1;
>> else
>> graph->chunk_base_graphs = data + chunk_offset;
>> + break;
>> +
>> + case GRAPH_CHUNKID_BLOOMINDEXES:
>> + if (graph->chunk_bloom_indexes)
>> + chunk_repeated = 1;
>> + else
>> + graph->chunk_bloom_indexes = data + chunk_offset;
>> + break;
>> +
>> + case GRAPH_CHUNKID_BLOOMDATA:
>> + if (graph->chunk_bloom_data)
>> + chunk_repeated = 1;
>> + else {
>> + uint32_t hash_version;
>> + graph->chunk_bloom_data = data + chunk_offset;
>> + hash_version = get_be32(data + chunk_offset);
>> +
>> + if (hash_version != 1)
>> + break;
>
> Shouldn't we mark Bloom filter as not to be used? Or is it left for
> later commit?
>
We take care of this in line 375.
> In the future it might be good idea to notify the user (perhaps
> protected with some advice.* option) that there is problem with Bloom
> filter data, namely that we have encountered unsupported hash version.
>
>> +
>> + graph->bloom_filter_settings = xmalloc(sizeof(struct bloom_filter_settings));
>
> Why is this structure allocated dynamically? We are leaking admittedly
> a small amount of memory because we never free this xmalloc() result.
>
> If we need this field being a pointer to struct to have NULL mean no
> supported Bloom filter data, we could have instead use chunk_bloom_*
> fields instead - we can set at least one of them to NULL.
>
I am freeing this up in free_commit_graph but I messed up putting it in the right commit.
Sorry about that. Fixed in v3.
Also as discussed in https://lore.kernel.org/git/3b7d77a1-aed9-d202-8646-4b964cb965db@gmail.com/
there is a bug in commit-graph.c where we should be calling free_commit_graph() instead of
just free(graph). I will do this in a separate series.
>> + }
>> + break;
>> }
>>
>> if (chunk_repeated) {
>> @@ -343,6 +371,17 @@ struct commit_graph *parse_commit_graph(void *graph_map, int fd,
>> last_chunk_offset = chunk_offset;
>> }
>>
>> + /* We need both the bloom chunks to exist together. Else ignore the data */
>> + if ((graph->chunk_bloom_indexes && !graph->chunk_bloom_data)
>> + || (!graph->chunk_bloom_indexes && graph->chunk_bloom_data)) {
>> + graph->chunk_bloom_indexes = NULL;
>> + graph->chunk_bloom_data = NULL;
>> + graph->bloom_filter_settings = NULL;
>> + }
>> +
>> + if (graph->chunk_bloom_indexes && graph->chunk_bloom_data)
>> + load_bloom_filters();
>
> Wouldn't it be simpler to rely on the fact that both Bloom chunks must
> exists for it to matter, and write it like this:
>
> + if (graph->chunk_bloom_indexes && graph->chunk_bloom_data) {
> + load_bloom_filters();
> + } else {
> + graph->chunk_bloom_indexes = NULL;
> + graph->chunk_bloom_data = NULL;
> + graph->bloom_filter_settings = NULL;
> + }
>
:) Yes. Fixed in v3.
>> +
>> static int oid_compare(const void *_a, const void *_b)
>> {
>> const struct object_id *a = (const struct object_id *)_a;
>> @@ -1198,8 +1290,8 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
>> load_bloom_filters();
>>
>> if (ctx->report_progress)
>> - progress = start_progress(
>> - _("Computing commit diff Bloom filters"),
>> + progress = start_delayed_progress(
>> + _("Computing changed paths Bloom filters"),
>> ctx->commits.nr);
>>
>
> Ooops. This look like a fixup which should be made to the original
> earlier commit instead, isn't it?
Yes. Should have been in a previous commit. Fixed in v3.
>> };
>>
>> struct commit_graph *load_commit_graph_one_fd_st(int fd, struct stat *st);
>> @@ -77,7 +82,7 @@ enum commit_graph_write_flags {
>> COMMIT_GRAPH_WRITE_SPLIT = (1 << 2),
>> /* Make sure that each OID in the input is a valid commit OID. */
>> COMMIT_GRAPH_WRITE_CHECK_OIDS = (1 << 3),
>> - COMMIT_GRAPH_WRITE_BLOOM_FILTERS = (1 << 4)
>> + COMMIT_GRAPH_WRITE_BLOOM_FILTERS = (1 << 4),
>
> This looks like accidental change; if we want to use trailing comma in
> enum, this change should be in my opinion done in the commit that added
> COMMIT_GRAPH_WRITE_BLOOM_FILTERS (as I have written in a comment there).
>
Yes, I noticed the lack of the comma later and forgot to move it to the right
commit. Fixed in v3.
>
> Thank you for your work on this series.
>
> Best,
>
^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH v2 07/11] commit-graph: write Bloom filters to commit graph file
2020-02-24 21:14 ` Garima Singh
@ 2020-02-25 11:40 ` Jakub Narebski
2020-02-25 15:58 ` Garima Singh
0 siblings, 1 reply; 159+ messages in thread
From: Jakub Narebski @ 2020-02-25 11:40 UTC (permalink / raw)
To: Garima Singh
Cc: Garima Singh via GitGitGadget, git, Derrick Stolee,
SZEDER Gábor, Jonathan Tan, Jeff Hostetler, Taylor Blau,
Jeff King, Christian Couder, Emily Shaffer, Junio C Hamano,
Garima Singh
Garima Singh <garimasigit@gmail.com> writes:
> On 2/19/2020 10:13 AM, Jakub Narebski wrote:
>> "Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:
[...]
>>> diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
>>> index a4f17441ae..22e511643d 100644
>>> --- a/Documentation/technical/commit-graph-format.txt
>>> +++ b/Documentation/technical/commit-graph-format.txt
>>> @@ -17,6 +17,9 @@ metadata, including:
>>> - The parents of the commit, stored using positional references within
>>> the graph file.
>>>
>>> +- The Bloom filter of the commit carrying the paths that were changed between
>>> + the commit and its first parent.
>>> +
>>
>> All right.
>>
>> Should we also state that it is optional (meta)data? This would be
>> first optional piece of data stored in commit-graph, I think.
>>
>
> However the entire commit graph file is non critical metadata since git commands
> work just fine without it, just slower. The same applies to the changed path
> bloom filters.
>
> Based on the definition of optional you are suggesting, edge data is optional
> because not every commit-graph has octopus merges.
Well, edge data (EDGE chunk) is optional in different way from Bloom
filter data. The former depends on the repository (whether there are
octopus merges used), the latter is opt-in user choice (whether to run
`git commit-graph write` with the `--changed-paths` option, or in the
future equivalent config option).
To provide some advise that can be acted upon: perhaps it would be
better to start with "It can store", or end with "if requested" or
"optionally". For example the change could look like the following
suggestion:
The Git commit graph stores a list of commit OIDs and some associated
metadata, including:
[...]
+- The Bloom filter of the commit carrying the paths that were changed between
+ the commit and its first parent, if requested.
+
Best,
--
Jakub Narębski
^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH v2 07/11] commit-graph: write Bloom filters to commit graph file
2020-02-25 11:40 ` Jakub Narebski
@ 2020-02-25 15:58 ` Garima Singh
0 siblings, 0 replies; 159+ messages in thread
From: Garima Singh @ 2020-02-25 15:58 UTC (permalink / raw)
To: Jakub Narebski
Cc: Garima Singh via GitGitGadget, git, Derrick Stolee,
SZEDER Gábor, Jonathan Tan, Jeff Hostetler, Taylor Blau,
Jeff King, Christian Couder, Emily Shaffer, Junio C Hamano,
Garima Singh
On 2/25/2020 6:40 AM, Jakub Narebski wrote:
> Garima Singh <garimasigit@gmail.com> writes:
>> On 2/19/2020 10:13 AM, Jakub Narebski wrote:
>>> "Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:
> [...]
>>>> diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
>>>> index a4f17441ae..22e511643d 100644
>>>> --- a/Documentation/technical/commit-graph-format.txt
>>>> +++ b/Documentation/technical/commit-graph-format.txt
>>>> @@ -17,6 +17,9 @@ metadata, including:
>>>> - The parents of the commit, stored using positional references within
>>>> the graph file.
>>>>
>>>> +- The Bloom filter of the commit carrying the paths that were changed between
>>>> + the commit and its first parent.
>>>> +
>>>
>>> All right.
>>>
>>> Should we also state that it is optional (meta)data? This would be
>>> first optional piece of data stored in commit-graph, I think.
>>>
>>
>> However the entire commit graph file is non critical metadata since git commands
>> work just fine without it, just slower. The same applies to the changed path
>> bloom filters.
>>
>> Based on the definition of optional you are suggesting, edge data is optional
>> because not every commit-graph has octopus merges.
>
> Well, edge data (EDGE chunk) is optional in different way from Bloom
> filter data. The former depends on the repository (whether there are
> octopus merges used), the latter is opt-in user choice (whether to run
> `git commit-graph write` with the `--changed-paths` option, or in the
> future equivalent config option).
>
> To provide some advise that can be acted upon: perhaps it would be
> better to start with "It can store", or end with "if requested" or
> "optionally". For example the change could look like the following
> suggestion:
>
>
> The Git commit graph stores a list of commit OIDs and some associated
> metadata, including:
> [...]
> +- The Bloom filter of the commit carrying the paths that were changed between
> + the commit and its first parent, if requested.
> +
>
> Best,
>
Sure. That makes sense. Will incorporate in v3.
Cheers!
Garima Singh
^ permalink raw reply [flat|nested] 159+ messages in thread
* [PATCH v2 08/11] commit-graph: reuse existing Bloom filters during write.
2020-02-05 22:56 ` [PATCH v2 00/11] " Garima Singh via GitGitGadget
` (6 preceding siblings ...)
2020-02-05 22:56 ` [PATCH v2 07/11] commit-graph: write Bloom filters to commit graph file Garima Singh via GitGitGadget
@ 2020-02-05 22:56 ` Garima Singh via GitGitGadget
2020-02-20 18:48 ` Jakub Narebski
2020-02-05 22:56 ` [PATCH v2 09/11] commit-graph: add --changed-paths option to write subcommand Garima Singh via GitGitGadget
` (5 subsequent siblings)
13 siblings, 1 reply; 159+ messages in thread
From: Garima Singh via GitGitGadget @ 2020-02-05 22:56 UTC (permalink / raw)
To: git
Cc: stolee, szeder.dev, jonathantanmy, jeffhost, me, peff,
garimasigit, jnareb, christian.couder, emilyshaffer, gitster,
Garima Singh, Garima Singh
From: Garima Singh <garima.singh@microsoft.com>
Read previously computed Bloom filters from the commit-graph file if
possible to avoid recomputing during commit-graph write.
See Documentation/technical/commit-graph-format for the format in which
the Bloom filter information is written to the commit graph file.
To read Bloom filter for a given commit with lexicographic position
'i' we need to:
1. Read BIDX[i] which essentially gives us the starting index in BDAT for
filter of commit i+1. It is essentially the index past the end
of the filter of commit i. It is called end_index in the code.
2. For i>0, read BIDX[i-1] which will give us the starting index in BDAT
for filter of commit i. It is called the start_index in the code.
For the first commit, where i = 0, Bloom filter data starts at the
beginning, just past the header in the BDAT chunk. Hence, start_index
will be 0.
3. The length of the filter will be end_index - start_index, because
BIDX[i] gives the cumulative 8-byte words including the ith
commit's filter.
We toggle whether Bloom filters should be recomputed based on the
compute_if_null flag.
Helped-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
---
bloom.c | 49 ++++++++++++++++++++++++++++++++++++++++++-
bloom.h | 4 +++-
commit-graph.c | 7 ++++---
t/helper/test-bloom.c | 2 +-
4 files changed, 56 insertions(+), 6 deletions(-)
diff --git a/bloom.c b/bloom.c
index 818382c03b..90d84dc713 100644
--- a/bloom.c
+++ b/bloom.c
@@ -1,5 +1,7 @@
#include "git-compat-util.h"
#include "bloom.h"
+#include "commit.h"
+#include "commit-slab.h"
#include "commit-graph.h"
#include "object-store.h"
#include "diff.h"
@@ -127,8 +129,39 @@ void add_key_to_filter(struct bloom_key *key,
}
}
+static int load_bloom_filter_from_graph(struct commit_graph *g,
+ struct bloom_filter *filter,
+ struct commit *c)
+{
+ uint32_t lex_pos, start_index, end_index;
+
+ while (c->graph_pos < g->num_commits_in_base)
+ g = g->base_graph;
+
+ /* The commit graph commit 'c' lives in doesn't carry bloom filters. */
+ if (!g->chunk_bloom_indexes)
+ return 0;
+
+ lex_pos = c->graph_pos - g->num_commits_in_base;
+
+ end_index = get_be32(g->chunk_bloom_indexes + 4 * lex_pos);
+
+ if (lex_pos)
+ start_index = get_be32(g->chunk_bloom_indexes + 4 * (lex_pos - 1));
+ else
+ start_index = 0;
+
+ filter->len = end_index - start_index;
+ filter->data = (uint64_t *)(g->chunk_bloom_data +
+ sizeof(uint64_t) * start_index +
+ BLOOMDATA_CHUNK_HEADER_SIZE);
+
+ return 1;
+}
+
struct bloom_filter *get_bloom_filter(struct repository *r,
- struct commit *c)
+ struct commit *c,
+ int compute_if_not_present)
{
struct bloom_filter *filter;
struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
@@ -141,6 +174,20 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
filter = bloom_filter_slab_at(&bloom_filters, c);
+ if (!filter->data) {
+ load_commit_graph_info(r, c);
+ if (c->graph_pos != COMMIT_NOT_FROM_GRAPH &&
+ r->objects->commit_graph->chunk_bloom_indexes) {
+ if (load_bloom_filter_from_graph(r->objects->commit_graph, filter, c))
+ return filter;
+ else
+ return NULL;
+ }
+ }
+
+ if (filter->data || !compute_if_not_present)
+ return filter;
+
repo_diff_setup(r, &diffopt);
diffopt.flags.recursive = 1;
diffopt.max_changes = max_changes;
diff --git a/bloom.h b/bloom.h
index 7f40c751f7..76f8a9ad0c 100644
--- a/bloom.h
+++ b/bloom.h
@@ -13,6 +13,7 @@ struct bloom_filter_settings {
#define DEFAULT_BLOOM_FILTER_SETTINGS { 1, 7, 10 }
#define BITS_PER_WORD 64
+#define BLOOMDATA_CHUNK_HEADER_SIZE 3*sizeof(uint32_t)
/*
* A bloom_filter struct represents a data segment to
@@ -47,7 +48,8 @@ void add_key_to_filter(struct bloom_key *key,
struct bloom_filter_settings *settings);
struct bloom_filter *get_bloom_filter(struct repository *r,
- struct commit *c);
+ struct commit *c,
+ int compute_if_not_present);
int bloom_filter_contains(struct bloom_filter *filter,
struct bloom_key *key,
diff --git a/commit-graph.c b/commit-graph.c
index 4585b3b702..c0e9834bf2 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1094,7 +1094,7 @@ static void write_graph_chunk_bloom_indexes(struct hashfile *f,
ctx->commits.nr);
while (list < last) {
- struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
+ struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
cur_pos += filter->len;
display_progress(progress, ++i);
hashwrite_be32(f, cur_pos);
@@ -1123,7 +1123,7 @@ static void write_graph_chunk_bloom_data(struct hashfile *f,
hashwrite_be32(f, settings->bits_per_entry);
while (list < last) {
- struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
+ struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
display_progress(progress, ++i);
hashwrite(f, filter->data, filter->len * sizeof(uint64_t));
list++;
@@ -1304,7 +1304,7 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
for (i = 0; i < ctx->commits.nr; i++) {
struct commit *c = sorted_by_pos[i];
- struct bloom_filter *filter = get_bloom_filter(ctx->r, c);
+ struct bloom_filter *filter = get_bloom_filter(ctx->r, c, 1);
ctx->total_bloom_filter_data_size += sizeof(uint64_t) * filter->len;
display_progress(progress, i + 1);
}
@@ -2314,6 +2314,7 @@ void free_commit_graph(struct commit_graph *g)
g->data = NULL;
close(g->graph_fd);
}
+ free(g->bloom_filter_settings);
free(g->filename);
free(g);
}
diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
index 331957011b..9b4be97f75 100644
--- a/t/helper/test-bloom.c
+++ b/t/helper/test-bloom.c
@@ -47,7 +47,7 @@ static void get_bloom_filter_for_commit(const struct object_id *commit_oid)
struct bloom_filter *filter;
setup_git_directory();
c = lookup_commit(the_repository, commit_oid);
- filter = get_bloom_filter(the_repository, c);
+ filter = get_bloom_filter(the_repository, c, 1);
print_bloom_filter(filter);
}
--
gitgitgadget
^ permalink raw reply related [flat|nested] 159+ messages in thread
* Re: [PATCH v2 08/11] commit-graph: reuse existing Bloom filters during write.
2020-02-05 22:56 ` [PATCH v2 08/11] commit-graph: reuse existing Bloom filters during write Garima Singh via GitGitGadget
@ 2020-02-20 18:48 ` Jakub Narebski
2020-02-24 21:45 ` Garima Singh
0 siblings, 1 reply; 159+ messages in thread
From: Jakub Narebski @ 2020-02-20 18:48 UTC (permalink / raw)
To: Garima Singh via GitGitGadget
Cc: git, Derrick Stolee, SZEDER Gábor, Jonathan Tan,
Jeff Hostetler, Taylor Blau, Jeff King, Garima Singh,
Christian Couder, Emily Shaffer, Junio C Hamano, Garima Singh
"Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:
> From: Garima Singh <garima.singh@microsoft.com>
>
> Read previously computed Bloom filters from the commit-graph file if
> possible to avoid recomputing during commit-graph write.
All right, what is written makes sense for this point in patch series.
But it my opinion it is more important to state that this commit adds
"parsing" of the Bloom filter data from commit-graph file. This means
that it needs to be calculated only once, then stored in commit-graph,
ready to be re-used.
>
> See Documentation/technical/commit-graph-format for the format in which
> the Bloom filter information is written to the commit graph file.
>
> To read Bloom filter for a given commit with lexicographic position
> 'i' we need to:
> 1. Read BIDX[i] which essentially gives us the starting index in BDAT for
> filter of commit i+1. It is essentially the index past the end
> of the filter of commit i. It is called end_index in the code.
>
> 2. For i>0, read BIDX[i-1] which will give us the starting index in BDAT
> for filter of commit i. It is called the start_index in the code.
> For the first commit, where i = 0, Bloom filter data starts at the
> beginning, just past the header in the BDAT chunk. Hence, start_index
> will be 0.
>
> 3. The length of the filter will be end_index - start_index, because
> BIDX[i] gives the cumulative 8-byte words including the ith
> commit's filter.
>
> We toggle whether Bloom filters should be recomputed based on the
> compute_if_null flag.
Nitpick: the flag (the parameter) is called compute_if_not_present, not
compute_if_null.
All right, this explanation is nice and clear.
>
> Helped-by: Derrick Stolee <dstolee@microsoft.com>
> Signed-off-by: Garima Singh <garima.singh@microsoft.com>
> ---
> bloom.c | 49 ++++++++++++++++++++++++++++++++++++++++++-
> bloom.h | 4 +++-
> commit-graph.c | 7 ++++---
> t/helper/test-bloom.c | 2 +-
> 4 files changed, 56 insertions(+), 6 deletions(-)
>
> diff --git a/bloom.c b/bloom.c
> index 818382c03b..90d84dc713 100644
> --- a/bloom.c
> +++ b/bloom.c
> @@ -1,5 +1,7 @@
> #include "git-compat-util.h"
> #include "bloom.h"
> +#include "commit.h"
> +#include "commit-slab.h"
> #include "commit-graph.h"
> #include "object-store.h"
> #include "diff.h"
> @@ -127,8 +129,39 @@ void add_key_to_filter(struct bloom_key *key,
> }
> }
>
> +static int load_bloom_filter_from_graph(struct commit_graph *g,
> + struct bloom_filter *filter,
> + struct commit *c)
> +{
> + uint32_t lex_pos, start_index, end_index;
> +
> + while (c->graph_pos < g->num_commits_in_base)
> + g = g->base_graph;
> +
> + /* The commit graph commit 'c' lives in doesn't carry bloom filters. */
> + if (!g->chunk_bloom_indexes)
> + return 0;
> +
> + lex_pos = c->graph_pos - g->num_commits_in_base;
All right, this finds lexicographical position of the commit following
the chain of incremental commit-graph files, and also check if the
commit-graph fragment that contains the commit in question has Bloom
filter data included.
> +
> + end_index = get_be32(g->chunk_bloom_indexes + 4 * lex_pos);
> +
> + if (lex_pos)
Wouldn't it be better to be more explicit, and write
+ if (lex_pos > 0)
> + start_index = get_be32(g->chunk_bloom_indexes + 4 * (lex_pos - 1));
> + else
> + start_index = 0;
All right, here we find start_index and end_index.
It might be good idea to at least assert() that start_index <= end_index,
though that should not happen (that is why I propose for this check to
be compiled on only for debug builds).
> +
> + filter->len = end_index - start_index;
> + filter->data = (uint64_t *)(g->chunk_bloom_data +
> + sizeof(uint64_t) * start_index +
> + BLOOMDATA_CHUNK_HEADER_SIZE);
All right, nice use of constant.
> +
> + return 1;
> +}
> +
> struct bloom_filter *get_bloom_filter(struct repository *r,
> - struct commit *c)
> + struct commit *c,
> + int compute_if_not_present)
> {
> struct bloom_filter *filter;
> struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
> @@ -141,6 +174,20 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
>
> filter = bloom_filter_slab_at(&bloom_filters, c);
>
> + if (!filter->data) {
> + load_commit_graph_info(r, c);
> + if (c->graph_pos != COMMIT_NOT_FROM_GRAPH &&
> + r->objects->commit_graph->chunk_bloom_indexes) {
All right, the limitation that the top layer of incremental commit graph
needs to have Bloom filters enabled for it to be even considered is
reasonable tradeoff, in my opinion.
> + if (load_bloom_filter_from_graph(r->objects->commit_graph, filter, c))
> + return filter;
> + else
> + return NULL;
If it should have filter, return it, otherwise return NULL.
I wonder however when it can return NULL (and whether it should compute
Bloom filters if required instead).
> + }
> + }
> +
> + if (filter->data || !compute_if_not_present)
> + return filter;
If we have filter from slab, return it. All right.
However, according to documentation contained in comments in
commit-slab.h, bloom_filter_slab_at() will allocate the location to
store the data, and return freshly allocated memory... fortunately it
uses xcalloc() so returned bloom_filter would have ->len == 0 and
->data == 0.
> +
> repo_diff_setup(r, &diffopt);
> diffopt.flags.recursive = 1;
> diffopt.max_changes = max_changes;
> diff --git a/bloom.h b/bloom.h
> index 7f40c751f7..76f8a9ad0c 100644
> --- a/bloom.h
> +++ b/bloom.h
> @@ -13,6 +13,7 @@ struct bloom_filter_settings {
>
> #define DEFAULT_BLOOM_FILTER_SETTINGS { 1, 7, 10 }
> #define BITS_PER_WORD 64
> +#define BLOOMDATA_CHUNK_HEADER_SIZE 3*sizeof(uint32_t)
All right.
>
> /*
> * A bloom_filter struct represents a data segment to
> @@ -47,7 +48,8 @@ void add_key_to_filter(struct bloom_key *key,
> struct bloom_filter_settings *settings);
>
> struct bloom_filter *get_bloom_filter(struct repository *r,
> - struct commit *c);
> + struct commit *c,
> + int compute_if_not_present);
>
All right, adding new parameter (changing function signature).
> int bloom_filter_contains(struct bloom_filter *filter,
> struct bloom_key *key,
> diff --git a/commit-graph.c b/commit-graph.c
> index 4585b3b702..c0e9834bf2 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -1094,7 +1094,7 @@ static void write_graph_chunk_bloom_indexes(struct hashfile *f,
> ctx->commits.nr);
>
> while (list < last) {
> - struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
> + struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
> cur_pos += filter->len;
> display_progress(progress, ++i);
> hashwrite_be32(f, cur_pos);
> @@ -1123,7 +1123,7 @@ static void write_graph_chunk_bloom_data(struct hashfile *f,
> hashwrite_be32(f, settings->bits_per_entry);
>
> while (list < last) {
> - struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
> + struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
> display_progress(progress, ++i);
> hashwrite(f, filter->data, filter->len * sizeof(uint64_t));
> list++;
All right, if needed (that is, if '--changed-path' option from the
future commit is provided to 'git commit-graph write'),
compute_bloom_filters() would be called befor write_commit_graph_file(),
which in turn runs write_graph_chunk_bloom_index() and *_data().
Actually, when writing Bloom data chunks (BIDX and BDAT) we could have
requested recomputing filters if necessary: slab storage works as
memoization, so you would calculate Bloom filter data for each commit in
the commit-graph only once. And write_graph_chunk_bloom_indexes()
and write_graph_chunk_bloom_data() are called only if ctx->changed_paths
is true.
So it would work with
+ struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 1);
Only in the future we would really need to call with compute_if_not_present
parameter set to falsy value.
> @@ -1304,7 +1304,7 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
>
> for (i = 0; i < ctx->commits.nr; i++) {
> struct commit *c = sorted_by_pos[i];
> - struct bloom_filter *filter = get_bloom_filter(ctx->r, c);
> + struct bloom_filter *filter = get_bloom_filter(ctx->r, c, 1);
> ctx->total_bloom_filter_data_size += sizeof(uint64_t) * filter->len;
> display_progress(progress, i + 1);
> }
> @@ -2314,6 +2314,7 @@ void free_commit_graph(struct commit_graph *g)
> g->data = NULL;
> close(g->graph_fd);
> }
> + free(g->bloom_filter_settings);
> free(g->filename);
> free(g);
Shouldn't this fixup be added to earlier commit?
> }
> diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
> index 331957011b..9b4be97f75 100644
> --- a/t/helper/test-bloom.c
> +++ b/t/helper/test-bloom.c
> @@ -47,7 +47,7 @@ static void get_bloom_filter_for_commit(const struct object_id *commit_oid)
> struct bloom_filter *filter;
> setup_git_directory();
> c = lookup_commit(the_repository, commit_oid);
> - filter = get_bloom_filter(the_repository, c);
> + filter = get_bloom_filter(the_repository, c, 1);
> print_bloom_filter(filter);
> }
I would like to see some tests, but that needs to wait for patch that
adds --changed-paths option to the 'write' subcommand.
Things to be tested:
1. That after reading commit-graph with Bloom filter:
- that commit(s) in commit-graph have Bloom filter
- that commits outside commit-graph do not have Bloom filter
2. That incremental commit-graph feature works:
- for commits in deeper layer that have Bloom filter chunks
- for commits in deeper layer that do not have Bloom filter chunks
Best,
--
Jakub Narębski
^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH v2 08/11] commit-graph: reuse existing Bloom filters during write.
2020-02-20 18:48 ` Jakub Narebski
@ 2020-02-24 21:45 ` Garima Singh
0 siblings, 0 replies; 159+ messages in thread
From: Garima Singh @ 2020-02-24 21:45 UTC (permalink / raw)
To: Jakub Narebski, Garima Singh via GitGitGadget
Cc: git, Derrick Stolee, SZEDER Gábor, Jonathan Tan,
Jeff Hostetler, Taylor Blau, Jeff King, Christian Couder,
Emily Shaffer, Junio C Hamano, Garima Singh
On 2/20/2020 1:48 PM, Jakub Narebski wrote:
> "Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
>> From: Garima Singh <garima.singh@microsoft.com>
>>
>> Read previously computed Bloom filters from the commit-graph file if
>> possible to avoid recomputing during commit-graph write.
>
> All right, what is written makes sense for this point in patch series.
>
> But it my opinion it is more important to state that this commit adds
> "parsing" of the Bloom filter data from commit-graph file. This means
> that it needs to be calculated only once, then stored in commit-graph,
> ready to be re-used.
>
Good point. Incorporated in v3.
>>
>> See Documentation/technical/commit-graph-format for the format in which
>> the Bloom filter information is written to the commit graph file.
>>
>> To read Bloom filter for a given commit with lexicographic position
>> 'i' we need to:
>> 1. Read BIDX[i] which essentially gives us the starting index in BDAT for
>> filter of commit i+1. It is essentially the index past the end
>> of the filter of commit i. It is called end_index in the code.
>>
>> 2. For i>0, read BIDX[i-1] which will give us the starting index in BDAT
>> for filter of commit i. It is called the start_index in the code.
>> For the first commit, where i = 0, Bloom filter data starts at the
>> beginning, just past the header in the BDAT chunk. Hence, start_index
>> will be 0.
>>
>> 3. The length of the filter will be end_index - start_index, because
>> BIDX[i] gives the cumulative 8-byte words including the ith
>> commit's filter.
>>
>> We toggle whether Bloom filters should be recomputed based on the
>> compute_if_null flag.
>
> Nitpick: the flag (the parameter) is called compute_if_not_present, not
> compute_if_null.
>
Oops. Fixed in v3.
>> +
>> + end_index = get_be32(g->chunk_bloom_indexes + 4 * lex_pos);
>> +
>> + if (lex_pos)
>
> Wouldn't it be better to be more explicit, and write
>
> + if (lex_pos > 0)
>
>
Sure.
>> + start_index = get_be32(g->chunk_bloom_indexes + 4 * (lex_pos - 1));
>> + else
>> + start_index = 0;
>
> All right, here we find start_index and end_index.
>
> It might be good idea to at least assert() that start_index <= end_index,
> though that should not happen (that is why I propose for this check to
> be compiled on only for debug builds).
>
I will look into this. Thanks!
>> @@ -1304,7 +1304,7 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
>>
>> for (i = 0; i < ctx->commits.nr; i++) {
>> struct commit *c = sorted_by_pos[i];
>> - struct bloom_filter *filter = get_bloom_filter(ctx->r, c);
>> + struct bloom_filter *filter = get_bloom_filter(ctx->r, c, 1);
>> ctx->total_bloom_filter_data_size += sizeof(uint64_t) * filter->len;
>> display_progress(progress, i + 1);
>> }
>> @@ -2314,6 +2314,7 @@ void free_commit_graph(struct commit_graph *g)
>> g->data = NULL;
>> close(g->graph_fd);
>> }
>> + free(g->bloom_filter_settings);
>> free(g->filename);
>> free(g);
>
> Shouldn't this fixup be added to earlier commit?
>
Yes.
>> }
>> diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
>> index 331957011b..9b4be97f75 100644
>> --- a/t/helper/test-bloom.c
>> +++ b/t/helper/test-bloom.c
>> @@ -47,7 +47,7 @@ static void get_bloom_filter_for_commit(const struct object_id *commit_oid)
>> struct bloom_filter *filter;
>> setup_git_directory();
>> c = lookup_commit(the_repository, commit_oid);
>> - filter = get_bloom_filter(the_repository, c);
>> + filter = get_bloom_filter(the_repository, c, 1);
>> print_bloom_filter(filter);
>> }
>
> I would like to see some tests, but that needs to wait for patch that
> adds --changed-paths option to the 'write' subcommand.
>
> Things to be tested:
> 1. That after reading commit-graph with Bloom filter:
> - that commit(s) in commit-graph have Bloom filter
> - that commits outside commit-graph do not have Bloom filter
> 2. That incremental commit-graph feature works:
> - for commits in deeper layer that have Bloom filter chunks
> - for commits in deeper layer that do not have Bloom filter chunks
>
Included in later commits.
> Best,
>
^ permalink raw reply [flat|nested] 159+ messages in thread
* [PATCH v2 09/11] commit-graph: add --changed-paths option to write subcommand
2020-02-05 22:56 ` [PATCH v2 00/11] " Garima Singh via GitGitGadget
` (7 preceding siblings ...)
2020-02-05 22:56 ` [PATCH v2 08/11] commit-graph: reuse existing Bloom filters during write Garima Singh via GitGitGadget
@ 2020-02-05 22:56 ` Garima Singh via GitGitGadget
2020-02-20 20:28 ` Jakub Narebski
2020-02-20 22:10 ` Bryan Turner
2020-02-05 22:56 ` [PATCH v2 10/11] revision.c: use Bloom filters to speed up path based revision walks Garima Singh via GitGitGadget
` (4 subsequent siblings)
13 siblings, 2 replies; 159+ messages in thread
From: Garima Singh via GitGitGadget @ 2020-02-05 22:56 UTC (permalink / raw)
To: git
Cc: stolee, szeder.dev, jonathantanmy, jeffhost, me, peff,
garimasigit, jnareb, christian.couder, emilyshaffer, gitster,
Garima Singh, Garima Singh
From: Garima Singh <garima.singh@microsoft.com>
Add --changed-paths option to git commit-graph write. This option will
allow users to compute information about the paths that have changed
between a commit and its first parent, and write it into the commit graph
file. If the option is passed to the write subcommand we set the
COMMIT_GRAPH_WRITE_BLOOM_FILTERS flag and pass it down to the
commit-graph logic.
Helped-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
---
Documentation/git-commit-graph.txt | 5 +++++
builtin/commit-graph.c | 9 +++++++--
2 files changed, 12 insertions(+), 2 deletions(-)
diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
index bcd85c1976..907d703b30 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -54,6 +54,11 @@ or `--stdin-packs`.)
With the `--append` option, include all commits that are present in the
existing commit-graph file.
+
+With the `--changed-paths` option, compute and write information about the
+paths changed between a commit and it's first parent. This operation can
+take a while on large repositories. It provides significant performance gains
+for getting history of a directory or a file with `git log -- <path>`.
++
With the `--split` option, write the commit-graph as a chain of multiple
commit-graph files stored in `<dir>/info/commit-graphs`. The new commits
not already in the commit-graph are added in a new "tip" file. This file
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index e0c6fc4bbf..261dcce091 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -9,7 +9,7 @@
static char const * const builtin_commit_graph_usage[] = {
N_("git commit-graph verify [--object-dir <objdir>] [--shallow] [--[no-]progress]"),
- N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] [--[no-]progress] <split options>"),
+ N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] [--changed-paths] [--[no-]progress] <split options>"),
NULL
};
@@ -19,7 +19,7 @@ static const char * const builtin_commit_graph_verify_usage[] = {
};
static const char * const builtin_commit_graph_write_usage[] = {
- N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] [--[no-]progress] <split options>"),
+ N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] [--changed-paths] [--[no-]progress] <split options>"),
NULL
};
@@ -32,6 +32,7 @@ static struct opts_commit_graph {
int split;
int shallow;
int progress;
+ int enable_changed_paths;
} opts;
static int graph_verify(int argc, const char **argv)
@@ -110,6 +111,8 @@ static int graph_write(int argc, const char **argv)
N_("start walk at commits listed by stdin")),
OPT_BOOL(0, "append", &opts.append,
N_("include all commits already in the commit-graph file")),
+ OPT_BOOL(0, "changed-paths", &opts.enable_changed_paths,
+ N_("enable computation for changed paths")),
OPT_BOOL(0, "progress", &opts.progress, N_("force progress reporting")),
OPT_BOOL(0, "split", &opts.split,
N_("allow writing an incremental commit-graph file")),
@@ -143,6 +146,8 @@ static int graph_write(int argc, const char **argv)
flags |= COMMIT_GRAPH_WRITE_SPLIT;
if (opts.progress)
flags |= COMMIT_GRAPH_WRITE_PROGRESS;
+ if (opts.enable_changed_paths)
+ flags |= COMMIT_GRAPH_WRITE_BLOOM_FILTERS;
read_replace_refs = 0;
--
gitgitgadget
^ permalink raw reply related [flat|nested] 159+ messages in thread
* Re: [PATCH v2 09/11] commit-graph: add --changed-paths option to write subcommand
2020-02-05 22:56 ` [PATCH v2 09/11] commit-graph: add --changed-paths option to write subcommand Garima Singh via GitGitGadget
@ 2020-02-20 20:28 ` Jakub Narebski
2020-02-24 21:51 ` Garima Singh
2020-02-20 22:10 ` Bryan Turner
1 sibling, 1 reply; 159+ messages in thread
From: Jakub Narebski @ 2020-02-20 20:28 UTC (permalink / raw)
To: Garima Singh via GitGitGadget
Cc: git, Derrick Stolee, SZEDER Gábor, Jonathan Tan,
Jeff Hostetler, Taylor Blau, Jeff King, Garima Singh,
Christian Couder, Emily Shaffer, Junio C Hamano, Garima Singh
"Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:
> From: Garima Singh <garima.singh@microsoft.com>
>
> Add --changed-paths option to git commit-graph write. This option will
> allow users to compute information about the paths that have changed
> between a commit and its first parent, and write it into the commit graph
> file. If the option is passed to the write subcommand we set the
> COMMIT_GRAPH_WRITE_BLOOM_FILTERS flag and pass it down to the
> commit-graph logic.
In the manpage you write that this operation (computing Bloom filters)
can take a while on large repositories. Could you perhaps provide some
numbers: how much longer does it take to write commit-graph file with
and without '--changed-paths' for example for Linux kernel, or some
other large repository? Thanks in advance.
>
> Helped-by: Derrick Stolee <dstolee@microsoft.com>
> Signed-off-by: Garima Singh <garima.singh@microsoft.com>
> ---
> Documentation/git-commit-graph.txt | 5 +++++
> builtin/commit-graph.c | 9 +++++++--
> 2 files changed, 12 insertions(+), 2 deletions(-)
What is missing is some sanity tests: that bloom index and bloom data
chunks are not present without '--changed-paths', and that they are
added with '--changed-paths'.
If possible, maybe also check in a separate test that the size of
bloom_index chunk agrees with the number of commits in the commit graph.
Also, we can now add those tests I have wrote about in my review of
previous patch, that is:
1. If you write commit-graph with --changed-paths, and either add some
commits later or exclude some commits from the commit graph, then:
a.) commit(s) in commit-graph have Bloom filter
b.) commit(s) not in commit-graph do not have Bloom filter
2. If you write commit-graph without --changed-paths as base layer,
and then write next layer with --changed-paths and --split, then:
a.) commit(s) in top layer have Bloom filter(s)
b.) commit(s) in bottom layer don't have Bloom filter(s)
>
> diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
> index bcd85c1976..907d703b30 100644
> --- a/Documentation/git-commit-graph.txt
> +++ b/Documentation/git-commit-graph.txt
> @@ -54,6 +54,11 @@ or `--stdin-packs`.)
> With the `--append` option, include all commits that are present in the
> existing commit-graph file.
> +
> +With the `--changed-paths` option, compute and write information about the
> +paths changed between a commit and it's first parent. This operation can
> +take a while on large repositories. It provides significant performance gains
> +for getting history of a directory or a file with `git log -- <path>`.
> ++
Should we write about limitation that the topmost layer in the split
commit graph needs to be written with '--changed-paths' for Git to use
this information? Or perhaps we should try (in the future) to remove
this limitation??
> With the `--split` option, write the commit-graph as a chain of multiple
> commit-graph files stored in `<dir>/info/commit-graphs`. The new commits
> not already in the commit-graph are added in a new "tip" file. This file
> diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
> index e0c6fc4bbf..261dcce091 100644
> --- a/builtin/commit-graph.c
> +++ b/builtin/commit-graph.c
> @@ -9,7 +9,7 @@
>
> static char const * const builtin_commit_graph_usage[] = {
> N_("git commit-graph verify [--object-dir <objdir>] [--shallow] [--[no-]progress]"),
> - N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] [--[no-]progress] <split options>"),
> + N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] [--changed-paths] [--[no-]progress] <split options>"),
> NULL
> };
>
> @@ -19,7 +19,7 @@ static const char * const builtin_commit_graph_verify_usage[] = {
> };
>
> static const char * const builtin_commit_graph_write_usage[] = {
> - N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] [--[no-]progress] <split options>"),
> + N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] [--changed-paths] [--[no-]progress] <split options>"),
> NULL
> };
>
All right.
> @@ -32,6 +32,7 @@ static struct opts_commit_graph {
> int split;
> int shallow;
> int progress;
> + int enable_changed_paths;
Bikeshed painting: should this field be called enable_changed_paths or
simply changed_paths?
> } opts;
>
> static int graph_verify(int argc, const char **argv)
> @@ -110,6 +111,8 @@ static int graph_write(int argc, const char **argv)
> N_("start walk at commits listed by stdin")),
> OPT_BOOL(0, "append", &opts.append,
> N_("include all commits already in the commit-graph file")),
> + OPT_BOOL(0, "changed-paths", &opts.enable_changed_paths,
> + N_("enable computation for changed paths")),
> OPT_BOOL(0, "progress", &opts.progress, N_("force progress reporting")),
> OPT_BOOL(0, "split", &opts.split,
> N_("allow writing an incremental commit-graph file")),
All right.
> @@ -143,6 +146,8 @@ static int graph_write(int argc, const char **argv)
> flags |= COMMIT_GRAPH_WRITE_SPLIT;
> if (opts.progress)
> flags |= COMMIT_GRAPH_WRITE_PROGRESS;
> + if (opts.enable_changed_paths)
> + flags |= COMMIT_GRAPH_WRITE_BLOOM_FILTERS;
>
> read_replace_refs = 0;
All right. This actually turns on calculation Bloom filters for changed
paths, thanks to
ctx->changed_paths = flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS ? 1 : 0;
that was added by the "[PATCH v2 04/11] commit-graph: compute Bloom
filters for changed paths" patch.
Though... should this enabling be split into two separate patches like
this?
Best,
--
Jakub Narębski
^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH v2 09/11] commit-graph: add --changed-paths option to write subcommand
2020-02-20 20:28 ` Jakub Narebski
@ 2020-02-24 21:51 ` Garima Singh
2020-02-25 12:10 ` Jakub Narebski
0 siblings, 1 reply; 159+ messages in thread
From: Garima Singh @ 2020-02-24 21:51 UTC (permalink / raw)
To: Jakub Narebski, Garima Singh via GitGitGadget
Cc: git, Derrick Stolee, SZEDER Gábor, Jonathan Tan,
Jeff Hostetler, Taylor Blau, Jeff King, Christian Couder,
Emily Shaffer, Junio C Hamano, Garima Singh
On 2/20/2020 3:28 PM, Jakub Narebski wrote:
> "Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
>> From: Garima Singh <garima.singh@microsoft.com>
>>
>> Add --changed-paths option to git commit-graph write. This option will
>> allow users to compute information about the paths that have changed
>> between a commit and its first parent, and write it into the commit graph
>> file. If the option is passed to the write subcommand we set the
>> COMMIT_GRAPH_WRITE_BLOOM_FILTERS flag and pass it down to the
>> commit-graph logic.
>
> In the manpage you write that this operation (computing Bloom filters)
> can take a while on large repositories. Could you perhaps provide some
> numbers: how much longer does it take to write commit-graph file with
> and without '--changed-paths' for example for Linux kernel, or some
> other large repository? Thanks in advance.
>
Yes. Will include numbers as appropriate in v3.
>>
>> Helped-by: Derrick Stolee <dstolee@microsoft.com>
>> Signed-off-by: Garima Singh <garima.singh@microsoft.com>
>> ---
>> Documentation/git-commit-graph.txt | 5 +++++
>> builtin/commit-graph.c | 9 +++++++--
>> 2 files changed, 12 insertions(+), 2 deletions(-)
>
> What is missing is some sanity tests: that bloom index and bloom data
> chunks are not present without '--changed-paths', and that they are
> added with '--changed-paths'.
>
> If possible, maybe also check in a separate test that the size of
> bloom_index chunk agrees with the number of commits in the commit graph.
>
>
> Also, we can now add those tests I have wrote about in my review of
> previous patch, that is:
>
> 1. If you write commit-graph with --changed-paths, and either add some
> commits later or exclude some commits from the commit graph, then:
>
> a.) commit(s) in commit-graph have Bloom filter
> b.) commit(s) not in commit-graph do not have Bloom filter
>
> 2. If you write commit-graph without --changed-paths as base layer,
> and then write next layer with --changed-paths and --split, then:
>
> a.) commit(s) in top layer have Bloom filter(s)
> b.) commit(s) in bottom layer don't have Bloom filter(s)
>
I will see what more can be done here.
>>
>> diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
>> index bcd85c1976..907d703b30 100644
>> --- a/Documentation/git-commit-graph.txt
>> +++ b/Documentation/git-commit-graph.txt
>> @@ -54,6 +54,11 @@ or `--stdin-packs`.)
>> With the `--append` option, include all commits that are present in the
>> existing commit-graph file.
>> +
>> +With the `--changed-paths` option, compute and write information about the
>> +paths changed between a commit and it's first parent. This operation can
>> +take a while on large repositories. It provides significant performance gains
>> +for getting history of a directory or a file with `git log -- <path>`.
>> ++
>
> Should we write about limitation that the topmost layer in the split
> commit graph needs to be written with '--changed-paths' for Git to use
> this information? Or perhaps we should try (in the future) to remove
> this limitation??
>
Given that this information is going to be used best effort, it would be
superfluous to describe every case and conditional that decides whether
this information is being used.
>> @@ -143,6 +146,8 @@ static int graph_write(int argc, const char **argv)
>> flags |= COMMIT_GRAPH_WRITE_SPLIT;
>> if (opts.progress)
>> flags |= COMMIT_GRAPH_WRITE_PROGRESS;
>> + if (opts.enable_changed_paths)
>> + flags |= COMMIT_GRAPH_WRITE_BLOOM_FILTERS;
>>
>> read_replace_refs = 0;
>
> All right. This actually turns on calculation Bloom filters for changed
> paths, thanks to
>
> ctx->changed_paths = flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS ? 1 : 0;
>
> that was added by the "[PATCH v2 04/11] commit-graph: compute Bloom
> filters for changed paths" patch.
>
> Though... should this enabling be split into two separate patches like
> this?
>
The idea is that in 4/11 We compute only if the flag is set.
And between that patch and this one: we prepare the foundational code
that is now ready for that flag to be set via an opt-in by the user.
>
> Best,
>
^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH v2 09/11] commit-graph: add --changed-paths option to write subcommand
2020-02-24 21:51 ` Garima Singh
@ 2020-02-25 12:10 ` Jakub Narebski
0 siblings, 0 replies; 159+ messages in thread
From: Jakub Narebski @ 2020-02-25 12:10 UTC (permalink / raw)
To: Garima Singh
Cc: Garima Singh via GitGitGadget, git, Derrick Stolee,
SZEDER Gábor, Jonathan Tan, Jeff Hostetler, Taylor Blau,
Jeff King, Christian Couder, Emily Shaffer, Junio C Hamano,
Garima Singh
Garima Singh <garimasigit@gmail.com> writes:
> On 2/20/2020 3:28 PM, Jakub Narebski wrote:
>> "Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:
[...]
>>> --- a/Documentation/git-commit-graph.txt
>>> +++ b/Documentation/git-commit-graph.txt
>>> @@ -54,6 +54,11 @@ or `--stdin-packs`.)
>>> With the `--append` option, include all commits that are present in the
>>> existing commit-graph file.
>>> +
>>> +With the `--changed-paths` option, compute and write information about the
>>> +paths changed between a commit and it's first parent. This operation can
>>> +take a while on large repositories. It provides significant performance gains
>>> +for getting history of a directory or a file with `git log -- <path>`.
>>> ++
>>
>> Should we write about limitation that the topmost layer in the split
>> commit graph needs to be written with '--changed-paths' for Git to use
>> this information? Or perhaps we should try (in the future) to remove
>> this limitation?
>
> Given that this information is going to be used best effort, it would be
> superfluous to describe every case and conditional that decides whether
> this information is being used.
I can somewhat agree with this reasoning.
However what I would like to avoid is surprising users. If one creates
base commit-graph with Bloom filters data, but then when creating
new layer of commit-graph (updating it incrementally), it may be
surprising that `git log -- <path>` is now much slower.
On the other hand if one would update commit-graph in a non-incremental
way (rewriting the commit-graph file), loosing the Bloom filter
information and performance of `git log -- <path>` because one forgot to
include `--changed-paths` is not that unexpected.
Anyway, in the future when this mechanism will be controlled by
appropriate config variable, this whole discussion would become somewhat
moot.
Thought for the future: perhaps `git commit-graph verify` could detect
that split graph has Bloom filters only for some layers, and inform the
user? But that is almost certainly out of scope of this patch series.
>>> @@ -143,6 +146,8 @@ static int graph_write(int argc, const char **argv)
>>> flags |= COMMIT_GRAPH_WRITE_SPLIT;
>>> if (opts.progress)
>>> flags |= COMMIT_GRAPH_WRITE_PROGRESS;
>>> + if (opts.enable_changed_paths)
>>> + flags |= COMMIT_GRAPH_WRITE_BLOOM_FILTERS;
>>>
>>> read_replace_refs = 0;
>>
>> All right. This actually turns on calculation Bloom filters for changed
>> paths, thanks to
>>
>> ctx->changed_paths = flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS ? 1 : 0;
>>
>> that was added by the "[PATCH v2 04/11] commit-graph: compute Bloom
>> filters for changed paths" patch.
>>
>> Though... should this enabling be split into two separate patches like
>> this?
>
> The idea is that in 4/11 We compute only if the flag is set.
> And between that patch and this one: we prepare the foundational code
> that is now ready for that flag to be set via an opt-in by the user.
All right.
Choosing how to split large change into series is not easy. One one
hand one would want for each change to be small and self contained. On
the other hand it would be good if each change was testable (test-tool
can help here).
Best,
--
Jakub Narębski
^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH v2 09/11] commit-graph: add --changed-paths option to write subcommand
2020-02-05 22:56 ` [PATCH v2 09/11] commit-graph: add --changed-paths option to write subcommand Garima Singh via GitGitGadget
2020-02-20 20:28 ` Jakub Narebski
@ 2020-02-20 22:10 ` Bryan Turner
2020-02-22 1:44 ` Garima Singh
1 sibling, 1 reply; 159+ messages in thread
From: Bryan Turner @ 2020-02-20 22:10 UTC (permalink / raw)
To: Garima Singh via GitGitGadget
Cc: Git Users, Derrick Stolee, SZEDER Gábor, Jonathan Tan,
jeffhost, me, Jeff King, garimasigit, jnareb, Christian Couder,
emilyshaffer, Junio C Hamano, Garima Singh
On Wed, Feb 5, 2020 at 2:56 PM Garima Singh via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Garima Singh <garima.singh@microsoft.com>
>
> Add --changed-paths option to git commit-graph write. This option will
> allow users to compute information about the paths that have changed
> between a commit and its first parent, and write it into the commit graph
> file. If the option is passed to the write subcommand we set the
> COMMIT_GRAPH_WRITE_BLOOM_FILTERS flag and pass it down to the
> commit-graph logic.
>
> Helped-by: Derrick Stolee <dstolee@microsoft.com>
> Signed-off-by: Garima Singh <garima.singh@microsoft.com>
> ---
> Documentation/git-commit-graph.txt | 5 +++++
> builtin/commit-graph.c | 9 +++++++--
> 2 files changed, 12 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
> index bcd85c1976..907d703b30 100644
> --- a/Documentation/git-commit-graph.txt
> +++ b/Documentation/git-commit-graph.txt
> @@ -54,6 +54,11 @@ or `--stdin-packs`.)
> With the `--append` option, include all commits that are present in the
> existing commit-graph file.
> +
> +With the `--changed-paths` option, compute and write information about the
> +paths changed between a commit and it's first parent. This operation can
"its first parent"
(Pardon the grammar nit from the peanut gallery!)
> +take a while on large repositories. It provides significant performance gains
> +for getting history of a directory or a file with `git log -- <path>`.
> ++
> With the `--split` option, write the commit-graph as a chain of multiple
> commit-graph files stored in `<dir>/info/commit-graphs`. The new commits
> not already in the commit-graph are added in a new "tip" file. This file
> diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
> index e0c6fc4bbf..261dcce091 100644
> --- a/builtin/commit-graph.c
> +++ b/builtin/commit-graph.c
> @@ -9,7 +9,7 @@
>
> static char const * const builtin_commit_graph_usage[] = {
> N_("git commit-graph verify [--object-dir <objdir>] [--shallow] [--[no-]progress]"),
> - N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] [--[no-]progress] <split options>"),
> + N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] [--changed-paths] [--[no-]progress] <split options>"),
> NULL
> };
>
> @@ -19,7 +19,7 @@ static const char * const builtin_commit_graph_verify_usage[] = {
> };
>
> static const char * const builtin_commit_graph_write_usage[] = {
> - N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] [--[no-]progress] <split options>"),
> + N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] [--changed-paths] [--[no-]progress] <split options>"),
> NULL
> };
>
> @@ -32,6 +32,7 @@ static struct opts_commit_graph {
> int split;
> int shallow;
> int progress;
> + int enable_changed_paths;
> } opts;
>
> static int graph_verify(int argc, const char **argv)
> @@ -110,6 +111,8 @@ static int graph_write(int argc, const char **argv)
> N_("start walk at commits listed by stdin")),
> OPT_BOOL(0, "append", &opts.append,
> N_("include all commits already in the commit-graph file")),
> + OPT_BOOL(0, "changed-paths", &opts.enable_changed_paths,
> + N_("enable computation for changed paths")),
> OPT_BOOL(0, "progress", &opts.progress, N_("force progress reporting")),
> OPT_BOOL(0, "split", &opts.split,
> N_("allow writing an incremental commit-graph file")),
> @@ -143,6 +146,8 @@ static int graph_write(int argc, const char **argv)
> flags |= COMMIT_GRAPH_WRITE_SPLIT;
> if (opts.progress)
> flags |= COMMIT_GRAPH_WRITE_PROGRESS;
> + if (opts.enable_changed_paths)
> + flags |= COMMIT_GRAPH_WRITE_BLOOM_FILTERS;
>
> read_replace_refs = 0;
>
> --
> gitgitgadget
>
^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH v2 09/11] commit-graph: add --changed-paths option to write subcommand
2020-02-20 22:10 ` Bryan Turner
@ 2020-02-22 1:44 ` Garima Singh
0 siblings, 0 replies; 159+ messages in thread
From: Garima Singh @ 2020-02-22 1:44 UTC (permalink / raw)
To: Bryan Turner, Garima Singh via GitGitGadget
Cc: Git Users, Derrick Stolee, SZEDER Gábor, Jonathan Tan,
jeffhost, me, Jeff King, jnareb, Christian Couder, emilyshaffer,
Junio C Hamano, Garima Singh
On 2/20/2020 5:10 PM, Bryan Turner wrote:
> On Wed, Feb 5, 2020 at 2:56 PM Garima Singh via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
>> diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
>> index bcd85c1976..907d703b30 100644
>> --- a/Documentation/git-commit-graph.txt
>> +++ b/Documentation/git-commit-graph.txt
>> @@ -54,6 +54,11 @@ or `--stdin-packs`.)
>> With the `--append` option, include all commits that are present in the
>> existing commit-graph file.
>> +
>> +With the `--changed-paths` option, compute and write information about the
>> +paths changed between a commit and it's first parent. This operation can
>
> "its first parent"
>
> (Pardon the grammar nit from the peanut gallery!)
>
:)
Thank you! Fixed in v3.
Cheers!
Garima Singh
^ permalink raw reply [flat|nested] 159+ messages in thread
* [PATCH v2 10/11] revision.c: use Bloom filters to speed up path based revision walks
2020-02-05 22:56 ` [PATCH v2 00/11] " Garima Singh via GitGitGadget
` (8 preceding siblings ...)
2020-02-05 22:56 ` [PATCH v2 09/11] commit-graph: add --changed-paths option to write subcommand Garima Singh via GitGitGadget
@ 2020-02-05 22:56 ` Garima Singh via GitGitGadget
2020-02-21 17:31 ` Jakub Narebski
2020-02-21 22:45 ` Jakub Narebski
2020-02-05 22:56 ` [PATCH v2 11/11] commit-graph: add GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS test flag Garima Singh via GitGitGadget
` (3 subsequent siblings)
13 siblings, 2 replies; 159+ messages in thread
From: Garima Singh via GitGitGadget @ 2020-02-05 22:56 UTC (permalink / raw)
To: git
Cc: stolee, szeder.dev, jonathantanmy, jeffhost, me, peff,
garimasigit, jnareb, christian.couder, emilyshaffer, gitster,
Garima Singh, Garima Singh
From: Garima Singh <garima.singh@microsoft.com>
Revision walk will now use Bloom filters for commits to speed up revision
walks for a particular path (for computing history for that path), if they
are present in the commit-graph file.
We load the Bloom filters during the prepare_revision_walk step, but only
when dealing with a single pathspec. While comparing trees in
rev_compare_trees(), if the Bloom filter says that the file is not different
between the two trees, we don't need to compute the expensive diff. This is
where we get our performance gains. The other response of the Bloom filter
is `maybe`, in which case we fall back to the full diff calculation to
determine if the path was changed in the commit.
Performance Gains:
We tested the performance of `git log -- <path>` on the git repo, the linux
and some internal large repos, with a variety of paths of varying depths.
On the git and linux repos:
- we observed a 2x to 5x speed up.
On a large internal repo with files seated 6-10 levels deep in the tree:
- we observed 10x to 20x speed ups, with some paths going up to 28 times
faster.
Helped-by: Derrick Stolee <dstolee@microsoft.com
Helped-by: SZEDER Gábor <szeder.dev@gmail.com>
Helped-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
---
revision.c | 124 +++++++++++++++++++++++++++++++-
revision.h | 11 +++
t/helper/test-read-graph.c | 4 ++
t/t4216-log-bloom.sh | 140 +++++++++++++++++++++++++++++++++++++
4 files changed, 277 insertions(+), 2 deletions(-)
create mode 100755 t/t4216-log-bloom.sh
diff --git a/revision.c b/revision.c
index 8136929e23..d1622afa17 100644
--- a/revision.c
+++ b/revision.c
@@ -29,6 +29,8 @@
#include "prio-queue.h"
#include "hashmap.h"
#include "utf8.h"
+#include "bloom.h"
+#include "json-writer.h"
volatile show_early_output_fn_t show_early_output;
@@ -624,11 +626,114 @@ static void file_change(struct diff_options *options,
options->flags.has_changes = 1;
}
+static int bloom_filter_atexit_registered;
+static unsigned int count_bloom_filter_maybe;
+static unsigned int count_bloom_filter_definitely_not;
+static unsigned int count_bloom_filter_false_positive;
+static unsigned int count_bloom_filter_not_present;
+static unsigned int count_bloom_filter_length_zero;
+
+static void trace2_bloom_filter_statistics_atexit(void)
+{
+ struct json_writer jw = JSON_WRITER_INIT;
+
+ jw_object_begin(&jw, 0);
+ jw_object_intmax(&jw, "filter_not_present", count_bloom_filter_not_present);
+ jw_object_intmax(&jw, "zero_length_filter", count_bloom_filter_length_zero);
+ jw_object_intmax(&jw, "maybe", count_bloom_filter_maybe);
+ jw_object_intmax(&jw, "definitely_not", count_bloom_filter_definitely_not);
+ jw_end(&jw);
+
+ trace2_data_json("bloom", the_repository, "statistics", &jw);
+
+ jw_release(&jw);
+}
+
+static void prepare_to_use_bloom_filter(struct rev_info *revs)
+{
+ struct pathspec_item *pi;
+ char *path_alloc = NULL;
+ const char *path;
+ int last_index;
+ int len;
+
+ if (!revs->commits)
+ return;
+
+ repo_parse_commit(revs->repo, revs->commits->item);
+
+ if (!revs->repo->objects->commit_graph)
+ return;
+
+ revs->bloom_filter_settings = revs->repo->objects->commit_graph->bloom_filter_settings;
+ if (!revs->bloom_filter_settings)
+ return;
+
+ pi = &revs->pruning.pathspec.items[0];
+ last_index = pi->len - 1;
+
+ if (pi->match[last_index] == '/') {
+ path_alloc = xstrdup(pi->match);
+ path_alloc[last_index] = '\0';
+ path = path_alloc;
+ } else
+ path = pi->match;
+
+ len = strlen(path);
+
+ revs->bloom_key = xmalloc(sizeof(struct bloom_key));
+ fill_bloom_key(path, len, revs->bloom_key, revs->bloom_filter_settings);
+
+ if (trace2_is_enabled() && !bloom_filter_atexit_registered) {
+ atexit(trace2_bloom_filter_statistics_atexit);
+ bloom_filter_atexit_registered = 1;
+ }
+
+ free(path_alloc);
+}
+
+static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
+ struct commit *commit)
+{
+ struct bloom_filter *filter;
+ int result;
+
+ if (!revs->repo->objects->commit_graph)
+ return -1;
+
+ if (commit->generation == GENERATION_NUMBER_INFINITY)
+ return -1;
+
+ filter = get_bloom_filter(revs->repo, commit, 0);
+
+ if (!filter) {
+ count_bloom_filter_not_present++;
+ return -1;
+ }
+
+ if (!filter->len) {
+ count_bloom_filter_length_zero++;
+ return -1;
+ }
+
+ result = bloom_filter_contains(filter,
+ revs->bloom_key,
+ revs->bloom_filter_settings);
+
+ if (result)
+ count_bloom_filter_maybe++;
+ else
+ count_bloom_filter_definitely_not++;
+
+ return result;
+}
+
static int rev_compare_tree(struct rev_info *revs,
- struct commit *parent, struct commit *commit)
+ struct commit *parent, struct commit *commit, int nth_parent)
{
struct tree *t1 = get_commit_tree(parent);
struct tree *t2 = get_commit_tree(commit);
+ int bloom_ret = 1;
if (!t1)
return REV_TREE_NEW;
@@ -653,11 +758,23 @@ static int rev_compare_tree(struct rev_info *revs,
return REV_TREE_SAME;
}
+ if (revs->pruning.pathspec.nr == 1 && !revs->reflog_info && !nth_parent) {
+ bloom_ret = check_maybe_different_in_bloom_filter(revs, commit);
+
+ if (bloom_ret == 0)
+ return REV_TREE_SAME;
+ }
+
tree_difference = REV_TREE_SAME;
revs->pruning.flags.has_changes = 0;
if (diff_tree_oid(&t1->object.oid, &t2->object.oid, "",
&revs->pruning) < 0)
return REV_TREE_DIFFERENT;
+
+ if (!nth_parent)
+ if (bloom_ret == 1 && tree_difference == REV_TREE_SAME)
+ count_bloom_filter_false_positive++;
+
return tree_difference;
}
@@ -855,7 +972,7 @@ static void try_to_simplify_commit(struct rev_info *revs, struct commit *commit)
die("cannot simplify commit %s (because of %s)",
oid_to_hex(&commit->object.oid),
oid_to_hex(&p->object.oid));
- switch (rev_compare_tree(revs, p, commit)) {
+ switch (rev_compare_tree(revs, p, commit, nth_parent)) {
case REV_TREE_SAME:
if (!revs->simplify_history || !relevant_commit(p)) {
/* Even if a merge with an uninteresting
@@ -3362,6 +3479,8 @@ int prepare_revision_walk(struct rev_info *revs)
FOR_EACH_OBJECT_PROMISOR_ONLY);
}
+ if (revs->pruning.pathspec.nr == 1 && !revs->reflog_info)
+ prepare_to_use_bloom_filter(revs);
if (revs->no_walk != REVISION_WALK_NO_WALK_UNSORTED)
commit_list_sort_by_date(&revs->commits);
if (revs->no_walk)
@@ -3379,6 +3498,7 @@ int prepare_revision_walk(struct rev_info *revs)
simplify_merges(revs);
if (revs->children.name)
set_children(revs);
+
return 0;
}
diff --git a/revision.h b/revision.h
index 475f048fb6..7c026fe41f 100644
--- a/revision.h
+++ b/revision.h
@@ -56,6 +56,8 @@ struct repository;
struct rev_info;
struct string_list;
struct saved_parents;
+struct bloom_key;
+struct bloom_filter_settings;
define_shared_commit_slab(revision_sources, char *);
struct rev_cmdline_info {
@@ -291,6 +293,15 @@ struct rev_info {
struct revision_sources *sources;
struct topo_walk_info *topo_walk_info;
+
+ /* Commit graph bloom filter fields */
+ /* The bloom filter key for the pathspec */
+ struct bloom_key *bloom_key;
+ /*
+ * The bloom filter settings used to generate the key.
+ * This is loaded from the commit-graph being used.
+ */
+ struct bloom_filter_settings *bloom_filter_settings;
};
int ref_excluded(struct string_list *, const char *path);
diff --git a/t/helper/test-read-graph.c b/t/helper/test-read-graph.c
index d2884efe0a..aff597c7a3 100644
--- a/t/helper/test-read-graph.c
+++ b/t/helper/test-read-graph.c
@@ -45,6 +45,10 @@ int cmd__read_graph(int argc, const char **argv)
printf(" commit_metadata");
if (graph->chunk_extra_edges)
printf(" extra_edges");
+ if (graph->chunk_bloom_indexes)
+ printf(" bloom_indexes");
+ if (graph->chunk_bloom_data)
+ printf(" bloom_data");
printf("\n");
UNLEAK(graph);
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
new file mode 100755
index 0000000000..19eca1864b
--- /dev/null
+++ b/t/t4216-log-bloom.sh
@@ -0,0 +1,140 @@
+#!/bin/sh
+
+test_description='git log for a path with bloom filters'
+. ./test-lib.sh
+
+test_expect_success 'setup test - repo, commits, commit graph, log outputs' '
+ git init &&
+ mkdir A A/B A/B/C &&
+ test_commit c1 A/file1 &&
+ test_commit c2 A/B/file2 &&
+ test_commit c3 A/B/C/file3 &&
+ test_commit c4 A/file1 &&
+ test_commit c5 A/B/file2 &&
+ test_commit c6 A/B/C/file3 &&
+ test_commit c7 A/file1 &&
+ test_commit c8 A/B/file2 &&
+ test_commit c9 A/B/C/file3 &&
+ git checkout -b side HEAD~4 &&
+ test_commit side-1 file4 &&
+ git checkout master &&
+ git merge side &&
+ test_commit c10 file5 &&
+ mv file5 file5_renamed &&
+ git add file5_renamed &&
+ git commit -m "rename" &&
+ git commit-graph write --reachable --changed-paths
+'
+graph_read_expect() {
+ OPTIONAL=""
+ NUM_CHUNKS=5
+ cat >expect <<- EOF
+ header: 43475048 1 1 $NUM_CHUNKS 0
+ num_commits: $1
+ chunks: oid_fanout oid_lookup commit_metadata bloom_indexes bloom_data
+ EOF
+ test-tool read-graph >output &&
+ test_cmp expect output
+}
+
+test_expect_success 'commit-graph write wrote out the bloom chunks' '
+ graph_read_expect 13
+'
+
+setup() {
+ rm output
+ rm "$TRASH_DIRECTORY/trace.perf"
+ git -c core.commitGraph=false log --pretty="format:%s" $1 >log_wo_bloom
+ GIT_TRACE2_PERF="$TRASH_DIRECTORY/trace.perf" git -c core.commitGraph=true log --pretty="format:%s" $1 >log_w_bloom
+}
+
+test_bloom_filters_used() {
+ log_args=$1
+ bloom_trace_prefix="statistics:{\"filter_not_present\":0,\"zero_length_filter\":0,\"maybe\""
+ setup "$log_args"
+ grep -q "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.perf" && test_cmp log_wo_bloom log_w_bloom
+}
+
+test_bloom_filters_not_used() {
+ log_args=$1
+ setup "$log_args"
+ !(grep -q "statistics:{\"filter_not_present\":" "$TRASH_DIRECTORY/trace.perf") && test_cmp log_wo_bloom log_w_bloom
+}
+
+for path in A A/B A/B/C A/file1 A/B/file2 A/B/C/file3 file4 file5_renamed
+do
+ for option in "" \
+ "--full-history" \
+ "--full-history --simplify-merges" \
+ "--simplify-merges" \
+ "--simplify-by-decoration" \
+ "--follow" \
+ "--first-parent" \
+ "--topo-order" \
+ "--date-order" \
+ "--author-date-order" \
+ "--ancestry-path side..master"
+ do
+ test_expect_success "git log option: $option for path: $path" '
+ test_bloom_filters_used "$option -- $path"
+ '
+ done
+done
+
+test_expect_success 'git log -- folder works with and without the trailing slash' '
+ test_bloom_filters_used "-- A" &&
+ test_bloom_filters_used "-- A/"
+'
+
+test_expect_success 'git log for path that does not exist. ' '
+ test_bloom_filters_used "-- path_does_not_exist"
+'
+
+test_expect_success 'git log with --walk-reflogs does not use bloom filters' '
+ test_bloom_filters_not_used "--walk-reflogs -- A"
+'
+
+test_expect_success 'git log -- multiple path specs does not use bloom filters' '
+ test_bloom_filters_not_used "-- file4 A/file1"
+'
+
+test_expect_success 'git log with wildcard that resolves to a single path uses bloom filters' '
+ test_bloom_filters_used "-- *4" &&
+ test_bloom_filters_used "-- *renamed"
+'
+
+test_expect_success 'git log with wildcard that resolves to a multiple paths does not uses bloom filters' '
+ test_bloom_filters_not_used "-- *" &&
+ test_bloom_filters_not_used "-- file*"
+'
+
+test_expect_success 'setup - add commit-graph to the chain without bloom filters' '
+ test_commit c14 A/anotherFile2 &&
+ test_commit c15 A/B/anotherFile2 &&
+ test_commit c16 A/B/C/anotherFile2 &&
+ GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=0 git commit-graph write --reachable --split &&
+ test_line_count = 2 .git/objects/info/commit-graphs/commit-graph-chain
+'
+
+test_expect_success 'git log does not use bloom filters if the latest graph does not have bloom filters.' '
+ test_bloom_filters_not_used "-- A/B"
+'
+
+test_expect_success 'setup - add commit-graph to the chain with bloom filters' '
+ test_commit c17 A/anotherFile3 &&
+ git commit-graph write --reachable --changed-paths --split &&
+ test_line_count = 3 .git/objects/info/commit-graphs/commit-graph-chain
+'
+
+test_bloom_filters_used_when_some_filters_are_missing() {
+ log_args=$1
+ bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"zero_length_filter\":0,\"maybe\":6,\"definitely_not\":6"
+ setup "$log_args"
+ grep -q "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.perf" && test_cmp log_wo_bloom log_w_bloom
+}
+
+test_expect_success 'git log uses bloom filters if they exist in the latest but not all commit graphs in the chain.' '
+ test_bloom_filters_used_when_some_filters_are_missing "-- A/B"
+'
+
+test_done
--
gitgitgadget
^ permalink raw reply related [flat|nested] 159+ messages in thread
* Re: [PATCH v2 10/11] revision.c: use Bloom filters to speed up path based revision walks
2020-02-05 22:56 ` [PATCH v2 10/11] revision.c: use Bloom filters to speed up path based revision walks Garima Singh via GitGitGadget
@ 2020-02-21 17:31 ` Jakub Narebski
2020-02-21 22:45 ` Jakub Narebski
1 sibling, 0 replies; 159+ messages in thread
From: Jakub Narebski @ 2020-02-21 17:31 UTC (permalink / raw)
To: Garima Singh via GitGitGadget
Cc: git, stolee, szeder.dev, jonathantanmy, jeffhost, me, peff,
garimasigit, christian.couder, emilyshaffer, gitster,
Garima Singh
"Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:
> From: Garima Singh <garima.singh@microsoft.com>
>
> Revision walk will now use Bloom filters for commits to speed up revision
> walks for a particular path (for computing history for that path), if they
> are present in the commit-graph file.
Why do we need to turn this feature off for --walk-reflog?
Anyway, in my opinion this restriction should be stated explicitly in
the commit message, if kept.
>
> We load the Bloom filters during the prepare_revision_walk step, but only
> when dealing with a single pathspec.
I would add the qualifier "currently" here, i.e. s/only/currently only/
to make it clear that it is the limitation of current implementation,
and not the inherent implementation of the technique.
> While comparing trees in
> rev_compare_trees(), if the Bloom filter says that the file is not different
> between the two trees, we don't need to compute the expensive diff. This is
> where we get our performance gains. The other response of the Bloom filter
> is `maybe`, in which case we fall back to the full diff calculation to
> determine if the path was changed in the commit.
All right, looks good.
Very minor nitpick: s/`maybe`/'maybe'/ (in my opinion).
>
> Performance Gains:
> We tested the performance of `git log -- <path>` on the git repo, the linux
> and some internal large repos, with a variety of paths of varying depths.
Another repository that we could test Bloom filters feature would be, as
I have written before, Android AOSP frameworks core repository
https://android.googlesource.com/platform/frameworks/base/
because being written in Java it has deep path hierarchy, and it also
has large number of commits.
>
> On the git and linux repos:
> - we observed a 2x to 5x speed up.
It would be nice to have at least one specific and repeatable example:
in given repository, starting from given commit or tag, following the
history of given path, what are timing results for doing some specific
command with and without Bloom filters computed and enabled.
One might also want to know the cost of this speedup: how much disk
space does it take (i.e. how large is the commit-graph file with and
without Bloom filters chunks), and how long does it take to compute
(i.e. how much time writing commit-graph takes with and without using
--changed-paths options).
>
> On a large internal repo with files seated 6-10 levels deep in the tree:
> - we observed 10x to 20x speed ups, with some paths going up to 28 times
> faster.
This is good to know.
In the future we might want to have procedurally generated synthetic
repository, where we would be able to control number of files, depth of
filesystem hierarchy, average number of changes per commit, etc. to be
used for performance testing. (Just wishful thinking)
>
> Helped-by: Derrick Stolee <dstolee@microsoft.com
> Helped-by: SZEDER Gábor <szeder.dev@gmail.com>
> Helped-by: Jonathan Tan <jonathantanmy@google.com>
> Signed-off-by: Garima Singh <garima.singh@microsoft.com>
> ---
> revision.c | 124 +++++++++++++++++++++++++++++++-
> revision.h | 11 +++
> t/helper/test-read-graph.c | 4 ++
> t/t4216-log-bloom.sh | 140 +++++++++++++++++++++++++++++++++++++
> 4 files changed, 277 insertions(+), 2 deletions(-)
> create mode 100755 t/t4216-log-bloom.sh
>
> diff --git a/revision.c b/revision.c
> index 8136929e23..d1622afa17 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -29,6 +29,8 @@
> #include "prio-queue.h"
> #include "hashmap.h"
> #include "utf8.h"
> +#include "bloom.h"
> +#include "json-writer.h"
>
> volatile show_early_output_fn_t show_early_output;
>
> @@ -624,11 +626,114 @@ static void file_change(struct diff_options *options,
> options->flags.has_changes = 1;
> }
>
> +static int bloom_filter_atexit_registered;
> +static unsigned int count_bloom_filter_maybe;
> +static unsigned int count_bloom_filter_definitely_not;
> +static unsigned int count_bloom_filter_false_positive;
> +static unsigned int count_bloom_filter_not_present;
> +static unsigned int count_bloom_filter_length_zero;
> +
> +static void trace2_bloom_filter_statistics_atexit(void)
> +{
> + struct json_writer jw = JSON_WRITER_INIT;
> +
> + jw_object_begin(&jw, 0);
> + jw_object_intmax(&jw, "filter_not_present", count_bloom_filter_not_present);
> + jw_object_intmax(&jw, "zero_length_filter", count_bloom_filter_length_zero);
> + jw_object_intmax(&jw, "maybe", count_bloom_filter_maybe);
> + jw_object_intmax(&jw, "definitely_not", count_bloom_filter_definitely_not);
> + jw_end(&jw);
> +
> + trace2_data_json("bloom", the_repository, "statistics", &jw);
> +
> + jw_release(&jw);
> +}
I thought that it would be better to put this part together with tests
that absolutely require this functionality in a separate subsequent
patch, but now I am not so sure. It is nice to have all or almost all
tests created in a single patch.
Looks good to me, but I don't know much about trace2 API, so take it
with a pinch of salt.
> +
> +static void prepare_to_use_bloom_filter(struct rev_info *revs)
> +{
> + struct pathspec_item *pi;
> + char *path_alloc = NULL;
> + const char *path;
> + int last_index;
> + int len;
> +
> + if (!revs->commits)
> + return;
I see that we need this because in next command we dereference
revs->commits to get revs->commits->item.
If I understand it correctly empty pending list may happen with "--all"
or "--glob" options, but somebody with more experience in this area of
code is needed to state for sure.
Should we test `git log --all -- <path>`?
> +
> + repo_parse_commit(revs->repo, revs->commits->item);
Are we calling this function for its side-effects? Wouldn't using
prepare_commit_graph(revs->repo) here be a better solution?
> +
> + if (!revs->repo->objects->commit_graph)
> + return;
Looks good to me. If there is no commit graph, then there are no Bloom
filters to consult.
> +
> + revs->bloom_filter_settings = revs->repo->objects->commit_graph->bloom_filter_settings;
Hmmm... is that why bloom_filter_settings is a pointer to struct, and
not struct itself?
> + if (!revs->bloom_filter_settings)
> + return;
Looks good to me. If there is no Bloomm filter in the commit-graph
file, then there are no Bloom filters to consult.
> +
> + pi = &revs->pruning.pathspec.items[0];
> + last_index = pi->len - 1;
> +
It might be a good idea to add a comment explaining what is happening
here, for example:
+ /* remove single trailing slash from path, if needed */
> + if (pi->match[last_index] == '/') {
> + path_alloc = xstrdup(pi->match);
> + path_alloc[last_index] = '\0';
> + path = path_alloc;
> + } else
> + path = pi->match;
> +
> + len = strlen(path);
We can avoid computing strlen(path) here, because in first branch of
this conditional we have len = last_index, in the second branch we have
len = pi->len.
> +
> + revs->bloom_key = xmalloc(sizeof(struct bloom_key));
> + fill_bloom_key(path, len, revs->bloom_key, revs->bloom_filter_settings);
All right, this is the meat of this function: creating bloom_key for a
path. Looks good to me.
> +
> + if (trace2_is_enabled() && !bloom_filter_atexit_registered) {
> + atexit(trace2_bloom_filter_statistics_atexit);
> + bloom_filter_atexit_registered = 1;
> + }
OK, here we register trace2 Bloom filter statistics handler, but only
once, and only when needed.
> +
> + free(path_alloc);
OK, path_alloc is either xstrdup-ed string, or NULL, and is no longer
needed (after possibly being used to create bloom_key).
> +}
> +
> +static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
> + struct commit *commit)
> +{
> + struct bloom_filter *filter;
> + int result;
> +
> + if (!revs->repo->objects->commit_graph)
> + return -1;
> +
> + if (commit->generation == GENERATION_NUMBER_INFINITY)
> + return -1;
Idle thought: would it be useful to gather for trace2 statistics also
number of commits encountered that were outside commit-graph?
> +
> + filter = get_bloom_filter(revs->repo, commit, 0);
> +
> + if (!filter) {
> + count_bloom_filter_not_present++;
> + return -1;
> + }
> +
> + if (!filter->len) {
> + count_bloom_filter_length_zero++;
> + return -1;
> + }
> +
> + result = bloom_filter_contains(filter,
> + revs->bloom_key,
> + revs->bloom_filter_settings);
> +
> + if (result)
> + count_bloom_filter_maybe++;
> + else
> + count_bloom_filter_definitely_not++;
> +
> + return result;
> +}
The whole check_maybe_different_in_bloom_filter() looks good to me,
thanks to designing and building a good API.
> +
> static int rev_compare_tree(struct rev_info *revs,
> - struct commit *parent, struct commit *commit)
> + struct commit *parent, struct commit *commit, int nth_parent)
> {
> struct tree *t1 = get_commit_tree(parent);
> struct tree *t2 = get_commit_tree(commit);
> + int bloom_ret = 1;
I don't understand why it is initialized to 1, and not to 0.
>
> if (!t1)
> return REV_TREE_NEW;
> @@ -653,11 +758,23 @@ static int rev_compare_tree(struct rev_info *revs,
> return REV_TREE_SAME;
> }
>
> + if (revs->pruning.pathspec.nr == 1 && !revs->reflog_info && !nth_parent) {
Shouldn't we check upfront here that revs->bloom_key is not NULL?
I don't think we check this down the callchain...
Or even better replace the first two checks with it, as revs->bloom_key
is set only if (revs->pruning.pathspec.nr == 1 && !revs->reflog_info),
see addition to prepare_revision_walk() below.
Of course the !nth_parent check needs to be kept, as this changes during
the revision walk (it is a limitation of current version of Bloom filter
in that only changes with respect to first parent are stored in filter).
> + bloom_ret = check_maybe_different_in_bloom_filter(revs, commit);
> +
> + if (bloom_ret == 0)
> + return REV_TREE_SAME;
> + }
All right, if we have single pathspec, and we don't walk reflog (?), and
we are interested in first parent, then we query the Bloom filter.
The Bloom filter can return 'no' or 'maybe'; if it returns 'no' then we
can short-circuit and avoid computing the tree diff.
> +
> tree_difference = REV_TREE_SAME;
> revs->pruning.flags.has_changes = 0;
> if (diff_tree_oid(&t1->object.oid, &t2->object.oid, "",
> &revs->pruning) < 0)
> return REV_TREE_DIFFERENT;
> +
> + if (!nth_parent)
Shouldn't this condition be exactly the same as for running
check_maybe_different_in_bloom_filter()? Otherwise due to initializing
bloom_ret to 1 we would get wrong statistics, isn't it?
> + if (bloom_ret == 1 && tree_difference == REV_TREE_SAME)
> + count_bloom_filter_false_positive++;
> +
All right, looks good.
> return tree_difference;
> }
>
> @@ -855,7 +972,7 @@ static void try_to_simplify_commit(struct rev_info *revs, struct commit *commit)
> die("cannot simplify commit %s (because of %s)",
> oid_to_hex(&commit->object.oid),
> oid_to_hex(&p->object.oid));
> - switch (rev_compare_tree(revs, p, commit)) {
> + switch (rev_compare_tree(revs, p, commit, nth_parent)) {
> case REV_TREE_SAME:
> if (!revs->simplify_history || !relevant_commit(p)) {
> /* Even if a merge with an uninteresting
OK, we are just dding new parameter, with the information needed to
decide whether Bloom filters can be used or not.
> @@ -3362,6 +3479,8 @@ int prepare_revision_walk(struct rev_info *revs)
> FOR_EACH_OBJECT_PROMISOR_ONLY);
> }
>
> + if (revs->pruning.pathspec.nr == 1 && !revs->reflog_info)
> + prepare_to_use_bloom_filter(revs);
Well, the limitation that the technique _currently_ works only with a
single pathspec is stated explicitly, but the fact that it is turned off
for some reason for --walk-reflog is not.
Otherwise, looks good to me.
> if (revs->no_walk != REVISION_WALK_NO_WALK_UNSORTED)
> commit_list_sort_by_date(&revs->commits);
> if (revs->no_walk)
> @@ -3379,6 +3498,7 @@ int prepare_revision_walk(struct rev_info *revs)
> simplify_merges(revs);
> if (revs->children.name)
> set_children(revs);
> +
> return 0;
> }
Unrelated coding style fixup, but we are doing changes in the
neighborhood. All right, I can agree to that.
>
> diff --git a/revision.h b/revision.h
> index 475f048fb6..7c026fe41f 100644
> --- a/revision.h
> +++ b/revision.h
> @@ -56,6 +56,8 @@ struct repository;
> struct rev_info;
> struct string_list;
> struct saved_parents;
> +struct bloom_key;
> +struct bloom_filter_settings;
> define_shared_commit_slab(revision_sources, char *);
>
> struct rev_cmdline_info {
> @@ -291,6 +293,15 @@ struct rev_info {
> struct revision_sources *sources;
>
> struct topo_walk_info *topo_walk_info;
> +
> + /* Commit graph bloom filter fields */
> + /* The bloom filter key for the pathspec */
> + struct bloom_key *bloom_key;
> + /*
> + * The bloom filter settings used to generate the key.
> + * This is loaded from the commit-graph being used.
> + */
> + struct bloom_filter_settings *bloom_filter_settings;
It is nice having those explanatory comments.
Sidenote: if I understand it correctly, revs->bloom_key is allocated but
never free()d. On the other hand revs->bloom_filter_settings is a weak
reference / is set to the value of other pointer, which is allocated and
free()d together with commit_graph struct.
> };
>
> int ref_excluded(struct string_list *, const char *path);
> diff --git a/t/helper/test-read-graph.c b/t/helper/test-read-graph.c
> index d2884efe0a..aff597c7a3 100644
> --- a/t/helper/test-read-graph.c
> +++ b/t/helper/test-read-graph.c
> @@ -45,6 +45,10 @@ int cmd__read_graph(int argc, const char **argv)
> printf(" commit_metadata");
> if (graph->chunk_extra_edges)
> printf(" extra_edges");
> + if (graph->chunk_bloom_indexes)
> + printf(" bloom_indexes");
> + if (graph->chunk_bloom_data)
> + printf(" bloom_data");
> printf("\n");
This chunk could be moved to the commit adding --changed-paths
option... on the other hand if all tests are to be added by this patch,
it can be left as is.
>
> UNLEAK(graph);
> diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
> new file mode 100755
> index 0000000000..19eca1864b
> --- /dev/null
> +++ b/t/t4216-log-bloom.sh
[...]
I'll leave reviewing tests of this feature for the next email.
Best regards,
--
Jakub Narębski
^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH v2 10/11] revision.c: use Bloom filters to speed up path based revision walks
2020-02-05 22:56 ` [PATCH v2 10/11] revision.c: use Bloom filters to speed up path based revision walks Garima Singh via GitGitGadget
2020-02-21 17:31 ` Jakub Narebski
@ 2020-02-21 22:45 ` Jakub Narebski
1 sibling, 0 replies; 159+ messages in thread
From: Jakub Narebski @ 2020-02-21 22:45 UTC (permalink / raw)
To: Garima Singh via GitGitGadget
Cc: git, Derrick Stolee, SZEDER Gábor, Jonathan Tan,
Jeff Hostetler, Taylor Blau, Jeff King, Garima Singh,
Christian Couder, Emily Shaffer, Junio C Hamano, Garima Singh
"Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:
This is a second part of my response, focusing solely on tests of the
Bloom filters feature.
[...]
> diff --git a/t/helper/test-read-graph.c b/t/helper/test-read-graph.c
> index d2884efe0a..aff597c7a3 100644
> --- a/t/helper/test-read-graph.c
> +++ b/t/helper/test-read-graph.c
> @@ -45,6 +45,10 @@ int cmd__read_graph(int argc, const char **argv)
> printf(" commit_metadata");
> if (graph->chunk_extra_edges)
> printf(" extra_edges");
> + if (graph->chunk_bloom_indexes)
> + printf(" bloom_indexes");
> + if (graph->chunk_bloom_data)
> + printf(" bloom_data");
> printf("\n");
>
All right, that is simple extension of 'test-helper read-graph'.
> UNLEAK(graph);
> diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
> new file mode 100755
> index 0000000000..19eca1864b
> --- /dev/null
> +++ b/t/t4216-log-bloom.sh
> @@ -0,0 +1,140 @@
> +#!/bin/sh
> +
> +test_description='git log for a path with bloom filters'
> +. ./test-lib.sh
> +
> +test_expect_success 'setup test - repo, commits, commit graph, log outputs' '
> + git init &&
> + mkdir A A/B A/B/C &&
> + test_commit c1 A/file1 &&
> + test_commit c2 A/B/file2 &&
> + test_commit c3 A/B/C/file3 &&
> + test_commit c4 A/file1 &&
> + test_commit c5 A/B/file2 &&
> + test_commit c6 A/B/C/file3 &&
> + test_commit c7 A/file1 &&
> + test_commit c8 A/B/file2 &&
> + test_commit c9 A/B/C/file3 &&
> + git checkout -b side HEAD~4 &&
> + test_commit side-1 file4 &&
> + git checkout master &&
> + git merge side &&
> + test_commit c10 file5 &&
Unfortunately this might be not enough for Git's heuristic similarity
based rename detection, as it creates 'file5' file with content 'c10'.
[Checking something]. Well, actually it looks like it works, even with
not much contents. I thought you would need to use something like
+ test_write_lines 1 2 3 4 5 6 7 8 9 >file5 &&
+ git add file5 &&
+ git commit -m c10 &&
But it turns out that it is, s far as I have checked, not necessary.
> + mv file5 file5_renamed &&
> + git add file5_renamed &&
> + git commit -m "rename" &&
> + git commit-graph write --reachable --changed-paths
> +'
Hmmm... there is no test for file that was present in history but got
deleted. Might be important (because of pre-image vs post-image name
issues).
Very minor issue: following the style used in t/test-lib-functions.sh
and the style guide in CodingGuidelines, it should be
+graph_read_expect () {
and the same for the following functions.
https://github.com/git/git/blob/master/Documentation/CodingGuidelines#L144
- We prefer a space between the function name and the parentheses,
and no space inside the parentheses. The opening "{" should also
be on the same line.
(incorrect)
my_function(){
...
(correct)
my_function () {
...
> +graph_read_expect() {
> + OPTIONAL=""
> + NUM_CHUNKS=5
> + cat >expect <<- EOF
> + header: 43475048 1 1 $NUM_CHUNKS 0
> + num_commits: $1
> + chunks: oid_fanout oid_lookup commit_metadata bloom_indexes bloom_data
Either OPTIONAL remains unused, and should be removed, or we leave it
for possible future extension, and we write
+ chunks: oid_fanout oid_lookup commit_metadata bloom_indexes bloom_data$OPTIONAL
like in t/t5318-commit-graph.sh.
> + EOF
> + test-tool read-graph >output &&
> + test_cmp expect output
Why 'output', and not 'actual'?
> +}
> +
> +test_expect_success 'commit-graph write wrote out the bloom chunks' '
> + graph_read_expect 13
> +'
All right, that is sanity-checking 'git commit-graph write --changed-paths'.
> +
> +setup() {
I wonder if we can come up with a better name... setup_log(),
setup_log_bloom(), log_compare()?
> + rm output
This shouldn't be here, in this function. Or perhaps it shouldn't even
be used at all; having 'output' doesn't hinder anything.
> + rm "$TRASH_DIRECTORY/trace.perf"
All right, this cleanup is needed.
> + git -c core.commitGraph=false log --pretty="format:%s" $1 >log_wo_bloom
> + GIT_TRACE2_PERF="$TRASH_DIRECTORY/trace.perf" git -c core.commitGraph=true log --pretty="format:%s" $1 >log_w_bloom
All right, we prepare for comparing version without Bloom filters
(reference) and with Bloom filters, and for checking if Bloom filters
were used.
> +}
This setup() function above is missing the && chain.
It should then in my opinion read:
+setup () {
+ rm "$TRASH_DIRECTORY/trace.perf" &&
+ git -c core.commitGraph=false log --format="%s" $1 >log_wo_bloom &&
+ GIT_TRACE2_PERF="$TRASH_DIRECTORY/trace.perf" \
+ git -c core.commitGraph=true log --format="%s" $1 >log_w_bloom
+}
Also, perhaps we should add at the beginning of this test file, outside
anu test_expect_success block, the following (see t/*trace2*.sh files):
# Turn off any inherited trace2 settings for this test.
sane_unset GIT_TRACE2 GIT_TRACE2_PERF GIT_TRACE2_EVENT
sane_unset GIT_TRACE2_PERF_BRIEF
sane_unset GIT_TRACE2_CONFIG_PARAMS
> +
> +test_bloom_filters_used() {
> + log_args=$1
> + bloom_trace_prefix="statistics:{\"filter_not_present\":0,\"zero_length_filter\":0,\"maybe\""
> + setup "$log_args"
Missing && chain.
> + grep -q "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.perf" && test_cmp log_wo_bloom log_w_bloom
Why no line break after &&?
> +}
Ugh, examining JSON output with regexp is in my opinion quite fragile.
Though I am not sure if requiring Perl and JSON module installed like
t/t0212-trace2-event.sh is any better.
> +
> +test_bloom_filters_not_used() {
> + log_args=$1
> + setup "$log_args"
> + !(grep -q "statistics:{\"filter_not_present\":" "$TRASH_DIRECTORY/trace.perf") && test_cmp log_wo_bloom log_w_bloom
We should also check that "$TRASH_DIRECTORY/trace.perf" file exist with
test_path_is_file.
Also, testing that something was not found is a bit fragile, but I don't
have any better idea on how to do this test without negating grep exit
value.
> +}
> +
> +for path in A A/B A/B/C A/file1 A/B/file2 A/B/C/file3 file4 file5_renamed
NOTE: file5 is missing from this list!
I suspect that adding it might cause the test to fail.
> +do
> + for option in "" \
> + "--full-history" \
> + "--full-history --simplify-merges" \
> + "--simplify-merges" \
> + "--simplify-by-decoration" \
> + "--follow" \
> + "--first-parent" \
> + "--topo-order" \
> + "--date-order" \
> + "--author-date-order" \
> + "--ancestry-path side..master"
> + do
> + test_expect_success "git log option: $option for path: $path" '
> + test_bloom_filters_used "$option -- $path"
All right, this tests that Bloom filters were used *and* that the
command run with Bloom filters and without Bloom filters (without
commit-graph) produces the same output.
> + '
> + done
> +done
> +
> +test_expect_success 'git log -- folder works with and without the trailing slash' '
> + test_bloom_filters_used "-- A" &&
> + test_bloom_filters_used "-- A/"
> +'
All right.
I wonder if we should test for insane test case, namely pathname to an
ordinary file that ends with slash:
+ test_bloom_filters_used "-- file4" &&
+ test_bloom_filters_used "-- file4/"
The latter should produce no output, being treated as not existing file.
> +
> +test_expect_success 'git log for path that does not exist. ' '
> + test_bloom_filters_used "-- path_does_not_exist"
> +'
All right.
> +
> +test_expect_success 'git log with --walk-reflogs does not use bloom filters' '
> + test_bloom_filters_not_used "--walk-reflogs -- A"
> +'
All right, but why is it so?
> +
> +test_expect_success 'git log -- multiple path specs does not use bloom filters' '
> + test_bloom_filters_not_used "-- file4 A/file1"
> +'
All right, though this is limitation of current code, not limitation of
technique, so _maybe_ it would be better to test_expect_failure that for
multiple pathspecs bloom_filters_used...
> +
> +test_expect_success 'git log with wildcard that resolves to a single path uses bloom filters' '
> + test_bloom_filters_used "-- *4" &&
> + test_bloom_filters_used "-- *renamed"
> +'
> +
> +test_expect_success 'git log with wildcard that resolves to a multiple paths does not uses bloom filters' '
> + test_bloom_filters_not_used "-- *" &&
> + test_bloom_filters_not_used "-- file*"
> +'
Same here.
> +
> +test_expect_success 'setup - add commit-graph to the chain without bloom filters' '
> + test_commit c14 A/anotherFile2 &&
> + test_commit c15 A/B/anotherFile2 &&
> + test_commit c16 A/B/C/anotherFile2 &&
> + GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=0 git commit-graph write --reachable --split &&
> + test_line_count = 2 .git/objects/info/commit-graphs/commit-graph-chain
> +'
> +
> +test_expect_success 'git log does not use bloom filters if the latest graph does not have bloom filters.' '
> + test_bloom_filters_not_used "-- A/B"
> +'
All right... though I would try to come up with a shorter test name :-)
> +
> +test_expect_success 'setup - add commit-graph to the chain with bloom filters' '
> + test_commit c17 A/anotherFile3 &&
> + git commit-graph write --reachable --changed-paths --split &&
> + test_line_count = 3 .git/objects/info/commit-graphs/commit-graph-chain
> +'
> +
> +test_bloom_filters_used_when_some_filters_are_missing() {
> + log_args=$1
> + bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"zero_length_filter\":0,\"maybe\":6,\"definitely_not\":6"
Perhaps a better solution would be to use (enhanced) 'test-tool bloom'
to check which commits have Bloom filters and which do not.
> + setup "$log_args"
> + grep -q "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.perf" && test_cmp log_wo_bloom log_w_bloom
> +}
Why broken && chain between setup() and the resr, and why && is not
followed by line break (as before)?
> +
> +test_expect_success 'git log uses bloom filters if they exist in the latest but not all commit graphs in the chain.' '
> + test_bloom_filters_used_when_some_filters_are_missing "-- A/B"
> +'
> +
> +test_done
All right... though the description of this test is a bit long.
Thank you for your work on this series.
Best,
--
Jakub Narębski
^ permalink raw reply [flat|nested] 159+ messages in thread
* [PATCH v2 11/11] commit-graph: add GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS test flag
2020-02-05 22:56 ` [PATCH v2 00/11] " Garima Singh via GitGitGadget
` (9 preceding siblings ...)
2020-02-05 22:56 ` [PATCH v2 10/11] revision.c: use Bloom filters to speed up path based revision walks Garima Singh via GitGitGadget
@ 2020-02-05 22:56 ` Garima Singh via GitGitGadget
2020-02-22 0:11 ` Jakub Narebski
2020-02-07 13:52 ` [PATCH v2 00/11] Changed Paths Bloom Filters SZEDER Gábor
` (2 subsequent siblings)
13 siblings, 1 reply; 159+ messages in thread
From: Garima Singh via GitGitGadget @ 2020-02-05 22:56 UTC (permalink / raw)
To: git
Cc: stolee, szeder.dev, jonathantanmy, jeffhost, me, peff,
garimasigit, jnareb, christian.couder, emilyshaffer, gitster,
Garima Singh, Garima Singh
From: Garima Singh <garima.singh@microsoft.com>
Add GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS test flag to the test setup suite
in order to toggle writing Bloom filters when running any of the git tests.
If set to true, we will compute and write Bloom filters every time a test
calls `git commit-graph write`, as if the `--changed-paths` option was
passed in.
The test suite passes when GIT_TEST_COMMIT_GRAPH and
GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS are enabled.
Helped-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
---
builtin/commit-graph.c | 3 ++-
ci/run-build-and-tests.sh | 1 +
commit-graph.h | 1 +
t/README | 5 +++++
t/t4216-log-bloom.sh | 3 +++
t/t5318-commit-graph.sh | 2 ++
t/t5324-split-commit-graph.sh | 1 +
7 files changed, 15 insertions(+), 1 deletion(-)
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 261dcce091..fc9b234ab0 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -146,7 +146,8 @@ static int graph_write(int argc, const char **argv)
flags |= COMMIT_GRAPH_WRITE_SPLIT;
if (opts.progress)
flags |= COMMIT_GRAPH_WRITE_PROGRESS;
- if (opts.enable_changed_paths)
+ if (opts.enable_changed_paths ||
+ git_env_bool(GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS, 0))
flags |= COMMIT_GRAPH_WRITE_BLOOM_FILTERS;
read_replace_refs = 0;
diff --git a/ci/run-build-and-tests.sh b/ci/run-build-and-tests.sh
index ff0ef7f08e..7b4857651d 100755
--- a/ci/run-build-and-tests.sh
+++ b/ci/run-build-and-tests.sh
@@ -19,6 +19,7 @@ linux-gcc)
export GIT_TEST_OE_SIZE=10
export GIT_TEST_OE_DELTA_SIZE=5
export GIT_TEST_COMMIT_GRAPH=1
+ export GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=1
export GIT_TEST_MULTI_PACK_INDEX=1
make test
;;
diff --git a/commit-graph.h b/commit-graph.h
index 25fefefb3e..4c202ff3d7 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -8,6 +8,7 @@
#define GIT_TEST_COMMIT_GRAPH "GIT_TEST_COMMIT_GRAPH"
#define GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD "GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD"
+#define GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS "GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS"
struct commit;
struct bloom_filter_settings;
diff --git a/t/README b/t/README
index caa125ba9a..be2f7d7fd2 100644
--- a/t/README
+++ b/t/README
@@ -378,6 +378,11 @@ GIT_TEST_COMMIT_GRAPH=<boolean>, when true, forces the commit-graph to
be written after every 'git commit' command, and overrides the
'core.commitGraph' setting to true.
+GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=<boolean>, when true, forces
+commit-graph write to compute and write changed path Bloom filters for
+every 'git commit-graph write', as if the `--changed-paths` option was
+passed in.
+
GIT_TEST_FSMONITOR=$PWD/t7519/fsmonitor-all exercises the fsmonitor
code path for utilizing a file system monitor to speed up detecting
new or changed files.
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index 19eca1864b..7acebb3962 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -3,6 +3,9 @@
test_description='git log for a path with bloom filters'
. ./test-lib.sh
+GIT_TEST_COMMIT_GRAPH=0
+GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=0
+
test_expect_success 'setup test - repo, commits, commit graph, log outputs' '
git init &&
mkdir A A/B A/B/C &&
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 3f03de6018..973020be2d 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -3,6 +3,8 @@
test_description='commit graph'
. ./test-lib.sh
+GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=0
+
test_expect_success 'setup full repo' '
mkdir full &&
cd "$TRASH_DIRECTORY/full" &&
diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
index c24823431f..9235db4561 100755
--- a/t/t5324-split-commit-graph.sh
+++ b/t/t5324-split-commit-graph.sh
@@ -4,6 +4,7 @@ test_description='split commit graph'
. ./test-lib.sh
GIT_TEST_COMMIT_GRAPH=0
+GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=0
test_expect_success 'setup repo' '
git init &&
--
gitgitgadget
^ permalink raw reply related [flat|nested] 159+ messages in thread
* Re: [PATCH v2 11/11] commit-graph: add GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS test flag
2020-02-05 22:56 ` [PATCH v2 11/11] commit-graph: add GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS test flag Garima Singh via GitGitGadget
@ 2020-02-22 0:11 ` Jakub Narebski
0 siblings, 0 replies; 159+ messages in thread
From: Jakub Narebski @ 2020-02-22 0:11 UTC (permalink / raw)
To: Garima Singh via GitGitGadget
Cc: git, stolee, szeder.dev, jonathantanmy, jeffhost, me, peff,
garimasigit, christian.couder, emilyshaffer, gitster,
Garima Singh
"Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:
> From: Garima Singh <garima.singh@microsoft.com>
>
> Add GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS test flag to the test setup suite
> in order to toggle writing Bloom filters when running any of the git tests.
> If set to true, we will compute and write Bloom filters every time a test
> calls `git commit-graph write`, as if the `--changed-paths` option was
> passed in.
>
> The test suite passes when GIT_TEST_COMMIT_GRAPH and
> GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS are enabled.
All right. Nice.
>
> Helped-by: Derrick Stolee <dstolee@microsoft.com>
> Signed-off-by: Garima Singh <garima.singh@microsoft.com>
> ---
> builtin/commit-graph.c | 3 ++-
> ci/run-build-and-tests.sh | 1 +
> commit-graph.h | 1 +
> t/README | 5 +++++
> t/t4216-log-bloom.sh | 3 +++
> t/t5318-commit-graph.sh | 2 ++
> t/t5324-split-commit-graph.sh | 1 +
> 7 files changed, 15 insertions(+), 1 deletion(-)
>
> diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
> index 261dcce091..fc9b234ab0 100644
> --- a/builtin/commit-graph.c
> +++ b/builtin/commit-graph.c
> @@ -146,7 +146,8 @@ static int graph_write(int argc, const char **argv)
> flags |= COMMIT_GRAPH_WRITE_SPLIT;
> if (opts.progress)
> flags |= COMMIT_GRAPH_WRITE_PROGRESS;
> - if (opts.enable_changed_paths)
> + if (opts.enable_changed_paths ||
> + git_env_bool(GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS, 0))
> flags |= COMMIT_GRAPH_WRITE_BLOOM_FILTERS;
>
Looks good to me.
> read_replace_refs = 0;
> diff --git a/ci/run-build-and-tests.sh b/ci/run-build-and-tests.sh
> index ff0ef7f08e..7b4857651d 100755
> --- a/ci/run-build-and-tests.sh
> +++ b/ci/run-build-and-tests.sh
> @@ -19,6 +19,7 @@ linux-gcc)
> export GIT_TEST_OE_SIZE=10
> export GIT_TEST_OE_DELTA_SIZE=5
> export GIT_TEST_COMMIT_GRAPH=1
> + export GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=1
> export GIT_TEST_MULTI_PACK_INDEX=1
> make test
> ;;
OK, include in continuous integration.
> diff --git a/commit-graph.h b/commit-graph.h
> index 25fefefb3e..4c202ff3d7 100644
> --- a/commit-graph.h
> +++ b/commit-graph.h
> @@ -8,6 +8,7 @@
>
> #define GIT_TEST_COMMIT_GRAPH "GIT_TEST_COMMIT_GRAPH"
> #define GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD "GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD"
> +#define GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS "GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS"
>
Looks good to me.
> struct commit;
> struct bloom_filter_settings;
> diff --git a/t/README b/t/README
> index caa125ba9a..be2f7d7fd2 100644
> --- a/t/README
> +++ b/t/README
> @@ -378,6 +378,11 @@ GIT_TEST_COMMIT_GRAPH=<boolean>, when true, forces the commit-graph to
> be written after every 'git commit' command, and overrides the
> 'core.commitGraph' setting to true.
>
> +GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=<boolean>, when true, forces
> +commit-graph write to compute and write changed path Bloom filters for
> +every 'git commit-graph write', as if the `--changed-paths` option was
> +passed in.
> +
Good, it is documented in README for tests.
> GIT_TEST_FSMONITOR=$PWD/t7519/fsmonitor-all exercises the fsmonitor
> code path for utilizing a file system monitor to speed up detecting
> new or changed files.
> diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
> index 19eca1864b..7acebb3962 100755
> --- a/t/t4216-log-bloom.sh
> +++ b/t/t4216-log-bloom.sh
> @@ -3,6 +3,9 @@
> test_description='git log for a path with bloom filters'
> . ./test-lib.sh
>
> +GIT_TEST_COMMIT_GRAPH=0
> +GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=0
> +
All right, we need to ensure that 'git commit-graph write' is not run
automatically, otherwise split / incremental commit-graph tests would
not work.
We also need to ensure that '--changed-paths' is not added
automatically, so that we can test that commit-graph does not include
Bloom filters chunks if not requested.
> test_expect_success 'setup test - repo, commits, commit graph, log outputs' '
> git init &&
> mkdir A A/B A/B/C &&
> diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
> index 3f03de6018..973020be2d 100755
> --- a/t/t5318-commit-graph.sh
> +++ b/t/t5318-commit-graph.sh
> @@ -3,6 +3,8 @@
> test_description='commit graph'
> . ./test-lib.sh
>
> +GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=0
> +
OK, otherwise it would screw up checking the content of commit-graph
with 'test-tool read-graph'.
> test_expect_success 'setup full repo' '
> mkdir full &&
> cd "$TRASH_DIRECTORY/full" &&
> diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
> index c24823431f..9235db4561 100755
> --- a/t/t5324-split-commit-graph.sh
> +++ b/t/t5324-split-commit-graph.sh
> @@ -4,6 +4,7 @@ test_description='split commit graph'
> . ./test-lib.sh
>
> GIT_TEST_COMMIT_GRAPH=0
> +GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=0
>
> test_expect_success 'setup repo' '
> git init &&
Same here.
Looks good to me.
--
Jakub Narębski
^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH v2 00/11] Changed Paths Bloom Filters
2020-02-05 22:56 ` [PATCH v2 00/11] " Garima Singh via GitGitGadget
` (10 preceding siblings ...)
2020-02-05 22:56 ` [PATCH v2 11/11] commit-graph: add GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS test flag Garima Singh via GitGitGadget
@ 2020-02-07 13:52 ` SZEDER Gábor
2020-02-07 15:09 ` Garima Singh
2020-02-08 23:04 ` Jakub Narebski
2020-03-30 0:31 ` [PATCH v3 00/16] " Garima Singh via GitGitGadget
13 siblings, 1 reply; 159+ messages in thread
From: SZEDER Gábor @ 2020-02-07 13:52 UTC (permalink / raw)
To: Garima Singh via GitGitGadget
Cc: git, stolee, jonathantanmy, jeffhost, me, peff, garimasigit,
jnareb, christian.couder, emilyshaffer, gitster, Garima Singh
On Wed, Feb 05, 2020 at 10:56:19PM +0000, Garima Singh via GitGitGadget wrote:
> Hey!
>
> The commit graph feature brought in a lot of performance improvements across
> multiple commands. However, file based history continues to be a performance
> pain point, especially in large repositories.
>
> Adopting changed path bloom filters has been discussed on the list before,
> and a prototype version was worked on by SZEDER Gábor, Jonathan Tan and Dr.
> Derrick Stolee [1]. This series is based on Dr. Stolee's proof of concept in
> [2]
>
> Performance Gains: We tested the performance of git log -- path on the git
> repo, the linux repo and some internal large repos, with a variety of paths
> of varying depths.
>
> On the git and linux repos: We observed a 2x to 5x speed up.
>
> On a large internal repo with files seated 6-10 levels deep in the tree: We
> observed 10x to 20x speed ups, with some paths going up to 28 times faster.
>
> Future Work (not included in the scope of this series):
>
> 1. Supporting multiple path based revision walk
> 2. Adopting it in git blame logic.
> 3. Interactions with line log git log -L
>
>
> ----------------------------------------------------------------------------
>
> Updates since the last submission
>
> * Removed all the RFC callouts, this is a ready for full review version
Don't know when I'll find enough time to properly review the series.
maybe someday...
> * Added unit tests for the bloom filter computation layer
This fails on big endian, e.g. in Travis CI's s390x build:
https://travis-ci.org/szeder/git-cooking-topics-for-travis-ci/jobs/647253022#L2210
(The link highlights the failure, but I'm afraid your browser won't
jump there right away; you'll have to click on the print-test-failures
fold at the bottom, and scroll down a bit...)
^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH v2 00/11] Changed Paths Bloom Filters
2020-02-07 13:52 ` [PATCH v2 00/11] Changed Paths Bloom Filters SZEDER Gábor
@ 2020-02-07 15:09 ` Garima Singh
2020-02-07 15:36 ` Derrick Stolee
2020-02-11 19:08 ` Garima Singh
0 siblings, 2 replies; 159+ messages in thread
From: Garima Singh @ 2020-02-07 15:09 UTC (permalink / raw)
To: SZEDER Gábor, Garima Singh via GitGitGadget
Cc: git, stolee, jonathantanmy, jeffhost, me, peff, jnareb,
christian.couder, emilyshaffer, gitster, Garima Singh
On 2/7/2020 8:52 AM, SZEDER Gábor wrote:
>> * Added unit tests for the bloom filter computation layer
>
> This fails on big endian, e.g. in Travis CI's s390x build:
>
> https://travis-ci.org/szeder/git-cooking-topics-for-travis-ci/jobs/647253022#L2210
>
> (The link highlights the failure, but I'm afraid your browser won't
> jump there right away; you'll have to click on the print-test-failures
> fold at the bottom, and scroll down a bit...)
>
Thank you so much for running this pipeline and pointing out the error!
We will carefully review our interactions with the binary data and
hopefully solve this in the next version.
Cheers!
Garima Singh
^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH v2 00/11] Changed Paths Bloom Filters
2020-02-07 15:09 ` Garima Singh
@ 2020-02-07 15:36 ` Derrick Stolee
2020-02-07 16:15 ` SZEDER Gábor
2020-02-11 19:08 ` Garima Singh
1 sibling, 1 reply; 159+ messages in thread
From: Derrick Stolee @ 2020-02-07 15:36 UTC (permalink / raw)
To: Garima Singh, SZEDER Gábor, Garima Singh via GitGitGadget
Cc: git, jonathantanmy, jeffhost, me, peff, jnareb, christian.couder,
emilyshaffer, gitster, Garima Singh
On 2/7/2020 10:09 AM, Garima Singh wrote:
>
> On 2/7/2020 8:52 AM, SZEDER Gábor wrote:
>>> * Added unit tests for the bloom filter computation layer
>>
>> This fails on big endian, e.g. in Travis CI's s390x build:
>>
>> https://travis-ci.org/szeder/git-cooking-topics-for-travis-ci/jobs/647253022#L2210
>>
>> (The link highlights the failure, but I'm afraid your browser won't
>> jump there right away; you'll have to click on the print-test-failures
>> fold at the bottom, and scroll down a bit...)
>>
>
> Thank you so much for running this pipeline and pointing out the error!
>
> We will carefully review our interactions with the binary data and
> hopefully solve this in the next version.
Szeder,
Thanks so much for running this test. We don't have access to a big endian
machine right now, so could you please apply this patch and re-run your tests?
The issue is described in the message below, and Garima is working to ensure
the handling of the filter data is clarified in the next version.
This is an issue from WAY back in the original prototype, and it highlights
that we've never been writing the data in network-byte order. This is completely
my fault.
Thanks,
-Stolee
-->8--
From c1067db5d618b2dae430dfe373a11c771517da9e Mon Sep 17 00:00:00 2001
From: Derrick Stolee <dstolee@microsoft.com>
Date: Fri, 7 Feb 2020 10:24:05 -0500
Subject: [PATCH] fixup! bloom: core Bloom filter implementation for changed
paths
The 'data' field of 'struct bloom_filter' can point to a memory location
(when computing one before writing to the commit-graph) or a memmap()'d
file location (when reading from the Bloom data chunk of the commit-graph
file). This means that the memory representation may be backwards in
Little Endian or Big Endian machines.
Always write and read bits from 'filter->data' using network order. This
allows us to avoid loading the data streams from the file into memory
buffers.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
bloom.c | 6 ++++--
t/helper/test-bloom.c | 2 +-
2 files changed, 5 insertions(+), 3 deletions(-)
diff --git a/bloom.c b/bloom.c
index 90d84dc713..aa6896584b 100644
--- a/bloom.c
+++ b/bloom.c
@@ -124,8 +124,9 @@ void add_key_to_filter(struct bloom_key *key,
for (i = 0; i < settings->num_hashes; i++) {
uint64_t hash_mod = key->hashes[i] % mod;
uint64_t block_pos = hash_mod / BITS_PER_WORD;
+ uint64_t bit = get_bitmask(hash_mod);
- filter->data[block_pos] |= get_bitmask(hash_mod);
+ filter->data[block_pos] |= htonll(bit);
}
}
@@ -269,7 +270,8 @@ int bloom_filter_contains(struct bloom_filter *filter,
for (i = 0; i < settings->num_hashes; i++) {
uint64_t hash_mod = key->hashes[i] % mod;
uint64_t block_pos = hash_mod / BITS_PER_WORD;
- if (!(filter->data[block_pos] & get_bitmask(hash_mod)))
+ uint64_t bit = get_bitmask(hash_mod);
+ if (!(filter->data[block_pos] & htonll(bit)))
return 0;
}
diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
index 9b4be97f75..09b2bb0a00 100644
--- a/t/helper/test-bloom.c
+++ b/t/helper/test-bloom.c
@@ -23,7 +23,7 @@ static void print_bloom_filter(struct bloom_filter *filter) {
printf("Filter_Length:%d\n", filter->len);
printf("Filter_Data:");
for (i = 0; i < filter->len; i++){
- printf("%"PRIx64"|", filter->data[i]);
+ printf("%"PRIx64"|", ntohll(filter->data[i]));
}
printf("\n");
}
--
2.25.0.vfs.1.1.1.g9906319d24.dirty
^ permalink raw reply related [flat|nested] 159+ messages in thread
* Re: [PATCH v2 00/11] Changed Paths Bloom Filters
2020-02-07 15:36 ` Derrick Stolee
@ 2020-02-07 16:15 ` SZEDER Gábor
2020-02-07 16:33 ` Derrick Stolee
0 siblings, 1 reply; 159+ messages in thread
From: SZEDER Gábor @ 2020-02-07 16:15 UTC (permalink / raw)
To: Derrick Stolee
Cc: Garima Singh, Garima Singh via GitGitGadget, git, jonathantanmy,
jeffhost, me, peff, jnareb, christian.couder, emilyshaffer,
gitster, Garima Singh
On Fri, Feb 07, 2020 at 10:36:58AM -0500, Derrick Stolee wrote:
> On 2/7/2020 10:09 AM, Garima Singh wrote:
> >
> > On 2/7/2020 8:52 AM, SZEDER Gábor wrote:
> >>> * Added unit tests for the bloom filter computation layer
> >>
> >> This fails on big endian, e.g. in Travis CI's s390x build:
> >>
> >> https://travis-ci.org/szeder/git-cooking-topics-for-travis-ci/jobs/647253022#L2210
> >>
> >> (The link highlights the failure, but I'm afraid your browser won't
> >> jump there right away; you'll have to click on the print-test-failures
> >> fold at the bottom, and scroll down a bit...)
> >>
> >
> > Thank you so much for running this pipeline and pointing out the error!
> >
> > We will carefully review our interactions with the binary data and
> > hopefully solve this in the next version.
>
> Szeder,
>
> Thanks so much for running this test. We don't have access to a big endian
> machine right now, so could you please apply this patch and re-run your tests?
Unfortunately, it still failed:
https://travis-ci.org/szeder/git-cooking-topics-for-travis-ci/jobs/647395554#L2204
> The issue is described in the message below, and Garima is working to ensure
> the handling of the filter data is clarified in the next version.
>
> This is an issue from WAY back in the original prototype, and it highlights
> that we've never been writing the data in network-byte order. This is completely
> my fault.
>
> Thanks,
> -Stolee
>
>
> -->8--
>
> From c1067db5d618b2dae430dfe373a11c771517da9e Mon Sep 17 00:00:00 2001
> From: Derrick Stolee <dstolee@microsoft.com>
> Date: Fri, 7 Feb 2020 10:24:05 -0500
> Subject: [PATCH] fixup! bloom: core Bloom filter implementation for changed
> paths
>
> The 'data' field of 'struct bloom_filter' can point to a memory location
> (when computing one before writing to the commit-graph) or a memmap()'d
> file location (when reading from the Bloom data chunk of the commit-graph
> file). This means that the memory representation may be backwards in
> Little Endian or Big Endian machines.
>
> Always write and read bits from 'filter->data' using network order. This
> allows us to avoid loading the data streams from the file into memory
> buffers.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
> bloom.c | 6 ++++--
> t/helper/test-bloom.c | 2 +-
> 2 files changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/bloom.c b/bloom.c
> index 90d84dc713..aa6896584b 100644
> --- a/bloom.c
> +++ b/bloom.c
> @@ -124,8 +124,9 @@ void add_key_to_filter(struct bloom_key *key,
> for (i = 0; i < settings->num_hashes; i++) {
> uint64_t hash_mod = key->hashes[i] % mod;
> uint64_t block_pos = hash_mod / BITS_PER_WORD;
> + uint64_t bit = get_bitmask(hash_mod);
>
> - filter->data[block_pos] |= get_bitmask(hash_mod);
> + filter->data[block_pos] |= htonll(bit);
> }
> }
>
> @@ -269,7 +270,8 @@ int bloom_filter_contains(struct bloom_filter *filter,
> for (i = 0; i < settings->num_hashes; i++) {
> uint64_t hash_mod = key->hashes[i] % mod;
> uint64_t block_pos = hash_mod / BITS_PER_WORD;
> - if (!(filter->data[block_pos] & get_bitmask(hash_mod)))
> + uint64_t bit = get_bitmask(hash_mod);
> + if (!(filter->data[block_pos] & htonll(bit)))
> return 0;
> }
>
> diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
> index 9b4be97f75..09b2bb0a00 100644
> --- a/t/helper/test-bloom.c
> +++ b/t/helper/test-bloom.c
> @@ -23,7 +23,7 @@ static void print_bloom_filter(struct bloom_filter *filter) {
> printf("Filter_Length:%d\n", filter->len);
> printf("Filter_Data:");
> for (i = 0; i < filter->len; i++){
> - printf("%"PRIx64"|", filter->data[i]);
> + printf("%"PRIx64"|", ntohll(filter->data[i]));
> }
> printf("\n");
> }
> --
> 2.25.0.vfs.1.1.1.g9906319d24.dirty
>
>
>
^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH v2 00/11] Changed Paths Bloom Filters
2020-02-07 16:15 ` SZEDER Gábor
@ 2020-02-07 16:33 ` Derrick Stolee
0 siblings, 0 replies; 159+ messages in thread
From: Derrick Stolee @ 2020-02-07 16:33 UTC (permalink / raw)
To: SZEDER Gábor
Cc: Garima Singh, Garima Singh via GitGitGadget, git, jonathantanmy,
jeffhost, me, peff, jnareb, christian.couder, emilyshaffer,
gitster, Garima Singh
On 2/7/2020 11:15 AM, SZEDER Gábor wrote:
> On Fri, Feb 07, 2020 at 10:36:58AM -0500, Derrick Stolee wrote:
>> On 2/7/2020 10:09 AM, Garima Singh wrote:
>>>
>>> On 2/7/2020 8:52 AM, SZEDER Gábor wrote:
>>>>> * Added unit tests for the bloom filter computation layer
>>>>
>>>> This fails on big endian, e.g. in Travis CI's s390x build:
>>>>
>>>> https://travis-ci.org/szeder/git-cooking-topics-for-travis-ci/jobs/647253022#L2210
>>>>
>>>> (The link highlights the failure, but I'm afraid your browser won't
>>>> jump there right away; you'll have to click on the print-test-failures
>>>> fold at the bottom, and scroll down a bit...)
>>>>
>>>
>>> Thank you so much for running this pipeline and pointing out the error!
>>>
>>> We will carefully review our interactions with the binary data and
>>> hopefully solve this in the next version.
>>
>> Szeder,
>>
>> Thanks so much for running this test. We don't have access to a big endian
>> machine right now, so could you please apply this patch and re-run your tests?
>
> Unfortunately, it still failed:
>
> https://travis-ci.org/szeder/git-cooking-topics-for-travis-ci/jobs/647395554#L2204
Thanks! Both fail on test 2 of t0095-bloom.sh, which includes this
expected output line:
Filter_Data:508928809087080a|8a7648210804001|4089824400951000|841ab310098051a8|
We may not be properly adjusting the output in the test-helper.
I still think the fixup patch I included is a good idea, but Garima
continues to dig into the problem from all angles to understand this
failure and the full fix.
-Stolee
^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH v2 00/11] Changed Paths Bloom Filters
2020-02-07 15:09 ` Garima Singh
2020-02-07 15:36 ` Derrick Stolee
@ 2020-02-11 19:08 ` Garima Singh
1 sibling, 0 replies; 159+ messages in thread
From: Garima Singh @ 2020-02-11 19:08 UTC (permalink / raw)
To: SZEDER Gábor, Garima Singh via GitGitGadget
Cc: git, stolee, jonathantanmy, jeffhost, me, peff, jnareb,
christian.couder, emilyshaffer, gitster, Garima Singh
On 2/7/2020 10:09 AM, Garima Singh wrote:
>
> On 2/7/2020 8:52 AM, SZEDER Gábor wrote:
>>> * Added unit tests for the bloom filter computation layer
>>
>> This fails on big endian, e.g. in Travis CI's s390x build:
>>
>> https://travis-ci.org/szeder/git-cooking-topics-for-travis-ci/jobs/647253022#L2210
>>
>> (The link highlights the failure, but I'm afraid your browser won't
>> jump there right away; you'll have to click on the print-test-failures
>> fold at the bottom, and scroll down a bit...)
>>
>
> Thank you so much for running this pipeline and pointing out the error!
>
> We will carefully review our interactions with the binary data and
> hopefully solve this in the next version.
>
> Cheers!
> Garima Singh
>
Hey!
The patch below carries the fix for the failure on Big-endian architectures.
We now treat bloom filter data as a simple binary stream of 1 byte words
instead of 8 byte words. This avoids the Big-endian vs Little-endian
confusion on different CPU architectures.
Here is the successful run of SZEDER's Travis CI s390x build.
https://travis-ci.org/szeder/git/jobs/649044879
I will be squashing this patch into the appropriate commits in the series
in v3, which I will send out after people have had a chance to complete
their review of v2.
A special thanks to SZEDER for helping us test our patches on his CI
pipeline and saving us the overhead of setting up a Big-endian machine!
Cheers!
Garima Singh
-->8--
From ee72310dd8c3ad2b810914edb651008f637e7c2a Mon Sep 17 00:00:00 2001
From: Garima Singh <garima.singh@microsoft.com>
Date: Tue, 11 Feb 2020 13:55:03 -0500
Subject: [PATCH] Process bloom filter data as 1 byte words
Process bloom filter data as 1 byte words instead of 8 byte
words to avoid the Big-endian vs Little-endian confusion on
different CPU architectures
Helped-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
---
bloom.c | 24 ++++-----
bloom.h | 4 +-
commit-graph.c | 4 +-
t/helper/test-bloom.c | 4 +-
t/t0095-bloom.sh | 118 +++++++++++++++++++++---------------------
5 files changed, 77 insertions(+), 77 deletions(-)
diff --git a/bloom.c b/bloom.c
index 90d84dc713..6d5d6bb2ef 100644
--- a/bloom.c
+++ b/bloom.c
@@ -45,12 +45,13 @@ static uint32_t seed_murmur3(uint32_t seed, const char *data, int len)
int len4 = len / sizeof(uint32_t);
- const uint32_t *blocks = (const uint32_t*)data;
-
uint32_t k;
- for (i = 0; i < len4; i++)
- {
- k = blocks[i];
+ for (i = 0; i < len4; i++) {
+ uint32_t byte1 = (uint32_t)data[4*i];
+ uint32_t byte2 = ((uint32_t)data[4*i + 1]) << 8;
+ uint32_t byte3 = ((uint32_t)data[4*i + 2]) << 16;
+ uint32_t byte4 = ((uint32_t)data[4*i + 3]) << 24;
+ k = byte1 | byte2 | byte3 | byte4;
k *= c1;
k = rotate_right(k, r1);
k *= c2;
@@ -61,8 +62,7 @@ static uint32_t seed_murmur3(uint32_t seed, const char *data, int len)
tail = (data + len4 * sizeof(uint32_t));
- switch (len & (sizeof(uint32_t) - 1))
- {
+ switch (len & (sizeof(uint32_t) - 1)) {
case 3:
k1 ^= ((uint32_t)tail[2]) << 16;
/*-fallthrough*/
@@ -88,9 +88,9 @@ static uint32_t seed_murmur3(uint32_t seed, const char *data, int len)
return seed;
}
-static inline uint64_t get_bitmask(uint32_t pos)
+static inline unsigned char get_bitmask(uint32_t pos)
{
- return ((uint64_t)1) << (pos & (BITS_PER_WORD - 1));
+ return ((unsigned char)1) << (pos & (BITS_PER_WORD - 1));
}
void load_bloom_filters(void)
@@ -152,8 +152,8 @@ static int load_bloom_filter_from_graph(struct commit_graph *g,
start_index = 0;
filter->len = end_index - start_index;
- filter->data = (uint64_t *)(g->chunk_bloom_data +
- sizeof(uint64_t) * start_index +
+ filter->data = (unsigned char *)(g->chunk_bloom_data +
+ sizeof(unsigned char) * start_index +
BLOOMDATA_CHUNK_HEADER_SIZE);
return 1;
@@ -234,7 +234,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
}
filter->len = (hashmap_get_size(&pathmap) * settings.bits_per_entry + BITS_PER_WORD - 1) / BITS_PER_WORD;
- filter->data = xcalloc(filter->len, sizeof(uint64_t));
+ filter->data = xcalloc(filter->len, sizeof(unsigned char));
hashmap_for_each_entry(&pathmap, &iter, e, entry) {
struct bloom_key key;
diff --git a/bloom.h b/bloom.h
index 76f8a9ad0c..9604723ce0 100644
--- a/bloom.h
+++ b/bloom.h
@@ -12,7 +12,7 @@ struct bloom_filter_settings {
};
#define DEFAULT_BLOOM_FILTER_SETTINGS { 1, 7, 10 }
-#define BITS_PER_WORD 64
+#define BITS_PER_WORD 8
#define BLOOMDATA_CHUNK_HEADER_SIZE 3*sizeof(uint32_t)
/*
@@ -22,7 +22,7 @@ struct bloom_filter_settings {
* 'data'.
*/
struct bloom_filter {
- uint64_t *data;
+ unsigned char *data;
int len;
};
diff --git a/commit-graph.c b/commit-graph.c
index c0e9834bf2..f5f9a23c9a 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1125,7 +1125,7 @@ static void write_graph_chunk_bloom_data(struct hashfile *f,
while (list < last) {
struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
display_progress(progress, ++i);
- hashwrite(f, filter->data, filter->len * sizeof(uint64_t));
+ hashwrite(f, filter->data, filter->len * sizeof(unsigned char));
list++;
}
@@ -1305,7 +1305,7 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
for (i = 0; i < ctx->commits.nr; i++) {
struct commit *c = sorted_by_pos[i];
struct bloom_filter *filter = get_bloom_filter(ctx->r, c, 1);
- ctx->total_bloom_filter_data_size += sizeof(uint64_t) * filter->len;
+ ctx->total_bloom_filter_data_size += sizeof(unsigned char) * filter->len;
display_progress(progress, i + 1);
}
diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
index 9b4be97f75..8fa2d8fc25 100644
--- a/t/helper/test-bloom.c
+++ b/t/helper/test-bloom.c
@@ -23,7 +23,7 @@ static void print_bloom_filter(struct bloom_filter *filter) {
printf("Filter_Length:%d\n", filter->len);
printf("Filter_Data:");
for (i = 0; i < filter->len; i++){
- printf("%"PRIx64"|", filter->data[i]);
+ printf("%02x|", filter->data[i]);
}
printf("\n");
}
@@ -57,7 +57,7 @@ int cmd__bloom(int argc, const char **argv)
struct bloom_filter filter;
int i = 2;
filter.len = (settings.bits_per_entry + BITS_PER_WORD - 1) / BITS_PER_WORD;
- filter.data = xcalloc(filter.len, sizeof(uint64_t));
+ filter.data = xcalloc(filter.len, sizeof(unsigned char));
if (!argv[2]){
die("at least one input string expected");
diff --git a/t/t0095-bloom.sh b/t/t0095-bloom.sh
index 424fe4fc29..58273219ff 100755
--- a/t/t0095-bloom.sh
+++ b/t/t0095-bloom.sh
@@ -3,58 +3,11 @@
test_description='test bloom.c'
. ./test-lib.sh
-test_expect_success 'get bloom filters for commit with no changes' '
- git init &&
- git commit --allow-empty -m "c0" &&
- cat >expect <<-\EOF &&
- Filter_Length:0
- Filter_Data:
- EOF
- test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
- test_cmp expect actual
-'
-
-test_expect_success 'get bloom filter for commit with 10 changes' '
- rm actual &&
- rm expect &&
- mkdir smallDir &&
- for i in $(test_seq 0 9)
- do
- echo $i >smallDir/$i
- done &&
- git add smallDir &&
- git commit -m "commit with 10 changes" &&
- cat >expect <<-\EOF &&
- Filter_Length:4
- Filter_Data:508928809087080a|8a7648210804001|4089824400951000|841ab310098051a8|
- EOF
- test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
- test_cmp expect actual
-'
-
-test_expect_success EXPENSIVE 'get bloom filter for commit with 513 changes' '
- rm actual &&
- rm expect &&
- mkdir bigDir &&
- for i in $(test_seq 0 512)
- do
- echo $i >bigDir/$i
- done &&
- git add bigDir &&
- git commit -m "commit with 513 changes" &&
- cat >expect <<-\EOF &&
- Filter_Length:0
- Filter_Data:
- EOF
- test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
- test_cmp expect actual
-'
-
test_expect_success 'compute bloom key for empty string' '
cat >expect <<-\EOF &&
Hashes:5615800c|5b966560|61174ab4|66983008|6c19155c|7199fab0|771ae004|
- Filter_Length:1
- Filter_Data:11000110001110|
+ Filter_Length:2
+ Filter_Data:11|11|
EOF
test-tool bloom generate_filter "" >actual &&
test_cmp expect actual
@@ -63,8 +16,8 @@ test_expect_success 'compute bloom key for empty string' '
test_expect_success 'compute bloom key for whitespace' '
cat >expect <<-\EOF &&
Hashes:1bf014e6|8a91b50b|f9335530|67d4f555|d676957a|4518359f|b3b9d5c4|
- Filter_Length:1
- Filter_Data:401004080200810|
+ Filter_Length:2
+ Filter_Data:71|8c|
EOF
test-tool bloom generate_filter " " >actual &&
test_cmp expect actual
@@ -73,8 +26,8 @@ test_expect_success 'compute bloom key for whitespace' '
test_expect_success 'compute bloom key for a root level folder' '
cat >expect <<-\EOF &&
Hashes:1a21016f|fff1c06d|e5c27f6b|cb933e69|b163fd67|9734bc65|7d057b63|
- Filter_Length:1
- Filter_Data:aaa800000000|
+ Filter_Length:2
+ Filter_Data:a8|aa|
EOF
test-tool bloom generate_filter "A" >actual &&
test_cmp expect actual
@@ -83,8 +36,8 @@ test_expect_success 'compute bloom key for a root level folder' '
test_expect_success 'compute bloom key for a root level file' '
cat >expect <<-\EOF &&
Hashes:e2d51107|30970605|7e58fb03|cc1af001|19dce4ff|679ed9fd|b560cefb|
- Filter_Length:1
- Filter_Data:a8000000000000aa|
+ Filter_Length:2
+ Filter_Data:aa|a8|
EOF
test-tool bloom generate_filter "file.txt" >actual &&
test_cmp expect actual
@@ -93,8 +46,8 @@ test_expect_success 'compute bloom key for a root level file' '
test_expect_success 'compute bloom key for a deep folder' '
cat >expect <<-\EOF &&
Hashes:864cf838|27f055cd|c993b362|6b3710f7|0cda6e8c|ae7dcc21|502129b6|
- Filter_Length:1
- Filter_Data:1c0000600003000|
+ Filter_Length:2
+ Filter_Data:c6|31|
EOF
test-tool bloom generate_filter "A/B/C/D/E" >actual &&
test_cmp expect actual
@@ -103,11 +56,58 @@ test_expect_success 'compute bloom key for a deep folder' '
test_expect_success 'compute bloom key for a deep file' '
cat >expect <<-\EOF &&
Hashes:07cdf850|4af629c7|8e1e5b3e|d1468cb5|146ebe2c|5796efa3|9abf211a|
- Filter_Length:1
- Filter_Data:4020100804010080|
+ Filter_Length:2
+ Filter_Data:a9|54|
EOF
test-tool bloom generate_filter "A/B/C/D/E/file.txt" >actual &&
test_cmp expect actual
'
+test_expect_success 'get bloom filters for commit with no changes' '
+ git init &&
+ git commit --allow-empty -m "c0" &&
+ cat >expect <<-\EOF &&
+ Filter_Length:0
+ Filter_Data:
+ EOF
+ test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
+ test_cmp expect actual
+'
+
+test_expect_success 'get bloom filter for commit with 10 changes' '
+ rm actual &&
+ rm expect &&
+ mkdir smallDir &&
+ for i in $(test_seq 0 9)
+ do
+ echo $i >smallDir/$i
+ done &&
+ git add smallDir &&
+ git commit -m "commit with 10 changes" &&
+ cat >expect <<-\EOF &&
+ Filter_Length:25
+ Filter_Data:c2|0b|b8|c0|10|88|f0|1d|c1|0c|01|a4|01|28|81|80|01|30|10|d0|92|be|88|10|8a|
+ EOF
+ test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
+ test_cmp expect actual
+'
+
+test_expect_success EXPENSIVE 'get bloom filter for commit with 513 changes' '
+ rm actual &&
+ rm expect &&
+ mkdir bigDir &&
+ for i in $(test_seq 0 512)
+ do
+ echo $i >bigDir/$i
+ done &&
+ git add bigDir &&
+ git commit -m "commit with 513 changes" &&
+ cat >expect <<-\EOF &&
+ Filter_Length:0
+ Filter_Data:
+ EOF
+ test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
+ test_cmp expect actual
+'
+
test_done
--
2.22.0.windows.1
^ permalink raw reply related [flat|nested] 159+ messages in thread
* Re: [PATCH v2 00/11] Changed Paths Bloom Filters
2020-02-05 22:56 ` [PATCH v2 00/11] " Garima Singh via GitGitGadget
` (11 preceding siblings ...)
2020-02-07 13:52 ` [PATCH v2 00/11] Changed Paths Bloom Filters SZEDER Gábor
@ 2020-02-08 23:04 ` Jakub Narebski
2020-02-21 17:41 ` Garima Singh
2020-03-30 0:31 ` [PATCH v3 00/16] " Garima Singh via GitGitGadget
13 siblings, 1 reply; 159+ messages in thread
From: Jakub Narebski @ 2020-02-08 23:04 UTC (permalink / raw)
To: Garima Singh via GitGitGadget
Cc: git, Derrick Stolee, SZEDER Gábor, Jonathan Tan,
Jeff Hostetler, Taylor Blau, Jeff King, Garima Singh,
Christian Couder, Emily Shaffer, Junio C Hamano, Garima Singh
"Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:
> Hey!
>
> The commit graph feature brought in a lot of performance improvements across
> multiple commands. However, file based history continues to be a performance
> pain point, especially in large repositories.
>
> Adopting changed path Bloom filters has been discussed on the list before,
> and a prototype version was worked on by SZEDER Gábor, Jonathan Tan and Dr.
> Derrick Stolee [1]. This series is based on Dr. Stolee's proof of
> concept in [2].
Sidenote: I wondered why it did use MurmurHash3 (64-bit version), which
requires adding its implementation, instead of reusing FNV-1 hash
(Fowler–Noll–Vo hash function) used by Git hashmap implementation, see
https://github.com/git/git/blob/228f53135a4a41a37b6be8e4d6e2b6153db4a8ed/hashmap.h#L109
Beside the fact that everyone is using MurmurHash for Bloom filters ;-)
It turns out that in various benchmark MurmurHash is faster and also
slightly better as a hash than FNV-1 or FNV-1b.
I wonder then if it would be a good idea (in the future) to make it easy
to use hashmap with MurmurHash3 instead of FNV-1, or maybe to even make
it the default for hashing strings.
>
> Performance Gains: We tested the performance of git log -- path on the git
> repo, the linux repo and some internal large repos, with a variety of paths
> of varying depths.
As I wrote in reply to previous version of this series, a good public
repository (and thus being able to use by anyone) to test the Bloom
filter performance improvements could be AOSP (Android) base:
https://android.googlesource.com/platform/frameworks/base/
which is a large repository with long path depths (due to Java file
naming conventions).
>
> On the git and linux repos: We observed a 2x to 5x speed up.
>
> On a large internal repo with files seated 6-10 levels deep in the tree: We
> observed 10x to 20x speed ups, with some paths going up to 28 times faster.
Very nice! Good work!
What is the cost of this feature, that is how long it takes to generate
Bloom filters, and how much larger commit-graph file gets? It would be
nice to know.
>
> Future Work (not included in the scope of this series):
>
> 1. Supporting multiple path based revision walk
Shouldn't then tests that were added in v2 mark use of Bloom filters
with multiple paths revision walking as _not working *yet*_
(test_expect_failure), and not expected to not work (test_expect_success
with test_bloom_filters_not_used)?
> 2. Adopting it in git blame logic.
> 3. Interactions with line log git log -L
>
>
> ----------------------------------------------------------------------------
>
> Updates since the last submission
>
> * Removed all the RFC callouts, this is a ready for full review version
> * Added unit tests for the bloom filter computation layer
> * Added more evolved functional tests for git log
> * Fixed a lot of the bugs found by the tests
> * Reacted to other miscellaneous feedback on the RFC series.
>
> Cheers! Garima Singh
>
> [1] https://lore.kernel.org/git/20181009193445.21908-1-szeder.dev@gmail.com/
> [2] https://lore.kernel.org/git/61559c5b-546e-d61b-d2e1-68de692f5972@gmail.com/
>
> Derrick Stolee (2):
> diff: halt tree-diff early after max_changes
> commit-graph: examine commits by generation number
>
> Garima Singh (8):
> commit-graph: use MAX_NUM_CHUNKS
> bloom: core Bloom filter implementation for changed paths
> commit-graph: compute Bloom filters for changed paths
> commit-graph: write Bloom filters to commit graph file
> commit-graph: reuse existing Bloom filters during write.
> commit-graph: add --changed-paths option to write subcommand
> revision.c: use Bloom filters to speed up path based revision walks
> commit-graph: add GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS test flag
>
> Jeff King (1):
> commit-graph: examine changed-path objects in pack order
The shortlog summary is a fine tool to show contributors to the patch
series, but is not as useful to show patch series as a whole: splitting
of patches and their ordering.
I will review each of patches individually, but now I would like to say
a few things about the series as a whole.
- [PATCH v2 01/11] commit-graph: use MAX_NUM_CHUNKS
Simple and non-controversial patch, improvement to existing code with
the goal of helping future development (including further patches).
- [PATCH v2 02/11] bloom: core Bloom filter implementation for changed paths
In my opinion this patch could be split into three individual pieces,
though one might think it is not worth it.
a. Add implementation of MurmurHash v3 (64-bit)
Include tests based on test-tool (creating file similar to the
t/helper/test-hash.c, or enhancing to that file) that the
implementation is correct, for example that 'The quick brown fox jumps
over the lazy dog' with given seed (for example the default feed of 0)
hashes to the same value as other implementations.
b. Add implementation of Bloom filter
Include generic Bloom filter tests i.e. that it correctly answers
"yes" and "maybe" (create filter, save it or print it, then use stored
filter), and tests specific to our implementation, namely that the
size of the filter behaves as it should.
c. Bloom filter implementation for changed paths
Here include tests that use 'test-tool bloom get_filter_for_commit',
that filter for commit with no changes and for commit with more than
512 fies changed works correctly, that directories are added along the
paths, etc.
- [PATCH v2 03/11] diff: halt tree-diff early after max_changes
I think keeping this patch as a separate step makes individual commits
easier to understand and review.
- [PATCH v2 04/11] commit-graph: compute Bloom filters for changed paths
Here we compute Bloom filters for changed paths for each commit in the
commit-graph file, without writing it to file; as a side-effect we
calculate total Bloom filters data size.
This doesn't make much sense as a standalone patch, but it is nice,
easy to understand incremental step in building the feature.
- [PATCH v2 05/11] commit-graph: examine changed-path objects in pack order
- [PATCH v2 06/11] commit-graph: examine commits by generation number
Those two are performance improvements of previous step. It is good
to keep them as separate commits, makes it easier to understand (and
easier to catch error via git-bisect, if there would be any)
- [PATCH v2 07/11] commit-graph: write Bloom filters to commit graph file
This commit includes the documentation of the two new chunks of
commit-graph file format.
I wonder if the 9th patch in this series, namely
commit-graph: add --changed-paths option to write subcommand
should not precede this commit. Otherwise we have this new code but
no way of testing it. On the other hand it makes it easier to
review. On the gripping hand, you can't really test that writing
works without the ability to parse Bloom filter data out of
commit-graph file... which is the next commit.
- [PATCH v2 08/11] commit-graph: reuse existing Bloom filters during write
This implements reading Bloom filters data from commit-graph file.
Is it a good split? I think it makes it easier to review the single
patch, but itt also makes them less standalone.
- [PATCH v2 09/11] commit-graph: add --changed-paths option to write subcommand
One thing we could test there is that we are writing two new chunks to
the commit-graph file (and perhaps checking that they are correctly
formatted, and have correct shape).
- [PATCH v2 10/11] revision.c: use Bloom filters to speed up path based revision walks
This is quite a big and involved patch, which in my opinion could be
split in two or three parts:
a. Add a bare bones implementation, like in v2
This limits amount of testing we can do; the only thing we can really
test is that we get the same results with and without Bloom filters.
b.1. Add trace2 Bloom filter statistics
b.2. Use said trace2 statistics to test use of Bloom filters
- [PATCH v2 11/11] commit-graph: add GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS test flag
This one is for (optional) exhaustive testing of the feature.
Feel free to disagree with those ideas.
Best,
--
Jakub Narębski
^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH v2 00/11] Changed Paths Bloom Filters
2020-02-08 23:04 ` Jakub Narebski
@ 2020-02-21 17:41 ` Garima Singh
2020-03-29 18:36 ` Junio C Hamano
0 siblings, 1 reply; 159+ messages in thread
From: Garima Singh @ 2020-02-21 17:41 UTC (permalink / raw)
To: Jakub Narebski, Garima Singh via GitGitGadget
Cc: git, Derrick Stolee, SZEDER Gábor, Jonathan Tan,
Jeff Hostetler, Taylor Blau, Jeff King, Christian Couder,
Emily Shaffer, Junio C Hamano, Garima Singh
On 2/8/2020 6:04 PM, Jakub Narebski wrote:
> "Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
>> Hey!
>>
>> The commit graph feature brought in a lot of performance improvements across
>> multiple commands. However, file based history continues to be a performance
>> pain point, especially in large repositories.
>>
>> Adopting changed path Bloom filters has been discussed on the list before,
>> and a prototype version was worked on by SZEDER Gábor, Jonathan Tan and Dr.
>> Derrick Stolee [1]. This series is based on Dr. Stolee's proof of
>> concept in [2].
>
> Sidenote: I wondered why it did use MurmurHash3 (64-bit version), which
> requires adding its implementation, instead of reusing FNV-1 hash
> (Fowler–Noll–Vo hash function) used by Git hashmap implementation, see
> https://github.com/git/git/blob/228f53135a4a41a37b6be8e4d6e2b6153db4a8ed/hashmap.h#L109
> Beside the fact that everyone is using MurmurHash for Bloom filters ;-)
>
> It turns out that in various benchmark MurmurHash is faster and also
> slightly better as a hash than FNV-1 or FNV-1b.
>
>
> I wonder then if it would be a good idea (in the future) to make it easy
> to use hashmap with MurmurHash3 instead of FNV-1, or maybe to even make
> it the default for hashing strings.
>
Making Murmur3 hash the default for hashing strings is definitely outside the
scope of this series. Also, if the method signatures for the murmur3 hash
matched the existing hash method signatures in hashmap.c, then it would be
appropriate to place them adjacently, even if no hashmap consumer uses it for
hashmaps. However, we need the option to start at a custom seed to do our double
hashing. A change in the future that involves adopting murmur3 in the hashmap
code would involve a simple code move before creating the new methods that
avoid a custom seed. So for now, it makes sense that these methods leave in
bloom.c where they are being used for a very specific purpose.
>>
>> Performance Gains: We tested the performance of git log -- path on the git
>> repo, the linux repo and some internal large repos, with a variety of paths
>> of varying depths.
>
> As I wrote in reply to previous version of this series, a good public
> repository (and thus being able to use by anyone) to test the Bloom
> filter performance improvements could be AOSP (Android) base:
>
> https://android.googlesource.com/platform/frameworks/base/
>
> which is a large repository with long path depths (due to Java file
> naming conventions).
>
Thank you! I will incorporate these results into the commit messages as
appropriate in v3.
>>
>> On the git and linux repos: We observed a 2x to 5x speed up.
>>
>> On a large internal repo with files seated 6-10 levels deep in the tree: We
>> observed 10x to 20x speed ups, with some paths going up to 28 times faster.
>
> Very nice! Good work!
>
> What is the cost of this feature, that is how long it takes to generate
> Bloom filters, and how much larger commit-graph file gets? It would be
> nice to know.
>
The cost of writing is much better now with Peff and Dr. Stolee's improvements.
I will include these numbers as well in the commit messages as appropriate in
v3.
>>
>> Future Work (not included in the scope of this series):
>>
>> 1. Supporting multiple path based revision walk
>
> Shouldn't then tests that were added in v2 mark use of Bloom filters
> with multiple paths revision walking as _not working *yet*_
> (test_expect_failure), and not expected to not work (test_expect_success
> with test_bloom_filters_not_used)?
>
My intent is to ensure that bloom filters are not being used in any of the
unsupported code paths. I don't have a strong preference about the test
semantics as long as I get that coverage :) So I will look into switching it
to test_expect_failure as you have suggested.
>> Derrick Stolee (2):
>> diff: halt tree-diff early after max_changes
>> commit-graph: examine commits by generation number
>>
>> Garima Singh (8):
>> commit-graph: use MAX_NUM_CHUNKS
>> bloom: core Bloom filter implementation for changed paths
>> commit-graph: compute Bloom filters for changed paths
>> commit-graph: write Bloom filters to commit graph file
>> commit-graph: reuse existing Bloom filters during write.
>> commit-graph: add --changed-paths option to write subcommand
>> revision.c: use Bloom filters to speed up path based revision walks
>> commit-graph: add GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS test flag
>>
>> Jeff King (1):
>> commit-graph: examine changed-path objects in pack order
>
> The shortlog summary is a fine tool to show contributors to the patch
> series, but is not as useful to show patch series as a whole: splitting
> of patches and their ordering.
>
This is a GitGitGadget specific thing, and it is probably by design. I have
opened an issue in that repo for any follow up discussions:
https://github.com/gitgitgadget/gitgitgadget/issues/203
> - [PATCH v2 02/11] bloom: core Bloom filter implementation for changed paths
>
> In my opinion this patch could be split into three individual pieces,
> though one might think it is not worth it.
>
I have gone back and forth on doing this. I like most of the core Bloom filter
computations being isolated in one patch/commit. But based on the rest of your
review, it seems like you are leaning heavily on having this split out.
So, I will take a proper stab at doing it for v3.
> - [PATCH v2 07/11] commit-graph: write Bloom filters to commit graph file
>
> This commit includes the documentation of the two new chunks of
> commit-graph file format.
>
> I wonder if the 9th patch in this series, namely
> commit-graph: add --changed-paths option to write subcommand
> should not precede this commit. Otherwise we have this new code but
> no way of testing it. On the other hand it makes it easier to
> review. On the gripping hand, you can't really test that writing
> works without the ability to parse Bloom filter data out of
> commit-graph file... which is the next commit.
>
Getting complete test coverage within a single patch would require 2 or 3 of
these patches to be combined. This would lead to a large patch that would be
much more difficult to review.
My tests in the patches following this one run git commands. Hence the tests
get introduced when the command line is ready to use all the new code.
The current ordering of patches works better than adding the --changed-paths
option before the logic that computes and writes. Otherwise the option will not
be doing what it is supposed to do in the patch it was introduced in.
> - [PATCH v2 08/11] commit-graph: reuse existing Bloom filters during write
>
> This implements reading Bloom filters data from commit-graph file.
> Is it a good split? I think it makes it easier to review the single
> patch, but itt also makes them less standalone.
>
All the logic upto this point works just fine without the ability to read and
parse precomputed bloom filters. This patch is an enhancement and it also
separates out the reading and writing logic. Reusing existing bloom filters
during write is the simplest interatcion that involves reading from the commit
graph file, and builds the foundation to make the `git log` improvements.
Hence, it warrants its own patch and review.
> - [PATCH v2 10/11] revision.c: use Bloom filters to speed up path based revision walks
>
> This is quite a big and involved patch, which in my opinion could be
> split in two or three parts:
>
> a. Add a bare bones implementation, like in v2
>
> This limits amount of testing we can do; the only thing we can really
> test is that we get the same results with and without Bloom filters.
>
> b.1. Add trace2 Bloom filter statistics
> b.2. Use said trace2 statistics to test use of Bloom filters
>
Sure. I will look into doing this split as well for v3.
>
> Feel free to disagree with those ideas.
>
> Best,
Thanks for taking the time for reviewing this series so thoroughly!
It is greatly appreciated!
Cheers,
Garima Singh
^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH v2 00/11] Changed Paths Bloom Filters
2020-02-21 17:41 ` Garima Singh
@ 2020-03-29 18:36 ` Junio C Hamano
0 siblings, 0 replies; 159+ messages in thread
From: Junio C Hamano @ 2020-03-29 18:36 UTC (permalink / raw)
To: Garima Singh
Cc: Jakub Narebski, Garima Singh via GitGitGadget, git,
Derrick Stolee, SZEDER Gábor, Jonathan Tan, Jeff Hostetler,
Taylor Blau, Jeff King, Christian Couder, Emily Shaffer,
Garima Singh
Garima Singh <garimasigit@gmail.com> writes:
> On 2/8/2020 6:04 PM, Jakub Narebski wrote:
>> "Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:
>> ...
> I have gone back and forth on doing this. I like most of the core Bloom filter
> computations being isolated in one patch/commit. But based on the rest of your
> review, it seems like you are leaning heavily on having this split out.
> So, I will take a proper stab at doing it for v3.
> ...
> Thanks for taking the time for reviewing this series so thoroughly!
> It is greatly appreciated!
Thanks for a great discussion. Just a friendly ping to the thread,
so that something from the discussion thread will stay on the first
page of mailing list archive's threaded view ;-)
^ permalink raw reply [flat|nested] 159+ messages in thread
* [PATCH v3 00/16] Changed Paths Bloom Filters
2020-02-05 22:56 ` [PATCH v2 00/11] " Garima Singh via GitGitGadget
` (12 preceding siblings ...)
2020-02-08 23:04 ` Jakub Narebski
@ 2020-03-30 0:31 ` Garima Singh via GitGitGadget
2020-03-30 0:31 ` [PATCH v3 01/16] commit-graph: define and use MAX_NUM_CHUNKS Garima Singh via GitGitGadget
` (16 more replies)
13 siblings, 17 replies; 159+ messages in thread
From: Garima Singh via GitGitGadget @ 2020-03-30 0:31 UTC (permalink / raw)
To: git; +Cc: stolee, szeder.dev, jonathantanmy, Garima Singh
Hey!
The commit graph feature brought in a lot of performance improvements across
multiple commands. However, file based history continues to be a performance
pain point, especially in large repositories.
Adopting changed path Bloom filters has been discussed on the list before,
and a prototype version was worked on by SZEDER Gábor, Jonathan Tan and Dr.
Derrick Stolee [1]. This series is based on Dr. Stolee's proof of concept in
[2]
With the changes in this series, git users will be able to choose to write
Bloom filters to the commit-graph using the following command:
'git commit-graph write --changed-paths'
Subsequent 'git log -- path' commands will use these computed Bloom filters
to decided which commits are worth exploring further to produce the history
of the provided path.
Cost of computing and writing Bloom filters
===========================================
Computing and writing Bloom filters to the commit graph for the first time
implies computing the diffs and the resulting Bloom filters for all the
commits in the repository. This adds a non trivial amount of time to run
time. Every subsequent run is incremental i.e. we reuse the previously
computed Bloom filters. So this is a one time cost.
Time taken by 'git commit-graph write' with and w/o --changed-paths, speed
up in 'git log -- path' with computed Bloom filters (see a):-
-------------------------------------------------------------------------
| Repo | w/o --changed-paths | with --changed-paths | Speed up |
-------------------------------------------------------------------------
| git [3] | 0.9 seconds | 7 seconds | 2x to 6x |
| linux [4] | 16 seconds | 1 minute 8 seconds | 2x to 6x |
| android [5] | 9 seconds | 48 seconds | 2x to 6x |
| AzDo(see b) | 1 minute | 5 minutes 2 seconds | 10x to 30x |
-------------------------------------------------------------------------
a) We tested the performance of git log -- path with randomly chosen paths
of varying depths in each repo. The speed up depends on how deep the files
are in the hierarchy and how often a file has been touched in the commit
history.
b) This internal repository has about 420k commits, 183k files distributed
across 34k folders, the size on disk is about 17 GiB. The most massive gains
on this repository were for files 6-12 levels deep in the tree.
c) These numbers were collected on a Windows machine, except for the linux
repo which was tested on a Linux machine.
Future Work (not included in the scope of this series)
======================================================
1. Supporting multiple path based revision walk
2. Adopting it in git blame logic.
3. Interactions with line log git log -L
Cheers! Garima Singh
[1] https://lore.kernel.org/git/20181009193445.21908-1-szeder.dev@gmail.com/
[2]
https://lore.kernel.org/git/61559c5b-546e-d61b-d2e1-68de692f5972@gmail.com/
[3] https://github.com/git/git
[4] https://github.com/torvalds/linux
[5] https://android.googlesource.com/platform/frameworks/base/
jeffhost@microsoft.com, me@ttaylorr.com, peff@peff.net,
garimasigit@gmail.com,jnareb@gmail.com, christian.couder@gmail.com,
emilyshaffer@gmail.com,gitster@pobox.com
Derrick Stolee (1):
diff: halt tree-diff early after max_changes
Garima Singh (14):
commit-graph: define and use MAX_NUM_CHUNKS
bloom.c: add the murmur3 hash implementation
bloom.c: introduce core Bloom filter constructs
bloom.c: core Bloom filter implementation for changed paths.
commit-graph: compute Bloom filters for changed paths
commit-graph: examine commits by generation number
diff: skip batch object download when possible
commit-graph: write Bloom filters to commit graph file
commit-graph: reuse existing Bloom filters during write
commit-graph: add --changed-paths option to write subcommand
revision.c: use Bloom filters to speed up path based revision walks
revision.c: add trace2 stats around Bloom filter usage
t4216: add end to end tests for git log with Bloom filters
commit-graph: add GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS test flag
Jeff King (1):
commit-graph: examine changed-path objects in pack order
Documentation/git-commit-graph.txt | 5 +
.../technical/commit-graph-format.txt | 30 ++
Makefile | 2 +
bloom.c | 276 ++++++++++++++++++
bloom.h | 90 ++++++
builtin/commit-graph.c | 10 +-
ci/run-build-and-tests.sh | 1 +
commit-graph.c | 213 +++++++++++++-
commit-graph.h | 9 +-
diff.c | 8 +-
diff.h | 6 +
revision.c | 126 +++++++-
revision.h | 11 +
t/README | 5 +
t/helper/test-bloom.c | 81 +++++
t/helper/test-read-graph.c | 4 +
t/helper/test-tool.c | 1 +
t/helper/test-tool.h | 1 +
t/t0095-bloom.sh | 117 ++++++++
t/t4216-log-bloom.sh | 155 ++++++++++
t/t5318-commit-graph.sh | 2 +
t/t5324-split-commit-graph.sh | 1 +
tree-diff.c | 6 +
23 files changed, 1148 insertions(+), 12 deletions(-)
create mode 100644 bloom.c
create mode 100644 bloom.h
create mode 100644 t/helper/test-bloom.c
create mode 100755 t/t0095-bloom.sh
create mode 100755 t/t4216-log-bloom.sh
base-commit: 3bab5d56259722843359702bc27111475437ad2a
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-497%2Fgarimasi514%2FcoreGit-bloomFilters-v3
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-497/garimasi514/coreGit-bloomFilters-v3
Pull-Request: https://github.com/gitgitgadget/git/pull/497
Range-diff vs v2:
1: bf6b93878af ! 1: c3ffd9820d5 commit-graph: use MAX_NUM_CHUNKS
@@ -1,10 +1,12 @@
Author: Garima Singh <garima.singh@microsoft.com>
- commit-graph: use MAX_NUM_CHUNKS
+ commit-graph: define and use MAX_NUM_CHUNKS
- This is a minor cleanup to make it easier to change the
- number of chunks being written to the commit-graph in the future.
+ This is a minor cleanup to make it easier to change
+ the number of chunks being written to the commit
+ graph.
+ Reviewed-by: Jakub Narębski <jnareb@gmail.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
diff --git a/commit-graph.c b/commit-graph.c
-: ----------- > 2: a5aa3415c05 bloom.c: add the murmur3 hash implementation
-: ----------- > 3: a7702c1afde bloom.c: introduce core Bloom filter constructs
2: 02b16d94227 ! 4: 8304c297520 bloom: core Bloom filter implementation for changed paths
@@ -1,89 +1,33 @@
Author: Garima Singh <garima.singh@microsoft.com>
- bloom: core Bloom filter implementation for changed paths
+ bloom.c: core Bloom filter implementation for changed paths.
- Add the core Bloom filter logic for computing the paths changed between a
- commit and its first parent. For details on what Bloom filters are and how they
- work, please refer to Dr. Derrick Stolee's blog post [1]. It provides a concise
- explaination of the adoption of Bloom filters as described in [2] and [3]
+ Add the core implementation for computing Bloom filters for
+ the paths changed between a commit and it's first parent.
- 1. We currently use 7 and 10 for the number of hashes and the size of each
- entry respectively. They served as great starting values, the mathematical
- details behind this choice are described in [1] and [4]. The implementation
- while not completely open to it at the moment, is flexible enough to allow
- for tweaking these settings in the future.
+ We fill the Bloom filters as (const char *data, int len) pairs
+ as `struct bloom_filters" within a commit slab.
- Note: The performance gains we have observed with these values are
- significant enough that we did not need to tweak these settings.
- The performance numbers are included in the cover letter of this series
- and in the message of a subsequent commit where we use Bloom filters in
- to speed up `git log -- <path>`.
-
- 2. As described in the blog and in [3], we do not need 7 independent hashing
- functions. We use the Murmur3 hashing scheme. Seed it twice and then
- combine those to procure an arbitrary number of hash values.
-
- 3. The filters are sized according to the number of changes in the each commit,
- with minimum size of one 64 bit word.
-
- 4. We fill the Bloom filters as (const char *data, int len) pairs as
- "struct bloom_filter"s in a commit slab.
-
- 5. The seed_murmur3 method is implemented as described in [5]. It hashes the
- given data using a given seed and produces a uniformly distributed hash
- value.
-
- [1] https://devblogs.microsoft.com/devops/super-charging-the-git-commit-graph-iv-Bloom-filters/
-
- [2] Flavio Bonomi, Michael Mitzenmacher, Rina Panigrahy, Sushil Singh, George Varghese
- "An Improved Construction for Counting Bloom Filters"
- http://theory.stanford.edu/~rinap/papers/esa2006b.pdf
- https://doi.org/10.1007/11841036_61
-
- [3] Peter C. Dillinger and Panagiotis Manolios
- "Bloom Filters in Probabilistic Verification"
- http://www.ccs.neu.edu/home/pete/pub/Bloom-filters-verification.pdf
- https://doi.org/10.1007/978-3-540-30494-4_26
-
- [4] Thomas Mueller Graf, Daniel Lemire
- "Xor Filters: Faster and Smaller Than Bloom and Cuckoo Filters"
- https://arxiv.org/abs/1912.08258
-
- [5] https://en.wikipedia.org/wiki/MurmurHash#Algorithm
+ Filters for commits with no changes and more than 512 changes,
+ is represented with a filter of length zero. There is no gain
+ in distinguishing between a computed filter of length zero for
+ a commit with no changes, and an uncomputed filter for new commits
+ or for commits with more than 512 changes. The effect on
+ `git log -- path` is the same in both cases. We will fall back to
+ the normal diffing algorithm when we can't benefit from the
+ existence of Bloom filters.
Helped-by: Jeff King <peff@peff.net>
Helped-by: Derrick Stolee <dstolee@microsoft.com>
+ Reviewed-by: Jakub Narębski <jnareb@gmail.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
- diff --git a/Makefile b/Makefile
- --- a/Makefile
- +++ b/Makefile
-@@
-
- PROGRAMS += $(patsubst %.o,git-%$X,$(PROGRAM_OBJS))
-
-+TEST_BUILTINS_OBJS += test-bloom.o
- TEST_BUILTINS_OBJS += test-chmtime.o
- TEST_BUILTINS_OBJS += test-config.o
- TEST_BUILTINS_OBJS += test-ctype.o
-@@
- LIB_OBJS += bisect.o
- LIB_OBJS += blame.o
- LIB_OBJS += blob.o
-+LIB_OBJS += bloom.o
- LIB_OBJS += branch.o
- LIB_OBJS += bulk-checkin.o
- LIB_OBJS += bundle.o
-
diff --git a/bloom.c b/bloom.c
- new file mode 100644
- --- /dev/null
+ --- a/bloom.c
+++ b/bloom.c
@@
-+#include "git-compat-util.h"
-+#include "bloom.h"
-+#include "commit-graph.h"
-+#include "object-store.h"
+ #include "git-compat-util.h"
+ #include "bloom.h"
+#include "diff.h"
+#include "diffcore.h"
+#include "revision.h"
@@ -97,118 +41,19 @@
+ struct hashmap_entry entry;
+ const char path[FLEX_ARRAY];
+};
+
+ static uint32_t rotate_left(uint32_t value, int32_t count)
+ {
+@@
+ filter->data[block_pos] |= get_bitmask(hash_mod);
+ }
+ }
+
-+static uint32_t rotate_right(uint32_t value, int32_t count)
-+{
-+ uint32_t mask = 8 * sizeof(uint32_t) - 1;
-+ count &= mask;
-+ return ((value >> count) | (value << ((-count) & mask)));
-+}
-+
-+/*
-+ * Calculate a hash value for the given data using the given seed.
-+ * Produces a uniformly distributed hash value.
-+ * Not considered to be cryptographically secure.
-+ * Implemented as described in https://en.wikipedia.org/wiki/MurmurHash#Algorithm
-+ **/
-+static uint32_t seed_murmur3(uint32_t seed, const char *data, int len)
-+{
-+ const uint32_t c1 = 0xcc9e2d51;
-+ const uint32_t c2 = 0x1b873593;
-+ const uint32_t r1 = 15;
-+ const uint32_t r2 = 13;
-+ const uint32_t m = 5;
-+ const uint32_t n = 0xe6546b64;
-+ int i;
-+ uint32_t k1 = 0;
-+ const char *tail;
-+
-+ int len4 = len / sizeof(uint32_t);
-+
-+ const uint32_t *blocks = (const uint32_t*)data;
-+
-+ uint32_t k;
-+ for (i = 0; i < len4; i++)
-+ {
-+ k = blocks[i];
-+ k *= c1;
-+ k = rotate_right(k, r1);
-+ k *= c2;
-+
-+ seed ^= k;
-+ seed = rotate_right(seed, r2) * m + n;
-+ }
-+
-+ tail = (data + len4 * sizeof(uint32_t));
-+
-+ switch (len & (sizeof(uint32_t) - 1))
-+ {
-+ case 3:
-+ k1 ^= ((uint32_t)tail[2]) << 16;
-+ /*-fallthrough*/
-+ case 2:
-+ k1 ^= ((uint32_t)tail[1]) << 8;
-+ /*-fallthrough*/
-+ case 1:
-+ k1 ^= ((uint32_t)tail[0]) << 0;
-+ k1 *= c1;
-+ k1 = rotate_right(k1, r1);
-+ k1 *= c2;
-+ seed ^= k1;
-+ break;
-+ }
-+
-+ seed ^= (uint32_t)len;
-+ seed ^= (seed >> 16);
-+ seed *= 0x85ebca6b;
-+ seed ^= (seed >> 13);
-+ seed *= 0xc2b2ae35;
-+ seed ^= (seed >> 16);
-+
-+ return seed;
-+}
-+
-+static inline uint64_t get_bitmask(uint32_t pos)
-+{
-+ return ((uint64_t)1) << (pos & (BITS_PER_WORD - 1));
-+}
-+
-+void load_bloom_filters(void)
++void init_bloom_filters(void)
+{
+ init_bloom_filter_slab(&bloom_filters);
+}
+
-+void fill_bloom_key(const char *data,
-+ int len,
-+ struct bloom_key *key,
-+ struct bloom_filter_settings *settings)
-+{
-+ int i;
-+ const uint32_t seed0 = 0x293ae76f;
-+ const uint32_t seed1 = 0x7e646e2c;
-+ const uint32_t hash0 = seed_murmur3(seed0, data, len);
-+ const uint32_t hash1 = seed_murmur3(seed1, data, len);
-+
-+ key->hashes = (uint32_t *)xcalloc(settings->num_hashes, sizeof(uint32_t));
-+ for (i = 0; i < settings->num_hashes; i++)
-+ key->hashes[i] = hash0 + i * hash1;
-+}
-+
-+void add_key_to_filter(struct bloom_key *key,
-+ struct bloom_filter *filter,
-+ struct bloom_filter_settings *settings)
-+{
-+ int i;
-+ uint64_t mod = filter->len * BITS_PER_WORD;
-+
-+ for (i = 0; i < settings->num_hashes; i++) {
-+ uint64_t hash_mod = key->hashes[i] % mod;
-+ uint64_t block_pos = hash_mod / BITS_PER_WORD;
-+
-+ filter->data[block_pos] |= get_bitmask(hash_mod);
-+ }
-+}
-+
+struct bloom_filter *get_bloom_filter(struct repository *r,
+ struct commit *c)
+{
@@ -217,7 +62,7 @@
+ int i;
+ struct diff_options diffopt;
+
-+ if (!bloom_filters.slab_size)
++ if (bloom_filters.slab_size == 0)
+ return NULL;
+
+ filter = bloom_filter_slab_at(&bloom_filters, c);
@@ -234,13 +79,12 @@
+
+ if (diff_queued_diff.nr <= 512) {
+ struct hashmap pathmap;
-+ struct pathmap_hash_entry* e;
++ struct pathmap_hash_entry *e;
+ struct hashmap_iter iter;
+ hashmap_init(&pathmap, NULL, NULL, 0);
+
+ for (i = 0; i < diff_queued_diff.nr; i++) {
-+ const char* path = diff_queued_diff.queue[i]->two->path;
-+ const char* p = path;
++ const char *path = diff_queued_diff.queue[i]->two->path;
+
+ /*
+ * Add each leading directory of the changed file, i.e. for
@@ -251,23 +95,23 @@
+ * Note that directories are added without the trailing '/'.
+ */
+ do {
-+ char* last_slash = strrchr(p, '/');
++ char *last_slash = strrchr(path, '/');
+
+ FLEX_ALLOC_STR(e, path, path);
-+ hashmap_entry_init(&e->entry, strhash(p));
++ hashmap_entry_init(&e->entry, strhash(path));
+ hashmap_add(&pathmap, &e->entry);
+
+ if (!last_slash)
-+ last_slash = (char*)p;
++ last_slash = (char*)path;
+ *last_slash = '\0';
+
-+ } while (*p);
++ } while (*path);
+
+ diff_free_filepair(diff_queued_diff.queue[i]);
+ }
+
+ filter->len = (hashmap_get_size(&pathmap) * settings.bits_per_entry + BITS_PER_WORD - 1) / BITS_PER_WORD;
-+ filter->data = xcalloc(filter->len, sizeof(uint64_t));
++ filter->data = xcalloc(filter->len, sizeof(unsigned char));
+
+ hashmap_for_each_entry(&pathmap, &iter, e, entry) {
+ struct bloom_key key;
@@ -287,138 +131,48 @@
+ DIFF_QUEUE_CLEAR(&diff_queued_diff);
+
+ return filter;
-+}
-+
-+int bloom_filter_contains(struct bloom_filter *filter,
-+ struct bloom_key *key,
-+ struct bloom_filter_settings *settings)
-+{
-+ int i;
-+ uint64_t mod = filter->len * BITS_PER_WORD;
-+
-+ if (!mod)
-+ return -1;
-+
-+ for (i = 0; i < settings->num_hashes; i++) {
-+ uint64_t hash_mod = key->hashes[i] % mod;
-+ uint64_t block_pos = hash_mod / BITS_PER_WORD;
-+ if (!(filter->data[block_pos] & get_bitmask(hash_mod)))
-+ return 0;
-+ }
-+
-+ return 1;
+}
diff --git a/bloom.h b/bloom.h
- new file mode 100644
- --- /dev/null
+ --- a/bloom.h
+++ b/bloom.h
@@
-+#ifndef BLOOM_H
-+#define BLOOM_H
-+
+ #ifndef BLOOM_H
+ #define BLOOM_H
+
+struct commit;
+struct repository;
-+struct commit_graph;
-+
-+struct bloom_filter_settings {
-+ uint32_t hash_version;
-+ uint32_t num_hashes;
-+ uint32_t bits_per_entry;
-+};
-+
-+#define DEFAULT_BLOOM_FILTER_SETTINGS { 1, 7, 10 }
-+#define BITS_PER_WORD 64
+
-+/*
-+ * A bloom_filter struct represents a data segment to
-+ * use when testing hash values. The 'len' member
-+ * dictates how many uint64_t entries are stored in
-+ * 'data'.
-+ */
-+struct bloom_filter {
-+ uint64_t *data;
-+ int len;
-+};
-+
-+/*
-+ * A bloom_key represents the k hash values for a
-+ * given hash input. These can be precomputed and
-+ * stored in a bloom_key for re-use when testing
-+ * against a bloom_filter.
-+ */
-+struct bloom_key {
-+ uint32_t *hashes;
-+};
-+
-+void load_bloom_filters(void);
-+
-+void fill_bloom_key(const char *data,
-+ int len,
-+ struct bloom_key *key,
-+ struct bloom_filter_settings *settings);
-+
-+void add_key_to_filter(struct bloom_key *key,
-+ struct bloom_filter *filter,
-+ struct bloom_filter_settings *settings);
+ struct bloom_filter_settings {
+ /*
+ * The version of the hashing technique being used.
+@@
+ struct bloom_filter *filter,
+ const struct bloom_filter_settings *settings);
+
++void init_bloom_filters(void);
+
+struct bloom_filter *get_bloom_filter(struct repository *r,
+ struct commit *c);
+
-+int bloom_filter_contains(struct bloom_filter *filter,
-+ struct bloom_key *key,
-+ struct bloom_filter_settings *settings);
-+
-+#endif
+ #endif
+ \ No newline at end of file
diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
- new file mode 100644
- --- /dev/null
+ --- a/t/helper/test-bloom.c
+++ b/t/helper/test-bloom.c
@@
-+#include "test-tool.h"
-+#include "git-compat-util.h"
-+#include "bloom.h"
-+#include "test-tool.h"
-+#include "cache.h"
-+#include "commit-graph.h"
+ #include "git-compat-util.h"
+ #include "bloom.h"
+ #include "test-tool.h"
+#include "commit.h"
-+#include "config.h"
-+#include "object-store.h"
-+#include "object.h"
-+#include "repository.h"
-+#include "tree.h"
-+
-+struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
-+
-+static void print_bloom_filter(struct bloom_filter *filter) {
-+ int i;
-+
-+ if (!filter) {
-+ printf("No filter.\n");
-+ return;
-+ }
-+ printf("Filter_Length:%d\n", filter->len);
-+ printf("Filter_Data:");
-+ for (i = 0; i < filter->len; i++){
-+ printf("%"PRIx64"|", filter->data[i]);
-+ }
-+ printf("\n");
-+}
-+
-+static void add_string_to_filter(const char *data, struct bloom_filter *filter) {
-+ struct bloom_key key;
-+ int i;
-+
-+ fill_bloom_key(data, strlen(data), &key, &settings);
-+ printf("Hashes:");
-+ for (i = 0; i < settings.num_hashes; i++){
-+ printf("%08x|", key.hashes[i]);
-+ }
-+ printf("\n");
-+ add_key_to_filter(&key, filter, &settings);
-+}
-+
+
+ struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
+
+@@
+ printf("\n");
+ }
+
+static void get_bloom_filter_for_commit(const struct object_id *commit_oid)
+{
+ struct commit *c;
@@ -429,72 +183,33 @@
+ print_bloom_filter(filter);
+}
+
-+int cmd__bloom(int argc, const char **argv)
-+{
-+ if (!strcmp(argv[1], "generate_filter")) {
-+ struct bloom_filter filter;
-+ int i = 2;
-+ filter.len = (settings.bits_per_entry + BITS_PER_WORD - 1) / BITS_PER_WORD;
-+ filter.data = xcalloc(filter.len, sizeof(uint64_t));
-+
-+ if (!argv[2]){
-+ die("at least one input string expected");
-+ }
-+
-+ while (argv[i]) {
-+ add_string_to_filter(argv[i], &filter);
-+ i++;
-+ }
-+
-+ print_bloom_filter(&filter);
-+ }
-+
-+ if (!strcmp(argv[1], "get_filter_for_commit")) {
+ int cmd__bloom(int argc, const char **argv)
+ {
+ if (!strcmp(argv[1], "get_murmur3")) {
+@@
+ print_bloom_filter(&filter);
+ }
+
++ if (!strcmp(argv[1], "get_filter_for_commit")) {
+ struct object_id oid;
+ const char *end;
+ if (parse_oid_hex(argv[2], &oid, &end))
+ die("cannot parse oid '%s'", argv[2]);
-+ load_bloom_filters();
++ init_bloom_filters();
+ get_bloom_filter_for_commit(&oid);
+ }
+
-+ return 0;
-+}
-
- diff --git a/t/helper/test-tool.c b/t/helper/test-tool.c
- --- a/t/helper/test-tool.c
- +++ b/t/helper/test-tool.c
-@@
- };
-
- static struct test_cmd cmds[] = {
-+ { "bloom", cmd__bloom },
- { "chmtime", cmd__chmtime },
- { "config", cmd__config },
- { "ctype", cmd__ctype },
-
- diff --git a/t/helper/test-tool.h b/t/helper/test-tool.h
- --- a/t/helper/test-tool.h
- +++ b/t/helper/test-tool.h
-@@
- #define USE_THE_INDEX_COMPATIBILITY_MACROS
- #include "git-compat-util.h"
-
-+int cmd__bloom(int argc, const char **argv);
- int cmd__chmtime(int argc, const char **argv);
- int cmd__config(int argc, const char **argv);
- int cmd__ctype(int argc, const char **argv);
+ return 0;
+ }
+ \ No newline at end of file
diff --git a/t/t0095-bloom.sh b/t/t0095-bloom.sh
- new file mode 100755
- --- /dev/null
+ --- a/t/t0095-bloom.sh
+++ b/t/t0095-bloom.sh
@@
-+#!/bin/sh
-+
-+test_description='test bloom.c'
-+. ./test-lib.sh
-+
+ test_cmp expect actual
+ '
+
+test_expect_success 'get bloom filters for commit with no changes' '
+ git init &&
+ git commit --allow-empty -m "c0" &&
@@ -517,8 +232,8 @@
+ git add smallDir &&
+ git commit -m "commit with 10 changes" &&
+ cat >expect <<-\EOF &&
-+ Filter_Length:4
-+ Filter_Data:508928809087080a|8a7648210804001|4089824400951000|841ab310098051a8|
++ Filter_Length:25
++ Filter_Data:82|a0|65|47|0c|92|90|c0|a1|40|02|a0|e2|40|e0|04|0a|9a|66|cf|80|19|85|42|23|
+ EOF
+ test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
+ test_cmp expect actual
@@ -542,64 +257,5 @@
+ test_cmp expect actual
+'
+
-+test_expect_success 'compute bloom key for empty string' '
-+ cat >expect <<-\EOF &&
-+ Hashes:5615800c|5b966560|61174ab4|66983008|6c19155c|7199fab0|771ae004|
-+ Filter_Length:1
-+ Filter_Data:11000110001110|
-+ EOF
-+ test-tool bloom generate_filter "" >actual &&
-+ test_cmp expect actual
-+'
-+
-+test_expect_success 'compute bloom key for whitespace' '
-+ cat >expect <<-\EOF &&
-+ Hashes:1bf014e6|8a91b50b|f9335530|67d4f555|d676957a|4518359f|b3b9d5c4|
-+ Filter_Length:1
-+ Filter_Data:401004080200810|
-+ EOF
-+ test-tool bloom generate_filter " " >actual &&
-+ test_cmp expect actual
-+'
-+
-+test_expect_success 'compute bloom key for a root level folder' '
-+ cat >expect <<-\EOF &&
-+ Hashes:1a21016f|fff1c06d|e5c27f6b|cb933e69|b163fd67|9734bc65|7d057b63|
-+ Filter_Length:1
-+ Filter_Data:aaa800000000|
-+ EOF
-+ test-tool bloom generate_filter "A" >actual &&
-+ test_cmp expect actual
-+'
-+
-+test_expect_success 'compute bloom key for a root level file' '
-+ cat >expect <<-\EOF &&
-+ Hashes:e2d51107|30970605|7e58fb03|cc1af001|19dce4ff|679ed9fd|b560cefb|
-+ Filter_Length:1
-+ Filter_Data:a8000000000000aa|
-+ EOF
-+ test-tool bloom generate_filter "file.txt" >actual &&
-+ test_cmp expect actual
-+'
-+
-+test_expect_success 'compute bloom key for a deep folder' '
-+ cat >expect <<-\EOF &&
-+ Hashes:864cf838|27f055cd|c993b362|6b3710f7|0cda6e8c|ae7dcc21|502129b6|
-+ Filter_Length:1
-+ Filter_Data:1c0000600003000|
-+ EOF
-+ test-tool bloom generate_filter "A/B/C/D/E" >actual &&
-+ test_cmp expect actual
-+'
-+
-+test_expect_success 'compute bloom key for a deep file' '
-+ cat >expect <<-\EOF &&
-+ Hashes:07cdf850|4af629c7|8e1e5b3e|d1468cb5|146ebe2c|5796efa3|9abf211a|
-+ Filter_Length:1
-+ Filter_Data:4020100804010080|
-+ EOF
-+ test-tool bloom generate_filter "A/B/C/D/E/file.txt" >actual &&
-+ test_cmp expect actual
-+'
-+
-+test_done
+ test_done
+ \ No newline at end of file
3: a698c04a78c ! 5: 2d4c0b2da38 diff: halt tree-diff early after max_changes
@@ -29,7 +29,7 @@
struct diff_options diffopt;
+ int max_changes = 512;
- if (!bloom_filters.slab_size)
+ if (bloom_filters.slab_size == 0)
return NULL;
@@
@@ -46,7 +46,7 @@
- if (diff_queued_diff.nr <= 512) {
+ if (diff_queued_diff.nr <= max_changes) {
struct hashmap pathmap;
- struct pathmap_hash_entry* e;
+ struct pathmap_hash_entry *e;
struct hashmap_iter iter;
diff --git a/diff.h b/diff.h
4: c17bbcbc66e ! 6: c38b9b386ef commit-graph: compute Bloom filters for changed paths
@@ -2,11 +2,13 @@
commit-graph: compute Bloom filters for changed paths
- Compute Bloom filters for the paths that changed between a commit and its
- first parent using the implementation in bloom.c, when the
- COMMIT_GRAPH_WRITE_CHANGED_PATHS flag is set. This computation is done on a
- commit-by-commit basis. We will write these Bloom filters to the commit graph
- file in the next change.
+ Add new COMMIT_GRAPH_WRITE_CHANGED_PATHS flag that makes Git compute
+ Bloom filters for the paths that changed between a commit and it's
+ first parent, for each commit in the commit-graph. This computation
+ is done on a commit-by-commit basis.
+
+ We will write these Bloom filters to the commit-graph file, to store
+ this data on disk, in the next change in this series.
Helped-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
@@ -31,7 +33,7 @@
+ changed_paths:1;
const struct split_commit_graph_opts *split_opts;
-+ uint32_t total_bloom_filter_data_size;
++ size_t total_bloom_filter_data_size;
};
static void write_graph_chunk_fanout(struct hashfile *f,
@@ -44,17 +46,17 @@
+ int i;
+ struct progress *progress = NULL;
+
-+ load_bloom_filters();
++ init_bloom_filters();
+
+ if (ctx->report_progress)
-+ progress = start_progress(
-+ _("Computing commit diff Bloom filters"),
++ progress = start_delayed_progress(
++ _("Computing commit changed paths Bloom filters"),
+ ctx->commits.nr);
+
+ for (i = 0; i < ctx->commits.nr; i++) {
+ struct commit *c = ctx->commits.list[i];
+ struct bloom_filter *filter = get_bloom_filter(ctx->r, c);
-+ ctx->total_bloom_filter_data_size += sizeof(uint64_t) * filter->len;
++ ctx->total_bloom_filter_data_size += sizeof(unsigned char) * filter->len;
+ display_progress(progress, i + 1);
+ }
+
@@ -93,7 +95,7 @@
/* Make sure that each OID in the input is a valid commit OID. */
- COMMIT_GRAPH_WRITE_CHECK_OIDS = (1 << 3)
+ COMMIT_GRAPH_WRITE_CHECK_OIDS = (1 << 3),
-+ COMMIT_GRAPH_WRITE_BLOOM_FILTERS = (1 << 4)
++ COMMIT_GRAPH_WRITE_BLOOM_FILTERS = (1 << 4),
};
struct split_commit_graph_opts {
5: 78e8e49c3a1 ! 7: d24c85c54ef commit-graph: examine changed-path objects in pack order
@@ -39,6 +39,7 @@
/* Remember to update object flag allocation in object.h */
#define REACHABLE (1u<<15)
+-char *get_commit_graph_filename(struct object_directory *odb)
+/* Keep track of the order in which commits are added to our list. */
+define_commit_slab(commit_pos, int);
+static struct commit_pos commit_pos = COMMIT_SLAB_INIT(1, commit_pos);
@@ -55,16 +56,20 @@
+}
+
+static int commit_pos_cmp(const void *va, const void *vb)
-+{
+ {
+- return xstrfmt("%s/info/commit-graph", odb->path);
+ const struct commit *a = *(const struct commit **)va;
+ const struct commit *b = *(const struct commit **)vb;
+ return commit_pos_at(&commit_pos, a) -
+ commit_pos_at(&commit_pos, b);
+}
+
- char *get_commit_graph_filename(const char *obj_dir)
- {
- char *filename = xstrfmt("%s/info/commit-graph", obj_dir);
++char *get_commit_graph_filename(struct object_directory *obj_dir)
++{
++ return xstrfmt("%s/info/commit-graph", obj_dir->path);
+ }
+
+ static char *get_split_graph_filename(struct object_directory *odb,
@@
oidcpy(&(ctx->oids.list[ctx->oids.nr]), oid);
ctx->oids.nr++;
@@ -78,27 +83,27 @@
{
int i;
struct progress *progress = NULL;
-+ struct commit **sorted_by_pos;
++ struct commit **sorted_commits;
- load_bloom_filters();
+ init_bloom_filters();
@@
- _("Computing commit diff Bloom filters"),
+ _("Computing commit changed paths Bloom filters"),
ctx->commits.nr);
-+ ALLOC_ARRAY(sorted_by_pos, ctx->commits.nr);
-+ COPY_ARRAY(sorted_by_pos, ctx->commits.list, ctx->commits.nr);
-+ QSORT(sorted_by_pos, ctx->commits.nr, commit_pos_cmp);
++ ALLOC_ARRAY(sorted_commits, ctx->commits.nr);
++ COPY_ARRAY(sorted_commits, ctx->commits.list, ctx->commits.nr);
++ QSORT(sorted_commits, ctx->commits.nr, commit_pos_cmp);
+
for (i = 0; i < ctx->commits.nr; i++) {
- struct commit *c = ctx->commits.list[i];
-+ struct commit *c = sorted_by_pos[i];
++ struct commit *c = sorted_commits[i];
struct bloom_filter *filter = get_bloom_filter(ctx->r, c);
- ctx->total_bloom_filter_data_size += sizeof(uint64_t) * filter->len;
+ ctx->total_bloom_filter_data_size += sizeof(unsigned char) * filter->len;
display_progress(progress, i + 1);
}
-+ free(sorted_by_pos);
++ free(sorted_commits);
stop_progress(&progress);
}
6: 58704d81b6b ! 8: 5ed16f35fed commit-graph: examine commits by generation number
@@ -1,11 +1,11 @@
-Author: Derrick Stolee <dstolee@microsoft.com>
+Author: Garima Singh <garima.singh@microsoft.com>
commit-graph: examine commits by generation number
When running 'git commit-graph write --changed-paths', we sort the
commits by pack-order to save time when computing the changed-paths
bloom filters. This does not help when finding the commits via the
- --reachable flag.
+ '--reachable' flag.
If not using pack-order, then sort by generation number before
examining the diff. Commits with similar generation are more likely
@@ -45,9 +45,9 @@
+ return 0;
+}
+
- char *get_commit_graph_filename(const char *obj_dir)
+ char *get_commit_graph_filename(struct object_directory *obj_dir)
{
- char *filename = xstrfmt("%s/info/commit-graph", obj_dir);
+ return xstrfmt("%s/info/commit-graph", obj_dir->path);
@@
report_progress:1,
split:1,
@@ -57,20 +57,20 @@
+ order_by_pack:1;
const struct split_commit_graph_opts *split_opts;
- uint32_t total_bloom_filter_data_size;
+ size_t total_bloom_filter_data_size;
@@
- ALLOC_ARRAY(sorted_by_pos, ctx->commits.nr);
- COPY_ARRAY(sorted_by_pos, ctx->commits.list, ctx->commits.nr);
-- QSORT(sorted_by_pos, ctx->commits.nr, commit_pos_cmp);
+ ALLOC_ARRAY(sorted_commits, ctx->commits.nr);
+ COPY_ARRAY(sorted_commits, ctx->commits.list, ctx->commits.nr);
+- QSORT(sorted_commits, ctx->commits.nr, commit_pos_cmp);
+
+ if (ctx->order_by_pack)
-+ QSORT(sorted_by_pos, ctx->commits.nr, commit_pos_cmp);
++ QSORT(sorted_commits, ctx->commits.nr, commit_pos_cmp);
+ else
-+ QSORT(sorted_by_pos, ctx->commits.nr, commit_gen_cmp);
++ QSORT(sorted_commits, ctx->commits.nr, commit_gen_cmp);
for (i = 0; i < ctx->commits.nr; i++) {
- struct commit *c = sorted_by_pos[i];
+ struct commit *c = sorted_commits[i];
@@
}
-: ----------- > 9: 55824cda89c diff: skip batch object download when possible
7: 39ee0610800 ! 10: 1e4663523de commit-graph: write Bloom filters to commit graph file
@@ -2,9 +2,10 @@
commit-graph: write Bloom filters to commit graph file
- Update the technical documentation for commit-graph-format with the formats for
- the Bloom filter index (BIDX) and Bloom filter data (BDAT) chunks. Write the
- computed Bloom filters information to the commit graph file using this format.
+ Update the technical documentation for commit-graph-format with
+ the formats for the Bloom filter index (BIDX) and Bloom filter
+ data (BDAT) chunks. Write the computed Bloom filters information
+ to the commit graph file using this format.
Helped-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
@@ -17,7 +18,7 @@
the graph file.
+- The Bloom filter of the commit carrying the paths that were changed between
-+ the commit and its first parent.
++ the commit and its first parent, if requested.
+
These positional references are stored as unsigned 32-bit integers
corresponding to the array position within the list of commit OIDs. Due
@@ -36,16 +37,22 @@
+ Bloom Filter Data (ID: {'B', 'D', 'A', 'T'}) [Optional]
+ * It starts with header consisting of three unsigned 32-bit integers:
+ - Version of the hash algorithm being used. We currently only support
-+ value 1 which implies the murmur3 hash implemented exactly as described
-+ in https://en.wikipedia.org/wiki/MurmurHash#Algorithm
++ value 1 which corresponds to the 32-bit version of the murmur3 hash
++ implemented exactly as described in
++ https://en.wikipedia.org/wiki/MurmurHash#Algorithm and the double
++ hashing technique using seed values 0x293ae76f and 0x7e646e2 as
++ described in https://doi.org/10.1007/978-3-540-30494-4_26 "Bloom Filters
++ in Probabilistic Verification"
+ - The number of times a path is hashed and hence the number of bit positions
-+ that cumulatively determine whether a file is present in the commit.
++ that cumulatively determine whether a file is present in the commit.
+ - The minimum number of bits 'b' per entry in the Bloom filter. If the filter
-+ contains 'n' entries, then the filter size is the minimum number of 64-bit
-+ words that contain n*b bits.
++ contains 'n' entries, then the filter size is the minimum number of 64-bit
++ words that contain n*b bits.
+ * The rest of the chunk is the concatenation of all the computed Bloom
+ filters for the commits in lexicographic order.
-+ * The BDAT chunk is present iff BIDX is present.
++ * Note: Commits with no changes or more than 512 changes have Bloom filters
++ of length zero.
++ * The BDAT chunk is present if and only if BIDX is present.
+
Base Graphs List (ID: {'B', 'A', 'S', 'E'}) [Optional]
This list of H-byte hashes describe a set of B commit-graph files that
@@ -103,16 +110,14 @@
last_chunk_offset = chunk_offset;
}
-+ /* We need both the bloom chunks to exist together. Else ignore the data */
-+ if ((graph->chunk_bloom_indexes && !graph->chunk_bloom_data)
-+ || (!graph->chunk_bloom_indexes && graph->chunk_bloom_data)) {
++ if (graph->chunk_bloom_indexes && graph->chunk_bloom_data) {
++ init_bloom_filters();
++ } else {
++ /* We need both the bloom chunks to exist together. Else ignore the data */
+ graph->chunk_bloom_indexes = NULL;
+ graph->chunk_bloom_data = NULL;
+ graph->bloom_filter_settings = NULL;
+ }
-+
-+ if (graph->chunk_bloom_indexes && graph->chunk_bloom_data)
-+ load_bloom_filters();
+
hashcpy(graph->oid.hash, graph->data + graph->data_len - graph->hash_len);
@@ -148,7 +153,7 @@
+
+static void write_graph_chunk_bloom_data(struct hashfile *f,
+ struct write_commit_graph_context *ctx,
-+ struct bloom_filter_settings *settings)
++ const struct bloom_filter_settings *settings)
+{
+ struct commit **list = ctx->commits.list;
+ struct commit **last = ctx->commits.list + ctx->commits.nr;
@@ -167,7 +172,7 @@
+ while (list < last) {
+ struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
+ display_progress(progress, ++i);
-+ hashwrite(f, filter->data, filter->len * sizeof(uint64_t));
++ hashwrite(f, filter->data, filter->len * sizeof(unsigned char));
+ list++;
+ }
+
@@ -177,22 +182,11 @@
static int oid_compare(const void *_a, const void *_b)
{
const struct object_id *a = (const struct object_id *)_a;
-@@
- load_bloom_filters();
-
- if (ctx->report_progress)
-- progress = start_progress(
-- _("Computing commit diff Bloom filters"),
-+ progress = start_delayed_progress(
-+ _("Computing changed paths Bloom filters"),
- ctx->commits.nr);
-
- ALLOC_ARRAY(sorted_by_pos, ctx->commits.nr);
@@
struct strbuf progress_title = STRBUF_INIT;
int num_chunks = 3;
struct object_id file_hash;
-+ struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
++ const struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
if (ctx->split) {
struct strbuf tmp_file = STRBUF_INIT;
@@ -236,6 +230,14 @@
if (ctx->num_commit_graphs_after > 1 &&
write_graph_chunk_base(f, ctx)) {
return -1;
+@@
+ close(g->graph_fd);
+ }
+ free(g->filename);
++ free(g->bloom_filter_settings);
+ free(g);
+ }
+
diff --git a/commit-graph.h b/commit-graph.h
--- a/commit-graph.h
@@ -246,7 +248,7 @@
struct commit;
+struct bloom_filter_settings;
- char *get_commit_graph_filename(const char *obj_dir);
+ char *get_commit_graph_filename(struct object_directory *odb);
int open_commit_graph(const char *graph_file, int *fd, struct stat *st);
@@
const unsigned char *chunk_commit_data;
@@ -258,13 +260,4 @@
+ struct bloom_filter_settings *bloom_filter_settings;
};
- struct commit_graph *load_commit_graph_one_fd_st(int fd, struct stat *st);
-@@
- COMMIT_GRAPH_WRITE_SPLIT = (1 << 2),
- /* Make sure that each OID in the input is a valid commit OID. */
- COMMIT_GRAPH_WRITE_CHECK_OIDS = (1 << 3),
-- COMMIT_GRAPH_WRITE_BLOOM_FILTERS = (1 << 4)
-+ COMMIT_GRAPH_WRITE_BLOOM_FILTERS = (1 << 4),
- };
-
- struct split_commit_graph_opts {
+ struct commit_graph *load_commit_graph_one_fd_st(int fd, struct stat *st,
8: b20c8d2b209 ! 11: 68395d4051b commit-graph: reuse existing Bloom filters during write.
@@ -1,9 +1,10 @@
Author: Garima Singh <garima.singh@microsoft.com>
- commit-graph: reuse existing Bloom filters during write.
+ commit-graph: reuse existing Bloom filters during write
- Read previously computed Bloom filters from the commit-graph file if
- possible to avoid recomputing during commit-graph write.
+ Add logic to
+ a) parse Bloom filter information from the commit graph file and,
+ b) re-use existing Bloom filters.
See Documentation/technical/commit-graph-format for the format in which
the Bloom filter information is written to the commit graph file.
@@ -25,7 +26,7 @@
commit's filter.
We toggle whether Bloom filters should be recomputed based on the
- compute_if_null flag.
+ compute_if_not_present flag.
Helped-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
@@ -34,15 +35,16 @@
--- a/bloom.c
+++ b/bloom.c
@@
- #include "git-compat-util.h"
- #include "bloom.h"
+ #include "diffcore.h"
+ #include "revision.h"
+ #include "hashmap.h"
++#include "commit-graph.h"
+#include "commit.h"
-+#include "commit-slab.h"
- #include "commit-graph.h"
- #include "object-store.h"
- #include "diff.h"
+
+ define_commit_slab(bloom_filter_slab, struct bloom_filter);
+
@@
- }
+ return ((unsigned char)1) << (pos & (BITS_PER_WORD - 1));
}
+static int load_bloom_filter_from_graph(struct commit_graph *g,
@@ -62,23 +64,29 @@
+
+ end_index = get_be32(g->chunk_bloom_indexes + 4 * lex_pos);
+
-+ if (lex_pos)
++ if (lex_pos > 0)
+ start_index = get_be32(g->chunk_bloom_indexes + 4 * (lex_pos - 1));
+ else
+ start_index = 0;
+
+ filter->len = end_index - start_index;
-+ filter->data = (uint64_t *)(g->chunk_bloom_data +
-+ sizeof(uint64_t) * start_index +
++ filter->data = (unsigned char *)(g->chunk_bloom_data +
++ sizeof(unsigned char) * start_index +
+ BLOOMDATA_CHUNK_HEADER_SIZE);
+
+ return 1;
+}
+
+ /*
+ * Calculate the murmur3 32-bit hash value for the given data
+ * using the given seed.
+@@
+ }
+
struct bloom_filter *get_bloom_filter(struct repository *r,
- struct commit *c)
+ struct commit *c,
-+ int compute_if_not_present)
++ int compute_if_not_present)
{
struct bloom_filter *filter;
struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
@@ -102,7 +110,7 @@
+
repo_diff_setup(r, &diffopt);
diffopt.flags.recursive = 1;
- diffopt.max_changes = max_changes;
+ diffopt.detect_rename = 0;
diff --git a/bloom.h b/bloom.h
--- a/bloom.h
@@ -110,21 +118,21 @@
@@
#define DEFAULT_BLOOM_FILTER_SETTINGS { 1, 7, 10 }
- #define BITS_PER_WORD 64
-+#define BLOOMDATA_CHUNK_HEADER_SIZE 3*sizeof(uint32_t)
+ #define BITS_PER_WORD 8
++#define BLOOMDATA_CHUNK_HEADER_SIZE 3 * sizeof(uint32_t)
/*
* A bloom_filter struct represents a data segment to
@@
- struct bloom_filter_settings *settings);
+ void init_bloom_filters(void);
struct bloom_filter *get_bloom_filter(struct repository *r,
- struct commit *c);
+ struct commit *c,
+ int compute_if_not_present);
- int bloom_filter_contains(struct bloom_filter *filter,
- struct bloom_key *key,
+ #endif
+ \ No newline at end of file
diff --git a/commit-graph.c b/commit-graph.c
--- a/commit-graph.c
@@ -145,25 +153,17 @@
- struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
+ struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
display_progress(progress, ++i);
- hashwrite(f, filter->data, filter->len * sizeof(uint64_t));
+ hashwrite(f, filter->data, filter->len * sizeof(unsigned char));
list++;
@@
for (i = 0; i < ctx->commits.nr; i++) {
- struct commit *c = sorted_by_pos[i];
+ struct commit *c = sorted_commits[i];
- struct bloom_filter *filter = get_bloom_filter(ctx->r, c);
+ struct bloom_filter *filter = get_bloom_filter(ctx->r, c, 1);
- ctx->total_bloom_filter_data_size += sizeof(uint64_t) * filter->len;
+ ctx->total_bloom_filter_data_size += sizeof(unsigned char) * filter->len;
display_progress(progress, i + 1);
}
-@@
- g->data = NULL;
- close(g->graph_fd);
- }
-+ free(g->bloom_filter_settings);
- free(g->filename);
- free(g);
- }
diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
--- a/t/helper/test-bloom.c
9: 3d7ee0c9695 ! 12: 7e450e45236 commit-graph: add --changed-paths option to write subcommand
@@ -56,7 +56,7 @@
+ int enable_changed_paths;
} opts;
- static int graph_verify(int argc, const char **argv)
+ static struct object_directory *find_odb(struct repository *r,
@@
N_("start walk at commits listed by stdin")),
OPT_BOOL(0, "append", &opts.append,
@@ -74,4 +74,4 @@
+ flags |= COMMIT_GRAPH_WRITE_BLOOM_FILTERS;
read_replace_refs = 0;
-
+ odb = find_odb(the_repository, opts.obj_dir);
10: 77f1c561e82 ! 13: b18af58aa3e revision.c: use Bloom filters to speed up path based revision walks
@@ -2,17 +2,27 @@
revision.c: use Bloom filters to speed up path based revision walks
- Revision walk will now use Bloom filters for commits to speed up revision
- walks for a particular path (for computing history for that path), if they
- are present in the commit-graph file.
+ Revision walk will now use Bloom filters for commits to speed up
+ revision walks for a particular path (for computing history for
+ that path), if they are present in the commit-graph file.
- We load the Bloom filters during the prepare_revision_walk step, but only
- when dealing with a single pathspec. While comparing trees in
- rev_compare_trees(), if the Bloom filter says that the file is not different
- between the two trees, we don't need to compute the expensive diff. This is
- where we get our performance gains. The other response of the Bloom filter
- is `maybe`, in which case we fall back to the full diff calculation to
- determine if the path was changed in the commit.
+ We load the Bloom filters during the prepare_revision_walk step,
+ currently only when dealing with a single pathspec. Extending
+ it to work with multiple pathspecs can be explored and built on
+ top of this series in the future.
+
+ While comparing trees in rev_compare_trees(), if the Bloom filter
+ says that the file is not different between the two trees, we don't
+ need to compute the expensive diff. This is where we get our
+ performance gains. The other response of the Bloom filter is '`:maybe',
+ in which case we fall back to the full diff calculation to determine
+ if the path was changed in the commit.
+
+ We do not try to use Bloom filters when the '--walk-reflogs' option
+ is specified. The '--walk-reflogs' option does not walk the commit
+ ancestry chain like the rest of the options. Incorporating the
+ performance gains when walking reflog entries would add more
+ complexity, and can be explored in a later series.
Performance Gains:
We tested the performance of `git log -- <path>` on the git repo, the linux
@@ -30,6 +40,49 @@
Helped-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
+ diff --git a/bloom.c b/bloom.c
+ --- a/bloom.c
+ +++ b/bloom.c
+@@
+
+ return filter;
+ }
++
++int bloom_filter_contains(const struct bloom_filter *filter,
++ const struct bloom_key *key,
++ const struct bloom_filter_settings *settings)
++{
++ int i;
++ uint64_t mod = filter->len * BITS_PER_WORD;
++
++ if (!mod)
++ return -1;
++
++ for (i = 0; i < settings->num_hashes; i++) {
++ uint64_t hash_mod = key->hashes[i] % mod;
++ uint64_t block_pos = hash_mod / BITS_PER_WORD;
++ if (!(filter->data[block_pos] & get_bitmask(hash_mod)))
++ return 0;
++ }
++
++ return 1;
++}
+ \ No newline at end of file
+
+ diff --git a/bloom.h b/bloom.h
+ --- a/bloom.h
+ +++ b/bloom.h
+@@
+ struct commit *c,
+ int compute_if_not_present);
+
++int bloom_filter_contains(const struct bloom_filter *filter,
++ const struct bloom_key *key,
++ const struct bloom_filter_settings *settings);
++
+ #endif
+ \ No newline at end of file
+
diff --git a/revision.c b/revision.c
--- a/revision.c
+++ b/revision.c
@@ -38,7 +91,6 @@
#include "hashmap.h"
#include "utf8.h"
+#include "bloom.h"
-+#include "json-writer.h"
volatile show_early_output_fn_t show_early_output;
@@ -46,29 +98,6 @@
options->flags.has_changes = 1;
}
-+static int bloom_filter_atexit_registered;
-+static unsigned int count_bloom_filter_maybe;
-+static unsigned int count_bloom_filter_definitely_not;
-+static unsigned int count_bloom_filter_false_positive;
-+static unsigned int count_bloom_filter_not_present;
-+static unsigned int count_bloom_filter_length_zero;
-+
-+static void trace2_bloom_filter_statistics_atexit(void)
-+{
-+ struct json_writer jw = JSON_WRITER_INIT;
-+
-+ jw_object_begin(&jw, 0);
-+ jw_object_intmax(&jw, "filter_not_present", count_bloom_filter_not_present);
-+ jw_object_intmax(&jw, "zero_length_filter", count_bloom_filter_length_zero);
-+ jw_object_intmax(&jw, "maybe", count_bloom_filter_maybe);
-+ jw_object_intmax(&jw, "definitely_not", count_bloom_filter_definitely_not);
-+ jw_end(&jw);
-+
-+ trace2_data_json("bloom", the_repository, "statistics", &jw);
-+
-+ jw_release(&jw);
-+}
-+
+static void prepare_to_use_bloom_filter(struct rev_info *revs)
+{
+ struct pathspec_item *pi;
@@ -92,6 +121,7 @@
+ pi = &revs->pruning.pathspec.items[0];
+ last_index = pi->len - 1;
+
++ /* remove single trailing slash from path, if needed */
+ if (pi->match[last_index] == '/') {
+ path_alloc = xstrdup(pi->match);
+ path_alloc[last_index] = '\0';
@@ -104,11 +134,6 @@
+ revs->bloom_key = xmalloc(sizeof(struct bloom_key));
+ fill_bloom_key(path, len, revs->bloom_key, revs->bloom_filter_settings);
+
-+ if (trace2_is_enabled() && !bloom_filter_atexit_registered) {
-+ atexit(trace2_bloom_filter_statistics_atexit);
-+ bloom_filter_atexit_registered = 1;
-+ }
-+
+ free(path_alloc);
+}
+
@@ -127,12 +152,10 @@
+ filter = get_bloom_filter(revs->repo, commit, 0);
+
+ if (!filter) {
-+ count_bloom_filter_not_present++;
+ return -1;
+ }
+
+ if (!filter->len) {
-+ count_bloom_filter_length_zero++;
+ return -1;
+ }
+
@@ -140,11 +163,6 @@
+ revs->bloom_key,
+ revs->bloom_filter_settings);
+
-+ if (result)
-+ count_bloom_filter_maybe++;
-+ else
-+ count_bloom_filter_definitely_not++;
-+
+ return result;
+}
+
@@ -162,7 +180,7 @@
return REV_TREE_SAME;
}
-+ if (revs->pruning.pathspec.nr == 1 && !revs->reflog_info && !nth_parent) {
++ if (revs->bloom_key && !nth_parent) {
+ bloom_ret = check_maybe_different_in_bloom_filter(revs, commit);
+
+ if (bloom_ret == 0)
@@ -174,10 +192,6 @@
if (diff_tree_oid(&t1->object.oid, &t2->object.oid, "",
&revs->pruning) < 0)
return REV_TREE_DIFFERENT;
-+
-+ if (!nth_parent)
-+ if (bloom_ret == 1 && tree_difference == REV_TREE_SAME)
-+ count_bloom_filter_false_positive++;
+
return tree_difference;
}
@@ -237,164 +251,3 @@
};
int ref_excluded(struct string_list *, const char *path);
-
- diff --git a/t/helper/test-read-graph.c b/t/helper/test-read-graph.c
- --- a/t/helper/test-read-graph.c
- +++ b/t/helper/test-read-graph.c
-@@
- printf(" commit_metadata");
- if (graph->chunk_extra_edges)
- printf(" extra_edges");
-+ if (graph->chunk_bloom_indexes)
-+ printf(" bloom_indexes");
-+ if (graph->chunk_bloom_data)
-+ printf(" bloom_data");
- printf("\n");
-
- UNLEAK(graph);
-
- diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
- new file mode 100755
- --- /dev/null
- +++ b/t/t4216-log-bloom.sh
-@@
-+#!/bin/sh
-+
-+test_description='git log for a path with bloom filters'
-+. ./test-lib.sh
-+
-+test_expect_success 'setup test - repo, commits, commit graph, log outputs' '
-+ git init &&
-+ mkdir A A/B A/B/C &&
-+ test_commit c1 A/file1 &&
-+ test_commit c2 A/B/file2 &&
-+ test_commit c3 A/B/C/file3 &&
-+ test_commit c4 A/file1 &&
-+ test_commit c5 A/B/file2 &&
-+ test_commit c6 A/B/C/file3 &&
-+ test_commit c7 A/file1 &&
-+ test_commit c8 A/B/file2 &&
-+ test_commit c9 A/B/C/file3 &&
-+ git checkout -b side HEAD~4 &&
-+ test_commit side-1 file4 &&
-+ git checkout master &&
-+ git merge side &&
-+ test_commit c10 file5 &&
-+ mv file5 file5_renamed &&
-+ git add file5_renamed &&
-+ git commit -m "rename" &&
-+ git commit-graph write --reachable --changed-paths
-+'
-+graph_read_expect() {
-+ OPTIONAL=""
-+ NUM_CHUNKS=5
-+ cat >expect <<- EOF
-+ header: 43475048 1 1 $NUM_CHUNKS 0
-+ num_commits: $1
-+ chunks: oid_fanout oid_lookup commit_metadata bloom_indexes bloom_data
-+ EOF
-+ test-tool read-graph >output &&
-+ test_cmp expect output
-+}
-+
-+test_expect_success 'commit-graph write wrote out the bloom chunks' '
-+ graph_read_expect 13
-+'
-+
-+setup() {
-+ rm output
-+ rm "$TRASH_DIRECTORY/trace.perf"
-+ git -c core.commitGraph=false log --pretty="format:%s" $1 >log_wo_bloom
-+ GIT_TRACE2_PERF="$TRASH_DIRECTORY/trace.perf" git -c core.commitGraph=true log --pretty="format:%s" $1 >log_w_bloom
-+}
-+
-+test_bloom_filters_used() {
-+ log_args=$1
-+ bloom_trace_prefix="statistics:{\"filter_not_present\":0,\"zero_length_filter\":0,\"maybe\""
-+ setup "$log_args"
-+ grep -q "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.perf" && test_cmp log_wo_bloom log_w_bloom
-+}
-+
-+test_bloom_filters_not_used() {
-+ log_args=$1
-+ setup "$log_args"
-+ !(grep -q "statistics:{\"filter_not_present\":" "$TRASH_DIRECTORY/trace.perf") && test_cmp log_wo_bloom log_w_bloom
-+}
-+
-+for path in A A/B A/B/C A/file1 A/B/file2 A/B/C/file3 file4 file5_renamed
-+do
-+ for option in "" \
-+ "--full-history" \
-+ "--full-history --simplify-merges" \
-+ "--simplify-merges" \
-+ "--simplify-by-decoration" \
-+ "--follow" \
-+ "--first-parent" \
-+ "--topo-order" \
-+ "--date-order" \
-+ "--author-date-order" \
-+ "--ancestry-path side..master"
-+ do
-+ test_expect_success "git log option: $option for path: $path" '
-+ test_bloom_filters_used "$option -- $path"
-+ '
-+ done
-+done
-+
-+test_expect_success 'git log -- folder works with and without the trailing slash' '
-+ test_bloom_filters_used "-- A" &&
-+ test_bloom_filters_used "-- A/"
-+'
-+
-+test_expect_success 'git log for path that does not exist. ' '
-+ test_bloom_filters_used "-- path_does_not_exist"
-+'
-+
-+test_expect_success 'git log with --walk-reflogs does not use bloom filters' '
-+ test_bloom_filters_not_used "--walk-reflogs -- A"
-+'
-+
-+test_expect_success 'git log -- multiple path specs does not use bloom filters' '
-+ test_bloom_filters_not_used "-- file4 A/file1"
-+'
-+
-+test_expect_success 'git log with wildcard that resolves to a single path uses bloom filters' '
-+ test_bloom_filters_used "-- *4" &&
-+ test_bloom_filters_used "-- *renamed"
-+'
-+
-+test_expect_success 'git log with wildcard that resolves to a multiple paths does not uses bloom filters' '
-+ test_bloom_filters_not_used "-- *" &&
-+ test_bloom_filters_not_used "-- file*"
-+'
-+
-+test_expect_success 'setup - add commit-graph to the chain without bloom filters' '
-+ test_commit c14 A/anotherFile2 &&
-+ test_commit c15 A/B/anotherFile2 &&
-+ test_commit c16 A/B/C/anotherFile2 &&
-+ GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=0 git commit-graph write --reachable --split &&
-+ test_line_count = 2 .git/objects/info/commit-graphs/commit-graph-chain
-+'
-+
-+test_expect_success 'git log does not use bloom filters if the latest graph does not have bloom filters.' '
-+ test_bloom_filters_not_used "-- A/B"
-+'
-+
-+test_expect_success 'setup - add commit-graph to the chain with bloom filters' '
-+ test_commit c17 A/anotherFile3 &&
-+ git commit-graph write --reachable --changed-paths --split &&
-+ test_line_count = 3 .git/objects/info/commit-graphs/commit-graph-chain
-+'
-+
-+test_bloom_filters_used_when_some_filters_are_missing() {
-+ log_args=$1
-+ bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"zero_length_filter\":0,\"maybe\":6,\"definitely_not\":6"
-+ setup "$log_args"
-+ grep -q "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.perf" && test_cmp log_wo_bloom log_w_bloom
-+}
-+
-+test_expect_success 'git log uses bloom filters if they exist in the latest but not all commit graphs in the chain.' '
-+ test_bloom_filters_used_when_some_filters_are_missing "-- A/B"
-+'
-+
-+test_done
-: ----------- > 14: b5eb280178f revision.c: add trace2 stats around Bloom filter usage
-: ----------- > 15: 3019ef72881 t4216: add end to end tests for git log with Bloom filters
11: e1b076a714d ! 16: 213abb5d895 commit-graph: add GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS test flag
@@ -37,8 +37,8 @@
export GIT_TEST_COMMIT_GRAPH=1
+ export GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=1
export GIT_TEST_MULTI_PACK_INDEX=1
+ export GIT_TEST_ADD_I_USE_BUILTIN=1
make test
- ;;
diff --git a/commit-graph.h b/commit-graph.h
--- a/commit-graph.h
@@ -68,20 +68,6 @@
code path for utilizing a file system monitor to speed up detecting
new or changed files.
- diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
- --- a/t/t4216-log-bloom.sh
- +++ b/t/t4216-log-bloom.sh
-@@
- test_description='git log for a path with bloom filters'
- . ./test-lib.sh
-
-+GIT_TEST_COMMIT_GRAPH=0
-+GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=0
-+
- test_expect_success 'setup test - repo, commits, commit graph, log outputs' '
- git init &&
- mkdir A A/B A/B/C &&
-
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
--
gitgitgadget
^ permalink raw reply [flat|nested] 159+ messages in thread
* [PATCH v3 01/16] commit-graph: define and use MAX_NUM_CHUNKS
2020-03-30 0:31 ` [PATCH v3 00/16] " Garima Singh via GitGitGadget
@ 2020-03-30 0:31 ` Garima Singh via GitGitGadget
2020-03-30 0:31 ` [PATCH v3 02/16] bloom.c: add the murmur3 hash implementation Garima Singh via GitGitGadget
` (15 subsequent siblings)
16 siblings, 0 replies; 159+ messages in thread
From: Garima Singh via GitGitGadget @ 2020-03-30 0:31 UTC (permalink / raw)
To: git; +Cc: stolee, szeder.dev, jonathantanmy, Garima Singh, Garima Singh
From: Garima Singh <garima.singh@microsoft.com>
This is a minor cleanup to make it easier to change
the number of chunks being written to the commit
graph.
Reviewed-by: Jakub Narębski <jnareb@gmail.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
---
commit-graph.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/commit-graph.c b/commit-graph.c
index f013a84e294..e4f1a5b2f1a 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -23,6 +23,7 @@
#define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */
#define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */
#define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */
+#define MAX_NUM_CHUNKS 5
#define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16)
@@ -1350,8 +1351,8 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
int fd;
struct hashfile *f;
struct lock_file lk = LOCK_INIT;
- uint32_t chunk_ids[6];
- uint64_t chunk_offsets[6];
+ uint32_t chunk_ids[MAX_NUM_CHUNKS + 1];
+ uint64_t chunk_offsets[MAX_NUM_CHUNKS + 1];
const unsigned hashsz = the_hash_algo->rawsz;
struct strbuf progress_title = STRBUF_INIT;
int num_chunks = 3;
--
gitgitgadget
^ permalink raw reply related [flat|nested] 159+ messages in thread
* [PATCH v3 02/16] bloom.c: add the murmur3 hash implementation
2020-03-30 0:31 ` [PATCH v3 00/16] " Garima Singh via GitGitGadget
2020-03-30 0:31 ` [PATCH v3 01/16] commit-graph: define and use MAX_NUM_CHUNKS Garima Singh via GitGitGadget
@ 2020-03-30 0:31 ` Garima Singh via GitGitGadget
2020-03-30 0:31 ` [PATCH v3 03/16] bloom.c: introduce core Bloom filter constructs Garima Singh via GitGitGadget
` (14 subsequent siblings)
16 siblings, 0 replies; 159+ messages in thread
From: Garima Singh via GitGitGadget @ 2020-03-30 0:31 UTC (permalink / raw)
To: git; +Cc: stolee, szeder.dev, jonathantanmy, Garima Singh, Garima Singh
From: Garima Singh <garima.singh@microsoft.com>
In preparation for computing changed paths Bloom filters,
implement the Murmur3 hash algorithm as described in [1].
It hashes the given data using the given seed and produces
a uniformly distributed hash value.
[1] https://en.wikipedia.org/wiki/MurmurHash#Algorithm
Helped-by: Derrick Stolee <dstolee@microsoft.com>
Helped-by: Szeder Gábor <szeder.dev@gmail.com>
Reviewed-by: Jakub Narębski <jnareb@gmail.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
---
Makefile | 2 ++
bloom.c | 73 +++++++++++++++++++++++++++++++++++++++++++
bloom.h | 13 ++++++++
t/helper/test-bloom.c | 13 ++++++++
t/helper/test-tool.c | 1 +
t/helper/test-tool.h | 1 +
t/t0095-bloom.sh | 30 ++++++++++++++++++
7 files changed, 133 insertions(+)
create mode 100644 bloom.c
create mode 100644 bloom.h
create mode 100644 t/helper/test-bloom.c
create mode 100755 t/t0095-bloom.sh
diff --git a/Makefile b/Makefile
index ef1ff2228f0..491f75e68c5 100644
--- a/Makefile
+++ b/Makefile
@@ -695,6 +695,7 @@ X =
PROGRAMS += $(patsubst %.o,git-%$X,$(PROGRAM_OBJS))
TEST_BUILTINS_OBJS += test-advise.o
+TEST_BUILTINS_OBJS += test-bloom.o
TEST_BUILTINS_OBJS += test-chmtime.o
TEST_BUILTINS_OBJS += test-config.o
TEST_BUILTINS_OBJS += test-ctype.o
@@ -840,6 +841,7 @@ LIB_OBJS += base85.o
LIB_OBJS += bisect.o
LIB_OBJS += blame.o
LIB_OBJS += blob.o
+LIB_OBJS += bloom.o
LIB_OBJS += branch.o
LIB_OBJS += bulk-checkin.o
LIB_OBJS += bundle.o
diff --git a/bloom.c b/bloom.c
new file mode 100644
index 00000000000..40e87632aeb
--- /dev/null
+++ b/bloom.c
@@ -0,0 +1,73 @@
+#include "git-compat-util.h"
+#include "bloom.h"
+
+static uint32_t rotate_left(uint32_t value, int32_t count)
+{
+ uint32_t mask = 8 * sizeof(uint32_t) - 1;
+ count &= mask;
+ return ((value << count) | (value >> ((-count) & mask)));
+}
+
+/*
+ * Calculate the murmur3 32-bit hash value for the given data
+ * using the given seed.
+ * Produces a uniformly distributed hash value.
+ * Not considered to be cryptographically secure.
+ * Implemented as described in https://en.wikipedia.org/wiki/MurmurHash#Algorithm
+ */
+uint32_t murmur3_seeded(uint32_t seed, const char *data, size_t len)
+{
+ const uint32_t c1 = 0xcc9e2d51;
+ const uint32_t c2 = 0x1b873593;
+ const uint32_t r1 = 15;
+ const uint32_t r2 = 13;
+ const uint32_t m = 5;
+ const uint32_t n = 0xe6546b64;
+ int i;
+ uint32_t k1 = 0;
+ const char *tail;
+
+ int len4 = len / sizeof(uint32_t);
+
+ uint32_t k;
+ for (i = 0; i < len4; i++) {
+ uint32_t byte1 = (uint32_t)data[4*i];
+ uint32_t byte2 = ((uint32_t)data[4*i + 1]) << 8;
+ uint32_t byte3 = ((uint32_t)data[4*i + 2]) << 16;
+ uint32_t byte4 = ((uint32_t)data[4*i + 3]) << 24;
+ k = byte1 | byte2 | byte3 | byte4;
+ k *= c1;
+ k = rotate_left(k, r1);
+ k *= c2;
+
+ seed ^= k;
+ seed = rotate_left(seed, r2) * m + n;
+ }
+
+ tail = (data + len4 * sizeof(uint32_t));
+
+ switch (len & (sizeof(uint32_t) - 1)) {
+ case 3:
+ k1 ^= ((uint32_t)tail[2]) << 16;
+ /*-fallthrough*/
+ case 2:
+ k1 ^= ((uint32_t)tail[1]) << 8;
+ /*-fallthrough*/
+ case 1:
+ k1 ^= ((uint32_t)tail[0]) << 0;
+ k1 *= c1;
+ k1 = rotate_left(k1, r1);
+ k1 *= c2;
+ seed ^= k1;
+ break;
+ }
+
+ seed ^= (uint32_t)len;
+ seed ^= (seed >> 16);
+ seed *= 0x85ebca6b;
+ seed ^= (seed >> 13);
+ seed *= 0xc2b2ae35;
+ seed ^= (seed >> 16);
+
+ return seed;
+}
\ No newline at end of file
diff --git a/bloom.h b/bloom.h
new file mode 100644
index 00000000000..d0fcc5f0aa6
--- /dev/null
+++ b/bloom.h
@@ -0,0 +1,13 @@
+#ifndef BLOOM_H
+#define BLOOM_H
+
+/*
+ * Calculate the murmur3 32-bit hash value for the given data
+ * using the given seed.
+ * Produces a uniformly distributed hash value.
+ * Not considered to be cryptographically secure.
+ * Implemented as described in https://en.wikipedia.org/wiki/MurmurHash#Algorithm
+ */
+uint32_t murmur3_seeded(uint32_t seed, const char *data, size_t len);
+
+#endif
\ No newline at end of file
diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
new file mode 100644
index 00000000000..60ee2043689
--- /dev/null
+++ b/t/helper/test-bloom.c
@@ -0,0 +1,13 @@
+#include "git-compat-util.h"
+#include "bloom.h"
+#include "test-tool.h"
+
+int cmd__bloom(int argc, const char **argv)
+{
+ if (!strcmp(argv[1], "get_murmur3")) {
+ uint32_t hashed = murmur3_seeded(0, argv[2], strlen(argv[2]));
+ printf("Murmur3 Hash with seed=0:0x%08x\n", hashed);
+ }
+
+ return 0;
+}
\ No newline at end of file
diff --git a/t/helper/test-tool.c b/t/helper/test-tool.c
index 31eedcd241f..6e26bd65c97 100644
--- a/t/helper/test-tool.c
+++ b/t/helper/test-tool.c
@@ -15,6 +15,7 @@ struct test_cmd {
static struct test_cmd cmds[] = {
{ "advise", cmd__advise_if_enabled },
+ { "bloom", cmd__bloom },
{ "chmtime", cmd__chmtime },
{ "config", cmd__config },
{ "ctype", cmd__ctype },
diff --git a/t/helper/test-tool.h b/t/helper/test-tool.h
index 4eb5e6609e1..dceeef1d5c2 100644
--- a/t/helper/test-tool.h
+++ b/t/helper/test-tool.h
@@ -5,6 +5,7 @@
#include "git-compat-util.h"
int cmd__advise_if_enabled(int argc, const char **argv);
+int cmd__bloom(int argc, const char **argv);
int cmd__chmtime(int argc, const char **argv);
int cmd__config(int argc, const char **argv);
int cmd__ctype(int argc, const char **argv);
diff --git a/t/t0095-bloom.sh b/t/t0095-bloom.sh
new file mode 100755
index 00000000000..2dad8c4a94e
--- /dev/null
+++ b/t/t0095-bloom.sh
@@ -0,0 +1,30 @@
+#!/bin/sh
+
+test_description='Testing the various Bloom filter computations in bloom.c'
+. ./test-lib.sh
+
+test_expect_success 'compute unseeded murmur3 hash for empty string' '
+ cat >expect <<-\EOF &&
+ Murmur3 Hash with seed=0:0x00000000
+ EOF
+ test-tool bloom get_murmur3 "" >actual &&
+ test_cmp expect actual
+'
+
+test_expect_success 'compute unseeded murmur3 hash for test string 1' '
+ cat >expect <<-\EOF &&
+ Murmur3 Hash with seed=0:0x627b0c2c
+ EOF
+ test-tool bloom get_murmur3 "Hello world!" >actual &&
+ test_cmp expect actual
+'
+
+test_expect_success 'compute unseeded murmur3 hash for test string 2' '
+ cat >expect <<-\EOF &&
+ Murmur3 Hash with seed=0:0x2e4ff723
+ EOF
+ test-tool bloom get_murmur3 "The quick brown fox jumps over the lazy dog" >actual &&
+ test_cmp expect actual
+'
+
+test_done
\ No newline at end of file
--
gitgitgadget
^ permalink raw reply related [flat|nested] 159+ messages in thread
* [PATCH v3 03/16] bloom.c: introduce core Bloom filter constructs
2020-03-30 0:31 ` [PATCH v3 00/16] " Garima Singh via GitGitGadget
2020-03-30 0:31 ` [PATCH v3 01/16] commit-graph: define and use MAX_NUM_CHUNKS Garima Singh via GitGitGadget
2020-03-30 0:31 ` [PATCH v3 02/16] bloom.c: add the murmur3 hash implementation Garima Singh via GitGitGadget
@ 2020-03-30 0:31 ` Garima Singh via GitGitGadget
2020-03-30 0:31 ` [PATCH v3 04/16] bloom.c: core Bloom filter implementation for changed paths Garima Singh via GitGitGadget
` (13 subsequent siblings)
16 siblings, 0 replies; 159+ messages in thread
From: Garima Singh via GitGitGadget @ 2020-03-30 0:31 UTC (permalink / raw)
To: git; +Cc: stolee, szeder.dev, jonathantanmy, Garima Singh, Garima Singh
From: Garima Singh <garima.singh@microsoft.com>
Introduce the constructs for Bloom filters, Bloom filter keys
and Bloom filter settings.
For details on what Bloom filters are and how they work, refer
to Dr. Derrick Stolee's blog post [1]. It provides a concise
explanation of the adoption of Bloom filters as described in
[2] and [3].
Implementation specifics:
1. We currently use 7 and 10 for the number of hashes and the
size of each entry respectively. They served as great starting
values, the mathematical details behind this choice are
described in [1] and [4]. The implementation, while not
completely open to it at the moment, is flexible enough to allow
for tweaking these settings in the future.
Note: The performance gains we have observed with these values
are significant enough that we did not need to tweak these
settings. The performance numbers are included in the cover letter
of this series and in the commit message of the subsequent commit
where we use Bloom filters to speed up `git log -- path`.
2. As described in [1] and [3], we do not need 7 independent hashing
functions. We use the Murmur3 hashing scheme, seed it twice and
then combine those to procure an arbitrary number of hash values.
3. The filters will be sized according to the number of changes in
each commit, in multiples of 8 bit words.
[1] Derrick Stolee
"Supercharging the Git Commit Graph IV: Bloom Filters"
https://devblogs.microsoft.com/devops/super-charging-the-git-commit-graph-iv-Bloom-filters/
[2] Flavio Bonomi, Michael Mitzenmacher, Rina Panigrahy, Sushil Singh, George Varghese
"An Improved Construction for Counting Bloom Filters"
http://theory.stanford.edu/~rinap/papers/esa2006b.pdf
https://doi.org/10.1007/11841036_61
[3] Peter C. Dillinger and Panagiotis Manolios
"Bloom Filters in Probabilistic Verification"
http://www.ccs.neu.edu/home/pete/pub/Bloom-filters-verification.pdf
https://doi.org/10.1007/978-3-540-30494-4_26
[4] Thomas Mueller Graf, Daniel Lemire
"Xor Filters: Faster and Smaller Than Bloom and Cuckoo Filters"
https://arxiv.org/abs/1912.08258
Helped-by: Derrick Stolee <dstolee@microsoft.com>
Reviewed-by: Jakub Narębski <jnareb@gmail.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
---
bloom.c | 38 +++++++++++++++++++++++++-
bloom.h | 63 +++++++++++++++++++++++++++++++++++++++++++
t/helper/test-bloom.c | 48 +++++++++++++++++++++++++++++++++
t/t0095-bloom.sh | 40 +++++++++++++++++++++++++++
4 files changed, 188 insertions(+), 1 deletion(-)
diff --git a/bloom.c b/bloom.c
index 40e87632aeb..888b67f1ea6 100644
--- a/bloom.c
+++ b/bloom.c
@@ -8,6 +8,11 @@ static uint32_t rotate_left(uint32_t value, int32_t count)
return ((value << count) | (value >> ((-count) & mask)));
}
+static inline unsigned char get_bitmask(uint32_t pos)
+{
+ return ((unsigned char)1) << (pos & (BITS_PER_WORD - 1));
+}
+
/*
* Calculate the murmur3 32-bit hash value for the given data
* using the given seed.
@@ -70,4 +75,35 @@ uint32_t murmur3_seeded(uint32_t seed, const char *data, size_t len)
seed ^= (seed >> 16);
return seed;
-}
\ No newline at end of file
+}
+
+void fill_bloom_key(const char *data,
+ size_t len,
+ struct bloom_key *key,
+ const struct bloom_filter_settings *settings)
+{
+ int i;
+ const uint32_t seed0 = 0x293ae76f;
+ const uint32_t seed1 = 0x7e646e2c;
+ const uint32_t hash0 = murmur3_seeded(seed0, data, len);
+ const uint32_t hash1 = murmur3_seeded(seed1, data, len);
+
+ key->hashes = (uint32_t *)xcalloc(settings->num_hashes, sizeof(uint32_t));
+ for (i = 0; i < settings->num_hashes; i++)
+ key->hashes[i] = hash0 + i * hash1;
+}
+
+void add_key_to_filter(const struct bloom_key *key,
+ struct bloom_filter *filter,
+ const struct bloom_filter_settings *settings)
+{
+ int i;
+ uint64_t mod = filter->len * BITS_PER_WORD;
+
+ for (i = 0; i < settings->num_hashes; i++) {
+ uint64_t hash_mod = key->hashes[i] % mod;
+ uint64_t block_pos = hash_mod / BITS_PER_WORD;
+
+ filter->data[block_pos] |= get_bitmask(hash_mod);
+ }
+}
diff --git a/bloom.h b/bloom.h
index d0fcc5f0aa6..b9ce422ca2d 100644
--- a/bloom.h
+++ b/bloom.h
@@ -1,6 +1,60 @@
#ifndef BLOOM_H
#define BLOOM_H
+struct bloom_filter_settings {
+ /*
+ * The version of the hashing technique being used.
+ * We currently only support version = 1 which is
+ * the seeded murmur3 hashing technique implemented
+ * in bloom.c.
+ */
+ uint32_t hash_version;
+
+ /*
+ * The number of times a path is hashed, i.e. the
+ * number of bit positions tht cumulatively
+ * determine whether a path is present in the
+ * Bloom filter.
+ */
+ uint32_t num_hashes;
+
+ /*
+ * The minimum number of bits per entry in the Bloom
+ * filter. If the filter contains 'n' entries, then
+ * filter size is the minimum number of 8-bit words
+ * that contain n*b bits.
+ */
+ uint32_t bits_per_entry;
+};
+
+#define DEFAULT_BLOOM_FILTER_SETTINGS { 1, 7, 10 }
+#define BITS_PER_WORD 8
+
+/*
+ * A bloom_filter struct represents a data segment to
+ * use when testing hash values. The 'len' member
+ * dictates how many entries are stored in
+ * 'data'.
+ */
+struct bloom_filter {
+ unsigned char *data;
+ size_t len;
+};
+
+/*
+ * A bloom_key represents the k hash values for a
+ * given string. These can be precomputed and
+ * stored in a bloom_key for re-use when testing
+ * against a bloom_filter. The number of hashes is
+ * given by the Bloom filter settings and is the same
+ * for all Bloom filters and keys interacting with
+ * the loaded version of the commit graph file and
+ * the Bloom data chunks.
+ */
+struct bloom_key {
+ uint32_t *hashes;
+};
+
/*
* Calculate the murmur3 32-bit hash value for the given data
* using the given seed.
@@ -10,4 +64,13 @@
*/
uint32_t murmur3_seeded(uint32_t seed, const char *data, size_t len);
+void fill_bloom_key(const char *data,
+ size_t len,
+ struct bloom_key *key,
+ const struct bloom_filter_settings *settings);
+
+void add_key_to_filter(const struct bloom_key *key,
+ struct bloom_filter *filter,
+ const struct bloom_filter_settings *settings);
+
#endif
\ No newline at end of file
diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
index 60ee2043689..20460cde775 100644
--- a/t/helper/test-bloom.c
+++ b/t/helper/test-bloom.c
@@ -2,6 +2,36 @@
#include "bloom.h"
#include "test-tool.h"
+struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
+
+static void add_string_to_filter(const char *data, struct bloom_filter *filter) {
+ struct bloom_key key;
+ int i;
+
+ fill_bloom_key(data, strlen(data), &key, &settings);
+ printf("Hashes:");
+ for (i = 0; i < settings.num_hashes; i++){
+ printf("0x%08x|", key.hashes[i]);
+ }
+ printf("\n");
+ add_key_to_filter(&key, filter, &settings);
+}
+
+static void print_bloom_filter(struct bloom_filter *filter) {
+ int i;
+
+ if (!filter) {
+ printf("No filter.\n");
+ return;
+ }
+ printf("Filter_Length:%d\n", (int)filter->len);
+ printf("Filter_Data:");
+ for (i = 0; i < filter->len; i++){
+ printf("%02x|", filter->data[i]);
+ }
+ printf("\n");
+}
+
int cmd__bloom(int argc, const char **argv)
{
if (!strcmp(argv[1], "get_murmur3")) {
@@ -9,5 +39,23 @@ int cmd__bloom(int argc, const char **argv)
printf("Murmur3 Hash with seed=0:0x%08x\n", hashed);
}
+ if (!strcmp(argv[1], "generate_filter")) {
+ struct bloom_filter filter;
+ int i = 2;
+ filter.len = (settings.bits_per_entry + BITS_PER_WORD - 1) / BITS_PER_WORD;
+ filter.data = xcalloc(filter.len, sizeof(unsigned char));
+
+ if (!argv[2]){
+ die("at least one input string expected");
+ }
+
+ while (argv[i]) {
+ add_string_to_filter(argv[i], &filter);
+ i++;
+ }
+
+ print_bloom_filter(&filter);
+ }
+
return 0;
}
\ No newline at end of file
diff --git a/t/t0095-bloom.sh b/t/t0095-bloom.sh
index 2dad8c4a94e..36a086c7c60 100755
--- a/t/t0095-bloom.sh
+++ b/t/t0095-bloom.sh
@@ -27,4 +27,44 @@ test_expect_success 'compute unseeded murmur3 hash for test string 2' '
test_cmp expect actual
'
+test_expect_success 'compute bloom key for empty string' '
+ cat >expect <<-\EOF &&
+ Hashes:0x5615800c|0x5b966560|0x61174ab4|0x66983008|0x6c19155c|0x7199fab0|0x771ae004|
+ Filter_Length:2
+ Filter_Data:11|11|
+ EOF
+ test-tool bloom generate_filter "" >actual &&
+ test_cmp expect actual
+'
+
+test_expect_success 'compute bloom key for whitespace' '
+ cat >expect <<-\EOF &&
+ Hashes:0xf178874c|0x5f3d6eb6|0xcd025620|0x3ac73d8a|0xa88c24f4|0x16510c5e|0x8415f3c8|
+ Filter_Length:2
+ Filter_Data:51|55|
+ EOF
+ test-tool bloom generate_filter " " >actual &&
+ test_cmp expect actual
+'
+
+test_expect_success 'compute bloom key for test string 1' '
+ cat >expect <<-\EOF &&
+ Hashes:0xb270de9b|0x1bb6f26e|0x84fd0641|0xee431a14|0x57892de7|0xc0cf41ba|0x2a15558d|
+ Filter_Length:2
+ Filter_Data:92|6c|
+ EOF
+ test-tool bloom generate_filter "Hello world!" >actual &&
+ test_cmp expect actual
+'
+
+test_expect_success 'compute bloom key for test string 2' '
+ cat >expect <<-\EOF &&
+ Hashes:0x20ab385b|0xf5237fe2|0xc99bc769|0x9e140ef0|0x728c5677|0x47049dfe|0x1b7ce585|
+ Filter_Length:2
+ Filter_Data:a5|4a|
+ EOF
+ test-tool bloom generate_filter "file.txt" >actual &&
+ test_cmp expect actual
+'
+
test_done
\ No newline at end of file
--
gitgitgadget
^ permalink raw reply related [flat|nested] 159+ messages in thread
* [PATCH v3 04/16] bloom.c: core Bloom filter implementation for changed paths.
2020-03-30 0:31 ` [PATCH v3 00/16] " Garima Singh via GitGitGadget
` (2 preceding siblings ...)
2020-03-30 0:31 ` [PATCH v3 03/16] bloom.c: introduce core Bloom filter constructs Garima Singh via GitGitGadget
@ 2020-03-30 0:31 ` Garima Singh via GitGitGadget
2020-03-30 0:31 ` [PATCH v3 05/16] diff: halt tree-diff early after max_changes Derrick Stolee via GitGitGadget
` (12 subsequent siblings)
16 siblings, 0 replies; 159+ messages in thread
From: Garima Singh via GitGitGadget @ 2020-03-30 0:31 UTC (permalink / raw)
To: git; +Cc: stolee, szeder.dev, jonathantanmy, Garima Singh, Garima Singh
From: Garima Singh <garima.singh@microsoft.com>
Add the core implementation for computing Bloom filters for
the paths changed between a commit and it's first parent.
We fill the Bloom filters as (const char *data, int len) pairs
as `struct bloom_filters" within a commit slab.
Filters for commits with no changes and more than 512 changes,
is represented with a filter of length zero. There is no gain
in distinguishing between a computed filter of length zero for
a commit with no changes, and an uncomputed filter for new commits
or for commits with more than 512 changes. The effect on
`git log -- path` is the same in both cases. We will fall back to
the normal diffing algorithm when we can't benefit from the
existence of Bloom filters.
Helped-by: Jeff King <peff@peff.net>
Helped-by: Derrick Stolee <dstolee@microsoft.com>
Reviewed-by: Jakub Narębski <jnareb@gmail.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
---
bloom.c | 97 +++++++++++++++++++++++++++++++++++++++++++
bloom.h | 8 ++++
t/helper/test-bloom.c | 20 +++++++++
t/t0095-bloom.sh | 47 +++++++++++++++++++++
4 files changed, 172 insertions(+)
diff --git a/bloom.c b/bloom.c
index 888b67f1ea6..881a9841ede 100644
--- a/bloom.c
+++ b/bloom.c
@@ -1,5 +1,18 @@
#include "git-compat-util.h"
#include "bloom.h"
+#include "diff.h"
+#include "diffcore.h"
+#include "revision.h"
+#include "hashmap.h"
+
+define_commit_slab(bloom_filter_slab, struct bloom_filter);
+
+struct bloom_filter_slab bloom_filters;
+
+struct pathmap_hash_entry {
+ struct hashmap_entry entry;
+ const char path[FLEX_ARRAY];
+};
static uint32_t rotate_left(uint32_t value, int32_t count)
{
@@ -107,3 +120,87 @@ void add_key_to_filter(const struct bloom_key *key,
filter->data[block_pos] |= get_bitmask(hash_mod);
}
}
+
+void init_bloom_filters(void)
+{
+ init_bloom_filter_slab(&bloom_filters);
+}
+
+struct bloom_filter *get_bloom_filter(struct repository *r,
+ struct commit *c)
+{
+ struct bloom_filter *filter;
+ struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
+ int i;
+ struct diff_options diffopt;
+
+ if (bloom_filters.slab_size == 0)
+ return NULL;
+
+ filter = bloom_filter_slab_at(&bloom_filters, c);
+
+ repo_diff_setup(r, &diffopt);
+ diffopt.flags.recursive = 1;
+ diff_setup_done(&diffopt);
+
+ if (c->parents)
+ diff_tree_oid(&c->parents->item->object.oid, &c->object.oid, "", &diffopt);
+ else
+ diff_tree_oid(NULL, &c->object.oid, "", &diffopt);
+ diffcore_std(&diffopt);
+
+ if (diff_queued_diff.nr <= 512) {
+ struct hashmap pathmap;
+ struct pathmap_hash_entry *e;
+ struct hashmap_iter iter;
+ hashmap_init(&pathmap, NULL, NULL, 0);
+
+ for (i = 0; i < diff_queued_diff.nr; i++) {
+ const char *path = diff_queued_diff.queue[i]->two->path;
+
+ /*
+ * Add each leading directory of the changed file, i.e. for
+ * 'dir/subdir/file' add 'dir' and 'dir/subdir' as well, so
+ * the Bloom filter could be used to speed up commands like
+ * 'git log dir/subdir', too.
+ *
+ * Note that directories are added without the trailing '/'.
+ */
+ do {
+ char *last_slash = strrchr(path, '/');
+
+ FLEX_ALLOC_STR(e, path, path);
+ hashmap_entry_init(&e->entry, strhash(path));
+ hashmap_add(&pathmap, &e->entry);
+
+ if (!last_slash)
+ last_slash = (char*)path;
+ *last_slash = '\0';
+
+ } while (*path);
+
+ diff_free_filepair(diff_queued_diff.queue[i]);
+ }
+
+ filter->len = (hashmap_get_size(&pathmap) * settings.bits_per_entry + BITS_PER_WORD - 1) / BITS_PER_WORD;
+ filter->data = xcalloc(filter->len, sizeof(unsigned char));
+
+ hashmap_for_each_entry(&pathmap, &iter, e, entry) {
+ struct bloom_key key;
+ fill_bloom_key(e->path, strlen(e->path), &key, &settings);
+ add_key_to_filter(&key, filter, &settings);
+ }
+
+ hashmap_free_entries(&pathmap, struct pathmap_hash_entry, entry);
+ } else {
+ for (i = 0; i < diff_queued_diff.nr; i++)
+ diff_free_filepair(diff_queued_diff.queue[i]);
+ filter->data = NULL;
+ filter->len = 0;
+ }
+
+ free(diff_queued_diff.queue);
+ DIFF_QUEUE_CLEAR(&diff_queued_diff);
+
+ return filter;
+}
diff --git a/bloom.h b/bloom.h
index b9ce422ca2d..85ab8e9423d 100644
--- a/bloom.h
+++ b/bloom.h
@@ -1,6 +1,9 @@
#ifndef BLOOM_H
#define BLOOM_H
+struct commit;
+struct repository;
+
struct bloom_filter_settings {
/*
* The version of the hashing technique being used.
@@ -73,4 +76,9 @@ void add_key_to_filter(const struct bloom_key *key,
struct bloom_filter *filter,
const struct bloom_filter_settings *settings);
+void init_bloom_filters(void);
+
+struct bloom_filter *get_bloom_filter(struct repository *r,
+ struct commit *c);
+
#endif
\ No newline at end of file
diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
index 20460cde775..f18d1b722e1 100644
--- a/t/helper/test-bloom.c
+++ b/t/helper/test-bloom.c
@@ -1,6 +1,7 @@
#include "git-compat-util.h"
#include "bloom.h"
#include "test-tool.h"
+#include "commit.h"
struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
@@ -32,6 +33,16 @@ static void print_bloom_filter(struct bloom_filter *filter) {
printf("\n");
}
+static void get_bloom_filter_for_commit(const struct object_id *commit_oid)
+{
+ struct commit *c;
+ struct bloom_filter *filter;
+ setup_git_directory();
+ c = lookup_commit(the_repository, commit_oid);
+ filter = get_bloom_filter(the_repository, c);
+ print_bloom_filter(filter);
+}
+
int cmd__bloom(int argc, const char **argv)
{
if (!strcmp(argv[1], "get_murmur3")) {
@@ -57,5 +68,14 @@ int cmd__bloom(int argc, const char **argv)
print_bloom_filter(&filter);
}
+ if (!strcmp(argv[1], "get_filter_for_commit")) {
+ struct object_id oid;
+ const char *end;
+ if (parse_oid_hex(argv[2], &oid, &end))
+ die("cannot parse oid '%s'", argv[2]);
+ init_bloom_filters();
+ get_bloom_filter_for_commit(&oid);
+ }
+
return 0;
}
\ No newline at end of file
diff --git a/t/t0095-bloom.sh b/t/t0095-bloom.sh
index 36a086c7c60..8f9eef116dc 100755
--- a/t/t0095-bloom.sh
+++ b/t/t0095-bloom.sh
@@ -67,4 +67,51 @@ test_expect_success 'compute bloom key for test string 2' '
test_cmp expect actual
'
+test_expect_success 'get bloom filters for commit with no changes' '
+ git init &&
+ git commit --allow-empty -m "c0" &&
+ cat >expect <<-\EOF &&
+ Filter_Length:0
+ Filter_Data:
+ EOF
+ test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
+ test_cmp expect actual
+'
+
+test_expect_success 'get bloom filter for commit with 10 changes' '
+ rm actual &&
+ rm expect &&
+ mkdir smallDir &&
+ for i in $(test_seq 0 9)
+ do
+ echo $i >smallDir/$i
+ done &&
+ git add smallDir &&
+ git commit -m "commit with 10 changes" &&
+ cat >expect <<-\EOF &&
+ Filter_Length:25
+ Filter_Data:82|a0|65|47|0c|92|90|c0|a1|40|02|a0|e2|40|e0|04|0a|9a|66|cf|80|19|85|42|23|
+ EOF
+ test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
+ test_cmp expect actual
+'
+
+test_expect_success EXPENSIVE 'get bloom filter for commit with 513 changes' '
+ rm actual &&
+ rm expect &&
+ mkdir bigDir &&
+ for i in $(test_seq 0 512)
+ do
+ echo $i >bigDir/$i
+ done &&
+ git add bigDir &&
+ git commit -m "commit with 513 changes" &&
+ cat >expect <<-\EOF &&
+ Filter_Length:0
+ Filter_Data:
+ EOF
+ test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
+ test_cmp expect actual
+'
+
test_done
\ No newline at end of file
--
gitgitgadget
^ permalink raw reply related [flat|nested] 159+ messages in thread
* [PATCH v3 05/16] diff: halt tree-diff early after max_changes
2020-03-30 0:31 ` [PATCH v3 00/16] " Garima Singh via GitGitGadget
` (3 preceding siblings ...)
2020-03-30 0:31 ` [PATCH v3 04/16] bloom.c: core Bloom filter implementation for changed paths Garima Singh via GitGitGadget
@ 2020-03-30 0:31 ` Derrick Stolee via GitGitGadget
2020-03-30 0:31 ` [PATCH v3 06/16] commit-graph: compute Bloom filters for changed paths Garima Singh via GitGitGadget
` (11 subsequent siblings)
16 siblings, 0 replies; 159+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-03-30 0:31 UTC (permalink / raw)
To: git; +Cc: stolee, szeder.dev, jonathantanmy, Garima Singh, Derrick Stolee
From: Derrick Stolee <dstolee@microsoft.com>
When computing the changed-paths bloom filters for the commit-graph,
we limit the size of the filter by restricting the number of paths
in the diff. Instead of computing a large diff and then ignoring the
result, it is better to halt the diff computation early.
Create a new "max_changes" option in struct diff_options. If non-zero,
then halt the diff computation after discovering strictly more changed
paths. This includes paths corresponding to trees that change.
Use this max_changes option in the bloom filter calculations. This
reduces the time taken to compute the filters for the Linux kernel
repo from 2m50s to 2m35s. On a large internal repository with ~500
commits that perform tree-wide changes, the time reduced from
6m15s to 3m48s.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
---
bloom.c | 4 +++-
diff.h | 5 +++++
tree-diff.c | 6 ++++++
3 files changed, 14 insertions(+), 1 deletion(-)
diff --git a/bloom.c b/bloom.c
index 881a9841ede..a16eee92331 100644
--- a/bloom.c
+++ b/bloom.c
@@ -133,6 +133,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
int i;
struct diff_options diffopt;
+ int max_changes = 512;
if (bloom_filters.slab_size == 0)
return NULL;
@@ -141,6 +142,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
repo_diff_setup(r, &diffopt);
diffopt.flags.recursive = 1;
+ diffopt.max_changes = max_changes;
diff_setup_done(&diffopt);
if (c->parents)
@@ -149,7 +151,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
diff_tree_oid(NULL, &c->object.oid, "", &diffopt);
diffcore_std(&diffopt);
- if (diff_queued_diff.nr <= 512) {
+ if (diff_queued_diff.nr <= max_changes) {
struct hashmap pathmap;
struct pathmap_hash_entry *e;
struct hashmap_iter iter;
diff --git a/diff.h b/diff.h
index 6febe7e3656..9443dc1b003 100644
--- a/diff.h
+++ b/diff.h
@@ -285,6 +285,11 @@ struct diff_options {
/* Number of hexdigits to abbreviate raw format output to. */
int abbrev;
+ /* If non-zero, then stop computing after this many changes. */
+ int max_changes;
+ /* For internal use only. */
+ int num_changes;
+
int ita_invisible_in_index;
/* white-space error highlighting */
#define WSEH_NEW (1<<12)
diff --git a/tree-diff.c b/tree-diff.c
index 33ded7f8b3e..f3d303c6e54 100644
--- a/tree-diff.c
+++ b/tree-diff.c
@@ -434,6 +434,9 @@ static struct combine_diff_path *ll_diff_tree_paths(
if (diff_can_quit_early(opt))
break;
+ if (opt->max_changes && opt->num_changes > opt->max_changes)
+ break;
+
if (opt->pathspec.nr) {
skip_uninteresting(&t, base, opt);
for (i = 0; i < nparent; i++)
@@ -518,6 +521,7 @@ static struct combine_diff_path *ll_diff_tree_paths(
/* t↓ */
update_tree_entry(&t);
+ opt->num_changes++;
}
/* t > p[imin] */
@@ -535,6 +539,7 @@ static struct combine_diff_path *ll_diff_tree_paths(
skip_emit_tp:
/* ∀ pi=p[imin] pi↓ */
update_tp_entries(tp, nparent);
+ opt->num_changes++;
}
}
@@ -552,6 +557,7 @@ struct combine_diff_path *diff_tree_paths(
const struct object_id **parents_oid, int nparent,
struct strbuf *base, struct diff_options *opt)
{
+ opt->num_changes = 0;
p = ll_diff_tree_paths(p, oid, parents_oid, nparent, base, opt);
/*
--
gitgitgadget
^ permalink raw reply related [flat|nested] 159+ messages in thread
* [PATCH v3 06/16] commit-graph: compute Bloom filters for changed paths
2020-03-30 0:31 ` [PATCH v3 00/16] " Garima Singh via GitGitGadget
` (4 preceding siblings ...)
2020-03-30 0:31 ` [PATCH v3 05/16] diff: halt tree-diff early after max_changes Derrick Stolee via GitGitGadget
@ 2020-03-30 0:31 ` Garima Singh via GitGitGadget
2020-03-30 0:31 ` [PATCH v3 07/16] commit-graph: examine changed-path objects in pack order Jeff King via GitGitGadget
` (10 subsequent siblings)
16 siblings, 0 replies; 159+ messages in thread
From: Garima Singh via GitGitGadget @ 2020-03-30 0:31 UTC (permalink / raw)
To: git; +Cc: stolee, szeder.dev, jonathantanmy, Garima Singh, Garima Singh
From: Garima Singh <garima.singh@microsoft.com>
Add new COMMIT_GRAPH_WRITE_CHANGED_PATHS flag that makes Git compute
Bloom filters for the paths that changed between a commit and it's
first parent, for each commit in the commit-graph. This computation
is done on a commit-by-commit basis.
We will write these Bloom filters to the commit-graph file, to store
this data on disk, in the next change in this series.
Helped-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
---
commit-graph.c | 32 +++++++++++++++++++++++++++++++-
commit-graph.h | 3 ++-
2 files changed, 33 insertions(+), 2 deletions(-)
diff --git a/commit-graph.c b/commit-graph.c
index e4f1a5b2f1a..862a00d67ed 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -16,6 +16,7 @@
#include "hashmap.h"
#include "replace-object.h"
#include "progress.h"
+#include "bloom.h"
#define GRAPH_SIGNATURE 0x43475048 /* "CGPH" */
#define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
@@ -789,9 +790,11 @@ struct write_commit_graph_context {
unsigned append:1,
report_progress:1,
split:1,
- check_oids:1;
+ check_oids:1,
+ changed_paths:1;
const struct split_commit_graph_opts *split_opts;
+ size_t total_bloom_filter_data_size;
};
static void write_graph_chunk_fanout(struct hashfile *f,
@@ -1134,6 +1137,28 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
stop_progress(&ctx->progress);
}
+static void compute_bloom_filters(struct write_commit_graph_context *ctx)
+{
+ int i;
+ struct progress *progress = NULL;
+
+ init_bloom_filters();
+
+ if (ctx->report_progress)
+ progress = start_delayed_progress(
+ _("Computing commit changed paths Bloom filters"),
+ ctx->commits.nr);
+
+ for (i = 0; i < ctx->commits.nr; i++) {
+ struct commit *c = ctx->commits.list[i];
+ struct bloom_filter *filter = get_bloom_filter(ctx->r, c);
+ ctx->total_bloom_filter_data_size += sizeof(unsigned char) * filter->len;
+ display_progress(progress, i + 1);
+ }
+
+ stop_progress(&progress);
+}
+
static int add_ref_to_list(const char *refname,
const struct object_id *oid,
int flags, void *cb_data)
@@ -1776,6 +1801,8 @@ int write_commit_graph(struct object_directory *odb,
ctx->split = flags & COMMIT_GRAPH_WRITE_SPLIT ? 1 : 0;
ctx->check_oids = flags & COMMIT_GRAPH_WRITE_CHECK_OIDS ? 1 : 0;
ctx->split_opts = split_opts;
+ ctx->changed_paths = flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS ? 1 : 0;
+ ctx->total_bloom_filter_data_size = 0;
if (ctx->split) {
struct commit_graph *g;
@@ -1870,6 +1897,9 @@ int write_commit_graph(struct object_directory *odb,
compute_generation_numbers(ctx);
+ if (ctx->changed_paths)
+ compute_bloom_filters(ctx);
+
res = write_commit_graph_file(ctx);
if (ctx->split)
diff --git a/commit-graph.h b/commit-graph.h
index e87a6f63600..86be81219da 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -79,7 +79,8 @@ enum commit_graph_write_flags {
COMMIT_GRAPH_WRITE_PROGRESS = (1 << 1),
COMMIT_GRAPH_WRITE_SPLIT = (1 << 2),
/* Make sure that each OID in the input is a valid commit OID. */
- COMMIT_GRAPH_WRITE_CHECK_OIDS = (1 << 3)
+ COMMIT_GRAPH_WRITE_CHECK_OIDS = (1 << 3),
+ COMMIT_GRAPH_WRITE_BLOOM_FILTERS = (1 << 4),
};
struct split_commit_graph_opts {
--
gitgitgadget
^ permalink raw reply related [flat|nested] 159+ messages in thread
* [PATCH v3 07/16] commit-graph: examine changed-path objects in pack order
2020-03-30 0:31 ` [PATCH v3 00/16] " Garima Singh via GitGitGadget
` (5 preceding siblings ...)
2020-03-30 0:31 ` [PATCH v3 06/16] commit-graph: compute Bloom filters for changed paths Garima Singh via GitGitGadget
@ 2020-03-30 0:31 ` Jeff King via GitGitGadget
2020-03-30 0:31 ` [PATCH v3 08/16] commit-graph: examine commits by generation number Garima Singh via GitGitGadget
` (9 subsequent siblings)
16 siblings, 0 replies; 159+ messages in thread
From: Jeff King via GitGitGadget @ 2020-03-30 0:31 UTC (permalink / raw)
To: git; +Cc: stolee, szeder.dev, jonathantanmy, Garima Singh, Jeff King
From: Jeff King <peff@peff.net>
Looking at the diff of commit objects in pack order is much faster than
in sha1 order, as it gives locality to the access of tree deltas
(whereas sha1 order is effectively random). Unfortunately the
commit-graph code sorts the commits (several times, sometimes as an oid
and sometimes a pointer-to-commit), and we ultimately traverse in sha1
order.
Instead, let's remember the position at which we see each commit, and
traverse in that order when looking at bloom filters. This drops my time
for "git commit-graph write --changed-paths" in linux.git from ~4
minutes to ~1.5 minutes.
Probably the "--reachable" code path would want something similar.
Or alternatively, we could use a different data structure (either a
hash, or maybe even just a bit in "struct commit") to keep track of
which oids we've seen, etc instead of sorting. And then we could keep
the original order.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
---
commit-graph.c | 38 +++++++++++++++++++++++++++++++++++---
1 file changed, 35 insertions(+), 3 deletions(-)
diff --git a/commit-graph.c b/commit-graph.c
index 862a00d67ed..31b06f878ce 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -17,6 +17,7 @@
#include "replace-object.h"
#include "progress.h"
#include "bloom.h"
+#include "commit-slab.h"
#define GRAPH_SIGNATURE 0x43475048 /* "CGPH" */
#define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
@@ -46,9 +47,32 @@
/* Remember to update object flag allocation in object.h */
#define REACHABLE (1u<<15)
-char *get_commit_graph_filename(struct object_directory *odb)
+/* Keep track of the order in which commits are added to our list. */
+define_commit_slab(commit_pos, int);
+static struct commit_pos commit_pos = COMMIT_SLAB_INIT(1, commit_pos);
+
+static void set_commit_pos(struct repository *r, const struct object_id *oid)
+{
+ static int32_t max_pos;
+ struct commit *commit = lookup_commit(r, oid);
+
+ if (!commit)
+ return; /* should never happen, but be lenient */
+
+ *commit_pos_at(&commit_pos, commit) = max_pos++;
+}
+
+static int commit_pos_cmp(const void *va, const void *vb)
{
- return xstrfmt("%s/info/commit-graph", odb->path);
+ const struct commit *a = *(const struct commit **)va;
+ const struct commit *b = *(const struct commit **)vb;
+ return commit_pos_at(&commit_pos, a) -
+ commit_pos_at(&commit_pos, b);
+}
+
+char *get_commit_graph_filename(struct object_directory *obj_dir)
+{
+ return xstrfmt("%s/info/commit-graph", obj_dir->path);
}
static char *get_split_graph_filename(struct object_directory *odb,
@@ -1021,6 +1045,8 @@ static int add_packed_commits(const struct object_id *oid,
oidcpy(&(ctx->oids.list[ctx->oids.nr]), oid);
ctx->oids.nr++;
+ set_commit_pos(ctx->r, oid);
+
return 0;
}
@@ -1141,6 +1167,7 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
{
int i;
struct progress *progress = NULL;
+ struct commit **sorted_commits;
init_bloom_filters();
@@ -1149,13 +1176,18 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
_("Computing commit changed paths Bloom filters"),
ctx->commits.nr);
+ ALLOC_ARRAY(sorted_commits, ctx->commits.nr);
+ COPY_ARRAY(sorted_commits, ctx->commits.list, ctx->commits.nr);
+ QSORT(sorted_commits, ctx->commits.nr, commit_pos_cmp);
+
for (i = 0; i < ctx->commits.nr; i++) {
- struct commit *c = ctx->commits.list[i];
+ struct commit *c = sorted_commits[i];
struct bloom_filter *filter = get_bloom_filter(ctx->r, c);
ctx->total_bloom_filter_data_size += sizeof(unsigned char) * filter->len;
display_progress(progress, i + 1);
}
+ free(sorted_commits);
stop_progress(&progress);
}
--
gitgitgadget
^ permalink raw reply related [flat|nested] 159+ messages in thread
* [PATCH v3 08/16] commit-graph: examine commits by generation number
2020-03-30 0:31 ` [PATCH v3 00/16] " Garima Singh via GitGitGadget
` (6 preceding siblings ...)
2020-03-30 0:31 ` [PATCH v3 07/16] commit-graph: examine changed-path objects in pack order Jeff King via GitGitGadget
@ 2020-03-30 0:31 ` Garima Singh via GitGitGadget
2020-03-30 0:31 ` [PATCH v3 09/16] diff: skip batch object download when possible Garima Singh via GitGitGadget
` (8 subsequent siblings)
16 siblings, 0 replies; 159+ messages in thread
From: Garima Singh via GitGitGadget @ 2020-03-30 0:31 UTC (permalink / raw)
To: git; +Cc: stolee, szeder.dev, jonathantanmy, Garima Singh, Garima Singh
From: Garima Singh <garima.singh@microsoft.com>
When running 'git commit-graph write --changed-paths', we sort the
commits by pack-order to save time when computing the changed-paths
bloom filters. This does not help when finding the commits via the
'--reachable' flag.
If not using pack-order, then sort by generation number before
examining the diff. Commits with similar generation are more likely
to have many trees in common, making the diff faster.
On the Linux kernel repository, this change reduced the computation
time for 'git commit-graph write --reachable --changed-paths' from
3m00s to 1m37s.
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
---
commit-graph.c | 33 ++++++++++++++++++++++++++++++---
1 file changed, 30 insertions(+), 3 deletions(-)
diff --git a/commit-graph.c b/commit-graph.c
index 31b06f878ce..732c81fa1b2 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -70,6 +70,25 @@ static int commit_pos_cmp(const void *va, const void *vb)
commit_pos_at(&commit_pos, b);
}
+static int commit_gen_cmp(const void *va, const void *vb)
+{
+ const struct commit *a = *(const struct commit **)va;
+ const struct commit *b = *(const struct commit **)vb;
+
+ /* lower generation commits first */
+ if (a->generation < b->generation)
+ return -1;
+ else if (a->generation > b->generation)
+ return 1;
+
+ /* use date as a heuristic when generations are equal */
+ if (a->date < b->date)
+ return -1;
+ else if (a->date > b->date)
+ return 1;
+ return 0;
+}
+
char *get_commit_graph_filename(struct object_directory *obj_dir)
{
return xstrfmt("%s/info/commit-graph", obj_dir->path);
@@ -815,7 +834,8 @@ struct write_commit_graph_context {
report_progress:1,
split:1,
check_oids:1,
- changed_paths:1;
+ changed_paths:1,
+ order_by_pack:1;
const struct split_commit_graph_opts *split_opts;
size_t total_bloom_filter_data_size;
@@ -1178,7 +1198,11 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
ALLOC_ARRAY(sorted_commits, ctx->commits.nr);
COPY_ARRAY(sorted_commits, ctx->commits.list, ctx->commits.nr);
- QSORT(sorted_commits, ctx->commits.nr, commit_pos_cmp);
+
+ if (ctx->order_by_pack)
+ QSORT(sorted_commits, ctx->commits.nr, commit_pos_cmp);
+ else
+ QSORT(sorted_commits, ctx->commits.nr, commit_gen_cmp);
for (i = 0; i < ctx->commits.nr; i++) {
struct commit *c = sorted_commits[i];
@@ -1884,6 +1908,7 @@ int write_commit_graph(struct object_directory *odb,
}
if (pack_indexes) {
+ ctx->order_by_pack = 1;
if ((res = fill_oids_from_packs(ctx, pack_indexes)))
goto cleanup;
}
@@ -1893,8 +1918,10 @@ int write_commit_graph(struct object_directory *odb,
goto cleanup;
}
- if (!pack_indexes && !commit_hex)
+ if (!pack_indexes && !commit_hex) {
+ ctx->order_by_pack = 1;
fill_oids_from_all_packs(ctx);
+ }
close_reachable(ctx);
--
gitgitgadget
^ permalink raw reply related [flat|nested] 159+ messages in thread
* [PATCH v3 09/16] diff: skip batch object download when possible
2020-03-30 0:31 ` [PATCH v3 00/16] " Garima Singh via GitGitGadget
` (7 preceding siblings ...)
2020-03-30 0:31 ` [PATCH v3 08/16] commit-graph: examine commits by generation number Garima Singh via GitGitGadget
@ 2020-03-30 0:31 ` Garima Singh via GitGitGadget
2020-03-30 0:31 ` [PATCH v3 10/16] commit-graph: write Bloom filters to commit graph file Garima Singh via GitGitGadget
` (7 subsequent siblings)
16 siblings, 0 replies; 159+ messages in thread
From: Garima Singh via GitGitGadget @ 2020-03-30 0:31 UTC (permalink / raw)
To: git; +Cc: stolee, szeder.dev, jonathantanmy, Garima Singh, Garima Singh
From: Garima Singh <garima.singh@microsoft.com>
When computing changed-path Bloom filters or performing a name-only
diff, we do not need the blob contents before completing the diff
values. Thus, we do not need to download a pack containing the blobs
we do not have on-disk before completing our diff calculation.
This prevents downloading every blob in a partial clone when computing
changed path Bloom filters. It also prevents over-aggressive downloads
during "git log --raw" commands.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
---
bloom.c | 1 +
diff.c | 8 +++++++-
diff.h | 1 +
3 files changed, 9 insertions(+), 1 deletion(-)
diff --git a/bloom.c b/bloom.c
index a16eee92331..dbcf594baec 100644
--- a/bloom.c
+++ b/bloom.c
@@ -142,6 +142,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
repo_diff_setup(r, &diffopt);
diffopt.flags.recursive = 1;
+ diffopt.detect_rename = 0;
diffopt.max_changes = max_changes;
diff_setup_done(&diffopt);
diff --git a/diff.c b/diff.c
index 1010d806f50..63376adb011 100644
--- a/diff.c
+++ b/diff.c
@@ -4633,6 +4633,10 @@ void diff_setup_done(struct diff_options *options)
if (!options->use_color || external_diff())
options->color_moved = 0;
+ if (!(options->output_format & ~(DIFF_FORMAT_NAME | DIFF_FORMAT_RAW)) &&
+ !options->detect_rename)
+ options->skip_batch_download_objects = 1;
+
FREE_AND_NULL(options->parseopts);
}
@@ -6507,7 +6511,9 @@ static void add_if_missing(struct repository *r,
void diffcore_std(struct diff_options *options)
{
- if (options->repo == the_repository && has_promisor_remote()) {
+ if (!options->skip_batch_download_objects &&
+ options->repo == the_repository &&
+ has_promisor_remote()) {
/*
* Prefetch the diff pairs that are about to be flushed.
*/
diff --git a/diff.h b/diff.h
index 9443dc1b003..e9f104309c4 100644
--- a/diff.h
+++ b/diff.h
@@ -281,6 +281,7 @@ struct diff_options {
int show_rename_progress;
int dirstat_permille;
int setup;
+ int skip_batch_download_objects;
/* Number of hexdigits to abbreviate raw format output to. */
int abbrev;
--
gitgitgadget
^ permalink raw reply related [flat|nested] 159+ messages in thread
* [PATCH v3 10/16] commit-graph: write Bloom filters to commit graph file
2020-03-30 0:31 ` [PATCH v3 00/16] " Garima Singh via GitGitGadget
` (8 preceding siblings ...)
2020-03-30 0:31 ` [PATCH v3 09/16] diff: skip batch object download when possible Garima Singh via GitGitGadget
@ 2020-03-30 0:31 ` Garima Singh via GitGitGadget
2020-03-30 0:31 ` [PATCH v3 11/16] commit-graph: reuse existing Bloom filters during write Garima Singh via GitGitGadget
` (6 subsequent siblings)
16 siblings, 0 replies; 159+ messages in thread
From: Garima Singh via GitGitGadget @ 2020-03-30 0:31 UTC (permalink / raw)
To: git; +Cc: stolee, szeder.dev, jonathantanmy, Garima Singh, Garima Singh
From: Garima Singh <garima.singh@microsoft.com>
Update the technical documentation for commit-graph-format with
the formats for the Bloom filter index (BIDX) and Bloom filter
data (BDAT) chunks. Write the computed Bloom filters information
to the commit graph file using this format.
Helped-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
---
.../technical/commit-graph-format.txt | 30 +++++
commit-graph.c | 113 +++++++++++++++++-
commit-graph.h | 5 +
3 files changed, 147 insertions(+), 1 deletion(-)
diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
index a4f17441aed..de56f9f1efd 100644
--- a/Documentation/technical/commit-graph-format.txt
+++ b/Documentation/technical/commit-graph-format.txt
@@ -17,6 +17,9 @@ metadata, including:
- The parents of the commit, stored using positional references within
the graph file.
+- The Bloom filter of the commit carrying the paths that were changed between
+ the commit and its first parent, if requested.
+
These positional references are stored as unsigned 32-bit integers
corresponding to the array position within the list of commit OIDs. Due
to some special constants we use to track parents, we can store at most
@@ -93,6 +96,33 @@ CHUNK DATA:
positions for the parents until reaching a value with the most-significant
bit on. The other bits correspond to the position of the last parent.
+ Bloom Filter Index (ID: {'B', 'I', 'D', 'X'}) (N * 4 bytes) [Optional]
+ * The ith entry, BIDX[i], stores the number of 8-byte word blocks in all
+ Bloom filters from commit 0 to commit i (inclusive) in lexicographic
+ order. The Bloom filter for the i-th commit spans from BIDX[i-1] to
+ BIDX[i] (plus header length), where BIDX[-1] is 0.
+ * The BIDX chunk is ignored if the BDAT chunk is not present.
+
+ Bloom Filter Data (ID: {'B', 'D', 'A', 'T'}) [Optional]
+ * It starts with header consisting of three unsigned 32-bit integers:
+ - Version of the hash algorithm being used. We currently only support
+ value 1 which corresponds to the 32-bit version of the murmur3 hash
+ implemented exactly as described in
+ https://en.wikipedia.org/wiki/MurmurHash#Algorithm and the double
+ hashing technique using seed values 0x293ae76f and 0x7e646e2 as
+ described in https://doi.org/10.1007/978-3-540-30494-4_26 "Bloom Filters
+ in Probabilistic Verification"
+ - The number of times a path is hashed and hence the number of bit positions
+ that cumulatively determine whether a file is present in the commit.
+ - The minimum number of bits 'b' per entry in the Bloom filter. If the filter
+ contains 'n' entries, then the filter size is the minimum number of 64-bit
+ words that contain n*b bits.
+ * The rest of the chunk is the concatenation of all the computed Bloom
+ filters for the commits in lexicographic order.
+ * Note: Commits with no changes or more than 512 changes have Bloom filters
+ of length zero.
+ * The BDAT chunk is present if and only if BIDX is present.
+
Base Graphs List (ID: {'B', 'A', 'S', 'E'}) [Optional]
This list of H-byte hashes describe a set of B commit-graph files that
form a commit-graph chain. The graph position for the ith commit in this
diff --git a/commit-graph.c b/commit-graph.c
index 732c81fa1b2..a8b6b5cca5d 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -24,8 +24,10 @@
#define GRAPH_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
#define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */
#define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */
+#define GRAPH_CHUNKID_BLOOMINDEXES 0x42494458 /* "BIDX" */
+#define GRAPH_CHUNKID_BLOOMDATA 0x42444154 /* "BDAT" */
#define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */
-#define MAX_NUM_CHUNKS 5
+#define MAX_NUM_CHUNKS 7
#define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16)
@@ -319,6 +321,32 @@ struct commit_graph *parse_commit_graph(void *graph_map, int fd,
chunk_repeated = 1;
else
graph->chunk_base_graphs = data + chunk_offset;
+ break;
+
+ case GRAPH_CHUNKID_BLOOMINDEXES:
+ if (graph->chunk_bloom_indexes)
+ chunk_repeated = 1;
+ else
+ graph->chunk_bloom_indexes = data + chunk_offset;
+ break;
+
+ case GRAPH_CHUNKID_BLOOMDATA:
+ if (graph->chunk_bloom_data)
+ chunk_repeated = 1;
+ else {
+ uint32_t hash_version;
+ graph->chunk_bloom_data = data + chunk_offset;
+ hash_version = get_be32(data + chunk_offset);
+
+ if (hash_version != 1)
+ break;
+
+ graph->bloom_filter_settings = xmalloc(sizeof(struct bloom_filter_settings));
+ graph->bloom_filter_settings->hash_version = hash_version;
+ graph->bloom_filter_settings->num_hashes = get_be32(data + chunk_offset + 4);
+ graph->bloom_filter_settings->bits_per_entry = get_be32(data + chunk_offset + 8);
+ }
+ break;
}
if (chunk_repeated) {
@@ -337,6 +365,15 @@ struct commit_graph *parse_commit_graph(void *graph_map, int fd,
last_chunk_offset = chunk_offset;
}
+ if (graph->chunk_bloom_indexes && graph->chunk_bloom_data) {
+ init_bloom_filters();
+ } else {
+ /* We need both the bloom chunks to exist together. Else ignore the data */
+ graph->chunk_bloom_indexes = NULL;
+ graph->chunk_bloom_data = NULL;
+ graph->bloom_filter_settings = NULL;
+ }
+
hashcpy(graph->oid.hash, graph->data + graph->data_len - graph->hash_len);
if (verify_commit_graph_lite(graph)) {
@@ -1034,6 +1071,59 @@ static void write_graph_chunk_extra_edges(struct hashfile *f,
}
}
+static void write_graph_chunk_bloom_indexes(struct hashfile *f,
+ struct write_commit_graph_context *ctx)
+{
+ struct commit **list = ctx->commits.list;
+ struct commit **last = ctx->commits.list + ctx->commits.nr;
+ uint32_t cur_pos = 0;
+ struct progress *progress = NULL;
+ int i = 0;
+
+ if (ctx->report_progress)
+ progress = start_delayed_progress(
+ _("Writing changed paths Bloom filters index"),
+ ctx->commits.nr);
+
+ while (list < last) {
+ struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
+ cur_pos += filter->len;
+ display_progress(progress, ++i);
+ hashwrite_be32(f, cur_pos);
+ list++;
+ }
+
+ stop_progress(&progress);
+}
+
+static void write_graph_chunk_bloom_data(struct hashfile *f,
+ struct write_commit_graph_context *ctx,
+ const struct bloom_filter_settings *settings)
+{
+ struct commit **list = ctx->commits.list;
+ struct commit **last = ctx->commits.list + ctx->commits.nr;
+ struct progress *progress = NULL;
+ int i = 0;
+
+ if (ctx->report_progress)
+ progress = start_delayed_progress(
+ _("Writing changed paths Bloom filters data"),
+ ctx->commits.nr);
+
+ hashwrite_be32(f, settings->hash_version);
+ hashwrite_be32(f, settings->num_hashes);
+ hashwrite_be32(f, settings->bits_per_entry);
+
+ while (list < last) {
+ struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
+ display_progress(progress, ++i);
+ hashwrite(f, filter->data, filter->len * sizeof(unsigned char));
+ list++;
+ }
+
+ stop_progress(&progress);
+}
+
static int oid_compare(const void *_a, const void *_b)
{
const struct object_id *a = (const struct object_id *)_a;
@@ -1438,6 +1528,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
struct strbuf progress_title = STRBUF_INIT;
int num_chunks = 3;
struct object_id file_hash;
+ const struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
if (ctx->split) {
struct strbuf tmp_file = STRBUF_INIT;
@@ -1482,6 +1573,12 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
chunk_ids[num_chunks] = GRAPH_CHUNKID_EXTRAEDGES;
num_chunks++;
}
+ if (ctx->changed_paths) {
+ chunk_ids[num_chunks] = GRAPH_CHUNKID_BLOOMINDEXES;
+ num_chunks++;
+ chunk_ids[num_chunks] = GRAPH_CHUNKID_BLOOMDATA;
+ num_chunks++;
+ }
if (ctx->num_commit_graphs_after > 1) {
chunk_ids[num_chunks] = GRAPH_CHUNKID_BASE;
num_chunks++;
@@ -1500,6 +1597,15 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
4 * ctx->num_extra_edges;
num_chunks++;
}
+ if (ctx->changed_paths) {
+ chunk_offsets[num_chunks + 1] = chunk_offsets[num_chunks] +
+ sizeof(uint32_t) * ctx->commits.nr;
+ num_chunks++;
+
+ chunk_offsets[num_chunks + 1] = chunk_offsets[num_chunks] +
+ sizeof(uint32_t) * 3 + ctx->total_bloom_filter_data_size;
+ num_chunks++;
+ }
if (ctx->num_commit_graphs_after > 1) {
chunk_offsets[num_chunks + 1] = chunk_offsets[num_chunks] +
hashsz * (ctx->num_commit_graphs_after - 1);
@@ -1537,6 +1643,10 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
write_graph_chunk_data(f, hashsz, ctx);
if (ctx->num_extra_edges)
write_graph_chunk_extra_edges(f, ctx);
+ if (ctx->changed_paths) {
+ write_graph_chunk_bloom_indexes(f, ctx);
+ write_graph_chunk_bloom_data(f, ctx, &bloom_settings);
+ }
if (ctx->num_commit_graphs_after > 1 &&
write_graph_chunk_base(f, ctx)) {
return -1;
@@ -2184,6 +2294,7 @@ void free_commit_graph(struct commit_graph *g)
close(g->graph_fd);
}
free(g->filename);
+ free(g->bloom_filter_settings);
free(g);
}
diff --git a/commit-graph.h b/commit-graph.h
index 86be81219da..8e7a8e0e5b2 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -11,6 +11,7 @@
#define GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD "GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD"
struct commit;
+struct bloom_filter_settings;
char *get_commit_graph_filename(struct object_directory *odb);
int open_commit_graph(const char *graph_file, int *fd, struct stat *st);
@@ -59,6 +60,10 @@ struct commit_graph {
const unsigned char *chunk_commit_data;
const unsigned char *chunk_extra_edges;
const unsigned char *chunk_base_graphs;
+ const unsigned char *chunk_bloom_indexes;
+ const unsigned char *chunk_bloom_data;
+
+ struct bloom_filter_settings *bloom_filter_settings;
};
struct commit_graph *load_commit_graph_one_fd_st(int fd, struct stat *st,
--
gitgitgadget
^ permalink raw reply related [flat|nested] 159+ messages in thread
* [PATCH v3 11/16] commit-graph: reuse existing Bloom filters during write
2020-03-30 0:31 ` [PATCH v3 00/16] " Garima Singh via GitGitGadget
` (9 preceding siblings ...)
2020-03-30 0:31 ` [PATCH v3 10/16] commit-graph: write Bloom filters to commit graph file Garima Singh via GitGitGadget
@ 2020-03-30 0:31 ` Garima Singh via GitGitGadget
2020-03-30 0:31 ` [PATCH v3 12/16] commit-graph: add --changed-paths option to write subcommand Garima Singh via GitGitGadget
` (5 subsequent siblings)
16 siblings, 0 replies; 159+ messages in thread
From: Garima Singh via GitGitGadget @ 2020-03-30 0:31 UTC (permalink / raw)
To: git; +Cc: stolee, szeder.dev, jonathantanmy, Garima Singh, Garima Singh
From: Garima Singh <garima.singh@microsoft.com>
Add logic to
a) parse Bloom filter information from the commit graph file and,
b) re-use existing Bloom filters.
See Documentation/technical/commit-graph-format for the format in which
the Bloom filter information is written to the commit graph file.
To read Bloom filter for a given commit with lexicographic position
'i' we need to:
1. Read BIDX[i] which essentially gives us the starting index in BDAT for
filter of commit i+1. It is essentially the index past the end
of the filter of commit i. It is called end_index in the code.
2. For i>0, read BIDX[i-1] which will give us the starting index in BDAT
for filter of commit i. It is called the start_index in the code.
For the first commit, where i = 0, Bloom filter data starts at the
beginning, just past the header in the BDAT chunk. Hence, start_index
will be 0.
3. The length of the filter will be end_index - start_index, because
BIDX[i] gives the cumulative 8-byte words including the ith
commit's filter.
We toggle whether Bloom filters should be recomputed based on the
compute_if_not_present flag.
Helped-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
---
bloom.c | 49 ++++++++++++++++++++++++++++++++++++++++++-
bloom.h | 4 +++-
commit-graph.c | 6 +++---
t/helper/test-bloom.c | 2 +-
4 files changed, 55 insertions(+), 6 deletions(-)
diff --git a/bloom.c b/bloom.c
index dbcf594baec..151d598ce7b 100644
--- a/bloom.c
+++ b/bloom.c
@@ -4,6 +4,8 @@
#include "diffcore.h"
#include "revision.h"
#include "hashmap.h"
+#include "commit-graph.h"
+#include "commit.h"
define_commit_slab(bloom_filter_slab, struct bloom_filter);
@@ -26,6 +28,36 @@ static inline unsigned char get_bitmask(uint32_t pos)
return ((unsigned char)1) << (pos & (BITS_PER_WORD - 1));
}
+static int load_bloom_filter_from_graph(struct commit_graph *g,
+ struct bloom_filter *filter,
+ struct commit *c)
+{
+ uint32_t lex_pos, start_index, end_index;
+
+ while (c->graph_pos < g->num_commits_in_base)
+ g = g->base_graph;
+
+ /* The commit graph commit 'c' lives in doesn't carry bloom filters. */
+ if (!g->chunk_bloom_indexes)
+ return 0;
+
+ lex_pos = c->graph_pos - g->num_commits_in_base;
+
+ end_index = get_be32(g->chunk_bloom_indexes + 4 * lex_pos);
+
+ if (lex_pos > 0)
+ start_index = get_be32(g->chunk_bloom_indexes + 4 * (lex_pos - 1));
+ else
+ start_index = 0;
+
+ filter->len = end_index - start_index;
+ filter->data = (unsigned char *)(g->chunk_bloom_data +
+ sizeof(unsigned char) * start_index +
+ BLOOMDATA_CHUNK_HEADER_SIZE);
+
+ return 1;
+}
+
/*
* Calculate the murmur3 32-bit hash value for the given data
* using the given seed.
@@ -127,7 +159,8 @@ void init_bloom_filters(void)
}
struct bloom_filter *get_bloom_filter(struct repository *r,
- struct commit *c)
+ struct commit *c,
+ int compute_if_not_present)
{
struct bloom_filter *filter;
struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
@@ -140,6 +173,20 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
filter = bloom_filter_slab_at(&bloom_filters, c);
+ if (!filter->data) {
+ load_commit_graph_info(r, c);
+ if (c->graph_pos != COMMIT_NOT_FROM_GRAPH &&
+ r->objects->commit_graph->chunk_bloom_indexes) {
+ if (load_bloom_filter_from_graph(r->objects->commit_graph, filter, c))
+ return filter;
+ else
+ return NULL;
+ }
+ }
+
+ if (filter->data || !compute_if_not_present)
+ return filter;
+
repo_diff_setup(r, &diffopt);
diffopt.flags.recursive = 1;
diffopt.detect_rename = 0;
diff --git a/bloom.h b/bloom.h
index 85ab8e9423d..760d7122374 100644
--- a/bloom.h
+++ b/bloom.h
@@ -32,6 +32,7 @@ struct bloom_filter_settings {
#define DEFAULT_BLOOM_FILTER_SETTINGS { 1, 7, 10 }
#define BITS_PER_WORD 8
+#define BLOOMDATA_CHUNK_HEADER_SIZE 3 * sizeof(uint32_t)
/*
* A bloom_filter struct represents a data segment to
@@ -79,6 +80,7 @@ void add_key_to_filter(const struct bloom_key *key,
void init_bloom_filters(void);
struct bloom_filter *get_bloom_filter(struct repository *r,
- struct commit *c);
+ struct commit *c,
+ int compute_if_not_present);
#endif
\ No newline at end of file
diff --git a/commit-graph.c b/commit-graph.c
index a8b6b5cca5d..77668629e27 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1086,7 +1086,7 @@ static void write_graph_chunk_bloom_indexes(struct hashfile *f,
ctx->commits.nr);
while (list < last) {
- struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
+ struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
cur_pos += filter->len;
display_progress(progress, ++i);
hashwrite_be32(f, cur_pos);
@@ -1115,7 +1115,7 @@ static void write_graph_chunk_bloom_data(struct hashfile *f,
hashwrite_be32(f, settings->bits_per_entry);
while (list < last) {
- struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
+ struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
display_progress(progress, ++i);
hashwrite(f, filter->data, filter->len * sizeof(unsigned char));
list++;
@@ -1296,7 +1296,7 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
for (i = 0; i < ctx->commits.nr; i++) {
struct commit *c = sorted_commits[i];
- struct bloom_filter *filter = get_bloom_filter(ctx->r, c);
+ struct bloom_filter *filter = get_bloom_filter(ctx->r, c, 1);
ctx->total_bloom_filter_data_size += sizeof(unsigned char) * filter->len;
display_progress(progress, i + 1);
}
diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
index f18d1b722e1..ce412664ba9 100644
--- a/t/helper/test-bloom.c
+++ b/t/helper/test-bloom.c
@@ -39,7 +39,7 @@ static void get_bloom_filter_for_commit(const struct object_id *commit_oid)
struct bloom_filter *filter;
setup_git_directory();
c = lookup_commit(the_repository, commit_oid);
- filter = get_bloom_filter(the_repository, c);
+ filter = get_bloom_filter(the_repository, c, 1);
print_bloom_filter(filter);
}
--
gitgitgadget
^ permalink raw reply related [flat|nested] 159+ messages in thread
* [PATCH v3 12/16] commit-graph: add --changed-paths option to write subcommand
2020-03-30 0:31 ` [PATCH v3 00/16] " Garima Singh via GitGitGadget
` (10 preceding siblings ...)
2020-03-30 0:31 ` [PATCH v3 11/16] commit-graph: reuse existing Bloom filters during write Garima Singh via GitGitGadget
@ 2020-03-30 0:31 ` Garima Singh via GitGitGadget
2020-03-30 0:31 ` [PATCH v3 13/16] revision.c: use Bloom filters to speed up path based revision walks Garima Singh via GitGitGadget
` (4 subsequent siblings)
16 siblings, 0 replies; 159+ messages in thread
From: Garima Singh via GitGitGadget @ 2020-03-30 0:31 UTC (permalink / raw)
To: git; +Cc: stolee, szeder.dev, jonathantanmy, Garima Singh, Garima Singh
From: Garima Singh <garima.singh@microsoft.com>
Add --changed-paths option to git commit-graph write. This option will
allow users to compute information about the paths that have changed
between a commit and its first parent, and write it into the commit graph
file. If the option is passed to the write subcommand we set the
COMMIT_GRAPH_WRITE_BLOOM_FILTERS flag and pass it down to the
commit-graph logic.
Helped-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
---
Documentation/git-commit-graph.txt | 5 +++++
builtin/commit-graph.c | 9 +++++++--
2 files changed, 12 insertions(+), 2 deletions(-)
diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
index 28d1fee5053..f4b13c005b8 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -57,6 +57,11 @@ or `--stdin-packs`.)
With the `--append` option, include all commits that are present in the
existing commit-graph file.
+
+With the `--changed-paths` option, compute and write information about the
+paths changed between a commit and it's first parent. This operation can
+take a while on large repositories. It provides significant performance gains
+for getting history of a directory or a file with `git log -- <path>`.
++
With the `--split` option, write the commit-graph as a chain of multiple
commit-graph files stored in `<dir>/info/commit-graphs`. The new commits
not already in the commit-graph are added in a new "tip" file. This file
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index d1ab6625f63..cacb5d04a80 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -9,7 +9,7 @@
static char const * const builtin_commit_graph_usage[] = {
N_("git commit-graph verify [--object-dir <objdir>] [--shallow] [--[no-]progress]"),
- N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] [--[no-]progress] <split options>"),
+ N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] [--changed-paths] [--[no-]progress] <split options>"),
NULL
};
@@ -19,7 +19,7 @@ static const char * const builtin_commit_graph_verify_usage[] = {
};
static const char * const builtin_commit_graph_write_usage[] = {
- N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] [--[no-]progress] <split options>"),
+ N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] [--changed-paths] [--[no-]progress] <split options>"),
NULL
};
@@ -32,6 +32,7 @@ static struct opts_commit_graph {
int split;
int shallow;
int progress;
+ int enable_changed_paths;
} opts;
static struct object_directory *find_odb(struct repository *r,
@@ -135,6 +136,8 @@ static int graph_write(int argc, const char **argv)
N_("start walk at commits listed by stdin")),
OPT_BOOL(0, "append", &opts.append,
N_("include all commits already in the commit-graph file")),
+ OPT_BOOL(0, "changed-paths", &opts.enable_changed_paths,
+ N_("enable computation for changed paths")),
OPT_BOOL(0, "progress", &opts.progress, N_("force progress reporting")),
OPT_BOOL(0, "split", &opts.split,
N_("allow writing an incremental commit-graph file")),
@@ -168,6 +171,8 @@ static int graph_write(int argc, const char **argv)
flags |= COMMIT_GRAPH_WRITE_SPLIT;
if (opts.progress)
flags |= COMMIT_GRAPH_WRITE_PROGRESS;
+ if (opts.enable_changed_paths)
+ flags |= COMMIT_GRAPH_WRITE_BLOOM_FILTERS;
read_replace_refs = 0;
odb = find_odb(the_repository, opts.obj_dir);
--
gitgitgadget
^ permalink raw reply related [flat|nested] 159+ messages in thread
* [PATCH v3 13/16] revision.c: use Bloom filters to speed up path based revision walks
2020-03-30 0:31 ` [PATCH v3 00/16] " Garima Singh via GitGitGadget
` (11 preceding siblings ...)
2020-03-30 0:31 ` [PATCH v3 12/16] commit-graph: add --changed-paths option to write subcommand Garima Singh via GitGitGadget
@ 2020-03-30 0:31 ` Garima Singh via GitGitGadget
2020-03-30 0:31 ` [PATCH v3 14/16] revision.c: add trace2 stats around Bloom filter usage Garima Singh via GitGitGadget
` (3 subsequent siblings)
16 siblings, 0 replies; 159+ messages in thread
From: Garima Singh via GitGitGadget @ 2020-03-30 0:31 UTC (permalink / raw)
To: git; +Cc: stolee, szeder.dev, jonathantanmy, Garima Singh, Garima Singh
From: Garima Singh <garima.singh@microsoft.com>
Revision walk will now use Bloom filters for commits to speed up
revision walks for a particular path (for computing history for
that path), if they are present in the commit-graph file.
We load the Bloom filters during the prepare_revision_walk step,
currently only when dealing with a single pathspec. Extending
it to work with multiple pathspecs can be explored and built on
top of this series in the future.
While comparing trees in rev_compare_trees(), if the Bloom filter
says that the file is not different between the two trees, we don't
need to compute the expensive diff. This is where we get our
performance gains. The other response of the Bloom filter is '`:maybe',
in which case we fall back to the full diff calculation to determine
if the path was changed in the commit.
We do not try to use Bloom filters when the '--walk-reflogs' option
is specified. The '--walk-reflogs' option does not walk the commit
ancestry chain like the rest of the options. Incorporating the
performance gains when walking reflog entries would add more
complexity, and can be explored in a later series.
Performance Gains:
We tested the performance of `git log -- <path>` on the git repo, the linux
and some internal large repos, with a variety of paths of varying depths.
On the git and linux repos:
- we observed a 2x to 5x speed up.
On a large internal repo with files seated 6-10 levels deep in the tree:
- we observed 10x to 20x speed ups, with some paths going up to 28 times
faster.
Helped-by: Derrick Stolee <dstolee@microsoft.com
Helped-by: SZEDER Gábor <szeder.dev@gmail.com>
Helped-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
---
bloom.c | 20 +++++++++++++
bloom.h | 4 +++
revision.c | 85 ++++++++++++++++++++++++++++++++++++++++++++++++++++--
revision.h | 11 +++++++
4 files changed, 118 insertions(+), 2 deletions(-)
diff --git a/bloom.c b/bloom.c
index 151d598ce7b..dd9bab9bbd6 100644
--- a/bloom.c
+++ b/bloom.c
@@ -254,3 +254,23 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
return filter;
}
+
+int bloom_filter_contains(const struct bloom_filter *filter,
+ const struct bloom_key *key,
+ const struct bloom_filter_settings *settings)
+{
+ int i;
+ uint64_t mod = filter->len * BITS_PER_WORD;
+
+ if (!mod)
+ return -1;
+
+ for (i = 0; i < settings->num_hashes; i++) {
+ uint64_t hash_mod = key->hashes[i] % mod;
+ uint64_t block_pos = hash_mod / BITS_PER_WORD;
+ if (!(filter->data[block_pos] & get_bitmask(hash_mod)))
+ return 0;
+ }
+
+ return 1;
+}
\ No newline at end of file
diff --git a/bloom.h b/bloom.h
index 760d7122374..b935186425d 100644
--- a/bloom.h
+++ b/bloom.h
@@ -83,4 +83,8 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
struct commit *c,
int compute_if_not_present);
+int bloom_filter_contains(const struct bloom_filter *filter,
+ const struct bloom_key *key,
+ const struct bloom_filter_settings *settings);
+
#endif
\ No newline at end of file
diff --git a/revision.c b/revision.c
index 8136929e236..d3fcb7c6ff6 100644
--- a/revision.c
+++ b/revision.c
@@ -29,6 +29,7 @@
#include "prio-queue.h"
#include "hashmap.h"
#include "utf8.h"
+#include "bloom.h"
volatile show_early_output_fn_t show_early_output;
@@ -624,11 +625,80 @@ static void file_change(struct diff_options *options,
options->flags.has_changes = 1;
}
+static void prepare_to_use_bloom_filter(struct rev_info *revs)
+{
+ struct pathspec_item *pi;
+ char *path_alloc = NULL;
+ const char *path;
+ int last_index;
+ int len;
+
+ if (!revs->commits)
+ return;
+
+ repo_parse_commit(revs->repo, revs->commits->item);
+
+ if (!revs->repo->objects->commit_graph)
+ return;
+
+ revs->bloom_filter_settings = revs->repo->objects->commit_graph->bloom_filter_settings;
+ if (!revs->bloom_filter_settings)
+ return;
+
+ pi = &revs->pruning.pathspec.items[0];
+ last_index = pi->len - 1;
+
+ /* remove single trailing slash from path, if needed */
+ if (pi->match[last_index] == '/') {
+ path_alloc = xstrdup(pi->match);
+ path_alloc[last_index] = '\0';
+ path = path_alloc;
+ } else
+ path = pi->match;
+
+ len = strlen(path);
+
+ revs->bloom_key = xmalloc(sizeof(struct bloom_key));
+ fill_bloom_key(path, len, revs->bloom_key, revs->bloom_filter_settings);
+
+ free(path_alloc);
+}
+
+static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
+ struct commit *commit)
+{
+ struct bloom_filter *filter;
+ int result;
+
+ if (!revs->repo->objects->commit_graph)
+ return -1;
+
+ if (commit->generation == GENERATION_NUMBER_INFINITY)
+ return -1;
+
+ filter = get_bloom_filter(revs->repo, commit, 0);
+
+ if (!filter) {
+ return -1;
+ }
+
+ if (!filter->len) {
+ return -1;
+ }
+
+ result = bloom_filter_contains(filter,
+ revs->bloom_key,
+ revs->bloom_filter_settings);
+
+ return result;
+}
+
static int rev_compare_tree(struct rev_info *revs,
- struct commit *parent, struct commit *commit)
+ struct commit *parent, struct commit *commit, int nth_parent)
{
struct tree *t1 = get_commit_tree(parent);
struct tree *t2 = get_commit_tree(commit);
+ int bloom_ret = 1;
if (!t1)
return REV_TREE_NEW;
@@ -653,11 +723,19 @@ static int rev_compare_tree(struct rev_info *revs,
return REV_TREE_SAME;
}
+ if (revs->bloom_key && !nth_parent) {
+ bloom_ret = check_maybe_different_in_bloom_filter(revs, commit);
+
+ if (bloom_ret == 0)
+ return REV_TREE_SAME;
+ }
+
tree_difference = REV_TREE_SAME;
revs->pruning.flags.has_changes = 0;
if (diff_tree_oid(&t1->object.oid, &t2->object.oid, "",
&revs->pruning) < 0)
return REV_TREE_DIFFERENT;
+
return tree_difference;
}
@@ -855,7 +933,7 @@ static void try_to_simplify_commit(struct rev_info *revs, struct commit *commit)
die("cannot simplify commit %s (because of %s)",
oid_to_hex(&commit->object.oid),
oid_to_hex(&p->object.oid));
- switch (rev_compare_tree(revs, p, commit)) {
+ switch (rev_compare_tree(revs, p, commit, nth_parent)) {
case REV_TREE_SAME:
if (!revs->simplify_history || !relevant_commit(p)) {
/* Even if a merge with an uninteresting
@@ -3362,6 +3440,8 @@ int prepare_revision_walk(struct rev_info *revs)
FOR_EACH_OBJECT_PROMISOR_ONLY);
}
+ if (revs->pruning.pathspec.nr == 1 && !revs->reflog_info)
+ prepare_to_use_bloom_filter(revs);
if (revs->no_walk != REVISION_WALK_NO_WALK_UNSORTED)
commit_list_sort_by_date(&revs->commits);
if (revs->no_walk)
@@ -3379,6 +3459,7 @@ int prepare_revision_walk(struct rev_info *revs)
simplify_merges(revs);
if (revs->children.name)
set_children(revs);
+
return 0;
}
diff --git a/revision.h b/revision.h
index 475f048fb61..7c026fe41fc 100644
--- a/revision.h
+++ b/revision.h
@@ -56,6 +56,8 @@ struct repository;
struct rev_info;
struct string_list;
struct saved_parents;
+struct bloom_key;
+struct bloom_filter_settings;
define_shared_commit_slab(revision_sources, char *);
struct rev_cmdline_info {
@@ -291,6 +293,15 @@ struct rev_info {
struct revision_sources *sources;
struct topo_walk_info *topo_walk_info;
+
+ /* Commit graph bloom filter fields */
+ /* The bloom filter key for the pathspec */
+ struct bloom_key *bloom_key;
+ /*
+ * The bloom filter settings used to generate the key.
+ * This is loaded from the commit-graph being used.
+ */
+ struct bloom_filter_settings *bloom_filter_settings;
};
int ref_excluded(struct string_list *, const char *path);
--
gitgitgadget
^ permalink raw reply related [flat|nested] 159+ messages in thread
* [PATCH v3 14/16] revision.c: add trace2 stats around Bloom filter usage
2020-03-30 0:31 ` [PATCH v3 00/16] " Garima Singh via GitGitGadget
` (12 preceding siblings ...)
2020-03-30 0:31 ` [PATCH v3 13/16] revision.c: use Bloom filters to speed up path based revision walks Garima Singh via GitGitGadget
@ 2020-03-30 0:31 ` Garima Singh via GitGitGadget
2020-03-30 0:31 ` [PATCH v3 15/16] t4216: add end to end tests for git log with Bloom filters Garima Singh via GitGitGadget
` (2 subsequent siblings)
16 siblings, 0 replies; 159+ messages in thread
From: Garima Singh via GitGitGadget @ 2020-03-30 0:31 UTC (permalink / raw)
To: git; +Cc: stolee, szeder.dev, jonathantanmy, Garima Singh, Garima Singh
From: Garima Singh <garima.singh@microsoft.com>
Add trace2 statistics around Bloom filter usage and behavior
for 'git log -- path' commands that are hoping to benefit from
the presence of computed changed paths Bloom filters.
These statistics are great for performance analysis work and
for formal testing, which we will see in the commit following
this one.
Helped-by: Derrick Stolee <dstolee@microsoft.com
Helped-by: SZEDER Gábor <szeder.dev@gmail.com>
Helped-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
---
revision.c | 41 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 41 insertions(+)
diff --git a/revision.c b/revision.c
index d3fcb7c6ff6..2b06ee739c8 100644
--- a/revision.c
+++ b/revision.c
@@ -30,6 +30,7 @@
#include "hashmap.h"
#include "utf8.h"
#include "bloom.h"
+#include "json-writer.h"
volatile show_early_output_fn_t show_early_output;
@@ -625,6 +626,30 @@ static void file_change(struct diff_options *options,
options->flags.has_changes = 1;
}
+static int bloom_filter_atexit_registered;
+static unsigned int count_bloom_filter_maybe;
+static unsigned int count_bloom_filter_definitely_not;
+static unsigned int count_bloom_filter_false_positive;
+static unsigned int count_bloom_filter_not_present;
+static unsigned int count_bloom_filter_length_zero;
+
+static void trace2_bloom_filter_statistics_atexit(void)
+{
+ struct json_writer jw = JSON_WRITER_INIT;
+
+ jw_object_begin(&jw, 0);
+ jw_object_intmax(&jw, "filter_not_present", count_bloom_filter_not_present);
+ jw_object_intmax(&jw, "zero_length_filter", count_bloom_filter_length_zero);
+ jw_object_intmax(&jw, "maybe", count_bloom_filter_maybe);
+ jw_object_intmax(&jw, "definitely_not", count_bloom_filter_definitely_not);
+ jw_object_intmax(&jw, "false_positive", count_bloom_filter_false_positive);
+ jw_end(&jw);
+
+ trace2_data_json("bloom", the_repository, "statistics", &jw);
+
+ jw_release(&jw);
+}
+
static void prepare_to_use_bloom_filter(struct rev_info *revs)
{
struct pathspec_item *pi;
@@ -661,6 +686,11 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
revs->bloom_key = xmalloc(sizeof(struct bloom_key));
fill_bloom_key(path, len, revs->bloom_key, revs->bloom_filter_settings);
+ if (trace2_is_enabled() && !bloom_filter_atexit_registered) {
+ atexit(trace2_bloom_filter_statistics_atexit);
+ bloom_filter_atexit_registered = 1;
+ }
+
free(path_alloc);
}
@@ -679,10 +709,12 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
filter = get_bloom_filter(revs->repo, commit, 0);
if (!filter) {
+ count_bloom_filter_not_present++;
return -1;
}
if (!filter->len) {
+ count_bloom_filter_length_zero++;
return -1;
}
@@ -690,6 +722,11 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
revs->bloom_key,
revs->bloom_filter_settings);
+ if (result)
+ count_bloom_filter_maybe++;
+ else
+ count_bloom_filter_definitely_not++;
+
return result;
}
@@ -736,6 +773,10 @@ static int rev_compare_tree(struct rev_info *revs,
&revs->pruning) < 0)
return REV_TREE_DIFFERENT;
+ if (!nth_parent)
+ if (bloom_ret == 1 && tree_difference == REV_TREE_SAME)
+ count_bloom_filter_false_positive++;
+
return tree_difference;
}
--
gitgitgadget
^ permalink raw reply related [flat|nested] 159+ messages in thread
* [PATCH v3 15/16] t4216: add end to end tests for git log with Bloom filters
2020-03-30 0:31 ` [PATCH v3 00/16] " Garima Singh via GitGitGadget
` (13 preceding siblings ...)
2020-03-30 0:31 ` [PATCH v3 14/16] revision.c: add trace2 stats around Bloom filter usage Garima Singh via GitGitGadget
@ 2020-03-30 0:31 ` Garima Singh via GitGitGadget
2020-03-30 0:31 ` [PATCH v3 16/16] commit-graph: add GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS test flag Garima Singh via GitGitGadget
2020-04-06 16:59 ` [PATCH v4 00/15] Changed Paths Bloom Filters Garima Singh via GitGitGadget
16 siblings, 0 replies; 159+ messages in thread
From: Garima Singh via GitGitGadget @ 2020-03-30 0:31 UTC (permalink / raw)
To: git; +Cc: stolee, szeder.dev, jonathantanmy, Garima Singh, Garima Singh
From: Garima Singh <garima.singh@microsoft.com>
These tests exercises writing commit graph with Bloom filters
and exercises 'git log -- path' with all the applicable
options. They check that the output is the same with and
without Bloom filters, confirm Bloom filters were used by
checking if trace2 statistics were logged correctly.
Also confirms cases where Bloom filters are not used:
1. Multiple path specs,
2. --walk-reflogs (see patch titled 'revision.c: use Bloom filters...'
for details,
3. If the latest commit graph does not have Bloom filters
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
---
t/helper/test-read-graph.c | 4 +
t/t4216-log-bloom.sh | 155 +++++++++++++++++++++++++++++++++++++
2 files changed, 159 insertions(+)
create mode 100755 t/t4216-log-bloom.sh
diff --git a/t/helper/test-read-graph.c b/t/helper/test-read-graph.c
index f8a461767ca..4223ff32fb6 100644
--- a/t/helper/test-read-graph.c
+++ b/t/helper/test-read-graph.c
@@ -45,6 +45,10 @@ int cmd__read_graph(int argc, const char **argv)
printf(" commit_metadata");
if (graph->chunk_extra_edges)
printf(" extra_edges");
+ if (graph->chunk_bloom_indexes)
+ printf(" bloom_indexes");
+ if (graph->chunk_bloom_data)
+ printf(" bloom_data");
printf("\n");
UNLEAK(graph);
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
new file mode 100755
index 00000000000..38accd272df
--- /dev/null
+++ b/t/t4216-log-bloom.sh
@@ -0,0 +1,155 @@
+#!/bin/sh
+
+test_description='git log for a path with Bloom filters'
+. ./test-lib.sh
+
+GIT_TEST_COMMIT_GRAPH=0
+GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=0
+
+test_expect_success 'setup test - repo, commits, commit graph, log outputs' '
+ git init &&
+ mkdir A A/B A/B/C &&
+ test_commit c1 A/file1 &&
+ test_commit c2 A/B/file2 &&
+ test_commit c3 A/B/C/file3 &&
+ test_commit c4 A/file1 &&
+ test_commit c5 A/B/file2 &&
+ test_commit c6 A/B/C/file3 &&
+ test_commit c7 A/file1 &&
+ test_commit c8 A/B/file2 &&
+ test_commit c9 A/B/C/file3 &&
+ test_commit c10 file_to_be_deleted &&
+ git checkout -b side HEAD~4 &&
+ test_commit side-1 file4 &&
+ git checkout master &&
+ git merge side &&
+ test_commit c11 file5 &&
+ mv file5 file5_renamed &&
+ git add file5_renamed &&
+ git commit -m "rename" &&
+ rm file_to_be_deleted &&
+ git add . &&
+ git commit -m "file removed" &&
+ git commit-graph write --reachable --changed-paths
+'
+graph_read_expect () {
+ NUM_CHUNKS=5
+ cat >expect <<- EOF
+ header: 43475048 1 1 $NUM_CHUNKS 0
+ num_commits: $1
+ chunks: oid_fanout oid_lookup commit_metadata bloom_indexes bloom_data
+ EOF
+ test-tool read-graph >actual &&
+ test_cmp expect actual
+}
+
+test_expect_success 'commit-graph write wrote out the bloom chunks' '
+ graph_read_expect 15
+'
+
+# Turn off any inherited trace2 settings for this test.
+sane_unset GIT_TRACE2 GIT_TRACE2_PERF GIT_TRACE2_EVENT
+sane_unset GIT_TRACE2_PERF_BRIEF
+sane_unset GIT_TRACE2_CONFIG_PARAMS
+
+setup () {
+ rm "$TRASH_DIRECTORY/trace.perf"
+ git -c core.commitGraph=false log --pretty="format:%s" $1 >log_wo_bloom &&
+ GIT_TRACE2_PERF="$TRASH_DIRECTORY/trace.perf" git -c core.commitGraph=true log --pretty="format:%s" $1 >log_w_bloom
+}
+
+test_bloom_filters_used () {
+ log_args=$1
+ bloom_trace_prefix="statistics:{\"filter_not_present\":0,\"zero_length_filter\":0,\"maybe\""
+ setup "$log_args" &&
+ grep -q "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.perf" &&
+ test_cmp log_wo_bloom log_w_bloom &&
+ test_path_is_file "$TRASH_DIRECTORY/trace.perf"
+}
+
+test_bloom_filters_not_used () {
+ log_args=$1
+ setup "$log_args" &&
+ !(grep -q "statistics:{\"filter_not_present\":" "$TRASH_DIRECTORY/trace.perf") &&
+ test_cmp log_wo_bloom log_w_bloom
+}
+
+for path in A A/B A/B/C A/file1 A/B/file2 A/B/C/file3 file4 file5 file5_renamed file_to_be_deleted
+do
+ for option in "" \
+ "--all" \
+ "--full-history" \
+ "--full-history --simplify-merges" \
+ "--simplify-merges" \
+ "--simplify-by-decoration" \
+ "--follow" \
+ "--first-parent" \
+ "--topo-order" \
+ "--date-order" \
+ "--author-date-order" \
+ "--ancestry-path side..master"
+ do
+ test_expect_success "git log option: $option for path: $path" '
+ test_bloom_filters_used "$option -- $path"
+ '
+ done
+done
+
+test_expect_success 'git log -- folder works with and without the trailing slash' '
+ test_bloom_filters_used "-- A" &&
+ test_bloom_filters_used "-- A/"
+'
+
+test_expect_success 'git log for path that does not exist. ' '
+ test_bloom_filters_used "-- path_does_not_exist"
+'
+
+test_expect_success 'git log with --walk-reflogs does not use Bloom filters' '
+ test_bloom_filters_not_used "--walk-reflogs -- A"
+'
+
+test_expect_success 'git log -- multiple path specs does not use Bloom filters' '
+ test_bloom_filters_not_used "-- file4 A/file1"
+'
+
+test_expect_success 'git log with wildcard that resolves to a single path uses Bloom filters' '
+ test_bloom_filters_used "-- *4" &&
+ test_bloom_filters_used "-- *renamed"
+'
+
+test_expect_success 'git log with wildcard that resolves to a multiple paths does not uses Bloom filters' '
+ test_bloom_filters_not_used "-- *" &&
+ test_bloom_filters_not_used "-- file*"
+'
+
+test_expect_success 'setup - add commit-graph to the chain without Bloom filters' '
+ test_commit c14 A/anotherFile2 &&
+ test_commit c15 A/B/anotherFile2 &&
+ test_commit c16 A/B/C/anotherFile2 &&
+ GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=0 git commit-graph write --reachable --split &&
+ test_line_count = 2 .git/objects/info/commit-graphs/commit-graph-chain
+'
+
+test_expect_success 'Do not use Bloom filters if the latest graph does not have Bloom filters.' '
+ test_bloom_filters_not_used "-- A/B"
+'
+
+test_expect_success 'setup - add commit-graph to the chain with Bloom filters' '
+ test_commit c17 A/anotherFile3 &&
+ git commit-graph write --reachable --changed-paths --split &&
+ test_line_count = 3 .git/objects/info/commit-graphs/commit-graph-chain
+'
+
+test_bloom_filters_used_when_some_filters_are_missing () {
+ log_args=$1
+ bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"zero_length_filter\":0,\"maybe\":8,\"definitely_not\":6"
+ setup "$log_args" &&
+ grep -q "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.perf" &&
+ test_cmp log_wo_bloom log_w_bloom
+}
+
+test_expect_success 'Use Bloom filters if they exist in the latest but not all commit graphs in the chain.' '
+ test_bloom_filters_used_when_some_filters_are_missing "-- A/B"
+'
+
+test_done
\ No newline at end of file
--
gitgitgadget
^ permalink raw reply related [flat|nested] 159+ messages in thread
* [PATCH v3 16/16] commit-graph: add GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS test flag
2020-03-30 0:31 ` [PATCH v3 00/16] " Garima Singh via GitGitGadget
` (14 preceding siblings ...)
2020-03-30 0:31 ` [PATCH v3 15/16] t4216: add end to end tests for git log with Bloom filters Garima Singh via GitGitGadget
@ 2020-03-30 0:31 ` Garima Singh via GitGitGadget
2020-04-06 16:59 ` [PATCH v4 00/15] Changed Paths Bloom Filters Garima Singh via GitGitGadget
16 siblings, 0 replies; 159+ messages in thread
From: Garima Singh via GitGitGadget @ 2020-03-30 0:31 UTC (permalink / raw)
To: git; +Cc: stolee, szeder.dev, jonathantanmy, Garima Singh, Garima Singh
From: Garima Singh <garima.singh@microsoft.com>
Add GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS test flag to the test setup suite
in order to toggle writing Bloom filters when running any of the git tests.
If set to true, we will compute and write Bloom filters every time a test
calls `git commit-graph write`, as if the `--changed-paths` option was
passed in.
The test suite passes when GIT_TEST_COMMIT_GRAPH and
GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS are enabled.
Helped-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
---
builtin/commit-graph.c | 3 ++-
ci/run-build-and-tests.sh | 1 +
commit-graph.h | 1 +
t/README | 5 +++++
t/t5318-commit-graph.sh | 2 ++
t/t5324-split-commit-graph.sh | 1 +
6 files changed, 12 insertions(+), 1 deletion(-)
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index cacb5d04a80..59009837dc9 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -171,7 +171,8 @@ static int graph_write(int argc, const char **argv)
flags |= COMMIT_GRAPH_WRITE_SPLIT;
if (opts.progress)
flags |= COMMIT_GRAPH_WRITE_PROGRESS;
- if (opts.enable_changed_paths)
+ if (opts.enable_changed_paths ||
+ git_env_bool(GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS, 0))
flags |= COMMIT_GRAPH_WRITE_BLOOM_FILTERS;
read_replace_refs = 0;
diff --git a/ci/run-build-and-tests.sh b/ci/run-build-and-tests.sh
index 4df54c4efea..17e25aade96 100755
--- a/ci/run-build-and-tests.sh
+++ b/ci/run-build-and-tests.sh
@@ -19,6 +19,7 @@ linux-gcc)
export GIT_TEST_OE_SIZE=10
export GIT_TEST_OE_DELTA_SIZE=5
export GIT_TEST_COMMIT_GRAPH=1
+ export GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=1
export GIT_TEST_MULTI_PACK_INDEX=1
export GIT_TEST_ADD_I_USE_BUILTIN=1
make test
diff --git a/commit-graph.h b/commit-graph.h
index 8e7a8e0e5b2..8655d064c14 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -9,6 +9,7 @@
#define GIT_TEST_COMMIT_GRAPH "GIT_TEST_COMMIT_GRAPH"
#define GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD "GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD"
+#define GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS "GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS"
struct commit;
struct bloom_filter_settings;
diff --git a/t/README b/t/README
index 369e3a9ded8..4f53da53a15 100644
--- a/t/README
+++ b/t/README
@@ -378,6 +378,11 @@ GIT_TEST_COMMIT_GRAPH=<boolean>, when true, forces the commit-graph to
be written after every 'git commit' command, and overrides the
'core.commitGraph' setting to true.
+GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=<boolean>, when true, forces
+commit-graph write to compute and write changed path Bloom filters for
+every 'git commit-graph write', as if the `--changed-paths` option was
+passed in.
+
GIT_TEST_FSMONITOR=$PWD/t7519/fsmonitor-all exercises the fsmonitor
code path for utilizing a file system monitor to speed up detecting
new or changed files.
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 9bf920ae171..18304a65e4d 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -3,6 +3,8 @@
test_description='commit graph'
. ./test-lib.sh
+GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=0
+
test_expect_success 'setup full repo' '
mkdir full &&
cd "$TRASH_DIRECTORY/full" &&
diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
index 53b2e6b4555..d3f1f2c4a71 100755
--- a/t/t5324-split-commit-graph.sh
+++ b/t/t5324-split-commit-graph.sh
@@ -4,6 +4,7 @@ test_description='split commit graph'
. ./test-lib.sh
GIT_TEST_COMMIT_GRAPH=0
+GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=0
test_expect_success 'setup repo' '
git init &&
--
gitgitgadget
^ permalink raw reply related [flat|nested] 159+ messages in thread
* [PATCH v4 00/15] Changed Paths Bloom Filters
2020-03-30 0:31 ` [PATCH v3 00/16] " Garima Singh via GitGitGadget
` (15 preceding siblings ...)
2020-03-30 0:31 ` [PATCH v3 16/16] commit-graph: add GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS test flag Garima Singh via GitGitGadget
@ 2020-04-06 16:59 ` Garima Singh via GitGitGadget
2020-04-06 16:59 ` [PATCH v4 01/15] commit-graph: define and use MAX_NUM_CHUNKS Garima Singh via GitGitGadget
` (15 more replies)
16 siblings, 16 replies; 159+ messages in thread
From: Garima Singh via GitGitGadget @ 2020-04-06 16:59 UTC (permalink / raw)
To: git; +Cc: stolee, szeder.dev, jonathantanmy, Garima Singh
Hey!
The commit graph feature brought in a lot of performance improvements across
multiple commands. However, file based history continues to be a performance
pain point, especially in large repositories.
Adopting changed path Bloom filters has been discussed on the list before,
and a prototype version was worked on by SZEDER Gábor, Jonathan Tan and Dr.
Derrick Stolee [1]. This series is based on Dr. Stolee's proof of concept in
[2]
With the changes in this series, git users will be able to choose to write
Bloom filters to the commit-graph using the following command:
'git commit-graph write --changed-paths'
Subsequent 'git log -- path' commands will use these computed Bloom filters
to decided which commits are worth exploring further to produce the history
of the provided path.
Cost of computing and writing Bloom filters
===========================================
Computing and writing Bloom filters to the commit graph for the first time
implies computing the diffs and the resulting Bloom filters for all the
commits in the repository. This adds a non trivial amount of time to run
time. Every subsequent run is incremental i.e. we reuse the previously
computed Bloom filters. So this is a one time cost.
Time taken by 'git commit-graph write' with and w/o --changed-paths, speed
up in 'git log -- path' with computed Bloom filters (see a):-
-------------------------------------------------------------------------
| Repo | w/o --changed-paths | with --changed-paths | Speed up |
-------------------------------------------------------------------------
| git [3] | 0.9 seconds | 7 seconds | 2x to 6x |
| linux [4] | 16 seconds | 1 minute 8 seconds | 2x to 6x |
| android [5] | 9 seconds | 48 seconds | 2x to 6x |
| AzDo(see b) | 1 minute | 5 minutes 2 seconds | 10x to 30x |
-------------------------------------------------------------------------
a) We tested the performance of git log -- path with randomly chosen paths
of varying depths in each repo. The speed up depends on how deep the files
are in the hierarchy and how often a file has been touched in the commit
history.
b) This internal repository has about 420k commits, 183k files distributed
across 34k folders, the size on disk is about 17 GiB. The most massive gains
on this repository were for files 6-12 levels deep in the tree.
c) These numbers were collected on a Windows machine, except for the linux
repo which was tested on a Linux machine.
Future Work (not included in the scope of this series)
======================================================
1. Supporting multiple path based revision walk
2. Adopting it in git blame logic.
3. Interactions with line log git log -L
Cheers! Garima Singh
[1] https://lore.kernel.org/git/20181009193445.21908-1-szeder.dev@gmail.com/
[2]
https://lore.kernel.org/git/61559c5b-546e-d61b-d2e1-68de692f5972@gmail.com/
[3] https://github.com/git/git
[4] https://github.com/torvalds/linux
[5] https://android.googlesource.com/platform/frameworks/base/
jeffhost@microsoft.com, me@ttaylorr.com, peff@peff.net,
garimasigit@gmail.com,jnareb@gmail.com, christian.couder@gmail.com,
emilyshaffer@gmail.com,gitster@pobox.com
Derrick Stolee (1):
diff: halt tree-diff early after max_changes
Garima Singh (13):
commit-graph: define and use MAX_NUM_CHUNKS
bloom.c: add the murmur3 hash implementation
bloom.c: introduce core Bloom filter constructs
bloom.c: core Bloom filter implementation for changed paths.
commit-graph: compute Bloom filters for changed paths
commit-graph: examine commits by generation number
commit-graph: write Bloom filters to commit graph file
commit-graph: reuse existing Bloom filters during write
commit-graph: add --changed-paths option to write subcommand
revision.c: use Bloom filters to speed up path based revision walks
revision.c: add trace2 stats around Bloom filter usage
t4216: add end to end tests for git log with Bloom filters
commit-graph: add GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS test flag
Jeff King (1):
commit-graph: examine changed-path objects in pack order
Documentation/git-commit-graph.txt | 5 +
.../technical/commit-graph-format.txt | 30 ++
Makefile | 2 +
bloom.c | 275 ++++++++++++++++++
bloom.h | 90 ++++++
builtin/commit-graph.c | 10 +-
ci/run-build-and-tests.sh | 1 +
commit-graph.c | 213 +++++++++++++-
commit-graph.h | 9 +-
diff.h | 5 +
revision.c | 126 +++++++-
revision.h | 11 +
t/README | 5 +
t/helper/test-bloom.c | 81 ++++++
t/helper/test-read-graph.c | 4 +
t/helper/test-tool.c | 1 +
t/helper/test-tool.h | 1 +
t/t0095-bloom.sh | 117 ++++++++
t/t4216-log-bloom.sh | 155 ++++++++++
t/t5318-commit-graph.sh | 2 +
t/t5324-split-commit-graph.sh | 1 +
tree-diff.c | 6 +
22 files changed, 1139 insertions(+), 11 deletions(-)
create mode 100644 bloom.c
create mode 100644 bloom.h
create mode 100644 t/helper/test-bloom.c
create mode 100755 t/t0095-bloom.sh
create mode 100755 t/t4216-log-bloom.sh
base-commit: 3bab5d56259722843359702bc27111475437ad2a
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-497%2Fgarimasi514%2FcoreGit-bloomFilters-v4
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-497/garimasi514/coreGit-bloomFilters-v4
Pull-Request: https://github.com/gitgitgadget/git/pull/497
Range-diff vs v3:
1: c3ffd9820d5 = 1: c3ffd9820d5 commit-graph: define and use MAX_NUM_CHUNKS
2: a5aa3415c05 = 2: a5aa3415c05 bloom.c: add the murmur3 hash implementation
3: a7702c1afde = 3: a7702c1afde bloom.c: introduce core Bloom filter constructs
4: 8304c297520 = 4: 8304c297520 bloom.c: core Bloom filter implementation for changed paths.
5: 2d4c0b2da38 = 5: 2d4c0b2da38 diff: halt tree-diff early after max_changes
6: c38b9b386ef = 6: c38b9b386ef commit-graph: compute Bloom filters for changed paths
7: d24c85c54ef = 7: d24c85c54ef commit-graph: examine changed-path objects in pack order
8: 5ed16f35fed = 8: 5ed16f35fed commit-graph: examine commits by generation number
9: 55824cda89c < -: ----------- diff: skip batch object download when possible
10: 1e4663523de = 9: ff6b96aad1e commit-graph: write Bloom filters to commit graph file
11: 68395d4051b ! 10: cc8022bdf82 commit-graph: reuse existing Bloom filters during write
@@ bloom.c: struct bloom_filter *get_bloom_filter(struct repository *r,
+
repo_diff_setup(r, &diffopt);
diffopt.flags.recursive = 1;
- diffopt.detect_rename = 0;
+ diffopt.max_changes = max_changes;
## bloom.h ##
@@ bloom.h: struct bloom_filter_settings {
12: 7e450e45236 = 11: c8b86c383ab commit-graph: add --changed-paths option to write subcommand
13: b18af58aa3e = 12: 617f549ef25 revision.c: use Bloom filters to speed up path based revision walks
14: b5eb280178f = 13: 6beaede7159 revision.c: add trace2 stats around Bloom filter usage
15: 3019ef72881 = 14: b899df5c98e t4216: add end to end tests for git log with Bloom filters
16: 213abb5d895 = 15: 5656e8590e9 commit-graph: add GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS test flag
--
gitgitgadget
^ permalink raw reply [flat|nested] 159+ messages in thread
* [PATCH v4 01/15] commit-graph: define and use MAX_NUM_CHUNKS
2020-04-06 16:59 ` [PATCH v4 00/15] Changed Paths Bloom Filters Garima Singh via GitGitGadget
@ 2020-04-06 16:59 ` Garima Singh via GitGitGadget
2020-04-06 16:59 ` [PATCH v4 02/15] bloom.c: add the murmur3 hash implementation Garima Singh via GitGitGadget
` (14 subsequent siblings)
15 siblings, 0 replies; 159+ messages in thread
From: Garima Singh via GitGitGadget @ 2020-04-06 16:59 UTC (permalink / raw)
To: git; +Cc: stolee, szeder.dev, jonathantanmy, Garima Singh, Garima Singh
From: Garima Singh <garima.singh@microsoft.com>
This is a minor cleanup to make it easier to change
the number of chunks being written to the commit
graph.
Reviewed-by: Jakub Narębski <jnareb@gmail.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
---
commit-graph.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/commit-graph.c b/commit-graph.c
index f013a84e294..e4f1a5b2f1a 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -23,6 +23,7 @@
#define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */
#define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */
#define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */
+#define MAX_NUM_CHUNKS 5
#define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16)
@@ -1350,8 +1351,8 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
int fd;
struct hashfile *f;
struct lock_file lk = LOCK_INIT;
- uint32_t chunk_ids[6];
- uint64_t chunk_offsets[6];
+ uint32_t chunk_ids[MAX_NUM_CHUNKS + 1];
+ uint64_t chunk_offsets[MAX_NUM_CHUNKS + 1];
const unsigned hashsz = the_hash_algo->rawsz;
struct strbuf progress_title = STRBUF_INIT;
int num_chunks = 3;
--
gitgitgadget
^ permalink raw reply related [flat|nested] 159+ messages in thread
* [PATCH v4 02/15] bloom.c: add the murmur3 hash implementation
2020-04-06 16:59 ` [PATCH v4 00/15] Changed Paths Bloom Filters Garima Singh via GitGitGadget
2020-04-06 16:59 ` [PATCH v4 01/15] commit-graph: define and use MAX_NUM_CHUNKS Garima Singh via GitGitGadget
@ 2020-04-06 16:59 ` Garima Singh via GitGitGadget
2020-04-06 16:59 ` [PATCH v4 03/15] bloom.c: introduce core Bloom filter constructs Garima Singh via GitGitGadget
` (13 subsequent siblings)
15 siblings, 0 replies; 159+ messages in thread
From: Garima Singh via GitGitGadget @ 2020-04-06 16:59 UTC (permalink / raw)
To: git; +Cc: stolee, szeder.dev, jonathantanmy, Garima Singh, Garima Singh
From: Garima Singh <garima.singh@microsoft.com>
In preparation for computing changed paths Bloom filters,
implement the Murmur3 hash algorithm as described in [1].
It hashes the given data using the given seed and produces
a uniformly distributed hash value.
[1] https://en.wikipedia.org/wiki/MurmurHash#Algorithm
Helped-by: Derrick Stolee <dstolee@microsoft.com>
Helped-by: Szeder Gábor <szeder.dev@gmail.com>
Reviewed-by: Jakub Narębski <jnareb@gmail.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
---
Makefile | 2 ++
bloom.c | 73 +++++++++++++++++++++++++++++++++++++++++++
bloom.h | 13 ++++++++
t/helper/test-bloom.c | 13 ++++++++
t/helper/test-tool.c | 1 +
t/helper/test-tool.h | 1 +
t/t0095-bloom.sh | 30 ++++++++++++++++++
7 files changed, 133 insertions(+)
create mode 100644 bloom.c
create mode 100644 bloom.h
create mode 100644 t/helper/test-bloom.c
create mode 100755 t/t0095-bloom.sh
diff --git a/Makefile b/Makefile
index ef1ff2228f0..491f75e68c5 100644
--- a/Makefile
+++ b/Makefile
@@ -695,6 +695,7 @@ X =
PROGRAMS += $(patsubst %.o,git-%$X,$(PROGRAM_OBJS))
TEST_BUILTINS_OBJS += test-advise.o
+TEST_BUILTINS_OBJS += test-bloom.o
TEST_BUILTINS_OBJS += test-chmtime.o
TEST_BUILTINS_OBJS += test-config.o
TEST_BUILTINS_OBJS += test-ctype.o
@@ -840,6 +841,7 @@ LIB_OBJS += base85.o
LIB_OBJS += bisect.o
LIB_OBJS += blame.o
LIB_OBJS += blob.o
+LIB_OBJS += bloom.o
LIB_OBJS += branch.o
LIB_OBJS += bulk-checkin.o
LIB_OBJS += bundle.o
diff --git a/bloom.c b/bloom.c
new file mode 100644
index 00000000000..40e87632aeb
--- /dev/null
+++ b/bloom.c
@@ -0,0 +1,73 @@
+#include "git-compat-util.h"
+#include "bloom.h"
+
+static uint32_t rotate_left(uint32_t value, int32_t count)
+{
+ uint32_t mask = 8 * sizeof(uint32_t) - 1;
+ count &= mask;
+ return ((value << count) | (value >> ((-count) & mask)));
+}
+
+/*
+ * Calculate the murmur3 32-bit hash value for the given data
+ * using the given seed.
+ * Produces a uniformly distributed hash value.
+ * Not considered to be cryptographically secure.
+ * Implemented as described in https://en.wikipedia.org/wiki/MurmurHash#Algorithm
+ */
+uint32_t murmur3_seeded(uint32_t seed, const char *data, size_t len)
+{
+ const uint32_t c1 = 0xcc9e2d51;
+ const uint32_t c2 = 0x1b873593;
+ const uint32_t r1 = 15;
+ const uint32_t r2 = 13;
+ const uint32_t m = 5;
+ const uint32_t n = 0xe6546b64;
+ int i;
+ uint32_t k1 = 0;
+ const char *tail;
+
+ int len4 = len / sizeof(uint32_t);
+
+ uint32_t k;
+ for (i = 0; i < len4; i++) {
+ uint32_t byte1 = (uint32_t)data[4*i];
+ uint32_t byte2 = ((uint32_t)data[4*i + 1]) << 8;
+ uint32_t byte3 = ((uint32_t)data[4*i + 2]) << 16;
+ uint32_t byte4 = ((uint32_t)data[4*i + 3]) << 24;
+ k = byte1 | byte2 | byte3 | byte4;
+ k *= c1;
+ k = rotate_left(k, r1);
+ k *= c2;
+
+ seed ^= k;
+ seed = rotate_left(seed, r2) * m + n;
+ }
+
+ tail = (data + len4 * sizeof(uint32_t));
+
+ switch (len & (sizeof(uint32_t) - 1)) {
+ case 3:
+ k1 ^= ((uint32_t)tail[2]) << 16;
+ /*-fallthrough*/
+ case 2:
+ k1 ^= ((uint32_t)tail[1]) << 8;
+ /*-fallthrough*/
+ case 1:
+ k1 ^= ((uint32_t)tail[0]) << 0;
+ k1 *= c1;
+ k1 = rotate_left(k1, r1);
+ k1 *= c2;
+ seed ^= k1;
+ break;
+ }
+
+ seed ^= (uint32_t)len;
+ seed ^= (seed >> 16);
+ seed *= 0x85ebca6b;
+ seed ^= (seed >> 13);
+ seed *= 0xc2b2ae35;
+ seed ^= (seed >> 16);
+
+ return seed;
+}
\ No newline at end of file
diff --git a/bloom.h b/bloom.h
new file mode 100644
index 00000000000..d0fcc5f0aa6
--- /dev/null
+++ b/bloom.h
@@ -0,0 +1,13 @@
+#ifndef BLOOM_H
+#define BLOOM_H
+
+/*
+ * Calculate the murmur3 32-bit hash value for the given data
+ * using the given seed.
+ * Produces a uniformly distributed hash value.
+ * Not considered to be cryptographically secure.
+ * Implemented as described in https://en.wikipedia.org/wiki/MurmurHash#Algorithm
+ */
+uint32_t murmur3_seeded(uint32_t seed, const char *data, size_t len);
+
+#endif
\ No newline at end of file
diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
new file mode 100644
index 00000000000..60ee2043689
--- /dev/null
+++ b/t/helper/test-bloom.c
@@ -0,0 +1,13 @@
+#include "git-compat-util.h"
+#include "bloom.h"
+#include "test-tool.h"
+
+int cmd__bloom(int argc, const char **argv)
+{
+ if (!strcmp(argv[1], "get_murmur3")) {
+ uint32_t hashed = murmur3_seeded(0, argv[2], strlen(argv[2]));
+ printf("Murmur3 Hash with seed=0:0x%08x\n", hashed);
+ }
+
+ return 0;
+}
\ No newline at end of file
diff --git a/t/helper/test-tool.c b/t/helper/test-tool.c
index 31eedcd241f..6e26bd65c97 100644
--- a/t/helper/test-tool.c
+++ b/t/helper/test-tool.c
@@ -15,6 +15,7 @@ struct test_cmd {
static struct test_cmd cmds[] = {
{ "advise", cmd__advise_if_enabled },
+ { "bloom", cmd__bloom },
{ "chmtime", cmd__chmtime },
{ "config", cmd__config },
{ "ctype", cmd__ctype },
diff --git a/t/helper/test-tool.h b/t/helper/test-tool.h
index 4eb5e6609e1..dceeef1d5c2 100644
--- a/t/helper/test-tool.h
+++ b/t/helper/test-tool.h
@@ -5,6 +5,7 @@
#include "git-compat-util.h"
int cmd__advise_if_enabled(int argc, const char **argv);
+int cmd__bloom(int argc, const char **argv);
int cmd__chmtime(int argc, const char **argv);
int cmd__config(int argc, const char **argv);
int cmd__ctype(int argc, const char **argv);
diff --git a/t/t0095-bloom.sh b/t/t0095-bloom.sh
new file mode 100755
index 00000000000..2dad8c4a94e
--- /dev/null
+++ b/t/t0095-bloom.sh
@@ -0,0 +1,30 @@
+#!/bin/sh
+
+test_description='Testing the various Bloom filter computations in bloom.c'
+. ./test-lib.sh
+
+test_expect_success 'compute unseeded murmur3 hash for empty string' '
+ cat >expect <<-\EOF &&
+ Murmur3 Hash with seed=0:0x00000000
+ EOF
+ test-tool bloom get_murmur3 "" >actual &&
+ test_cmp expect actual
+'
+
+test_expect_success 'compute unseeded murmur3 hash for test string 1' '
+ cat >expect <<-\EOF &&
+ Murmur3 Hash with seed=0:0x627b0c2c
+ EOF
+ test-tool bloom get_murmur3 "Hello world!" >actual &&
+ test_cmp expect actual
+'
+
+test_expect_success 'compute unseeded murmur3 hash for test string 2' '
+ cat >expect <<-\EOF &&
+ Murmur3 Hash with seed=0:0x2e4ff723
+ EOF
+ test-tool bloom get_murmur3 "The quick brown fox jumps over the lazy dog" >actual &&
+ test_cmp expect actual
+'
+
+test_done
\ No newline at end of file
--
gitgitgadget
^ permalink raw reply related [flat|nested] 159+ messages in thread
* [PATCH v4 03/15] bloom.c: introduce core Bloom filter constructs
2020-04-06 16:59 ` [PATCH v4 00/15] Changed Paths Bloom Filters Garima Singh via GitGitGadget
2020-04-06 16:59 ` [PATCH v4 01/15] commit-graph: define and use MAX_NUM_CHUNKS Garima Singh via GitGitGadget
2020-04-06 16:59 ` [PATCH v4 02/15] bloom.c: add the murmur3 hash implementation Garima Singh via GitGitGadget
@ 2020-04-06 16:59 ` Garima Singh via GitGitGadget
2020-04-06 16:59 ` [PATCH v4 04/15] bloom.c: core Bloom filter implementation for changed paths Garima Singh via GitGitGadget
` (12 subsequent siblings)
15 siblings, 0 replies; 159+ messages in thread
From: Garima Singh via GitGitGadget @ 2020-04-06 16:59 UTC (permalink / raw)
To: git; +Cc: stolee, szeder.dev, jonathantanmy, Garima Singh, Garima Singh
From: Garima Singh <garima.singh@microsoft.com>
Introduce the constructs for Bloom filters, Bloom filter keys
and Bloom filter settings.
For details on what Bloom filters are and how they work, refer
to Dr. Derrick Stolee's blog post [1]. It provides a concise
explanation of the adoption of Bloom filters as described in
[2] and [3].
Implementation specifics:
1. We currently use 7 and 10 for the number of hashes and the
size of each entry respectively. They served as great starting
values, the mathematical details behind this choice are
described in [1] and [4]. The implementation, while not
completely open to it at the moment, is flexible enough to allow
for tweaking these settings in the future.
Note: The performance gains we have observed with these values
are significant enough that we did not need to tweak these
settings. The performance numbers are included in the cover letter
of this series and in the commit message of the subsequent commit
where we use Bloom filters to speed up `git log -- path`.
2. As described in [1] and [3], we do not need 7 independent hashing
functions. We use the Murmur3 hashing scheme, seed it twice and
then combine those to procure an arbitrary number of hash values.
3. The filters will be sized according to the number of changes in
each commit, in multiples of 8 bit words.
[1] Derrick Stolee
"Supercharging the Git Commit Graph IV: Bloom Filters"
https://devblogs.microsoft.com/devops/super-charging-the-git-commit-graph-iv-Bloom-filters/
[2] Flavio Bonomi, Michael Mitzenmacher, Rina Panigrahy, Sushil Singh, George Varghese
"An Improved Construction for Counting Bloom Filters"
http://theory.stanford.edu/~rinap/papers/esa2006b.pdf
https://doi.org/10.1007/11841036_61
[3] Peter C. Dillinger and Panagiotis Manolios
"Bloom Filters in Probabilistic Verification"
http://www.ccs.neu.edu/home/pete/pub/Bloom-filters-verification.pdf
https://doi.org/10.1007/978-3-540-30494-4_26
[4] Thomas Mueller Graf, Daniel Lemire
"Xor Filters: Faster and Smaller Than Bloom and Cuckoo Filters"
https://arxiv.org/abs/1912.08258
Helped-by: Derrick Stolee <dstolee@microsoft.com>
Reviewed-by: Jakub Narębski <jnareb@gmail.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
---
bloom.c | 38 +++++++++++++++++++++++++-
bloom.h | 63 +++++++++++++++++++++++++++++++++++++++++++
t/helper/test-bloom.c | 48 +++++++++++++++++++++++++++++++++
t/t0095-bloom.sh | 40 +++++++++++++++++++++++++++
4 files changed, 188 insertions(+), 1 deletion(-)
diff --git a/bloom.c b/bloom.c
index 40e87632aeb..888b67f1ea6 100644
--- a/bloom.c
+++ b/bloom.c
@@ -8,6 +8,11 @@ static uint32_t rotate_left(uint32_t value, int32_t count)
return ((value << count) | (value >> ((-count) & mask)));
}
+static inline unsigned char get_bitmask(uint32_t pos)
+{
+ return ((unsigned char)1) << (pos & (BITS_PER_WORD - 1));
+}
+
/*
* Calculate the murmur3 32-bit hash value for the given data
* using the given seed.
@@ -70,4 +75,35 @@ uint32_t murmur3_seeded(uint32_t seed, const char *data, size_t len)
seed ^= (seed >> 16);
return seed;
-}
\ No newline at end of file
+}
+
+void fill_bloom_key(const char *data,
+ size_t len,
+ struct bloom_key *key,
+ const struct bloom_filter_settings *settings)
+{
+ int i;
+ const uint32_t seed0 = 0x293ae76f;
+ const uint32_t seed1 = 0x7e646e2c;
+ const uint32_t hash0 = murmur3_seeded(seed0, data, len);
+ const uint32_t hash1 = murmur3_seeded(seed1, data, len);
+
+ key->hashes = (uint32_t *)xcalloc(settings->num_hashes, sizeof(uint32_t));
+ for (i = 0; i < settings->num_hashes; i++)
+ key->hashes[i] = hash0 + i * hash1;
+}
+
+void add_key_to_filter(const struct bloom_key *key,
+ struct bloom_filter *filter,
+ const struct bloom_filter_settings *settings)
+{
+ int i;
+ uint64_t mod = filter->len * BITS_PER_WORD;
+
+ for (i = 0; i < settings->num_hashes; i++) {
+ uint64_t hash_mod = key->hashes[i] % mod;
+ uint64_t block_pos = hash_mod / BITS_PER_WORD;
+
+ filter->data[block_pos] |= get_bitmask(hash_mod);
+ }
+}
diff --git a/bloom.h b/bloom.h
index d0fcc5f0aa6..b9ce422ca2d 100644
--- a/bloom.h
+++ b/bloom.h
@@ -1,6 +1,60 @@
#ifndef BLOOM_H
#define BLOOM_H
+struct bloom_filter_settings {
+ /*
+ * The version of the hashing technique being used.
+ * We currently only support version = 1 which is
+ * the seeded murmur3 hashing technique implemented
+ * in bloom.c.
+ */
+ uint32_t hash_version;
+
+ /*
+ * The number of times a path is hashed, i.e. the
+ * number of bit positions tht cumulatively
+ * determine whether a path is present in the
+ * Bloom filter.
+ */
+ uint32_t num_hashes;
+
+ /*
+ * The minimum number of bits per entry in the Bloom
+ * filter. If the filter contains 'n' entries, then
+ * filter size is the minimum number of 8-bit words
+ * that contain n*b bits.
+ */
+ uint32_t bits_per_entry;
+};
+
+#define DEFAULT_BLOOM_FILTER_SETTINGS { 1, 7, 10 }
+#define BITS_PER_WORD 8
+
+/*
+ * A bloom_filter struct represents a data segment to
+ * use when testing hash values. The 'len' member
+ * dictates how many entries are stored in
+ * 'data'.
+ */
+struct bloom_filter {
+ unsigned char *data;
+ size_t len;
+};
+
+/*
+ * A bloom_key represents the k hash values for a
+ * given string. These can be precomputed and
+ * stored in a bloom_key for re-use when testing
+ * against a bloom_filter. The number of hashes is
+ * given by the Bloom filter settings and is the same
+ * for all Bloom filters and keys interacting with
+ * the loaded version of the commit graph file and
+ * the Bloom data chunks.
+ */
+struct bloom_key {
+ uint32_t *hashes;
+};
+
/*
* Calculate the murmur3 32-bit hash value for the given data
* using the given seed.
@@ -10,4 +64,13 @@
*/
uint32_t murmur3_seeded(uint32_t seed, const char *data, size_t len);
+void fill_bloom_key(const char *data,
+ size_t len,
+ struct bloom_key *key,
+ const struct bloom_filter_settings *settings);
+
+void add_key_to_filter(const struct bloom_key *key,
+ struct bloom_filter *filter,
+ const struct bloom_filter_settings *settings);
+
#endif
\ No newline at end of file
diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
index 60ee2043689..20460cde775 100644
--- a/t/helper/test-bloom.c
+++ b/t/helper/test-bloom.c
@@ -2,6 +2,36 @@
#include "bloom.h"
#include "test-tool.h"
+struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
+
+static void add_string_to_filter(const char *data, struct bloom_filter *filter) {
+ struct bloom_key key;
+ int i;
+
+ fill_bloom_key(data, strlen(data), &key, &settings);
+ printf("Hashes:");
+ for (i = 0; i < settings.num_hashes; i++){
+ printf("0x%08x|", key.hashes[i]);
+ }
+ printf("\n");
+ add_key_to_filter(&key, filter, &settings);
+}
+
+static void print_bloom_filter(struct bloom_filter *filter) {
+ int i;
+
+ if (!filter) {
+ printf("No filter.\n");
+ return;
+ }
+ printf("Filter_Length:%d\n", (int)filter->len);
+ printf("Filter_Data:");
+ for (i = 0; i < filter->len; i++){
+ printf("%02x|", filter->data[i]);
+ }
+ printf("\n");
+}
+
int cmd__bloom(int argc, const char **argv)
{
if (!strcmp(argv[1], "get_murmur3")) {
@@ -9,5 +39,23 @@ int cmd__bloom(int argc, const char **argv)
printf("Murmur3 Hash with seed=0:0x%08x\n", hashed);
}
+ if (!strcmp(argv[1], "generate_filter")) {
+ struct bloom_filter filter;
+ int i = 2;
+ filter.len = (settings.bits_per_entry + BITS_PER_WORD - 1) / BITS_PER_WORD;
+ filter.data = xcalloc(filter.len, sizeof(unsigned char));
+
+ if (!argv[2]){
+ die("at least one input string expected");
+ }
+
+ while (argv[i]) {
+ add_string_to_filter(argv[i], &filter);
+ i++;
+ }
+
+ print_bloom_filter(&filter);
+ }
+
return 0;
}
\ No newline at end of file
diff --git a/t/t0095-bloom.sh b/t/t0095-bloom.sh
index 2dad8c4a94e..36a086c7c60 100755
--- a/t/t0095-bloom.sh
+++ b/t/t0095-bloom.sh
@@ -27,4 +27,44 @@ test_expect_success 'compute unseeded murmur3 hash for test string 2' '
test_cmp expect actual
'
+test_expect_success 'compute bloom key for empty string' '
+ cat >expect <<-\EOF &&
+ Hashes:0x5615800c|0x5b966560|0x61174ab4|0x66983008|0x6c19155c|0x7199fab0|0x771ae004|
+ Filter_Length:2
+ Filter_Data:11|11|
+ EOF
+ test-tool bloom generate_filter "" >actual &&
+ test_cmp expect actual
+'
+
+test_expect_success 'compute bloom key for whitespace' '
+ cat >expect <<-\EOF &&
+ Hashes:0xf178874c|0x5f3d6eb6|0xcd025620|0x3ac73d8a|0xa88c24f4|0x16510c5e|0x8415f3c8|
+ Filter_Length:2
+ Filter_Data:51|55|
+ EOF
+ test-tool bloom generate_filter " " >actual &&
+ test_cmp expect actual
+'
+
+test_expect_success 'compute bloom key for test string 1' '
+ cat >expect <<-\EOF &&
+ Hashes:0xb270de9b|0x1bb6f26e|0x84fd0641|0xee431a14|0x57892de7|0xc0cf41ba|0x2a15558d|
+ Filter_Length:2
+ Filter_Data:92|6c|
+ EOF
+ test-tool bloom generate_filter "Hello world!" >actual &&
+ test_cmp expect actual
+'
+
+test_expect_success 'compute bloom key for test string 2' '
+ cat >expect <<-\EOF &&
+ Hashes:0x20ab385b|0xf5237fe2|0xc99bc769|0x9e140ef0|0x728c5677|0x47049dfe|0x1b7ce585|
+ Filter_Length:2
+ Filter_Data:a5|4a|
+ EOF
+ test-tool bloom generate_filter "file.txt" >actual &&
+ test_cmp expect actual
+'
+
test_done
\ No newline at end of file
--
gitgitgadget
^ permalink raw reply related [flat|nested] 159+ messages in thread
* [PATCH v4 04/15] bloom.c: core Bloom filter implementation for changed paths.
2020-04-06 16:59 ` [PATCH v4 00/15] Changed Paths Bloom Filters Garima Singh via GitGitGadget
` (2 preceding siblings ...)
2020-04-06 16:59 ` [PATCH v4 03/15] bloom.c: introduce core Bloom filter constructs Garima Singh via GitGitGadget
@ 2020-04-06 16:59 ` Garima Singh via GitGitGadget
2020-06-27 15:53 ` SZEDER Gábor
2020-04-06 16:59 ` [PATCH v4 05/15] diff: halt tree-diff early after max_changes Derrick Stolee via GitGitGadget
` (11 subsequent siblings)
15 siblings, 1 reply; 159+ messages in thread
From: Garima Singh via GitGitGadget @ 2020-04-06 16:59 UTC (permalink / raw)
To: git; +Cc: stolee, szeder.dev, jonathantanmy, Garima Singh, Garima Singh
From: Garima Singh <garima.singh@microsoft.com>
Add the core implementation for computing Bloom filters for
the paths changed between a commit and it's first parent.
We fill the Bloom filters as (const char *data, int len) pairs
as `struct bloom_filters" within a commit slab.
Filters for commits with no changes and more than 512 changes,
is represented with a filter of length zero. There is no gain
in distinguishing between a computed filter of length zero for
a commit with no changes, and an uncomputed filter for new commits
or for commits with more than 512 changes. The effect on
`git log -- path` is the same in both cases. We will fall back to
the normal diffing algorithm when we can't benefit from the
existence of Bloom filters.
Helped-by: Jeff King <peff@peff.net>
Helped-by: Derrick Stolee <dstolee@microsoft.com>
Reviewed-by: Jakub Narębski <jnareb@gmail.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
---
bloom.c | 97 +++++++++++++++++++++++++++++++++++++++++++
bloom.h | 8 ++++
t/helper/test-bloom.c | 20 +++++++++
t/t0095-bloom.sh | 47 +++++++++++++++++++++
4 files changed, 172 insertions(+)
diff --git a/bloom.c b/bloom.c
index 888b67f1ea6..881a9841ede 100644
--- a/bloom.c
+++ b/bloom.c
@@ -1,5 +1,18 @@
#include "git-compat-util.h"
#include "bloom.h"
+#include "diff.h"
+#include "diffcore.h"
+#include "revision.h"
+#include "hashmap.h"
+
+define_commit_slab(bloom_filter_slab, struct bloom_filter);
+
+struct bloom_filter_slab bloom_filters;
+
+struct pathmap_hash_entry {
+ struct hashmap_entry entry;
+ const char path[FLEX_ARRAY];
+};
static uint32_t rotate_left(uint32_t value, int32_t count)
{
@@ -107,3 +120,87 @@ void add_key_to_filter(const struct bloom_key *key,
filter->data[block_pos] |= get_bitmask(hash_mod);
}
}
+
+void init_bloom_filters(void)
+{
+ init_bloom_filter_slab(&bloom_filters);
+}
+
+struct bloom_filter *get_bloom_filter(struct repository *r,
+ struct commit *c)
+{
+ struct bloom_filter *filter;
+ struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
+ int i;
+ struct diff_options diffopt;
+
+ if (bloom_filters.slab_size == 0)
+ return NULL;
+
+ filter = bloom_filter_slab_at(&bloom_filters, c);
+
+ repo_diff_setup(r, &diffopt);
+ diffopt.flags.recursive = 1;
+ diff_setup_done(&diffopt);
+
+ if (c->parents)
+ diff_tree_oid(&c->parents->item->object.oid, &c->object.oid, "", &diffopt);
+ else
+ diff_tree_oid(NULL, &c->object.oid, "", &diffopt);
+ diffcore_std(&diffopt);
+
+ if (diff_queued_diff.nr <= 512) {
+ struct hashmap pathmap;
+ struct pathmap_hash_entry *e;
+ struct hashmap_iter iter;
+ hashmap_init(&pathmap, NULL, NULL, 0);
+
+ for (i = 0; i < diff_queued_diff.nr; i++) {
+ const char *path = diff_queued_diff.queue[i]->two->path;
+
+ /*
+ * Add each leading directory of the changed file, i.e. for
+ * 'dir/subdir/file' add 'dir' and 'dir/subdir' as well, so
+ * the Bloom filter could be used to speed up commands like
+ * 'git log dir/subdir', too.
+ *
+ * Note that directories are added without the trailing '/'.
+ */
+ do {
+ char *last_slash = strrchr(path, '/');
+
+ FLEX_ALLOC_STR(e, path, path);
+ hashmap_entry_init(&e->entry, strhash(path));
+ hashmap_add(&pathmap, &e->entry);
+
+ if (!last_slash)
+ last_slash = (char*)path;
+ *last_slash = '\0';
+
+ } while (*path);
+
+ diff_free_filepair(diff_queued_diff.queue[i]);
+ }
+
+ filter->len = (hashmap_get_size(&pathmap) * settings.bits_per_entry + BITS_PER_WORD - 1) / BITS_PER_WORD;
+ filter->data = xcalloc(filter->len, sizeof(unsigned char));
+
+ hashmap_for_each_entry(&pathmap, &iter, e, entry) {
+ struct bloom_key key;
+ fill_bloom_key(e->path, strlen(e->path), &key, &settings);
+ add_key_to_filter(&key, filter, &settings);
+ }
+
+ hashmap_free_entries(&pathmap, struct pathmap_hash_entry, entry);
+ } else {
+ for (i = 0; i < diff_queued_diff.nr; i++)
+ diff_free_filepair(diff_queued_diff.queue[i]);
+ filter->data = NULL;
+ filter->len = 0;
+ }
+
+ free(diff_queued_diff.queue);
+ DIFF_QUEUE_CLEAR(&diff_queued_diff);
+
+ return filter;
+}
diff --git a/bloom.h b/bloom.h
index b9ce422ca2d..85ab8e9423d 100644
--- a/bloom.h
+++ b/bloom.h
@@ -1,6 +1,9 @@
#ifndef BLOOM_H
#define BLOOM_H
+struct commit;
+struct repository;
+
struct bloom_filter_settings {
/*
* The version of the hashing technique being used.
@@ -73,4 +76,9 @@ void add_key_to_filter(const struct bloom_key *key,
struct bloom_filter *filter,
const struct bloom_filter_settings *settings);
+void init_bloom_filters(void);
+
+struct bloom_filter *get_bloom_filter(struct repository *r,
+ struct commit *c);
+
#endif
\ No newline at end of file
diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
index 20460cde775..f18d1b722e1 100644
--- a/t/helper/test-bloom.c
+++ b/t/helper/test-bloom.c
@@ -1,6 +1,7 @@
#include "git-compat-util.h"
#include "bloom.h"
#include "test-tool.h"
+#include "commit.h"
struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
@@ -32,6 +33,16 @@ static void print_bloom_filter(struct bloom_filter *filter) {
printf("\n");
}
+static void get_bloom_filter_for_commit(const struct object_id *commit_oid)
+{
+ struct commit *c;
+ struct bloom_filter *filter;
+ setup_git_directory();
+ c = lookup_commit(the_repository, commit_oid);
+ filter = get_bloom_filter(the_repository, c);
+ print_bloom_filter(filter);
+}
+
int cmd__bloom(int argc, const char **argv)
{
if (!strcmp(argv[1], "get_murmur3")) {
@@ -57,5 +68,14 @@ int cmd__bloom(int argc, const char **argv)
print_bloom_filter(&filter);
}
+ if (!strcmp(argv[1], "get_filter_for_commit")) {
+ struct object_id oid;
+ const char *end;
+ if (parse_oid_hex(argv[2], &oid, &end))
+ die("cannot parse oid '%s'", argv[2]);
+ init_bloom_filters();
+ get_bloom_filter_for_commit(&oid);
+ }
+
return 0;
}
\ No newline at end of file
diff --git a/t/t0095-bloom.sh b/t/t0095-bloom.sh
index 36a086c7c60..8f9eef116dc 100755
--- a/t/t0095-bloom.sh
+++ b/t/t0095-bloom.sh
@@ -67,4 +67,51 @@ test_expect_success 'compute bloom key for test string 2' '
test_cmp expect actual
'
+test_expect_success 'get bloom filters for commit with no changes' '
+ git init &&
+ git commit --allow-empty -m "c0" &&
+ cat >expect <<-\EOF &&
+ Filter_Length:0
+ Filter_Data:
+ EOF
+ test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
+ test_cmp expect actual
+'
+
+test_expect_success 'get bloom filter for commit with 10 changes' '
+ rm actual &&
+ rm expect &&
+ mkdir smallDir &&
+ for i in $(test_seq 0 9)
+ do
+ echo $i >smallDir/$i
+ done &&
+ git add smallDir &&
+ git commit -m "commit with 10 changes" &&
+ cat >expect <<-\EOF &&
+ Filter_Length:25
+ Filter_Data:82|a0|65|47|0c|92|90|c0|a1|40|02|a0|e2|40|e0|04|0a|9a|66|cf|80|19|85|42|23|
+ EOF
+ test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
+ test_cmp expect actual
+'
+
+test_expect_success EXPENSIVE 'get bloom filter for commit with 513 changes' '
+ rm actual &&
+ rm expect &&
+ mkdir bigDir &&
+ for i in $(test_seq 0 512)
+ do
+ echo $i >bigDir/$i
+ done &&
+ git add bigDir &&
+ git commit -m "commit with 513 changes" &&
+ cat >expect <<-\EOF &&
+ Filter_Length:0
+ Filter_Data:
+ EOF
+ test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
+ test_cmp expect actual
+'
+
test_done
\ No newline at end of file
--
gitgitgadget
^ permalink raw reply related [flat|nested] 159+ messages in thread
* Re: [PATCH v4 04/15] bloom.c: core Bloom filter implementation for changed paths.
2020-04-06 16:59 ` [PATCH v4 04/15] bloom.c: core Bloom filter implementation for changed paths Garima Singh via GitGitGadget
@ 2020-06-27 15:53 ` SZEDER Gábor
0 siblings, 0 replies; 159+ messages in thread
From: SZEDER Gábor @ 2020-06-27 15:53 UTC (permalink / raw)
To: Garima Singh via GitGitGadget; +Cc: git, stolee, jonathantanmy, Garima Singh
On Mon, Apr 06, 2020 at 04:59:44PM +0000, Garima Singh via GitGitGadget wrote:
> From: Garima Singh <garima.singh@microsoft.com>
>
> Add the core implementation for computing Bloom filters for
> the paths changed between a commit and it's first parent.
>
> We fill the Bloom filters as (const char *data, int len) pairs
> as `struct bloom_filters" within a commit slab.
>
> Filters for commits with no changes and more than 512 changes,
> is represented with a filter of length zero. There is no gain
> in distinguishing between a computed filter of length zero for
> a commit with no changes, and an uncomputed filter for new commits
> or for commits with more than 512 changes. The effect on
> `git log -- path` is the same in both cases. We will fall back to
> the normal diffing algorithm when we can't benefit from the
> existence of Bloom filters.
>
> Helped-by: Jeff King <peff@peff.net>
> Helped-by: Derrick Stolee <dstolee@microsoft.com>
> Reviewed-by: Jakub Narębski <jnareb@gmail.com>
> Signed-off-by: Garima Singh <garima.singh@microsoft.com>
> ---
> bloom.c | 97 +++++++++++++++++++++++++++++++++++++++++++
> bloom.h | 8 ++++
> t/helper/test-bloom.c | 20 +++++++++
> t/t0095-bloom.sh | 47 +++++++++++++++++++++
> 4 files changed, 172 insertions(+)
>
> diff --git a/bloom.c b/bloom.c
> index 888b67f1ea6..881a9841ede 100644
> --- a/bloom.c
> +++ b/bloom.c
> @@ -1,5 +1,18 @@
> #include "git-compat-util.h"
> #include "bloom.h"
> +#include "diff.h"
> +#include "diffcore.h"
> +#include "revision.h"
> +#include "hashmap.h"
> +
> +define_commit_slab(bloom_filter_slab, struct bloom_filter);
So here we define a commit slab for modified path Bloom filters, ...
> +struct bloom_filter_slab bloom_filters;
> +
> +struct pathmap_hash_entry {
> + struct hashmap_entry entry;
> + const char path[FLEX_ARRAY];
> +};
>
> static uint32_t rotate_left(uint32_t value, int32_t count)
> {
> @@ -107,3 +120,87 @@ void add_key_to_filter(const struct bloom_key *key,
> filter->data[block_pos] |= get_bitmask(hash_mod);
> }
> }
> +
> +void init_bloom_filters(void)
> +{
> + init_bloom_filter_slab(&bloom_filters);
... here initialize the slab ...
> +}
> +
> +struct bloom_filter *get_bloom_filter(struct repository *r,
> + struct commit *c)
> +{
> + struct bloom_filter *filter;
> + struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
> + int i;
> + struct diff_options diffopt;
> +
> + if (bloom_filters.slab_size == 0)
> + return NULL;
> +
> + filter = bloom_filter_slab_at(&bloom_filters, c);
... allocate an entry in the slab ...
> +
> + repo_diff_setup(r, &diffopt);
> + diffopt.flags.recursive = 1;
> + diff_setup_done(&diffopt);
> +
> + if (c->parents)
> + diff_tree_oid(&c->parents->item->object.oid, &c->object.oid, "", &diffopt);
> + else
> + diff_tree_oid(NULL, &c->object.oid, "", &diffopt);
> + diffcore_std(&diffopt);
> +
> + if (diff_queued_diff.nr <= 512) {
> + struct hashmap pathmap;
> + struct pathmap_hash_entry *e;
> + struct hashmap_iter iter;
> + hashmap_init(&pathmap, NULL, NULL, 0);
> +
> + for (i = 0; i < diff_queued_diff.nr; i++) {
> + const char *path = diff_queued_diff.queue[i]->two->path;
> +
> + /*
> + * Add each leading directory of the changed file, i.e. for
> + * 'dir/subdir/file' add 'dir' and 'dir/subdir' as well, so
> + * the Bloom filter could be used to speed up commands like
> + * 'git log dir/subdir', too.
> + *
> + * Note that directories are added without the trailing '/'.
> + */
> + do {
> + char *last_slash = strrchr(path, '/');
> +
> + FLEX_ALLOC_STR(e, path, path);
> + hashmap_entry_init(&e->entry, strhash(path));
> + hashmap_add(&pathmap, &e->entry);
> +
> + if (!last_slash)
> + last_slash = (char*)path;
> + *last_slash = '\0';
> +
> + } while (*path);
> +
> + diff_free_filepair(diff_queued_diff.queue[i]);
> + }
> +
> + filter->len = (hashmap_get_size(&pathmap) * settings.bits_per_entry + BITS_PER_WORD - 1) / BITS_PER_WORD;
> + filter->data = xcalloc(filter->len, sizeof(unsigned char));
... and here we fill the slab with data, including a memory allocation
for each slab entry.
What is missing in this patch or in any of the followup patches is a
place where we clear the slab and the additional memory attached to
it.
> +
> + hashmap_for_each_entry(&pathmap, &iter, e, entry) {
> + struct bloom_key key;
> + fill_bloom_key(e->path, strlen(e->path), &key, &settings);
> + add_key_to_filter(&key, filter, &settings);
> + }
> +
> + hashmap_free_entries(&pathmap, struct pathmap_hash_entry, entry);
> + } else {
> + for (i = 0; i < diff_queued_diff.nr; i++)
> + diff_free_filepair(diff_queued_diff.queue[i]);
> + filter->data = NULL;
> + filter->len = 0;
> + }
> +
> + free(diff_queued_diff.queue);
> + DIFF_QUEUE_CLEAR(&diff_queued_diff);
> +
> + return filter;
> +}
> diff --git a/bloom.h b/bloom.h
> index b9ce422ca2d..85ab8e9423d 100644
> --- a/bloom.h
> +++ b/bloom.h
> @@ -1,6 +1,9 @@
> #ifndef BLOOM_H
> #define BLOOM_H
>
> +struct commit;
> +struct repository;
> +
> struct bloom_filter_settings {
> /*
> * The version of the hashing technique being used.
> @@ -73,4 +76,9 @@ void add_key_to_filter(const struct bloom_key *key,
> struct bloom_filter *filter,
> const struct bloom_filter_settings *settings);
>
> +void init_bloom_filters(void);
> +
> +struct bloom_filter *get_bloom_filter(struct repository *r,
> + struct commit *c);
> +
> #endif
> \ No newline at end of file
> diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
> index 20460cde775..f18d1b722e1 100644
> --- a/t/helper/test-bloom.c
> +++ b/t/helper/test-bloom.c
> @@ -1,6 +1,7 @@
> #include "git-compat-util.h"
> #include "bloom.h"
> #include "test-tool.h"
> +#include "commit.h"
>
> struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
>
> @@ -32,6 +33,16 @@ static void print_bloom_filter(struct bloom_filter *filter) {
> printf("\n");
> }
>
> +static void get_bloom_filter_for_commit(const struct object_id *commit_oid)
> +{
> + struct commit *c;
> + struct bloom_filter *filter;
> + setup_git_directory();
> + c = lookup_commit(the_repository, commit_oid);
> + filter = get_bloom_filter(the_repository, c);
> + print_bloom_filter(filter);
> +}
> +
> int cmd__bloom(int argc, const char **argv)
> {
> if (!strcmp(argv[1], "get_murmur3")) {
> @@ -57,5 +68,14 @@ int cmd__bloom(int argc, const char **argv)
> print_bloom_filter(&filter);
> }
>
> + if (!strcmp(argv[1], "get_filter_for_commit")) {
> + struct object_id oid;
> + const char *end;
> + if (parse_oid_hex(argv[2], &oid, &end))
> + die("cannot parse oid '%s'", argv[2]);
> + init_bloom_filters();
> + get_bloom_filter_for_commit(&oid);
> + }
> +
> return 0;
> }
> \ No newline at end of file
> diff --git a/t/t0095-bloom.sh b/t/t0095-bloom.sh
> index 36a086c7c60..8f9eef116dc 100755
> --- a/t/t0095-bloom.sh
> +++ b/t/t0095-bloom.sh
> @@ -67,4 +67,51 @@ test_expect_success 'compute bloom key for test string 2' '
> test_cmp expect actual
> '
>
> +test_expect_success 'get bloom filters for commit with no changes' '
> + git init &&
> + git commit --allow-empty -m "c0" &&
> + cat >expect <<-\EOF &&
> + Filter_Length:0
> + Filter_Data:
> + EOF
> + test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
> + test_cmp expect actual
> +'
> +
> +test_expect_success 'get bloom filter for commit with 10 changes' '
> + rm actual &&
> + rm expect &&
> + mkdir smallDir &&
> + for i in $(test_seq 0 9)
> + do
> + echo $i >smallDir/$i
> + done &&
> + git add smallDir &&
> + git commit -m "commit with 10 changes" &&
> + cat >expect <<-\EOF &&
> + Filter_Length:25
> + Filter_Data:82|a0|65|47|0c|92|90|c0|a1|40|02|a0|e2|40|e0|04|0a|9a|66|cf|80|19|85|42|23|
> + EOF
> + test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
> + test_cmp expect actual
> +'
> +
> +test_expect_success EXPENSIVE 'get bloom filter for commit with 513 changes' '
> + rm actual &&
> + rm expect &&
> + mkdir bigDir &&
> + for i in $(test_seq 0 512)
> + do
> + echo $i >bigDir/$i
> + done &&
> + git add bigDir &&
> + git commit -m "commit with 513 changes" &&
> + cat >expect <<-\EOF &&
> + Filter_Length:0
> + Filter_Data:
> + EOF
> + test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
> + test_cmp expect actual
> +'
> +
> test_done
> \ No newline at end of file
> --
> gitgitgadget
>
^ permalink raw reply [flat|nested] 159+ messages in thread
* [PATCH v4 05/15] diff: halt tree-diff early after max_changes
2020-04-06 16:59 ` [PATCH v4 00/15] Changed Paths Bloom Filters Garima Singh via GitGitGadget
` (3 preceding siblings ...)
2020-04-06 16:59 ` [PATCH v4 04/15] bloom.c: core Bloom filter implementation for changed paths Garima Singh via GitGitGadget
@ 2020-04-06 16:59 ` Derrick Stolee via GitGitGadget
2020-08-04 14:47 ` SZEDER Gábor
2020-04-06 16:59 ` [PATCH v4 06/15] commit-graph: compute Bloom filters for changed paths Garima Singh via GitGitGadget
` (10 subsequent siblings)
15 siblings, 1 reply; 159+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-04-06 16:59 UTC (permalink / raw)
To: git; +Cc: stolee, szeder.dev, jonathantanmy, Garima Singh, Derrick Stolee
From: Derrick Stolee <dstolee@microsoft.com>
When computing the changed-paths bloom filters for the commit-graph,
we limit the size of the filter by restricting the number of paths
in the diff. Instead of computing a large diff and then ignoring the
result, it is better to halt the diff computation early.
Create a new "max_changes" option in struct diff_options. If non-zero,
then halt the diff computation after discovering strictly more changed
paths. This includes paths corresponding to trees that change.
Use this max_changes option in the bloom filter calculations. This
reduces the time taken to compute the filters for the Linux kernel
repo from 2m50s to 2m35s. On a large internal repository with ~500
commits that perform tree-wide changes, the time reduced from
6m15s to 3m48s.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
---
bloom.c | 4 +++-
diff.h | 5 +++++
tree-diff.c | 6 ++++++
3 files changed, 14 insertions(+), 1 deletion(-)
diff --git a/bloom.c b/bloom.c
index 881a9841ede..a16eee92331 100644
--- a/bloom.c
+++ b/bloom.c
@@ -133,6 +133,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
int i;
struct diff_options diffopt;
+ int max_changes = 512;
if (bloom_filters.slab_size == 0)
return NULL;
@@ -141,6 +142,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
repo_diff_setup(r, &diffopt);
diffopt.flags.recursive = 1;
+ diffopt.max_changes = max_changes;
diff_setup_done(&diffopt);
if (c->parents)
@@ -149,7 +151,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
diff_tree_oid(NULL, &c->object.oid, "", &diffopt);
diffcore_std(&diffopt);
- if (diff_queued_diff.nr <= 512) {
+ if (diff_queued_diff.nr <= max_changes) {
struct hashmap pathmap;
struct pathmap_hash_entry *e;
struct hashmap_iter iter;
diff --git a/diff.h b/diff.h
index 6febe7e3656..9443dc1b003 100644
--- a/diff.h
+++ b/diff.h
@@ -285,6 +285,11 @@ struct diff_options {
/* Number of hexdigits to abbreviate raw format output to. */
int abbrev;
+ /* If non-zero, then stop computing after this many changes. */
+ int max_changes;
+ /* For internal use only. */
+ int num_changes;
+
int ita_invisible_in_index;
/* white-space error highlighting */
#define WSEH_NEW (1<<12)
diff --git a/tree-diff.c b/tree-diff.c
index 33ded7f8b3e..f3d303c6e54 100644
--- a/tree-diff.c
+++ b/tree-diff.c
@@ -434,6 +434,9 @@ static struct combine_diff_path *ll_diff_tree_paths(
if (diff_can_quit_early(opt))
break;
+ if (opt->max_changes && opt->num_changes > opt->max_changes)
+ break;
+
if (opt->pathspec.nr) {
skip_uninteresting(&t, base, opt);
for (i = 0; i < nparent; i++)
@@ -518,6 +521,7 @@ static struct combine_diff_path *ll_diff_tree_paths(
/* t↓ */
update_tree_entry(&t);
+ opt->num_changes++;
}
/* t > p[imin] */
@@ -535,6 +539,7 @@ static struct combine_diff_path *ll_diff_tree_paths(
skip_emit_tp:
/* ∀ pi=p[imin] pi↓ */
update_tp_entries(tp, nparent);
+ opt->num_changes++;
}
}
@@ -552,6 +557,7 @@ struct combine_diff_path *diff_tree_paths(
const struct object_id **parents_oid, int nparent,
struct strbuf *base, struct diff_options *opt)
{
+ opt->num_changes = 0;
p = ll_diff_tree_paths(p, oid, parents_oid, nparent, base, opt);
/*
--
gitgitgadget
^ permalink raw reply related [flat|nested] 159+ messages in thread
* Re: [PATCH v4 05/15] diff: halt tree-diff early after max_changes
2020-04-06 16:59 ` [PATCH v4 05/15] diff: halt tree-diff early after max_changes Derrick Stolee via GitGitGadget
@ 2020-08-04 14:47 ` SZEDER Gábor
2020-08-04 16:25 ` Derrick Stolee
0 siblings, 1 reply; 159+ messages in thread
From: SZEDER Gábor @ 2020-08-04 14:47 UTC (permalink / raw)
To: Derrick Stolee via GitGitGadget
Cc: git, stolee, jonathantanmy, Garima Singh, Derrick Stolee
On Mon, Apr 06, 2020 at 04:59:45PM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
>
> When computing the changed-paths bloom filters for the commit-graph,
> we limit the size of the filter by restricting the number of paths
> in the diff. Instead of computing a large diff and then ignoring the
> result, it is better to halt the diff computation early.
>
> Create a new "max_changes" option in struct diff_options. If non-zero,
> then halt the diff computation after discovering strictly more changed
> paths. This includes paths corresponding to trees that change.
>
> Use this max_changes option in the bloom filter calculations. This
> reduces the time taken to compute the filters for the Linux kernel
> repo from 2m50s to 2m35s. On a large internal repository with ~500
> commits that perform tree-wide changes, the time reduced from
> 6m15s to 3m48s.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> Signed-off-by: Garima Singh <garima.singh@microsoft.com>
> ---
> bloom.c | 4 +++-
> diff.h | 5 +++++
> tree-diff.c | 6 ++++++
> 3 files changed, 14 insertions(+), 1 deletion(-)
>
> diff --git a/bloom.c b/bloom.c
> index 881a9841ede..a16eee92331 100644
> --- a/bloom.c
> +++ b/bloom.c
> @@ -133,6 +133,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
> struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
> int i;
> struct diff_options diffopt;
> + int max_changes = 512;
>
> if (bloom_filters.slab_size == 0)
> return NULL;
> @@ -141,6 +142,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
>
> repo_diff_setup(r, &diffopt);
> diffopt.flags.recursive = 1;
> + diffopt.max_changes = max_changes;
> diff_setup_done(&diffopt);
>
> if (c->parents)
> @@ -149,7 +151,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
> diff_tree_oid(NULL, &c->object.oid, "", &diffopt);
> diffcore_std(&diffopt);
>
> - if (diff_queued_diff.nr <= 512) {
> + if (diff_queued_diff.nr <= max_changes) {
> struct hashmap pathmap;
> struct pathmap_hash_entry *e;
> struct hashmap_iter iter;
> diff --git a/diff.h b/diff.h
> index 6febe7e3656..9443dc1b003 100644
> --- a/diff.h
> +++ b/diff.h
> @@ -285,6 +285,11 @@ struct diff_options {
> /* Number of hexdigits to abbreviate raw format output to. */
> int abbrev;
>
> + /* If non-zero, then stop computing after this many changes. */
> + int max_changes;
> + /* For internal use only. */
> + int num_changes;
"For internal use only", understood.
> +
> int ita_invisible_in_index;
> /* white-space error highlighting */
> #define WSEH_NEW (1<<12)
> diff --git a/tree-diff.c b/tree-diff.c
> index 33ded7f8b3e..f3d303c6e54 100644
> --- a/tree-diff.c
> +++ b/tree-diff.c
> @@ -434,6 +434,9 @@ static struct combine_diff_path *ll_diff_tree_paths(
> if (diff_can_quit_early(opt))
> break;
>
> + if (opt->max_changes && opt->num_changes > opt->max_changes)
> + break;
> +
> if (opt->pathspec.nr) {
> skip_uninteresting(&t, base, opt);
> for (i = 0; i < nparent; i++)
> @@ -518,6 +521,7 @@ static struct combine_diff_path *ll_diff_tree_paths(
>
> /* t↓ */
> update_tree_entry(&t);
> + opt->num_changes++;
> }
>
> /* t > p[imin] */
> @@ -535,6 +539,7 @@ static struct combine_diff_path *ll_diff_tree_paths(
> skip_emit_tp:
> /* ∀ pi=p[imin] pi↓ */
> update_tp_entries(tp, nparent);
> + opt->num_changes++;
> }
> }
This counter is basically broken, its value is wrong for over 98% of
commits, and, worse, its value remains 0 for over 85% of commits in
the repositories I usually use to test modified path Bloom filters.
Consequently, a relatively large number of commits modifying more than
512 paths get Bloom filters.
The makeshift tests in the patch below demonstrate these issues as
most of them fail, most notably those two tests that demonstrate that
modifying existing paths are not counted at all.
--- >8 ---
diff --git a/bloom.c b/bloom.c
index 9b86aa3f59..3db0fde734 100644
--- a/bloom.c
+++ b/bloom.c
@@ -203,7 +203,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
repo_diff_setup(r, &diffopt);
diffopt.flags.recursive = 1;
diffopt.detect_rename = 0;
- diffopt.max_changes = max_changes;
+ diffopt.max_changes = 0;
diff_setup_done(&diffopt);
/* ensure commit is parsed so we have parent information */
@@ -214,6 +214,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
else
diff_tree_oid(NULL, &c->object.oid, "", &diffopt);
diffcore_std(&diffopt);
+ printf("%s %d\n", oid_to_hex(&c->object.oid), diffopt.num_changes);
if (diffopt.num_changes <= max_changes) {
struct hashmap pathmap;
diff --git a/t/t9999-test.sh b/t/t9999-test.sh
new file mode 100755
index 0000000000..8d2bd9f03f
--- /dev/null
+++ b/t/t9999-test.sh
@@ -0,0 +1,142 @@
+#!/bin/sh
+
+test_description='test'
+
+. ./test-lib.sh
+
+test_expect_success 'setup' '
+ test_tick &&
+
+ echo 1 >file &&
+ mkdir -p dir/subdir &&
+ echo 1 >dir/subdir/file1 &&
+ echo 1 >dir/subdir/file2 &&
+ git add file dir &&
+ git commit -m setup &&
+
+ echo 2 >file &&
+ git commit -a -m "modify one path in root" &&
+ mod_one_path=$(git rev-parse HEAD) &&
+
+ echo 2 >dir/subdir/file1 &&
+ echo 2 >dir/subdir/file2 &&
+ git commit -a -m "modify two file two dirs deep" &&
+ mod_four_paths=$(git rev-parse HEAD) &&
+
+ >new-file &&
+ git add new-file &&
+ git commit -m "add new file in root" &&
+ new_file_in_root=$(git rev-parse HEAD) &&
+
+ git rm new-file &&
+ git commit -m "delete file in root" &&
+ delete_file_in_root=$(git rev-parse HEAD) &&
+
+ >dir/new-file &&
+ git add dir/new-file &&
+ git commit -m "add new file in dir" &&
+ new_file_in_dir=$(git rev-parse HEAD) &&
+
+ git rm dir/new-file &&
+ git commit -m "delete file in dir" &&
+ delete_file_in_dir=$(git rev-parse HEAD) &&
+
+ echo 1 >d-f &&
+ git add d-f &&
+ git commit -m foo &&
+ git rm d-f &&
+ mkdir d-f &&
+ echo 2 >d-f/file &&
+ git add d-f &&
+ git commit -m "replace file with dir" &&
+ file_to_dir=$(git rev-parse HEAD) &&
+
+ >d-f.c &&
+ git add d-f.c &&
+ git commit -m "add a file that sorts between d-f and d-f/" &&
+ git rm -r d-f &&
+ echo 3 >d-f &&
+ git add d-f &&
+ git commit -m "replace dir with file" &&
+ dir_to_file=$(git rev-parse HEAD) &&
+
+ bin_sha1=$(git rev-parse HEAD:dir/subdir | hex2oct) &&
+ # leading zero in mode: the content of the tree remains the same,
+ # but its oid does change!
+ printf "040000 subdir\0$bin_sha1" >rawtree &&
+ tree1=$(git hash-object -t tree -w rawtree) &&
+ git cat-file -p HEAD^{tree} >out &&
+ tree2=$(sed -e "s/$(git rev-parse HEAD:dir/)/$tree1/" out |git mktree) &&
+ different_but_same_tree=$(git commit-tree \
+ -m "leading zeros in mode" \
+ -p $(git rev-parse HEAD) $tree2) &&
+ git update-ref HEAD $different_but_same_tree &&
+
+ git commit-graph write --reachable --changed-paths >out &&
+ cat out # debug
+'
+
+test_expect_success 'modify one path in root' '
+ git diff --name-status $mod_one_path^ $mod_one_path &&
+ echo "$mod_one_path 1" >expect &&
+ grep "$mod_one_path" out >actual &&
+ test_cmp expect actual
+'
+
+test_expect_success 'modify two file two dirs deep' '
+ git diff --name-status $mod_four_paths^ $mod_four_paths &&
+ echo "$mod_four_paths 4" >expect &&
+ grep "$mod_four_paths" out >actual &&
+ test_cmp expect actual
+'
+
+test_expect_success 'add new file in root' '
+ git diff --name-status $new_file_in_root^ $new_file_in_root &&
+ echo "$new_file_in_root 1" >expect &&
+ grep "$new_file_in_root" out >actual &&
+ test_cmp expect actual
+'
+
+test_expect_success 'delete file in root' '
+ git diff --name-status $delete_file_in_root^ $delete_file_in_root &&
+ echo "$delete_file_in_root 1" >expect &&
+ grep "$delete_file_in_root" out >actual &&
+ test_cmp expect actual
+'
+
+test_expect_success 'add new file in dir' '
+ git diff --name-status $new_file_in_dir^ $new_file_in_dir &&
+ echo "$new_file_in_dir 2" >expect &&
+ grep "$new_file_in_dir" out >actual &&
+ test_cmp expect actual
+'
+
+test_expect_success 'delete file in dir' '
+ git diff --name-status $delete_file_in_dir^ $delete_file_in_dir &&
+ echo "$delete_file_in_dir 2" >expect &&
+ grep "$delete_file_in_dir" out >actual &&
+ test_cmp expect actual
+'
+
+test_expect_success 'replace file with dir' '
+ git diff --name-status $file_to_dir^ $file_to_dir &&
+ echo "$file_to_dir 2" >expect &&
+ grep "$file_to_dir" out >actual &&
+ test_cmp expect actual
+'
+
+test_expect_success 'replace dir with file' '
+ git diff --name-status $dir_to_file^ $dir_to_file &&
+ echo "$dir_to_file 2" >expect &&
+ grep "$dir_to_file" out >actual &&
+ test_cmp expect actual
+'
+
+test_expect_success 'leading zeros in mode' '
+ git diff --name-status $different_but_same_tree^ $different_but_same_tree &&
+ echo "$different_but_same_tree 0" >expect &&
+ grep "$different_but_same_tree" out >actual &&
+ test_cmp expect actual
+'
+
+test_done
--- >8 ---
> @@ -552,6 +557,7 @@ struct combine_diff_path *diff_tree_paths(
> const struct object_id **parents_oid, int nparent,
> struct strbuf *base, struct diff_options *opt)
> {
> + opt->num_changes = 0;
> p = ll_diff_tree_paths(p, oid, parents_oid, nparent, base, opt);
>
> /*
> --
> gitgitgadget
>
^ permalink raw reply related [flat|nested] 159+ messages in thread
* Re: [PATCH v4 05/15] diff: halt tree-diff early after max_changes
2020-08-04 14:47 ` SZEDER Gábor
@ 2020-08-04 16:25 ` Derrick Stolee
2020-08-04 17:00 ` SZEDER Gábor
0 siblings, 1 reply; 159+ messages in thread
From: Derrick Stolee @ 2020-08-04 16:25 UTC (permalink / raw)
To: SZEDER Gábor, Derrick Stolee via GitGitGadget
Cc: git, jonathantanmy, Garima Singh, Derrick Stolee, Taylor Blau
On 8/4/2020 10:47 AM, SZEDER Gábor wrote:
> On Mon, Apr 06, 2020 at 04:59:45PM +0000, Derrick Stolee via GitGitGadget wrote:
> This counter is basically broken, its value is wrong for over 98% of
> commits, and, worse, its value remains 0 for over 85% of commits in
> the repositories I usually use to test modified path Bloom filters.
> Consequently, a relatively large number of commits modifying more than
> 512 paths get Bloom filters.
Thanks for finding this! The counter is only really tested in one
place, and that test only considers _file adds_, which is a problem.
If I understand this correctly, the bug is a performance-only bug
(since this is a performance-only feature), but it is an important
one to fix.
There is certainly some dark magic happening in this tree-diff logic,
so instead of trying to get an accurate count we should just use the
magic global diff_queued_diff to track the current list of file changes.
Note: diff_queued_diff does not track the directory changes, so it
is an under-count for the total changes to track in the Bloom filter.
This is later corrected by the block that adds these leading directory
changes.
> The makeshift tests in the patch below demonstrate these issues as
> most of them fail, most notably those two tests that demonstrate that
> modifying existing paths are not counted at all.
I adapted your diff along with ripping out 'num_changes' in favor
of diff_queued_diff.nr. This required modifying some of your expected
values in the test script (losing the leading directories in the
count).
I'll work with Taylor to create a fix, and include proper testing
of the logic here. We'll stick it in the v2 of his max-changed-paths
series [1]. He already has some helpful logging that can help create
tests that ensure this logic is performing as expected.
We plan to have that fix available by later today or early tomorrow.
Will you be available to help validate it?
[1] https://lore.kernel.org/git/cover.1596480582.git.me@ttaylorr.com/
Thanks,
-Stolee
--- >8 ---
diff --git a/bloom.c b/bloom.c
index 1a573226e7..b8d6cb9240 100644
--- a/bloom.c
+++ b/bloom.c
@@ -218,8 +218,9 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
else
diff_tree_oid(NULL, &c->object.oid, "", &diffopt);
diffcore_std(&diffopt);
+ printf("%s %d\n", oid_to_hex(&c->object.oid), diff_queued_diff.nr);
- if (diffopt.num_changes <= max_changes) {
+ if (diff_queued_diff.nr <= max_changes) {
struct hashmap pathmap;
struct pathmap_hash_entry *e;
struct hashmap_iter iter;
diff --git a/diff.h b/diff.h
index e0c0af6286b..1d32b718857 100644
--- a/diff.h
+++ b/diff.h
@@ -287,8 +287,6 @@ struct diff_options {
/* If non-zero, then stop computing after this many changes. */
int max_changes;
- /* For internal use only. */
- int num_changes;
int ita_invisible_in_index;
/* white-space error highlighting */
diff --git a/t/t9999-test.sh b/t/t9999-test.sh
new file mode 100755
index 00000000000..1f35aa8e2c5
--- /dev/null
+++ b/t/t9999-test.sh
@@ -0,0 +1,142 @@
+#!/bin/sh
+
+test_description='test'
+
+. ./test-lib.sh
+
+test_expect_success 'setup' '
+ test_tick &&
+
+ echo 1 >file &&
+ mkdir -p dir/subdir &&
+ echo 1 >dir/subdir/file1 &&
+ echo 1 >dir/subdir/file2 &&
+ git add file dir &&
+ git commit -m setup &&
+
+ echo 2 >file &&
+ git commit -a -m "modify one path in root" &&
+ mod_one_path=$(git rev-parse HEAD) &&
+
+ echo 2 >dir/subdir/file1 &&
+ echo 2 >dir/subdir/file2 &&
+ git commit -a -m "modify two file two dirs deep" &&
+ mod_four_paths=$(git rev-parse HEAD) &&
+
+ >new-file &&
+ git add new-file &&
+ git commit -m "add new file in root" &&
+ new_file_in_root=$(git rev-parse HEAD) &&
+
+ git rm new-file &&
+ git commit -m "delete file in root" &&
+ delete_file_in_root=$(git rev-parse HEAD) &&
+
+ >dir/new-file &&
+ git add dir/new-file &&
+ git commit -m "add new file in dir" &&
+ new_file_in_dir=$(git rev-parse HEAD) &&
+
+ git rm dir/new-file &&
+ git commit -m "delete file in dir" &&
+ delete_file_in_dir=$(git rev-parse HEAD) &&
+
+ echo 1 >d-f &&
+ git add d-f &&
+ git commit -m foo &&
+ git rm d-f &&
+ mkdir d-f &&
+ echo 2 >d-f/file &&
+ git add d-f &&
+ git commit -m "replace file with dir" &&
+ file_to_dir=$(git rev-parse HEAD) &&
+
+ >d-f.c &&
+ git add d-f.c &&
+ git commit -m "add a file that sorts between d-f and d-f/" &&
+ git rm -r d-f &&
+ echo 3 >d-f &&
+ git add d-f &&
+ git commit -m "replace dir with file" &&
+ dir_to_file=$(git rev-parse HEAD) &&
+
+ bin_sha1=$(git rev-parse HEAD:dir/subdir | hex2oct) &&
+ # leading zero in mode: the content of the tree remains the same,
+ # but its oid does change!
+ printf "040000 subdir\0$bin_sha1" >rawtree &&
+ tree1=$(git hash-object -t tree -w rawtree) &&
+ git cat-file -p HEAD^{tree} >out &&
+ tree2=$(sed -e "s/$(git rev-parse HEAD:dir/)/$tree1/" out |git mktree) &&
+ different_but_same_tree=$(git commit-tree \
+ -m "leading zeros in mode" \
+ -p $(git rev-parse HEAD) $tree2) &&
+ git update-ref HEAD $different_but_same_tree &&
+
+ git commit-graph write --reachable --changed-paths >out &&
+ cat out # debug
+'
+
+test_expect_success 'modify one path in root' '
+ git diff --name-status $mod_one_path^ $mod_one_path &&
+ echo "$mod_one_path 1" >expect &&
+ grep "$mod_one_path" out >actual &&
+ test_cmp expect actual
+'
+
+test_expect_success 'modify two file two dirs deep' '
+ git diff --name-status $mod_four_paths^ $mod_four_paths &&
+ echo "$mod_four_paths 2" >expect &&
+ grep "$mod_four_paths" out >actual &&
+ test_cmp expect actual
+'
+
+test_expect_success 'add new file in root' '
+ git diff --name-status $new_file_in_root^ $new_file_in_root &&
+ echo "$new_file_in_root 1" >expect &&
+ grep "$new_file_in_root" out >actual &&
+ test_cmp expect actual
+'
+
+test_expect_success 'delete file in root' '
+ git diff --name-status $delete_file_in_root^ $delete_file_in_root &&
+ echo "$delete_file_in_root 1" >expect &&
+ grep "$delete_file_in_root" out >actual &&
+ test_cmp expect actual
+'
+
+test_expect_success 'add new file in dir' '
+ git diff --name-status $new_file_in_dir^ $new_file_in_dir &&
+ echo "$new_file_in_dir 1" >expect &&
+ grep "$new_file_in_dir" out >actual &&
+ test_cmp expect actual
+'
+
+test_expect_success 'delete file in dir' '
+ git diff --name-status $delete_file_in_dir^ $delete_file_in_dir &&
+ echo "$delete_file_in_dir 1" >expect &&
+ grep "$delete_file_in_dir" out >actual &&
+ test_cmp expect actual
+'
+
+test_expect_success 'replace file with dir' '
+ git diff --name-status $file_to_dir^ $file_to_dir &&
+ echo "$file_to_dir 2" >expect &&
+ grep "$file_to_dir" out >actual &&
+ test_cmp expect actual
+'
+
+test_expect_success 'replace dir with file' '
+ git diff --name-status $dir_to_file^ $dir_to_file &&
+ echo "$dir_to_file 2" >expect &&
+ grep "$dir_to_file" out >actual &&
+ test_cmp expect actual
+'
+
+test_expect_success 'leading zeros in mode' '
+ git diff --name-status $different_but_same_tree^ $different_but_same_tree &&
+ echo "$different_but_same_tree 0" >expect &&
+ grep "$different_but_same_tree" out >actual &&
+ test_cmp expect actual
+'
+
+test_done
diff --git a/tree-diff.c b/tree-diff.c
index 6ebad1a46f3..7cebbb327e2 100644
--- a/tree-diff.c
+++ b/tree-diff.c
@@ -434,7 +434,7 @@ static struct combine_diff_path *ll_diff_tree_paths(
if (diff_can_quit_early(opt))
break;
- if (opt->max_changes && opt->num_changes > opt->max_changes)
+ if (opt->max_changes && diff_queued_diff.nr > opt->max_changes)
break;
if (opt->pathspec.nr) {
@@ -521,7 +521,6 @@ static struct combine_diff_path *ll_diff_tree_paths(
/* t↓ */
update_tree_entry(&t);
- opt->num_changes++;
}
/* t > p[imin] */
@@ -539,7 +538,6 @@ static struct combine_diff_path *ll_diff_tree_paths(
skip_emit_tp:
/* ∀ pi=p[imin] pi↓ */
update_tp_entries(tp, nparent);
- opt->num_changes++;
}
}
@@ -557,7 +555,6 @@ struct combine_diff_path *diff_tree_paths(
const struct object_id **parents_oid, int nparent,
struct strbuf *base, struct diff_options *opt)
{
- opt->num_changes = 0;
p = ll_diff_tree_paths(p, oid, parents_oid, nparent, base, opt);
/*
^ permalink raw reply related [flat|nested] 159+ messages in thread
* Re: [PATCH v4 05/15] diff: halt tree-diff early after max_changes
2020-08-04 16:25 ` Derrick Stolee
@ 2020-08-04 17:00 ` SZEDER Gábor
2020-08-04 17:31 ` Derrick Stolee
0 siblings, 1 reply; 159+ messages in thread
From: SZEDER Gábor @ 2020-08-04 17:00 UTC (permalink / raw)
To: Derrick Stolee
Cc: Derrick Stolee via GitGitGadget, git, jonathantanmy,
Garima Singh, Derrick Stolee, Taylor Blau
On Tue, Aug 04, 2020 at 12:25:45PM -0400, Derrick Stolee wrote:
> On 8/4/2020 10:47 AM, SZEDER Gábor wrote:
> > On Mon, Apr 06, 2020 at 04:59:45PM +0000, Derrick Stolee via GitGitGadget wrote:
> > This counter is basically broken, its value is wrong for over 98% of
> > commits, and, worse, its value remains 0 for over 85% of commits in
> > the repositories I usually use to test modified path Bloom filters.
> > Consequently, a relatively large number of commits modifying more than
> > 512 paths get Bloom filters.
>
> Thanks for finding this! The counter is only really tested in one
> place, and that test only considers _file adds_, which is a problem.
>
> If I understand this correctly, the bug is a performance-only bug
> (since this is a performance-only feature), but it is an important
> one to fix.
Or a performance-only feature in a performance-only feature, because
those additional modified path Bloom filters can improve the runtime
of pathspec-limited revision walks (assuming that the false positive
rate is low enough).
> There is certainly some dark magic happening in this tree-diff logic,
> so instead of trying to get an accurate count we should just use the
> magic global diff_queued_diff to track the current list of file changes.
>
> Note: diff_queued_diff does not track the directory changes, so it
> is an under-count for the total changes to track in the Bloom filter.
> This is later corrected by the block that adds these leading directory
> changes.
>
> > The makeshift tests in the patch below demonstrate these issues as
> > most of them fail, most notably those two tests that demonstrate that
> > modifying existing paths are not counted at all.
>
> I adapted your diff along with ripping out 'num_changes' in favor
> of diff_queued_diff.nr. This required modifying some of your expected
> values in the test script (losing the leading directories in the
> count).
>
> I'll work with Taylor to create a fix, and include proper testing
> of the logic here. We'll stick it in the v2 of his max-changed-paths
> series [1]. He already has some helpful logging that can help create
> tests that ensure this logic is performing as expected.
Don't forget to include a check of the hashmap's size, to make sure.
FWIW, the patch below does result in the correct count (read: the same
as in my implemenation) for all but 4 commits in those repositories I
use for testing, without adding any memory allocations and extra
strcmp() calls.
--- >8 ---
diff --git a/cache.h b/cache.h
index 0f0485ecfe..3fc7e1b427 100644
--- a/cache.h
+++ b/cache.h
@@ -1574,6 +1574,7 @@ int repo_interpret_branch_name(struct repository *r,
int validate_headref(const char *ref);
int base_name_compare(const char *name1, int len1, int mode1, const char *name2, int len2, int mode2);
+int base_name_compare_df(const char *name1, int len1, int mode1, const char *name2, int len2, int mode2, int *df);
int df_name_compare(const char *name1, int len1, int mode1, const char *name2, int len2, int mode2);
int name_compare(const char *name1, size_t len1, const char *name2, size_t len2);
int cache_name_stage_compare(const char *name1, int len1, int stage1, const char *name2, int len2, int stage2);
diff --git a/read-cache.c b/read-cache.c
index aa427c5c17..041af19e60 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -460,13 +460,16 @@ int ie_modified(struct index_state *istate,
return 0;
}
-int base_name_compare(const char *name1, int len1, int mode1,
- const char *name2, int len2, int mode2)
+int base_name_compare_df(const char *name1, int len1, int mode1,
+ const char *name2, int len2, int mode2,
+ int *df)
{
unsigned char c1, c2;
int len = len1 < len2 ? len1 : len2;
int cmp;
+ *df = 0;
+
cmp = memcmp(name1, name2, len);
if (cmp)
return cmp;
@@ -476,7 +479,21 @@ int base_name_compare(const char *name1, int len1, int mode1,
c1 = '/';
if (!c2 && S_ISDIR(mode2))
c2 = '/';
- return (c1 < c2) ? -1 : (c1 > c2) ? 1 : 0;
+ if (c1 == c2)
+ return 0; /* TODO: is this even possible? */
+ if ((c1 == '/' && !c2) ||
+ (!c1 && c2 == '/'))
+ *df = 1;
+ return (c1 < c2) ? -1 : 1;
+}
+
+int base_name_compare(const char *name1, int len1, int mode1,
+ const char *name2, int len2, int mode2)
+{
+ int unused;
+ return base_name_compare_df(name1, len1, mode1,
+ name2, len2, mode2,
+ &unused);
}
/*
diff --git a/t/t9999-test.sh b/t/t9999-test.sh
index 8d2bd9f03f..4f08590b45 100755
--- a/t/t9999-test.sh
+++ b/t/t9999-test.sh
@@ -125,7 +125,7 @@ test_expect_success 'replace file with dir' '
test_cmp expect actual
'
-test_expect_success 'replace dir with file' '
+test_expect_failure 'replace dir with file' '
git diff --name-status $dir_to_file^ $dir_to_file &&
echo "$dir_to_file 2" >expect &&
grep "$dir_to_file" out >actual &&
diff --git a/tree-diff.c b/tree-diff.c
index f3d303c6e5..e27f9c805e 100644
--- a/tree-diff.c
+++ b/tree-diff.c
@@ -46,11 +46,14 @@ static int ll_diff_tree_oid(const struct object_id *old_oid,
* Due to this convention, if trees are scanned in sorted order, all
* non-empty descriptors will be processed first.
*/
-static int tree_entry_pathcmp(struct tree_desc *t1, struct tree_desc *t2)
+static int tree_entry_pathcmp(struct tree_desc *t1, struct tree_desc *t2,
+ int *df)
{
struct name_entry *e1, *e2;
int cmp;
+ *df = 0;
+
/* empty descriptors sort after valid tree entries */
if (!t1->size)
return t2->size ? 1 : 0;
@@ -59,8 +62,9 @@ static int tree_entry_pathcmp(struct tree_desc *t1, struct tree_desc *t2)
e1 = &t1->entry;
e2 = &t2->entry;
- cmp = base_name_compare(e1->path, tree_entry_len(e1), e1->mode,
- e2->path, tree_entry_len(e2), e2->mode);
+ cmp = base_name_compare_df(e1->path, tree_entry_len(e1), e1->mode,
+ e2->path, tree_entry_len(e2), e2->mode,
+ df);
return cmp;
}
@@ -410,7 +414,7 @@ static struct combine_diff_path *ll_diff_tree_paths(
{
struct tree_desc t, *tp;
void *ttree, **tptree;
- int i;
+ int i, df;
FAST_ARRAY_ALLOC(tp, nparent);
FAST_ARRAY_ALLOC(tptree, nparent);
@@ -463,7 +467,7 @@ static struct combine_diff_path *ll_diff_tree_paths(
tp[0].entry.mode &= ~S_IFXMIN_NEQ;
for (i = 1; i < nparent; ++i) {
- cmp = tree_entry_pathcmp(&tp[i], &tp[imin]);
+ cmp = tree_entry_pathcmp(&tp[i], &tp[imin], &df);
if (cmp < 0) {
imin = i;
tp[i].entry.mode &= ~S_IFXMIN_NEQ;
@@ -483,10 +487,12 @@ static struct combine_diff_path *ll_diff_tree_paths(
/* compare t vs p[imin] */
- cmp = tree_entry_pathcmp(&t, &tp[imin]);
+ cmp = tree_entry_pathcmp(&t, &tp[imin], &df);
/* t = p[imin] */
if (cmp == 0) {
+ int prev_num_changes = opt->num_changes;
+
/* are either pi > p[imin] or diff(t,pi) != ø ? */
if (!opt->flags.find_copies_harder) {
for (i = 0; i < nparent; ++i) {
@@ -506,6 +512,9 @@ static struct combine_diff_path *ll_diff_tree_paths(
/* D += {δ(t,pi) if pi=p[imin]; "+a" if pi > p[imin]} */
p = emit_path(p, base, opt, nparent,
&t, tp, imin);
+ if (!(opt->num_changes == prev_num_changes &&
+ S_ISDIR(t.entry.mode)))
+ opt->num_changes++;
skip_emit_t_tp:
/* t↓, ∀ pi=p[imin] pi↓ */
@@ -518,10 +527,11 @@ static struct combine_diff_path *ll_diff_tree_paths(
/* D += "+t" */
p = emit_path(p, base, opt, nparent,
&t, /*tp=*/NULL, -1);
+ if (!df)
+ opt->num_changes++;
/* t↓ */
update_tree_entry(&t);
- opt->num_changes++;
}
/* t > p[imin] */
@@ -535,11 +545,12 @@ static struct combine_diff_path *ll_diff_tree_paths(
p = emit_path(p, base, opt, nparent,
/*t=*/NULL, tp, imin);
+ if (!df)
+ opt->num_changes++;
skip_emit_tp:
/* ∀ pi=p[imin] pi↓ */
update_tp_entries(tp, nparent);
- opt->num_changes++;
}
}
--- >8 ---
Having said that, the best (i.e faster and accurate) solution to this
issue is probably:
- Update the callchain between diff_tree_oid() and the diff callback
functions to allow the callbacks to break diffing with a non-zero
error code.
- Fill Bloom filters using the approach presented in:
https://public-inbox.org/git/20200529085038.26008-21-szeder.dev@gmail.com/
but modify the callbacks to return non-zero when too many paths
have been processed.
- Drop this counter entirely, as there are no other users.
> We plan to have that fix available by later today or early tomorrow.
> Will you be available to help validate it?
>
> [1] https://lore.kernel.org/git/cover.1596480582.git.me@ttaylorr.com/
>
> Thanks,
> -Stolee
>
> --- >8 ---
>
> diff --git a/bloom.c b/bloom.c
> index 1a573226e7..b8d6cb9240 100644
> --- a/bloom.c
> +++ b/bloom.c
> @@ -218,8 +218,9 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
> else
> diff_tree_oid(NULL, &c->object.oid, "", &diffopt);
> diffcore_std(&diffopt);
> + printf("%s %d\n", oid_to_hex(&c->object.oid), diff_queued_diff.nr);
>
> - if (diffopt.num_changes <= max_changes) {
> + if (diff_queued_diff.nr <= max_changes) {
> struct hashmap pathmap;
> struct pathmap_hash_entry *e;
> struct hashmap_iter iter;
> diff --git a/diff.h b/diff.h
> index e0c0af6286b..1d32b718857 100644
> --- a/diff.h
> +++ b/diff.h
> @@ -287,8 +287,6 @@ struct diff_options {
>
> /* If non-zero, then stop computing after this many changes. */
> int max_changes;
> - /* For internal use only. */
> - int num_changes;
>
> int ita_invisible_in_index;
> /* white-space error highlighting */
> diff --git a/t/t9999-test.sh b/t/t9999-test.sh
> new file mode 100755
> index 00000000000..1f35aa8e2c5
> --- /dev/null
> +++ b/t/t9999-test.sh
> @@ -0,0 +1,142 @@
> +#!/bin/sh
> +
> +test_description='test'
> +
> +. ./test-lib.sh
> +
> +test_expect_success 'setup' '
> + test_tick &&
> +
> + echo 1 >file &&
> + mkdir -p dir/subdir &&
> + echo 1 >dir/subdir/file1 &&
> + echo 1 >dir/subdir/file2 &&
> + git add file dir &&
> + git commit -m setup &&
> +
> + echo 2 >file &&
> + git commit -a -m "modify one path in root" &&
> + mod_one_path=$(git rev-parse HEAD) &&
> +
> + echo 2 >dir/subdir/file1 &&
> + echo 2 >dir/subdir/file2 &&
> + git commit -a -m "modify two file two dirs deep" &&
> + mod_four_paths=$(git rev-parse HEAD) &&
> +
> + >new-file &&
> + git add new-file &&
> + git commit -m "add new file in root" &&
> + new_file_in_root=$(git rev-parse HEAD) &&
> +
> + git rm new-file &&
> + git commit -m "delete file in root" &&
> + delete_file_in_root=$(git rev-parse HEAD) &&
> +
> + >dir/new-file &&
> + git add dir/new-file &&
> + git commit -m "add new file in dir" &&
> + new_file_in_dir=$(git rev-parse HEAD) &&
> +
> + git rm dir/new-file &&
> + git commit -m "delete file in dir" &&
> + delete_file_in_dir=$(git rev-parse HEAD) &&
> +
> + echo 1 >d-f &&
> + git add d-f &&
> + git commit -m foo &&
> + git rm d-f &&
> + mkdir d-f &&
> + echo 2 >d-f/file &&
> + git add d-f &&
> + git commit -m "replace file with dir" &&
> + file_to_dir=$(git rev-parse HEAD) &&
> +
> + >d-f.c &&
> + git add d-f.c &&
> + git commit -m "add a file that sorts between d-f and d-f/" &&
> + git rm -r d-f &&
> + echo 3 >d-f &&
> + git add d-f &&
> + git commit -m "replace dir with file" &&
> + dir_to_file=$(git rev-parse HEAD) &&
> +
> + bin_sha1=$(git rev-parse HEAD:dir/subdir | hex2oct) &&
> + # leading zero in mode: the content of the tree remains the same,
> + # but its oid does change!
> + printf "040000 subdir\0$bin_sha1" >rawtree &&
> + tree1=$(git hash-object -t tree -w rawtree) &&
> + git cat-file -p HEAD^{tree} >out &&
> + tree2=$(sed -e "s/$(git rev-parse HEAD:dir/)/$tree1/" out |git mktree) &&
> + different_but_same_tree=$(git commit-tree \
> + -m "leading zeros in mode" \
> + -p $(git rev-parse HEAD) $tree2) &&
> + git update-ref HEAD $different_but_same_tree &&
> +
> + git commit-graph write --reachable --changed-paths >out &&
> + cat out # debug
> +'
> +
> +test_expect_success 'modify one path in root' '
> + git diff --name-status $mod_one_path^ $mod_one_path &&
> + echo "$mod_one_path 1" >expect &&
> + grep "$mod_one_path" out >actual &&
> + test_cmp expect actual
> +'
> +
> +test_expect_success 'modify two file two dirs deep' '
> + git diff --name-status $mod_four_paths^ $mod_four_paths &&
> + echo "$mod_four_paths 2" >expect &&
> + grep "$mod_four_paths" out >actual &&
> + test_cmp expect actual
> +'
> +
> +test_expect_success 'add new file in root' '
> + git diff --name-status $new_file_in_root^ $new_file_in_root &&
> + echo "$new_file_in_root 1" >expect &&
> + grep "$new_file_in_root" out >actual &&
> + test_cmp expect actual
> +'
> +
> +test_expect_success 'delete file in root' '
> + git diff --name-status $delete_file_in_root^ $delete_file_in_root &&
> + echo "$delete_file_in_root 1" >expect &&
> + grep "$delete_file_in_root" out >actual &&
> + test_cmp expect actual
> +'
> +
> +test_expect_success 'add new file in dir' '
> + git diff --name-status $new_file_in_dir^ $new_file_in_dir &&
> + echo "$new_file_in_dir 1" >expect &&
> + grep "$new_file_in_dir" out >actual &&
> + test_cmp expect actual
> +'
> +
> +test_expect_success 'delete file in dir' '
> + git diff --name-status $delete_file_in_dir^ $delete_file_in_dir &&
> + echo "$delete_file_in_dir 1" >expect &&
> + grep "$delete_file_in_dir" out >actual &&
> + test_cmp expect actual
> +'
> +
> +test_expect_success 'replace file with dir' '
> + git diff --name-status $file_to_dir^ $file_to_dir &&
> + echo "$file_to_dir 2" >expect &&
> + grep "$file_to_dir" out >actual &&
> + test_cmp expect actual
> +'
> +
> +test_expect_success 'replace dir with file' '
> + git diff --name-status $dir_to_file^ $dir_to_file &&
> + echo "$dir_to_file 2" >expect &&
> + grep "$dir_to_file" out >actual &&
> + test_cmp expect actual
> +'
> +
> +test_expect_success 'leading zeros in mode' '
> + git diff --name-status $different_but_same_tree^ $different_but_same_tree &&
> + echo "$different_but_same_tree 0" >expect &&
> + grep "$different_but_same_tree" out >actual &&
> + test_cmp expect actual
> +'
> +
> +test_done
> diff --git a/tree-diff.c b/tree-diff.c
> index 6ebad1a46f3..7cebbb327e2 100644
> --- a/tree-diff.c
> +++ b/tree-diff.c
> @@ -434,7 +434,7 @@ static struct combine_diff_path *ll_diff_tree_paths(
> if (diff_can_quit_early(opt))
> break;
>
> - if (opt->max_changes && opt->num_changes > opt->max_changes)
> + if (opt->max_changes && diff_queued_diff.nr > opt->max_changes)
> break;
>
> if (opt->pathspec.nr) {
> @@ -521,7 +521,6 @@ static struct combine_diff_path *ll_diff_tree_paths(
>
> /* t↓ */
> update_tree_entry(&t);
> - opt->num_changes++;
> }
>
> /* t > p[imin] */
> @@ -539,7 +538,6 @@ static struct combine_diff_path *ll_diff_tree_paths(
> skip_emit_tp:
> /* ∀ pi=p[imin] pi↓ */
> update_tp_entries(tp, nparent);
> - opt->num_changes++;
> }
> }
>
> @@ -557,7 +555,6 @@ struct combine_diff_path *diff_tree_paths(
> const struct object_id **parents_oid, int nparent,
> struct strbuf *base, struct diff_options *opt)
> {
> - opt->num_changes = 0;
> p = ll_diff_tree_paths(p, oid, parents_oid, nparent, base, opt);
>
> /*
^ permalink raw reply related [flat|nested] 159+ messages in thread
* Re: [PATCH v4 05/15] diff: halt tree-diff early after max_changes
2020-08-04 17:00 ` SZEDER Gábor
@ 2020-08-04 17:31 ` Derrick Stolee
2020-08-05 17:08 ` Derrick Stolee
0 siblings, 1 reply; 159+ messages in thread
From: Derrick Stolee @ 2020-08-04 17:31 UTC (permalink / raw)
To: SZEDER Gábor
Cc: Derrick Stolee via GitGitGadget, git, jonathantanmy,
Garima Singh, Derrick Stolee, Taylor Blau
On 8/4/2020 1:00 PM, SZEDER Gábor wrote:
> On Tue, Aug 04, 2020 at 12:25:45PM -0400, Derrick Stolee wrote:
>> On 8/4/2020 10:47 AM, SZEDER Gábor wrote:
>>> On Mon, Apr 06, 2020 at 04:59:45PM +0000, Derrick Stolee via GitGitGadget wrote:
>>> This counter is basically broken, its value is wrong for over 98% of
>>> commits, and, worse, its value remains 0 for over 85% of commits in
>>> the repositories I usually use to test modified path Bloom filters.
>>> Consequently, a relatively large number of commits modifying more than
>>> 512 paths get Bloom filters.
>>
>> Thanks for finding this! The counter is only really tested in one
>> place, and that test only considers _file adds_, which is a problem.
>>
>> If I understand this correctly, the bug is a performance-only bug
>> (since this is a performance-only feature), but it is an important
>> one to fix.
>
> Or a performance-only feature in a performance-only feature, because
> those additional modified path Bloom filters can improve the runtime
> of pathspec-limited revision walks (assuming that the false positive
> rate is low enough).
>
>> There is certainly some dark magic happening in this tree-diff logic,
>> so instead of trying to get an accurate count we should just use the
>> magic global diff_queued_diff to track the current list of file changes.
>>
>> Note: diff_queued_diff does not track the directory changes, so it
>> is an under-count for the total changes to track in the Bloom filter.
>> This is later corrected by the block that adds these leading directory
>> changes.
>>
>>> The makeshift tests in the patch below demonstrate these issues as
>>> most of them fail, most notably those two tests that demonstrate that
>>> modifying existing paths are not counted at all.
>>
>> I adapted your diff along with ripping out 'num_changes' in favor
>> of diff_queued_diff.nr. This required modifying some of your expected
>> values in the test script (losing the leading directories in the
>> count).
>>
>> I'll work with Taylor to create a fix, and include proper testing
>> of the logic here. We'll stick it in the v2 of his max-changed-paths
>> series [1]. He already has some helpful logging that can help create
>> tests that ensure this logic is performing as expected.
>
> Don't forget to include a check of the hashmap's size, to make sure.
Yes, thanks for the pointer. That check is currently not in there,
since the code assumes the hashmap's size will match num_changes.
Hopefully, the tests I intend to write around this would have caught
such an omission.
> FWIW, the patch below does result in the correct count (read: the same
> as in my implemenation) for all but 4 commits in those repositories I
> use for testing, without adding any memory allocations and extra
> strcmp() calls.
...
> Having said that, the best (i.e faster and accurate) solution to this
> issue is probably:
>
> - Update the callchain between diff_tree_oid() and the diff callback
> functions to allow the callbacks to break diffing with a non-zero
> error code.
It looks like this part would not be too difficult. The pathchange
callback is called by emit_path() which returns a struct combine_diff_path
pointer. This could return NULL to signal an early termination, but
we need to update all callers of the following methods to handle NULL
responses:
* emit_path()
* ll_diff_tree_paths()
* diff_tree_paths()
Of some interest: diff_tree_paths() returns a struct combine_diff_path
pointer, but no callers seem to consume it.
> - Fill Bloom filters using the approach presented in:
>
> https://public-inbox.org/git/20200529085038.26008-21-szeder.dev@gmail.com/
>
> but modify the callbacks to return non-zero when too many paths
> have been processed.
Thanks for the pointer to that specific patch. You do a good job of
describing your thought process, including why you used the callback
approach instead of the diff queue approach. The main reason seemed to
be memory overhead from populating the entire diff queue before
checking the limit.
However, if we are using the diff queue as the short-circuit, then
perhaps that memory overhead isn't as much of a problem?
You admit yourself, that
This patch implements a more efficient, but more complex, approach:
The logic around matching prefixes definitely seems complex and
hard to test, especially around the file/directory changes with the
sort order problems that have plagued similar prefix checks recently.
I'm not doubting your implementation, just saying that the complexity
is worth considering before jumping to that solution too quickly.
To sum up, I intend to start with a fix that uses the diff queue
count as a limit, then try the callback approach to see if there are
measurable improvements in performance.
> - Drop this counter entirely, as there are no other users.
With the callback approach, "this counter" is both num_changes and
max_changes, since the callback would perform all of the short-circuit
logic.
Thanks,
-Stolee
^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH v4 05/15] diff: halt tree-diff early after max_changes
2020-08-04 17:31 ` Derrick Stolee
@ 2020-08-05 17:08 ` Derrick Stolee
0 siblings, 0 replies; 159+ messages in thread
From: Derrick Stolee @ 2020-08-05 17:08 UTC (permalink / raw)
To: SZEDER Gábor
Cc: Derrick Stolee via GitGitGadget, git, jonathantanmy,
Garima Singh, Derrick Stolee, Taylor Blau
On 8/4/2020 1:31 PM, Derrick Stolee wrote:
> On 8/4/2020 1:00 PM, SZEDER Gábor wrote:
>> Having said that, the best (i.e faster and accurate) solution to this
>> issue is probably:
>>
>> - Update the callchain between diff_tree_oid() and the diff callback
>> functions to allow the callbacks to break diffing with a non-zero
>> error code.
>
> It looks like this part would not be too difficult.
Oh, my hubris! I gave this a shot for some time this morning. This
will definitely take some work to do right. Just changing the callbacks
to return 'int' is a wide-sweeping change, but the place where they are
called already has an 'int' return that means something different.
I'm not saying this is impossible. It just takes more attention and care
than I can currently devote, given my other works in progress right now.
>> - Fill Bloom filters using the approach presented in:
>>
>> https://public-inbox.org/git/20200529085038.26008-21-szeder.dev@gmail.com/
>>
>> but modify the callbacks to return non-zero when too many paths
>> have been processed.
>
> Thanks for the pointer to that specific patch. You do a good job of
> describing your thought process, including why you used the callback
> approach instead of the diff queue approach. The main reason seemed to
> be memory overhead from populating the entire diff queue before
> checking the limit.
>
> However, if we are using the diff queue as the short-circuit, then
> perhaps that memory overhead isn't as much of a problem?
>
> You admit yourself, that
>
> This patch implements a more efficient, but more complex, approach:
>
> The logic around matching prefixes definitely seems complex and
> hard to test, especially around the file/directory changes with the
> sort order problems that have plagued similar prefix checks recently.
> I'm not doubting your implementation, just saying that the complexity
> is worth considering before jumping to that solution too quickly.
>
> To sum up, I intend to start with a fix that uses the diff queue
> count as a limit, then try the callback approach to see if there are
> measurable improvements in performance.
That fix is now available [1].
[1] https://lore.kernel.org/git/d1c4bbcaa9627068d5d9fbd0e4a2e8c8834a4bd3.1596646576.git.me@ttaylorr.com/
Again, the callback approach seems promising. The complexity is
stopping me from trying to apply it on top of the current
implementation, while I should be focusing on other things. I completely
believe that that approach is faster and more memory-efficient. I would
love to test and review a patch that takes that approach here.
Thanks,
-Stolee
^ permalink raw reply [flat|nested] 159+ messages in thread
* [PATCH v4 06/15] commit-graph: compute Bloom filters for changed paths
2020-04-06 16:59 ` [PATCH v4 00/15] Changed Paths Bloom Filters Garima Singh via GitGitGadget
` (4 preceding siblings ...)
2020-04-06 16:59 ` [PATCH v4 05/15] diff: halt tree-diff early after max_changes Derrick Stolee via GitGitGadget
@ 2020-04-06 16:59 ` Garima Singh via GitGitGadget
2020-04-06 16:59 ` [PATCH v4 07/15] commit-graph: examine changed-path objects in pack order Jeff King via GitGitGadget
` (9 subsequent siblings)
15 siblings, 0 replies; 159+ messages in thread
From: Garima Singh via GitGitGadget @ 2020-04-06 16:59 UTC (permalink / raw)
To: git; +Cc: stolee, szeder.dev, jonathantanmy, Garima Singh, Garima Singh
From: Garima Singh <garima.singh@microsoft.com>
Add new COMMIT_GRAPH_WRITE_CHANGED_PATHS flag that makes Git compute
Bloom filters for the paths that changed between a commit and it's
first parent, for each commit in the commit-graph. This computation
is done on a commit-by-commit basis.
We will write these Bloom filters to the commit-graph file, to store
this data on disk, in the next change in this series.
Helped-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
---
commit-graph.c | 32 +++++++++++++++++++++++++++++++-
commit-graph.h | 3 ++-
2 files changed, 33 insertions(+), 2 deletions(-)
diff --git a/commit-graph.c b/commit-graph.c
index e4f1a5b2f1a..862a00d67ed 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -16,6 +16,7 @@
#include "hashmap.h"
#include "replace-object.h"
#include "progress.h"
+#include "bloom.h"
#define GRAPH_SIGNATURE 0x43475048 /* "CGPH" */
#define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
@@ -789,9 +790,11 @@ struct write_commit_graph_context {
unsigned append:1,
report_progress:1,
split:1,
- check_oids:1;
+ check_oids:1,
+ changed_paths:1;
const struct split_commit_graph_opts *split_opts;
+ size_t total_bloom_filter_data_size;
};
static void write_graph_chunk_fanout(struct hashfile *f,
@@ -1134,6 +1137,28 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
stop_progress(&ctx->progress);
}
+static void compute_bloom_filters(struct write_commit_graph_context *ctx)
+{
+ int i;
+ struct progress *progress = NULL;
+
+ init_bloom_filters();
+
+ if (ctx->report_progress)
+ progress = start_delayed_progress(
+ _("Computing commit changed paths Bloom filters"),
+ ctx->commits.nr);
+
+ for (i = 0; i < ctx->commits.nr; i++) {
+ struct commit *c = ctx->commits.list[i];
+ struct bloom_filter *filter = get_bloom_filter(ctx->r, c);
+ ctx->total_bloom_filter_data_size += sizeof(unsigned char) * filter->len;
+ display_progress(progress, i + 1);
+ }
+
+ stop_progress(&progress);
+}
+
static int add_ref_to_list(const char *refname,
const struct object_id *oid,
int flags, void *cb_data)
@@ -1776,6 +1801,8 @@ int write_commit_graph(struct object_directory *odb,
ctx->split = flags & COMMIT_GRAPH_WRITE_SPLIT ? 1 : 0;
ctx->check_oids = flags & COMMIT_GRAPH_WRITE_CHECK_OIDS ? 1 : 0;
ctx->split_opts = split_opts;
+ ctx->changed_paths = flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS ? 1 : 0;
+ ctx->total_bloom_filter_data_size = 0;
if (ctx->split) {
struct commit_graph *g;
@@ -1870,6 +1897,9 @@ int write_commit_graph(struct object_directory *odb,
compute_generation_numbers(ctx);
+ if (ctx->changed_paths)
+ compute_bloom_filters(ctx);
+
res = write_commit_graph_file(ctx);
if (ctx->split)
diff --git a/commit-graph.h b/commit-graph.h
index e87a6f63600..86be81219da 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -79,7 +79,8 @@ enum commit_graph_write_flags {
COMMIT_GRAPH_WRITE_PROGRESS = (1 << 1),
COMMIT_GRAPH_WRITE_SPLIT = (1 << 2),
/* Make sure that each OID in the input is a valid commit OID. */
- COMMIT_GRAPH_WRITE_CHECK_OIDS = (1 << 3)
+ COMMIT_GRAPH_WRITE_CHECK_OIDS = (1 << 3),
+ COMMIT_GRAPH_WRITE_BLOOM_FILTERS = (1 << 4),
};
struct split_commit_graph_opts {
--
gitgitgadget
^ permalink raw reply related [flat|nested] 159+ messages in thread
* [PATCH v4 07/15] commit-graph: examine changed-path objects in pack order
2020-04-06 16:59 ` [PATCH v4 00/15] Changed Paths Bloom Filters Garima Singh via GitGitGadget
` (5 preceding siblings ...)
2020-04-06 16:59 ` [PATCH v4 06/15] commit-graph: compute Bloom filters for changed paths Garima Singh via GitGitGadget
@ 2020-04-06 16:59 ` Jeff King via GitGitGadget
2020-04-06 16:59 ` [PATCH v4 08/15] commit-graph: examine commits by generation number Garima Singh via GitGitGadget
` (8 subsequent siblings)
15 siblings, 0 replies; 159+ messages in thread
From: Jeff King via GitGitGadget @ 2020-04-06 16:59 UTC (permalink / raw)
To: git; +Cc: stolee, szeder.dev, jonathantanmy, Garima Singh, Jeff King
From: Jeff King <peff@peff.net>
Looking at the diff of commit objects in pack order is much faster than
in sha1 order, as it gives locality to the access of tree deltas
(whereas sha1 order is effectively random). Unfortunately the
commit-graph code sorts the commits (several times, sometimes as an oid
and sometimes a pointer-to-commit), and we ultimately traverse in sha1
order.
Instead, let's remember the position at which we see each commit, and
traverse in that order when looking at bloom filters. This drops my time
for "git commit-graph write --changed-paths" in linux.git from ~4
minutes to ~1.5 minutes.
Probably the "--reachable" code path would want something similar.
Or alternatively, we could use a different data structure (either a
hash, or maybe even just a bit in "struct commit") to keep track of
which oids we've seen, etc instead of sorting. And then we could keep
the original order.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
---
commit-graph.c | 38 +++++++++++++++++++++++++++++++++++---
1 file changed, 35 insertions(+), 3 deletions(-)
diff --git a/commit-graph.c b/commit-graph.c
index 862a00d67ed..31b06f878ce 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -17,6 +17,7 @@
#include "replace-object.h"
#include "progress.h"
#include "bloom.h"
+#include "commit-slab.h"
#define GRAPH_SIGNATURE 0x43475048 /* "CGPH" */
#define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
@@ -46,9 +47,32 @@
/* Remember to update object flag allocation in object.h */
#define REACHABLE (1u<<15)
-char *get_commit_graph_filename(struct object_directory *odb)
+/* Keep track of the order in which commits are added to our list. */
+define_commit_slab(commit_pos, int);
+static struct commit_pos commit_pos = COMMIT_SLAB_INIT(1, commit_pos);
+
+static void set_commit_pos(struct repository *r, const struct object_id *oid)
+{
+ static int32_t max_pos;
+ struct commit *commit = lookup_commit(r, oid);
+
+ if (!commit)
+ return; /* should never happen, but be lenient */
+
+ *commit_pos_at(&commit_pos, commit) = max_pos++;
+}
+
+static int commit_pos_cmp(const void *va, const void *vb)
{
- return xstrfmt("%s/info/commit-graph", odb->path);
+ const struct commit *a = *(const struct commit **)va;
+ const struct commit *b = *(const struct commit **)vb;
+ return commit_pos_at(&commit_pos, a) -
+ commit_pos_at(&commit_pos, b);
+}
+
+char *get_commit_graph_filename(struct object_directory *obj_dir)
+{
+ return xstrfmt("%s/info/commit-graph", obj_dir->path);
}
static char *get_split_graph_filename(struct object_directory *odb,
@@ -1021,6 +1045,8 @@ static int add_packed_commits(const struct object_id *oid,
oidcpy(&(ctx->oids.list[ctx->oids.nr]), oid);
ctx->oids.nr++;
+ set_commit_pos(ctx->r, oid);
+
return 0;
}
@@ -1141,6 +1167,7 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
{
int i;
struct progress *progress = NULL;
+ struct commit **sorted_commits;
init_bloom_filters();
@@ -1149,13 +1176,18 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
_("Computing commit changed paths Bloom filters"),
ctx->commits.nr);
+ ALLOC_ARRAY(sorted_commits, ctx->commits.nr);
+ COPY_ARRAY(sorted_commits, ctx->commits.list, ctx->commits.nr);
+ QSORT(sorted_commits, ctx->commits.nr, commit_pos_cmp);
+
for (i = 0; i < ctx->commits.nr; i++) {
- struct commit *c = ctx->commits.list[i];
+ struct commit *c = sorted_commits[i];
struct bloom_filter *filter = get_bloom_filter(ctx->r, c);
ctx->total_bloom_filter_data_size += sizeof(unsigned char) * filter->len;
display_progress(progress, i + 1);
}
+ free(sorted_commits);
stop_progress(&progress);
}
--
gitgitgadget
^ permalink raw reply related [flat|nested] 159+ messages in thread
* [PATCH v4 08/15] commit-graph: examine commits by generation number
2020-04-06 16:59 ` [PATCH v4 00/15] Changed Paths Bloom Filters Garima Singh via GitGitGadget
` (6 preceding siblings ...)
2020-04-06 16:59 ` [PATCH v4 07/15] commit-graph: examine changed-path objects in pack order Jeff King via GitGitGadget
@ 2020-04-06 16:59 ` Garima Singh via GitGitGadget
2020-04-06 16:59 ` [PATCH v4 09/15] commit-graph: write Bloom filters to commit graph file Garima Singh via GitGitGadget
` (7 subsequent siblings)
15 siblings, 0 replies; 159+ messages in thread
From: Garima Singh via GitGitGadget @ 2020-04-06 16:59 UTC (permalink / raw)
To: git; +Cc: stolee, szeder.dev, jonathantanmy, Garima Singh, Garima Singh
From: Garima Singh <garima.singh@microsoft.com>
When running 'git commit-graph write --changed-paths', we sort the
commits by pack-order to save time when computing the changed-paths
bloom filters. This does not help when finding the commits via the
'--reachable' flag.
If not using pack-order, then sort by generation number before
examining the diff. Commits with similar generation are more likely
to have many trees in common, making the diff faster.
On the Linux kernel repository, this change reduced the computation
time for 'git commit-graph write --reachable --changed-paths' from
3m00s to 1m37s.
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
---
commit-graph.c | 33 ++++++++++++++++++++++++++++++---
1 file changed, 30 insertions(+), 3 deletions(-)
diff --git a/commit-graph.c b/commit-graph.c
index 31b06f878ce..732c81fa1b2 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -70,6 +70,25 @@ static int commit_pos_cmp(const void *va, const void *vb)
commit_pos_at(&commit_pos, b);
}
+static int commit_gen_cmp(const void *va, const void *vb)
+{
+ const struct commit *a = *(const struct commit **)va;
+ const struct commit *b = *(const struct commit **)vb;
+
+ /* lower generation commits first */
+ if (a->generation < b->generation)
+ return -1;
+ else if (a->generation > b->generation)
+ return 1;
+
+ /* use date as a heuristic when generations are equal */
+ if (a->date < b->date)
+ return -1;
+ else if (a->date > b->date)
+ return 1;
+ return 0;
+}
+
char *get_commit_graph_filename(struct object_directory *obj_dir)
{
return xstrfmt("%s/info/commit-graph", obj_dir->path);
@@ -815,7 +834,8 @@ struct write_commit_graph_context {
report_progress:1,
split:1,
check_oids:1,
- changed_paths:1;
+ changed_paths:1,
+ order_by_pack:1;
const struct split_commit_graph_opts *split_opts;
size_t total_bloom_filter_data_size;
@@ -1178,7 +1198,11 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
ALLOC_ARRAY(sorted_commits, ctx->commits.nr);
COPY_ARRAY(sorted_commits, ctx->commits.list, ctx->commits.nr);
- QSORT(sorted_commits, ctx->commits.nr, commit_pos_cmp);
+
+ if (ctx->order_by_pack)
+ QSORT(sorted_commits, ctx->commits.nr, commit_pos_cmp);
+ else
+ QSORT(sorted_commits, ctx->commits.nr, commit_gen_cmp);
for (i = 0; i < ctx->commits.nr; i++) {
struct commit *c = sorted_commits[i];
@@ -1884,6 +1908,7 @@ int write_commit_graph(struct object_directory *odb,
}
if (pack_indexes) {
+ ctx->order_by_pack = 1;
if ((res = fill_oids_from_packs(ctx, pack_indexes)))
goto cleanup;
}
@@ -1893,8 +1918,10 @@ int write_commit_graph(struct object_directory *odb,
goto cleanup;
}
- if (!pack_indexes && !commit_hex)
+ if (!pack_indexes && !commit_hex) {
+ ctx->order_by_pack = 1;
fill_oids_from_all_packs(ctx);
+ }
close_reachable(ctx);
--
gitgitgadget
^ permalink raw reply related [flat|nested] 159+ messages in thread
* [PATCH v4 09/15] commit-graph: write Bloom filters to commit graph file
2020-04-06 16:59 ` [PATCH v4 00/15] Changed Paths Bloom Filters Garima Singh via GitGitGadget
` (7 preceding siblings ...)
2020-04-06 16:59 ` [PATCH v4 08/15] commit-graph: examine commits by generation number Garima Singh via GitGitGadget
@ 2020-04-06 16:59 ` Garima Singh via GitGitGadget
2020-05-29 8:57 ` SZEDER Gábor
2020-07-09 17:00 ` [PATCH] commit-graph: fix "Writing out commit graph" progress counter SZEDER Gábor
2020-04-06 16:59 ` [PATCH v4 10/15] commit-graph: reuse existing Bloom filters during write Garima Singh via GitGitGadget
` (6 subsequent siblings)
15 siblings, 2 replies; 159+ messages in thread
From: Garima Singh via GitGitGadget @ 2020-04-06 16:59 UTC (permalink / raw)
To: git; +Cc: stolee, szeder.dev, jonathantanmy, Garima Singh, Garima Singh
From: Garima Singh <garima.singh@microsoft.com>
Update the technical documentation for commit-graph-format with
the formats for the Bloom filter index (BIDX) and Bloom filter
data (BDAT) chunks. Write the computed Bloom filters information
to the commit graph file using this format.
Helped-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
---
.../technical/commit-graph-format.txt | 30 +++++
commit-graph.c | 113 +++++++++++++++++-
commit-graph.h | 5 +
3 files changed, 147 insertions(+), 1 deletion(-)
diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
index a4f17441aed..de56f9f1efd 100644
--- a/Documentation/technical/commit-graph-format.txt
+++ b/Documentation/technical/commit-graph-format.txt
@@ -17,6 +17,9 @@ metadata, including:
- The parents of the commit, stored using positional references within
the graph file.
+- The Bloom filter of the commit carrying the paths that were changed between
+ the commit and its first parent, if requested.
+
These positional references are stored as unsigned 32-bit integers
corresponding to the array position within the list of commit OIDs. Due
to some special constants we use to track parents, we can store at most
@@ -93,6 +96,33 @@ CHUNK DATA:
positions for the parents until reaching a value with the most-significant
bit on. The other bits correspond to the position of the last parent.
+ Bloom Filter Index (ID: {'B', 'I', 'D', 'X'}) (N * 4 bytes) [Optional]
+ * The ith entry, BIDX[i], stores the number of 8-byte word blocks in all
+ Bloom filters from commit 0 to commit i (inclusive) in lexicographic
+ order. The Bloom filter for the i-th commit spans from BIDX[i-1] to
+ BIDX[i] (plus header length), where BIDX[-1] is 0.
+ * The BIDX chunk is ignored if the BDAT chunk is not present.
+
+ Bloom Filter Data (ID: {'B', 'D', 'A', 'T'}) [Optional]
+ * It starts with header consisting of three unsigned 32-bit integers:
+ - Version of the hash algorithm being used. We currently only support
+ value 1 which corresponds to the 32-bit version of the murmur3 hash
+ implemented exactly as described in
+ https://en.wikipedia.org/wiki/MurmurHash#Algorithm and the double
+ hashing technique using seed values 0x293ae76f and 0x7e646e2 as
+ described in https://doi.org/10.1007/978-3-540-30494-4_26 "Bloom Filters
+ in Probabilistic Verification"
+ - The number of times a path is hashed and hence the number of bit positions
+ that cumulatively determine whether a file is present in the commit.
+ - The minimum number of bits 'b' per entry in the Bloom filter. If the filter
+ contains 'n' entries, then the filter size is the minimum number of 64-bit
+ words that contain n*b bits.
+ * The rest of the chunk is the concatenation of all the computed Bloom
+ filters for the commits in lexicographic order.
+ * Note: Commits with no changes or more than 512 changes have Bloom filters
+ of length zero.
+ * The BDAT chunk is present if and only if BIDX is present.
+
Base Graphs List (ID: {'B', 'A', 'S', 'E'}) [Optional]
This list of H-byte hashes describe a set of B commit-graph files that
form a commit-graph chain. The graph position for the ith commit in this
diff --git a/commit-graph.c b/commit-graph.c
index 732c81fa1b2..a8b6b5cca5d 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -24,8 +24,10 @@
#define GRAPH_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
#define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */
#define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */
+#define GRAPH_CHUNKID_BLOOMINDEXES 0x42494458 /* "BIDX" */
+#define GRAPH_CHUNKID_BLOOMDATA 0x42444154 /* "BDAT" */
#define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */
-#define MAX_NUM_CHUNKS 5
+#define MAX_NUM_CHUNKS 7
#define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16)
@@ -319,6 +321,32 @@ struct commit_graph *parse_commit_graph(void *graph_map, int fd,
chunk_repeated = 1;
else
graph->chunk_base_graphs = data + chunk_offset;
+ break;
+
+ case GRAPH_CHUNKID_BLOOMINDEXES:
+ if (graph->chunk_bloom_indexes)
+ chunk_repeated = 1;
+ else
+ graph->chunk_bloom_indexes = data + chunk_offset;
+ break;
+
+ case GRAPH_CHUNKID_BLOOMDATA:
+ if (graph->chunk_bloom_data)
+ chunk_repeated = 1;
+ else {
+ uint32_t hash_version;
+ graph->chunk_bloom_data = data + chunk_offset;
+ hash_version = get_be32(data + chunk_offset);
+
+ if (hash_version != 1)
+ break;
+
+ graph->bloom_filter_settings = xmalloc(sizeof(struct bloom_filter_settings));
+ graph->bloom_filter_settings->hash_version = hash_version;
+ graph->bloom_filter_settings->num_hashes = get_be32(data + chunk_offset + 4);
+ graph->bloom_filter_settings->bits_per_entry = get_be32(data + chunk_offset + 8);
+ }
+ break;
}
if (chunk_repeated) {
@@ -337,6 +365,15 @@ struct commit_graph *parse_commit_graph(void *graph_map, int fd,
last_chunk_offset = chunk_offset;
}
+ if (graph->chunk_bloom_indexes && graph->chunk_bloom_data) {
+ init_bloom_filters();
+ } else {
+ /* We need both the bloom chunks to exist together. Else ignore the data */
+ graph->chunk_bloom_indexes = NULL;
+ graph->chunk_bloom_data = NULL;
+ graph->bloom_filter_settings = NULL;
+ }
+
hashcpy(graph->oid.hash, graph->data + graph->data_len - graph->hash_len);
if (verify_commit_graph_lite(graph)) {
@@ -1034,6 +1071,59 @@ static void write_graph_chunk_extra_edges(struct hashfile *f,
}
}
+static void write_graph_chunk_bloom_indexes(struct hashfile *f,
+ struct write_commit_graph_context *ctx)
+{
+ struct commit **list = ctx->commits.list;
+ struct commit **last = ctx->commits.list + ctx->commits.nr;
+ uint32_t cur_pos = 0;
+ struct progress *progress = NULL;
+ int i = 0;
+
+ if (ctx->report_progress)
+ progress = start_delayed_progress(
+ _("Writing changed paths Bloom filters index"),
+ ctx->commits.nr);
+
+ while (list < last) {
+ struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
+ cur_pos += filter->len;
+ display_progress(progress, ++i);
+ hashwrite_be32(f, cur_pos);
+ list++;
+ }
+
+ stop_progress(&progress);
+}
+
+static void write_graph_chunk_bloom_data(struct hashfile *f,
+ struct write_commit_graph_context *ctx,
+ const struct bloom_filter_settings *settings)
+{
+ struct commit **list = ctx->commits.list;
+ struct commit **last = ctx->commits.list + ctx->commits.nr;
+ struct progress *progress = NULL;
+ int i = 0;
+
+ if (ctx->report_progress)
+ progress = start_delayed_progress(
+ _("Writing changed paths Bloom filters data"),
+ ctx->commits.nr);
+
+ hashwrite_be32(f, settings->hash_version);
+ hashwrite_be32(f, settings->num_hashes);
+ hashwrite_be32(f, settings->bits_per_entry);
+
+ while (list < last) {
+ struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
+ display_progress(progress, ++i);
+ hashwrite(f, filter->data, filter->len * sizeof(unsigned char));
+ list++;
+ }
+
+ stop_progress(&progress);
+}
+
static int oid_compare(const void *_a, const void *_b)
{
const struct object_id *a = (const struct object_id *)_a;
@@ -1438,6 +1528,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
struct strbuf progress_title = STRBUF_INIT;
int num_chunks = 3;
struct object_id file_hash;
+ const struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
if (ctx->split) {
struct strbuf tmp_file = STRBUF_INIT;
@@ -1482,6 +1573,12 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
chunk_ids[num_chunks] = GRAPH_CHUNKID_EXTRAEDGES;
num_chunks++;
}
+ if (ctx->changed_paths) {
+ chunk_ids[num_chunks] = GRAPH_CHUNKID_BLOOMINDEXES;
+ num_chunks++;
+ chunk_ids[num_chunks] = GRAPH_CHUNKID_BLOOMDATA;
+ num_chunks++;
+ }
if (ctx->num_commit_graphs_after > 1) {
chunk_ids[num_chunks] = GRAPH_CHUNKID_BASE;
num_chunks++;
@@ -1500,6 +1597,15 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
4 * ctx->num_extra_edges;
num_chunks++;
}
+ if (ctx->changed_paths) {
+ chunk_offsets[num_chunks + 1] = chunk_offsets[num_chunks] +
+ sizeof(uint32_t) * ctx->commits.nr;
+ num_chunks++;
+
+ chunk_offsets[num_chunks + 1] = chunk_offsets[num_chunks] +
+ sizeof(uint32_t) * 3 + ctx->total_bloom_filter_data_size;
+ num_chunks++;
+ }
if (ctx->num_commit_graphs_after > 1) {
chunk_offsets[num_chunks + 1] = chunk_offsets[num_chunks] +
hashsz * (ctx->num_commit_graphs_after - 1);
@@ -1537,6 +1643,10 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
write_graph_chunk_data(f, hashsz, ctx);
if (ctx->num_extra_edges)
write_graph_chunk_extra_edges(f, ctx);
+ if (ctx->changed_paths) {
+ write_graph_chunk_bloom_indexes(f, ctx);
+ write_graph_chunk_bloom_data(f, ctx, &bloom_settings);
+ }
if (ctx->num_commit_graphs_after > 1 &&
write_graph_chunk_base(f, ctx)) {
return -1;
@@ -2184,6 +2294,7 @@ void free_commit_graph(struct commit_graph *g)
close(g->graph_fd);
}
free(g->filename);
+ free(g->bloom_filter_settings);
free(g);
}
diff --git a/commit-graph.h b/commit-graph.h
index 86be81219da..8e7a8e0e5b2 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -11,6 +11,7 @@
#define GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD "GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD"
struct commit;
+struct bloom_filter_settings;
char *get_commit_graph_filename(struct object_directory *odb);
int open_commit_graph(const char *graph_file, int *fd, struct stat *st);
@@ -59,6 +60,10 @@ struct commit_graph {
const unsigned char *chunk_commit_data;
const unsigned char *chunk_extra_edges;
const unsigned char *chunk_base_graphs;
+ const unsigned char *chunk_bloom_indexes;
+ const unsigned char *chunk_bloom_data;
+
+ struct bloom_filter_settings *bloom_filter_settings;
};
struct commit_graph *load_commit_graph_one_fd_st(int fd, struct stat *st,
--
gitgitgadget
^ permalink raw reply related [flat|nested] 159+ messages in thread
* Re: [PATCH v4 09/15] commit-graph: write Bloom filters to commit graph file
2020-04-06 16:59 ` [PATCH v4 09/15] commit-graph: write Bloom filters to commit graph file Garima Singh via GitGitGadget
@ 2020-05-29 8:57 ` SZEDER Gábor
2020-05-29 13:35 ` Derrick Stolee
2020-07-09 17:00 ` [PATCH] commit-graph: fix "Writing out commit graph" progress counter SZEDER Gábor
1 sibling, 1 reply; 159+ messages in thread
From: SZEDER Gábor @ 2020-05-29 8:57 UTC (permalink / raw)
To: Garima Singh via GitGitGadget; +Cc: git, stolee, jonathantanmy, Garima Singh
On Mon, Apr 06, 2020 at 04:59:49PM +0000, Garima Singh via GitGitGadget wrote:
> From: Garima Singh <garima.singh@microsoft.com>
>
> Update the technical documentation for commit-graph-format with
> the formats for the Bloom filter index (BIDX) and Bloom filter
> data (BDAT) chunks. Write the computed Bloom filters information
> to the commit graph file using this format.
>
> Helped-by: Derrick Stolee <dstolee@microsoft.com>
> Signed-off-by: Garima Singh <garima.singh@microsoft.com>
> ---
> .../technical/commit-graph-format.txt | 30 +++++
> commit-graph.c | 113 +++++++++++++++++-
> commit-graph.h | 5 +
> 3 files changed, 147 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
> index a4f17441aed..de56f9f1efd 100644
> --- a/Documentation/technical/commit-graph-format.txt
> +++ b/Documentation/technical/commit-graph-format.txt
> @@ -17,6 +17,9 @@ metadata, including:
> - The parents of the commit, stored using positional references within
> the graph file.
>
> +- The Bloom filter of the commit carrying the paths that were changed between
> + the commit and its first parent, if requested.
> +
> These positional references are stored as unsigned 32-bit integers
> corresponding to the array position within the list of commit OIDs. Due
> to some special constants we use to track parents, we can store at most
> @@ -93,6 +96,33 @@ CHUNK DATA:
> positions for the parents until reaching a value with the most-significant
> bit on. The other bits correspond to the position of the last parent.
>
> + Bloom Filter Index (ID: {'B', 'I', 'D', 'X'}) (N * 4 bytes) [Optional]
> + * The ith entry, BIDX[i], stores the number of 8-byte word blocks in all
This is inconsistent with the implementation: according to the code in
one of the previous patches these entries are simple byte offsets, not
8-byte word offsets, i.e. the combined size of all modified path
Bloom filters can be at most 2^32 bytes.
The commit-graph file can contain information about at most 2^31-1
commits. This means that with that many commits each commit can have
a merely 2 byte Bloom filter on average. When using 7 hashes we'd
need 10 bits per path, so in two bytes we could store only a single
path.
Clearly, using 4 byte index entries significantly lowers the max
number of commits that can be stored with modified path Bloom filters.
IMO every new chunk must support at least 2^31-1 commits.
> + Bloom filters from commit 0 to commit i (inclusive) in lexicographic
> + order. The Bloom filter for the i-th commit spans from BIDX[i-1] to
> + BIDX[i] (plus header length), where BIDX[-1] is 0.
> + * The BIDX chunk is ignored if the BDAT chunk is not present.
> +
> + Bloom Filter Data (ID: {'B', 'D', 'A', 'T'}) [Optional]
> + * It starts with header consisting of three unsigned 32-bit integers:
> + - Version of the hash algorithm being used. We currently only support
> + value 1 which corresponds to the 32-bit version of the murmur3 hash
> + implemented exactly as described in
> + https://en.wikipedia.org/wiki/MurmurHash#Algorithm and the double
> + hashing technique using seed values 0x293ae76f and 0x7e646e2 as
> + described in https://doi.org/10.1007/978-3-540-30494-4_26 "Bloom Filters
> + in Probabilistic Verification"
How should double hashing compute the k hashes, i.e. using 64 bit or
32 bit unsigned integer arithmetic?
I'm puzzled that you link to this paper and still use double hashing.
Two of the contributions of that paper are that it points out some
shortcomings of the double hashing scheme and provides a better
alternative in the form of enhanced double hashing, which can cut the
false positive rate in half.
However, that paper considers the hashing scheme only in the context
of one big Bloom filter. I've found that when it comes to many small
Bloom filters then the k hashes produced by any double hashing variant
are not independent enough, and "standard" double hashing fares the
worst among them. There are real repositories out there where double
hashing has over an order of magnitude higher average false positive
rate than enhanced double hashing. Though that's not to say that
enhanced double hashing is good...
For details on these issues see
https://public-inbox.org/git/20200529085038.26008-16-szeder.dev@gmail.com
> + - The number of times a path is hashed and hence the number of bit positions
> + that cumulatively determine whether a file is present in the commit.
> + - The minimum number of bits 'b' per entry in the Bloom filter. If the filter
> + contains 'n' entries, then the filter size is the minimum number of 64-bit
> + words that contain n*b bits.
Since the ideal number of bits per element depends only on the number
of hashes per path (k / ln(2) ≈ k * 10 / 7), why is this value stored
in the commit-graph?
> + * The rest of the chunk is the concatenation of all the computed Bloom
> + filters for the commits in lexicographic order.
> + * Note: Commits with no changes or more than 512 changes have Bloom filters
> + of length zero.
What does this "Note:" prefix mean in the file format specification?
Can an implementation use a one byte Bloom filter with no bits set for
a commit with no changes? Can an implementation still store a Bloom
filter for commits that modify more than 512 paths?
> + * The BDAT chunk is present if and only if BIDX is present.
> +
> Base Graphs List (ID: {'B', 'A', 'S', 'E'}) [Optional]
> This list of H-byte hashes describe a set of B commit-graph files that
> form a commit-graph chain. The graph position for the ith commit in this
> diff --git a/commit-graph.c b/commit-graph.c
> index 732c81fa1b2..a8b6b5cca5d 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -1034,6 +1071,59 @@ static void write_graph_chunk_extra_edges(struct hashfile *f,
> }
> }
>
> +static void write_graph_chunk_bloom_indexes(struct hashfile *f,
> + struct write_commit_graph_context *ctx)
> +{
> + struct commit **list = ctx->commits.list;
> + struct commit **last = ctx->commits.list + ctx->commits.nr;
> + uint32_t cur_pos = 0;
> + struct progress *progress = NULL;
> + int i = 0;
> +
> + if (ctx->report_progress)
> + progress = start_delayed_progress(
> + _("Writing changed paths Bloom filters index"),
> + ctx->commits.nr);
> +
> + while (list < last) {
> + struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
> + cur_pos += filter->len;
Given a sufficiently large number of commits with large enough Bloom
filters this will silently overflow.
> + display_progress(progress, ++i);
> + hashwrite_be32(f, cur_pos);
> + list++;
> + }
> +
> + stop_progress(&progress);
> +}
^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH v4 09/15] commit-graph: write Bloom filters to commit graph file
2020-05-29 8:57 ` SZEDER Gábor
@ 2020-05-29 13:35 ` Derrick Stolee
2020-05-31 17:23 ` SZEDER Gábor
0 siblings, 1 reply; 159+ messages in thread
From: Derrick Stolee @ 2020-05-29 13:35 UTC (permalink / raw)
To: SZEDER Gábor, Garima Singh via GitGitGadget
Cc: git, jonathantanmy, Garima Singh
On 5/29/2020 4:57 AM, SZEDER Gábor wrote:
> On Mon, Apr 06, 2020 at 04:59:49PM +0000, Garima Singh via GitGitGadget wrote:
>> From: Garima Singh <garima.singh@microsoft.com>
>>
>> Update the technical documentation for commit-graph-format with
>> the formats for the Bloom filter index (BIDX) and Bloom filter
>> data (BDAT) chunks. Write the computed Bloom filters information
>> to the commit graph file using this format.
>>
>> Helped-by: Derrick Stolee <dstolee@microsoft.com>
>> Signed-off-by: Garima Singh <garima.singh@microsoft.com>
>> ---
>> .../technical/commit-graph-format.txt | 30 +++++
>> commit-graph.c | 113 +++++++++++++++++-
>> commit-graph.h | 5 +
>> 3 files changed, 147 insertions(+), 1 deletion(-)
>>
>> diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
>> index a4f17441aed..de56f9f1efd 100644
>> --- a/Documentation/technical/commit-graph-format.txt
>> +++ b/Documentation/technical/commit-graph-format.txt
>> @@ -17,6 +17,9 @@ metadata, including:
>> - The parents of the commit, stored using positional references within
>> the graph file.
>>
>> +- The Bloom filter of the commit carrying the paths that were changed between
>> + the commit and its first parent, if requested.
>> +
>> These positional references are stored as unsigned 32-bit integers
>> corresponding to the array position within the list of commit OIDs. Due
>> to some special constants we use to track parents, we can store at most
>> @@ -93,6 +96,33 @@ CHUNK DATA:
>> positions for the parents until reaching a value with the most-significant
>> bit on. The other bits correspond to the position of the last parent.
>>
>> + Bloom Filter Index (ID: {'B', 'I', 'D', 'X'}) (N * 4 bytes) [Optional]
>> + * The ith entry, BIDX[i], stores the number of 8-byte word blocks in all
>
> This is inconsistent with the implementation: according to the code in
> one of the previous patches these entries are simple byte offsets, not
> 8-byte word offsets, i.e. the combined size of all modified path
> Bloom filters can be at most 2^32 bytes.
The documentation was fixed in 88093289cdc (Documentation: changed-path Bloom
filters use byte words, 2020-05-11).
> The commit-graph file can contain information about at most 2^31-1
> commits. This means that with that many commits each commit can have
> a merely 2 byte Bloom filter on average. When using 7 hashes we'd
> need 10 bits per path, so in two bytes we could store only a single
> path.
>
> Clearly, using 4 byte index entries significantly lowers the max
> number of commits that can be stored with modified path Bloom filters.
This is a good point, and certainly the reason for 8-byte multiples.
> IMO every new chunk must support at least 2^31-1 commits.
I'm not sure this is a valid requirement. Even extremely large repositories
(that are created by actual use, not synthetic) are on the scale of 2^24
commits.
You are right that we should make the commit-graph write process more robust
to reaching these limits. You point out that we have a new limit when these
filters are enabled.
For reference, the Windows OS repo has ~4.25 million commits and the
commit-graph file with changed-path Bloom filters is around 520mb. That's
the whole file size, and without the filters it's around 240mb, so the
filters are taking <300mb ~ 2^29 and we would need to grow the repo by 8x
to hit this limit. That's not an unreasonable amount of growth, but is
also far enough away that we can handle it in time.
The incremental commit-graph can actually save us here (and is similar to
how we solved a scale issue in Azure Repos around the multi-pack-index):
we can refuse to merge layers of an incremental commit-graph if the
changed-path filters would exceed the size limit. Of course, the _first_
write of such a commit-graph would need to be aware of this limit and
plan for it in advance, but that's also a theoretical issue.
I'm tracking some follow-up work [1] for the changed-path filters,
including a way to limit the number of filters computed in one
"git commit-graph write" process. I'll make note of your concerns here,
too.
[1] https://github.com/microsoft/git/issues/272
>> + Bloom filters from commit 0 to commit i (inclusive) in lexicographic
>> + order. The Bloom filter for the i-th commit spans from BIDX[i-1] to
>> + BIDX[i] (plus header length), where BIDX[-1] is 0.
>> + * The BIDX chunk is ignored if the BDAT chunk is not present.
>> +
>> + Bloom Filter Data (ID: {'B', 'D', 'A', 'T'}) [Optional]
>> + * It starts with header consisting of three unsigned 32-bit integers:
>> + - Version of the hash algorithm being used. We currently only support
>> + value 1 which corresponds to the 32-bit version of the murmur3 hash
>> + implemented exactly as described in
>> + https://en.wikipedia.org/wiki/MurmurHash#Algorithm and the double
>> + hashing technique using seed values 0x293ae76f and 0x7e646e2 as
>> + described in https://doi.org/10.1007/978-3-540-30494-4_26 "Bloom Filters
>> + in Probabilistic Verification"
>
> How should double hashing compute the k hashes, i.e. using 64 bit or
> 32 bit unsigned integer arithmetic?
>
> I'm puzzled that you link to this paper and still use double hashing.
>
> Two of the contributions of that paper are that it points out some
> shortcomings of the double hashing scheme and provides a better
> alternative in the form of enhanced double hashing, which can cut the
> false positive rate in half.
>
> However, that paper considers the hashing scheme only in the context
> of one big Bloom filter. I've found that when it comes to many small
> Bloom filters then the k hashes produced by any double hashing variant
> are not independent enough, and "standard" double hashing fares the
> worst among them. There are real repositories out there where double
> hashing has over an order of magnitude higher average false positive
> rate than enhanced double hashing. Though that's not to say that
> enhanced double hashing is good...
>
> For details on these issues see
>
> https://public-inbox.org/git/20200529085038.26008-16-szeder.dev@gmail.com
That message includes very detailed experimental analysis, which is nice.
We will need to do some concrete side-by-side comparisons to see if there
actually is a meaningful difference. (You may have already done this.)
>> + - The number of times a path is hashed and hence the number of bit positions
>> + that cumulatively determine whether a file is present in the commit.
>> + - The minimum number of bits 'b' per entry in the Bloom filter. If the filter
>> + contains 'n' entries, then the filter size is the minimum number of 64-bit
>> + words that contain n*b bits.
>
> Since the ideal number of bits per element depends only on the number
> of hashes per path (k / ln(2) ≈ k * 10 / 7), why is this value stored
> in the commit-graph?
The ideal number depends also on what false-positive rate you want. In a
hypothetical future where we want to allow customization here, we want
the filters to be consistently sized across all filters.
>> + * The rest of the chunk is the concatenation of all the computed Bloom
>> + filters for the commits in lexicographic order.
>> + * Note: Commits with no changes or more than 512 changes have Bloom filters
>> + of length zero.
>
> What does this "Note:" prefix mean in the file format specification?
>
> Can an implementation use a one byte Bloom filter with no bits set for
> a commit with no changes? Can an implementation still store a Bloom
> filter for commits that modify more than 512 paths?
This is currently due to a hard-coded value in the implementation. It's not a
requirement of the file format.
>> + * The BDAT chunk is present if and only if BIDX is present.
>> +
>> Base Graphs List (ID: {'B', 'A', 'S', 'E'}) [Optional]
>> This list of H-byte hashes describe a set of B commit-graph files that
>> form a commit-graph chain. The graph position for the ith commit in this
>> diff --git a/commit-graph.c b/commit-graph.c
>> index 732c81fa1b2..a8b6b5cca5d 100644
>> --- a/commit-graph.c
>> +++ b/commit-graph.c
>
>> @@ -1034,6 +1071,59 @@ static void write_graph_chunk_extra_edges(struct hashfile *f,
>> }
>> }
>>
>> +static void write_graph_chunk_bloom_indexes(struct hashfile *f,
>> + struct write_commit_graph_context *ctx)
>> +{
>> + struct commit **list = ctx->commits.list;
>> + struct commit **last = ctx->commits.list + ctx->commits.nr;
>> + uint32_t cur_pos = 0;
>> + struct progress *progress = NULL;
>> + int i = 0;
>> +
>> + if (ctx->report_progress)
>> + progress = start_delayed_progress(
>> + _("Writing changed paths Bloom filters index"),
>> + ctx->commits.nr);
>> +
>> + while (list < last) {
>> + struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
>> + cur_pos += filter->len;
>
> Given a sufficiently large number of commits with large enough Bloom
> filters this will silently overflow.
Worth fixing, but we are not in a rush. I noted it in my GitHub issue.
Thanks,
-Stolee
^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH v4 09/15] commit-graph: write Bloom filters to commit graph file
2020-05-29 13:35 ` Derrick Stolee
@ 2020-05-31 17:23 ` SZEDER Gábor
0 siblings, 0 replies; 159+ messages in thread
From: SZEDER Gábor @ 2020-05-31 17:23 UTC (permalink / raw)
To: Derrick Stolee
Cc: Garima Singh via GitGitGadget, git, jonathantanmy, Garima Singh
On Fri, May 29, 2020 at 09:35:17AM -0400, Derrick Stolee wrote:
> >> + Bloom Filter Index (ID: {'B', 'I', 'D', 'X'}) (N * 4 bytes) [Optional]
> >> + * The ith entry, BIDX[i], stores the number of 8-byte word blocks in all
> >
> > This is inconsistent with the implementation: according to the code in
> > one of the previous patches these entries are simple byte offsets, not
> > 8-byte word offsets, i.e. the combined size of all modified path
> > Bloom filters can be at most 2^32 bytes.
>
> The documentation was fixed in 88093289cdc (Documentation: changed-path Bloom
> filters use byte words, 2020-05-11).
Oh, good. I'm waaay behind the curve and haven't seen this fix. Even
better, now I also noticed that two bugs I was about to report have
been fixed already (though both fixes have minor flaws).
Ok, so at least the specs are consistent with the implementation. I'm
not sure this was done in the right direction, though, because too
small Bloom filters do hurt performance.
> > Clearly, using 4 byte index entries significantly lowers the max
> > number of commits that can be stored with modified path Bloom filters.
>
> This is a good point, and certainly the reason for 8-byte multiples.
Note that Bloom filters with power-of-two number of bits have higher
false positive probabilities when using some form of double hashing.
When going for 8 byte blocks all commits modifying <= 12 paths
(assuming 7 hashes per path) will have power-of-2 sized Bloom filters
(64 or 128 bits), and that is a lot of commits.
> The incremental commit-graph can actually save us here
Oh, I haven't thought of that
> >> + Bloom filters from commit 0 to commit i (inclusive) in lexicographic
> >> + order. The Bloom filter for the i-th commit spans from BIDX[i-1] to
> >> + BIDX[i] (plus header length), where BIDX[-1] is 0.
> >> + * The BIDX chunk is ignored if the BDAT chunk is not present.
> >> +
> >> + Bloom Filter Data (ID: {'B', 'D', 'A', 'T'}) [Optional]
> >> + * It starts with header consisting of three unsigned 32-bit integers:
> >> + - Version of the hash algorithm being used. We currently only support
> >> + value 1 which corresponds to the 32-bit version of the murmur3 hash
> >> + implemented exactly as described in
> >> + https://en.wikipedia.org/wiki/MurmurHash#Algorithm and the double
> >> + hashing technique using seed values 0x293ae76f and 0x7e646e2 as
> >> + described in https://doi.org/10.1007/978-3-540-30494-4_26 "Bloom Filters
> >> + in Probabilistic Verification"
> >
> > How should double hashing compute the k hashes, i.e. using 64 bit or
> > 32 bit unsigned integer arithmetic?
Note that this should be clarified in the specs.
> >> + - The number of times a path is hashed and hence the number of bit positions
> >> + that cumulatively determine whether a file is present in the commit.
> >> + - The minimum number of bits 'b' per entry in the Bloom filter. If the filter
> >> + contains 'n' entries, then the filter size is the minimum number of 64-bit
> >> + words that contain n*b bits.
> >
> > Since the ideal number of bits per element depends only on the number
> > of hashes per path (k / ln(2) ≈ k * 10 / 7), why is this value stored
> > in the commit-graph?
>
> The ideal number depends also on what false-positive rate you want.
Well, yes, but indirectly: according to Wikipedia :) the optimal
number of hashes per element depends only on the desired false
probability, and the optimal number of bits per element depends only
on the number of hashes per element.
So storing the min number of bits per entry seems to be redundant.
> In a
> hypothetical future where we want to allow customization here, we want
> the filters to be consistently sized across all filters.
Wouldn't customizing through the number of hashes be sufficient?
> >> + * Note: Commits with no changes or more than 512 changes have Bloom filters
> >> + of length zero.
> >
> > What does this "Note:" prefix mean in the file format specification?
> >
> > Can an implementation use a one byte Bloom filter with no bits set for
> > a commit with no changes? Can an implementation still store a Bloom
> > filter for commits that modify more than 512 paths?
>
> This is currently due to a hard-coded value in the implementation. It's not a
> requirement of the file format.
Should an implementation detail like that be part of the specs? It
sure caused a bit of confusion here.
^ permalink raw reply [flat|nested] 159+ messages in thread
* [PATCH] commit-graph: fix "Writing out commit graph" progress counter
2020-04-06 16:59 ` [PATCH v4 09/15] commit-graph: write Bloom filters to commit graph file Garima Singh via GitGitGadget
2020-05-29 8:57 ` SZEDER Gábor
@ 2020-07-09 17:00 ` SZEDER Gábor
2020-07-09 18:01 ` Derrick Stolee
1 sibling, 1 reply; 159+ messages in thread
From: SZEDER Gábor @ 2020-07-09 17:00 UTC (permalink / raw)
To: Junio C Hamano
Cc: Garima Singh via GitGitGadget, stolee, jonathantanmy,
Garima Singh, git, SZEDER Gábor
76ffbca71a (commit-graph: write Bloom filters to commit graph file,
2020-04-06) added two delayed progress lines to writing the Bloom
filter index and data chunk. This is wrong, because a single common
progress is used while writing all chunks, which is not updated while
writing these two new chunks, resulting in incomplete-looking "done"
lines:
Expanding reachable commits in commit graph: 888679, done.
Computing commit changed paths Bloom filters: 100% (888678/888678), done.
Writing out commit graph in 6 passes: 66% (3554712/5332068), done.
Use the common 'struct progress' instance while writing the Bloom
filter chunks as well.
Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
---
commit-graph.c | 22 ++--------------------
1 file changed, 2 insertions(+), 20 deletions(-)
diff --git a/commit-graph.c b/commit-graph.c
index aaf3327ede..65cf32637c 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1086,23 +1086,14 @@ static void write_graph_chunk_bloom_indexes(struct hashfile *f,
struct commit **list = ctx->commits.list;
struct commit **last = ctx->commits.list + ctx->commits.nr;
uint32_t cur_pos = 0;
- struct progress *progress = NULL;
- int i = 0;
-
- if (ctx->report_progress)
- progress = start_delayed_progress(
- _("Writing changed paths Bloom filters index"),
- ctx->commits.nr);
while (list < last) {
struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
cur_pos += filter->len;
- display_progress(progress, ++i);
+ display_progress(ctx->progress, ++ctx->progress_cnt);
hashwrite_be32(f, cur_pos);
list++;
}
-
- stop_progress(&progress);
}
static void write_graph_chunk_bloom_data(struct hashfile *f,
@@ -1111,13 +1102,6 @@ static void write_graph_chunk_bloom_data(struct hashfile *f,
{
struct commit **list = ctx->commits.list;
struct commit **last = ctx->commits.list + ctx->commits.nr;
- struct progress *progress = NULL;
- int i = 0;
-
- if (ctx->report_progress)
- progress = start_delayed_progress(
- _("Writing changed paths Bloom filters data"),
- ctx->commits.nr);
hashwrite_be32(f, settings->hash_version);
hashwrite_be32(f, settings->num_hashes);
@@ -1125,12 +1109,10 @@ static void write_graph_chunk_bloom_data(struct hashfile *f,
while (list < last) {
struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
- display_progress(progress, ++i);
+ display_progress(ctx->progress, ++ctx->progress_cnt);
hashwrite(f, filter->data, filter->len * sizeof(unsigned char));
list++;
}
-
- stop_progress(&progress);
}
static int oid_compare(const void *_a, const void *_b)
--
2.27.0.547.g4ba2d26563
^ permalink raw reply related [flat|nested] 159+ messages in thread
* Re: [PATCH] commit-graph: fix "Writing out commit graph" progress counter
2020-07-09 17:00 ` [PATCH] commit-graph: fix "Writing out commit graph" progress counter SZEDER Gábor
@ 2020-07-09 18:01 ` Derrick Stolee
2020-07-09 18:20 ` Derrick Stolee
0 siblings, 1 reply; 159+ messages in thread
From: Derrick Stolee @ 2020-07-09 18:01 UTC (permalink / raw)
To: SZEDER Gábor, Junio C Hamano
Cc: Garima Singh via GitGitGadget, jonathantanmy, Garima Singh, git
On 7/9/2020 1:00 PM, SZEDER Gábor wrote:
> 76ffbca71a (commit-graph: write Bloom filters to commit graph file,
> 2020-04-06) added two delayed progress lines to writing the Bloom
> filter index and data chunk. This is wrong, because a single common
> progress is used while writing all chunks, which is not updated while
> writing these two new chunks, resulting in incomplete-looking "done"
> lines:
>
> Expanding reachable commits in commit graph: 888679, done.
> Computing commit changed paths Bloom filters: 100% (888678/888678), done.
> Writing out commit graph in 6 passes: 66% (3554712/5332068), done.
>
> Use the common 'struct progress' instance while writing the Bloom
> filter chunks as well.
Thanks for finding this. It's a clearly correct way to go,
and is one of the things that did not get updated properly
between the old prototype when applying it on the new code
that included this ctx->progress pattern.
Junio: head's up that this will conflict with the final patch
in ds/maintenance. I'll remove my edits to these methods in
my v2 to make that merge a bit easier.
Thanks,
-Stolee
^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH] commit-graph: fix "Writing out commit graph" progress counter
2020-07-09 18:01 ` Derrick Stolee
@ 2020-07-09 18:20 ` Derrick Stolee
0 siblings, 0 replies; 159+ messages in thread
From: Derrick Stolee @ 2020-07-09 18:20 UTC (permalink / raw)
To: SZEDER Gábor, Junio C Hamano
Cc: Garima Singh via GitGitGadget, jonathantanmy, Garima Singh, git
On 7/9/2020 2:01 PM, Derrick Stolee wrote:
> Junio: head's up that this will conflict with the final patch
> in ds/maintenance. I'll remove my edits to these methods in
> my v2 to make that merge a bit easier.
Or, I'm getting confused, because I changed start_progress()
calls in midx.c, not commit-graph.c. Please ignore my scattered
brain.
Thanks,
-Stolee
^ permalink raw reply [flat|nested] 159+ messages in thread
* [PATCH v4 10/15] commit-graph: reuse existing Bloom filters during write
2020-04-06 16:59 ` [PATCH v4 00/15] Changed Paths Bloom Filters Garima Singh via GitGitGadget
` (8 preceding siblings ...)
2020-04-06 16:59 ` [PATCH v4 09/15] commit-graph: write Bloom filters to commit graph file Garima Singh via GitGitGadget
@ 2020-04-06 16:59 ` Garima Singh via GitGitGadget
2020-06-19 14:02 ` SZEDER Gábor
2020-07-27 21:33 ` SZEDER Gábor
2020-04-06 16:59 ` [PATCH v4 11/15] commit-graph: add --changed-paths option to write subcommand Garima Singh via GitGitGadget
` (5 subsequent siblings)
15 siblings, 2 replies; 159+ messages in thread
From: Garima Singh via GitGitGadget @ 2020-04-06 16:59 UTC (permalink / raw)
To: git; +Cc: stolee, szeder.dev, jonathantanmy, Garima Singh, Garima Singh
From: Garima Singh <garima.singh@microsoft.com>
Add logic to
a) parse Bloom filter information from the commit graph file and,
b) re-use existing Bloom filters.
See Documentation/technical/commit-graph-format for the format in which
the Bloom filter information is written to the commit graph file.
To read Bloom filter for a given commit with lexicographic position
'i' we need to:
1. Read BIDX[i] which essentially gives us the starting index in BDAT for
filter of commit i+1. It is essentially the index past the end
of the filter of commit i. It is called end_index in the code.
2. For i>0, read BIDX[i-1] which will give us the starting index in BDAT
for filter of commit i. It is called the start_index in the code.
For the first commit, where i = 0, Bloom filter data starts at the
beginning, just past the header in the BDAT chunk. Hence, start_index
will be 0.
3. The length of the filter will be end_index - start_index, because
BIDX[i] gives the cumulative 8-byte words including the ith
commit's filter.
We toggle whether Bloom filters should be recomputed based on the
compute_if_not_present flag.
Helped-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
---
bloom.c | 49 ++++++++++++++++++++++++++++++++++++++++++-
bloom.h | 4 +++-
commit-graph.c | 6 +++---
t/helper/test-bloom.c | 2 +-
4 files changed, 55 insertions(+), 6 deletions(-)
diff --git a/bloom.c b/bloom.c
index a16eee92331..0f714dd76ae 100644
--- a/bloom.c
+++ b/bloom.c
@@ -4,6 +4,8 @@
#include "diffcore.h"
#include "revision.h"
#include "hashmap.h"
+#include "commit-graph.h"
+#include "commit.h"
define_commit_slab(bloom_filter_slab, struct bloom_filter);
@@ -26,6 +28,36 @@ static inline unsigned char get_bitmask(uint32_t pos)
return ((unsigned char)1) << (pos & (BITS_PER_WORD - 1));
}
+static int load_bloom_filter_from_graph(struct commit_graph *g,
+ struct bloom_filter *filter,
+ struct commit *c)
+{
+ uint32_t lex_pos, start_index, end_index;
+
+ while (c->graph_pos < g->num_commits_in_base)
+ g = g->base_graph;
+
+ /* The commit graph commit 'c' lives in doesn't carry bloom filters. */
+ if (!g->chunk_bloom_indexes)
+ return 0;
+
+ lex_pos = c->graph_pos - g->num_commits_in_base;
+
+ end_index = get_be32(g->chunk_bloom_indexes + 4 * lex_pos);
+
+ if (lex_pos > 0)
+ start_index = get_be32(g->chunk_bloom_indexes + 4 * (lex_pos - 1));
+ else
+ start_index = 0;
+
+ filter->len = end_index - start_index;
+ filter->data = (unsigned char *)(g->chunk_bloom_data +
+ sizeof(unsigned char) * start_index +
+ BLOOMDATA_CHUNK_HEADER_SIZE);
+
+ return 1;
+}
+
/*
* Calculate the murmur3 32-bit hash value for the given data
* using the given seed.
@@ -127,7 +159,8 @@ void init_bloom_filters(void)
}
struct bloom_filter *get_bloom_filter(struct repository *r,
- struct commit *c)
+ struct commit *c,
+ int compute_if_not_present)
{
struct bloom_filter *filter;
struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
@@ -140,6 +173,20 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
filter = bloom_filter_slab_at(&bloom_filters, c);
+ if (!filter->data) {
+ load_commit_graph_info(r, c);
+ if (c->graph_pos != COMMIT_NOT_FROM_GRAPH &&
+ r->objects->commit_graph->chunk_bloom_indexes) {
+ if (load_bloom_filter_from_graph(r->objects->commit_graph, filter, c))
+ return filter;
+ else
+ return NULL;
+ }
+ }
+
+ if (filter->data || !compute_if_not_present)
+ return filter;
+
repo_diff_setup(r, &diffopt);
diffopt.flags.recursive = 1;
diffopt.max_changes = max_changes;
diff --git a/bloom.h b/bloom.h
index 85ab8e9423d..760d7122374 100644
--- a/bloom.h
+++ b/bloom.h
@@ -32,6 +32,7 @@ struct bloom_filter_settings {
#define DEFAULT_BLOOM_FILTER_SETTINGS { 1, 7, 10 }
#define BITS_PER_WORD 8
+#define BLOOMDATA_CHUNK_HEADER_SIZE 3 * sizeof(uint32_t)
/*
* A bloom_filter struct represents a data segment to
@@ -79,6 +80,7 @@ void add_key_to_filter(const struct bloom_key *key,
void init_bloom_filters(void);
struct bloom_filter *get_bloom_filter(struct repository *r,
- struct commit *c);
+ struct commit *c,
+ int compute_if_not_present);
#endif
\ No newline at end of file
diff --git a/commit-graph.c b/commit-graph.c
index a8b6b5cca5d..77668629e27 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1086,7 +1086,7 @@ static void write_graph_chunk_bloom_indexes(struct hashfile *f,
ctx->commits.nr);
while (list < last) {
- struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
+ struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
cur_pos += filter->len;
display_progress(progress, ++i);
hashwrite_be32(f, cur_pos);
@@ -1115,7 +1115,7 @@ static void write_graph_chunk_bloom_data(struct hashfile *f,
hashwrite_be32(f, settings->bits_per_entry);
while (list < last) {
- struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
+ struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
display_progress(progress, ++i);
hashwrite(f, filter->data, filter->len * sizeof(unsigned char));
list++;
@@ -1296,7 +1296,7 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
for (i = 0; i < ctx->commits.nr; i++) {
struct commit *c = sorted_commits[i];
- struct bloom_filter *filter = get_bloom_filter(ctx->r, c);
+ struct bloom_filter *filter = get_bloom_filter(ctx->r, c, 1);
ctx->total_bloom_filter_data_size += sizeof(unsigned char) * filter->len;
display_progress(progress, i + 1);
}
diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
index f18d1b722e1..ce412664ba9 100644
--- a/t/helper/test-bloom.c
+++ b/t/helper/test-bloom.c
@@ -39,7 +39,7 @@ static void get_bloom_filter_for_commit(const struct object_id *commit_oid)
struct bloom_filter *filter;
setup_git_directory();
c = lookup_commit(the_repository, commit_oid);
- filter = get_bloom_filter(the_repository, c);
+ filter = get_bloom_filter(the_repository, c, 1);
print_bloom_filter(filter);
}
--
gitgitgadget
^ permalink raw reply related [flat|nested] 159+ messages in thread
* Re: [PATCH v4 10/15] commit-graph: reuse existing Bloom filters during write
2020-04-06 16:59 ` [PATCH v4 10/15] commit-graph: reuse existing Bloom filters during write Garima Singh via GitGitGadget
@ 2020-06-19 14:02 ` SZEDER Gábor
2020-06-19 19:28 ` Junio C Hamano
2020-07-27 21:33 ` SZEDER Gábor
1 sibling, 1 reply; 159+ messages in thread
From: SZEDER Gábor @ 2020-06-19 14:02 UTC (permalink / raw)
To: Garima Singh via GitGitGadget; +Cc: git, stolee, jonathantanmy, Garima Singh
On Mon, Apr 06, 2020 at 04:59:50PM +0000, Garima Singh via GitGitGadget wrote:
> From: Garima Singh <garima.singh@microsoft.com>
>
> Add logic to
> a) parse Bloom filter information from the commit graph file and,
> b) re-use existing Bloom filters.
>
> See Documentation/technical/commit-graph-format for the format in which
> the Bloom filter information is written to the commit graph file.
>
> To read Bloom filter for a given commit with lexicographic position
> 'i' we need to:
> 1. Read BIDX[i] which essentially gives us the starting index in BDAT for
> filter of commit i+1. It is essentially the index past the end
> of the filter of commit i. It is called end_index in the code.
>
> 2. For i>0, read BIDX[i-1] which will give us the starting index in BDAT
> for filter of commit i. It is called the start_index in the code.
> For the first commit, where i = 0, Bloom filter data starts at the
> beginning, just past the header in the BDAT chunk. Hence, start_index
> will be 0.
>
> 3. The length of the filter will be end_index - start_index, because
> BIDX[i] gives the cumulative 8-byte words including the ith
> commit's filter.
>
> We toggle whether Bloom filters should be recomputed based on the
> compute_if_not_present flag.
A very important question is not discussed here: when should we
recompute Bloom filters?
> Helped-by: Derrick Stolee <dstolee@microsoft.com>
> Signed-off-by: Garima Singh <garima.singh@microsoft.com>
> ---
> bloom.c | 49 ++++++++++++++++++++++++++++++++++++++++++-
> bloom.h | 4 +++-
> commit-graph.c | 6 +++---
> t/helper/test-bloom.c | 2 +-
> 4 files changed, 55 insertions(+), 6 deletions(-)
>
> diff --git a/bloom.c b/bloom.c
> index a16eee92331..0f714dd76ae 100644
> --- a/bloom.c
> +++ b/bloom.c
> @@ -4,6 +4,8 @@
> #include "diffcore.h"
> #include "revision.h"
> #include "hashmap.h"
> +#include "commit-graph.h"
> +#include "commit.h"
>
> define_commit_slab(bloom_filter_slab, struct bloom_filter);
>
> @@ -26,6 +28,36 @@ static inline unsigned char get_bitmask(uint32_t pos)
> return ((unsigned char)1) << (pos & (BITS_PER_WORD - 1));
> }
>
> +static int load_bloom_filter_from_graph(struct commit_graph *g,
> + struct bloom_filter *filter,
> + struct commit *c)
> +{
> + uint32_t lex_pos, start_index, end_index;
> +
> + while (c->graph_pos < g->num_commits_in_base)
> + g = g->base_graph;
> +
> + /* The commit graph commit 'c' lives in doesn't carry bloom filters. */
> + if (!g->chunk_bloom_indexes)
> + return 0;
> +
> + lex_pos = c->graph_pos - g->num_commits_in_base;
> +
> + end_index = get_be32(g->chunk_bloom_indexes + 4 * lex_pos);
> +
> + if (lex_pos > 0)
> + start_index = get_be32(g->chunk_bloom_indexes + 4 * (lex_pos - 1));
> + else
> + start_index = 0;
> +
> + filter->len = end_index - start_index;
> + filter->data = (unsigned char *)(g->chunk_bloom_data +
> + sizeof(unsigned char) * start_index +
> + BLOOMDATA_CHUNK_HEADER_SIZE);
> +
> + return 1;
> +}
> +
> /*
> * Calculate the murmur3 32-bit hash value for the given data
> * using the given seed.
> @@ -127,7 +159,8 @@ void init_bloom_filters(void)
> }
>
> struct bloom_filter *get_bloom_filter(struct repository *r,
> - struct commit *c)
> + struct commit *c,
> + int compute_if_not_present)
> {
> struct bloom_filter *filter;
> struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
This line in the hunk context sets the default parameters with which
this process will compute any new changed path Bloom filters.
Note that this is not the settings instance that eventually gets
written to the header of the Bloom filters chunk:
write_commit_graph_file() has its own 'struct bloom_filter_settings'
instance, and that's the one that goes into the chunk header.
> @@ -140,6 +173,20 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
>
> filter = bloom_filter_slab_at(&bloom_filters, c);
>
> + if (!filter->data) {
> + load_commit_graph_info(r, c);
> + if (c->graph_pos != COMMIT_NOT_FROM_GRAPH &&
> + r->objects->commit_graph->chunk_bloom_indexes) {
> + if (load_bloom_filter_from_graph(r->objects->commit_graph, filter, c))
> + return filter;
> + else
> + return NULL;
> + }
> + }
And in the above conditions we try to load the existing Bloom filter
for the given commit and return it as-is for reuse if it already
exists, or go on to compute a new Bloom filter with the parameters set
at the beginning of the function.
Unfortunately, the parameters used to compute the now reused Bloom
filters are not checked anywhere. In fact this writing process
entirely ignores all parameters in the header of the existing Bloom
filters chunk, and simply replaces them with the default parameters
hard-coded in write_commit_graph_file(). Consequently, we can end up
with Bloom filters computed with different parameters in the same
commit-graph file, which, in turn, can result in commits omitted from
the output of pathspec-limited revision walks.
The makeshift (there is no way to override those hard-coded defaults)
tests below demonstrate this issue.
This issue raises a good couple of questions:
- What should we do when updating a commit-graph that was written
with different Bloom filter parameters than our hardcoded
defaults?
Reusing the exising Bloom filters is clearly wrong. Throwing away
all existing Bloom filters and recomputing them with our defaults
parameters doesn't seem to be good option, because that's a
considerable amount of work, and the user might have a reason to
chose those parameters.
- What should we do when updating a commit-graph that was written
with different Bloom filter parameters than specified by the user
on the command line or in the config?
Wipe out the old Bloom filters and recompute with new parameters,
spending considerable time in bigger repositories? Or stop with a
warning about the different parameters (maybe it's just a typo),
and require '--force'?
Dunno, and we don't have such options and configuration yet
anyway.
- What about split commit-graphs?
When split commit-graphs were introduced there was not a single
chunk that had its own header. Now the Bloom filters chunk does
have a header, which leads to other questions:
- Should that Bloom filters header be included in every split
commit-graph?
Not sure, but I suppose that having a header in each split
commit-graph file would make loading and parsing that chunk a
bit simpler, because all of them should be parsed the same way.
Anyway, I think the specs should be explicit about it. But...:
- Should we allow different parameters in the Bloom filter chunks
in each split commit-graph?
The point of split commit-graphs is to avoid the overhead of
re-writing the whole commit-graph file every time new commits
are added, and it's crucial that both writing and merging split
commit-graph files are cheap. However, split commit-graph files
using different Bloom filter parameters can't be merged without
recomputing those Bloom filters, making merging quite expensive.
So I don't think that it's a good idea to allow different Bloom
filter parameters in split commit-graphs. But then perhaps it
would be better not to have a Bloom filter chunk header in all
split commit-graph files after all.
In any case, the last test below shows that the Bloom filter
parameters are only read from the header of the most recent split
commit-graph file.
--- >8 ---
#!/bin/sh
test_description='test'
. ./test-lib.sh
test_expect_success 'yuckiest setup ever!' '
(
cd "$GIT_BUILD_DIR" &&
# The number of hashes per path cannot be configured
# at runtime, so build a dedicated git binary that
# writes Bloom filters using only 6 hashes per path.
sed -i -e "/DEFAULT_BLOOM_FILTER_SETTINGS/ s/7/6/" bloom.h &&
make -j4 git &&
cp git git6 &&
# Revert, rebuild.
sed -i -e "/DEFAULT_BLOOM_FILTER_SETTINGS/ s/6/7/" bloom.h &&
make -j4 git
) &&
git6="$GIT_BUILD_DIR"/git6
'
test_expect_success 'setup' '
# We need a filename whose 7th hash maps to a different bit
# position than any of its first 6 hashes in a 2-byte Bloom
# filter.
file=File &&
test_tick &&
git commit --allow-empty -m initial &&
echo 1 >$file &&
git add $file &&
git commit -m one $file &&
echo 2 >$file &&
git commit -m two $file &&
git log --oneline -- $file >expect
'
test_expect_success 'can read Bloom filters with different parameters' '
test_when_finished "rm -rfv .git/objects/info/commit-graph*" &&
# Write a commit-graph with Bloom filters using only 6 hashes
# per path.
"$git6" commit-graph write --reachable --changed-paths &&
# Try pathspec-limited revision walk with the git binary writing
# Bloom filters using 7 hashes: it still works, because no matter
# how many hashes it would use when writing the commit-graph, the
# reader part respects the nr of hashes stored in the
# commit-graph file. So far so good.
git log --oneline $file >actual &&
test_cmp expect actual
'
test_expect_failure 'commit-graph write does not reuse Bloom filters with different parameters' '
test_when_finished "rm -rfv .git/objects/info/commit-graph*" &&
# Write a commit-graph with Bloom filters using only 6 hashes
# per path for a subset of commits.
git rev-parse HEAD^ |
"$git6" commit-graph write --stdin-commits --changed-paths &&
# Add the rest of the commits to the commit-graph containing Bloom
# filters using 6 hashes with a git version that writes Bloom
# filters using 7 hashes.
# Does it reuse the existing Bloom filters with 6 hashes?
git commit-graph write --reachable --changed-paths &&
# Yes, it does, because these report different filter data,
# even though both commits modified the same file.
test-tool bloom get_filter_for_commit $(git rev-parse HEAD^) &&
test-tool bloom get_filter_for_commit $(git rev-parse HEAD) &&
# Furthermore, it updated the Bloom filter chunk header as well,
# which now stores that all Bloom filters use 7 hashes.
# Consequently, the first commit whose Bloom filter was written
# with only 6 hashes falls victim of a false negative, and is
# omitted from the output.
git log --oneline $file >actual &&
test_cmp expect actual
'
test_expect_failure 'split commit-graphs and Bloom filters with different parameters' '
test_when_finished "rm -rfv .git/objects/info/commit-graph*" &&
git rev-parse HEAD^ |
"$git6" commit-graph write --stdin-commits --changed-paths --split &&
git commit-graph write --reachable --changed-paths --split=no-merge &&
# To make sure that I test what I want, i.e. two commit-graphs
# with one commit in each. (Though "test-tool read-graph" is
# utterly oblivious to split commit graphs...)
test_line_count = 2 .git/objects/info/commit-graphs/commit-graph-chain &&
verbose test "$(test-tool read-graph |sed -n -e "s/^num_commits: //p")" = 1 &&
test-tool bloom get_filter_for_commit $(git rev-parse HEAD^) &&
test-tool bloom get_filter_for_commit $(git rev-parse HEAD) &&
git log --oneline $file >actual &&
test_cmp expect actual
'
test_done
--- 8< ---
> + if (filter->data || !compute_if_not_present)
> + return filter;
> +
> repo_diff_setup(r, &diffopt);
> diffopt.flags.recursive = 1;
> diffopt.max_changes = max_changes;
> diff --git a/bloom.h b/bloom.h
> index 85ab8e9423d..760d7122374 100644
> --- a/bloom.h
> +++ b/bloom.h
> @@ -32,6 +32,7 @@ struct bloom_filter_settings {
>
> #define DEFAULT_BLOOM_FILTER_SETTINGS { 1, 7, 10 }
> #define BITS_PER_WORD 8
> +#define BLOOMDATA_CHUNK_HEADER_SIZE 3 * sizeof(uint32_t)
>
> /*
> * A bloom_filter struct represents a data segment to
> @@ -79,6 +80,7 @@ void add_key_to_filter(const struct bloom_key *key,
> void init_bloom_filters(void);
>
> struct bloom_filter *get_bloom_filter(struct repository *r,
> - struct commit *c);
> + struct commit *c,
> + int compute_if_not_present);
>
> #endif
> \ No newline at end of file
> diff --git a/commit-graph.c b/commit-graph.c
> index a8b6b5cca5d..77668629e27 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -1086,7 +1086,7 @@ static void write_graph_chunk_bloom_indexes(struct hashfile *f,
> ctx->commits.nr);
>
> while (list < last) {
> - struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
> + struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
> cur_pos += filter->len;
> display_progress(progress, ++i);
> hashwrite_be32(f, cur_pos);
> @@ -1115,7 +1115,7 @@ static void write_graph_chunk_bloom_data(struct hashfile *f,
> hashwrite_be32(f, settings->bits_per_entry);
>
> while (list < last) {
> - struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
> + struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
> display_progress(progress, ++i);
> hashwrite(f, filter->data, filter->len * sizeof(unsigned char));
> list++;
> @@ -1296,7 +1296,7 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
>
> for (i = 0; i < ctx->commits.nr; i++) {
> struct commit *c = sorted_commits[i];
> - struct bloom_filter *filter = get_bloom_filter(ctx->r, c);
> + struct bloom_filter *filter = get_bloom_filter(ctx->r, c, 1);
> ctx->total_bloom_filter_data_size += sizeof(unsigned char) * filter->len;
> display_progress(progress, i + 1);
> }
> diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
> index f18d1b722e1..ce412664ba9 100644
> --- a/t/helper/test-bloom.c
> +++ b/t/helper/test-bloom.c
> @@ -39,7 +39,7 @@ static void get_bloom_filter_for_commit(const struct object_id *commit_oid)
> struct bloom_filter *filter;
> setup_git_directory();
> c = lookup_commit(the_repository, commit_oid);
> - filter = get_bloom_filter(the_repository, c);
> + filter = get_bloom_filter(the_repository, c, 1);
> print_bloom_filter(filter);
> }
>
> --
> gitgitgadget
^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH v4 10/15] commit-graph: reuse existing Bloom filters during write
2020-06-19 14:02 ` SZEDER Gábor
@ 2020-06-19 19:28 ` Junio C Hamano
0 siblings, 0 replies; 159+ messages in thread
From: Junio C Hamano @ 2020-06-19 19:28 UTC (permalink / raw)
To: SZEDER Gábor
Cc: Garima Singh via GitGitGadget, git, stolee, jonathantanmy, Garima Singh
SZEDER Gábor <szeder.dev@gmail.com> writes:
> Note that this is not the settings instance that eventually gets
> written to the header of the Bloom filters chunk:
> write_commit_graph_file() has its own 'struct bloom_filter_settings'
> instance, and that's the one that goes into the chunk header.
> ...
> Unfortunately, the parameters used to compute the now reused Bloom
> filters are not checked anywhere. In fact this writing process
> entirely ignores all parameters in the header of the existing Bloom
> filters chunk, and simply replaces them with the default parameters
> hard-coded in write_commit_graph_file(). Consequently, we can end up
> with Bloom filters computed with different parameters in the same
> commit-graph file, which, in turn, can result in commits omitted from
> the output of pathspec-limited revision walks.
Yeah, the whole design seems quite broken and as you said later,
mixing other ingredients like split file would only make things
worse X-<.
> The makeshift (there is no way to override those hard-coded defaults)
> tests below demonstrate this issue.
>
> This issue raises a good couple of questions:
>
> - What should we do when updating a commit-graph that was written
> with different Bloom filter parameters than our hardcoded
> defaults?
>
> Reusing the exising Bloom filters is clearly wrong. Throwing away
> all existing Bloom filters and recomputing them with our defaults
> parameters doesn't seem to be good option, because that's a
> considerable amount of work, and the user might have a reason to
> chose those parameters.
>
> - What should we do when updating a commit-graph that was written
> with different Bloom filter parameters than specified by the user
> on the command line or in the config?
>
> Wipe out the old Bloom filters and recompute with new parameters,
> spending considerable time in bigger repositories? Or stop with a
> warning about the different parameters (maybe it's just a typo),
> and require '--force'?
>
> Dunno, and we don't have such options and configuration yet
> anyway.
>
> - What about split commit-graphs?
>
> When split commit-graphs were introduced there was not a single
> chunk that had its own header. Now the Bloom filters chunk does
> have a header, which leads to other questions:
>
> - Should that Bloom filters header be included in every split
> commit-graph?
>
> Not sure, but I suppose that having a header in each split
> commit-graph file would make loading and parsing that chunk a
> bit simpler, because all of them should be parsed the same way.
> Anyway, I think the specs should be explicit about it. But...:
>
> - Should we allow different parameters in the Bloom filter chunks
> in each split commit-graph?
>
> The point of split commit-graphs is to avoid the overhead of
> re-writing the whole commit-graph file every time new commits
> are added, and it's crucial that both writing and merging split
> commit-graph files are cheap. However, split commit-graph files
> using different Bloom filter parameters can't be merged without
> recomputing those Bloom filters, making merging quite expensive.
>
> So I don't think that it's a good idea to allow different Bloom
> filter parameters in split commit-graphs. But then perhaps it
> would be better not to have a Bloom filter chunk header in all
> split commit-graph files after all.
>
> In any case, the last test below shows that the Bloom filter
> parameters are only read from the header of the most recent split
> commit-graph file.
>
>
> --- >8 ---
>
> #!/bin/sh
>
> test_description='test'
>
> . ./test-lib.sh
>
> test_expect_success 'yuckiest setup ever!' '
> (
> cd "$GIT_BUILD_DIR" &&
>
> # The number of hashes per path cannot be configured
> # at runtime, so build a dedicated git binary that
> # writes Bloom filters using only 6 hashes per path.
> sed -i -e "/DEFAULT_BLOOM_FILTER_SETTINGS/ s/7/6/" bloom.h &&
> make -j4 git &&
> cp git git6 &&
>
> # Revert, rebuild.
> sed -i -e "/DEFAULT_BLOOM_FILTER_SETTINGS/ s/6/7/" bloom.h &&
> make -j4 git
> ) &&
> git6="$GIT_BUILD_DIR"/git6
> '
>
> test_expect_success 'setup' '
> # We need a filename whose 7th hash maps to a different bit
> # position than any of its first 6 hashes in a 2-byte Bloom
> # filter.
> file=File &&
>
> test_tick &&
> git commit --allow-empty -m initial &&
> echo 1 >$file &&
> git add $file &&
> git commit -m one $file &&
> echo 2 >$file &&
> git commit -m two $file &&
>
> git log --oneline -- $file >expect
> '
>
> test_expect_success 'can read Bloom filters with different parameters' '
> test_when_finished "rm -rfv .git/objects/info/commit-graph*" &&
>
> # Write a commit-graph with Bloom filters using only 6 hashes
> # per path.
> "$git6" commit-graph write --reachable --changed-paths &&
>
> # Try pathspec-limited revision walk with the git binary writing
> # Bloom filters using 7 hashes: it still works, because no matter
> # how many hashes it would use when writing the commit-graph, the
> # reader part respects the nr of hashes stored in the
> # commit-graph file. So far so good.
> git log --oneline $file >actual &&
> test_cmp expect actual
> '
>
> test_expect_failure 'commit-graph write does not reuse Bloom filters with different parameters' '
> test_when_finished "rm -rfv .git/objects/info/commit-graph*" &&
>
> # Write a commit-graph with Bloom filters using only 6 hashes
> # per path for a subset of commits.
> git rev-parse HEAD^ |
> "$git6" commit-graph write --stdin-commits --changed-paths &&
>
> # Add the rest of the commits to the commit-graph containing Bloom
> # filters using 6 hashes with a git version that writes Bloom
> # filters using 7 hashes.
> # Does it reuse the existing Bloom filters with 6 hashes?
> git commit-graph write --reachable --changed-paths &&
>
> # Yes, it does, because these report different filter data,
> # even though both commits modified the same file.
> test-tool bloom get_filter_for_commit $(git rev-parse HEAD^) &&
> test-tool bloom get_filter_for_commit $(git rev-parse HEAD) &&
>
> # Furthermore, it updated the Bloom filter chunk header as well,
> # which now stores that all Bloom filters use 7 hashes.
> # Consequently, the first commit whose Bloom filter was written
> # with only 6 hashes falls victim of a false negative, and is
> # omitted from the output.
> git log --oneline $file >actual &&
> test_cmp expect actual
> '
>
> test_expect_failure 'split commit-graphs and Bloom filters with different parameters' '
> test_when_finished "rm -rfv .git/objects/info/commit-graph*" &&
>
> git rev-parse HEAD^ |
> "$git6" commit-graph write --stdin-commits --changed-paths --split &&
>
> git commit-graph write --reachable --changed-paths --split=no-merge &&
>
> # To make sure that I test what I want, i.e. two commit-graphs
> # with one commit in each. (Though "test-tool read-graph" is
> # utterly oblivious to split commit graphs...)
> test_line_count = 2 .git/objects/info/commit-graphs/commit-graph-chain &&
> verbose test "$(test-tool read-graph |sed -n -e "s/^num_commits: //p")" = 1 &&
>
> test-tool bloom get_filter_for_commit $(git rev-parse HEAD^) &&
> test-tool bloom get_filter_for_commit $(git rev-parse HEAD) &&
>
> git log --oneline $file >actual &&
> test_cmp expect actual
> '
>
> test_done
>
> --- 8< ---
>
>> + if (filter->data || !compute_if_not_present)
>> + return filter;
>> +
>> repo_diff_setup(r, &diffopt);
>> diffopt.flags.recursive = 1;
>> diffopt.max_changes = max_changes;
>> diff --git a/bloom.h b/bloom.h
>> index 85ab8e9423d..760d7122374 100644
>> --- a/bloom.h
>> +++ b/bloom.h
>> @@ -32,6 +32,7 @@ struct bloom_filter_settings {
>>
>> #define DEFAULT_BLOOM_FILTER_SETTINGS { 1, 7, 10 }
>> #define BITS_PER_WORD 8
>> +#define BLOOMDATA_CHUNK_HEADER_SIZE 3 * sizeof(uint32_t)
>>
>> /*
>> * A bloom_filter struct represents a data segment to
>> @@ -79,6 +80,7 @@ void add_key_to_filter(const struct bloom_key *key,
>> void init_bloom_filters(void);
>>
>> struct bloom_filter *get_bloom_filter(struct repository *r,
>> - struct commit *c);
>> + struct commit *c,
>> + int compute_if_not_present);
>>
>> #endif
>> \ No newline at end of file
>> diff --git a/commit-graph.c b/commit-graph.c
>> index a8b6b5cca5d..77668629e27 100644
>> --- a/commit-graph.c
>> +++ b/commit-graph.c
>> @@ -1086,7 +1086,7 @@ static void write_graph_chunk_bloom_indexes(struct hashfile *f,
>> ctx->commits.nr);
>>
>> while (list < last) {
>> - struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
>> + struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
>> cur_pos += filter->len;
>> display_progress(progress, ++i);
>> hashwrite_be32(f, cur_pos);
>> @@ -1115,7 +1115,7 @@ static void write_graph_chunk_bloom_data(struct hashfile *f,
>> hashwrite_be32(f, settings->bits_per_entry);
>>
>> while (list < last) {
>> - struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
>> + struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
>> display_progress(progress, ++i);
>> hashwrite(f, filter->data, filter->len * sizeof(unsigned char));
>> list++;
>> @@ -1296,7 +1296,7 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
>>
>> for (i = 0; i < ctx->commits.nr; i++) {
>> struct commit *c = sorted_commits[i];
>> - struct bloom_filter *filter = get_bloom_filter(ctx->r, c);
>> + struct bloom_filter *filter = get_bloom_filter(ctx->r, c, 1);
>> ctx->total_bloom_filter_data_size += sizeof(unsigned char) * filter->len;
>> display_progress(progress, i + 1);
>> }
>> diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
>> index f18d1b722e1..ce412664ba9 100644
>> --- a/t/helper/test-bloom.c
>> +++ b/t/helper/test-bloom.c
>> @@ -39,7 +39,7 @@ static void get_bloom_filter_for_commit(const struct object_id *commit_oid)
>> struct bloom_filter *filter;
>> setup_git_directory();
>> c = lookup_commit(the_repository, commit_oid);
>> - filter = get_bloom_filter(the_repository, c);
>> + filter = get_bloom_filter(the_repository, c, 1);
>> print_bloom_filter(filter);
>> }
>>
>> --
>> gitgitgadget
^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH v4 10/15] commit-graph: reuse existing Bloom filters during write
2020-04-06 16:59 ` [PATCH v4 10/15] commit-graph: reuse existing Bloom filters during write Garima Singh via GitGitGadget
2020-06-19 14:02 ` SZEDER Gábor
@ 2020-07-27 21:33 ` SZEDER Gábor
1 sibling, 0 replies; 159+ messages in thread
From: SZEDER Gábor @ 2020-07-27 21:33 UTC (permalink / raw)
To: Garima Singh via GitGitGadget; +Cc: git, stolee, jonathantanmy, Garima Singh
On Mon, Apr 06, 2020 at 04:59:50PM +0000, Garima Singh via GitGitGadget wrote:
> From: Garima Singh <garima.singh@microsoft.com>
>
> Add logic to
> a) parse Bloom filter information from the commit graph file and,
> b) re-use existing Bloom filters.
>
> See Documentation/technical/commit-graph-format for the format in which
> the Bloom filter information is written to the commit graph file.
>
> To read Bloom filter for a given commit with lexicographic position
> 'i' we need to:
> 1. Read BIDX[i] which essentially gives us the starting index in BDAT for
> filter of commit i+1. It is essentially the index past the end
> of the filter of commit i. It is called end_index in the code.
>
> 2. For i>0, read BIDX[i-1] which will give us the starting index in BDAT
> for filter of commit i. It is called the start_index in the code.
> For the first commit, where i = 0, Bloom filter data starts at the
> beginning, just past the header in the BDAT chunk. Hence, start_index
> will be 0.
>
> 3. The length of the filter will be end_index - start_index, because
> BIDX[i] gives the cumulative 8-byte words including the ith
> commit's filter.
>
> We toggle whether Bloom filters should be recomputed based on the
> compute_if_not_present flag.
>
> Helped-by: Derrick Stolee <dstolee@microsoft.com>
> Signed-off-by: Garima Singh <garima.singh@microsoft.com>
> ---
> bloom.c | 49 ++++++++++++++++++++++++++++++++++++++++++-
> bloom.h | 4 +++-
> commit-graph.c | 6 +++---
> t/helper/test-bloom.c | 2 +-
> 4 files changed, 55 insertions(+), 6 deletions(-)
>
> diff --git a/bloom.c b/bloom.c
> index a16eee92331..0f714dd76ae 100644
> --- a/bloom.c
> +++ b/bloom.c
> @@ -4,6 +4,8 @@
> #include "diffcore.h"
> #include "revision.h"
> #include "hashmap.h"
> +#include "commit-graph.h"
> +#include "commit.h"
>
> define_commit_slab(bloom_filter_slab, struct bloom_filter);
>
> @@ -26,6 +28,36 @@ static inline unsigned char get_bitmask(uint32_t pos)
> return ((unsigned char)1) << (pos & (BITS_PER_WORD - 1));
> }
>
> +static int load_bloom_filter_from_graph(struct commit_graph *g,
> + struct bloom_filter *filter,
> + struct commit *c)
> +{
> + uint32_t lex_pos, start_index, end_index;
> +
> + while (c->graph_pos < g->num_commits_in_base)
> + g = g->base_graph;
> +
> + /* The commit graph commit 'c' lives in doesn't carry bloom filters. */
> + if (!g->chunk_bloom_indexes)
> + return 0;
> +
> + lex_pos = c->graph_pos - g->num_commits_in_base;
> +
> + end_index = get_be32(g->chunk_bloom_indexes + 4 * lex_pos);
Let's suppose that we encounter a bogus commit-graph file. This would
then segfault if 'lex_pos' were to point past the end of file, i.e.
past the mmap()-ed memory region.
> +
> + if (lex_pos > 0)
> + start_index = get_be32(g->chunk_bloom_indexes + 4 * (lex_pos - 1));
> + else
> + start_index = 0;
> +
> + filter->len = end_index - start_index;
> + filter->data = (unsigned char *)(g->chunk_bloom_data +
> + sizeof(unsigned char) * start_index +
> + BLOOMDATA_CHUNK_HEADER_SIZE);
And this could lead to segfault later when accessing the Bloom filter
data if 'start_index' or 'end_index' were to point past EOF or
end_index < start_index.
IMO all indices and offsets read from the commit-graph file must be
checked to ensure that they fit in the corresponding chunk, like I did
in my modified path Bloom filters implementation. However, I'm not
sure how it's best to handle an out-of-bounds offset... Simply
erroring out in case of a bogus commit-graph file is the
straightforward possibility, of course, but since the commit-graph is
only an optimization, it would be better user experience to warn and
ignore it and finish the operation without the commit-graph (albeit
slower). But is it even possible to ignore the commit-graph, say, in
the middle of a 'git rev-list --topo-order HEAD'?
> + return 1;
> +}
> +
> /*
> * Calculate the murmur3 32-bit hash value for the given data
> * using the given seed.
> @@ -127,7 +159,8 @@ void init_bloom_filters(void)
> }
>
> struct bloom_filter *get_bloom_filter(struct repository *r,
> - struct commit *c)
> + struct commit *c,
> + int compute_if_not_present)
> {
> struct bloom_filter *filter;
> struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
> @@ -140,6 +173,20 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
>
> filter = bloom_filter_slab_at(&bloom_filters, c);
>
> + if (!filter->data) {
> + load_commit_graph_info(r, c);
> + if (c->graph_pos != COMMIT_NOT_FROM_GRAPH &&
> + r->objects->commit_graph->chunk_bloom_indexes) {
> + if (load_bloom_filter_from_graph(r->objects->commit_graph, filter, c))
> + return filter;
> + else
> + return NULL;
> + }
> + }
> +
> + if (filter->data || !compute_if_not_present)
> + return filter;
> +
> repo_diff_setup(r, &diffopt);
> diffopt.flags.recursive = 1;
> diffopt.max_changes = max_changes;
> diff --git a/bloom.h b/bloom.h
> index 85ab8e9423d..760d7122374 100644
> --- a/bloom.h
> +++ b/bloom.h
> @@ -32,6 +32,7 @@ struct bloom_filter_settings {
>
> #define DEFAULT_BLOOM_FILTER_SETTINGS { 1, 7, 10 }
> #define BITS_PER_WORD 8
> +#define BLOOMDATA_CHUNK_HEADER_SIZE 3 * sizeof(uint32_t)
>
> /*
> * A bloom_filter struct represents a data segment to
> @@ -79,6 +80,7 @@ void add_key_to_filter(const struct bloom_key *key,
> void init_bloom_filters(void);
>
> struct bloom_filter *get_bloom_filter(struct repository *r,
> - struct commit *c);
> + struct commit *c,
> + int compute_if_not_present);
>
> #endif
> \ No newline at end of file
> diff --git a/commit-graph.c b/commit-graph.c
> index a8b6b5cca5d..77668629e27 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -1086,7 +1086,7 @@ static void write_graph_chunk_bloom_indexes(struct hashfile *f,
> ctx->commits.nr);
>
> while (list < last) {
> - struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
> + struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
> cur_pos += filter->len;
> display_progress(progress, ++i);
> hashwrite_be32(f, cur_pos);
> @@ -1115,7 +1115,7 @@ static void write_graph_chunk_bloom_data(struct hashfile *f,
> hashwrite_be32(f, settings->bits_per_entry);
>
> while (list < last) {
> - struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
> + struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
> display_progress(progress, ++i);
> hashwrite(f, filter->data, filter->len * sizeof(unsigned char));
> list++;
> @@ -1296,7 +1296,7 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
>
> for (i = 0; i < ctx->commits.nr; i++) {
> struct commit *c = sorted_commits[i];
> - struct bloom_filter *filter = get_bloom_filter(ctx->r, c);
> + struct bloom_filter *filter = get_bloom_filter(ctx->r, c, 1);
> ctx->total_bloom_filter_data_size += sizeof(unsigned char) * filter->len;
> display_progress(progress, i + 1);
> }
> diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
> index f18d1b722e1..ce412664ba9 100644
> --- a/t/helper/test-bloom.c
> +++ b/t/helper/test-bloom.c
> @@ -39,7 +39,7 @@ static void get_bloom_filter_for_commit(const struct object_id *commit_oid)
> struct bloom_filter *filter;
> setup_git_directory();
> c = lookup_commit(the_repository, commit_oid);
> - filter = get_bloom_filter(the_repository, c);
> + filter = get_bloom_filter(the_repository, c, 1);
> print_bloom_filter(filter);
> }
>
> --
> gitgitgadget
>
^ permalink raw reply [flat|nested] 159+ messages in thread
* [PATCH v4 11/15] commit-graph: add --changed-paths option to write subcommand
2020-04-06 16:59 ` [PATCH v4 00/15] Changed Paths Bloom Filters Garima Singh via GitGitGadget
` (9 preceding siblings ...)
2020-04-06 16:59 ` [PATCH v4 10/15] commit-graph: reuse existing Bloom filters during write Garima Singh via GitGitGadget
@ 2020-04-06 16:59 ` Garima Singh via GitGitGadget
2020-06-07 22:21 ` SZEDER Gábor
2020-04-06 16:59 ` [PATCH v4 12/15] revision.c: use Bloom filters to speed up path based revision walks Garima Singh via GitGitGadget
` (4 subsequent siblings)
15 siblings, 1 reply; 159+ messages in thread
From: Garima Singh via GitGitGadget @ 2020-04-06 16:59 UTC (permalink / raw)
To: git; +Cc: stolee, szeder.dev, jonathantanmy, Garima Singh, Garima Singh
From: Garima Singh <garima.singh@microsoft.com>
Add --changed-paths option to git commit-graph write. This option will
allow users to compute information about the paths that have changed
between a commit and its first parent, and write it into the commit graph
file. If the option is passed to the write subcommand we set the
COMMIT_GRAPH_WRITE_BLOOM_FILTERS flag and pass it down to the
commit-graph logic.
Helped-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
---
Documentation/git-commit-graph.txt | 5 +++++
builtin/commit-graph.c | 9 +++++++--
2 files changed, 12 insertions(+), 2 deletions(-)
diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
index 28d1fee5053..f4b13c005b8 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -57,6 +57,11 @@ or `--stdin-packs`.)
With the `--append` option, include all commits that are present in the
existing commit-graph file.
+
+With the `--changed-paths` option, compute and write information about the
+paths changed between a commit and it's first parent. This operation can
+take a while on large repositories. It provides significant performance gains
+for getting history of a directory or a file with `git log -- <path>`.
++
With the `--split` option, write the commit-graph as a chain of multiple
commit-graph files stored in `<dir>/info/commit-graphs`. The new commits
not already in the commit-graph are added in a new "tip" file. This file
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index d1ab6625f63..cacb5d04a80 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -9,7 +9,7 @@
static char const * const builtin_commit_graph_usage[] = {
N_("git commit-graph verify [--object-dir <objdir>] [--shallow] [--[no-]progress]"),
- N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] [--[no-]progress] <split options>"),
+ N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] [--changed-paths] [--[no-]progress] <split options>"),
NULL
};
@@ -19,7 +19,7 @@ static const char * const builtin_commit_graph_verify_usage[] = {
};
static const char * const builtin_commit_graph_write_usage[] = {
- N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] [--[no-]progress] <split options>"),
+ N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] [--changed-paths] [--[no-]progress] <split options>"),
NULL
};
@@ -32,6 +32,7 @@ static struct opts_commit_graph {
int split;
int shallow;
int progress;
+ int enable_changed_paths;
} opts;
static struct object_directory *find_odb(struct repository *r,
@@ -135,6 +136,8 @@ static int graph_write(int argc, const char **argv)
N_("start walk at commits listed by stdin")),
OPT_BOOL(0, "append", &opts.append,
N_("include all commits already in the commit-graph file")),
+ OPT_BOOL(0, "changed-paths", &opts.enable_changed_paths,
+ N_("enable computation for changed paths")),
OPT_BOOL(0, "progress", &opts.progress, N_("force progress reporting")),
OPT_BOOL(0, "split", &opts.split,
N_("allow writing an incremental commit-graph file")),
@@ -168,6 +171,8 @@ static int graph_write(int argc, const char **argv)
flags |= COMMIT_GRAPH_WRITE_SPLIT;
if (opts.progress)
flags |= COMMIT_GRAPH_WRITE_PROGRESS;
+ if (opts.enable_changed_paths)
+ flags |= COMMIT_GRAPH_WRITE_BLOOM_FILTERS;
read_replace_refs = 0;
odb = find_odb(the_repository, opts.obj_dir);
--
gitgitgadget
^ permalink raw reply related [flat|nested] 159+ messages in thread
* Re: [PATCH v4 11/15] commit-graph: add --changed-paths option to write subcommand
2020-04-06 16:59 ` [PATCH v4 11/15] commit-graph: add --changed-paths option to write subcommand Garima Singh via GitGitGadget
@ 2020-06-07 22:21 ` SZEDER Gábor
0 siblings, 0 replies; 159+ messages in thread
From: SZEDER Gábor @ 2020-06-07 22:21 UTC (permalink / raw)
To: Garima Singh via GitGitGadget; +Cc: git, stolee, jonathantanmy, Garima Singh
On Mon, Apr 06, 2020 at 04:59:51PM +0000, Garima Singh via GitGitGadget wrote:
> From: Garima Singh <garima.singh@microsoft.com>
>
> Add --changed-paths option to git commit-graph write. This option will
> allow users to compute information about the paths that have changed
> between a commit and its first parent, and write it into the commit graph
> file. If the option is passed to the write subcommand we set the
> COMMIT_GRAPH_WRITE_BLOOM_FILTERS flag and pass it down to the
> commit-graph logic.
>
> Helped-by: Derrick Stolee <dstolee@microsoft.com>
> Signed-off-by: Garima Singh <garima.singh@microsoft.com>
> ---
> Documentation/git-commit-graph.txt | 5 +++++
> builtin/commit-graph.c | 9 +++++++--
> 2 files changed, 12 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
> index 28d1fee5053..f4b13c005b8 100644
> --- a/Documentation/git-commit-graph.txt
> +++ b/Documentation/git-commit-graph.txt
> @@ -57,6 +57,11 @@ or `--stdin-packs`.)
> With the `--append` option, include all commits that are present in the
> existing commit-graph file.
> +
> +With the `--changed-paths` option, compute and write information about the
> +paths changed between a commit and it's first parent. This operation can
> +take a while on large repositories. It provides significant performance gains
> +for getting history of a directory or a file with `git log -- <path>`.
So 'git commit-graph write' only computes and writes changed path
Bloom filters if this option is specified. Though not mentioned in
the documentation or in the commit message, the negated
'--no-changed-paths' is supported as well, and it removes Bloom
filters from the commit-graph file. All this is quite reasonable.
However, the most important question is what happens when the
commit-graph file already contains Bloom filters and neither of these
options are specified on the command line. This isn't mentioned in
the docs or in the commit message, either, but as it is implemented in
this patch (i.e. COMMIT_GRAPH_WRITE_BLOOM_FILTERS is not passed from
the builtin to the commit-graph logic) all those existing Bloom
filters are removed from the commit-graph. Considering how expensive
it was to compute those Bloom filters this might not be the most
desirable behaviour.
This is important, because 'git commit-graph write' is not the only
command that writes the commit-graph file. 'git gc' does that by
default, too, and will wipe out any modified path Bloom filters while
doing so. Worse, the user doesn't even have to invoke 'git gc'
manually, because a lot of git commands invoke 'git gc --auto'.
$ git commit-graph write --reachable --changed-paths
$ ~/src/git/t/helper/test-tool read-graph |grep ^chunks
chunks: oid_fanout oid_lookup commit_metadata bloom_indexes bloom_data
$ git gc --quiet
$ ~/src/git/t/helper/test-tool read-graph |grep ^chunks
chunks: oid_fanout oid_lookup commit_metadata
Consequently, if users want to use modified path Bloom filters, then
they should avoid gc, both manual and auto, or they'll have to
re-generate the Bloom filters every once in a while. That is
definitely not the desired behaviour.
Now compare this e.g. to the behaviour of 'git update-index
--split-index' and '--untracked-cache': both of these options turn on
features that improve performance and write extra stuff to the index,
and after they did so all subsequent git commands updating the index
will keep writing that extra stuff, including 'git update-index'
itself even without those options, until it's finally invoked with the
corresponding '--no-...' option. I particularly like how
'--[no-]untracked-cache' and 'core.untrackedCache' work together and
warn when the given command line option goes against the configured
value, and I think the command line options and configuration
variables controlling modified path Bloom filters should behave
similarly.
> With the `--split` option, write the commit-graph as a chain of multiple
> commit-graph files stored in `<dir>/info/commit-graphs`. The new commits
> not already in the commit-graph are added in a new "tip" file. This file
> diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
> index d1ab6625f63..cacb5d04a80 100644
> --- a/builtin/commit-graph.c
> +++ b/builtin/commit-graph.c
> @@ -9,7 +9,7 @@
>
> static char const * const builtin_commit_graph_usage[] = {
> N_("git commit-graph verify [--object-dir <objdir>] [--shallow] [--[no-]progress]"),
> - N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] [--[no-]progress] <split options>"),
> + N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] [--changed-paths] [--[no-]progress] <split options>"),
> NULL
> };
>
> @@ -19,7 +19,7 @@ static const char * const builtin_commit_graph_verify_usage[] = {
> };
>
> static const char * const builtin_commit_graph_write_usage[] = {
> - N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] [--[no-]progress] <split options>"),
> + N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] [--changed-paths] [--[no-]progress] <split options>"),
> NULL
> };
>
> @@ -32,6 +32,7 @@ static struct opts_commit_graph {
> int split;
> int shallow;
> int progress;
> + int enable_changed_paths;
> } opts;
>
> static struct object_directory *find_odb(struct repository *r,
> @@ -135,6 +136,8 @@ static int graph_write(int argc, const char **argv)
> N_("start walk at commits listed by stdin")),
> OPT_BOOL(0, "append", &opts.append,
> N_("include all commits already in the commit-graph file")),
> + OPT_BOOL(0, "changed-paths", &opts.enable_changed_paths,
> + N_("enable computation for changed paths")),
> OPT_BOOL(0, "progress", &opts.progress, N_("force progress reporting")),
> OPT_BOOL(0, "split", &opts.split,
> N_("allow writing an incremental commit-graph file")),
> @@ -168,6 +171,8 @@ static int graph_write(int argc, const char **argv)
> flags |= COMMIT_GRAPH_WRITE_SPLIT;
> if (opts.progress)
> flags |= COMMIT_GRAPH_WRITE_PROGRESS;
> + if (opts.enable_changed_paths)
> + flags |= COMMIT_GRAPH_WRITE_BLOOM_FILTERS;
>
> read_replace_refs = 0;
> odb = find_odb(the_repository, opts.obj_dir);
> --
> gitgitgadget
>
^ permalink raw reply [flat|nested] 159+ messages in thread
* [PATCH v4 12/15] revision.c: use Bloom filters to speed up path based revision walks
2020-04-06 16:59 ` [PATCH v4 00/15] Changed Paths Bloom Filters Garima Singh via GitGitGadget
` (10 preceding siblings ...)
2020-04-06 16:59 ` [PATCH v4 11/15] commit-graph: add --changed-paths option to write subcommand Garima Singh via GitGitGadget
@ 2020-04-06 16:59 ` Garima Singh via GitGitGadget
2020-06-26 6:34 ` SZEDER Gábor
2020-04-06 16:59 ` [PATCH v4 13/15] revision.c: add trace2 stats around Bloom filter usage Garima Singh via GitGitGadget
` (3 subsequent siblings)
15 siblings, 1 reply; 159+ messages in thread
From: Garima Singh via GitGitGadget @ 2020-04-06 16:59 UTC (permalink / raw)
To: git; +Cc: stolee, szeder.dev, jonathantanmy, Garima Singh, Garima Singh
From: Garima Singh <garima.singh@microsoft.com>
Revision walk will now use Bloom filters for commits to speed up
revision walks for a particular path (for computing history for
that path), if they are present in the commit-graph file.
We load the Bloom filters during the prepare_revision_walk step,
currently only when dealing with a single pathspec. Extending
it to work with multiple pathspecs can be explored and built on
top of this series in the future.
While comparing trees in rev_compare_trees(), if the Bloom filter
says that the file is not different between the two trees, we don't
need to compute the expensive diff. This is where we get our
performance gains. The other response of the Bloom filter is '`:maybe',
in which case we fall back to the full diff calculation to determine
if the path was changed in the commit.
We do not try to use Bloom filters when the '--walk-reflogs' option
is specified. The '--walk-reflogs' option does not walk the commit
ancestry chain like the rest of the options. Incorporating the
performance gains when walking reflog entries would add more
complexity, and can be explored in a later series.
Performance Gains:
We tested the performance of `git log -- <path>` on the git repo, the linux
and some internal large repos, with a variety of paths of varying depths.
On the git and linux repos:
- we observed a 2x to 5x speed up.
On a large internal repo with files seated 6-10 levels deep in the tree:
- we observed 10x to 20x speed ups, with some paths going up to 28 times
faster.
Helped-by: Derrick Stolee <dstolee@microsoft.com
Helped-by: SZEDER Gábor <szeder.dev@gmail.com>
Helped-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
---
bloom.c | 20 +++++++++++++
bloom.h | 4 +++
revision.c | 85 ++++++++++++++++++++++++++++++++++++++++++++++++++++--
revision.h | 11 +++++++
4 files changed, 118 insertions(+), 2 deletions(-)
diff --git a/bloom.c b/bloom.c
index 0f714dd76ae..c5b461d1cfe 100644
--- a/bloom.c
+++ b/bloom.c
@@ -253,3 +253,23 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
return filter;
}
+
+int bloom_filter_contains(const struct bloom_filter *filter,
+ const struct bloom_key *key,
+ const struct bloom_filter_settings *settings)
+{
+ int i;
+ uint64_t mod = filter->len * BITS_PER_WORD;
+
+ if (!mod)
+ return -1;
+
+ for (i = 0; i < settings->num_hashes; i++) {
+ uint64_t hash_mod = key->hashes[i] % mod;
+ uint64_t block_pos = hash_mod / BITS_PER_WORD;
+ if (!(filter->data[block_pos] & get_bitmask(hash_mod)))
+ return 0;
+ }
+
+ return 1;
+}
\ No newline at end of file
diff --git a/bloom.h b/bloom.h
index 760d7122374..b935186425d 100644
--- a/bloom.h
+++ b/bloom.h
@@ -83,4 +83,8 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
struct commit *c,
int compute_if_not_present);
+int bloom_filter_contains(const struct bloom_filter *filter,
+ const struct bloom_key *key,
+ const struct bloom_filter_settings *settings);
+
#endif
\ No newline at end of file
diff --git a/revision.c b/revision.c
index 8136929e236..d3fcb7c6ff6 100644
--- a/revision.c
+++ b/revision.c
@@ -29,6 +29,7 @@
#include "prio-queue.h"
#include "hashmap.h"
#include "utf8.h"
+#include "bloom.h"
volatile show_early_output_fn_t show_early_output;
@@ -624,11 +625,80 @@ static void file_change(struct diff_options *options,
options->flags.has_changes = 1;
}
+static void prepare_to_use_bloom_filter(struct rev_info *revs)
+{
+ struct pathspec_item *pi;
+ char *path_alloc = NULL;
+ const char *path;
+ int last_index;
+ int len;
+
+ if (!revs->commits)
+ return;
+
+ repo_parse_commit(revs->repo, revs->commits->item);
+
+ if (!revs->repo->objects->commit_graph)
+ return;
+
+ revs->bloom_filter_settings = revs->repo->objects->commit_graph->bloom_filter_settings;
+ if (!revs->bloom_filter_settings)
+ return;
+
+ pi = &revs->pruning.pathspec.items[0];
+ last_index = pi->len - 1;
+
+ /* remove single trailing slash from path, if needed */
+ if (pi->match[last_index] == '/') {
+ path_alloc = xstrdup(pi->match);
+ path_alloc[last_index] = '\0';
+ path = path_alloc;
+ } else
+ path = pi->match;
+
+ len = strlen(path);
+
+ revs->bloom_key = xmalloc(sizeof(struct bloom_key));
+ fill_bloom_key(path, len, revs->bloom_key, revs->bloom_filter_settings);
+
+ free(path_alloc);
+}
+
+static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
+ struct commit *commit)
+{
+ struct bloom_filter *filter;
+ int result;
+
+ if (!revs->repo->objects->commit_graph)
+ return -1;
+
+ if (commit->generation == GENERATION_NUMBER_INFINITY)
+ return -1;
+
+ filter = get_bloom_filter(revs->repo, commit, 0);
+
+ if (!filter) {
+ return -1;
+ }
+
+ if (!filter->len) {
+ return -1;
+ }
+
+ result = bloom_filter_contains(filter,
+ revs->bloom_key,
+ revs->bloom_filter_settings);
+
+ return result;
+}
+
static int rev_compare_tree(struct rev_info *revs,
- struct commit *parent, struct commit *commit)
+ struct commit *parent, struct commit *commit, int nth_parent)
{
struct tree *t1 = get_commit_tree(parent);
struct tree *t2 = get_commit_tree(commit);
+ int bloom_ret = 1;
if (!t1)
return REV_TREE_NEW;
@@ -653,11 +723,19 @@ static int rev_compare_tree(struct rev_info *revs,
return REV_TREE_SAME;
}
+ if (revs->bloom_key && !nth_parent) {
+ bloom_ret = check_maybe_different_in_bloom_filter(revs, commit);
+
+ if (bloom_ret == 0)
+ return REV_TREE_SAME;
+ }
+
tree_difference = REV_TREE_SAME;
revs->pruning.flags.has_changes = 0;
if (diff_tree_oid(&t1->object.oid, &t2->object.oid, "",
&revs->pruning) < 0)
return REV_TREE_DIFFERENT;
+
return tree_difference;
}
@@ -855,7 +933,7 @@ static void try_to_simplify_commit(struct rev_info *revs, struct commit *commit)
die("cannot simplify commit %s (because of %s)",
oid_to_hex(&commit->object.oid),
oid_to_hex(&p->object.oid));
- switch (rev_compare_tree(revs, p, commit)) {
+ switch (rev_compare_tree(revs, p, commit, nth_parent)) {
case REV_TREE_SAME:
if (!revs->simplify_history || !relevant_commit(p)) {
/* Even if a merge with an uninteresting
@@ -3362,6 +3440,8 @@ int prepare_revision_walk(struct rev_info *revs)
FOR_EACH_OBJECT_PROMISOR_ONLY);
}
+ if (revs->pruning.pathspec.nr == 1 && !revs->reflog_info)
+ prepare_to_use_bloom_filter(revs);
if (revs->no_walk != REVISION_WALK_NO_WALK_UNSORTED)
commit_list_sort_by_date(&revs->commits);
if (revs->no_walk)
@@ -3379,6 +3459,7 @@ int prepare_revision_walk(struct rev_info *revs)
simplify_merges(revs);
if (revs->children.name)
set_children(revs);
+
return 0;
}
diff --git a/revision.h b/revision.h
index 475f048fb61..7c026fe41fc 100644
--- a/revision.h
+++ b/revision.h
@@ -56,6 +56,8 @@ struct repository;
struct rev_info;
struct string_list;
struct saved_parents;
+struct bloom_key;
+struct bloom_filter_settings;
define_shared_commit_slab(revision_sources, char *);
struct rev_cmdline_info {
@@ -291,6 +293,15 @@ struct rev_info {
struct revision_sources *sources;
struct topo_walk_info *topo_walk_info;
+
+ /* Commit graph bloom filter fields */
+ /* The bloom filter key for the pathspec */
+ struct bloom_key *bloom_key;
+ /*
+ * The bloom filter settings used to generate the key.
+ * This is loaded from the commit-graph being used.
+ */
+ struct bloom_filter_settings *bloom_filter_settings;
};
int ref_excluded(struct string_list *, const char *path);
--
gitgitgadget
^ permalink raw reply related [flat|nested] 159+ messages in thread
* Re: [PATCH v4 12/15] revision.c: use Bloom filters to speed up path based revision walks
2020-04-06 16:59 ` [PATCH v4 12/15] revision.c: use Bloom filters to speed up path based revision walks Garima Singh via GitGitGadget
@ 2020-06-26 6:34 ` SZEDER Gábor
0 siblings, 0 replies; 159+ messages in thread
From: SZEDER Gábor @ 2020-06-26 6:34 UTC (permalink / raw)
To: Garima Singh via GitGitGadget; +Cc: git, stolee, jonathantanmy, Garima Singh
On Mon, Apr 06, 2020 at 04:59:52PM +0000, Garima Singh via GitGitGadget wrote:
> +static void prepare_to_use_bloom_filter(struct rev_info *revs)
> +{
> + struct pathspec_item *pi;
> + char *path_alloc = NULL;
> + const char *path;
> + int last_index;
> + int len;
> +
> + if (!revs->commits)
> + return;
> +
> + repo_parse_commit(revs->repo, revs->commits->item);
> +
> + if (!revs->repo->objects->commit_graph)
> + return;
> +
> + revs->bloom_filter_settings = revs->repo->objects->commit_graph->bloom_filter_settings;
> + if (!revs->bloom_filter_settings)
> + return;
> +
> + pi = &revs->pruning.pathspec.items[0];
> + last_index = pi->len - 1;
> +
> + /* remove single trailing slash from path, if needed */
> + if (pi->match[last_index] == '/') {
> + path_alloc = xstrdup(pi->match);
> + path_alloc[last_index] = '\0';
> + path = path_alloc;
fill_bloom_key() takes a length parameter, so there is no need to
duplicate the path to be able to shorten it by one character to remove
that trailing '/'.
> + } else
> + path = pi->match;
> +
> + len = strlen(path);
'struct pathspec_item's 'len' field already contains the length of the
path, so there is no need for this strlen().
> +
> + revs->bloom_key = xmalloc(sizeof(struct bloom_key));
> + fill_bloom_key(path, len, revs->bloom_key, revs->bloom_filter_settings);
> +
> + free(path_alloc);
> +}
> @@ -3362,6 +3440,8 @@ int prepare_revision_walk(struct rev_info *revs)
> FOR_EACH_OBJECT_PROMISOR_ONLY);
> }
>
> + if (revs->pruning.pathspec.nr == 1 && !revs->reflog_info)
> + prepare_to_use_bloom_filter(revs);
> if (revs->no_walk != REVISION_WALK_NO_WALK_UNSORTED)
> commit_list_sort_by_date(&revs->commits);
> if (revs->no_walk)
return 0;
if (revs->limited) {
if (limit_list(revs) < 0)
return -1;
I extended the hunk context a bit to show that
prepare_to_use_bloom_filter() is called before limit_list(). This is
important, because specifying exclude revs and pathspecs, i.e. 'git
log ^v1.2.3 -- dir/file' does perform a lot of diffs in limit_list(),
and this way we can take advantage of Bloom filters even in this case.
> @@ -3379,6 +3459,7 @@ int prepare_revision_walk(struct rev_info *revs)
> simplify_merges(revs);
> if (revs->children.name)
> set_children(revs);
> +
> return 0;
> }
>
> diff --git a/revision.h b/revision.h
> index 475f048fb61..7c026fe41fc 100644
> --- a/revision.h
> +++ b/revision.h
> @@ -56,6 +56,8 @@ struct repository;
> struct rev_info;
> struct string_list;
> struct saved_parents;
> +struct bloom_key;
> +struct bloom_filter_settings;
> define_shared_commit_slab(revision_sources, char *);
>
> struct rev_cmdline_info {
> @@ -291,6 +293,15 @@ struct rev_info {
> struct revision_sources *sources;
>
> struct topo_walk_info *topo_walk_info;
> +
> + /* Commit graph bloom filter fields */
> + /* The bloom filter key for the pathspec */
> + struct bloom_key *bloom_key;
> + /*
> + * The bloom filter settings used to generate the key.
> + * This is loaded from the commit-graph being used.
> + */
> + struct bloom_filter_settings *bloom_filter_settings;
> };
>
> int ref_excluded(struct string_list *, const char *path);
> --
> gitgitgadget
>
^ permalink raw reply [flat|nested] 159+ messages in thread
* [PATCH v4 13/15] revision.c: add trace2 stats around Bloom filter usage
2020-04-06 16:59 ` [PATCH v4 00/15] Changed Paths Bloom Filters Garima Singh via GitGitGadget
` (11 preceding siblings ...)
2020-04-06 16:59 ` [PATCH v4 12/15] revision.c: use Bloom filters to speed up path based revision walks Garima Singh via GitGitGadget
@ 2020-04-06 16:59 ` Garima Singh via GitGitGadget
2020-04-06 16:59 ` [PATCH v4 14/15] t4216: add end to end tests for git log with Bloom filters Garima Singh via GitGitGadget
` (2 subsequent siblings)
15 siblings, 0 replies; 159+ messages in thread
From: Garima Singh via GitGitGadget @ 2020-04-06 16:59 UTC (permalink / raw)
To: git; +Cc: stolee, szeder.dev, jonathantanmy, Garima Singh, Garima Singh
From: Garima Singh <garima.singh@microsoft.com>
Add trace2 statistics around Bloom filter usage and behavior
for 'git log -- path' commands that are hoping to benefit from
the presence of computed changed paths Bloom filters.
These statistics are great for performance analysis work and
for formal testing, which we will see in the commit following
this one.
Helped-by: Derrick Stolee <dstolee@microsoft.com
Helped-by: SZEDER Gábor <szeder.dev@gmail.com>
Helped-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
---
revision.c | 41 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 41 insertions(+)
diff --git a/revision.c b/revision.c
index d3fcb7c6ff6..2b06ee739c8 100644
--- a/revision.c
+++ b/revision.c
@@ -30,6 +30,7 @@
#include "hashmap.h"
#include "utf8.h"
#include "bloom.h"
+#include "json-writer.h"
volatile show_early_output_fn_t show_early_output;
@@ -625,6 +626,30 @@ static void file_change(struct diff_options *options,
options->flags.has_changes = 1;
}
+static int bloom_filter_atexit_registered;
+static unsigned int count_bloom_filter_maybe;
+static unsigned int count_bloom_filter_definitely_not;
+static unsigned int count_bloom_filter_false_positive;
+static unsigned int count_bloom_filter_not_present;
+static unsigned int count_bloom_filter_length_zero;
+
+static void trace2_bloom_filter_statistics_atexit(void)
+{
+ struct json_writer jw = JSON_WRITER_INIT;
+
+ jw_object_begin(&jw, 0);
+ jw_object_intmax(&jw, "filter_not_present", count_bloom_filter_not_present);
+ jw_object_intmax(&jw, "zero_length_filter", count_bloom_filter_length_zero);
+ jw_object_intmax(&jw, "maybe", count_bloom_filter_maybe);
+ jw_object_intmax(&jw, "definitely_not", count_bloom_filter_definitely_not);
+ jw_object_intmax(&jw, "false_positive", count_bloom_filter_false_positive);
+ jw_end(&jw);
+
+ trace2_data_json("bloom", the_repository, "statistics", &jw);
+
+ jw_release(&jw);
+}
+
static void prepare_to_use_bloom_filter(struct rev_info *revs)
{
struct pathspec_item *pi;
@@ -661,6 +686,11 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
revs->bloom_key = xmalloc(sizeof(struct bloom_key));
fill_bloom_key(path, len, revs->bloom_key, revs->bloom_filter_settings);
+ if (trace2_is_enabled() && !bloom_filter_atexit_registered) {
+ atexit(trace2_bloom_filter_statistics_atexit);
+ bloom_filter_atexit_registered = 1;
+ }
+
free(path_alloc);
}
@@ -679,10 +709,12 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
filter = get_bloom_filter(revs->repo, commit, 0);
if (!filter) {
+ count_bloom_filter_not_present++;
return -1;
}
if (!filter->len) {
+ count_bloom_filter_length_zero++;
return -1;
}
@@ -690,6 +722,11 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
revs->bloom_key,
revs->bloom_filter_settings);
+ if (result)
+ count_bloom_filter_maybe++;
+ else
+ count_bloom_filter_definitely_not++;
+
return result;
}
@@ -736,6 +773,10 @@ static int rev_compare_tree(struct rev_info *revs,
&revs->pruning) < 0)
return REV_TREE_DIFFERENT;
+ if (!nth_parent)
+ if (bloom_ret == 1 && tree_difference == REV_TREE_SAME)
+ count_bloom_filter_false_positive++;
+
return tree_difference;
}
--
gitgitgadget
^ permalink raw reply related [flat|nested] 159+ messages in thread
* [PATCH v4 14/15] t4216: add end to end tests for git log with Bloom filters
2020-04-06 16:59 ` [PATCH v4 00/15] Changed Paths Bloom Filters Garima Singh via GitGitGadget
` (12 preceding siblings ...)
2020-04-06 16:59 ` [PATCH v4 13/15] revision.c: add trace2 stats around Bloom filter usage Garima Singh via GitGitGadget
@ 2020-04-06 16:59 ` Garima Singh via GitGitGadget
2020-04-06 16:59 ` [PATCH v4 15/15] commit-graph: add GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS test flag Garima Singh via GitGitGadget
2020-04-08 15:51 ` [PATCH v4 00/15] Changed Paths Bloom Filters Derrick Stolee
15 siblings, 0 replies; 159+ messages in thread
From: Garima Singh via GitGitGadget @ 2020-04-06 16:59 UTC (permalink / raw)
To: git; +Cc: stolee, szeder.dev, jonathantanmy, Garima Singh, Garima Singh
From: Garima Singh <garima.singh@microsoft.com>
These tests exercises writing commit graph with Bloom filters
and exercises 'git log -- path' with all the applicable
options. They check that the output is the same with and
without Bloom filters, confirm Bloom filters were used by
checking if trace2 statistics were logged correctly.
Also confirms cases where Bloom filters are not used:
1. Multiple path specs,
2. --walk-reflogs (see patch titled 'revision.c: use Bloom filters...'
for details,
3. If the latest commit graph does not have Bloom filters
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
---
t/helper/test-read-graph.c | 4 +
t/t4216-log-bloom.sh | 155 +++++++++++++++++++++++++++++++++++++
2 files changed, 159 insertions(+)
create mode 100755 t/t4216-log-bloom.sh
diff --git a/t/helper/test-read-graph.c b/t/helper/test-read-graph.c
index f8a461767ca..4223ff32fb6 100644
--- a/t/helper/test-read-graph.c
+++ b/t/helper/test-read-graph.c
@@ -45,6 +45,10 @@ int cmd__read_graph(int argc, const char **argv)
printf(" commit_metadata");
if (graph->chunk_extra_edges)
printf(" extra_edges");
+ if (graph->chunk_bloom_indexes)
+ printf(" bloom_indexes");
+ if (graph->chunk_bloom_data)
+ printf(" bloom_data");
printf("\n");
UNLEAK(graph);
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
new file mode 100755
index 00000000000..38accd272df
--- /dev/null
+++ b/t/t4216-log-bloom.sh
@@ -0,0 +1,155 @@
+#!/bin/sh
+
+test_description='git log for a path with Bloom filters'
+. ./test-lib.sh
+
+GIT_TEST_COMMIT_GRAPH=0
+GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=0
+
+test_expect_success 'setup test - repo, commits, commit graph, log outputs' '
+ git init &&
+ mkdir A A/B A/B/C &&
+ test_commit c1 A/file1 &&
+ test_commit c2 A/B/file2 &&
+ test_commit c3 A/B/C/file3 &&
+ test_commit c4 A/file1 &&
+ test_commit c5 A/B/file2 &&
+ test_commit c6 A/B/C/file3 &&
+ test_commit c7 A/file1 &&
+ test_commit c8 A/B/file2 &&
+ test_commit c9 A/B/C/file3 &&
+ test_commit c10 file_to_be_deleted &&
+ git checkout -b side HEAD~4 &&
+ test_commit side-1 file4 &&
+ git checkout master &&
+ git merge side &&
+ test_commit c11 file5 &&
+ mv file5 file5_renamed &&
+ git add file5_renamed &&
+ git commit -m "rename" &&
+ rm file_to_be_deleted &&
+ git add . &&
+ git commit -m "file removed" &&
+ git commit-graph write --reachable --changed-paths
+'
+graph_read_expect () {
+ NUM_CHUNKS=5
+ cat >expect <<- EOF
+ header: 43475048 1 1 $NUM_CHUNKS 0
+ num_commits: $1
+ chunks: oid_fanout oid_lookup commit_metadata bloom_indexes bloom_data
+ EOF
+ test-tool read-graph >actual &&
+ test_cmp expect actual
+}
+
+test_expect_success 'commit-graph write wrote out the bloom chunks' '
+ graph_read_expect 15
+'
+
+# Turn off any inherited trace2 settings for this test.
+sane_unset GIT_TRACE2 GIT_TRACE2_PERF GIT_TRACE2_EVENT
+sane_unset GIT_TRACE2_PERF_BRIEF
+sane_unset GIT_TRACE2_CONFIG_PARAMS
+
+setup () {
+ rm "$TRASH_DIRECTORY/trace.perf"
+ git -c core.commitGraph=false log --pretty="format:%s" $1 >log_wo_bloom &&
+ GIT_TRACE2_PERF="$TRASH_DIRECTORY/trace.perf" git -c core.commitGraph=true log --pretty="format:%s" $1 >log_w_bloom
+}
+
+test_bloom_filters_used () {
+ log_args=$1
+ bloom_trace_prefix="statistics:{\"filter_not_present\":0,\"zero_length_filter\":0,\"maybe\""
+ setup "$log_args" &&
+ grep -q "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.perf" &&
+ test_cmp log_wo_bloom log_w_bloom &&
+ test_path_is_file "$TRASH_DIRECTORY/trace.perf"
+}
+
+test_bloom_filters_not_used () {
+ log_args=$1
+ setup "$log_args" &&
+ !(grep -q "statistics:{\"filter_not_present\":" "$TRASH_DIRECTORY/trace.perf") &&
+ test_cmp log_wo_bloom log_w_bloom
+}
+
+for path in A A/B A/B/C A/file1 A/B/file2 A/B/C/file3 file4 file5 file5_renamed file_to_be_deleted
+do
+ for option in "" \
+ "--all" \
+ "--full-history" \
+ "--full-history --simplify-merges" \
+ "--simplify-merges" \
+ "--simplify-by-decoration" \
+ "--follow" \
+ "--first-parent" \
+ "--topo-order" \
+ "--date-order" \
+ "--author-date-order" \
+ "--ancestry-path side..master"
+ do
+ test_expect_success "git log option: $option for path: $path" '
+ test_bloom_filters_used "$option -- $path"
+ '
+ done
+done
+
+test_expect_success 'git log -- folder works with and without the trailing slash' '
+ test_bloom_filters_used "-- A" &&
+ test_bloom_filters_used "-- A/"
+'
+
+test_expect_success 'git log for path that does not exist. ' '
+ test_bloom_filters_used "-- path_does_not_exist"
+'
+
+test_expect_success 'git log with --walk-reflogs does not use Bloom filters' '
+ test_bloom_filters_not_used "--walk-reflogs -- A"
+'
+
+test_expect_success 'git log -- multiple path specs does not use Bloom filters' '
+ test_bloom_filters_not_used "-- file4 A/file1"
+'
+
+test_expect_success 'git log with wildcard that resolves to a single path uses Bloom filters' '
+ test_bloom_filters_used "-- *4" &&
+ test_bloom_filters_used "-- *renamed"
+'
+
+test_expect_success 'git log with wildcard that resolves to a multiple paths does not uses Bloom filters' '
+ test_bloom_filters_not_used "-- *" &&
+ test_bloom_filters_not_used "-- file*"
+'
+
+test_expect_success 'setup - add commit-graph to the chain without Bloom filters' '
+ test_commit c14 A/anotherFile2 &&
+ test_commit c15 A/B/anotherFile2 &&
+ test_commit c16 A/B/C/anotherFile2 &&
+ GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=0 git commit-graph write --reachable --split &&
+ test_line_count = 2 .git/objects/info/commit-graphs/commit-graph-chain
+'
+
+test_expect_success 'Do not use Bloom filters if the latest graph does not have Bloom filters.' '
+ test_bloom_filters_not_used "-- A/B"
+'
+
+test_expect_success 'setup - add commit-graph to the chain with Bloom filters' '
+ test_commit c17 A/anotherFile3 &&
+ git commit-graph write --reachable --changed-paths --split &&
+ test_line_count = 3 .git/objects/info/commit-graphs/commit-graph-chain
+'
+
+test_bloom_filters_used_when_some_filters_are_missing () {
+ log_args=$1
+ bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"zero_length_filter\":0,\"maybe\":8,\"definitely_not\":6"
+ setup "$log_args" &&
+ grep -q "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.perf" &&
+ test_cmp log_wo_bloom log_w_bloom
+}
+
+test_expect_success 'Use Bloom filters if they exist in the latest but not all commit graphs in the chain.' '
+ test_bloom_filters_used_when_some_filters_are_missing "-- A/B"
+'
+
+test_done
\ No newline at end of file
--
gitgitgadget
^ permalink raw reply related [flat|nested] 159+ messages in thread
* [PATCH v4 15/15] commit-graph: add GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS test flag
2020-04-06 16:59 ` [PATCH v4 00/15] Changed Paths Bloom Filters Garima Singh via GitGitGadget
` (13 preceding siblings ...)
2020-04-06 16:59 ` [PATCH v4 14/15] t4216: add end to end tests for git log with Bloom filters Garima Singh via GitGitGadget
@ 2020-04-06 16:59 ` Garima Singh via GitGitGadget
2020-04-08 15:51 ` [PATCH v4 00/15] Changed Paths Bloom Filters Derrick Stolee
15 siblings, 0 replies; 159+ messages in thread
From: Garima Singh via GitGitGadget @ 2020-04-06 16:59 UTC (permalink / raw)
To: git; +Cc: stolee, szeder.dev, jonathantanmy, Garima Singh, Garima Singh
From: Garima Singh <garima.singh@microsoft.com>
Add GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS test flag to the test setup suite
in order to toggle writing Bloom filters when running any of the git tests.
If set to true, we will compute and write Bloom filters every time a test
calls `git commit-graph write`, as if the `--changed-paths` option was
passed in.
The test suite passes when GIT_TEST_COMMIT_GRAPH and
GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS are enabled.
Helped-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
---
builtin/commit-graph.c | 3 ++-
ci/run-build-and-tests.sh | 1 +
commit-graph.h | 1 +
t/README | 5 +++++
t/t5318-commit-graph.sh | 2 ++
t/t5324-split-commit-graph.sh | 1 +
6 files changed, 12 insertions(+), 1 deletion(-)
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index cacb5d04a80..59009837dc9 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -171,7 +171,8 @@ static int graph_write(int argc, const char **argv)
flags |= COMMIT_GRAPH_WRITE_SPLIT;
if (opts.progress)
flags |= COMMIT_GRAPH_WRITE_PROGRESS;
- if (opts.enable_changed_paths)
+ if (opts.enable_changed_paths ||
+ git_env_bool(GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS, 0))
flags |= COMMIT_GRAPH_WRITE_BLOOM_FILTERS;
read_replace_refs = 0;
diff --git a/ci/run-build-and-tests.sh b/ci/run-build-and-tests.sh
index 4df54c4efea..17e25aade96 100755
--- a/ci/run-build-and-tests.sh
+++ b/ci/run-build-and-tests.sh
@@ -19,6 +19,7 @@ linux-gcc)
export GIT_TEST_OE_SIZE=10
export GIT_TEST_OE_DELTA_SIZE=5
export GIT_TEST_COMMIT_GRAPH=1
+ export GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=1
export GIT_TEST_MULTI_PACK_INDEX=1
export GIT_TEST_ADD_I_USE_BUILTIN=1
make test
diff --git a/commit-graph.h b/commit-graph.h
index 8e7a8e0e5b2..8655d064c14 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -9,6 +9,7 @@
#define GIT_TEST_COMMIT_GRAPH "GIT_TEST_COMMIT_GRAPH"
#define GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD "GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD"
+#define GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS "GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS"
struct commit;
struct bloom_filter_settings;
diff --git a/t/README b/t/README
index 369e3a9ded8..4f53da53a15 100644
--- a/t/README
+++ b/t/README
@@ -378,6 +378,11 @@ GIT_TEST_COMMIT_GRAPH=<boolean>, when true, forces the commit-graph to
be written after every 'git commit' command, and overrides the
'core.commitGraph' setting to true.
+GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=<boolean>, when true, forces
+commit-graph write to compute and write changed path Bloom filters for
+every 'git commit-graph write', as if the `--changed-paths` option was
+passed in.
+
GIT_TEST_FSMONITOR=$PWD/t7519/fsmonitor-all exercises the fsmonitor
code path for utilizing a file system monitor to speed up detecting
new or changed files.
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 9bf920ae171..18304a65e4d 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -3,6 +3,8 @@
test_description='commit graph'
. ./test-lib.sh
+GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=0
+
test_expect_success 'setup full repo' '
mkdir full &&
cd "$TRASH_DIRECTORY/full" &&
diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
index 53b2e6b4555..d3f1f2c4a71 100755
--- a/t/t5324-split-commit-graph.sh
+++ b/t/t5324-split-commit-graph.sh
@@ -4,6 +4,7 @@ test_description='split commit graph'
. ./test-lib.sh
GIT_TEST_COMMIT_GRAPH=0
+GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=0
test_expect_success 'setup repo' '
git init &&
--
gitgitgadget
^ permalink raw reply related [flat|nested] 159+ messages in thread
* Re: [PATCH v4 00/15] Changed Paths Bloom Filters
2020-04-06 16:59 ` [PATCH v4 00/15] Changed Paths Bloom Filters Garima Singh via GitGitGadget
` (14 preceding siblings ...)
2020-04-06 16:59 ` [PATCH v4 15/15] commit-graph: add GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS test flag Garima Singh via GitGitGadget
@ 2020-04-08 15:51 ` Derrick Stolee
2020-04-08 19:21 ` Junio C Hamano
` (2 more replies)
15 siblings, 3 replies; 159+ messages in thread
From: Derrick Stolee @ 2020-04-08 15:51 UTC (permalink / raw)
To: Garima Singh via GitGitGadget, git
Cc: szeder.dev, jonathantanmy, Garima Singh, jnareb, gitster
On 4/6/2020 12:59 PM, Garima Singh via GitGitGadget wrote:
> Hey!
>
> The commit graph feature brought in a lot of performance improvements across
> multiple commands. However, file based history continues to be a performance
> pain point, especially in large repositories.
>
> Adopting changed path Bloom filters has been discussed on the list before,
> and a prototype version was worked on by SZEDER Gábor, Jonathan Tan and Dr.
> Derrick Stolee [1]. This series is based on Dr. Stolee's proof of concept in
> [2]
>
> With the changes in this series, git users will be able to choose to write
> Bloom filters to the commit-graph using the following command:
>
> 'git commit-graph write --changed-paths'
>
> Subsequent 'git log -- path' commands will use these computed Bloom filters
> to decided which commits are worth exploring further to produce the history
> of the provided path.
I noticed Jakub was not CC'd on this email. Jakub: do you plan to re-review
the new version? Or are you satisfied with the resolutions to your comments?
Is anyone else planning to review this series?
I'm just wondering when we should take this series to cook in 'next' and
start building things on top of it, such as "git blame" or "git log -L"
improvements. While it cooks, any bugs or issues could be resolved with
patches on top of this version. That would be my preference, anyway.
What do you think, Junio?
Thanks,
-Stolee
^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH v4 00/15] Changed Paths Bloom Filters
2020-04-08 15:51 ` [PATCH v4 00/15] Changed Paths Bloom Filters Derrick Stolee
@ 2020-04-08 19:21 ` Junio C Hamano
2020-04-08 20:05 ` Jakub Narębski
2020-04-12 20:34 ` Taylor Blau
2 siblings, 0 replies; 159+ messages in thread
From: Junio C Hamano @ 2020-04-08 19:21 UTC (permalink / raw)
To: Derrick Stolee
Cc: Garima Singh via GitGitGadget, git, szeder.dev, jonathantanmy,
Garima Singh, jnareb
Derrick Stolee <stolee@gmail.com> writes:
> I noticed Jakub was not CC'd on this email. Jakub: do you plan to re-review
> the new version? Or are you satisfied with the resolutions to your comments?
> ...
> What do you think, Junio?
I was hoping that after Jakub's review, the new round was ready for
'next' to be extended further by building on top as needed. Of
course the path-limited revision walk is one of the most important
part of the entire system, so I'd welcome reviews from others, too.
Thanks.
^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH v4 00/15] Changed Paths Bloom Filters
2020-04-08 15:51 ` [PATCH v4 00/15] Changed Paths Bloom Filters Derrick Stolee
2020-04-08 19:21 ` Junio C Hamano
@ 2020-04-08 20:05 ` Jakub Narębski
2020-04-12 20:34 ` Taylor Blau
2 siblings, 0 replies; 159+ messages in thread
From: Jakub Narębski @ 2020-04-08 20:05 UTC (permalink / raw)
To: Derrick Stolee
Cc: Garima Singh via GitGitGadget, git, SZEDER Gábor,
Jonathan Tan, Garima Singh, gitster
On Wed, 8 Apr 2020 at 17:51, Derrick Stolee <stolee@gmail.com> wrote:
>
> On 4/6/2020 12:59 PM, Garima Singh via GitGitGadget wrote:
> > Hey!
> >
> > The commit graph feature brought in a lot of performance improvements across
> > multiple commands. However, file based history continues to be a performance
> > pain point, especially in large repositories.
> >
> > Adopting changed path Bloom filters has been discussed on the list before,
> > and a prototype version was worked on by SZEDER Gábor, Jonathan Tan and Dr.
> > Derrick Stolee [1]. This series is based on Dr. Stolee's proof of concept in
> > [2]
> >
> > With the changes in this series, git users will be able to choose to write
> > Bloom filters to the commit-graph using the following command:
> >
> > 'git commit-graph write --changed-paths'
> >
> > Subsequent 'git log -- path' commands will use these computed Bloom filters
> > to decided which commits are worth exploring further to produce the history
> > of the provided path.
>
> I noticed Jakub was not CC'd on this email. Jakub: do you plan to re-review
> the new version? Or are you satisfied with the resolutions to your comments?
I am planning to re-review v4 of this series when I would have time,
which means probably after Easter.
I think if it handles endianness issues correctly, it should be ready.
--
Jakub Narębski
^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH v4 00/15] Changed Paths Bloom Filters
2020-04-08 15:51 ` [PATCH v4 00/15] Changed Paths Bloom Filters Derrick Stolee
2020-04-08 19:21 ` Junio C Hamano
2020-04-08 20:05 ` Jakub Narębski
@ 2020-04-12 20:34 ` Taylor Blau
2 siblings, 0 replies; 159+ messages in thread
From: Taylor Blau @ 2020-04-12 20:34 UTC (permalink / raw)
To: Derrick Stolee
Cc: Garima Singh via GitGitGadget, git, szeder.dev, jonathantanmy,
Garima Singh, jnareb, gitster
Hi Stolee,
On Wed, Apr 08, 2020 at 11:51:14AM -0400, Derrick Stolee wrote:
> On 4/6/2020 12:59 PM, Garima Singh via GitGitGadget wrote:
> > Hey!
> >
> > The commit graph feature brought in a lot of performance improvements across
> > multiple commands. However, file based history continues to be a performance
> > pain point, especially in large repositories.
> >
> > Adopting changed path Bloom filters has been discussed on the list before,
> > and a prototype version was worked on by SZEDER Gábor, Jonathan Tan and Dr.
> > Derrick Stolee [1]. This series is based on Dr. Stolee's proof of concept in
> > [2]
> >
> > With the changes in this series, git users will be able to choose to write
> > Bloom filters to the commit-graph using the following command:
> >
> > 'git commit-graph write --changed-paths'
> >
> > Subsequent 'git log -- path' commands will use these computed Bloom filters
> > to decided which commits are worth exploring further to produce the history
> > of the provided path.
>
> I noticed Jakub was not CC'd on this email. Jakub: do you plan to re-review
> the new version? Or are you satisfied with the resolutions to your comments?
>
> Is anyone else planning to review this series?
I feel horribly that I've had this patch series sitting in my review
backlog for months and haven't gotten to it yet, especially because I
have such an interest in these patches and know that much care was taken
to prepare them.
I read through these patches over some coffee today at a cursory level.
The high-level approach makes sense to me, and the implementation looks
solid. I think that anything that does come up (see below) can be
addressed in 'next' rather than waiting longer on this series.
For what it's worth, I'm planning on starting to test this series in
some of our testing repositories at GitHub, and I'll report back on our
experience with some notes (and patches) should anything come up.
> I'm just wondering when we should take this series to cook in 'next' and
> start building things on top of it, such as "git blame" or "git log -L"
> improvements. While it cooks, any bugs or issues could be resolved with
> patches on top of this version. That would be my preference, anyway.
That would be my preference, too.
I noticed a few small things (mostly a couple of typos and other very
minor details). But, I'd much rather build on top of this series once it
has landed in 'next' than go to a fifth re-roll since there are many
patches involved.
I also noticed that you have already sent some patches in a separate
series that are based on this one, which would apply cleanly if this
series is merged into next.
I figure that this will also be helpful as I send some patches about
extra 'commit-graph write' options out of GitHub's fork, since they will
inevitably create merge conflicts if we both are targeting 'next'. So,
I figure that this approach will ease some maintainer burden ;-).
>
> What do you think, Junio?
>
> Thanks,
> -Stolee
Thanks,
Taylor
^ permalink raw reply [flat|nested] 159+ messages in thread