Git Mailing List Archive on lore.kernel.org
 help / color / Atom feed
From: "SZEDER Gábor" <szeder.dev@gmail.com>
To: git@vger.kernel.org
Cc: "Junio C Hamano" <gitster@pobox.com>,
	"Garima Singh" <garima.singh@microsoft.com>,
	"Derrick Stolee" <stolee@gmail.com>,
	"Jakub Narebski" <jnareb@gmail.com>, "Jeff King" <peff@peff.net>,
	"Taylor Blau" <me@ttaylorr.com>,
	"SZEDER Gábor" <szeder.dev@gmail.com>
Subject: [PATCH 20/34] commit-graph: fill the Modified Path Bloom Filter Index chunk
Date: Fri, 29 May 2020 10:50:24 +0200
Message-ID: <20200529085038.26008-21-szeder.dev@gmail.com> (raw)
In-Reply-To: <20200529085038.26008-1-szeder.dev@gmail.com>

Collect all paths modified by all commits, and write embedded modified
path Bloom filters for those commits that modify few enough paths to
fit into a 63 bit array using 7 hashes per path, i.e. 6 or less.

For now all modified path Bloom filters are created from scratch,
re-using filters from the existing commit-graph file will be done in a
later patch in this series.

Plug this into the custom revision walking loop in close_reachable(),
so the iteration is at least somewhat friendly to the delta base
cache, and we won't have yet another loop iterating over all commits.
Take care care that this loop might encounter the same commits
multiple times (e.g. because more than one refs point to the same
commit, or a commit is present in more than one pack file), and skip
creating a modified path Bloom filter for already processed commits.
We wouldn't need to be careful about this if we were to create
modified path Bloom filters a bit later on, after the list of commits
have been deduplicated, but at that point the commits are in SHA-1
order, i.e. effectively random, which is downright hostile to the
caches.

Perhaps processing the modified paths returned by the tree-diff
machinery turned out to be more complicated than strictly necessary,
but...

So, my initial approach was to first let diff_tree_oid() and
diffcore_std() do their thing and put all modified files into a diff
queue, then iterate over that queue and add all paths (i.e. leading
directories as well) to a hashmap with a suitable comparison function
for deduplication, use the number of elements in that hashmap to
properly size the Bloom filter's bit array, and finally iterate over
all paths in the hashmap and add them to the Bloom filter.  Simple,
short, straightforward, almost idiomatic even, without subtle corner
cases.

This approach, however, involved lot of memory allocations: three
allocations for each entry in the diff queue (IOW for each modified
file), and one allocation for each hashmap entry (IOW for each
modified path), not counting allocations for the hashmap's internals.
All these memory allocations caused considerable overhead.

This patch implements a more efficient, but more complex, approach:
register callback functions in 'struct diff_options' for add/removal
and modification, which process, deduplicate, and hash modified paths
_while_ the tree-diff machinery scans and compares the two trees.
Tree-diff calls these callback functions with the filenames in order
[1], so the deduplication of modified paths basically boils down to
skipping the common prefix with the previously processed path (and
some subtlety in case of dir-file replacement, as explained in an
in-code comment).  This way we don't do any memory allocations for
each modified file or path while processing the diff output, and the
memory allocations that are still left are amortized over the whole
commit-graph writing process.

[1] The tree-diff machinery would call its callbacks with out of order
    filenames if it encountered a bogus tree object with unsorted
    entries.  This doesn't affect the correctness of modified
    path Bloom filters only their sizes: the worst thing that can
    happen is that this approach would count and hash some paths more
    than once, thus Bloom filters containing paths modified compared
    to such a bogus tree object would end up bigger than expected.
    And bogus unsorted tree object cause all kinds of weird diff
    issues anyway...

[TODO: The pathchange callback is no good.]

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
---
 commit-graph.c | 243 ++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 238 insertions(+), 5 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 3a38f25ce9..413605a29c 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -16,6 +16,8 @@
 #include "progress.h"
 #include "bloom-filter.h"
 #include "commit-slab.h"
+#include "diff.h"
+#include "compat/PMurHash.h"
 
 #define GRAPH_SIGNATURE 0x43475048 /* "CGPH" */
 #define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
@@ -37,8 +39,10 @@
 #define GRAPH_LAST_EDGE 0x80000000
 
 #define GRAPH_MODIFIED_PATH_BLOOM_FILTER_NONE 0xffffffffffffffff
+#define GRAPH_MODIFIED_PATH_BLOOM_FILTER_EMBEDDED_NR_BITS 63
 
 #define GRAPH_MODIFIED_PATH_BLOOM_FILTER_DEFAULT_NR_HASHES 7
+#define GRAPH_MODIFIED_PATH_BLOOM_FILTER_DEFAULT_EMBEDDED_LIMIT 6
 
 #define GRAPH_HEADER_SIZE 8
 #define GRAPH_FANOUT_SIZE (4 * 256)
@@ -773,6 +777,36 @@ struct tree *get_commit_tree_in_graph(struct repository *r, const struct commit
 	return get_commit_tree_in_graph_one(r, r->objects->commit_graph, c);
 }
 
+static uint32_t modified_path_bloom_seeds[] = {
+	0xe83c5163U,
+	0x3b376b0cU,
+	0x72441af7U,
+	0x433860f3U,
+	0x6d9617f4U,
+	0x22705334U,
+	0xc8af66abU,
+	0x43b44ccfU,
+	0xb84f767cU,
+	0x1da177e4U
+};
+
+static void compute_modified_path_bloom_hashes_for_path(const char *path,
+		size_t len, unsigned int num_hashes, uint32_t *hashes)
+{
+	uint32_t h1, h2, i;
+
+	h1 = PMurHash32(modified_path_bloom_seeds[0], path, len);
+	h2 = PMurHash32(modified_path_bloom_seeds[1], path, len);
+
+	/* Equivalent to hashes[i] = h1 + i * h2 + (i * i * i - i) / 6 */
+	hashes[0] = h1;
+	for (i = 1; i < num_hashes; i++) {
+		h1 += h2;
+		h2 += i;
+		hashes[i] = h1;
+	}
+}
+
 struct packed_commit_list {
 	struct commit **list;
 	int nr;
@@ -816,6 +850,29 @@ struct write_commit_graph_context {
 	struct modified_path_bloom_filter_context {
 		unsigned use_modified_path_bloom_filters:1;
 		unsigned int num_hashes;
+		/*
+		 * Number of paths to be added to "embedded" modified path
+		 * Bloom filters.
+		 */
+		int embedded_limit;
+
+		struct diff_options diffopt;
+
+		/*
+		 * Used to deduplicate paths modified by the currenly
+		 * processed commit.
+		 */
+		struct strbuf prev_path;
+		size_t *hashed_prefix_lengths;
+		int hashed_prefix_lengths_nr, hashed_prefix_lengths_alloc;
+
+		/*
+		 * The tree-diff callbacks will fill this array with
+		 * 'num_hashes' hash values for each path modified by the
+		 * currently processed commit.
+		 */
+		uint32_t *hashes;
+		int hashes_nr, hashes_alloc;
 	} mpbfctx;
 };
 
@@ -1032,10 +1089,20 @@ static int write_graph_chunk_modified_path_bloom_index(struct hashfile *f,
 		bfi = modified_path_bloom_filters_peek(
 				&modified_path_bloom_filters, commit);
 
-		if (!bfi || !bfi->filter.nr_bits)
+		if (!bfi || !bfi->filter.nr_bits) {
 			hashwrite(f, &no_bloom_filter, sizeof(no_bloom_filter));
-		else
-			BUG("writing proper Bloom filters is not implemented yet");
+		} else if (bfi->filter.nr_bits == GRAPH_MODIFIED_PATH_BLOOM_FILTER_EMBEDDED_NR_BITS) {
+			uint8_t filterdata[sizeof(uint64_t)];
+			memcpy(filterdata, bfi->filter.bits,
+			       sizeof(filterdata));
+			/*
+			 * Set the most significant bit of the first byte to
+			 * indicate embedded modified path Bloom Filter.
+			 */
+			filterdata[0] |= 1 << 7;
+			hashwrite(f, filterdata, sizeof(filterdata));
+		} else
+			BUG("writing non-embedded Bloom filters is not implemented yet");
 	}
 	return 0;
 }
@@ -1074,9 +1141,162 @@ static int add_packed_commits(const struct object_id *oid,
 	return 0;
 }
 
+static void handle_modified_file(
+		struct modified_path_bloom_filter_context *mpbfctx,
+		const char *path)
+{
+	const char *prev = mpbfctx->prev_path.buf;
+	const char *p = path;
+	size_t common_prefix_len, already_hashed_prefix_len = 0;
+	int i;
+
+	/*
+	 * We want to hash each modified path only once, i.e. if a commit
+	 * modifies both 'dir/foo' and 'dir/bar', then we want to hash
+	 * 'dir' only once.
+	 */
+
+	/*
+	 * Skip common prefix with the previously handled path, so we
+	 * hash any common leading directories only once.
+	 */
+	while (*prev && *p && *prev == *p) {
+		prev++;
+		p++;
+	}
+	common_prefix_len = p - path;
+
+	/*
+	 * However, merely skipping the common prefix is not enough.
+	 *
+	 * If a commit removes a file (symlink, or submodule) and adds a
+	 * directory with the same name (or vice versa), e.g. removes
+	 * 'foo' and adds 'foo/bar', and adds/modifies/removes 'foo.c',
+	 * then the tree-diff machinery will call this function with
+	 * those paths in the following order:
+	 *
+	 *   foo
+	 *   foo.c
+	 *   foo/bar
+	 *
+	 * So the first call hashes 'foo', the second hashes 'foo.c', so
+	 * far so good.  In the third call the common prefix of 'foo.c'
+	 * and 'foo/bar' is 'foo', and then an immediate strchrnul(p, '/')
+	 * would find the subsequent '/' and would hash 'foo' _again_.
+	 * Not so good.
+	 *
+	 * To avoid this we maintain the length of already hashed common
+	 * prefixes and skip one more character (i.e. the '/') if
+	 * necessary.
+	 */
+	for (i = 0; i < mpbfctx->hashed_prefix_lengths_nr; i++) {
+		if (common_prefix_len < mpbfctx->hashed_prefix_lengths[i]) {
+			mpbfctx->hashed_prefix_lengths_nr = i;
+			break;
+		}
+
+		already_hashed_prefix_len = mpbfctx->hashed_prefix_lengths[i];
+	}
+	if (common_prefix_len == already_hashed_prefix_len)
+		p = path + already_hashed_prefix_len + 1;
+
+	do {
+		unsigned int new_hashes_nr = mpbfctx->hashes_nr + mpbfctx->num_hashes;
+
+		p = strchrnul(p, '/');
+
+		ALLOC_GROW(mpbfctx->hashes, new_hashes_nr,
+			   mpbfctx->hashes_alloc);
+		compute_modified_path_bloom_hashes_for_path(path, p - path,
+				mpbfctx->num_hashes,
+				mpbfctx->hashes + mpbfctx->hashes_nr);
+		mpbfctx->hashes_nr = new_hashes_nr;
+
+		ALLOC_GROW(mpbfctx->hashed_prefix_lengths,
+			   mpbfctx->hashed_prefix_lengths_nr + 1,
+			   mpbfctx->hashed_prefix_lengths_alloc);
+		mpbfctx->hashed_prefix_lengths[mpbfctx->hashed_prefix_lengths_nr] = p - path;
+		mpbfctx->hashed_prefix_lengths_nr++;
+	} while (*p && p++);
+
+	/*
+	 * Update the previous path for the next iteration, copying only
+	 * as many characters as necessary.
+	 */
+	strbuf_setlen(&mpbfctx->prev_path, common_prefix_len);
+	strbuf_add(&mpbfctx->prev_path, path + common_prefix_len,
+		   p - (path + common_prefix_len));
+}
+
+static void file_add_remove_cb(struct diff_options *options,
+			       int addremove, unsigned mode,
+			       const struct object_id *oid,
+			       int oid_valid, const char *fullpath,
+			       unsigned dirty_submodule)
+{
+	handle_modified_file(options->change_fn_data, fullpath);
+}
+
+static void file_change_cb(struct diff_options *options,
+			   unsigned old_mode, unsigned new_mode,
+			   const struct object_id *old_oid,
+			   const struct object_id *new_oid,
+			   int old_oid_valid, int new_oid_valid,
+			   const char *fullpath,
+			   unsigned old_dirty_submodule,
+			   unsigned new_dirty_submodule)
+{
+	handle_modified_file(options->change_fn_data, fullpath);
+}
+
+static void create_modified_path_bloom_filter(
+		struct modified_path_bloom_filter_context *mpbfctx,
+		struct commit *commit)
+{
+	struct modified_path_bloom_filter_info *bfi;
+	struct object_id *parent_oid;
+	uintmax_t path_component_count;
+
+	if (!mpbfctx->use_modified_path_bloom_filters)
+		return;
+
+	/*
+	 * This function can be called multiple times for the same commit
+	 * (a commit can be present in multiple pack files, or multiple
+	 * refs can point to the same commit) so check first whether we have
+	 * already created a modified path Bloom filter for it.
+	 */
+	bfi = modified_path_bloom_filters_peek(&modified_path_bloom_filters,
+					       commit);
+	if (bfi && bfi->filter.nr_bits)
+		return;
+
+	parent_oid = commit->parents ? &commit->parents->item->object.oid :
+				       NULL;
+	mpbfctx->hashes_nr = 0;
+	strbuf_reset(&mpbfctx->prev_path);
+	diff_tree_oid(parent_oid, &commit->object.oid, "", &mpbfctx->diffopt);
+	path_component_count = mpbfctx->hashes_nr / mpbfctx->num_hashes;
+
+	if (path_component_count > mpbfctx->embedded_limit) {
+		/* Not implemented yet. */
+	} else {
+		bfi = modified_path_bloom_filters_at(
+				&modified_path_bloom_filters, commit);
+
+		bloom_filter_init_with_size(&bfi->filter,
+				GRAPH_MODIFIED_PATH_BLOOM_FILTER_EMBEDDED_NR_BITS);
+		bloom_filter_set_bits(&bfi->filter, mpbfctx->hashes,
+				      mpbfctx->hashes_nr);
+	}
+}
+
 static void add_missing_parents(struct write_commit_graph_context *ctx, struct commit *commit)
 {
 	struct commit_list *parent;
+
+	create_modified_path_bloom_filter(&ctx->mpbfctx, commit);
+
 	for (parent = commit->parents; parent; parent = parent->next) {
 		if (!(parent->item->object.flags & REACHABLE)) {
 			ALLOC_GROW(ctx->oids.list, ctx->oids.nr + 1, ctx->oids.alloc);
@@ -1888,9 +2108,18 @@ int write_commit_graph(const char *obj_dir,
 	} else if (res) {
 		ctx->mpbfctx.use_modified_path_bloom_filters = 1;
 		ctx->mpbfctx.num_hashes = GRAPH_MODIFIED_PATH_BLOOM_FILTER_DEFAULT_NR_HASHES;
-	}
+		ctx->mpbfctx.embedded_limit = GRAPH_MODIFIED_PATH_BLOOM_FILTER_EMBEDDED_NR_BITS / (ctx->mpbfctx.num_hashes * 10 / 7);
 
 		init_modified_path_bloom_filters(&modified_path_bloom_filters);
+
+		repo_diff_setup(ctx->r, &ctx->mpbfctx.diffopt);
+		ctx->mpbfctx.diffopt.flags.recursive = 1;
+		ctx->mpbfctx.diffopt.add_remove = file_add_remove_cb;
+		ctx->mpbfctx.diffopt.change = file_change_cb;
+		ctx->mpbfctx.diffopt.change_fn_data = &ctx->mpbfctx;
+		diff_setup_done(&ctx->mpbfctx.diffopt);
+
+		strbuf_init(&ctx->mpbfctx.prev_path, PATH_MAX);
 	}
 
 	if (ctx->split) {
@@ -2013,10 +2242,14 @@ int write_commit_graph(const char *obj_dir,
 		free(ctx->commit_graph_hash_after);
 	}
 
-	if (ctx->mpbfctx.use_modified_path_bloom_filters)
+	if (ctx->mpbfctx.use_modified_path_bloom_filters) {
 		deep_clear_modified_path_bloom_filters(
 				&modified_path_bloom_filters,
 				free_modified_path_bloom_filter_info_in_slab);
+		strbuf_release(&ctx->mpbfctx.prev_path);
+		free(ctx->mpbfctx.hashed_prefix_lengths);
+		free(ctx->mpbfctx.hashes);
+	}
 
 	free(ctx);
 
-- 
2.27.0.rc1.431.g5c813f95dc


  parent reply index

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-05-29  8:50 [PoC PATCH 00/34] An alternative modified path Bloom filters implementation SZEDER Gábor
2020-05-28 22:09 ` Johannes Schindelin
2020-05-29  8:50 ` [PATCH 01/34] tree-walk.c: don't match submodule entries for 'submod/anything' SZEDER Gábor
2020-05-29  8:50 ` [PATCH 02/34] commit-graph: fix parsing the Chunk Lookup table SZEDER Gábor
2020-05-29  8:50 ` [PATCH 03/34] commit-graph-format.txt: all multi-byte numbers are in network byte order SZEDER Gábor
2020-05-29  8:50 ` [PATCH 04/34] commit-slab: add a function to deep free entries on the slab SZEDER Gábor
2020-06-04 16:43   ` Derrick Stolee
2020-06-27 15:56     ` SZEDER Gábor
2020-06-29 11:30       ` Derrick Stolee
2020-05-29  8:50 ` [PATCH 05/34] diff.h: drop diff_tree_oid() & friends' return value SZEDER Gábor
2020-05-29  8:50 ` [PATCH 06/34] commit-graph: clean up #includes SZEDER Gábor
2020-05-29  8:50 ` [PATCH 07/34] commit-graph: simplify parse_commit_graph() #1 SZEDER Gábor
2020-05-29  8:50 ` [PATCH 08/34] commit-graph: simplify parse_commit_graph() #2 SZEDER Gábor
2020-05-29  8:50 ` [PATCH 09/34] commit-graph: simplify write_commit_graph_file() #1 SZEDER Gábor
2020-05-29  8:50 ` [PATCH 10/34] commit-graph: simplify write_commit_graph_file() #2 SZEDER Gábor
2020-05-29  8:50 ` [PATCH 11/34] commit-graph: allocate the 'struct chunk_info' array dinamically SZEDER Gábor
2020-05-29  8:50 ` [PATCH 12/34] commit-graph: unify the signatures of all write_graph_chunk_*() functions SZEDER Gábor
2020-05-29  8:50 ` [PATCH 13/34] commit-graph: simplify write_commit_graph_file() #3 SZEDER Gábor
2020-05-29  8:50 ` [PATCH 14/34] commit-graph: check chunk sizes after writing SZEDER Gábor
2020-05-29  8:50 ` [PATCH 15/34] commit-graph-format.txt: document the modified path Bloom filter chunks SZEDER Gábor
2020-06-02 17:59   ` SZEDER Gábor
2020-05-29  8:50 ` [PATCH 16/34] Add a generic and minimal Bloom filter implementation SZEDER Gábor
2020-05-29  8:50 ` [PATCH 17/34] Import a streaming-capable Murmur3 hash function implementation SZEDER Gábor
2020-05-29  8:50 ` [PATCH 18/34] commit-graph: write "empty" Modified Path Bloom Filter Index chunk SZEDER Gábor
2020-05-29  8:50 ` [PATCH 19/34] commit-graph: add commit slab for modified path Bloom filters SZEDER Gábor
2020-05-29  8:50 ` SZEDER Gábor [this message]
2020-05-29  8:50 ` [PATCH 21/34] commit-graph: load and use the Modified Path Bloom Filter Index chunk SZEDER Gábor
2020-05-29  8:50 ` [PATCH 22/34] commit-graph: write the Modified Path Bloom Filters chunk SZEDER Gábor
2020-05-29  8:50 ` [PATCH 23/34] commit-graph: load and use " SZEDER Gábor
2020-05-29  8:50 ` [PATCH 24/34] commit-graph: check all leading directories in modified path Bloom filters SZEDER Gábor
2020-05-29  8:50 ` [PATCH 25/34] commit-graph: check embedded modified path Bloom filters with a mask SZEDER Gábor
2020-05-29  8:50 ` [PATCH 26/34] commit-graph: deduplicate modified path Bloom filters SZEDER Gábor
2020-05-29  8:50 ` [PATCH 27/34] commit-graph: load modified path Bloom filters for merge commits SZEDER Gábor
2020-05-29  8:50 ` [PATCH 28/34] commit-graph: write Modified Path Bloom Filter Merge Index chunk SZEDER Gábor
2020-05-29  8:50 ` [PATCH 29/34] commit-graph: extract init and free write_commit_graph_context SZEDER Gábor
2020-05-29  8:50 ` [PATCH 30/34] commit-graph: move write_commit_graph_reachable below write_commit_graph SZEDER Gábor
2020-05-29  8:50 ` [PATCH 31/34] t7007-show: make the first test compatible with the next patch SZEDER Gábor
2020-05-29  8:50 ` [PATCH 32/34] PoC commit-graph: use revision walk machinery for '--reachable' SZEDER Gábor
2020-05-29  8:50 ` [PATCH 33/34] commit-graph: write modified path Bloom filters in "history order" SZEDER Gábor
2020-05-29  8:50 ` [PATCH 34/34] commit-graph: use modified path Bloom filters with wildcards, if possible SZEDER Gábor
2020-05-29 13:59 ` [PoC PATCH 00/34] An alternative modified path Bloom filters implementation Derrick Stolee
2020-06-01 23:25 ` Taylor Blau
2020-06-02 17:08   ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200529085038.26008-21-szeder.dev@gmail.com \
    --to=szeder.dev@gmail.com \
    --cc=garima.singh@microsoft.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jnareb@gmail.com \
    --cc=me@ttaylorr.com \
    --cc=peff@peff.net \
    --cc=stolee@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Git Mailing List Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/git/0 git/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 git git/ https://lore.kernel.org/git \
		git@vger.kernel.org
	public-inbox-index git

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.git


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git