All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jeff King <peff@peff.net>
To: git@vger.kernel.org
Cc: Patrick Steinhardt <ps@pks.im>
Subject: [PATCH] bitmaps: don't recurse into trees already in the bitmap
Date: Mon, 14 Jun 2021 03:27:03 -0400	[thread overview]
Message-ID: <YMcExxF7DI6k+sZ+@coredump.intra.peff.net> (raw)

If an object is already mentioned in a reachability bitmap we are
building, then by definition so are all of the objects it can reach. We
have an optimization to stop traversing commits when we see they are
already in the bitmap, but we don't do the same for trees.

It's generally unavoidable to recurse into trees for commits not yet
covered by bitmaps (since most commits generally do have unique
top-level trees). But they usually have subtrees that are shared with
other commits (i.e., all of the subtrees the commit _didn't_ touch). And
some of those commits (and their trees) may be covered by the bitmap.

Usually this isn't _too_ big a deal, because we'll visit those subtrees
only once in total for the whole walk. But if you have a large number of
unbitmapped commits, and if your tree is big, then you may end up
opening a lot of sub-trees for no good reason.

We can use the same optimization we do for commits here: when we are
about to open a tree, see if it's in the bitmap (either the one we are
building, or the "seen" bitmap which covers the UNINTERESTING side of
the bitmap when doing a set-difference).

This works especially well because we'll visit all commits before
hitting any trees. So even in a history like:

  A -- B

if "A" has a bitmap on disk but "B" doesn't, we'll already have OR-ed in
the results from A before looking at B's tree (so we really will only
look at trees touched by B).

For most repositories, the timings produced by p5310 are unspectacular.
Here's linux.git:

  Test                         HEAD^             HEAD
  --------------------------------------------------------------------
  5310.4: simulated clone      6.00(5.90+0.10)   5.98(5.90+0.08) -0.3%
  5310.5: simulated fetch      2.98(5.45+0.18)   2.85(5.31+0.18) -4.4%
  5310.7: rev-list (commits)   0.32(0.29+0.03)   0.33(0.30+0.03) +3.1%
  5310.8: rev-list (objects)   1.48(1.44+0.03)   1.49(1.44+0.05) +0.7%

Any improvement there is within the noise (the +3.1% on test 7 has to be
noise, since we are not recursing into trees, and thus the new code
isn't even run). The results for git.git are likewise uninteresting.

But here are numbers from some other real-world repositories (that are
not public). This one's tree is comparable in size to linux.git, but has
~16k refs (and so less complete bitmap coverage):

  Test                         HEAD^               HEAD
  -------------------------------------------------------------------------
  5310.4: simulated clone      38.34(39.86+0.74)   33.95(35.53+0.76) -11.5%
  5310.5: simulated fetch      2.29(6.31+0.35)     2.20(5.97+0.41) -3.9%
  5310.7: rev-list (commits)   0.99(0.86+0.13)     0.96(0.85+0.11) -3.0%
  5310.8: rev-list (objects)   11.32(11.04+0.27)   6.59(6.37+0.21) -41.8%

And here's another with a very large tree (~340k entries), and a fairly
large number of refs (~10k):

  Test                         HEAD^               HEAD
  -------------------------------------------------------------------------
  5310.3: simulated clone      53.83(54.71+1.54)   39.77(40.76+1.50) -26.1%
  5310.4: simulated fetch      19.91(20.11+0.56)   19.79(19.98+0.67) -0.6%
  5310.6: rev-list (commits)   0.54(0.44+0.11)     0.51(0.43+0.07) -5.6%
  5310.7: rev-list (objects)   24.32(23.59+0.73)   9.85(9.49+0.36) -59.5%

This patch provides substantial improvements in these larger cases, and
have any drawbacks for smaller ones (the cost of the bitmap check is
quite small compared to an actual tree traversal).

Note that we have to add a version of revision.c's include_check
callback which handles non-commits. We could possibly consolidate this
into a single callback for all objects types, as there's only one user
of the feature which would need converted (pack-bitmap.c:should_include).
That would in theory let us avoid duplicating any logic. But when I
tried it, the code ended up much worse to read, with lots of repeated
"if it's a commit do this, otherwise do that". Having two separate
callbacks splits that naturally, and matches the existing split of
show_commit/show_object callbacks.

Signed-off-by: Jeff King <peff@peff.net>
---
Patrick, I cc'd you since you mentioned you'd tried using bitmaps to
optimize a few queries, but they sometimes made things slower. This
might help some (but I doubt it will make all problems go away).

Of course, any review is appreciated, as well. :)

 list-objects.c |  3 +++
 pack-bitmap.c  | 17 +++++++++++++++++
 revision.h     |  1 +
 3 files changed, 21 insertions(+)

diff --git a/list-objects.c b/list-objects.c
index 7f404677d5..473a332416 100644
--- a/list-objects.c
+++ b/list-objects.c
@@ -164,6 +164,9 @@ static void process_tree(struct traversal_context *ctx,
 		die("bad tree object");
 	if (obj->flags & (UNINTERESTING | SEEN))
 		return;
+	if (revs->include_check_obj &&
+	    !revs->include_check_obj(&tree->object, revs->include_check_data))
+		return;
 
 	failed_parse = parse_tree_gently(tree, 1);
 	if (failed_parse) {
diff --git a/pack-bitmap.c b/pack-bitmap.c
index d90e1d9d8c..9dfee4a5b7 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -525,6 +525,22 @@ static int should_include(struct commit *commit, void *_data)
 	return 1;
 }
 
+static int should_include_obj(struct object *obj, void *_data)
+{
+	struct include_data *data = _data;
+	int bitmap_pos;
+
+	bitmap_pos = bitmap_position(data->bitmap_git, &obj->oid);
+	if (bitmap_pos < 0)
+		return 1;
+	if ((data->seen && bitmap_get(data->seen, bitmap_pos)) ||
+	     bitmap_get(data->base, bitmap_pos)) {
+		obj->flags |= SEEN;
+		return 0;
+	}
+	return 1;
+}
+
 static int add_commit_to_bitmap(struct bitmap_index *bitmap_git,
 				struct bitmap **base,
 				struct commit *commit)
@@ -620,6 +636,7 @@ static struct bitmap *find_objects(struct bitmap_index *bitmap_git,
 		incdata.seen = seen;
 
 		revs->include_check = should_include;
+		revs->include_check_obj = should_include_obj;
 		revs->include_check_data = &incdata;
 
 		if (prepare_revision_walk(revs))
diff --git a/revision.h b/revision.h
index 93aa012f51..7bb5fa4e7c 100644
--- a/revision.h
+++ b/revision.h
@@ -262,6 +262,7 @@ struct rev_info {
 	int min_parents;
 	int max_parents;
 	int (*include_check)(struct commit *, void *);
+	int (*include_check_obj)(struct object *obj, void *);
 	void *include_check_data;
 
 	/* diff info for patches and for paths limiting */
-- 
2.32.0.549.g52475a27a6

             reply	other threads:[~2021-06-14  7:27 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-06-14  7:27 Jeff King [this message]
2021-06-14 12:05 ` Jeff King
2021-06-15 14:17   ` Derrick Stolee
2021-06-16 12:31     ` Patrick Steinhardt
2021-06-18 12:59       ` Jeff King
2021-06-18 13:35         ` Patrick Steinhardt
2021-06-18 14:10           ` Jeff King
2021-06-22 10:47           ` Patrick Steinhardt
2021-06-22 19:39             ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YMcExxF7DI6k+sZ+@coredump.intra.peff.net \
    --to=peff@peff.net \
    --cc=git@vger.kernel.org \
    --cc=ps@pks.im \
    --subject='Re: [PATCH] bitmaps: don'\''t recurse into trees already in the bitmap' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.