git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "SZEDER Gábor" <szeder.dev@gmail.com>
To: git@vger.kernel.org
Cc: "Junio C Hamano" <gitster@pobox.com>,
	"Garima Singh" <garima.singh@microsoft.com>,
	"Derrick Stolee" <stolee@gmail.com>,
	"Jakub Narebski" <jnareb@gmail.com>, "Jeff King" <peff@peff.net>,
	"Taylor Blau" <me@ttaylorr.com>,
	"SZEDER Gábor" <szeder.dev@gmail.com>
Subject: [PATCH 32/34] PoC commit-graph: use revision walk machinery for '--reachable'
Date: Fri, 29 May 2020 10:50:36 +0200	[thread overview]
Message-ID: <20200529085038.26008-33-szeder.dev@gmail.com> (raw)
In-Reply-To: <20200529085038.26008-1-szeder.dev@gmail.com>

[Not sure it is legal to do this.  I hope it's legal.  It would be
great if it were legal.  But I really doubt it's legal.  FWIW, at
least the test suite still passes, even with GIT_TEST_COMMIT_GRAPH=1.
Or we have a hole in test coverage...]

Our revision walk machinery knows a thing or two about how to traverse
the history in an order that is friendlier to the delta base cache.
The hand-rolled revision walk in commit-graph.c:close_reachable() is
unaware of any such tricks: it is just a simple loop iterating over an
array of commits, appending unseen parents to the end.  This didn't
really matter in the past, because we only accessed commit objects
while writing the commit-graph.  Now, however, we have modified path
Bloom filters, and to fill them we need access to tree objects as well
to gather all modified paths, which puts more strain on the caches,
and the order in which we traverse the commits becomes relevant.

So let's use the regular revision walk machinery (i.e. 'struct
rev_info', setup_revisions(), get_revision() and friends) for
'--reachable' to speed up writing a commit-graph with modified path
Bloom filters.

However, stick to the current algorithm when scanning pack files in
commits, because passing all packed commits to setup_revisions() might
do more harm than good (presumably, I didn't check; scanning all packs
for commits can be hopelessly inefficient anyway).  '--stdin-commits',
too, could benefit from using the revision walk machinery when it's
fed with branch tips from 'git for-each-ref', or fare worse when it's
fed all commits from 'git rev-list' (didn't check this, either)... so
let's just leave it using the current algorithm for now.

The table below shows the reduced runtime and max RSS of writing a
commit-graph file from scratch with '--reachable' and modified path
Bloom filters, i.e. that of:

  git -c core.modifiedPathBloomFilters=1 commit-graph write --reachable

                       Runtime                         Max RSS
                  before     after               before      after
  --------------------------------------------------------------------------
  android-base    57.722s   40.113s   -30.5%     711804k    697624k    -2.0%
  cmssw           27.977s   25.154s   -10.1%     980884k    786196k   -19.9%
  cpython         10.161s    8.956s   -11.9%     331736k    328704k    -0.9%
  gcc            114.980s   36.975s   -67.9%     561928k    484212k   -13.8%
  gecko-dev      132.557s   97.314s   -26.6%    2203544k   2161480k    -1.9%
  git              8.097s    5.215s   -35.6%     240564k    189652k   -21.2%
  glibc            4.693s    3.996s   -14.9%     215924k    207572k    -3.9%
  homebrew-core   56.661s   56.384s    -0.5%     469668k    475596k    +1.3%
  jdk             20.355s   18.936s    -7.0%     327244k    322316k    -1.5%
  linux          215.235s   98.991s   -54.0%    1549852k   1445028k    -6.8%
  llvm-project    48.659s   30.933s   -36.4%     783740k    736988k    -6.0%
  rails            5.804s    5.389s    -7.2%     284524k    284528k       0%
  rust            20.151s   12.921s   -35.9%     687840k    664104k    -3.5%
  webkit          31.226s   30.341s    -2.8%    2018396k   1902884k    -5.7%
---
 commit-graph.c | 110 ++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 94 insertions(+), 16 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 4cbfce61bf..1677e4fb3e 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1703,6 +1703,9 @@ static void create_modified_path_bloom_filter(
 	 * (a commit can be present in multiple pack files, or multiple
 	 * refs can point to the same commit) so check first whether we have
 	 * already created a modified path Bloom filter for it.
+	 *
+	 * TODO: with '--reachable' we can now be sure that we only look at
+	 * each commit only once, so this is unnecessary in that code path.
 	 */
 	bfi = modified_path_bloom_filters_peek(&modified_path_bloom_filters,
 					       commit);
@@ -1891,16 +1894,6 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 	stop_progress(&ctx->progress);
 }
 
-static int add_ref_to_list(const char *refname,
-			   const struct object_id *oid,
-			   int flags, void *cb_data)
-{
-	struct string_list *list = (struct string_list *)cb_data;
-
-	string_list_append(list, oid_to_hex(oid));
-	return 0;
-}
-
 static int fill_oids_from_packs(struct write_commit_graph_context *ctx,
 				struct string_list *pack_indexes)
 {
@@ -2773,14 +2766,99 @@ int write_commit_graph_reachable(const char *obj_dir,
 				 enum commit_graph_write_flags flags,
 				 const struct split_commit_graph_opts *split_opts)
 {
-	struct string_list list = STRING_LIST_INIT_DUP;
-	int result;
+	struct write_commit_graph_context *ctx;
+	struct rev_info revs;
+	const char *revs_argv[] = { NULL, "--all", NULL };
+	struct commit *commit;
+	int result = 0;
+
+	ctx = init_write_commit_graph_context(obj_dir, flags, split_opts);
+	if (!ctx)
+		return 0;
+
+	/*
+	 * Some git commands write a commit-graph in-process at the end,
+	 * prevent their revision walk to interfere with ours.
+	 */
+	clear_object_flags(SEEN | ADDED | SHOWN | TOPO_WALK_EXPLORED |
+			   TOPO_WALK_INDEGREE | UNINTERESTING);
+
+	init_revisions(&revs, NULL);
+	setup_revisions(ARRAY_SIZE(revs_argv) - 1, revs_argv, &revs, NULL);
+	if (prepare_revision_walk(&revs)) {
+		error(_("revision walk setup failed"));
+		result = -1;
+		goto cleanup;
+	}
+
+	if (ctx->report_progress)
+		ctx->progress = start_delayed_progress(
+				_("Gathering commit info"),
+				0);
+	while ((commit = get_revision(&revs))) {
+		display_progress(ctx->progress, ctx->oids.nr + 1);
+		ALLOC_GROW(ctx->oids.list, ctx->oids.nr + 1, ctx->oids.alloc);
+		oidcpy(&ctx->oids.list[ctx->oids.nr], &commit->object.oid);
+		ctx->oids.nr++;
+
+		if (!commit->parents) {
+			create_modified_path_bloom_filter(&ctx->mpbfctx,
+							  commit, 0);
+		} else {
+			struct commit_list *parent;
+			int nth_parent;
+
+			for (parent = commit->parents, nth_parent = 0;
+			     parent;
+			     parent = parent->next, nth_parent++)
+				create_modified_path_bloom_filter(&ctx->mpbfctx,
+								  commit, nth_parent);
+		}
+	}
+	stop_progress(&ctx->progress);
+
+	if (ctx->oids.nr >= GRAPH_EDGE_LAST_MASK) {
+		error(_("the commit graph format cannot write %d commits"), ctx->oids.nr);
+		result = -1;
+		goto cleanup;
+	}
 
-	for_each_ref(add_ref_to_list, &list);
-	result = write_commit_graph(obj_dir, NULL, &list,
-				    flags, split_opts);
+	QSORT(ctx->oids.list, ctx->oids.nr, oid_compare);
+
+	/*
+	 * TODO: the revision walking loop above could fill ctx->commits
+	 * straight away, counting extra edges as well.
+	 */
+	ctx->commits.alloc = ctx->oids.nr;
+	ALLOC_ARRAY(ctx->commits.list, ctx->commits.alloc);
+
+	/*
+	 * Despite its unassuming name, this function removes duplicate
+	 * commits/oids and, worse, it does counts extra edges as well.
+	 */
+	copy_oids_to_commits(ctx);
+
+	if (!ctx->commits.nr)
+		goto cleanup;
+
+	if (ctx->split) {
+		split_graph_merge_strategy(ctx);
 
-	string_list_clear(&list, 0);
+		merge_commit_graphs(ctx);
+	} else
+		ctx->num_commit_graphs_after = 1;
+
+	compute_generation_numbers(ctx);
+
+	result = write_commit_graph_file(ctx);
+
+	if (ctx->split)
+		mark_commit_graphs(ctx);
+
+	expire_commit_graphs(ctx);
+
+cleanup:
+	free_write_commit_graph_context(ctx);
 	return result;
 }
 
-- 
2.27.0.rc1.431.g5c813f95dc


  parent reply	other threads:[~2020-05-29  8:52 UTC|newest]

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-05-29  8:50 [PoC PATCH 00/34] An alternative modified path Bloom filters implementation SZEDER Gábor
2020-05-28 22:09 ` Johannes Schindelin
2020-05-29  8:50 ` [PATCH 01/34] tree-walk.c: don't match submodule entries for 'submod/anything' SZEDER Gábor
2020-05-29  8:50 ` [PATCH 02/34] commit-graph: fix parsing the Chunk Lookup table SZEDER Gábor
2020-05-29  8:50 ` [PATCH 03/34] commit-graph-format.txt: all multi-byte numbers are in network byte order SZEDER Gábor
2020-05-29  8:50 ` [PATCH 04/34] commit-slab: add a function to deep free entries on the slab SZEDER Gábor
2020-06-04 16:43   ` Derrick Stolee
2020-06-27 15:56     ` SZEDER Gábor
2020-06-29 11:30       ` Derrick Stolee
2020-05-29  8:50 ` [PATCH 05/34] diff.h: drop diff_tree_oid() & friends' return value SZEDER Gábor
2020-05-29  8:50 ` [PATCH 06/34] commit-graph: clean up #includes SZEDER Gábor
2020-05-29  8:50 ` [PATCH 07/34] commit-graph: simplify parse_commit_graph() #1 SZEDER Gábor
2020-05-29  8:50 ` [PATCH 08/34] commit-graph: simplify parse_commit_graph() #2 SZEDER Gábor
2020-05-29  8:50 ` [PATCH 09/34] commit-graph: simplify write_commit_graph_file() #1 SZEDER Gábor
2020-05-29  8:50 ` [PATCH 10/34] commit-graph: simplify write_commit_graph_file() #2 SZEDER Gábor
2020-05-29  8:50 ` [PATCH 11/34] commit-graph: allocate the 'struct chunk_info' array dinamically SZEDER Gábor
2020-05-29  8:50 ` [PATCH 12/34] commit-graph: unify the signatures of all write_graph_chunk_*() functions SZEDER Gábor
2020-05-29  8:50 ` [PATCH 13/34] commit-graph: simplify write_commit_graph_file() #3 SZEDER Gábor
2020-05-29  8:50 ` [PATCH 14/34] commit-graph: check chunk sizes after writing SZEDER Gábor
2020-05-29  8:50 ` [PATCH 15/34] commit-graph-format.txt: document the modified path Bloom filter chunks SZEDER Gábor
2020-06-02 17:59   ` SZEDER Gábor
2020-05-29  8:50 ` [PATCH 16/34] Add a generic and minimal Bloom filter implementation SZEDER Gábor
2020-05-29  8:50 ` [PATCH 17/34] Import a streaming-capable Murmur3 hash function implementation SZEDER Gábor
2020-05-29  8:50 ` [PATCH 18/34] commit-graph: write "empty" Modified Path Bloom Filter Index chunk SZEDER Gábor
2020-05-29  8:50 ` [PATCH 19/34] commit-graph: add commit slab for modified path Bloom filters SZEDER Gábor
2020-05-29  8:50 ` [PATCH 20/34] commit-graph: fill the Modified Path Bloom Filter Index chunk SZEDER Gábor
2020-05-29  8:50 ` [PATCH 21/34] commit-graph: load and use " SZEDER Gábor
2020-05-29  8:50 ` [PATCH 22/34] commit-graph: write the Modified Path Bloom Filters chunk SZEDER Gábor
2020-05-29  8:50 ` [PATCH 23/34] commit-graph: load and use " SZEDER Gábor
2020-05-29  8:50 ` [PATCH 24/34] commit-graph: check all leading directories in modified path Bloom filters SZEDER Gábor
2020-05-29  8:50 ` [PATCH 25/34] commit-graph: check embedded modified path Bloom filters with a mask SZEDER Gábor
2020-05-29  8:50 ` [PATCH 26/34] commit-graph: deduplicate modified path Bloom filters SZEDER Gábor
2020-05-29  8:50 ` [PATCH 27/34] commit-graph: load modified path Bloom filters for merge commits SZEDER Gábor
2020-05-29  8:50 ` [PATCH 28/34] commit-graph: write Modified Path Bloom Filter Merge Index chunk SZEDER Gábor
2020-05-29  8:50 ` [PATCH 29/34] commit-graph: extract init and free write_commit_graph_context SZEDER Gábor
2020-05-29  8:50 ` [PATCH 30/34] commit-graph: move write_commit_graph_reachable below write_commit_graph SZEDER Gábor
2020-05-29  8:50 ` [PATCH 31/34] t7007-show: make the first test compatible with the next patch SZEDER Gábor
2020-05-29  8:50 ` SZEDER Gábor [this message]
2020-05-29  8:50 ` [PATCH 33/34] commit-graph: write modified path Bloom filters in "history order" SZEDER Gábor
2020-05-29  8:50 ` [PATCH 34/34] commit-graph: use modified path Bloom filters with wildcards, if possible SZEDER Gábor
2020-05-29 13:59 ` [PoC PATCH 00/34] An alternative modified path Bloom filters implementation Derrick Stolee
2020-06-01 23:25 ` Taylor Blau
2020-06-02 17:08   ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200529085038.26008-33-szeder.dev@gmail.com \
    --to=szeder.dev@gmail.com \
    --cc=garima.singh@microsoft.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jnareb@gmail.com \
    --cc=me@ttaylorr.com \
    --cc=peff@peff.net \
    --cc=stolee@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).