From: "SZEDER Gábor" <szeder.dev@gmail.com>
To: git@vger.kernel.org
Cc: "Junio C Hamano" <gitster@pobox.com>,
"Garima Singh" <garima.singh@microsoft.com>,
"Derrick Stolee" <stolee@gmail.com>,
"Jakub Narebski" <jnareb@gmail.com>, "Jeff King" <peff@peff.net>,
"Taylor Blau" <me@ttaylorr.com>,
"SZEDER Gábor" <szeder.dev@gmail.com>
Subject: [PATCH 32/34] PoC commit-graph: use revision walk machinery for '--reachable'
Date: Fri, 29 May 2020 10:50:36 +0200 [thread overview]
Message-ID: <20200529085038.26008-33-szeder.dev@gmail.com> (raw)
In-Reply-To: <20200529085038.26008-1-szeder.dev@gmail.com>
[Not sure it is legal to do this. I hope it's legal. It would be
great if it were legal. But I really doubt it's legal. FWIW, at
least the test suite still passes, even with GIT_TEST_COMMIT_GRAPH=1.
Or we have a hole in test coverage...]
Our revision walk machinery knows a thing or two about how to traverse
the history in an order that is friendlier to the delta base cache.
The hand-rolled revision walk in commit-graph.c:close_reachable() is
unaware of any such tricks: it is just a simple loop iterating over an
array of commits, appending unseen parents to the end. This didn't
really matter in the past, because we only accessed commit objects
while writing the commit-graph. Now, however, we have modified path
Bloom filters, and to fill them we need access to tree objects as well
to gather all modified paths, which puts more strain on the caches,
and the order in which we traverse the commits becomes relevant.
So let's use the regular revision walk machinery (i.e. 'struct
rev_info', setup_revisions(), get_revision() and friends) for
'--reachable' to speed up writing a commit-graph with modified path
Bloom filters.
However, stick to the current algorithm when scanning pack files in
commits, because passing all packed commits to setup_revisions() might
do more harm than good (presumably, I didn't check; scanning all packs
for commits can be hopelessly inefficient anyway). '--stdin-commits',
too, could benefit from using the revision walk machinery when it's
fed with branch tips from 'git for-each-ref', or fare worse when it's
fed all commits from 'git rev-list' (didn't check this, either)... so
let's just leave it using the current algorithm for now.
The table below shows the reduced runtime and max RSS of writing a
commit-graph file from scratch with '--reachable' and modified path
Bloom filters, i.e. that of:
git -c core.modifiedPathBloomFilters=1 commit-graph write --reachable
Runtime Max RSS
before after before after
--------------------------------------------------------------------------
android-base 57.722s 40.113s -30.5% 711804k 697624k -2.0%
cmssw 27.977s 25.154s -10.1% 980884k 786196k -19.9%
cpython 10.161s 8.956s -11.9% 331736k 328704k -0.9%
gcc 114.980s 36.975s -67.9% 561928k 484212k -13.8%
gecko-dev 132.557s 97.314s -26.6% 2203544k 2161480k -1.9%
git 8.097s 5.215s -35.6% 240564k 189652k -21.2%
glibc 4.693s 3.996s -14.9% 215924k 207572k -3.9%
homebrew-core 56.661s 56.384s -0.5% 469668k 475596k +1.3%
jdk 20.355s 18.936s -7.0% 327244k 322316k -1.5%
linux 215.235s 98.991s -54.0% 1549852k 1445028k -6.8%
llvm-project 48.659s 30.933s -36.4% 783740k 736988k -6.0%
rails 5.804s 5.389s -7.2% 284524k 284528k 0%
rust 20.151s 12.921s -35.9% 687840k 664104k -3.5%
webkit 31.226s 30.341s -2.8% 2018396k 1902884k -5.7%
---
commit-graph.c | 110 ++++++++++++++++++++++++++++++++++++++++++-------
1 file changed, 94 insertions(+), 16 deletions(-)
diff --git a/commit-graph.c b/commit-graph.c
index 4cbfce61bf..1677e4fb3e 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1703,6 +1703,9 @@ static void create_modified_path_bloom_filter(
* (a commit can be present in multiple pack files, or multiple
* refs can point to the same commit) so check first whether we have
* already created a modified path Bloom filter for it.
+ *
+ * TODO: with '--reachable' we can now be sure that we only look at
+ * each commit only once, so this is unnecessary in that code path.
*/
bfi = modified_path_bloom_filters_peek(&modified_path_bloom_filters,
commit);
@@ -1891,16 +1894,6 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
stop_progress(&ctx->progress);
}
-static int add_ref_to_list(const char *refname,
- const struct object_id *oid,
- int flags, void *cb_data)
-{
- struct string_list *list = (struct string_list *)cb_data;
-
- string_list_append(list, oid_to_hex(oid));
- return 0;
-}
-
static int fill_oids_from_packs(struct write_commit_graph_context *ctx,
struct string_list *pack_indexes)
{
@@ -2773,14 +2766,99 @@ int write_commit_graph_reachable(const char *obj_dir,
enum commit_graph_write_flags flags,
const struct split_commit_graph_opts *split_opts)
{
- struct string_list list = STRING_LIST_INIT_DUP;
- int result;
+ struct write_commit_graph_context *ctx;
+ struct rev_info revs;
+ const char *revs_argv[] = { NULL, "--all", NULL };
+ struct commit *commit;
+ int result = 0;
+
+ ctx = init_write_commit_graph_context(obj_dir, flags, split_opts);
+ if (!ctx)
+ return 0;
+
+ /*
+ * Some git commands write a commit-graph in-process at the end,
+ * prevent their revision walk to interfere with ours.
+ */
+ clear_object_flags(SEEN | ADDED | SHOWN | TOPO_WALK_EXPLORED |
+ TOPO_WALK_INDEGREE | UNINTERESTING);
+
+ init_revisions(&revs, NULL);
+ setup_revisions(ARRAY_SIZE(revs_argv) - 1, revs_argv, &revs, NULL);
+ if (prepare_revision_walk(&revs)) {
+ error(_("revision walk setup failed"));
+ result = -1;
+ goto cleanup;
+ }
+
+ if (ctx->report_progress)
+ ctx->progress = start_delayed_progress(
+ _("Gathering commit info"),
+ 0);
+ while ((commit = get_revision(&revs))) {
+ display_progress(ctx->progress, ctx->oids.nr + 1);
+ ALLOC_GROW(ctx->oids.list, ctx->oids.nr + 1, ctx->oids.alloc);
+ oidcpy(&ctx->oids.list[ctx->oids.nr], &commit->object.oid);
+ ctx->oids.nr++;
+
+ if (!commit->parents) {
+ create_modified_path_bloom_filter(&ctx->mpbfctx,
+ commit, 0);
+ } else {
+ struct commit_list *parent;
+ int nth_parent;
+
+ for (parent = commit->parents, nth_parent = 0;
+ parent;
+ parent = parent->next, nth_parent++)
+ create_modified_path_bloom_filter(&ctx->mpbfctx,
+ commit, nth_parent);
+ }
+ }
+ stop_progress(&ctx->progress);
+
+ if (ctx->oids.nr >= GRAPH_EDGE_LAST_MASK) {
+ error(_("the commit graph format cannot write %d commits"), ctx->oids.nr);
+ result = -1;
+ goto cleanup;
+ }
- for_each_ref(add_ref_to_list, &list);
- result = write_commit_graph(obj_dir, NULL, &list,
- flags, split_opts);
+ QSORT(ctx->oids.list, ctx->oids.nr, oid_compare);
+
+ /*
+ * TODO: the revision walking loop above could fill ctx->commits
+ * straight away, counting extra edges as well.
+ */
+ ctx->commits.alloc = ctx->oids.nr;
+ ALLOC_ARRAY(ctx->commits.list, ctx->commits.alloc);
+
+ /*
+ * Despite its unassuming name, this function removes duplicate
+ * commits/oids and, worse, it does counts extra edges as well.
+ */
+ copy_oids_to_commits(ctx);
+
+ if (!ctx->commits.nr)
+ goto cleanup;
+
+ if (ctx->split) {
+ split_graph_merge_strategy(ctx);
- string_list_clear(&list, 0);
+ merge_commit_graphs(ctx);
+ } else
+ ctx->num_commit_graphs_after = 1;
+
+ compute_generation_numbers(ctx);
+
+ result = write_commit_graph_file(ctx);
+
+ if (ctx->split)
+ mark_commit_graphs(ctx);
+
+ expire_commit_graphs(ctx);
+
+cleanup:
+ free_write_commit_graph_context(ctx);
return result;
}
--
2.27.0.rc1.431.g5c813f95dc
next prev parent reply other threads:[~2020-05-29 8:52 UTC|newest]
Thread overview: 43+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-05-29 8:50 [PoC PATCH 00/34] An alternative modified path Bloom filters implementation SZEDER Gábor
2020-05-28 22:09 ` Johannes Schindelin
2020-05-29 8:50 ` [PATCH 01/34] tree-walk.c: don't match submodule entries for 'submod/anything' SZEDER Gábor
2020-05-29 8:50 ` [PATCH 02/34] commit-graph: fix parsing the Chunk Lookup table SZEDER Gábor
2020-05-29 8:50 ` [PATCH 03/34] commit-graph-format.txt: all multi-byte numbers are in network byte order SZEDER Gábor
2020-05-29 8:50 ` [PATCH 04/34] commit-slab: add a function to deep free entries on the slab SZEDER Gábor
2020-06-04 16:43 ` Derrick Stolee
2020-06-27 15:56 ` SZEDER Gábor
2020-06-29 11:30 ` Derrick Stolee
2020-05-29 8:50 ` [PATCH 05/34] diff.h: drop diff_tree_oid() & friends' return value SZEDER Gábor
2020-05-29 8:50 ` [PATCH 06/34] commit-graph: clean up #includes SZEDER Gábor
2020-05-29 8:50 ` [PATCH 07/34] commit-graph: simplify parse_commit_graph() #1 SZEDER Gábor
2020-05-29 8:50 ` [PATCH 08/34] commit-graph: simplify parse_commit_graph() #2 SZEDER Gábor
2020-05-29 8:50 ` [PATCH 09/34] commit-graph: simplify write_commit_graph_file() #1 SZEDER Gábor
2020-05-29 8:50 ` [PATCH 10/34] commit-graph: simplify write_commit_graph_file() #2 SZEDER Gábor
2020-05-29 8:50 ` [PATCH 11/34] commit-graph: allocate the 'struct chunk_info' array dinamically SZEDER Gábor
2020-05-29 8:50 ` [PATCH 12/34] commit-graph: unify the signatures of all write_graph_chunk_*() functions SZEDER Gábor
2020-05-29 8:50 ` [PATCH 13/34] commit-graph: simplify write_commit_graph_file() #3 SZEDER Gábor
2020-05-29 8:50 ` [PATCH 14/34] commit-graph: check chunk sizes after writing SZEDER Gábor
2020-05-29 8:50 ` [PATCH 15/34] commit-graph-format.txt: document the modified path Bloom filter chunks SZEDER Gábor
2020-06-02 17:59 ` SZEDER Gábor
2020-05-29 8:50 ` [PATCH 16/34] Add a generic and minimal Bloom filter implementation SZEDER Gábor
2020-05-29 8:50 ` [PATCH 17/34] Import a streaming-capable Murmur3 hash function implementation SZEDER Gábor
2020-05-29 8:50 ` [PATCH 18/34] commit-graph: write "empty" Modified Path Bloom Filter Index chunk SZEDER Gábor
2020-05-29 8:50 ` [PATCH 19/34] commit-graph: add commit slab for modified path Bloom filters SZEDER Gábor
2020-05-29 8:50 ` [PATCH 20/34] commit-graph: fill the Modified Path Bloom Filter Index chunk SZEDER Gábor
2020-05-29 8:50 ` [PATCH 21/34] commit-graph: load and use " SZEDER Gábor
2020-05-29 8:50 ` [PATCH 22/34] commit-graph: write the Modified Path Bloom Filters chunk SZEDER Gábor
2020-05-29 8:50 ` [PATCH 23/34] commit-graph: load and use " SZEDER Gábor
2020-05-29 8:50 ` [PATCH 24/34] commit-graph: check all leading directories in modified path Bloom filters SZEDER Gábor
2020-05-29 8:50 ` [PATCH 25/34] commit-graph: check embedded modified path Bloom filters with a mask SZEDER Gábor
2020-05-29 8:50 ` [PATCH 26/34] commit-graph: deduplicate modified path Bloom filters SZEDER Gábor
2020-05-29 8:50 ` [PATCH 27/34] commit-graph: load modified path Bloom filters for merge commits SZEDER Gábor
2020-05-29 8:50 ` [PATCH 28/34] commit-graph: write Modified Path Bloom Filter Merge Index chunk SZEDER Gábor
2020-05-29 8:50 ` [PATCH 29/34] commit-graph: extract init and free write_commit_graph_context SZEDER Gábor
2020-05-29 8:50 ` [PATCH 30/34] commit-graph: move write_commit_graph_reachable below write_commit_graph SZEDER Gábor
2020-05-29 8:50 ` [PATCH 31/34] t7007-show: make the first test compatible with the next patch SZEDER Gábor
2020-05-29 8:50 ` SZEDER Gábor [this message]
2020-05-29 8:50 ` [PATCH 33/34] commit-graph: write modified path Bloom filters in "history order" SZEDER Gábor
2020-05-29 8:50 ` [PATCH 34/34] commit-graph: use modified path Bloom filters with wildcards, if possible SZEDER Gábor
2020-05-29 13:59 ` [PoC PATCH 00/34] An alternative modified path Bloom filters implementation Derrick Stolee
2020-06-01 23:25 ` Taylor Blau
2020-06-02 17:08 ` Junio C Hamano
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200529085038.26008-33-szeder.dev@gmail.com \
--to=szeder.dev@gmail.com \
--cc=garima.singh@microsoft.com \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=jnareb@gmail.com \
--cc=me@ttaylorr.com \
--cc=peff@peff.net \
--cc=stolee@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).