All of lore.kernel.org
 help / color / mirror / Atom feed
From: "SZEDER Gábor" <szeder.dev@gmail.com>
To: git@vger.kernel.org
Cc: "Junio C Hamano" <gitster@pobox.com>,
	"Garima Singh" <garima.singh@microsoft.com>,
	"Derrick Stolee" <stolee@gmail.com>,
	"Jakub Narebski" <jnareb@gmail.com>, "Jeff King" <peff@peff.net>,
	"Taylor Blau" <me@ttaylorr.com>,
	"SZEDER Gábor" <szeder.dev@gmail.com>
Subject: [PATCH 33/34] commit-graph: write modified path Bloom filters in "history order"
Date: Fri, 29 May 2020 10:50:37 +0200	[thread overview]
Message-ID: <20200529085038.26008-34-szeder.dev@gmail.com> (raw)
In-Reply-To: <20200529085038.26008-1-szeder.dev@gmail.com>

[Explain!  Try topo order!]

I don't have recent benchmark results at hand yet, but as far as I can
recall some old results this reduces the time spent loading and
checking modified path Bloom filters by another ~10%.  Checking Bloom
filters is of course only a small part of the whole runtime of a
pathspec-limited revision walk, so the overall improvement is only
about 1-2%.
Oh, well, I expected more from this change... But perhaps it has
larger impact with less embedded modified path Bloom filters?

This is the last patch in this series that improves pathspec-limited
revision walk performance, so it's time to compare average runtime and
memory usage of

  git rev-list HEAD -- "$path"

for 5000+ randomly selected paths from each repository with and
without modified path Bloom filters:

                  Average runtime     Average  |  Average max RSS
                 without      with    speedup  |  without    with
  ---------------------------------------------+----------------------------
  android-base   0.8780s    0.1456s     6.03x  |  387MB    91.2MB    -76.5%
  cmssw          0.3143s    0.0313s    10.04x  |  181MB    37.4MB    -79.4%
  cpython        0.7453s    0.0810s     9.20x  |  159MB    43.8MB    -72.5%
  elasticsearch  0.1492s    0.0136s    10.95x  |  134MB    21.8MB    -83.7%
  gcc            7.1852s    0.2114s    34.00x  |  297MB    91.1MB    -69.3%
  gecko-dev      4.6113s    0.4815s     9.58x  |  832MB   175.2MB    -79.0%
  git            0.6180s    0.0310s    19.94x  |  131MB    31.5MB    -76.0%
  glibc          0.5618s    0.0282s    19.92x  |  135MB    25.0MB    -81.6%
  go             0.4913s    0.0403s    12.19x  |  130MB    24.7MB    -81.1%
  jdk            0.0482s    0.0068s     7.09x  |   52MB    27.1MB    -48.2%
  linux          0.7043s    0.0873s     8.08x  |  438MB   150.9MB    -65.5%
  llvm-project   2.6844s    0.4135s     6.49x  |  384MB   126.5MB    -67.1%
  rails          0.2784s    0.0391s     7.12x  |   88MB    28.8MB    -67.3%
  rust           0.7757s    0.0439s    17.67x  |  344MB    42.0MB    -87.8%
  tensorflow     0.6258s    0.0487s    12.85x  |  233MB    36.7MB    -84.2%
  webkit         1.9137s    0.2420s     7.91x  |  940MB    84.7MB    -91.0%

The astute reader may notice that in several cases the achieved
speedup is a bit higher than the calculated potential speedup shown in
the log message of the patch adding the modified path Bloom filter
specs, e.g. 34x vs. 29.7x in the gcc repository.  I suspect that by
eliminating much of the tree-diff overhead we load a lot less (tree)
objects, putting less strain on the caches, and in turn making other
parts of the process faster as well.

Alas, this is not without extra cost: writing the commit-graph file
with modified path Bloom filters takes a lot longer and the resulting
commit-graph file is considerably larger:

          Writing a commit-graph file from      |
             scratch with '--reachable'         | commit-graph file size
                 without       with             |   without      with
  ----------------------------------------------+--------------------------------
  android-base    6.230s     40.880s     6.56x  |  25077512    31278605    +24.7%
  cmssw           3.177s     25.691s     8.09x  |  13017516    23971497    +84.1%
  cpython         1.344s      8.951s     6.66x  |   6808068     8152146    +19.7%
  gcc             2.886s     36.917s    12.79x  |  12839660    18068286    +40.7%
  gecko-dev       9.331s     97.729s    10.47x  |  43335208    66568549    +53.6%
  git             0.750s      5.245s     6.99x  |   3388208     4011595    +18.3%
  glibc           0.565s      4.146s     7.34x  |   2832460     3936588    +38.9%
  homebrew-cask   1.083s     27.024s    24.95x  |   5928644     6859160    +15.7%
  homebrew-core   1.702s     55.478s    32.60x  |   9171324    10509040    +14.5%
  jdk             0.568s     19.418s    34.19x  |   3237508    15858410   +389.8%
  linux          13.525s    100.837s     7.46x  |  49800744    81939893    +64.5%
  llvm-project    4.275s     31.188s     7.30x  |  19309116    25336867    +31.2%
  rails           0.986s      5.607s     5.69x  |   4990252     6317541    +26.5%
  rust            1.260s     13.250s    10.52x  |   6053824     8601440    +42.0%
  webkit          3.658s     30.469s     8.33x  |  12288620    19438512    +58.1%
---
 commit-graph.c | 28 +++++++++++++++++++++-------
 1 file changed, 21 insertions(+), 7 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 1677e4fb3e..8eb0cbedaf 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -67,6 +67,8 @@ struct modified_path_bloom_filter_info {
 	uint64_t offset;
 	uint32_t merge_index_pos;
 	struct modified_path_bloom_filter_info *next;
+	/* TODO: need better names for 'next' and 'next_commit_bfi' */
+	struct modified_path_bloom_filter_info *next_commit_bfi;
 };
 
 static void free_modified_path_bloom_filter_info_in_slab(
@@ -1159,6 +1161,10 @@ struct write_commit_graph_context {
 
 		/* Used to find identical modified path Bloom filters */
 		struct hashmap dedup_hashmap;
+
+		/* To chain up Bloom filters in history order */
+		struct modified_path_bloom_filter_info *first_commit_bfi;
+		struct modified_path_bloom_filter_info *prev_commit_bfi;
 	} mpbfctx;
 };
 
@@ -1363,18 +1369,18 @@ static int write_graph_chunk_modified_path_bloom_filters(struct hashfile *f,
 		struct write_commit_graph_context *ctx)
 {
 	uint64_t offset = 0;
-	int i;
+	struct modified_path_bloom_filter_info *next_commit_bfi;
 
-	for (i = 0; i < ctx->commits.nr; i++) {
-		struct commit *commit = ctx->commits.list[i];
-		struct modified_path_bloom_filter_info *bfi;
-		unsigned int filter_size;
+	next_commit_bfi = ctx->mpbfctx.first_commit_bfi;
+	while (next_commit_bfi) {
+		struct modified_path_bloom_filter_info *bfi = next_commit_bfi;
+
+		next_commit_bfi = next_commit_bfi->next_commit_bfi;
 
 		display_progress(ctx->progress, ++ctx->progress_cnt);
 
-		bfi = modified_path_bloom_filters_peek(
-				&modified_path_bloom_filters, commit);
 		for (; bfi; bfi = bfi->next) {
+			unsigned int filter_size;
 			if (bfi->duplicate_of)
 				continue;
 			if (!bfi->filter.nr_bits)
@@ -1762,6 +1768,14 @@ static void create_modified_path_bloom_filter(
 	bloom_filter_set_bits(&bfi->filter, mpbfctx->hashes,
 			      mpbfctx->hashes_nr);
 
+	if (!mpbfctx->first_commit_bfi) {
+		mpbfctx->first_commit_bfi = bfi;
+		mpbfctx->prev_commit_bfi = bfi;
+	} else if (!nth_parent) {
+		mpbfctx->prev_commit_bfi->next_commit_bfi = bfi;
+		mpbfctx->prev_commit_bfi = bfi;
+	}
+
 	if (path_component_count > mpbfctx->embedded_limit &&
 	    !handle_duplicate_modified_path_bloom_filter(mpbfctx, bfi))
 		mpbfctx->total_filter_size += sizeof(uint32_t) +
-- 
2.27.0.rc1.431.g5c813f95dc


  parent reply	other threads:[~2020-05-29  8:52 UTC|newest]

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-05-29  8:50 [PoC PATCH 00/34] An alternative modified path Bloom filters implementation SZEDER Gábor
2020-05-28 22:09 ` Johannes Schindelin
2020-05-29  8:50 ` [PATCH 01/34] tree-walk.c: don't match submodule entries for 'submod/anything' SZEDER Gábor
2020-05-29  8:50 ` [PATCH 02/34] commit-graph: fix parsing the Chunk Lookup table SZEDER Gábor
2020-05-29  8:50 ` [PATCH 03/34] commit-graph-format.txt: all multi-byte numbers are in network byte order SZEDER Gábor
2020-05-29  8:50 ` [PATCH 04/34] commit-slab: add a function to deep free entries on the slab SZEDER Gábor
2020-06-04 16:43   ` Derrick Stolee
2020-06-27 15:56     ` SZEDER Gábor
2020-06-29 11:30       ` Derrick Stolee
2020-05-29  8:50 ` [PATCH 05/34] diff.h: drop diff_tree_oid() & friends' return value SZEDER Gábor
2020-05-29  8:50 ` [PATCH 06/34] commit-graph: clean up #includes SZEDER Gábor
2020-05-29  8:50 ` [PATCH 07/34] commit-graph: simplify parse_commit_graph() #1 SZEDER Gábor
2020-05-29  8:50 ` [PATCH 08/34] commit-graph: simplify parse_commit_graph() #2 SZEDER Gábor
2020-05-29  8:50 ` [PATCH 09/34] commit-graph: simplify write_commit_graph_file() #1 SZEDER Gábor
2020-05-29  8:50 ` [PATCH 10/34] commit-graph: simplify write_commit_graph_file() #2 SZEDER Gábor
2020-05-29  8:50 ` [PATCH 11/34] commit-graph: allocate the 'struct chunk_info' array dinamically SZEDER Gábor
2020-05-29  8:50 ` [PATCH 12/34] commit-graph: unify the signatures of all write_graph_chunk_*() functions SZEDER Gábor
2020-05-29  8:50 ` [PATCH 13/34] commit-graph: simplify write_commit_graph_file() #3 SZEDER Gábor
2020-05-29  8:50 ` [PATCH 14/34] commit-graph: check chunk sizes after writing SZEDER Gábor
2020-05-29  8:50 ` [PATCH 15/34] commit-graph-format.txt: document the modified path Bloom filter chunks SZEDER Gábor
2020-06-02 17:59   ` SZEDER Gábor
2020-05-29  8:50 ` [PATCH 16/34] Add a generic and minimal Bloom filter implementation SZEDER Gábor
2020-05-29  8:50 ` [PATCH 17/34] Import a streaming-capable Murmur3 hash function implementation SZEDER Gábor
2020-05-29  8:50 ` [PATCH 18/34] commit-graph: write "empty" Modified Path Bloom Filter Index chunk SZEDER Gábor
2020-05-29  8:50 ` [PATCH 19/34] commit-graph: add commit slab for modified path Bloom filters SZEDER Gábor
2020-05-29  8:50 ` [PATCH 20/34] commit-graph: fill the Modified Path Bloom Filter Index chunk SZEDER Gábor
2020-05-29  8:50 ` [PATCH 21/34] commit-graph: load and use " SZEDER Gábor
2020-05-29  8:50 ` [PATCH 22/34] commit-graph: write the Modified Path Bloom Filters chunk SZEDER Gábor
2020-05-29  8:50 ` [PATCH 23/34] commit-graph: load and use " SZEDER Gábor
2020-05-29  8:50 ` [PATCH 24/34] commit-graph: check all leading directories in modified path Bloom filters SZEDER Gábor
2020-05-29  8:50 ` [PATCH 25/34] commit-graph: check embedded modified path Bloom filters with a mask SZEDER Gábor
2020-05-29  8:50 ` [PATCH 26/34] commit-graph: deduplicate modified path Bloom filters SZEDER Gábor
2020-05-29  8:50 ` [PATCH 27/34] commit-graph: load modified path Bloom filters for merge commits SZEDER Gábor
2020-05-29  8:50 ` [PATCH 28/34] commit-graph: write Modified Path Bloom Filter Merge Index chunk SZEDER Gábor
2020-05-29  8:50 ` [PATCH 29/34] commit-graph: extract init and free write_commit_graph_context SZEDER Gábor
2020-05-29  8:50 ` [PATCH 30/34] commit-graph: move write_commit_graph_reachable below write_commit_graph SZEDER Gábor
2020-05-29  8:50 ` [PATCH 31/34] t7007-show: make the first test compatible with the next patch SZEDER Gábor
2020-05-29  8:50 ` [PATCH 32/34] PoC commit-graph: use revision walk machinery for '--reachable' SZEDER Gábor
2020-05-29  8:50 ` SZEDER Gábor [this message]
2020-05-29  8:50 ` [PATCH 34/34] commit-graph: use modified path Bloom filters with wildcards, if possible SZEDER Gábor
2020-05-29 13:59 ` [PoC PATCH 00/34] An alternative modified path Bloom filters implementation Derrick Stolee
2020-06-01 23:25 ` Taylor Blau
2020-06-02 17:08   ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200529085038.26008-34-szeder.dev@gmail.com \
    --to=szeder.dev@gmail.com \
    --cc=garima.singh@microsoft.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jnareb@gmail.com \
    --cc=me@ttaylorr.com \
    --cc=peff@peff.net \
    --cc=stolee@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.