All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com>
To: git@vger.kernel.org
Cc: gitster@pobox.com, newren@gmail.com, peff@peff.net,
	jrnieder@gmail.com, sunshine@sunshineco.com, pclouds@gmail.com,
	Derrick Stolee <derrickstolee@github.com>,
	Derrick Stolee <dstolee@microsoft.com>
Subject: [PATCH 27/27] cache-tree: integrate with sparse directory entries
Date: Mon, 25 Jan 2021 17:42:13 +0000	[thread overview]
Message-ID: <05e7548b780da6b2bf2342d91d8757568df0a6b8.1611596534.git.gitgitgadget@gmail.com> (raw)
In-Reply-To: <pull.847.git.1611596533.gitgitgadget@gmail.com>

From: Derrick Stolee <dstolee@microsoft.com>

The cache-tree extension was previously disabled with sparse indexes.
However, the cache-tree is an important performance feature for commands
like 'git status' and 'git add'. Integrate it with sparse directory
entries.

When writing a sparse index, completely clear and recalculate the cache
tree. By starting from scratch, the only integration necessary is to
check if we hit a sparse directory entry and create a leaf of the
cache-tree that has an entry_count of one and no subtrees.

Once the cache-tree exists within a sparse index, we finally get
improved performance. I test the sparse index performance using a
private monorepo with over 2.1 million files at HEAD, but with a
sparse-checkout definition that has only 68,000 paths in the populated
cone. The sparse index has about 2,000 sparse directory entries. I
compare three scenarios:

 1. Use the full index. The index size is ~186 MB.
 2. Use the sparse index. The index size is ~5.5 MB.
 3. Use a commit where HEAD matches the populated set. The full index
    size is ~5.3MB.

The third benchmark is included as a theoretical optimium for a
repository of the same object database.

First, a clean 'git status' improves from 3.1s to 240ms.

Benchmark #1: full index (git status)
  Time (mean ± σ):      3.167 s ±  0.036 s    [User: 2.006 s, System: 1.078 s]
  Range (min … max):    3.100 s …  3.208 s    10 runs

Benchmark #2: sparse index (git status)
  Time (mean ± σ):     239.5 ms ±   8.1 ms    [User: 189.4 ms, System: 226.8 ms]
  Range (min … max):   226.0 ms … 251.9 ms    13 runs

Benchmark #3: small tree (git status)
  Time (mean ± σ):     195.3 ms ±   4.5 ms    [User: 116.5 ms, System: 84.4 ms]
  Range (min … max):   188.8 ms … 202.8 ms    15 runs

The optimimum is still 45ms faster. This is due in part to the 2,000+
sparse directory entries, but there might be other optimizations to make
in the sparse-index case. In particular, I find that this performance
difference disappears when I disable FS Monitor, which is somewhat
disabled in the sparse-index case, but might still be adding overhead.

The performance numbers for 'git add .' are much closer to optimal:

Benchmark #1: full index (git add .)
  Time (mean ± σ):      3.076 s ±  0.022 s    [User: 2.065 s, System: 0.943 s]
  Range (min … max):    3.044 s …  3.116 s    10 runs

Benchmark #2: sparse index (git add .)
  Time (mean ± σ):     218.0 ms ±   6.6 ms    [User: 195.7 ms, System: 206.6 ms]
  Range (min … max):   209.8 ms … 228.2 ms    13 runs

Benchmark #3: small tree (git add .)
  Time (mean ± σ):     217.6 ms ±   5.4 ms    [User: 131.9 ms, System: 86.7 ms]
  Range (min … max):   212.1 ms … 228.4 ms    14 runs

In this test, I also used "echo >>README.md" to append a line to the
README.md file, so the 'git add .' command is doing _something_ other
than a no-op. Without this edit (and FS Monitor enabled) the small
tree case again gains about 30ms on the sparse index case.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 cache-tree.c   | 18 ++++++++++++++++++
 sparse-index.c | 10 +++++++++-
 2 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/cache-tree.c b/cache-tree.c
index 5f07a39e501..9da6a4394e0 100644
--- a/cache-tree.c
+++ b/cache-tree.c
@@ -256,6 +256,24 @@ static int update_one(struct cache_tree *it,
 
 	*skip_count = 0;
 
+	/*
+	 * If the first entry of this region is a sparse directory
+	 * entry corresponding exactly to 'base', then this cache_tree
+	 * struct is a "leaf" in the data structure, pointing to the
+	 * tree OID specified in the entry.
+	 */
+	if (entries > 0) {
+		const struct cache_entry *ce = cache[0];
+
+		if (S_ISSPARSEDIR(ce) &&
+		    ce->ce_namelen == baselen &&
+		    !strncmp(ce->name, base, baselen)) {
+			it->entry_count = 1;
+			oidcpy(&it->oid, &ce->oid);
+			return 1;
+		}
+	}
+
 	if (0 <= it->entry_count && has_object_file(&it->oid))
 		return it->entry_count;
 
diff --git a/sparse-index.c b/sparse-index.c
index a201f3b905c..9ea3b321400 100644
--- a/sparse-index.c
+++ b/sparse-index.c
@@ -181,7 +181,11 @@ int convert_to_sparse(struct index_state *istate)
 	istate->cache_nr = convert_to_sparse_rec(istate,
 						 0, 0, istate->cache_nr,
 						 "", 0, istate->cache_tree);
-	istate->drop_cache_tree = 1;
+
+	/* Clear and recompute the cache-tree */
+	cache_tree_free(&istate->cache_tree);
+	cache_tree_update(istate, 0);
+
 	istate->sparse_index = 1;
 	trace2_region_leave("index", "convert_to_sparse", istate->repo);
 	return 0;
@@ -278,6 +282,10 @@ void ensure_full_index(struct index_state *istate)
 
 	free(full);
 
+	/* Clear and recompute the cache-tree */
+	cache_tree_free(&istate->cache_tree);
+	cache_tree_update(istate, 0);
+
 	trace2_region_leave("index", "ensure_full_index", istate->repo);
 }
 
-- 
gitgitgadget

  parent reply	other threads:[~2021-01-25 17:59 UTC|newest]

Thread overview: 61+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-01-25 17:41 [PATCH 00/27] [RFC] Sparse Index Derrick Stolee via GitGitGadget
2021-01-25 17:41 ` [PATCH 01/27] sparse-index: add guard to ensure full index Derrick Stolee via GitGitGadget
2021-01-25 17:41 ` [PATCH 02/27] sparse-index: implement ensure_full_index() Derrick Stolee via GitGitGadget
2021-01-27  3:05   ` Elijah Newren
2021-01-27 13:43     ` Derrick Stolee
2021-01-27 16:38       ` Elijah Newren
2021-01-28  5:25     ` Junio C Hamano
2021-01-25 17:41 ` [PATCH 03/27] t1092: compare sparse-checkout to sparse-index Derrick Stolee via GitGitGadget
2021-01-27  3:08   ` Elijah Newren
2021-01-27 13:30     ` Derrick Stolee
2021-01-27 16:54       ` Elijah Newren
2021-01-25 17:41 ` [PATCH 04/27] test-read-cache: print cache entries with --table Derrick Stolee via GitGitGadget
2021-01-27  3:25   ` Elijah Newren
2021-01-25 17:41 ` [PATCH 05/27] test-tool: read-cache --table --no-stat Derrick Stolee via GitGitGadget
2021-01-25 17:41 ` [PATCH 06/27] test-tool: don't force full index Derrick Stolee via GitGitGadget
2021-01-25 17:41 ` [PATCH 07/27] unpack-trees: ensure " Derrick Stolee via GitGitGadget
2021-01-27  4:43   ` Elijah Newren
2021-01-25 17:41 ` [PATCH 08/27] sparse-checkout: hold pattern list in index Derrick Stolee via GitGitGadget
2021-01-27 17:00   ` Elijah Newren
2021-01-28 13:12     ` Derrick Stolee
2021-01-25 17:41 ` [PATCH 09/27] sparse-index: convert from full to sparse Derrick Stolee via GitGitGadget
2021-01-27 17:30   ` Elijah Newren
2021-01-25 17:41 ` [PATCH 10/27] submodule: sparse-index should not collapse links Derrick Stolee via GitGitGadget
2021-01-25 17:41 ` [PATCH 11/27] unpack-trees: allow sparse directories Derrick Stolee via GitGitGadget
2021-01-27 17:36   ` Elijah Newren
2021-01-25 17:41 ` [PATCH 12/27] sparse-index: check index conversion happens Derrick Stolee via GitGitGadget
2021-01-27 17:46   ` Elijah Newren
2021-01-25 17:41 ` [PATCH 13/27] sparse-index: create extension for compatibility Derrick Stolee via GitGitGadget
2021-01-27 18:03   ` Elijah Newren
2021-01-25 17:42 ` [PATCH 14/27] sparse-checkout: toggle sparse index from builtin Derrick Stolee via GitGitGadget
2021-01-27 18:18   ` Elijah Newren
2021-01-28 15:26     ` Derrick Stolee
2021-01-25 17:42 ` [PATCH 15/27] [RFC-VERSION] *: ensure full index Derrick Stolee via GitGitGadget
2021-02-01 20:22   ` Elijah Newren
2021-02-01 21:10     ` Derrick Stolee
2021-01-25 17:42 ` [PATCH 16/27] unpack-trees: make sparse aware Derrick Stolee via GitGitGadget
2021-02-01 20:50   ` Elijah Newren
2021-02-09 17:23     ` Derrick Stolee
2021-01-25 17:42 ` [PATCH 17/27] dir.c: accept a directory as part of cone-mode patterns Derrick Stolee via GitGitGadget
2021-02-01 22:12   ` Elijah Newren
2021-01-25 17:42 ` [PATCH 18/27] status: use sparse-index throughout Derrick Stolee via GitGitGadget
2021-01-25 17:42 ` [PATCH 19/27] status: skip sparse-checkout percentage with sparse-index Derrick Stolee via GitGitGadget
2021-01-25 17:42 ` [PATCH 20/27] sparse-index: expand_to_path() trivial implementation Derrick Stolee via GitGitGadget
2021-01-25 17:42 ` [PATCH 21/27] sparse-index: expand_to_path no-op if path exists Derrick Stolee via GitGitGadget
2021-02-01 22:34   ` Elijah Newren
2021-01-25 17:42 ` [PATCH 22/27] add: allow operating on a sparse-only index Derrick Stolee via GitGitGadget
2021-02-01 23:08   ` Elijah Newren
2021-01-25 17:42 ` [PATCH 23/27] submodule: die_path_inside_submodule is sparse aware Derrick Stolee via GitGitGadget
2021-01-25 17:42 ` [PATCH 24/27] dir: use expand_to_path in add_patterns() Derrick Stolee via GitGitGadget
2021-02-01 23:21   ` Elijah Newren
2021-01-25 17:42 ` [PATCH 25/27] fsmonitor: disable if index is sparse Derrick Stolee via GitGitGadget
2021-01-25 17:42 ` [PATCH 26/27] pathspec: stop calling ensure_full_index Derrick Stolee via GitGitGadget
2021-02-01 23:24   ` Elijah Newren
2021-02-02  2:39     ` Derrick Stolee
2021-01-25 17:42 ` Derrick Stolee via GitGitGadget [this message]
2021-02-01 23:54   ` [PATCH 27/27] cache-tree: integrate with sparse directory entries Elijah Newren
2021-02-02  2:41     ` Derrick Stolee
2021-02-02  3:05       ` Elijah Newren
2021-01-25 20:10 ` [PATCH 00/27] [RFC] Sparse Index Junio C Hamano
2021-01-25 21:18   ` Derrick Stolee
2021-02-02  3:11 ` Elijah Newren

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=05e7548b780da6b2bf2342d91d8757568df0a6b8.1611596534.git.gitgitgadget@gmail.com \
    --to=gitgitgadget@gmail.com \
    --cc=derrickstolee@github.com \
    --cc=dstolee@microsoft.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jrnieder@gmail.com \
    --cc=newren@gmail.com \
    --cc=pclouds@gmail.com \
    --cc=peff@peff.net \
    --cc=sunshine@sunshineco.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.