All of lore.kernel.org
 help / color / mirror / Atom feed
From: Elijah Newren <newren@gmail.com>
To: Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com>
Cc: "Git Mailing List" <git@vger.kernel.org>,
	"Junio C Hamano" <gitster@pobox.com>, "Jeff King" <peff@peff.net>,
	"Jonathan Nieder" <jrnieder@gmail.com>,
	"Eric Sunshine" <sunshine@sunshineco.com>,
	"Nguyễn Thái Ngọc" <pclouds@gmail.com>,
	"Derrick Stolee" <derrickstolee@github.com>,
	"Derrick Stolee" <dstolee@microsoft.com>
Subject: Re: [PATCH 27/27] cache-tree: integrate with sparse directory entries
Date: Mon, 1 Feb 2021 15:54:45 -0800	[thread overview]
Message-ID: <CABPp-BGtF+p7D8x0xvSwMz7XveqVcBWhr20iHQ4=Vrxw6LEoKw@mail.gmail.com> (raw)
In-Reply-To: <05e7548b780da6b2bf2342d91d8757568df0a6b8.1611596534.git.gitgitgadget@gmail.com>

On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <dstolee@microsoft.com>
>
> The cache-tree extension was previously disabled with sparse indexes.
> However, the cache-tree is an important performance feature for commands
> like 'git status' and 'git add'. Integrate it with sparse directory
> entries.
>
> When writing a sparse index, completely clear and recalculate the cache
> tree. By starting from scratch, the only integration necessary is to
> check if we hit a sparse directory entry and create a leaf of the
> cache-tree that has an entry_count of one and no subtrees.
>
> Once the cache-tree exists within a sparse index, we finally get
> improved performance. I test the sparse index performance using a
> private monorepo with over 2.1 million files at HEAD, but with a
> sparse-checkout definition that has only 68,000 paths in the populated
> cone. The sparse index has about 2,000 sparse directory entries. I
> compare three scenarios:

How many .gitignore entries does this monorepo have?  What percentage
of those are populated for #2 and #3?

Can a testcase be devised that others can repeat?  For example, I
created https://github.com/newren/gvfs-like-git-bomb once upon a time
to create a very small repository with a very large index and a lot of
skip-worktree entries, mostly to test some stuff that someone (Ben
Peart?) mentioned as being slow and verify for myself that it wasn't
Windows specific.

>  1. Use the full index. The index size is ~186 MB.
>  2. Use the sparse index. The index size is ~5.5 MB.
>  3. Use a commit where HEAD matches the populated set. The full index
>     size is ~5.3MB.

I'm not sure I'm understanding the difference between #2 and #3, other
than #3 is smaller.  How did you form #2?  Also, what do you mean by
"full index size" for #3, when it's the smallest?  Isn't that index
the most sparse (or least full)?  Or is it an index for a different
commit entirely that has far fewer files in it?

> The third benchmark is included as a theoretical optimium for a

s/optimium/optimum/

> repository of the same object database.

This I'm also not understanding, but maybe this goes back to not
understanding the difference in how #2 and #3 are constructed.

> First, a clean 'git status' improves from 3.1s to 240ms.
>
> Benchmark #1: full index (git status)
>   Time (mean ± σ):      3.167 s ±  0.036 s    [User: 2.006 s, System: 1.078 s]
>   Range (min … max):    3.100 s …  3.208 s    10 runs
>
> Benchmark #2: sparse index (git status)
>   Time (mean ± σ):     239.5 ms ±   8.1 ms    [User: 189.4 ms, System: 226.8 ms]
>   Range (min … max):   226.0 ms … 251.9 ms    13 runs
>
> Benchmark #3: small tree (git status)
>   Time (mean ± σ):     195.3 ms ±   4.5 ms    [User: 116.5 ms, System: 84.4 ms]
>   Range (min … max):   188.8 ms … 202.8 ms    15 runs

Always nice to see a speedup factor greater than 10.  :-)

>
> The optimimum is still 45ms faster. This is due in part to the 2,000+

s/optimimum/optimum/

> sparse directory entries, but there might be other optimizations to make
> in the sparse-index case. In particular, I find that this performance
> difference disappears when I disable FS Monitor, which is somewhat
> disabled in the sparse-index case, but might still be adding overhead.

The FS monitor wording is unclear to me; it feels like multiple negatives.

> The performance numbers for 'git add .' are much closer to optimal:
>
> Benchmark #1: full index (git add .)
>   Time (mean ± σ):      3.076 s ±  0.022 s    [User: 2.065 s, System: 0.943 s]
>   Range (min … max):    3.044 s …  3.116 s    10 runs
>
> Benchmark #2: sparse index (git add .)
>   Time (mean ± σ):     218.0 ms ±   6.6 ms    [User: 195.7 ms, System: 206.6 ms]
>   Range (min … max):   209.8 ms … 228.2 ms    13 runs
>
> Benchmark #3: small tree (git add .)
>   Time (mean ± σ):     217.6 ms ±   5.4 ms    [User: 131.9 ms, System: 86.7 ms]
>   Range (min … max):   212.1 ms … 228.4 ms    14 runs
>
> In this test, I also used "echo >>README.md" to append a line to the
> README.md file, so the 'git add .' command is doing _something_ other
> than a no-op. Without this edit (and FS Monitor enabled) the small
> tree case again gains about 30ms on the sparse index case.

Meaning the small tree is 30 ms faster than reported here, or 30 ms
slower, or that both sparse index and small tree are faster but the
small tree decreases its time more than the sparse index one does?

Sorry, I don't mean to be dense, I'm just struggling with
understanding words today it seems.  (Also, it seems like there's a
joke in there about me being "dense" in a review of a "sparse"
feature...but I'm not quite coming up with it.)

> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  cache-tree.c   | 18 ++++++++++++++++++
>  sparse-index.c | 10 +++++++++-
>  2 files changed, 27 insertions(+), 1 deletion(-)
>
> diff --git a/cache-tree.c b/cache-tree.c
> index 5f07a39e501..9da6a4394e0 100644
> --- a/cache-tree.c
> +++ b/cache-tree.c
> @@ -256,6 +256,24 @@ static int update_one(struct cache_tree *it,
>
>         *skip_count = 0;
>
> +       /*
> +        * If the first entry of this region is a sparse directory
> +        * entry corresponding exactly to 'base', then this cache_tree
> +        * struct is a "leaf" in the data structure, pointing to the
> +        * tree OID specified in the entry.
> +        */
> +       if (entries > 0) {
> +               const struct cache_entry *ce = cache[0];
> +
> +               if (S_ISSPARSEDIR(ce) &&
> +                   ce->ce_namelen == baselen &&
> +                   !strncmp(ce->name, base, baselen)) {
> +                       it->entry_count = 1;
> +                       oidcpy(&it->oid, &ce->oid);
> +                       return 1;
> +               }
> +       }
> +
>         if (0 <= it->entry_count && has_object_file(&it->oid))
>                 return it->entry_count;
>
> diff --git a/sparse-index.c b/sparse-index.c
> index a201f3b905c..9ea3b321400 100644
> --- a/sparse-index.c
> +++ b/sparse-index.c
> @@ -181,7 +181,11 @@ int convert_to_sparse(struct index_state *istate)
>         istate->cache_nr = convert_to_sparse_rec(istate,
>                                                  0, 0, istate->cache_nr,
>                                                  "", 0, istate->cache_tree);
> -       istate->drop_cache_tree = 1;
> +
> +       /* Clear and recompute the cache-tree */
> +       cache_tree_free(&istate->cache_tree);
> +       cache_tree_update(istate, 0);
> +
>         istate->sparse_index = 1;
>         trace2_region_leave("index", "convert_to_sparse", istate->repo);
>         return 0;
> @@ -278,6 +282,10 @@ void ensure_full_index(struct index_state *istate)
>
>         free(full);
>
> +       /* Clear and recompute the cache-tree */
> +       cache_tree_free(&istate->cache_tree);
> +       cache_tree_update(istate, 0);
> +
>         trace2_region_leave("index", "ensure_full_index", istate->repo);
>  }
>
> --
> gitgitgadget

This is very exciting work!!

  reply	other threads:[~2021-02-01 23:55 UTC|newest]

Thread overview: 61+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-01-25 17:41 [PATCH 00/27] [RFC] Sparse Index Derrick Stolee via GitGitGadget
2021-01-25 17:41 ` [PATCH 01/27] sparse-index: add guard to ensure full index Derrick Stolee via GitGitGadget
2021-01-25 17:41 ` [PATCH 02/27] sparse-index: implement ensure_full_index() Derrick Stolee via GitGitGadget
2021-01-27  3:05   ` Elijah Newren
2021-01-27 13:43     ` Derrick Stolee
2021-01-27 16:38       ` Elijah Newren
2021-01-28  5:25     ` Junio C Hamano
2021-01-25 17:41 ` [PATCH 03/27] t1092: compare sparse-checkout to sparse-index Derrick Stolee via GitGitGadget
2021-01-27  3:08   ` Elijah Newren
2021-01-27 13:30     ` Derrick Stolee
2021-01-27 16:54       ` Elijah Newren
2021-01-25 17:41 ` [PATCH 04/27] test-read-cache: print cache entries with --table Derrick Stolee via GitGitGadget
2021-01-27  3:25   ` Elijah Newren
2021-01-25 17:41 ` [PATCH 05/27] test-tool: read-cache --table --no-stat Derrick Stolee via GitGitGadget
2021-01-25 17:41 ` [PATCH 06/27] test-tool: don't force full index Derrick Stolee via GitGitGadget
2021-01-25 17:41 ` [PATCH 07/27] unpack-trees: ensure " Derrick Stolee via GitGitGadget
2021-01-27  4:43   ` Elijah Newren
2021-01-25 17:41 ` [PATCH 08/27] sparse-checkout: hold pattern list in index Derrick Stolee via GitGitGadget
2021-01-27 17:00   ` Elijah Newren
2021-01-28 13:12     ` Derrick Stolee
2021-01-25 17:41 ` [PATCH 09/27] sparse-index: convert from full to sparse Derrick Stolee via GitGitGadget
2021-01-27 17:30   ` Elijah Newren
2021-01-25 17:41 ` [PATCH 10/27] submodule: sparse-index should not collapse links Derrick Stolee via GitGitGadget
2021-01-25 17:41 ` [PATCH 11/27] unpack-trees: allow sparse directories Derrick Stolee via GitGitGadget
2021-01-27 17:36   ` Elijah Newren
2021-01-25 17:41 ` [PATCH 12/27] sparse-index: check index conversion happens Derrick Stolee via GitGitGadget
2021-01-27 17:46   ` Elijah Newren
2021-01-25 17:41 ` [PATCH 13/27] sparse-index: create extension for compatibility Derrick Stolee via GitGitGadget
2021-01-27 18:03   ` Elijah Newren
2021-01-25 17:42 ` [PATCH 14/27] sparse-checkout: toggle sparse index from builtin Derrick Stolee via GitGitGadget
2021-01-27 18:18   ` Elijah Newren
2021-01-28 15:26     ` Derrick Stolee
2021-01-25 17:42 ` [PATCH 15/27] [RFC-VERSION] *: ensure full index Derrick Stolee via GitGitGadget
2021-02-01 20:22   ` Elijah Newren
2021-02-01 21:10     ` Derrick Stolee
2021-01-25 17:42 ` [PATCH 16/27] unpack-trees: make sparse aware Derrick Stolee via GitGitGadget
2021-02-01 20:50   ` Elijah Newren
2021-02-09 17:23     ` Derrick Stolee
2021-01-25 17:42 ` [PATCH 17/27] dir.c: accept a directory as part of cone-mode patterns Derrick Stolee via GitGitGadget
2021-02-01 22:12   ` Elijah Newren
2021-01-25 17:42 ` [PATCH 18/27] status: use sparse-index throughout Derrick Stolee via GitGitGadget
2021-01-25 17:42 ` [PATCH 19/27] status: skip sparse-checkout percentage with sparse-index Derrick Stolee via GitGitGadget
2021-01-25 17:42 ` [PATCH 20/27] sparse-index: expand_to_path() trivial implementation Derrick Stolee via GitGitGadget
2021-01-25 17:42 ` [PATCH 21/27] sparse-index: expand_to_path no-op if path exists Derrick Stolee via GitGitGadget
2021-02-01 22:34   ` Elijah Newren
2021-01-25 17:42 ` [PATCH 22/27] add: allow operating on a sparse-only index Derrick Stolee via GitGitGadget
2021-02-01 23:08   ` Elijah Newren
2021-01-25 17:42 ` [PATCH 23/27] submodule: die_path_inside_submodule is sparse aware Derrick Stolee via GitGitGadget
2021-01-25 17:42 ` [PATCH 24/27] dir: use expand_to_path in add_patterns() Derrick Stolee via GitGitGadget
2021-02-01 23:21   ` Elijah Newren
2021-01-25 17:42 ` [PATCH 25/27] fsmonitor: disable if index is sparse Derrick Stolee via GitGitGadget
2021-01-25 17:42 ` [PATCH 26/27] pathspec: stop calling ensure_full_index Derrick Stolee via GitGitGadget
2021-02-01 23:24   ` Elijah Newren
2021-02-02  2:39     ` Derrick Stolee
2021-01-25 17:42 ` [PATCH 27/27] cache-tree: integrate with sparse directory entries Derrick Stolee via GitGitGadget
2021-02-01 23:54   ` Elijah Newren [this message]
2021-02-02  2:41     ` Derrick Stolee
2021-02-02  3:05       ` Elijah Newren
2021-01-25 20:10 ` [PATCH 00/27] [RFC] Sparse Index Junio C Hamano
2021-01-25 21:18   ` Derrick Stolee
2021-02-02  3:11 ` Elijah Newren

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CABPp-BGtF+p7D8x0xvSwMz7XveqVcBWhr20iHQ4=Vrxw6LEoKw@mail.gmail.com' \
    --to=newren@gmail.com \
    --cc=derrickstolee@github.com \
    --cc=dstolee@microsoft.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=gitster@pobox.com \
    --cc=jrnieder@gmail.com \
    --cc=pclouds@gmail.com \
    --cc=peff@peff.net \
    --cc=sunshine@sunshineco.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.