All of lore.kernel.org
 help / color / mirror / Atom feed
From: Elijah Newren <newren@gmail.com>
To: Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com>
Cc: "Git Mailing List" <git@vger.kernel.org>,
	"Junio C Hamano" <gitster@pobox.com>, "Jeff King" <peff@peff.net>,
	"Jonathan Nieder" <jrnieder@gmail.com>,
	"Eric Sunshine" <sunshine@sunshineco.com>,
	"Nguyễn Thái Ngọc" <pclouds@gmail.com>,
	"Derrick Stolee" <derrickstolee@github.com>
Subject: Re: [PATCH 00/27] [RFC] Sparse Index
Date: Mon, 1 Feb 2021 19:11:27 -0800	[thread overview]
Message-ID: <CABPp-BGy6Nt0b0n2=Z9Qzc+LcQtZ64R9nXqaZA8qmQBjd9gOow@mail.gmail.com> (raw)
In-Reply-To: <pull.847.git.1611596533.gitgitgadget@gmail.com>

On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> This is based on ds/more-index-cleanups also available as GitGitGadget PR
> #839.
>
> The sparse checkout feature allows users to specify a "populated set" that
> is smaller than the full list of files at HEAD. Files outside the sparse
> checkout definition are not present in the working directory, but are still
> present in the index (and marked with the CE_SKIP_WORKTREE bit).
>
> This means that the working directory has size O(populated), and commands
> like "git status" or "git checkout" operate using an O(populated) number of
> filesystem operations. However, parsing the index still operates on the
> scale of O(HEAD).
>
> This can be particularly jarring if you are merging a small repository with
> a large monorepo for the purpose of simplifying dependency management. Even
> if users have nothing more in their working directory than they had before,
> they suddenly see a significant increase in their "git status" or "git add"
> times. In these cases, simply parsing the index can be a huge portion of the
> command time.
>
> This RFC proposes an update to the index formats to allow "sparse directory
> entries". These entries correspond to directories that are completely
> excluded from the sparse checkout definition. We can detect that a directory
> is excluded when using "cone mode" patterns.
>
> Since having directory entries is a radical departure from the existing
> index format, a new extension "extensions.sparseIndex" is added. Using a
> sparse index should cause incompatible tools to fail because they do not
> understand this extension.
>
> The index is a critical data structure, so making such a drastic change must
> be handled carefully. This RFC does only enough adjustments to demonstrate
> performance improvements for "git status" and "git add." Other commands
> should operate identically to before, since the other commands will expand a
> sparse index into a full index by parsing trees.
>
> WARNING: I'm getting a failure on the FreeBSD build with my sparse-checkout
> tests. I'm not sure what is causing these failures, but I will explore while
> we discuss the possibility of the feature as a whole.
>
> Here is an outline for this RFC:
>
>  * Patches 1-14: create and test the sparse index format. This is just
>    enough to start writing the format, but all Git commands become slower
>    for using it. This is because everything is guarded to expand to a full
>    index before actually operating on the cache entries.
>
>  * Patch 15: This massive patch is actually a bunch of patches squashed
>    together. I have a branch that adds "ensure_full_index()" guards in each
>    necessary file along with some commentary about how the index is being
>    used. This patch is presented here as one big dump because that
>    commentary isn't particularly interesting if the RFC leads to a very
>    different approach.
>
>  * Patches 16-27: These changes make enough code "sparse aware" such that
>    "git status" and "git add" start operating in time O(populated) instead
>    of O(HEAD).
>
> Performance numbers are given in patch 27, but repeated somewhat here. The
> test environment I use has ~2.1 million paths at HEAD, but only 68,000
> populated paths given the sparse-checkout I'm using. The sparse index has
> about 2,000 sparse directory entries.
>
>  1. Use the full index. The index size is ~186 MB.
>  2. Use the sparse index. The index size is ~5.5 MB.
>  3. Use a commit where HEAD matches the populated set. The full index size
>     is ~5.3MB.
>
> The third benchmark is included as a theoretical optimum for a repository of
> the same object database.
>
> First, a clean 'git status' improves from 3.1s to 240ms.
>
> Benchmark #1: full index (git status) Time (mean ± σ): 3.167 s ± 0.036 s
> [User: 2.006 s, System: 1.078 s] Range (min … max): 3.100 s … 3.208 s 10
> runs
>
> Benchmark #2: sparse index (git status) Time (mean ± σ): 239.5 ms ± 8.1 ms
> [User: 189.4 ms, System: 226.8 ms] Range (min … max): 226.0 ms … 251.9 ms 13
> runs
>
> Benchmark #3: small tree (git status) Time (mean ± σ): 195.3 ms ± 4.5 ms
> [User: 116.5 ms, System: 84.4 ms] Range (min … max): 188.8 ms … 202.8 ms 15
> runs
>
> The performance numbers for 'git add .' are much closer to optimal:
>
> Benchmark #1: full index (git add .) Time (mean ± σ): 3.076 s ± 0.022 s
> [User: 2.065 s, System: 0.943 s] Range (min … max): 3.044 s … 3.116 s 10
> runs
>
> Benchmark #2: sparse index (git add .) Time (mean ± σ): 218.0 ms ± 6.6 ms
> [User: 195.7 ms, System: 206.6 ms] Range (min … max): 209.8 ms … 228.2 ms 13
> runs
>
> Benchmark #3: small tree (git add .) Time (mean ± σ): 217.6 ms ± 5.4 ms
> [User: 131.9 ms, System: 86.7 ms] Range (min … max): 212.1 ms … 228.4 ms 14
> runs
>
> I expect that making a sparse index work optimally through the most common
> Git commands will take a year of effort. During this process, I expect to
> add a lot of testing infrastructure around the sparse-checkout feature,
> especially in corner cases. (This RFC focuses on the happy paths of
> operating only within the sparse cone, but that will change in the future.)
>
> If this general approach is acceptable, then I would follow it with a
> sequence of patch submissions that follow this approach:
>
>  1. Basics of the format. (Patches 1-14)
>  2. Add additional guards around index interactions (Patch 15, but split
>     appropriately.)
>  3. Speed up "git status" and "git add" (Patches 16-27)
>
> After those three items that are represented in this RFC, the work starts to
> parallelize a bit. My basic ideas for moving forward from this point are to
> do these basic steps:
>
>  * Add new index API abstractions where appropriate, make them sparse-aware.
>  * Add new tests around sparse-checkout corner cases. Ensure the sparse
>    index works correctly.
>  * For a given builtin, add extra testing for sparse-checkouts then it them
>    sparse-aware.
>
> Here are some specific questions I'm hoping to answer in this RFC period:
>
>  1. Are these sparse directory entries an appropriate way to extend the
>     index format?
>  2. Is extensions.sparseIndex a good way to signal that these entries can
>     exist?
>  3. Is git sparse-checkout init --cone --sparse-index an appropriate way to
>     toggle the format?
>  4. Are there specific areas that I should target to harden the index API
>     before I submit this work?
>  5. Does anyone have a good idea how to test a large portion of the test
>     suite with sparse-index enabled? The problem I see is that most tests
>     don't use sparse-checkout, so the sparse index is identical to the full
>     index. Would it be interesting to enable the test setup to add such
>     "extra" directories during the test setup?
>
> Thanks, -Stolee

Thanks for working on this.  It's very exciting seeing this idea come
alive.  I had lots of little nitpicks and questions and whatnot here
and there, but that almost feels like a diversion.  Overall, I think
you divided up the series in a very logical and easy to follow
fashion, and actually achieved quite a bit already.

I suspect I have partially answered some of your questions above among
all my comments, and left others unanswered or worse, just re-asked
the same question(s) myself.  Feel free to ping again with the next
round and I'll see if I can dodge your questions again...er, I mean,
try to think of something helpful to say.  :-)

>
> Derrick Stolee (27):
>   sparse-index: add guard to ensure full index
>   sparse-index: implement ensure_full_index()
>   t1092: compare sparse-checkout to sparse-index
>   test-read-cache: print cache entries with --table
>   test-tool: read-cache --table --no-stat
>   test-tool: don't force full index
>   unpack-trees: ensure full index
>   sparse-checkout: hold pattern list in index
>   sparse-index: convert from full to sparse
>   submodule: sparse-index should not collapse links
>   unpack-trees: allow sparse directories
>   sparse-index: check index conversion happens
>   sparse-index: create extension for compatibility
>   sparse-checkout: toggle sparse index from builtin
>   [RFC-VERSION] *: ensure full index
>   unpack-trees: make sparse aware
>   dir.c: accept a directory as part of cone-mode patterns
>   status: use sparse-index throughout
>   status: skip sparse-checkout percentage with sparse-index
>   sparse-index: expand_to_path() trivial implementation
>   sparse-index: expand_to_path no-op if path exists
>   add: allow operating on a sparse-only index
>   submodule: die_path_inside_submodule is sparse aware
>   dir: use expand_to_path in add_patterns()
>   fsmonitor: disable if index is sparse
>   pathspec: stop calling ensure_full_index
>   cache-tree: integrate with sparse directory entries
>
>  Documentation/config/extensions.txt      |   7 +
>  Documentation/git-sparse-checkout.txt    |  14 +
>  Makefile                                 |   1 +
>  apply.c                                  |  10 +-
>  blame.c                                  |   7 +-
>  builtin/add.c                            |   3 +
>  builtin/checkout-index.c                 |   5 +-
>  builtin/commit.c                         |   3 +
>  builtin/grep.c                           |   2 +
>  builtin/ls-files.c                       |   9 +-
>  builtin/merge-index.c                    |   2 +
>  builtin/mv.c                             |   2 +
>  builtin/rm.c                             |   2 +
>  builtin/sparse-checkout.c                |  35 ++-
>  builtin/update-index.c                   |   2 +
>  cache-tree.c                             |  21 ++
>  cache.h                                  |  15 +-
>  diff.c                                   |   2 +
>  dir.c                                    |  19 +-
>  dir.h                                    |   2 +-
>  entry.c                                  |   2 +
>  fsmonitor.c                              |  18 +-
>  merge-recursive.c                        |  22 +-
>  name-hash.c                              |  10 +
>  pathspec.c                               |   4 +-
>  pathspec.h                               |   4 +-
>  preload-index.c                          |   2 +
>  read-cache.c                             |  51 +++-
>  repo-settings.c                          |  15 +
>  repository.c                             |  11 +-
>  repository.h                             |   3 +
>  rerere.c                                 |   2 +
>  resolve-undo.c                           |   6 +
>  setup.c                                  |   3 +
>  sha1-name.c                              |   3 +
>  sparse-index.c                           | 360 +++++++++++++++++++++++
>  sparse-index.h                           |  23 ++
>  split-index.c                            |   2 +
>  submodule.c                              |  22 +-
>  submodule.h                              |   6 +-
>  t/helper/test-read-cache.c               |  77 ++++-
>  t/t1092-sparse-checkout-compatibility.sh | 130 +++++++-
>  tree.c                                   |   2 +
>  unpack-trees.c                           |  40 ++-
>  wt-status.c                              |  21 +-
>  wt-status.h                              |   1 +
>  46 files changed, 942 insertions(+), 61 deletions(-)
>  create mode 100644 sparse-index.c
>  create mode 100644 sparse-index.h
>
>
> base-commit: 2271fe7848aa11b30e5313d95d9caebc2937fce5
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-847%2Fderrickstolee%2Fsparse-index%2Frfc-v1
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-847/derrickstolee/sparse-index/rfc-v1
> Pull-Request: https://github.com/gitgitgadget/git/pull/847
> --
> gitgitgadget

      parent reply	other threads:[~2021-02-02  3:12 UTC|newest]

Thread overview: 61+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-01-25 17:41 [PATCH 00/27] [RFC] Sparse Index Derrick Stolee via GitGitGadget
2021-01-25 17:41 ` [PATCH 01/27] sparse-index: add guard to ensure full index Derrick Stolee via GitGitGadget
2021-01-25 17:41 ` [PATCH 02/27] sparse-index: implement ensure_full_index() Derrick Stolee via GitGitGadget
2021-01-27  3:05   ` Elijah Newren
2021-01-27 13:43     ` Derrick Stolee
2021-01-27 16:38       ` Elijah Newren
2021-01-28  5:25     ` Junio C Hamano
2021-01-25 17:41 ` [PATCH 03/27] t1092: compare sparse-checkout to sparse-index Derrick Stolee via GitGitGadget
2021-01-27  3:08   ` Elijah Newren
2021-01-27 13:30     ` Derrick Stolee
2021-01-27 16:54       ` Elijah Newren
2021-01-25 17:41 ` [PATCH 04/27] test-read-cache: print cache entries with --table Derrick Stolee via GitGitGadget
2021-01-27  3:25   ` Elijah Newren
2021-01-25 17:41 ` [PATCH 05/27] test-tool: read-cache --table --no-stat Derrick Stolee via GitGitGadget
2021-01-25 17:41 ` [PATCH 06/27] test-tool: don't force full index Derrick Stolee via GitGitGadget
2021-01-25 17:41 ` [PATCH 07/27] unpack-trees: ensure " Derrick Stolee via GitGitGadget
2021-01-27  4:43   ` Elijah Newren
2021-01-25 17:41 ` [PATCH 08/27] sparse-checkout: hold pattern list in index Derrick Stolee via GitGitGadget
2021-01-27 17:00   ` Elijah Newren
2021-01-28 13:12     ` Derrick Stolee
2021-01-25 17:41 ` [PATCH 09/27] sparse-index: convert from full to sparse Derrick Stolee via GitGitGadget
2021-01-27 17:30   ` Elijah Newren
2021-01-25 17:41 ` [PATCH 10/27] submodule: sparse-index should not collapse links Derrick Stolee via GitGitGadget
2021-01-25 17:41 ` [PATCH 11/27] unpack-trees: allow sparse directories Derrick Stolee via GitGitGadget
2021-01-27 17:36   ` Elijah Newren
2021-01-25 17:41 ` [PATCH 12/27] sparse-index: check index conversion happens Derrick Stolee via GitGitGadget
2021-01-27 17:46   ` Elijah Newren
2021-01-25 17:41 ` [PATCH 13/27] sparse-index: create extension for compatibility Derrick Stolee via GitGitGadget
2021-01-27 18:03   ` Elijah Newren
2021-01-25 17:42 ` [PATCH 14/27] sparse-checkout: toggle sparse index from builtin Derrick Stolee via GitGitGadget
2021-01-27 18:18   ` Elijah Newren
2021-01-28 15:26     ` Derrick Stolee
2021-01-25 17:42 ` [PATCH 15/27] [RFC-VERSION] *: ensure full index Derrick Stolee via GitGitGadget
2021-02-01 20:22   ` Elijah Newren
2021-02-01 21:10     ` Derrick Stolee
2021-01-25 17:42 ` [PATCH 16/27] unpack-trees: make sparse aware Derrick Stolee via GitGitGadget
2021-02-01 20:50   ` Elijah Newren
2021-02-09 17:23     ` Derrick Stolee
2021-01-25 17:42 ` [PATCH 17/27] dir.c: accept a directory as part of cone-mode patterns Derrick Stolee via GitGitGadget
2021-02-01 22:12   ` Elijah Newren
2021-01-25 17:42 ` [PATCH 18/27] status: use sparse-index throughout Derrick Stolee via GitGitGadget
2021-01-25 17:42 ` [PATCH 19/27] status: skip sparse-checkout percentage with sparse-index Derrick Stolee via GitGitGadget
2021-01-25 17:42 ` [PATCH 20/27] sparse-index: expand_to_path() trivial implementation Derrick Stolee via GitGitGadget
2021-01-25 17:42 ` [PATCH 21/27] sparse-index: expand_to_path no-op if path exists Derrick Stolee via GitGitGadget
2021-02-01 22:34   ` Elijah Newren
2021-01-25 17:42 ` [PATCH 22/27] add: allow operating on a sparse-only index Derrick Stolee via GitGitGadget
2021-02-01 23:08   ` Elijah Newren
2021-01-25 17:42 ` [PATCH 23/27] submodule: die_path_inside_submodule is sparse aware Derrick Stolee via GitGitGadget
2021-01-25 17:42 ` [PATCH 24/27] dir: use expand_to_path in add_patterns() Derrick Stolee via GitGitGadget
2021-02-01 23:21   ` Elijah Newren
2021-01-25 17:42 ` [PATCH 25/27] fsmonitor: disable if index is sparse Derrick Stolee via GitGitGadget
2021-01-25 17:42 ` [PATCH 26/27] pathspec: stop calling ensure_full_index Derrick Stolee via GitGitGadget
2021-02-01 23:24   ` Elijah Newren
2021-02-02  2:39     ` Derrick Stolee
2021-01-25 17:42 ` [PATCH 27/27] cache-tree: integrate with sparse directory entries Derrick Stolee via GitGitGadget
2021-02-01 23:54   ` Elijah Newren
2021-02-02  2:41     ` Derrick Stolee
2021-02-02  3:05       ` Elijah Newren
2021-01-25 20:10 ` [PATCH 00/27] [RFC] Sparse Index Junio C Hamano
2021-01-25 21:18   ` Derrick Stolee
2021-02-02  3:11 ` Elijah Newren [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CABPp-BGy6Nt0b0n2=Z9Qzc+LcQtZ64R9nXqaZA8qmQBjd9gOow@mail.gmail.com' \
    --to=newren@gmail.com \
    --cc=derrickstolee@github.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=gitster@pobox.com \
    --cc=jrnieder@gmail.com \
    --cc=pclouds@gmail.com \
    --cc=peff@peff.net \
    --cc=sunshine@sunshineco.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.