All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/20] Sparse Index: Design, Format, Tests
@ 2021-02-23 20:14 Derrick Stolee via GitGitGadget
  2021-02-23 20:14 ` [PATCH 01/20] sparse-index: design doc and format update Derrick Stolee via GitGitGadget
                   ` (21 more replies)
  0 siblings, 22 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-23 20:14 UTC (permalink / raw)
  To: git; +Cc: newren, gitster, pclouds, jrnieder, Derrick Stolee

Here is the first full patch series submission coming out of the
sparse-index RFC [1].

[1]
https://lore.kernel.org/git/pull.847.git.1611596533.gitgitgadget@gmail.com/

I won't waste too much space here, because PATCH 1 includes a sizeable
design document that describes the feature, the reasoning behind it, and my
plan for getting this implemented widely throughout the codebase.

There are some new things here that were not in the RFC:

 * Design doc and format updates. (Patch 1)
 * Performance test script. (Patches 2 and 20)

Notably missing in this series from the RFC:

 * The mega-patch inserting ensure_full_index() throughout the codebase.
   That will be a follow-up series to this one.
 * The integrations with git status and git add to demonstrate the improved
   performance. Those will also appear in their own series later.

I plan to keep my latest work in this area in my 'sparse-index/wip' branch
[2]. It includes all of the work from the RFC right now, updated with the
work from this series.

[2] https://github.com/derrickstolee/git/tree/sparse-index/wip

Thanks, -Stolee

Derrick Stolee (20):
  sparse-index: design doc and format update
  t/perf: add performance test for sparse operations
  t1092: clean up script quoting
  sparse-index: add guard to ensure full index
  sparse-index: implement ensure_full_index()
  t1092: compare sparse-checkout to sparse-index
  test-read-cache: print cache entries with --table
  test-tool: don't force full index
  unpack-trees: ensure full index
  sparse-checkout: hold pattern list in index
  sparse-index: convert from full to sparse
  submodule: sparse-index should not collapse links
  unpack-trees: allow sparse directories
  sparse-index: check index conversion happens
  sparse-index: create extension for compatibility
  sparse-checkout: toggle sparse index from builtin
  sparse-checkout: disable sparse-index
  cache-tree: integrate with sparse directory entries
  sparse-index: loose integration with cache_tree_verify()
  p2000: add sparse-index repos

 Documentation/config/extensions.txt      |   7 +
 Documentation/git-sparse-checkout.txt    |  14 ++
 Documentation/technical/index-format.txt |   7 +
 Documentation/technical/sparse-index.txt | 167 +++++++++++++
 Makefile                                 |   1 +
 builtin/sparse-checkout.c                |  44 +++-
 cache-tree.c                             |  40 ++++
 cache.h                                  |  12 +-
 read-cache.c                             |  35 ++-
 repo-settings.c                          |  15 ++
 repository.c                             |  11 +-
 repository.h                             |   3 +
 setup.c                                  |   3 +
 sparse-index.c                           | 290 +++++++++++++++++++++++
 sparse-index.h                           |  11 +
 t/README                                 |   3 +
 t/helper/test-read-cache.c               |  61 ++++-
 t/perf/p2000-sparse-operations.sh        | 104 ++++++++
 t/t1091-sparse-checkout-builtin.sh       |  13 +
 t/t1092-sparse-checkout-compatibility.sh | 136 +++++++++--
 unpack-trees.c                           |  16 +-
 21 files changed, 953 insertions(+), 40 deletions(-)
 create mode 100644 Documentation/technical/sparse-index.txt
 create mode 100644 sparse-index.c
 create mode 100644 sparse-index.h
 create mode 100755 t/perf/p2000-sparse-operations.sh


base-commit: 966e671106b2fd38301e7c344c754fd118d0bb07
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-883%2Fderrickstolee%2Fsparse-index%2Fformat-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-883/derrickstolee/sparse-index/format-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/883
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH 01/20] sparse-index: design doc and format update
  2021-02-23 20:14 [PATCH 00/20] Sparse Index: Design, Format, Tests Derrick Stolee via GitGitGadget
@ 2021-02-23 20:14 ` Derrick Stolee via GitGitGadget
  2021-02-24  1:13   ` Elijah Newren
  2021-02-23 20:14 ` [PATCH 02/20] t/perf: add performance test for sparse operations Derrick Stolee via GitGitGadget
                   ` (20 subsequent siblings)
  21 siblings, 1 reply; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-23 20:14 UTC (permalink / raw)
  To: git; +Cc: newren, gitster, pclouds, jrnieder, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

This begins a long effort to update the index format to allow sparse
directory entries. This should result in a significant improvement to
Git commands when HEAD contains millions of files, but the user has
selected many fewer files to keep in their sparse-checkout definition.

Currently, the index format is only updated in the presence of
extensions.sparseIndex instead of increasing a file format version
number. This is temporary, and index v5 is part of the plan for future
work in this area.

The design document details many of the reasons for embarking on this
work, and also the plan for completing it safely.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/index-format.txt |   7 +
 Documentation/technical/sparse-index.txt | 167 +++++++++++++++++++++++
 2 files changed, 174 insertions(+)
 create mode 100644 Documentation/technical/sparse-index.txt

diff --git a/Documentation/technical/index-format.txt b/Documentation/technical/index-format.txt
index b633482b1bdf..387126582556 100644
--- a/Documentation/technical/index-format.txt
+++ b/Documentation/technical/index-format.txt
@@ -44,6 +44,13 @@ Git index format
   localization, no special casing of directory separator '/'). Entries
   with the same name are sorted by their stage field.
 
+  An index entry typically represents a file. However, if sparse-checkout
+  is enabled in cone mode (`core.sparseCheckoutCone` is enabled) and the
+  `extensions.sparseIndex` extension is enabled, then the index may
+  contain entries for directories outside of the sparse-checkout definition.
+  These entries have mode `0040000`, include the `SKIP_WORKTREE` bit, and
+  the path ends in a directory separator.
+
   32-bit ctime seconds, the last time a file's metadata changed
     this is stat(2) data
 
diff --git a/Documentation/technical/sparse-index.txt b/Documentation/technical/sparse-index.txt
new file mode 100644
index 000000000000..9070836f0655
--- /dev/null
+++ b/Documentation/technical/sparse-index.txt
@@ -0,0 +1,167 @@
+Git Sparse-Index Design Document
+================================
+
+The sparse-checkout feature allows users to focus a working directory on
+a subset of the files at HEAD. The cone mode patterns, enabled by
+`core.sparseCheckoutCone`, allow for very fast pattern matching to
+discover which files at HEAD belong in the sparse-checkout cone.
+
+Three important scale dimensions for a Git worktree are:
+
+* `HEAD`: How many files are present at `HEAD`?
+
+* Populated: How many files are within the sparse-checkout cone.
+
+* Modified: How many files has the user modified in the working directory?
+
+We will use big-O notation -- O(X) -- to denote how expensive certain
+operations are in terms of these dimensions.
+
+These dimensions are ordered by their magnitude: users (typically) modify
+fewer files than are populated, and we can only populate files at `HEAD`.
+These dimensions are also ordered by how expensive they are per item: it
+is expensive to detect a modified file than it is to write one that we
+know must be populated; changing `HEAD` only really requires updating the
+index.
+
+Problems occur if there is an extreme imbalance in these dimensions. For
+example, if `HEAD` contains millions of paths but the populated set has
+only tens of thousands, then commands like `git status` and `git add` can
+be dominated by operations that require O(`HEAD`) operations instead of
+O(Populated). Primarily, the cost is in parsing and rewriting the index,
+which is filled primarily with files at `HEAD` that are marked with the
+`SKIP_WORKTREE` bit.
+
+The sparse-index intends to take these commands that read and modify the
+index from O(`HEAD`) to O(Populated). To do this, we need to modify the
+index format in a significant way: add "sparse directory" entries.
+
+With cone mode patterns, it is possible to detect when an entire
+directory will have its contents outside of the sparse-checkout definition.
+Instead of listing all of the files it contains as individual entries, a
+sparse-index contains an entry with the directory name, referencing the
+object ID of the tree at `HEAD` and marked with the `SKIP_WORKTREE` bit.
+If we need to discover the details for paths within that directory, we
+can parse trees to find that list.
+
+This addition of sparse-directory entries violates expectations about the
+index format and its in-memory data structure. There are many consumers in
+the codebase that expect to iterate through all of the index entries and
+see only files. In addition, they expect to see all files at `HEAD`. One
+way to handle this is to parse trees to replace a sparse-directory entry
+with all of the files within that tree as the index is loaded. However,
+parsing trees is slower than parsing the index format, so that is a slower
+operation than if we left the index alone.
+
+The implementation plan below follows four phases to slowly integrate with
+the sparse-index. The intention is to incrementally update Git commands to
+interact safely with the sparse-index without significant slowdowns. This
+may not always be possible, but the hope is that the primary commands that
+users need in their daily work are dramatically improved.
+
+Phase I: Format and initial speedups
+------------------------------------
+
+During this phase, Git learns to enable the sparse-index and safely parse
+one. Protections are put in place so that every consumer of the in-memory
+data structure can operate with its current assumption of every file at
+`HEAD`.
+
+At first, every index parse will expand the sparse-directory entries into
+the full list of paths at `HEAD`. This will be slower in all cases. The
+only noticable change in behavior will be that the serialized index file
+contains sparse-directory entries.
+
+To start, we use a new repository extension, `extensions.sparseIndex`, to
+allow inserting sparse-directory entries into indexes with file format
+versions 2, 3, and 4. This prevents Git versions that do not understand
+the sparse-index from operating on one, but it also prevents other
+operations that do not use the index at all. A new format, index v5, will
+be introduced that includes sparse-directory entries by default. It might
+also introduce other features that have been considered for improving the
+index, as well.
+
+Next, consumers of the index will be guarded against operating on a
+sparse-index by inserting calls to `ensure_full_index()` or
+`expand_index_to_path()`. After these guards are in place, we can begin
+leaving sparse-directory entries in the in-memory index structure.
+
+Even after inserting these guards, we will keep expanding sparse-indexes
+for most Git commands using the `command_requires_full_index` repository
+setting. This setting will be on by default and disabled one builtin at a
+time until we have sufficient confidence that all of the index operations
+are properly guarded.
+
+To complete this phase, the commands `git status` and `git add` will be
+integrated with the sparse-index so that they operate with O(Populated)
+performance. They will be carefully tested for operations within and
+outside the sparse-checkout definition.
+
+Phase II: Careful integrations
+------------------------------
+
+This phase focuses on ensuring that all index extensions and APIs work
+well with a sparse-index. This requires significant increases to our test
+coverage, especially for operations that interact with the working
+directory outside of the sparse-checkout definition. Some of these
+behaviors may not be the desirable ones, such as some tests already
+marked for failure in `t1092-sparse-checkout-compatibility.sh`.
+
+The index extensions that may require special integrations are:
+
+* FS Monitor
+* Untracked cache
+
+While integrating with these features, we should look for patterns that
+might lead to better APIs for interacting with the index. Coalescing
+common usage patterns into an API call can reduce the number of places
+where sparse-directories need to be handled carefully.
+
+Phase III: Important command speedups
+-------------------------------------
+
+At this point, the patterns for testing and implementing sparse-directory
+logic should be relatively stable. This phase focuses on updating some of
+the most common builtins that use the index to operate as O(Populated).
+Here is a potential list of commands that could be valuable to integrate
+at this point:
+
+* `git commit`
+* `git checkout`
+* `git merge`
+* `git rebase`
+
+Along with `git status` and `git add`, these commands cover the majority
+of users' interactions with the working directory. In addition, we can
+integrate with these commands:
+
+* `git grep`
+* `git rm`
+
+These have been proposed as some whose behavior could change when in a
+repo with a sparse-checkout definition. It would be good to include this
+behavior automatically when using a sparse-index. Some clarity is needed
+to make the behavior switch clear to the user.
+
+This phase is the first where parallel work might be possible without too
+much conflicts between topics.
+
+Phase IV: The long tail
+-----------------------
+
+This last phase is less a "phase" and more "the new normal" after all of
+the previous work.
+
+To start, the `command_requires_full_index` option could be removed in
+favor of expanding only when hitting an API guard.
+
+There are many Git commands that could use special attention to operate as
+O(Populated), while some might be so rare that it is acceptable to leave
+them with additional overhead when a sparse-index is present.
+
+Here are some commands that might be useful to update:
+
+* `git sparse-checkout set`
+* `git am`
+* `git clean`
+* `git stash`
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH 02/20] t/perf: add performance test for sparse operations
  2021-02-23 20:14 [PATCH 00/20] Sparse Index: Design, Format, Tests Derrick Stolee via GitGitGadget
  2021-02-23 20:14 ` [PATCH 01/20] sparse-index: design doc and format update Derrick Stolee via GitGitGadget
@ 2021-02-23 20:14 ` Derrick Stolee via GitGitGadget
  2021-02-24  2:30   ` Elijah Newren
  2021-02-23 20:14 ` [PATCH 03/20] t1092: clean up script quoting Derrick Stolee via GitGitGadget
                   ` (19 subsequent siblings)
  21 siblings, 1 reply; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-23 20:14 UTC (permalink / raw)
  To: git; +Cc: newren, gitster, pclouds, jrnieder, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Create a test script that takes the default performance test (the Git
codebase) and multiplies it by 256 using four layers of duplicated
trees of width four. This results in nearly one million blob entries in
the index. Then, we can clone this repository with sparse-checkout
patterns that demonstrate four copies of the initial repository. Each
clone will use a different index format or mode so peformance can be
tested across the different options.

Note that the initial repo is stripped of submodules before doing the
copies. This preserves the expected data shape of the sparse index,
because directories containing submodules are not collapsed to a sparse
directory entry.

Run a few Git commands on these clones, especially those that use the
index (status, add, commit).

Here are the results on my Linux machine:

Test
--------------------------------------------------------------
2000.2: git status (full-index-v3)             0.37(0.30+0.09)
2000.3: git status (full-index-v4)             0.39(0.32+0.10)
2000.4: git add -A (full-index-v3)             1.42(1.06+0.20)
2000.5: git add -A (full-index-v4)             1.26(0.98+0.16)
2000.6: git add . (full-index-v3)              1.40(1.04+0.18)
2000.7: git add . (full-index-v4)              1.26(0.98+0.17)
2000.8: git commit -a -m A (full-index-v3)     1.42(1.11+0.16)
2000.9: git commit -a -m A (full-index-v4)     1.33(1.08+0.16)

It is perhaps noteworthy that there is an improvement when using index
version 4. This is because the v3 index uses 108 MiB while the v4
index uses 80 MiB. Since the repeated portions of the directories are
very short (f3/f1/f2, for example) this ratio is less pronounced than in
similarly-sized real repositories.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/perf/p2000-sparse-operations.sh | 87 +++++++++++++++++++++++++++++++
 1 file changed, 87 insertions(+)
 create mode 100755 t/perf/p2000-sparse-operations.sh

diff --git a/t/perf/p2000-sparse-operations.sh b/t/perf/p2000-sparse-operations.sh
new file mode 100755
index 000000000000..52597683376e
--- /dev/null
+++ b/t/perf/p2000-sparse-operations.sh
@@ -0,0 +1,87 @@
+#!/bin/sh
+
+test_description="test performance of Git operations using the index"
+
+. ./perf-lib.sh
+
+test_perf_default_repo
+
+SPARSE_CONE=f2/f4/f1
+
+test_expect_success 'setup repo and indexes' '
+	git reset --hard HEAD &&
+	# Remove submodules from the example repo, because our
+	# duplication of the entire repo creates an unlikly data shape.
+	git config --file .gitmodules --get-regexp "submodule.*.path" >modules &&
+	rm -f .gitmodules &&
+	git add .gitmodules &&
+	for module in $(awk "{print \$2}" modules)
+	do
+		git rm $module || return 1
+	done &&
+	git add . &&
+	git commit -m "remove submodules" &&
+
+	echo bogus >a &&
+	cp a b &&
+	git add a b &&
+	git commit -m "level 0" &&
+	BLOB=$(git rev-parse HEAD:a) &&
+	OLD_COMMIT=$(git rev-parse HEAD) &&
+	OLD_TREE=$(git rev-parse HEAD^{tree}) &&
+
+	for i in $(test_seq 1 4)
+	do
+		cat >in <<-EOF &&
+			100755 blob $BLOB	a
+			040000 tree $OLD_TREE	f1
+			040000 tree $OLD_TREE	f2
+			040000 tree $OLD_TREE	f3
+			040000 tree $OLD_TREE	f4
+		EOF
+		NEW_TREE=$(git mktree <in) &&
+		NEW_COMMIT=$(git commit-tree $NEW_TREE -p $OLD_COMMIT -m "level $i") &&
+		OLD_TREE=$NEW_TREE &&
+		OLD_COMMIT=$NEW_COMMIT || return 1
+	done &&
+
+	git sparse-checkout init --cone &&
+	git branch -f wide $OLD_COMMIT &&
+	git -c core.sparseCheckoutCone=true clone --branch=wide --sparse . full-index-v3 &&
+	(
+		cd full-index-v3 &&
+		git sparse-checkout init --cone &&
+		git sparse-checkout set $SPARSE_CONE &&
+		git config index.version 3 &&
+		git update-index --index-version=3
+	) &&
+	git -c core.sparseCheckoutCone=true clone --branch=wide --sparse . full-index-v4 &&
+	(
+		cd full-index-v4 &&
+		git sparse-checkout init --cone &&
+		git sparse-checkout set $SPARSE_CONE &&
+		git config index.version 4 &&
+		git update-index --index-version=4
+	)
+'
+
+test_perf_on_all () {
+	command="$@"
+	for repo in full-index-v3 full-index-v4
+	do
+		test_perf "$command ($repo)" "
+			(
+				cd $repo &&
+				echo >>$SPARSE_CONE/a &&
+				$command
+			)
+		"
+	done
+}
+
+test_perf_on_all git status
+test_perf_on_all git add -A
+test_perf_on_all git add .
+test_perf_on_all git commit -a -m A
+
+test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH 03/20] t1092: clean up script quoting
  2021-02-23 20:14 [PATCH 00/20] Sparse Index: Design, Format, Tests Derrick Stolee via GitGitGadget
  2021-02-23 20:14 ` [PATCH 01/20] sparse-index: design doc and format update Derrick Stolee via GitGitGadget
  2021-02-23 20:14 ` [PATCH 02/20] t/perf: add performance test for sparse operations Derrick Stolee via GitGitGadget
@ 2021-02-23 20:14 ` Derrick Stolee via GitGitGadget
  2021-02-23 20:14 ` [PATCH 04/20] sparse-index: add guard to ensure full index Derrick Stolee via GitGitGadget
                   ` (18 subsequent siblings)
  21 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-23 20:14 UTC (permalink / raw)
  To: git; +Cc: newren, gitster, pclouds, jrnieder, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

This test was introduced in 19a0acc83e4 (t1092: test interesting
sparse-checkout scenarios, 2021-01-23), but these issues with quoting
were not noticed until starting this follow-up series. The old mechanism
would drop quoting such as in

   test_all_match git commit -m "touch README.md"

The above happened to work because README.md is a file in the
repository, so 'git commit -m touch REAMDE.md' would succeed by
accident.

Other cases included quoting for no good reason, so clean that up now.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/t1092-sparse-checkout-compatibility.sh | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index 8cd3e5a8d227..3725d3997e70 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -96,20 +96,20 @@ init_repos () {
 run_on_sparse () {
 	(
 		cd sparse-checkout &&
-		$* >../sparse-checkout-out 2>../sparse-checkout-err
+		"$@" >../sparse-checkout-out 2>../sparse-checkout-err
 	)
 }
 
 run_on_all () {
 	(
 		cd full-checkout &&
-		$* >../full-checkout-out 2>../full-checkout-err
+		"$@" >../full-checkout-out 2>../full-checkout-err
 	) &&
-	run_on_sparse $*
+	run_on_sparse "$@"
 }
 
 test_all_match () {
-	run_on_all $* &&
+	run_on_all "$@" &&
 	test_cmp full-checkout-out sparse-checkout-out &&
 	test_cmp full-checkout-err sparse-checkout-err
 }
@@ -119,7 +119,7 @@ test_expect_success 'status with options' '
 	test_all_match git status --porcelain=v2 &&
 	test_all_match git status --porcelain=v2 -z -u &&
 	test_all_match git status --porcelain=v2 -uno &&
-	run_on_all "touch README.md" &&
+	run_on_all touch README.md &&
 	test_all_match git status --porcelain=v2 &&
 	test_all_match git status --porcelain=v2 -z -u &&
 	test_all_match git status --porcelain=v2 -uno &&
@@ -135,7 +135,7 @@ test_expect_success 'add, commit, checkout' '
 	write_script edit-contents <<-\EOF &&
 	echo text >>$1
 	EOF
-	run_on_all "../edit-contents README.md" &&
+	run_on_all ../edit-contents README.md &&
 
 	test_all_match git add README.md &&
 	test_all_match git status --porcelain=v2 &&
@@ -144,7 +144,7 @@ test_expect_success 'add, commit, checkout' '
 	test_all_match git checkout HEAD~1 &&
 	test_all_match git checkout - &&
 
-	run_on_all "../edit-contents README.md" &&
+	run_on_all ../edit-contents README.md &&
 
 	test_all_match git add -A &&
 	test_all_match git status --porcelain=v2 &&
@@ -153,7 +153,7 @@ test_expect_success 'add, commit, checkout' '
 	test_all_match git checkout HEAD~1 &&
 	test_all_match git checkout - &&
 
-	run_on_all "../edit-contents deep/newfile" &&
+	run_on_all ../edit-contents deep/newfile &&
 
 	test_all_match git status --porcelain=v2 -uno &&
 	test_all_match git status --porcelain=v2 &&
@@ -186,7 +186,7 @@ test_expect_success 'diff --staged' '
 	write_script edit-contents <<-\EOF &&
 	echo text >>README.md
 	EOF
-	run_on_all "../edit-contents" &&
+	run_on_all ../edit-contents &&
 
 	test_all_match git diff &&
 	test_all_match git diff --staged &&
@@ -280,7 +280,7 @@ test_expect_success 'clean' '
 	echo bogus >>.gitignore &&
 	run_on_all cp ../.gitignore . &&
 	test_all_match git add .gitignore &&
-	test_all_match git commit -m ignore-bogus-files &&
+	test_all_match git commit -m "ignore bogus files" &&
 
 	run_on_sparse mkdir folder1 &&
 	run_on_all touch folder1/bogus &&
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH 04/20] sparse-index: add guard to ensure full index
  2021-02-23 20:14 [PATCH 00/20] Sparse Index: Design, Format, Tests Derrick Stolee via GitGitGadget
                   ` (2 preceding siblings ...)
  2021-02-23 20:14 ` [PATCH 03/20] t1092: clean up script quoting Derrick Stolee via GitGitGadget
@ 2021-02-23 20:14 ` Derrick Stolee via GitGitGadget
  2021-02-24  2:44   ` Elijah Newren
  2021-02-23 20:14 ` [PATCH 05/20] sparse-index: implement ensure_full_index() Derrick Stolee via GitGitGadget
                   ` (17 subsequent siblings)
  21 siblings, 1 reply; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-23 20:14 UTC (permalink / raw)
  To: git; +Cc: newren, gitster, pclouds, jrnieder, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Upcoming changes will introduce modifications to the index format that
allow sparse directories. It will be useful to have a mechanism for
converting those sparse index files into full indexes by walking the
tree at those sparse directories. Name this method ensure_full_index()
as it will guarantee that the index is fully expanded.

This method is not implemented yet, and instead we focus on the
scaffolding to declare it and call it at the appropriate time.

Add a 'command_requires_full_index' member to struct repo_settings. This
will be an indicator that we need the index in full mode to do certain
index operations. This starts as being true for every command, then we
will set it to false as some commands integrate with sparse indexes.

If 'command_requires_full_index' is true, then we will immediately
expand a sparse index to a full one upon reading from disk. This
suffices for now, but we will want to add more callers to
ensure_full_index() later.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Makefile        |  1 +
 repo-settings.c |  8 ++++++++
 repository.c    | 11 ++++++++++-
 repository.h    |  2 ++
 sparse-index.c  |  8 ++++++++
 sparse-index.h  |  7 +++++++
 6 files changed, 36 insertions(+), 1 deletion(-)
 create mode 100644 sparse-index.c
 create mode 100644 sparse-index.h

diff --git a/Makefile b/Makefile
index 5a239cac20e3..3bf61699238d 100644
--- a/Makefile
+++ b/Makefile
@@ -980,6 +980,7 @@ LIB_OBJS += setup.o
 LIB_OBJS += shallow.o
 LIB_OBJS += sideband.o
 LIB_OBJS += sigchain.o
+LIB_OBJS += sparse-index.o
 LIB_OBJS += split-index.o
 LIB_OBJS += stable-qsort.o
 LIB_OBJS += strbuf.o
diff --git a/repo-settings.c b/repo-settings.c
index f7fff0f5ab83..d63569e4041e 100644
--- a/repo-settings.c
+++ b/repo-settings.c
@@ -77,4 +77,12 @@ void prepare_repo_settings(struct repository *r)
 		UPDATE_DEFAULT_BOOL(r->settings.core_untracked_cache, UNTRACKED_CACHE_KEEP);
 
 	UPDATE_DEFAULT_BOOL(r->settings.fetch_negotiation_algorithm, FETCH_NEGOTIATION_DEFAULT);
+
+	/*
+	 * This setting guards all index reads to require a full index
+	 * over a sparse index. After suitable guards are placed in the
+	 * codebase around uses of the index, this setting will be
+	 * removed.
+	 */
+	r->settings.command_requires_full_index = 1;
 }
diff --git a/repository.c b/repository.c
index c98298acd017..a8acae002f71 100644
--- a/repository.c
+++ b/repository.c
@@ -10,6 +10,7 @@
 #include "object.h"
 #include "lockfile.h"
 #include "submodule-config.h"
+#include "sparse-index.h"
 
 /* The main repository */
 static struct repository the_repo;
@@ -261,6 +262,8 @@ void repo_clear(struct repository *repo)
 
 int repo_read_index(struct repository *repo)
 {
+	int res;
+
 	if (!repo->index)
 		repo->index = xcalloc(1, sizeof(*repo->index));
 
@@ -270,7 +273,13 @@ int repo_read_index(struct repository *repo)
 	else if (repo->index->repo != repo)
 		BUG("repo's index should point back at itself");
 
-	return read_index_from(repo->index, repo->index_file, repo->gitdir);
+	res = read_index_from(repo->index, repo->index_file, repo->gitdir);
+
+	prepare_repo_settings(repo);
+	if (repo->settings.command_requires_full_index)
+		ensure_full_index(repo->index);
+
+	return res;
 }
 
 int repo_hold_locked_index(struct repository *repo,
diff --git a/repository.h b/repository.h
index b385ca3c94b6..e06a23015697 100644
--- a/repository.h
+++ b/repository.h
@@ -41,6 +41,8 @@ struct repo_settings {
 	enum fetch_negotiation_setting fetch_negotiation_algorithm;
 
 	int core_multi_pack_index;
+
+	unsigned command_requires_full_index:1;
 };
 
 struct repository {
diff --git a/sparse-index.c b/sparse-index.c
new file mode 100644
index 000000000000..82183ead563b
--- /dev/null
+++ b/sparse-index.c
@@ -0,0 +1,8 @@
+#include "cache.h"
+#include "repository.h"
+#include "sparse-index.h"
+
+void ensure_full_index(struct index_state *istate)
+{
+	/* intentionally left blank */
+}
diff --git a/sparse-index.h b/sparse-index.h
new file mode 100644
index 000000000000..09a20d036c46
--- /dev/null
+++ b/sparse-index.h
@@ -0,0 +1,7 @@
+#ifndef SPARSE_INDEX_H__
+#define SPARSE_INDEX_H__
+
+struct index_state;
+void ensure_full_index(struct index_state *istate);
+
+#endif
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH 05/20] sparse-index: implement ensure_full_index()
  2021-02-23 20:14 [PATCH 00/20] Sparse Index: Design, Format, Tests Derrick Stolee via GitGitGadget
                   ` (3 preceding siblings ...)
  2021-02-23 20:14 ` [PATCH 04/20] sparse-index: add guard to ensure full index Derrick Stolee via GitGitGadget
@ 2021-02-23 20:14 ` Derrick Stolee via GitGitGadget
  2021-02-24  3:20   ` Elijah Newren
  2021-02-23 20:14 ` [PATCH 06/20] t1092: compare sparse-checkout to sparse-index Derrick Stolee via GitGitGadget
                   ` (16 subsequent siblings)
  21 siblings, 1 reply; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-23 20:14 UTC (permalink / raw)
  To: git; +Cc: newren, gitster, pclouds, jrnieder, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

We will mark an in-memory index_state as having sparse directory entries
with the sparse_index bit. These currently cannot exist, but we will add
a mechanism for collapsing a full index to a sparse one in a later
change. That will happen at write time, so we must first allow parsing
the format before writing it.

Commands or methods that require a full index in order to operate can
call ensure_full_index() to expand that index in-memory. This requires
parsing trees using that index's repository.

Sparse directory entries have a specific 'ce_mode' value. The macro
S_ISSPARSEDIR(ce->ce_mode) can check if a cache_entry 'ce' has this type.
This ce_mode is not possible with the existing index formats, so we don't
also verify all properties of a sparse-directory entry, which are:

 1. ce->ce_mode == 0040000
 2. ce->flags & CE_SKIP_WORKTREE is true
 3. ce->name[ce->namelen - 1] == '/' (ends in dir separator)
 4. ce->oid references a tree object.

These are all semi-enforced in ensure_full_index() to some extent. Any
deviation will cause a warning at minimum or a failure in the worst
case.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 cache.h        |  7 +++-
 read-cache.c   |  9 +++++
 sparse-index.c | 95 +++++++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 109 insertions(+), 2 deletions(-)

diff --git a/cache.h b/cache.h
index d92814961405..1336c8d7435e 100644
--- a/cache.h
+++ b/cache.h
@@ -204,6 +204,8 @@ struct cache_entry {
 #error "CE_EXTENDED_FLAGS out of range"
 #endif
 
+#define S_ISSPARSEDIR(m) ((m) == S_IFDIR)
+
 /* Forward structure decls */
 struct pathspec;
 struct child_process;
@@ -319,7 +321,8 @@ struct index_state {
 		 drop_cache_tree : 1,
 		 updated_workdir : 1,
 		 updated_skipworktree : 1,
-		 fsmonitor_has_run_once : 1;
+		 fsmonitor_has_run_once : 1,
+		 sparse_index : 1;
 	struct hashmap name_hash;
 	struct hashmap dir_hash;
 	struct object_id oid;
@@ -722,6 +725,8 @@ int read_index_from(struct index_state *, const char *path,
 		    const char *gitdir);
 int is_index_unborn(struct index_state *);
 
+void ensure_full_index(struct index_state *istate);
+
 /* For use with `write_locked_index()`. */
 #define COMMIT_LOCK		(1 << 0)
 #define SKIP_IF_UNCHANGED	(1 << 1)
diff --git a/read-cache.c b/read-cache.c
index 29144cf879e7..97dbf2434f30 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -101,6 +101,9 @@ static const char *alternate_index_output;
 
 static void set_index_entry(struct index_state *istate, int nr, struct cache_entry *ce)
 {
+	if (S_ISSPARSEDIR(ce->ce_mode))
+		istate->sparse_index = 1;
+
 	istate->cache[nr] = ce;
 	add_name_hash(istate, ce);
 }
@@ -2255,6 +2258,12 @@ int do_read_index(struct index_state *istate, const char *path, int must_exist)
 	trace2_data_intmax("index", the_repository, "read/cache_nr",
 			   istate->cache_nr);
 
+	if (!istate->repo)
+		istate->repo = the_repository;
+	prepare_repo_settings(istate->repo);
+	if (istate->repo->settings.command_requires_full_index)
+		ensure_full_index(istate);
+
 	return istate->cache_nr;
 
 unmap:
diff --git a/sparse-index.c b/sparse-index.c
index 82183ead563b..316cb949b74b 100644
--- a/sparse-index.c
+++ b/sparse-index.c
@@ -1,8 +1,101 @@
 #include "cache.h"
 #include "repository.h"
 #include "sparse-index.h"
+#include "tree.h"
+#include "pathspec.h"
+#include "trace2.h"
+
+static void set_index_entry(struct index_state *istate, int nr, struct cache_entry *ce)
+{
+	ALLOC_GROW(istate->cache, nr + 1, istate->cache_alloc);
+
+	istate->cache[nr] = ce;
+	add_name_hash(istate, ce);
+}
+
+static int add_path_to_index(const struct object_id *oid,
+				struct strbuf *base, const char *path,
+				unsigned int mode, int stage, void *context)
+{
+	struct index_state *istate = (struct index_state *)context;
+	struct cache_entry *ce;
+	size_t len = base->len;
+
+	if (S_ISDIR(mode))
+		return READ_TREE_RECURSIVE;
+
+	strbuf_addstr(base, path);
+
+	ce = make_cache_entry(istate, mode, oid, base->buf, 0, 0);
+	ce->ce_flags |= CE_SKIP_WORKTREE;
+	set_index_entry(istate, istate->cache_nr++, ce);
+
+	strbuf_setlen(base, len);
+	return 0;
+}
 
 void ensure_full_index(struct index_state *istate)
 {
-	/* intentionally left blank */
+	int i;
+	struct index_state *full;
+
+	if (!istate || !istate->sparse_index)
+		return;
+
+	if (!istate->repo)
+		istate->repo = the_repository;
+
+	trace2_region_enter("index", "ensure_full_index", istate->repo);
+
+	/* initialize basics of new index */
+	full = xcalloc(1, sizeof(struct index_state));
+	memcpy(full, istate, sizeof(struct index_state));
+
+	/* then change the necessary things */
+	full->sparse_index = 0;
+	full->cache_alloc = (3 * istate->cache_alloc) / 2;
+	full->cache_nr = 0;
+	ALLOC_ARRAY(full->cache, full->cache_alloc);
+
+	for (i = 0; i < istate->cache_nr; i++) {
+		struct cache_entry *ce = istate->cache[i];
+		struct tree *tree;
+		struct pathspec ps;
+
+		if (!S_ISSPARSEDIR(ce->ce_mode)) {
+			set_index_entry(full, full->cache_nr++, ce);
+			continue;
+		}
+		if (!(ce->ce_flags & CE_SKIP_WORKTREE))
+			warning(_("index entry is a directory, but not sparse (%08x)"),
+				ce->ce_flags);
+
+		/* recursively walk into cd->name */
+		tree = lookup_tree(istate->repo, &ce->oid);
+
+		memset(&ps, 0, sizeof(ps));
+		ps.recursive = 1;
+		ps.has_wildcard = 1;
+		ps.max_depth = -1;
+
+		read_tree_recursive(istate->repo, tree,
+				    ce->name, strlen(ce->name),
+				    0, &ps,
+				    add_path_to_index, full);
+
+		/* free directory entries. full entries are re-used */
+		discard_cache_entry(ce);
+	}
+
+	/* Copy back into original index. */
+	memcpy(&istate->name_hash, &full->name_hash, sizeof(full->name_hash));
+	istate->sparse_index = 0;
+	free(istate->cache);
+	istate->cache = full->cache;
+	istate->cache_nr = full->cache_nr;
+	istate->cache_alloc = full->cache_alloc;
+
+	free(full);
+
+	trace2_region_leave("index", "ensure_full_index", istate->repo);
 }
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH 06/20] t1092: compare sparse-checkout to sparse-index
  2021-02-23 20:14 [PATCH 00/20] Sparse Index: Design, Format, Tests Derrick Stolee via GitGitGadget
                   ` (4 preceding siblings ...)
  2021-02-23 20:14 ` [PATCH 05/20] sparse-index: implement ensure_full_index() Derrick Stolee via GitGitGadget
@ 2021-02-23 20:14 ` Derrick Stolee via GitGitGadget
  2021-02-25  6:37   ` Elijah Newren
  2021-02-23 20:14 ` [PATCH 07/20] test-read-cache: print cache entries with --table Derrick Stolee via GitGitGadget
                   ` (15 subsequent siblings)
  21 siblings, 1 reply; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-23 20:14 UTC (permalink / raw)
  To: git; +Cc: newren, gitster, pclouds, jrnieder, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Add a new 'sparse-index' repo alongside the 'full-checkout' and
'sparse-checkout' repos in t1092-sparse-checkout-compatibility.sh. Also
add run_on_sparse and test_sparse_match helpers. These helpers will be
used when the sparse index is implemented.

Add GIT_TEST_SPARSE_INDEX environment variable to enable the
sparse-index by default. This will be intended to use across the entire
test suite, except that it will only affect cases where the
sparse-checkout feature is enabled.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/README                                 |  3 +++
 t/t1092-sparse-checkout-compatibility.sh | 24 ++++++++++++++++++++----
 2 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/t/README b/t/README
index 593d4a4e270c..b98bc563aab5 100644
--- a/t/README
+++ b/t/README
@@ -439,6 +439,9 @@ and "sha256".
 GIT_TEST_WRITE_REV_INDEX=<boolean>, when true enables the
 'pack.writeReverseIndex' setting.
 
+GIT_TEST_SPARSE_INDEX=<boolean>, when true enables index writes to use the
+sparse-index format by default.
+
 Naming Tests
 ------------
 
diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index 3725d3997e70..71d6f9e4c014 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -7,6 +7,7 @@ test_description='compare full workdir to sparse workdir'
 test_expect_success 'setup' '
 	git init initial-repo &&
 	(
+		GIT_TEST_SPARSE_INDEX=0 &&
 		cd initial-repo &&
 		echo a >a &&
 		echo "after deep" >e &&
@@ -87,23 +88,32 @@ init_repos () {
 
 	cp -r initial-repo sparse-checkout &&
 	git -C sparse-checkout reset --hard &&
-	git -C sparse-checkout sparse-checkout init --cone &&
+
+	cp -r initial-repo sparse-index &&
+	git -C sparse-index reset --hard &&
 
 	# initialize sparse-checkout definitions
-	git -C sparse-checkout sparse-checkout set deep
+	git -C sparse-checkout sparse-checkout init --cone &&
+	git -C sparse-checkout sparse-checkout set deep &&
+	GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout init --cone &&
+	GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set deep
 }
 
 run_on_sparse () {
 	(
 		cd sparse-checkout &&
-		"$@" >../sparse-checkout-out 2>../sparse-checkout-err
+		GIT_TEST_SPARSE_INDEX=0 "$@" >../sparse-checkout-out 2>../sparse-checkout-err
+	) &&
+	(
+		cd sparse-index &&
+		GIT_TEST_SPARSE_INDEX=1 "$@" >../sparse-index-out 2>../sparse-index-err
 	)
 }
 
 run_on_all () {
 	(
 		cd full-checkout &&
-		"$@" >../full-checkout-out 2>../full-checkout-err
+		GIT_TEST_SPARSE_INDEX=0 "$@" >../full-checkout-out 2>../full-checkout-err
 	) &&
 	run_on_sparse "$@"
 }
@@ -114,6 +124,12 @@ test_all_match () {
 	test_cmp full-checkout-err sparse-checkout-err
 }
 
+test_sparse_match () {
+	run_on_sparse $* &&
+	test_cmp sparse-checkout-out sparse-index-out &&
+	test_cmp sparse-checkout-err sparse-index-err
+}
+
 test_expect_success 'status with options' '
 	init_repos &&
 	test_all_match git status --porcelain=v2 &&
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH 07/20] test-read-cache: print cache entries with --table
  2021-02-23 20:14 [PATCH 00/20] Sparse Index: Design, Format, Tests Derrick Stolee via GitGitGadget
                   ` (5 preceding siblings ...)
  2021-02-23 20:14 ` [PATCH 06/20] t1092: compare sparse-checkout to sparse-index Derrick Stolee via GitGitGadget
@ 2021-02-23 20:14 ` Derrick Stolee via GitGitGadget
  2021-02-25  7:02   ` Elijah Newren
  2021-02-23 20:14 ` [PATCH 08/20] test-tool: don't force full index Derrick Stolee via GitGitGadget
                   ` (14 subsequent siblings)
  21 siblings, 1 reply; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-23 20:14 UTC (permalink / raw)
  To: git; +Cc: newren, gitster, pclouds, jrnieder, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

This table is helpful for discovering data in the index to ensure it is
being written correctly, especially as we build and test the
sparse-index. This table includes an output format similar to 'git
ls-tree', but should not be compared to that directly. The biggest
reasons are that 'git ls-tree' includes a tree entry for every
subdirectory, even those that would not appear as a sparse directory in
a sparse-index. Further, 'git ls-tree' does not use a trailing directory
separator for its tree rows.

This does not print the stat() information for the blobs. That could be
added in a future change with another option. The tests that are added
in the next few changes care only about the object types and IDs.

To make the option parsing slightly more robust, wrap the string
comparisons in a loop adapted from test-dir-iterator.c.

Care must be taken with the final check for the 'cnt' variable. We
continue the expectation that the numerical value is the final argument.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/helper/test-read-cache.c | 50 ++++++++++++++++++++++++++++++--------
 1 file changed, 40 insertions(+), 10 deletions(-)

diff --git a/t/helper/test-read-cache.c b/t/helper/test-read-cache.c
index 244977a29bdf..e4c3492f7d3e 100644
--- a/t/helper/test-read-cache.c
+++ b/t/helper/test-read-cache.c
@@ -2,35 +2,65 @@
 #include "cache.h"
 #include "config.h"
 
+static void print_cache_entry(struct cache_entry *ce)
+{
+	printf("%06o ", ce->ce_mode & 0777777);
+
+	if (S_ISSPARSEDIR(ce->ce_mode))
+		printf("tree ");
+	else if (S_ISGITLINK(ce->ce_mode))
+		printf("commit ");
+	else
+		printf("blob ");
+
+	printf("%s\t%s\n",
+	       oid_to_hex(&ce->oid),
+	       ce->name);
+}
+
+static void print_cache(struct index_state *cache)
+{
+	int i;
+	for (i = 0; i < the_index.cache_nr; i++)
+		print_cache_entry(the_index.cache[i]);
+}
+
 int cmd__read_cache(int argc, const char **argv)
 {
+	struct repository *r = the_repository;
 	int i, cnt = 1;
 	const char *name = NULL;
+	int table = 0;
 
-	if (argc > 1 && skip_prefix(argv[1], "--print-and-refresh=", &name)) {
-		argc--;
-		argv++;
+	for (++argv, --argc; *argv && starts_with(*argv, "--"); ++argv, --argc) {
+		if (skip_prefix(*argv, "--print-and-refresh=", &name))
+			continue;
+		if (!strcmp(*argv, "--table"))
+			table = 1;
 	}
 
-	if (argc == 2)
-		cnt = strtol(argv[1], NULL, 0);
+	if (argc == 1)
+		cnt = strtol(argv[0], NULL, 0);
 	setup_git_directory();
 	git_config(git_default_config, NULL);
+
 	for (i = 0; i < cnt; i++) {
-		read_cache();
+		repo_read_index(r);
 		if (name) {
 			int pos;
 
-			refresh_index(&the_index, REFRESH_QUIET,
+			refresh_index(r->index, REFRESH_QUIET,
 				      NULL, NULL, NULL);
-			pos = index_name_pos(&the_index, name, strlen(name));
+			pos = index_name_pos(r->index, name, strlen(name));
 			if (pos < 0)
 				die("%s not in index", name);
 			printf("%s is%s up to date\n", name,
-			       ce_uptodate(the_index.cache[pos]) ? "" : " not");
+			       ce_uptodate(r->index->cache[pos]) ? "" : " not");
 			write_file(name, "%d\n", i);
 		}
-		discard_cache();
+		if (table)
+			print_cache(r->index);
+		discard_index(r->index);
 	}
 	return 0;
 }
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH 08/20] test-tool: don't force full index
  2021-02-23 20:14 [PATCH 00/20] Sparse Index: Design, Format, Tests Derrick Stolee via GitGitGadget
                   ` (6 preceding siblings ...)
  2021-02-23 20:14 ` [PATCH 07/20] test-read-cache: print cache entries with --table Derrick Stolee via GitGitGadget
@ 2021-02-23 20:14 ` Derrick Stolee via GitGitGadget
  2021-02-23 20:14 ` [PATCH 09/20] unpack-trees: ensure " Derrick Stolee via GitGitGadget
                   ` (13 subsequent siblings)
  21 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-23 20:14 UTC (permalink / raw)
  To: git; +Cc: newren, gitster, pclouds, jrnieder, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

We will use 'test-tool read-cache --table' to check that a sparse
index is written as part of init_repos. Since we will no longer always
expand a sparse index into a full index, add an '--expand' parameter
that adds a call to ensure_full_index() so we can compare a sparse index
directly against a full index, or at least what the in-memory index
looks like when expanded in this way.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/helper/test-read-cache.c               | 13 ++++++++++++-
 t/t1092-sparse-checkout-compatibility.sh |  5 +++++
 2 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/t/helper/test-read-cache.c b/t/helper/test-read-cache.c
index e4c3492f7d3e..4780429dca6b 100644
--- a/t/helper/test-read-cache.c
+++ b/t/helper/test-read-cache.c
@@ -1,6 +1,7 @@
 #include "test-tool.h"
 #include "cache.h"
 #include "config.h"
+#include "sparse-index.h"
 
 static void print_cache_entry(struct cache_entry *ce)
 {
@@ -30,13 +31,19 @@ int cmd__read_cache(int argc, const char **argv)
 	struct repository *r = the_repository;
 	int i, cnt = 1;
 	const char *name = NULL;
-	int table = 0;
+	int table = 0, expand = 0;
+
+	initialize_the_repository();
+	prepare_repo_settings(r);
+	r->settings.command_requires_full_index = 0;
 
 	for (++argv, --argc; *argv && starts_with(*argv, "--"); ++argv, --argc) {
 		if (skip_prefix(*argv, "--print-and-refresh=", &name))
 			continue;
 		if (!strcmp(*argv, "--table"))
 			table = 1;
+		else if (!strcmp(*argv, "--expand"))
+			expand = 1;
 	}
 
 	if (argc == 1)
@@ -46,6 +53,10 @@ int cmd__read_cache(int argc, const char **argv)
 
 	for (i = 0; i < cnt; i++) {
 		repo_read_index(r);
+
+		if (expand)
+			ensure_full_index(r->index);
+
 		if (name) {
 			int pos;
 
diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index 71d6f9e4c014..4d789fe86b9d 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -130,6 +130,11 @@ test_sparse_match () {
 	test_cmp sparse-checkout-err sparse-index-err
 }
 
+test_expect_success 'expanded in-memory index matches full index' '
+	init_repos &&
+	test_sparse_match test-tool read-cache --expand --table
+'
+
 test_expect_success 'status with options' '
 	init_repos &&
 	test_all_match git status --porcelain=v2 &&
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH 09/20] unpack-trees: ensure full index
  2021-02-23 20:14 [PATCH 00/20] Sparse Index: Design, Format, Tests Derrick Stolee via GitGitGadget
                   ` (7 preceding siblings ...)
  2021-02-23 20:14 ` [PATCH 08/20] test-tool: don't force full index Derrick Stolee via GitGitGadget
@ 2021-02-23 20:14 ` Derrick Stolee via GitGitGadget
  2021-02-23 20:14 ` [PATCH 10/20] sparse-checkout: hold pattern list in index Derrick Stolee via GitGitGadget
                   ` (12 subsequent siblings)
  21 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-23 20:14 UTC (permalink / raw)
  To: git; +Cc: newren, gitster, pclouds, jrnieder, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The next change will translate full indexes into sparse indexes at write
time. The existing logic provides a way for every sparse index to be
expanded to a full index at read time. However, there are cases where an
index is written and then continues to be used in-memory to perform
further updates.

unpack_trees() is frequently called after such a write. In particular,
commands like 'git reset' do this double-update of the index.

Ensure that we have a full index when entering unpack_trees(), but only
when command_requires_full_index is true. This is always true at the
moment, but we will later relax that after unpack_trees() is updated to
handle sparse directory entries.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 unpack-trees.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/unpack-trees.c b/unpack-trees.c
index f5f668f532d8..4dd99219073a 100644
--- a/unpack-trees.c
+++ b/unpack-trees.c
@@ -1567,6 +1567,7 @@ static int verify_absent(const struct cache_entry *,
  */
 int unpack_trees(unsigned len, struct tree_desc *t, struct unpack_trees_options *o)
 {
+	struct repository *repo = the_repository;
 	int i, ret;
 	static struct cache_entry *dfc;
 	struct pattern_list pl;
@@ -1578,6 +1579,12 @@ int unpack_trees(unsigned len, struct tree_desc *t, struct unpack_trees_options
 	trace_performance_enter();
 	trace2_region_enter("unpack_trees", "unpack_trees", the_repository);
 
+	prepare_repo_settings(repo);
+	if (repo->settings.command_requires_full_index) {
+		ensure_full_index(o->src_index);
+		ensure_full_index(o->dst_index);
+	}
+
 	if (!core_apply_sparse_checkout || !o->update)
 		o->skip_sparse_checkout = 1;
 	if (!o->skip_sparse_checkout && !o->pl) {
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH 10/20] sparse-checkout: hold pattern list in index
  2021-02-23 20:14 [PATCH 00/20] Sparse Index: Design, Format, Tests Derrick Stolee via GitGitGadget
                   ` (8 preceding siblings ...)
  2021-02-23 20:14 ` [PATCH 09/20] unpack-trees: ensure " Derrick Stolee via GitGitGadget
@ 2021-02-23 20:14 ` Derrick Stolee via GitGitGadget
  2021-02-25  7:14   ` Elijah Newren
  2021-02-23 20:14 ` [PATCH 11/20] sparse-index: convert from full to sparse Derrick Stolee via GitGitGadget
                   ` (11 subsequent siblings)
  21 siblings, 1 reply; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-23 20:14 UTC (permalink / raw)
  To: git; +Cc: newren, gitster, pclouds, jrnieder, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

As we modify the sparse-checkout definition, we perform index operations
on a pattern_list that only exists in-memory. This allows easy backing
out in case the index update fails.

However, if the index write itself cares about the sparse-checkout
pattern set, we need access to that in-memory copy. Place a pointer to
a 'struct pattern_list' in the index so we can access this on-demand.
This will be used in the next change which uses the sparse-checkout
definition to filter out directories that are outsie the sparse cone.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 builtin/sparse-checkout.c | 17 ++++++++++-------
 cache.h                   |  2 ++
 2 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/builtin/sparse-checkout.c b/builtin/sparse-checkout.c
index 2306a9ad98e0..e00b82af727b 100644
--- a/builtin/sparse-checkout.c
+++ b/builtin/sparse-checkout.c
@@ -110,6 +110,8 @@ static int update_working_directory(struct pattern_list *pl)
 	if (is_index_unborn(r->index))
 		return UPDATE_SPARSITY_SUCCESS;
 
+	r->index->sparse_checkout_patterns = pl;
+
 	memset(&o, 0, sizeof(o));
 	o.verbose_update = isatty(2);
 	o.update = 1;
@@ -138,6 +140,7 @@ static int update_working_directory(struct pattern_list *pl)
 	else
 		rollback_lock_file(&lock_file);
 
+	r->index->sparse_checkout_patterns = NULL;
 	return result;
 }
 
@@ -517,19 +520,18 @@ static int modify_pattern_list(int argc, const char **argv, enum modify_type m)
 {
 	int result;
 	int changed_config = 0;
-	struct pattern_list pl;
-	memset(&pl, 0, sizeof(pl));
+	struct pattern_list *pl = xcalloc(1, sizeof(*pl));
 
 	switch (m) {
 	case ADD:
 		if (core_sparse_checkout_cone)
-			add_patterns_cone_mode(argc, argv, &pl);
+			add_patterns_cone_mode(argc, argv, pl);
 		else
-			add_patterns_literal(argc, argv, &pl);
+			add_patterns_literal(argc, argv, pl);
 		break;
 
 	case REPLACE:
-		add_patterns_from_input(&pl, argc, argv);
+		add_patterns_from_input(pl, argc, argv);
 		break;
 	}
 
@@ -539,12 +541,13 @@ static int modify_pattern_list(int argc, const char **argv, enum modify_type m)
 		changed_config = 1;
 	}
 
-	result = write_patterns_and_update(&pl);
+	result = write_patterns_and_update(pl);
 
 	if (result && changed_config)
 		set_config(MODE_NO_PATTERNS);
 
-	clear_pattern_list(&pl);
+	clear_pattern_list(pl);
+	free(pl);
 	return result;
 }
 
diff --git a/cache.h b/cache.h
index 1336c8d7435e..d75b352f38d3 100644
--- a/cache.h
+++ b/cache.h
@@ -307,6 +307,7 @@ static inline unsigned int canon_mode(unsigned int mode)
 struct split_index;
 struct untracked_cache;
 struct progress;
+struct pattern_list;
 
 struct index_state {
 	struct cache_entry **cache;
@@ -332,6 +333,7 @@ struct index_state {
 	struct mem_pool *ce_mem_pool;
 	struct progress *progress;
 	struct repository *repo;
+	struct pattern_list *sparse_checkout_patterns;
 };
 
 /* Name hashing */
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH 11/20] sparse-index: convert from full to sparse
  2021-02-23 20:14 [PATCH 00/20] Sparse Index: Design, Format, Tests Derrick Stolee via GitGitGadget
                   ` (9 preceding siblings ...)
  2021-02-23 20:14 ` [PATCH 10/20] sparse-checkout: hold pattern list in index Derrick Stolee via GitGitGadget
@ 2021-02-23 20:14 ` Derrick Stolee via GitGitGadget
  2021-02-25  7:33   ` Elijah Newren
  2021-02-23 20:14 ` [PATCH 12/20] submodule: sparse-index should not collapse links Derrick Stolee via GitGitGadget
                   ` (10 subsequent siblings)
  21 siblings, 1 reply; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-23 20:14 UTC (permalink / raw)
  To: git; +Cc: newren, gitster, pclouds, jrnieder, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

If we have a full index, then we can convert it to a sparse index by
replacing directories outside of the sparse cone with sparse directory
entries. The convert_to_sparse() method does this, when the situation is
appropriate.

For now, we avoid converting the index to a sparse index if:

 1. the index is split.
 2. the index is already sparse.
 3. sparse-checkout is disabled.
 4. sparse-checkout does not use cone mode.

Finally, we currently limit the conversion to when the
GIT_TEST_SPARSE_INDEX environment variable is enabled. A mode using Git
config will be added in a later change.

The trickiest thing about this conversion is that we might not be able
to mark a directory as a sparse directory just because it is outside the
sparse cone. There might be unmerged files within that directory, so we
need to look for those. Also, if there is some strange reason why a file
is not marked with CE_SKIP_WORKTREE, then we should give up on
converting that directory. There is still hope that some of its
subdirectories might be able to convert to sparse, so we keep looking
deeper.

The conversion process is assisted by the cache-tree extension. This is
calculated from the full index if it does not already exist. We then
abandon the cache-tree as it no longer applies to the newly-sparse
index. Thus, this cache-tree will be recalculated in every
sparse-full-sparse round-trip until we integrate the cache-tree
extension with the sparse index.

Some Git commands use the index after writing it. For example, 'git add'
will update the index, then write it to disk, then read its entries to
report information. To keep the in-memory index in a full state after
writing, we re-expand it to a full one after the write. This is wasteful
for commands that only write the index and do not read from it again,
but that is only the case until we make those commands "sparse aware."

We can compare the behavior of the sparse-index in
t1092-sparse-checkout-compability.sh by using GIT_TEST_SPARSE_INDEX=1
when operating on the 'sparse-index' repo. We can also compare the two
sparse repos directly, such as comparing their indexes (when expanded to
full in the case of the 'sparse-index' repo). We also verify that the
index is actually populated with sparse directory entries.

The 'checkout and reset (mixed)' test is marked for failure when
comparing a sparse repo to a full repo, but we can compare the two
sparse-checkout cases directly to ensure that we are not changing the
behavior when using a sparse index.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 cache-tree.c                             |   3 +
 cache.h                                  |   2 +
 read-cache.c                             |  26 ++++-
 sparse-index.c                           | 139 +++++++++++++++++++++++
 sparse-index.h                           |   1 +
 t/t1092-sparse-checkout-compatibility.sh |  61 +++++++++-
 6 files changed, 227 insertions(+), 5 deletions(-)

diff --git a/cache-tree.c b/cache-tree.c
index 2fb483d3c083..5f07a39e501e 100644
--- a/cache-tree.c
+++ b/cache-tree.c
@@ -6,6 +6,7 @@
 #include "object-store.h"
 #include "replace-object.h"
 #include "promisor-remote.h"
+#include "sparse-index.h"
 
 #ifndef DEBUG_CACHE_TREE
 #define DEBUG_CACHE_TREE 0
@@ -442,6 +443,8 @@ int cache_tree_update(struct index_state *istate, int flags)
 	if (i)
 		return i;
 
+	ensure_full_index(istate);
+
 	if (!istate->cache_tree)
 		istate->cache_tree = cache_tree();
 
diff --git a/cache.h b/cache.h
index d75b352f38d3..e8b7d3b4fb33 100644
--- a/cache.h
+++ b/cache.h
@@ -251,6 +251,8 @@ static inline unsigned int create_ce_mode(unsigned int mode)
 {
 	if (S_ISLNK(mode))
 		return S_IFLNK;
+	if (mode == S_IFDIR)
+		return S_IFDIR;
 	if (S_ISDIR(mode) || S_ISGITLINK(mode))
 		return S_IFGITLINK;
 	return S_IFREG | ce_permissions(mode);
diff --git a/read-cache.c b/read-cache.c
index 97dbf2434f30..67acbf202f4e 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -25,6 +25,7 @@
 #include "fsmonitor.h"
 #include "thread-utils.h"
 #include "progress.h"
+#include "sparse-index.h"
 
 /* Mask for the name length in ce_flags in the on-disk index */
 
@@ -1002,8 +1003,14 @@ int verify_path(const char *path, unsigned mode)
 
 			c = *path++;
 			if ((c == '.' && !verify_dotfile(path, mode)) ||
-			    is_dir_sep(c) || c == '\0')
+			    is_dir_sep(c))
 				return 0;
+			/*
+			 * allow terminating directory separators for
+			 * sparse directory enries.
+			 */
+			if (c == '\0')
+				return S_ISDIR(mode);
 		} else if (c == '\\' && protect_ntfs) {
 			if (is_ntfs_dotgit(path))
 				return 0;
@@ -3061,6 +3068,14 @@ static int do_write_locked_index(struct index_state *istate, struct lock_file *l
 				 unsigned flags)
 {
 	int ret;
+	int was_full = !istate->sparse_index;
+
+	ret = convert_to_sparse(istate);
+
+	if (ret) {
+		warning(_("failed to convert to a sparse-index"));
+		return ret;
+	}
 
 	/*
 	 * TODO trace2: replace "the_repository" with the actual repo instance
@@ -3072,6 +3087,9 @@ static int do_write_locked_index(struct index_state *istate, struct lock_file *l
 	trace2_region_leave_printf("index", "do_write_index", the_repository,
 				   "%s", get_lock_file_path(lock));
 
+	if (was_full)
+		ensure_full_index(istate);
+
 	if (ret)
 		return ret;
 	if (flags & COMMIT_LOCK)
@@ -3162,9 +3180,10 @@ static int write_shared_index(struct index_state *istate,
 			      struct tempfile **temp)
 {
 	struct split_index *si = istate->split_index;
-	int ret;
+	int ret, was_full = !istate->sparse_index;
 
 	move_cache_to_base_index(istate);
+	convert_to_sparse(istate);
 
 	trace2_region_enter_printf("index", "shared/do_write_index",
 				   the_repository, "%s", get_tempfile_path(*temp));
@@ -3172,6 +3191,9 @@ static int write_shared_index(struct index_state *istate,
 	trace2_region_leave_printf("index", "shared/do_write_index",
 				   the_repository, "%s", get_tempfile_path(*temp));
 
+	if (was_full)
+		ensure_full_index(istate);
+
 	if (ret)
 		return ret;
 	ret = adjust_shared_perm(get_tempfile_path(*temp));
diff --git a/sparse-index.c b/sparse-index.c
index 316cb949b74b..cb1f85635fbc 100644
--- a/sparse-index.c
+++ b/sparse-index.c
@@ -4,6 +4,145 @@
 #include "tree.h"
 #include "pathspec.h"
 #include "trace2.h"
+#include "cache-tree.h"
+#include "config.h"
+#include "dir.h"
+#include "fsmonitor.h"
+
+static struct cache_entry *construct_sparse_dir_entry(
+				struct index_state *istate,
+				const char *sparse_dir,
+				struct cache_tree *tree)
+{
+	struct cache_entry *de;
+
+	de = make_cache_entry(istate, S_IFDIR, &tree->oid, sparse_dir, 0, 0);
+
+	de->ce_flags |= CE_SKIP_WORKTREE;
+	return de;
+}
+
+/*
+ * Returns the number of entries "inserted" into the index.
+ */
+static int convert_to_sparse_rec(struct index_state *istate,
+				 int num_converted,
+				 int start, int end,
+				 const char *ct_path, size_t ct_pathlen,
+				 struct cache_tree *ct)
+{
+	int i, can_convert = 1;
+	int start_converted = num_converted;
+	enum pattern_match_result match;
+	int dtype;
+	struct strbuf child_path = STRBUF_INIT;
+	struct pattern_list *pl = istate->sparse_checkout_patterns;
+
+	/*
+	 * Is the current path outside of the sparse cone?
+	 * Then check if the region can be replaced by a sparse
+	 * directory entry (everything is sparse and merged).
+	 */
+	match = path_matches_pattern_list(ct_path, ct_pathlen,
+					  NULL, &dtype, pl, istate);
+	if (match != NOT_MATCHED)
+		can_convert = 0;
+
+	for (i = start; can_convert && i < end; i++) {
+		struct cache_entry *ce = istate->cache[i];
+
+		if (ce_stage(ce) ||
+		    !(ce->ce_flags & CE_SKIP_WORKTREE))
+			can_convert = 0;
+	}
+
+	if (can_convert) {
+		struct cache_entry *se;
+		se = construct_sparse_dir_entry(istate, ct_path, ct);
+
+		istate->cache[num_converted++] = se;
+		return 1;
+	}
+
+	for (i = start; i < end; ) {
+		int count, span, pos = -1;
+		const char *base, *slash;
+		struct cache_entry *ce = istate->cache[i];
+
+		/*
+		 * Detect if this is a normal entry oustide of any subtree
+		 * entry.
+		 */
+		base = ce->name + ct_pathlen;
+		slash = strchr(base, '/');
+
+		if (slash)
+			pos = cache_tree_subtree_pos(ct, base, slash - base);
+
+		if (pos < 0) {
+			istate->cache[num_converted++] = ce;
+			i++;
+			continue;
+		}
+
+		strbuf_setlen(&child_path, 0);
+		strbuf_add(&child_path, ce->name, slash - ce->name + 1);
+
+		span = ct->down[pos]->cache_tree->entry_count;
+		count = convert_to_sparse_rec(istate,
+					      num_converted, i, i + span,
+					      child_path.buf, child_path.len,
+					      ct->down[pos]->cache_tree);
+		num_converted += count;
+		i += span;
+	}
+
+	strbuf_release(&child_path);
+	return num_converted - start_converted;
+}
+
+int convert_to_sparse(struct index_state *istate)
+{
+	if (istate->split_index || istate->sparse_index ||
+	    !core_apply_sparse_checkout || !core_sparse_checkout_cone)
+		return 0;
+
+	/*
+	 * For now, only create a sparse index with the
+	 * GIT_TEST_SPARSE_INDEX environment variable. We will relax
+	 * this once we have a proper way to opt-in (and later still,
+	 * opt-out).
+	 */
+	if (!git_env_bool("GIT_TEST_SPARSE_INDEX", 0))
+		return 0;
+
+	if (!istate->sparse_checkout_patterns) {
+		istate->sparse_checkout_patterns = xcalloc(1, sizeof(struct pattern_list));
+		if (get_sparse_checkout_patterns(istate->sparse_checkout_patterns) < 0)
+			return 0;
+	}
+
+	if (!istate->sparse_checkout_patterns->use_cone_patterns) {
+		warning(_("attempting to use sparse-index without cone mode"));
+		return -1;
+	}
+
+	if (cache_tree_update(istate, 0)) {
+		warning(_("unable to update cache-tree, staying full"));
+		return -1;
+	}
+
+	remove_fsmonitor(istate);
+
+	trace2_region_enter("index", "convert_to_sparse", istate->repo);
+	istate->cache_nr = convert_to_sparse_rec(istate,
+						 0, 0, istate->cache_nr,
+						 "", 0, istate->cache_tree);
+	istate->drop_cache_tree = 1;
+	istate->sparse_index = 1;
+	trace2_region_leave("index", "convert_to_sparse", istate->repo);
+	return 0;
+}
 
 static void set_index_entry(struct index_state *istate, int nr, struct cache_entry *ce)
 {
diff --git a/sparse-index.h b/sparse-index.h
index 09a20d036c46..64380e121d80 100644
--- a/sparse-index.h
+++ b/sparse-index.h
@@ -3,5 +3,6 @@
 
 struct index_state;
 void ensure_full_index(struct index_state *istate);
+int convert_to_sparse(struct index_state *istate);
 
 #endif
diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index 4d789fe86b9d..ca87033d30b0 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -2,6 +2,9 @@
 
 test_description='compare full workdir to sparse workdir'
 
+GIT_TEST_CHECK_CACHE_TREE=0
+GIT_TEST_SPLIT_INDEX=0
+
 . ./test-lib.sh
 
 test_expect_success 'setup' '
@@ -121,15 +124,49 @@ run_on_all () {
 test_all_match () {
 	run_on_all "$@" &&
 	test_cmp full-checkout-out sparse-checkout-out &&
-	test_cmp full-checkout-err sparse-checkout-err
+	test_cmp full-checkout-out sparse-index-out &&
+	test_cmp full-checkout-err sparse-checkout-err &&
+	test_cmp full-checkout-err sparse-index-err
 }
 
 test_sparse_match () {
-	run_on_sparse $* &&
+	run_on_sparse "$@" &&
 	test_cmp sparse-checkout-out sparse-index-out &&
 	test_cmp sparse-checkout-err sparse-index-err
 }
 
+test_expect_success 'sparse-index contents' '
+	init_repos &&
+
+	test-tool -C sparse-index read-cache --table >cache &&
+	for dir in folder1 folder2 x
+	do
+		TREE=$(git -C sparse-index rev-parse HEAD:$dir) &&
+		grep "040000 tree $TREE	$dir/" cache \
+			|| return 1
+	done &&
+
+	GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set folder1 &&
+
+	test-tool -C sparse-index read-cache --table >cache &&
+	for dir in deep folder2 x
+	do
+		TREE=$(git -C sparse-index rev-parse HEAD:$dir) &&
+		grep "040000 tree $TREE	$dir/" cache \
+			|| return 1
+	done &&
+
+	GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set deep/deeper1 &&
+
+	test-tool -C sparse-index read-cache --table >cache &&
+	for dir in deep/deeper2 folder1 folder2 x
+	do
+		TREE=$(git -C sparse-index rev-parse HEAD:$dir) &&
+		grep "040000 tree $TREE	$dir/" cache \
+			|| return 1
+	done
+'
+
 test_expect_success 'expanded in-memory index matches full index' '
 	init_repos &&
 	test_sparse_match test-tool read-cache --expand --table
@@ -137,6 +174,7 @@ test_expect_success 'expanded in-memory index matches full index' '
 
 test_expect_success 'status with options' '
 	init_repos &&
+	test_sparse_match ls &&
 	test_all_match git status --porcelain=v2 &&
 	test_all_match git status --porcelain=v2 -z -u &&
 	test_all_match git status --porcelain=v2 -uno &&
@@ -273,6 +311,17 @@ test_expect_failure 'checkout and reset (mixed)' '
 	test_all_match git reset update-folder2
 '
 
+# Ensure that sparse-index behaves identically to
+# sparse-checkout with a full index.
+test_expect_success 'checkout and reset (mixed) [sparse]' '
+	init_repos &&
+
+	test_sparse_match git checkout -b reset-test update-deep &&
+	test_sparse_match git reset deepest &&
+	test_sparse_match git reset update-folder1 &&
+	test_sparse_match git reset update-folder2
+'
+
 test_expect_success 'merge' '
 	init_repos &&
 
@@ -309,14 +358,20 @@ test_expect_success 'clean' '
 	test_all_match git status --porcelain=v2 &&
 	test_all_match git clean -f &&
 	test_all_match git status --porcelain=v2 &&
+	test_sparse_match ls &&
+	test_sparse_match ls folder1 &&
 
 	test_all_match git clean -xf &&
 	test_all_match git status --porcelain=v2 &&
+	test_sparse_match ls &&
+	test_sparse_match ls folder1 &&
 
 	test_all_match git clean -xdf &&
 	test_all_match git status --porcelain=v2 &&
+	test_sparse_match ls &&
+	test_sparse_match ls folder1 &&
 
-	test_path_is_dir sparse-checkout/folder1
+	test_sparse_match test_path_is_dir folder1
 '
 
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH 12/20] submodule: sparse-index should not collapse links
  2021-02-23 20:14 [PATCH 00/20] Sparse Index: Design, Format, Tests Derrick Stolee via GitGitGadget
                   ` (10 preceding siblings ...)
  2021-02-23 20:14 ` [PATCH 11/20] sparse-index: convert from full to sparse Derrick Stolee via GitGitGadget
@ 2021-02-23 20:14 ` Derrick Stolee via GitGitGadget
  2021-02-23 20:14 ` [PATCH 13/20] unpack-trees: allow sparse directories Derrick Stolee via GitGitGadget
                   ` (9 subsequent siblings)
  21 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-23 20:14 UTC (permalink / raw)
  To: git; +Cc: newren, gitster, pclouds, jrnieder, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

A submodule is stored as a "Git link" that actually points to a commit
within a submodule. Submodules are populated or not depending on
submodule configuration, not sparse-checkout. To ensure that the
sparse-index feature integrates correctly with submodules, we should not
collapse a directory if there is a Git link within its range.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 sparse-index.c                           |  1 +
 t/t1092-sparse-checkout-compatibility.sh | 17 +++++++++++++++++
 2 files changed, 18 insertions(+)

diff --git a/sparse-index.c b/sparse-index.c
index cb1f85635fbc..14029fafc750 100644
--- a/sparse-index.c
+++ b/sparse-index.c
@@ -52,6 +52,7 @@ static int convert_to_sparse_rec(struct index_state *istate,
 		struct cache_entry *ce = istate->cache[i];
 
 		if (ce_stage(ce) ||
+		    S_ISGITLINK(ce->ce_mode) ||
 		    !(ce->ce_flags & CE_SKIP_WORKTREE))
 			can_convert = 0;
 	}
diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index ca87033d30b0..b38fab6455d9 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -374,4 +374,21 @@ test_expect_success 'clean' '
 	test_sparse_match test_path_is_dir folder1
 '
 
+test_expect_success 'submodule handling' '
+	init_repos &&
+
+	test_all_match mkdir modules &&
+	test_all_match touch modules/a &&
+	test_all_match git add modules &&
+	test_all_match git commit -m "add modules directory" &&
+
+	run_on_all git submodule add "$(pwd)/initial-repo" modules/sub &&
+	test_all_match git commit -m "add submodule" &&
+
+	# having a submodule prevents "modules" from collapse
+	test-tool -C sparse-index read-cache --table >cache &&
+	grep "100644 blob .*	modules/a" cache &&
+	grep "160000 commit $(git -C initial-repo rev-parse HEAD)	modules/sub" cache
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH 13/20] unpack-trees: allow sparse directories
  2021-02-23 20:14 [PATCH 00/20] Sparse Index: Design, Format, Tests Derrick Stolee via GitGitGadget
                   ` (11 preceding siblings ...)
  2021-02-23 20:14 ` [PATCH 12/20] submodule: sparse-index should not collapse links Derrick Stolee via GitGitGadget
@ 2021-02-23 20:14 ` Derrick Stolee via GitGitGadget
  2021-02-25  7:40   ` Elijah Newren
  2021-02-23 20:14 ` [PATCH 14/20] sparse-index: check index conversion happens Derrick Stolee via GitGitGadget
                   ` (8 subsequent siblings)
  21 siblings, 1 reply; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-23 20:14 UTC (permalink / raw)
  To: git; +Cc: newren, gitster, pclouds, jrnieder, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The index_pos_by_traverse_info() currently throws a BUG() when a
directory entry exists exactly in the index. We need to consider that it
is possible to have a directory in a sparse index as long as that entry
is itself marked with the skip-worktree bit.

The negation of the 'pos' variable must be conditioned to only when it
starts as negative. This is identical behavior as before when the index
is full.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 unpack-trees.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/unpack-trees.c b/unpack-trees.c
index 4dd99219073a..b324eec2a5d1 100644
--- a/unpack-trees.c
+++ b/unpack-trees.c
@@ -746,9 +746,12 @@ static int index_pos_by_traverse_info(struct name_entry *names,
 	strbuf_make_traverse_path(&name, info, names->path, names->pathlen);
 	strbuf_addch(&name, '/');
 	pos = index_name_pos(o->src_index, name.buf, name.len);
-	if (pos >= 0)
-		BUG("This is a directory and should not exist in index");
-	pos = -pos - 1;
+	if (pos >= 0) {
+		if (!o->src_index->sparse_index ||
+		    !(o->src_index->cache[pos]->ce_flags & CE_SKIP_WORKTREE))
+			BUG("This is a directory and should not exist in index");
+	} else
+		pos = -pos - 1;
 	if (pos >= o->src_index->cache_nr ||
 	    !starts_with(o->src_index->cache[pos]->name, name.buf) ||
 	    (pos > 0 && starts_with(o->src_index->cache[pos-1]->name, name.buf)))
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH 14/20] sparse-index: check index conversion happens
  2021-02-23 20:14 [PATCH 00/20] Sparse Index: Design, Format, Tests Derrick Stolee via GitGitGadget
                   ` (12 preceding siblings ...)
  2021-02-23 20:14 ` [PATCH 13/20] unpack-trees: allow sparse directories Derrick Stolee via GitGitGadget
@ 2021-02-23 20:14 ` Derrick Stolee via GitGitGadget
  2021-02-23 20:14 ` [PATCH 15/20] sparse-index: create extension for compatibility Derrick Stolee via GitGitGadget
                   ` (7 subsequent siblings)
  21 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-23 20:14 UTC (permalink / raw)
  To: git; +Cc: newren, gitster, pclouds, jrnieder, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Add a test case that uses test_region to ensure that we are truly
expanding a sparse index to a full one, then converting back to sparse
when writing the index. As we integrate more Git commands with the
sparse index, we will convert these commands to check that we do _not_
convert the sparse index to a full index and instead stay sparse the
entire time.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/t1092-sparse-checkout-compatibility.sh | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index b38fab6455d9..bfc9e28ef0e1 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -391,4 +391,22 @@ test_expect_success 'submodule handling' '
 	grep "160000 commit $(git -C initial-repo rev-parse HEAD)	modules/sub" cache
 '
 
+test_expect_success 'sparse-index is expanded and converted back' '
+	init_repos &&
+
+	(
+		GIT_TEST_SPARSE_INDEX=1 &&
+		export GIT_TEST_SPARSE_INDEX &&
+		GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
+			git -C sparse-index -c core.fsmonitor="" reset --hard &&
+		test_region index convert_to_sparse trace2.txt &&
+		test_region index ensure_full_index trace2.txt &&
+
+		rm trace2.txt &&
+		GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
+			git -C sparse-index -c core.fsmonitor="" status -uno &&
+		test_region index ensure_full_index trace2.txt
+	)
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH 15/20] sparse-index: create extension for compatibility
  2021-02-23 20:14 [PATCH 00/20] Sparse Index: Design, Format, Tests Derrick Stolee via GitGitGadget
                   ` (13 preceding siblings ...)
  2021-02-23 20:14 ` [PATCH 14/20] sparse-index: check index conversion happens Derrick Stolee via GitGitGadget
@ 2021-02-23 20:14 ` Derrick Stolee via GitGitGadget
  2021-02-25  7:45   ` Elijah Newren
  2021-02-23 20:14 ` [PATCH 16/20] sparse-checkout: toggle sparse index from builtin Derrick Stolee via GitGitGadget
                   ` (6 subsequent siblings)
  21 siblings, 1 reply; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-23 20:14 UTC (permalink / raw)
  To: git; +Cc: newren, gitster, pclouds, jrnieder, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Previously, we enabled the sparse index format only using
GIT_TEST_SPARSE_INDEX=1. This is not a feasible direction for users to
actually select this mode. Further, sparse directory entries are not
understood by the index formats as advertised.

We _could_ add a new index version that explicitly adds these
capabilities, but there are nuances to index formats 2, 3, and 4 that
are still valuable to select as options. For now, create a repo
extension, "extensions.sparseIndex", that specifies that the tool
reading this repository must understand sparse directory entries.

This change only encodes the extension and enables it when
GIT_TEST_SPARSE_INDEX=1. Later, we will add a more user-friendly CLI
mechanism.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/config/extensions.txt |  7 ++++++
 cache.h                             |  1 +
 repo-settings.c                     |  7 ++++++
 repository.h                        |  3 ++-
 setup.c                             |  3 +++
 sparse-index.c                      | 38 +++++++++++++++++++++++++----
 6 files changed, 53 insertions(+), 6 deletions(-)

diff --git a/Documentation/config/extensions.txt b/Documentation/config/extensions.txt
index 4e23d73cdcad..5c86b3648732 100644
--- a/Documentation/config/extensions.txt
+++ b/Documentation/config/extensions.txt
@@ -6,3 +6,10 @@ extensions.objectFormat::
 Note that this setting should only be set by linkgit:git-init[1] or
 linkgit:git-clone[1].  Trying to change it after initialization will not
 work and will produce hard-to-diagnose issues.
+
+extensions.sparseIndex::
+	When combined with `core.sparseCheckout=true` and
+	`core.sparseCheckoutCone=true`, the index may contain entries
+	corresponding to directories outside of the sparse-checkout
+	definition. Versions of Git that do not understand this extension
+	do not expect directory entries in the index.
diff --git a/cache.h b/cache.h
index e8b7d3b4fb33..eea61fba7568 100644
--- a/cache.h
+++ b/cache.h
@@ -1053,6 +1053,7 @@ struct repository_format {
 	int worktree_config;
 	int is_bare;
 	int hash_algo;
+	int sparse_index;
 	char *work_tree;
 	struct string_list unknown_extensions;
 	struct string_list v1_only_extensions;
diff --git a/repo-settings.c b/repo-settings.c
index d63569e4041e..9677d50f9238 100644
--- a/repo-settings.c
+++ b/repo-settings.c
@@ -85,4 +85,11 @@ void prepare_repo_settings(struct repository *r)
 	 * removed.
 	 */
 	r->settings.command_requires_full_index = 1;
+
+	/*
+	 * Initialize this as off.
+	 */
+	r->settings.sparse_index = 0;
+	if (!repo_config_get_bool(r, "extensions.sparseindex", &value) && value)
+		r->settings.sparse_index = 1;
 }
diff --git a/repository.h b/repository.h
index e06a23015697..a45f7520fd9e 100644
--- a/repository.h
+++ b/repository.h
@@ -42,7 +42,8 @@ struct repo_settings {
 
 	int core_multi_pack_index;
 
-	unsigned command_requires_full_index:1;
+	unsigned command_requires_full_index:1,
+		 sparse_index:1;
 };
 
 struct repository {
diff --git a/setup.c b/setup.c
index c04cd25a30df..cd8394564613 100644
--- a/setup.c
+++ b/setup.c
@@ -500,6 +500,9 @@ static enum extension_result handle_extension(const char *var,
 			return error("invalid value for 'extensions.objectformat'");
 		data->hash_algo = format;
 		return EXTENSION_OK;
+	} else if (!strcmp(ext, "sparseindex")) {
+		data->sparse_index = 1;
+		return EXTENSION_OK;
 	}
 	return EXTENSION_UNKNOWN;
 }
diff --git a/sparse-index.c b/sparse-index.c
index 14029fafc750..97b0d0c57857 100644
--- a/sparse-index.c
+++ b/sparse-index.c
@@ -102,19 +102,47 @@ static int convert_to_sparse_rec(struct index_state *istate,
 	return num_converted - start_converted;
 }
 
+static int enable_sparse_index(struct repository *repo)
+{
+	const char *config_path = repo_git_path(repo, "config.worktree");
+
+	if (upgrade_repository_format(1) < 0) {
+		warning(_("unable to upgrade repository format to enable sparse-index"));
+		return -1;
+	}
+	git_config_set_in_file_gently(config_path,
+				      "extensions.sparseIndex",
+				      "true");
+
+	prepare_repo_settings(repo);
+	repo->settings.sparse_index = 1;
+	return 0;
+}
+
 int convert_to_sparse(struct index_state *istate)
 {
 	if (istate->split_index || istate->sparse_index ||
 	    !core_apply_sparse_checkout || !core_sparse_checkout_cone)
 		return 0;
 
+	if (!istate->repo)
+		istate->repo = the_repository;
+
+	/*
+	 * The GIT_TEST_SPARSE_INDEX environment variable triggers the
+	 * extensions.sparseIndex config variable to be on.
+	 */
+	if (git_env_bool("GIT_TEST_SPARSE_INDEX", 0)) {
+		int err = enable_sparse_index(istate->repo);
+		if (err < 0)
+			return err;
+	}
+
 	/*
-	 * For now, only create a sparse index with the
-	 * GIT_TEST_SPARSE_INDEX environment variable. We will relax
-	 * this once we have a proper way to opt-in (and later still,
-	 * opt-out).
+	 * Only convert to sparse if extensions.sparseIndex is set.
 	 */
-	if (!git_env_bool("GIT_TEST_SPARSE_INDEX", 0))
+	prepare_repo_settings(istate->repo);
+	if (!istate->repo->settings.sparse_index)
 		return 0;
 
 	if (!istate->sparse_checkout_patterns) {
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH 16/20] sparse-checkout: toggle sparse index from builtin
  2021-02-23 20:14 [PATCH 00/20] Sparse Index: Design, Format, Tests Derrick Stolee via GitGitGadget
                   ` (14 preceding siblings ...)
  2021-02-23 20:14 ` [PATCH 15/20] sparse-index: create extension for compatibility Derrick Stolee via GitGitGadget
@ 2021-02-23 20:14 ` Derrick Stolee via GitGitGadget
  2021-02-24 19:11   ` Martin Ågren
  2021-02-23 20:14 ` [PATCH 17/20] sparse-checkout: disable sparse-index Derrick Stolee via GitGitGadget
                   ` (5 subsequent siblings)
  21 siblings, 1 reply; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-23 20:14 UTC (permalink / raw)
  To: git; +Cc: newren, gitster, pclouds, jrnieder, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The sparse index extension is used to signal that index writes should be
in sparse mode. This was only updated using GIT_TEST_SPARSE_INDEX=1.

Add a '--[no-]sparse-index' option to 'git sparse-checkout init' that
specifies if the sparse index should be used. It also updates the index
to use the correct format, either way. Add a warning in the
documentation that the use of a repository extension might reduce
compatibility with third-party tools. 'git sparse-checkout init' already
sets extension.worktreeConfig, which places most sparse-checkout users
outside of the scope of most third-party tools.

Update t1092-sparse-checkout-compatibility.sh to use this CLI instead of
GIT_TEST_SPARSE_INDEX=1.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-sparse-checkout.txt    | 14 +++++++++
 builtin/sparse-checkout.c                | 17 ++++++++++-
 sparse-index.c                           | 37 +++++++++++++++--------
 sparse-index.h                           |  3 ++
 t/t1092-sparse-checkout-compatibility.sh | 38 +++++++++++-------------
 5 files changed, 76 insertions(+), 33 deletions(-)

diff --git a/Documentation/git-sparse-checkout.txt b/Documentation/git-sparse-checkout.txt
index a0eeaeb02ee3..b51b8450cfd9 100644
--- a/Documentation/git-sparse-checkout.txt
+++ b/Documentation/git-sparse-checkout.txt
@@ -45,6 +45,20 @@ To avoid interfering with other worktrees, it first enables the
 When `--cone` is provided, the `core.sparseCheckoutCone` setting is
 also set, allowing for better performance with a limited set of
 patterns (see 'CONE PATTERN SET' below).
++
+Use the `--[no-]sparse-index` option to toggle the use of the sparse
+index format. This reduces the size of the index to be more closely
+aligned with your sparse-checkout definition. This can have significant
+performance advantages for commands such as `git status` or `git add`.
+This feature is still experimental. Some commands might be slower with
+a sparse index until they are properly integrated with the feature.
++
+**WARNING:** Using a sparse index requires modifying the index in a way
+that is not completely understood by other tools. Enabling sparse index
+enables the `extensions.spareseIndex` config value, which might cause
+other tools to stop working with your repository. If you have trouble with
+this compatibility, then run `git sparse-checkout sparse-index disable` to
+remove this config and rewrite your index to not be sparse.
 
 'set'::
 	Write a set of patterns to the sparse-checkout file, as given as
diff --git a/builtin/sparse-checkout.c b/builtin/sparse-checkout.c
index e00b82af727b..ca63e2c64e95 100644
--- a/builtin/sparse-checkout.c
+++ b/builtin/sparse-checkout.c
@@ -14,6 +14,7 @@
 #include "unpack-trees.h"
 #include "wt-status.h"
 #include "quote.h"
+#include "sparse-index.h"
 
 static const char *empty_base = "";
 
@@ -283,12 +284,13 @@ static int set_config(enum sparse_checkout_mode mode)
 }
 
 static char const * const builtin_sparse_checkout_init_usage[] = {
-	N_("git sparse-checkout init [--cone]"),
+	N_("git sparse-checkout init [--cone] [--[no-]sparse-index]"),
 	NULL
 };
 
 static struct sparse_checkout_init_opts {
 	int cone_mode;
+	int sparse_index;
 } init_opts;
 
 static int sparse_checkout_init(int argc, const char **argv)
@@ -303,11 +305,15 @@ static int sparse_checkout_init(int argc, const char **argv)
 	static struct option builtin_sparse_checkout_init_options[] = {
 		OPT_BOOL(0, "cone", &init_opts.cone_mode,
 			 N_("initialize the sparse-checkout in cone mode")),
+		OPT_BOOL(0, "sparse-index", &init_opts.sparse_index,
+			 N_("toggle the use of a sparse index")),
 		OPT_END(),
 	};
 
 	repo_read_index(the_repository);
 
+	init_opts.sparse_index = -1;
+
 	argc = parse_options(argc, argv, NULL,
 			     builtin_sparse_checkout_init_options,
 			     builtin_sparse_checkout_init_usage, 0);
@@ -326,6 +332,15 @@ static int sparse_checkout_init(int argc, const char **argv)
 	sparse_filename = get_sparse_checkout_filename();
 	res = add_patterns_from_file_to_list(sparse_filename, "", 0, &pl, NULL);
 
+	if (init_opts.sparse_index >= 0) {
+		if (set_sparse_index_config(the_repository, init_opts.sparse_index) < 0)
+			die(_("failed to modify sparse-index config"));
+
+		/* force an index rewrite */
+		repo_read_index(the_repository);
+		the_repository->index->updated_workdir = 1;
+	}
+
 	/* If we already have a sparse-checkout file, use it. */
 	if (res >= 0) {
 		free(sparse_filename);
diff --git a/sparse-index.c b/sparse-index.c
index 97b0d0c57857..a991c5331e9e 100644
--- a/sparse-index.c
+++ b/sparse-index.c
@@ -104,23 +104,37 @@ static int convert_to_sparse_rec(struct index_state *istate,
 
 static int enable_sparse_index(struct repository *repo)
 {
-	const char *config_path = repo_git_path(repo, "config.worktree");
+	int res;
 
 	if (upgrade_repository_format(1) < 0) {
 		warning(_("unable to upgrade repository format to enable sparse-index"));
 		return -1;
 	}
-	git_config_set_in_file_gently(config_path,
-				      "extensions.sparseIndex",
-				      "true");
+	res = git_config_set_gently("extensions.sparseindex", "true");
 
 	prepare_repo_settings(repo);
 	repo->settings.sparse_index = 1;
-	return 0;
+	return res;
+}
+
+int set_sparse_index_config(struct repository *repo, int enable)
+{
+	int res;
+
+	if (enable)
+		return enable_sparse_index(repo);
+
+	/* Don't downgrade repository format, just remove the extension. */
+	res = git_config_set_gently("extensions.sparseindex", NULL);
+
+	prepare_repo_settings(repo);
+	repo->settings.sparse_index = 0;
+	return res;
 }
 
 int convert_to_sparse(struct index_state *istate)
 {
+	int test_env;
 	if (istate->split_index || istate->sparse_index ||
 	    !core_apply_sparse_checkout || !core_sparse_checkout_cone)
 		return 0;
@@ -129,14 +143,13 @@ int convert_to_sparse(struct index_state *istate)
 		istate->repo = the_repository;
 
 	/*
-	 * The GIT_TEST_SPARSE_INDEX environment variable triggers the
-	 * extensions.sparseIndex config variable to be on.
+	 * If GIT_TEST_SPARSE_INDEX=1, then trigger extensions.sparseIndex
+	 * to be fully enabled. If GIT_TEST_SPARSE_INDEX=0 (set explicitly),
+	 * then purposefully disable the setting.
 	 */
-	if (git_env_bool("GIT_TEST_SPARSE_INDEX", 0)) {
-		int err = enable_sparse_index(istate->repo);
-		if (err < 0)
-			return err;
-	}
+	test_env = git_env_bool("GIT_TEST_SPARSE_INDEX", -1);
+	if (test_env >= 0)
+		set_sparse_index_config(istate->repo, test_env);
 
 	/*
 	 * Only convert to sparse if extensions.sparseIndex is set.
diff --git a/sparse-index.h b/sparse-index.h
index 64380e121d80..39dcc859735e 100644
--- a/sparse-index.h
+++ b/sparse-index.h
@@ -5,4 +5,7 @@ struct index_state;
 void ensure_full_index(struct index_state *istate);
 int convert_to_sparse(struct index_state *istate);
 
+struct repository;
+int set_sparse_index_config(struct repository *repo, int enable);
+
 #endif
diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index bfc9e28ef0e1..9c2bc4d25f66 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -4,6 +4,7 @@ test_description='compare full workdir to sparse workdir'
 
 GIT_TEST_CHECK_CACHE_TREE=0
 GIT_TEST_SPLIT_INDEX=0
+GIT_TEST_SPARSE_INDEX=
 
 . ./test-lib.sh
 
@@ -98,25 +99,26 @@ init_repos () {
 	# initialize sparse-checkout definitions
 	git -C sparse-checkout sparse-checkout init --cone &&
 	git -C sparse-checkout sparse-checkout set deep &&
-	GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout init --cone &&
-	GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set deep
+	git -C sparse-index sparse-checkout init --cone --sparse-index &&
+	test_cmp_config -C sparse-index true extensions.sparseindex &&
+	git -C sparse-index sparse-checkout set deep
 }
 
 run_on_sparse () {
 	(
 		cd sparse-checkout &&
-		GIT_TEST_SPARSE_INDEX=0 "$@" >../sparse-checkout-out 2>../sparse-checkout-err
+		"$@" >../sparse-checkout-out 2>../sparse-checkout-err
 	) &&
 	(
 		cd sparse-index &&
-		GIT_TEST_SPARSE_INDEX=1 "$@" >../sparse-index-out 2>../sparse-index-err
+		"$@" >../sparse-index-out 2>../sparse-index-err
 	)
 }
 
 run_on_all () {
 	(
 		cd full-checkout &&
-		GIT_TEST_SPARSE_INDEX=0 "$@" >../full-checkout-out 2>../full-checkout-err
+		"$@" >../full-checkout-out 2>../full-checkout-err
 	) &&
 	run_on_sparse "$@"
 }
@@ -146,7 +148,7 @@ test_expect_success 'sparse-index contents' '
 			|| return 1
 	done &&
 
-	GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set folder1 &&
+	git -C sparse-index sparse-checkout set folder1 &&
 
 	test-tool -C sparse-index read-cache --table >cache &&
 	for dir in deep folder2 x
@@ -156,7 +158,7 @@ test_expect_success 'sparse-index contents' '
 			|| return 1
 	done &&
 
-	GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set deep/deeper1 &&
+	git -C sparse-index sparse-checkout set deep/deeper1 &&
 
 	test-tool -C sparse-index read-cache --table >cache &&
 	for dir in deep/deeper2 folder1 folder2 x
@@ -394,19 +396,15 @@ test_expect_success 'submodule handling' '
 test_expect_success 'sparse-index is expanded and converted back' '
 	init_repos &&
 
-	(
-		GIT_TEST_SPARSE_INDEX=1 &&
-		export GIT_TEST_SPARSE_INDEX &&
-		GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
-			git -C sparse-index -c core.fsmonitor="" reset --hard &&
-		test_region index convert_to_sparse trace2.txt &&
-		test_region index ensure_full_index trace2.txt &&
-
-		rm trace2.txt &&
-		GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
-			git -C sparse-index -c core.fsmonitor="" status -uno &&
-		test_region index ensure_full_index trace2.txt
-	)
+	GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
+		git -C sparse-index -c core.fsmonitor="" reset --hard &&
+	test_region index convert_to_sparse trace2.txt &&
+	test_region index ensure_full_index trace2.txt &&
+
+	rm trace2.txt &&
+	GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
+		git -C sparse-index -c core.fsmonitor="" status -uno &&
+	test_region index ensure_full_index trace2.txt
 '
 
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH 17/20] sparse-checkout: disable sparse-index
  2021-02-23 20:14 [PATCH 00/20] Sparse Index: Design, Format, Tests Derrick Stolee via GitGitGadget
                   ` (15 preceding siblings ...)
  2021-02-23 20:14 ` [PATCH 16/20] sparse-checkout: toggle sparse index from builtin Derrick Stolee via GitGitGadget
@ 2021-02-23 20:14 ` Derrick Stolee via GitGitGadget
  2021-02-27 12:32   ` SZEDER Gábor
  2021-02-23 20:14 ` [PATCH 18/20] cache-tree: integrate with sparse directory entries Derrick Stolee via GitGitGadget
                   ` (4 subsequent siblings)
  21 siblings, 1 reply; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-23 20:14 UTC (permalink / raw)
  To: git; +Cc: newren, gitster, pclouds, jrnieder, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

We use 'git sparse-checkout init --cone --sparse-index' to toggle the
sparse-index feature. It makes sense to also disable it when running
'git sparse-checkout disable'. This is particularly important because it
removes the extensions.sparseIndex config option, allowing other tools
to use this Git repository again.

This does mean that 'git sparse-checkout init' will not re-enable the
sparse-index feature, even if it was previously enabled.

While testing this feature, I noticed that the sparse-index was not
being written on the first run, but by a second. This was caught by the
call to 'test-tool read-cache --table'. This requires adjusting some
assignments to core_apply_sparse_checkout and pl.use_cone_patterns in
the sparse_checkout_init() logic.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 builtin/sparse-checkout.c          | 10 +++++++++-
 t/t1091-sparse-checkout-builtin.sh | 13 +++++++++++++
 2 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/builtin/sparse-checkout.c b/builtin/sparse-checkout.c
index ca63e2c64e95..585343fa1972 100644
--- a/builtin/sparse-checkout.c
+++ b/builtin/sparse-checkout.c
@@ -280,6 +280,9 @@ static int set_config(enum sparse_checkout_mode mode)
 				      "core.sparseCheckoutCone",
 				      mode == MODE_CONE_PATTERNS ? "true" : NULL);
 
+	if (mode == MODE_NO_PATTERNS)
+		set_sparse_index_config(the_repository, 0);
+
 	return 0;
 }
 
@@ -341,10 +344,11 @@ static int sparse_checkout_init(int argc, const char **argv)
 		the_repository->index->updated_workdir = 1;
 	}
 
+	core_apply_sparse_checkout = 1;
+
 	/* If we already have a sparse-checkout file, use it. */
 	if (res >= 0) {
 		free(sparse_filename);
-		core_apply_sparse_checkout = 1;
 		return update_working_directory(NULL);
 	}
 
@@ -366,6 +370,7 @@ static int sparse_checkout_init(int argc, const char **argv)
 	add_pattern(strbuf_detach(&pattern, NULL), empty_base, 0, &pl, 0);
 	strbuf_addstr(&pattern, "!/*/");
 	add_pattern(strbuf_detach(&pattern, NULL), empty_base, 0, &pl, 0);
+	pl.use_cone_patterns = init_opts.cone_mode;
 
 	return write_patterns_and_update(&pl);
 }
@@ -632,6 +637,9 @@ static int sparse_checkout_disable(int argc, const char **argv)
 	strbuf_addstr(&match_all, "/*");
 	add_pattern(strbuf_detach(&match_all, NULL), empty_base, 0, &pl, 0);
 
+	prepare_repo_settings(the_repository);
+	the_repository->settings.sparse_index = 0;
+
 	if (update_working_directory(&pl))
 		die(_("error while refreshing working directory"));
 
diff --git a/t/t1091-sparse-checkout-builtin.sh b/t/t1091-sparse-checkout-builtin.sh
index fc64e9ed99f4..ff1ad570a255 100755
--- a/t/t1091-sparse-checkout-builtin.sh
+++ b/t/t1091-sparse-checkout-builtin.sh
@@ -205,6 +205,19 @@ test_expect_success 'sparse-checkout disable' '
 	check_files repo a deep folder1 folder2
 '
 
+test_expect_success 'sparse-index enabled and disabled' '
+	git -C repo sparse-checkout init --cone --sparse-index &&
+	test_cmp_config -C repo true extensions.sparseIndex &&
+	test-tool -C repo read-cache --table >cache &&
+	grep " tree " cache &&
+
+	git -C repo sparse-checkout disable &&
+	test-tool -C repo read-cache --table >cache &&
+	! grep " tree " cache &&
+	git -C repo config --list >config &&
+	! grep extensions.sparseindex config
+'
+
 test_expect_success 'cone mode: init and set' '
 	git -C repo sparse-checkout init --cone &&
 	git -C repo config --list >config &&
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH 18/20] cache-tree: integrate with sparse directory entries
  2021-02-23 20:14 [PATCH 00/20] Sparse Index: Design, Format, Tests Derrick Stolee via GitGitGadget
                   ` (16 preceding siblings ...)
  2021-02-23 20:14 ` [PATCH 17/20] sparse-checkout: disable sparse-index Derrick Stolee via GitGitGadget
@ 2021-02-23 20:14 ` Derrick Stolee via GitGitGadget
  2021-02-23 20:14 ` [PATCH 19/20] sparse-index: loose integration with cache_tree_verify() Derrick Stolee via GitGitGadget
                   ` (3 subsequent siblings)
  21 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-23 20:14 UTC (permalink / raw)
  To: git; +Cc: newren, gitster, pclouds, jrnieder, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The cache-tree extension was previously disabled with sparse indexes.
However, the cache-tree is an important performance feature for commands
like 'git status' and 'git add'. Integrate it with sparse directory
entries.

When writing a sparse index, completely clear and recalculate the cache
tree. By starting from scratch, the only integration necessary is to
check if we hit a sparse directory entry and create a leaf of the
cache-tree that has an entry_count of one and no subtrees.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 cache-tree.c   | 18 ++++++++++++++++++
 sparse-index.c | 10 +++++++++-
 2 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/cache-tree.c b/cache-tree.c
index 5f07a39e501e..950a9615db8f 100644
--- a/cache-tree.c
+++ b/cache-tree.c
@@ -256,6 +256,24 @@ static int update_one(struct cache_tree *it,
 
 	*skip_count = 0;
 
+	/*
+	 * If the first entry of this region is a sparse directory
+	 * entry corresponding exactly to 'base', then this cache_tree
+	 * struct is a "leaf" in the data structure, pointing to the
+	 * tree OID specified in the entry.
+	 */
+	if (entries > 0) {
+		const struct cache_entry *ce = cache[0];
+
+		if (S_ISSPARSEDIR(ce->ce_mode) &&
+		    ce->ce_namelen == baselen &&
+		    !strncmp(ce->name, base, baselen)) {
+			it->entry_count = 1;
+			oidcpy(&it->oid, &ce->oid);
+			return 1;
+		}
+	}
+
 	if (0 <= it->entry_count && has_object_file(&it->oid))
 		return it->entry_count;
 
diff --git a/sparse-index.c b/sparse-index.c
index a991c5331e9e..e541f251b37a 100644
--- a/sparse-index.c
+++ b/sparse-index.c
@@ -180,7 +180,11 @@ int convert_to_sparse(struct index_state *istate)
 	istate->cache_nr = convert_to_sparse_rec(istate,
 						 0, 0, istate->cache_nr,
 						 "", 0, istate->cache_tree);
-	istate->drop_cache_tree = 1;
+
+	/* Clear and recompute the cache-tree */
+	cache_tree_free(&istate->cache_tree);
+	cache_tree_update(istate, 0);
+
 	istate->sparse_index = 1;
 	trace2_region_leave("index", "convert_to_sparse", istate->repo);
 	return 0;
@@ -278,5 +282,9 @@ void ensure_full_index(struct index_state *istate)
 
 	free(full);
 
+	/* Clear and recompute the cache-tree */
+	cache_tree_free(&istate->cache_tree);
+	cache_tree_update(istate, 0);
+
 	trace2_region_leave("index", "ensure_full_index", istate->repo);
 }
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH 19/20] sparse-index: loose integration with cache_tree_verify()
  2021-02-23 20:14 [PATCH 00/20] Sparse Index: Design, Format, Tests Derrick Stolee via GitGitGadget
                   ` (17 preceding siblings ...)
  2021-02-23 20:14 ` [PATCH 18/20] cache-tree: integrate with sparse directory entries Derrick Stolee via GitGitGadget
@ 2021-02-23 20:14 ` Derrick Stolee via GitGitGadget
  2021-02-23 20:14 ` [PATCH 20/20] p2000: add sparse-index repos Derrick Stolee via GitGitGadget
                   ` (2 subsequent siblings)
  21 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-23 20:14 UTC (permalink / raw)
  To: git; +Cc: newren, gitster, pclouds, jrnieder, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The cache_tree_verify() method is run when GIT_TEST_CHECK_CACHE_TREE
is enabled, which it is by default in the test suite. The logic must
be adjusted for the presence of these directory entries.

For now, leave the test as a simple check for whether the directory
entry is sparse. Do not go any further until needed.

This allows us to re-enable GIT_TEST_CHECK_CACHE_TREE in
t1092-sparse-checkout-compatibility.sh. Further,
p2000-sparse-operations.sh uses the test suite and hence this is enabled
for all tests. We need to integrate with it before we run our
performance tests with a sparse-index.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 cache-tree.c                             | 19 +++++++++++++++++++
 t/t1092-sparse-checkout-compatibility.sh |  1 -
 2 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/cache-tree.c b/cache-tree.c
index 950a9615db8f..11bf1fcae6e1 100644
--- a/cache-tree.c
+++ b/cache-tree.c
@@ -808,6 +808,19 @@ int cache_tree_matches_traversal(struct cache_tree *root,
 	return 0;
 }
 
+static void verify_one_sparse(struct repository *r,
+			      struct index_state *istate,
+			      struct cache_tree *it,
+			      struct strbuf *path,
+			      int pos)
+{
+	struct cache_entry *ce = istate->cache[pos];
+
+	if (!S_ISSPARSEDIR(ce->ce_mode))
+		BUG("directory '%s' is present in index, but not sparse",
+		    path->buf);
+}
+
 static void verify_one(struct repository *r,
 		       struct index_state *istate,
 		       struct cache_tree *it,
@@ -830,6 +843,12 @@ static void verify_one(struct repository *r,
 
 	if (path->len) {
 		pos = index_name_pos(istate, path->buf, path->len);
+
+		if (pos >= 0) {
+			verify_one_sparse(r, istate, it, path, pos);
+			return;
+		}
+
 		pos = -pos - 1;
 	} else {
 		pos = 0;
diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index 9c2bc4d25f66..c2624176c2e0 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -2,7 +2,6 @@
 
 test_description='compare full workdir to sparse workdir'
 
-GIT_TEST_CHECK_CACHE_TREE=0
 GIT_TEST_SPLIT_INDEX=0
 GIT_TEST_SPARSE_INDEX=
 
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH 20/20] p2000: add sparse-index repos
  2021-02-23 20:14 [PATCH 00/20] Sparse Index: Design, Format, Tests Derrick Stolee via GitGitGadget
                   ` (18 preceding siblings ...)
  2021-02-23 20:14 ` [PATCH 19/20] sparse-index: loose integration with cache_tree_verify() Derrick Stolee via GitGitGadget
@ 2021-02-23 20:14 ` Derrick Stolee via GitGitGadget
  2021-02-23 23:49 ` [PATCH 00/20] Sparse Index: Design, Format, Tests Elijah Newren
  2021-03-10 19:30 ` [PATCH v2 " Derrick Stolee via GitGitGadget
  21 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-23 20:14 UTC (permalink / raw)
  To: git; +Cc: newren, gitster, pclouds, jrnieder, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

p2000-sparse-operations.sh compares different Git commands in
repositories with many files at HEAD but using sparse-checkout to focus
on a small portion of those files.

Add extra copies of the repository that use the sparse-index format so
we can track how that affects the performance of different commands.

At this point in time, the sparse-index is 100% overhead from the CPU
front, and this is measurable in these tests:

Test
---------------------------------------------------------------
2000.2: git status (full-index-v3)              0.59(0.51+0.12)
2000.3: git status (full-index-v4)              0.59(0.52+0.11)
2000.4: git status (sparse-index-v3)            1.40(1.32+0.12)
2000.5: git status (sparse-index-v4)            1.41(1.36+0.08)
2000.6: git add -A (full-index-v3)              2.32(1.97+0.19)
2000.7: git add -A (full-index-v4)              2.17(1.92+0.14)
2000.8: git add -A (sparse-index-v3)            2.31(2.21+0.15)
2000.9: git add -A (sparse-index-v4)            2.30(2.20+0.13)
2000.10: git add . (full-index-v3)              2.39(2.02+0.20)
2000.11: git add . (full-index-v4)              2.20(1.94+0.16)
2000.12: git add . (sparse-index-v3)            2.36(2.27+0.12)
2000.13: git add . (sparse-index-v4)            2.33(2.21+0.16)
2000.14: git commit -a -m A (full-index-v3)     2.47(2.12+0.20)
2000.15: git commit -a -m A (full-index-v4)     2.26(2.00+0.17)
2000.16: git commit -a -m A (sparse-index-v3)   3.01(2.92+0.16)
2000.17: git commit -a -m A (sparse-index-v4)   3.01(2.94+0.15)

Note that there is very little difference between the v3 and v4 index
formats when the sparse-index is enabled. This is primarily due to the
fact that the relative file sizes are the same, and the command time is
mostly taken up by parsing tree objects to expand the sparse index into
a full one.

With the current file layout, the index file sizes are given by this
table:

       |  full index | sparse index |
       +-------------+--------------+
    v3 |     108 MiB |      1.6 MiB |
    v4 |      80 MiB |      1.2 MiB |

Future updates will improve the performance of Git commands when the
index is sparse.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/perf/p2000-sparse-operations.sh | 19 ++++++++++++++++++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/t/perf/p2000-sparse-operations.sh b/t/perf/p2000-sparse-operations.sh
index 52597683376e..f9c7f3c6e27e 100755
--- a/t/perf/p2000-sparse-operations.sh
+++ b/t/perf/p2000-sparse-operations.sh
@@ -62,12 +62,29 @@ test_expect_success 'setup repo and indexes' '
 		git sparse-checkout set $SPARSE_CONE &&
 		git config index.version 4 &&
 		git update-index --index-version=4
+	) &&
+	git -c core.sparseCheckoutCone=true clone --branch=wide --sparse . sparse-index-v3 &&
+	(
+		cd sparse-index-v3 &&
+		git sparse-checkout init --cone --sparse-index &&
+		git sparse-checkout set $SPARSE_CONE &&
+		git config index.version 3 &&
+		git update-index --index-version=3
+	) &&
+	git -c core.sparseCheckoutCone=true clone --branch=wide --sparse . sparse-index-v4 &&
+	(
+		cd sparse-index-v4 &&
+		git sparse-checkout init --cone --sparse-index &&
+		git sparse-checkout set $SPARSE_CONE &&
+		git config index.version 4 &&
+		git update-index --index-version=4
 	)
 '
 
 test_perf_on_all () {
 	command="$@"
-	for repo in full-index-v3 full-index-v4
+	for repo in full-index-v3 full-index-v4 \
+		    sparse-index-v3 sparse-index-v4
 	do
 		test_perf "$command ($repo)" "
 			(
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 203+ messages in thread

* Re: [PATCH 00/20] Sparse Index: Design, Format, Tests
  2021-02-23 20:14 [PATCH 00/20] Sparse Index: Design, Format, Tests Derrick Stolee via GitGitGadget
                   ` (19 preceding siblings ...)
  2021-02-23 20:14 ` [PATCH 20/20] p2000: add sparse-index repos Derrick Stolee via GitGitGadget
@ 2021-02-23 23:49 ` Elijah Newren
  2021-02-26 21:28   ` Elijah Newren
  2021-03-10 19:30 ` [PATCH v2 " Derrick Stolee via GitGitGadget
  21 siblings, 1 reply; 203+ messages in thread
From: Elijah Newren @ 2021-02-23 23:49 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano,
	Nguyễn Thái Ngọc, Jonathan Nieder,
	Derrick Stolee

On Tue, Feb 23, 2021 at 12:14 PM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> Here is the first full patch series submission coming out of the
> sparse-index RFC [1].

Wahoo!  I'll be reading these over the next few days.

> [1]
> https://lore.kernel.org/git/pull.847.git.1611596533.gitgitgadget@gmail.com/
>
> I won't waste too much space here, because PATCH 1 includes a sizeable
> design document that describes the feature, the reasoning behind it, and my
> plan for getting this implemented widely throughout the codebase.
>
> There are some new things here that were not in the RFC:
>
>  * Design doc and format updates. (Patch 1)
>  * Performance test script. (Patches 2 and 20)
>
> Notably missing in this series from the RFC:
>
>  * The mega-patch inserting ensure_full_index() throughout the codebase.
>    That will be a follow-up series to this one.
>  * The integrations with git status and git add to demonstrate the improved
>    performance. Those will also appear in their own series later.
>
> I plan to keep my latest work in this area in my 'sparse-index/wip' branch
> [2]. It includes all of the work from the RFC right now, updated with the
> work from this series.
>
> [2] https://github.com/derrickstolee/git/tree/sparse-index/wip
>
> Thanks, -Stolee
>
> Derrick Stolee (20):
>   sparse-index: design doc and format update
>   t/perf: add performance test for sparse operations
>   t1092: clean up script quoting
>   sparse-index: add guard to ensure full index
>   sparse-index: implement ensure_full_index()
>   t1092: compare sparse-checkout to sparse-index
>   test-read-cache: print cache entries with --table
>   test-tool: don't force full index
>   unpack-trees: ensure full index
>   sparse-checkout: hold pattern list in index
>   sparse-index: convert from full to sparse
>   submodule: sparse-index should not collapse links
>   unpack-trees: allow sparse directories
>   sparse-index: check index conversion happens
>   sparse-index: create extension for compatibility
>   sparse-checkout: toggle sparse index from builtin
>   sparse-checkout: disable sparse-index
>   cache-tree: integrate with sparse directory entries
>   sparse-index: loose integration with cache_tree_verify()
>   p2000: add sparse-index repos
>
>  Documentation/config/extensions.txt      |   7 +
>  Documentation/git-sparse-checkout.txt    |  14 ++
>  Documentation/technical/index-format.txt |   7 +
>  Documentation/technical/sparse-index.txt | 167 +++++++++++++
>  Makefile                                 |   1 +
>  builtin/sparse-checkout.c                |  44 +++-
>  cache-tree.c                             |  40 ++++
>  cache.h                                  |  12 +-
>  read-cache.c                             |  35 ++-
>  repo-settings.c                          |  15 ++
>  repository.c                             |  11 +-
>  repository.h                             |   3 +
>  setup.c                                  |   3 +
>  sparse-index.c                           | 290 +++++++++++++++++++++++
>  sparse-index.h                           |  11 +
>  t/README                                 |   3 +
>  t/helper/test-read-cache.c               |  61 ++++-
>  t/perf/p2000-sparse-operations.sh        | 104 ++++++++
>  t/t1091-sparse-checkout-builtin.sh       |  13 +
>  t/t1092-sparse-checkout-compatibility.sh | 136 +++++++++--
>  unpack-trees.c                           |  16 +-
>  21 files changed, 953 insertions(+), 40 deletions(-)
>  create mode 100644 Documentation/technical/sparse-index.txt
>  create mode 100644 sparse-index.c
>  create mode 100644 sparse-index.h
>  create mode 100755 t/perf/p2000-sparse-operations.sh
>
>
> base-commit: 966e671106b2fd38301e7c344c754fd118d0bb07
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-883%2Fderrickstolee%2Fsparse-index%2Fformat-v1
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-883/derrickstolee/sparse-index/format-v1
> Pull-Request: https://github.com/gitgitgadget/git/pull/883
> --
> gitgitgadget

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 01/20] sparse-index: design doc and format update
  2021-02-23 20:14 ` [PATCH 01/20] sparse-index: design doc and format update Derrick Stolee via GitGitGadget
@ 2021-02-24  1:13   ` Elijah Newren
  2021-02-25 15:29     ` Derrick Stolee
  0 siblings, 1 reply; 203+ messages in thread
From: Elijah Newren @ 2021-02-24  1:13 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano,
	Nguyễn Thái Ngọc, Jonathan Nieder,
	Derrick Stolee, Derrick Stolee, Matheus Tavares Bernardino

On Tue, Feb 23, 2021 at 12:14 PM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <dstolee@microsoft.com>
>
> This begins a long effort to update the index format to allow sparse
> directory entries. This should result in a significant improvement to
> Git commands when HEAD contains millions of files, but the user has
> selected many fewer files to keep in their sparse-checkout definition.
>
> Currently, the index format is only updated in the presence of
> extensions.sparseIndex instead of increasing a file format version
> number. This is temporary, and index v5 is part of the plan for future
> work in this area.
>
> The design document details many of the reasons for embarking on this
> work, and also the plan for completing it safely.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  Documentation/technical/index-format.txt |   7 +
>  Documentation/technical/sparse-index.txt | 167 +++++++++++++++++++++++
>  2 files changed, 174 insertions(+)
>  create mode 100644 Documentation/technical/sparse-index.txt
>
> diff --git a/Documentation/technical/index-format.txt b/Documentation/technical/index-format.txt
> index b633482b1bdf..387126582556 100644
> --- a/Documentation/technical/index-format.txt
> +++ b/Documentation/technical/index-format.txt
> @@ -44,6 +44,13 @@ Git index format
>    localization, no special casing of directory separator '/'). Entries
>    with the same name are sorted by their stage field.
>
> +  An index entry typically represents a file. However, if sparse-checkout
> +  is enabled in cone mode (`core.sparseCheckoutCone` is enabled) and the
> +  `extensions.sparseIndex` extension is enabled, then the index may
> +  contain entries for directories outside of the sparse-checkout definition.
> +  These entries have mode `0040000`, include the `SKIP_WORKTREE` bit, and
> +  the path ends in a directory separator.
> +
>    32-bit ctime seconds, the last time a file's metadata changed
>      this is stat(2) data
>
> diff --git a/Documentation/technical/sparse-index.txt b/Documentation/technical/sparse-index.txt
> new file mode 100644
> index 000000000000..9070836f0655
> --- /dev/null
> +++ b/Documentation/technical/sparse-index.txt
> @@ -0,0 +1,167 @@
> +Git Sparse-Index Design Document
> +================================
> +
> +The sparse-checkout feature allows users to focus a working directory on
> +a subset of the files at HEAD. The cone mode patterns, enabled by
> +`core.sparseCheckoutCone`, allow for very fast pattern matching to
> +discover which files at HEAD belong in the sparse-checkout cone.
> +
> +Three important scale dimensions for a Git worktree are:
> +
> +* `HEAD`: How many files are present at `HEAD`?
> +
> +* Populated: How many files are within the sparse-checkout cone.
> +
> +* Modified: How many files has the user modified in the working directory?
> +
> +We will use big-O notation -- O(X) -- to denote how expensive certain
> +operations are in terms of these dimensions.
> +
> +These dimensions are ordered by their magnitude: users (typically) modify
> +fewer files than are populated, and we can only populate files at `HEAD`.
> +These dimensions are also ordered by how expensive they are per item: it
> +is expensive to detect a modified file than it is to write one that we
> +know must be populated; changing `HEAD` only really requires updating the
> +index.
> +
> +Problems occur if there is an extreme imbalance in these dimensions. For
> +example, if `HEAD` contains millions of paths but the populated set has
> +only tens of thousands, then commands like `git status` and `git add` can
> +be dominated by operations that require O(`HEAD`) operations instead of
> +O(Populated). Primarily, the cost is in parsing and rewriting the index,
> +which is filled primarily with files at `HEAD` that are marked with the
> +`SKIP_WORKTREE` bit.
> +
> +The sparse-index intends to take these commands that read and modify the
> +index from O(`HEAD`) to O(Populated). To do this, we need to modify the
> +index format in a significant way: add "sparse directory" entries.
> +
> +With cone mode patterns, it is possible to detect when an entire
> +directory will have its contents outside of the sparse-checkout definition.
> +Instead of listing all of the files it contains as individual entries, a
> +sparse-index contains an entry with the directory name, referencing the
> +object ID of the tree at `HEAD` and marked with the `SKIP_WORKTREE` bit.
> +If we need to discover the details for paths within that directory, we
> +can parse trees to find that list.
> +
> +This addition of sparse-directory entries violates expectations about the

Violates current expectations, yes.  Documentation tends to live a
long time, and I suspect that 2-3 years from now reading this sentence
might be jarring since we'll have modified the code to have an updated
set of expectations.  Maybe a simple "As of time of writing, ..." at
the beginning of the sentence here?  Or maybe I'm just being overly
worried...

> +index format and its in-memory data structure. There are many consumers in
> +the codebase that expect to iterate through all of the index entries and
> +see only files. In addition, they expect to see all files at `HEAD`. One
> +way to handle this is to parse trees to replace a sparse-directory entry
> +with all of the files within that tree as the index is loaded. However,
> +parsing trees is slower than parsing the index format, so that is a slower
> +operation than if we left the index alone.
> +
> +The implementation plan below follows four phases to slowly integrate with
> +the sparse-index. The intention is to incrementally update Git commands to
> +interact safely with the sparse-index without significant slowdowns. This
> +may not always be possible, but the hope is that the primary commands that
> +users need in their daily work are dramatically improved.
> +
> +Phase I: Format and initial speedups
> +------------------------------------
> +
> +During this phase, Git learns to enable the sparse-index and safely parse
> +one. Protections are put in place so that every consumer of the in-memory
> +data structure can operate with its current assumption of every file at
> +`HEAD`.
> +
> +At first, every index parse will expand the sparse-directory entries into
> +the full list of paths at `HEAD`. This will be slower in all cases. The
> +only noticable change in behavior will be that the serialized index file

noticeable

> +contains sparse-directory entries.
> +
> +To start, we use a new repository extension, `extensions.sparseIndex`, to
> +allow inserting sparse-directory entries into indexes with file format
> +versions 2, 3, and 4. This prevents Git versions that do not understand
> +the sparse-index from operating on one, but it also prevents other
> +operations that do not use the index at all. A new format, index v5, will
> +be introduced that includes sparse-directory entries by default. It might
> +also introduce other features that have been considered for improving the
> +index, as well.
> +
> +Next, consumers of the index will be guarded against operating on a
> +sparse-index by inserting calls to `ensure_full_index()` or
> +`expand_index_to_path()`. After these guards are in place, we can begin
> +leaving sparse-directory entries in the in-memory index structure.
> +
> +Even after inserting these guards, we will keep expanding sparse-indexes
> +for most Git commands using the `command_requires_full_index` repository
> +setting. This setting will be on by default and disabled one builtin at a
> +time until we have sufficient confidence that all of the index operations
> +are properly guarded.
> +
> +To complete this phase, the commands `git status` and `git add` will be
> +integrated with the sparse-index so that they operate with O(Populated)
> +performance. They will be carefully tested for operations within and
> +outside the sparse-checkout definition.

Good plan so far, but there's something else bugging me a little here.
One thing we noticed with our usage of `sparse-checkout` is that
although unimportant _tracked_ files go away, leftover build files and
other untracked files stick around.  So, although 'git status'
shouldn't have to check the tracked files anymore, it is still going
to have to look at each of the *untracked* files and compare to
.gitignore files in order to correctly classify each file as ignored
or just plain untracked.  Our `sparsify` tool has for a long time
tried to warn about such files when changing the sparsity
patterns/modules and had an --remove-old-ignores option for clearing
out ignored files that are found within directories that are sparse
(Meaning the directories where all files under them are marked
SKIP_WORKTREE.). I was never sure whether a warning was enough, or if
pushing that option more made sense, but about a month ago my
colleagues made the tool just auto-invoke that option from other
sparsify invocations.  To my knowledge, there have been no complaints
about that being automatically turned on; but there were
complaints/confusion before about the directories being left around.
(Of course, non-ignored files are still left around by that option.)

I'm worried that since sparse-checkout doesn't do anything to help
with all these untracked/ignored files, we might not get all the
performance improvements we want from a `git status` with sparse
directories.  We'll be dropping from walking O(2*HEAD) files (1 source
+ 1 object file) down to O(HEAD) files (just the object files) rather
than actually getting down to O(Populated).

> +
> +Phase II: Careful integrations
> +------------------------------
> +
> +This phase focuses on ensuring that all index extensions and APIs work
> +well with a sparse-index. This requires significant increases to our test
> +coverage, especially for operations that interact with the working
> +directory outside of the sparse-checkout definition. Some of these
> +behaviors may not be the desirable ones, such as some tests already
> +marked for failure in `t1092-sparse-checkout-compatibility.sh`.
> +
> +The index extensions that may require special integrations are:
> +
> +* FS Monitor
> +* Untracked cache
> +
> +While integrating with these features, we should look for patterns that
> +might lead to better APIs for interacting with the index. Coalescing
> +common usage patterns into an API call can reduce the number of places
> +where sparse-directories need to be handled carefully.

Makes sense.

> +Phase III: Important command speedups
> +-------------------------------------
> +
> +At this point, the patterns for testing and implementing sparse-directory
> +logic should be relatively stable. This phase focuses on updating some of
> +the most common builtins that use the index to operate as O(Populated).
> +Here is a potential list of commands that could be valuable to integrate
> +at this point:
> +
> +* `git commit`
> +* `git checkout`
> +* `git merge`
> +* `git rebase`
> +
> +Along with `git status` and `git add`, these commands cover the majority
> +of users' interactions with the working directory.

Sounds like a good plan as well.

I hope we get to make this specific to the merge-ort backend.  It
localizes the index-related code to (a) a call to unpack_trees()
called from checkout-like code (which would probably automatically be
handled by your updates to git checkout), and (b) a single function
named record_conflicted_index_entries().  I feel it should be pretty
easy to update.

In contrast, the idea of attempting to update merge-recursive with
this kind of change sounds overwhelming.

>  In addition, we can
> +integrate with these commands:
> +
> +* `git grep`
> +* `git rm`
> +
> +These have been proposed as some whose behavior could change when in a
> +repo with a sparse-checkout definition. It would be good to include this
> +behavior automatically when using a sparse-index. Some clarity is needed
> +to make the behavior switch clear to the user.

Is this leftover from before recent events?  I think this portion of
the document should just be stricken.

I argued about how these were buggy as-is due SKIP_WORKTREE always
having been an incomplete implementation of an idea at [1], but didn't
hear a further response from you.  I'm curious if you disagreed with
my reasoning, or it just slipped through the cracks in a busy schedule
and this portion of the document was leftover from before.  In my
opinion, both commands are just buggy and should be fixed for general
sparse-checkout usage cases, not just for sparse-index.

As for git grep, it has options for searching the working tree
(default) OR searching the index (--cached) OR searching an old commit
(passing a REVISION).  But never some combination or more than one of
these.  The fact that it combined some in the cases of SKIP_WORKTREE
entries looks entirely like a bug to me.  For the same reasons I
argued that --untracked and --cached are incompatible[2], we shouldn't
be combining results from searching the working tree and searching the
index.  Luckily, this fix has already been submitted[3] and picked up
in mt/grep-sparse-checkout and is marked in the cooking emails as
"Will merge to next".

As for git rm, I'll quote from my email to Matheus:

"""As far as the longer term discussion about making git rm configurable...
_If_ it comes up again in the future, I will argue that if git rm
should have configuration to delete paths outside the sparsity
specification, then git add should have configuration to add paths
outside the sparsity specification that happen to be present despite
being SKIP_WORKTREE, that git diff with no revision arguments (nor
--cached) should have configuration to diff against paths that are
SKIP_WORKTREE but happen to be present, that git status should have
configuration to report on changes to paths that are SKIP_WORKTREE but
happen to be present, that git checkout should have configuration to
write files to the working tree despite matching sparsity paths, etc.
And I'll argue that you do ALL of those or you're being inconsistent.
I hope that people see these are actually all the same request and
that it is horribly inconsistent to do some of these and not others,
and that at least by the time I get to mentioning checkout that they
realize it's a crazy request.  We should just tell users to extend
their sparsity if they want the working copy (and commands that
interact with the working copy) to handle the additional paths.  Maybe
I'm just really biased, but I don't see how this makes sense.  I would
argue more about it, but no one has responded.  My plan was to just
fix the default behavior, and then see if anyone ever actually cared
enough to come back and ask for more configurability."""

Also, for rm, Matheus has already submitted the fix[4], though at
Junio's request he separated out some fixes for git-add as a separate
preliminary series[5] and then will resubmit the other `add` and `rm`
fixes.

[1] https://lore.kernel.org/git/CABPp-BHwNoVnooqDFPAsZxBT9aR5Dwk5D9sDRCvYSb8akxAJgA@mail.gmail.com/
[2] https://lore.kernel.org/git/xmqqtuql0yfp.fsf@gitster.c.googlers.com/
[3] https://lore.kernel.org/git/5f3f7ac77039d41d1692ceae4b0c5df3bb45b74a.1612901326.git.matheus.bernardino@usp.br/
[4] https://lore.kernel.org/git/61a77cd5f45ba02c7dff4b7932abdebb17c1667f.1613593946.git.matheus.bernardino@usp.br/
[5] https://lore.kernel.org/git/cover.1614037664.git.matheus.bernardino@usp.br/

Anyway, that's a long way of saying I think this section of your
document is already obsolete.  (Which is a good thing -- less work to
do to get sparse-index working.  Thanks, Matheus!).

> +This phase is the first where parallel work might be possible without too
> +much conflicts between topics.
> +
> +Phase IV: The long tail
> +-----------------------
> +
> +This last phase is less a "phase" and more "the new normal" after all of
> +the previous work.
> +
> +To start, the `command_requires_full_index` option could be removed in
> +favor of expanding only when hitting an API guard.
> +
> +There are many Git commands that could use special attention to operate as
> +O(Populated), while some might be so rare that it is acceptable to leave
> +them with additional overhead when a sparse-index is present.
> +
> +Here are some commands that might be useful to update:
> +
> +* `git sparse-checkout set`
> +* `git am`
> +* `git clean`
> +* `git stash`

Oh, man, git stash is definitely in need of work.  It's still a
minimalistic transliteration of shell to C, complete with lots of
process forking and piping output between various low-level commands.
It might be interesting to rewrite this in terms of the merge
machinery, though its separate stashing of staged stuff, unstaged
stuff, and possibly untracked stuff means that there is a sequence of
two or three merges needed and interesting failure handling to do if
those merges fail, especially if the user uses --index.  But I
digress...


Anyway, overall, very nicely written and planned out.  Thanks for
taking the time to write this all up.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 02/20] t/perf: add performance test for sparse operations
  2021-02-23 20:14 ` [PATCH 02/20] t/perf: add performance test for sparse operations Derrick Stolee via GitGitGadget
@ 2021-02-24  2:30   ` Elijah Newren
  2021-03-09 20:03     ` Derrick Stolee
  0 siblings, 1 reply; 203+ messages in thread
From: Elijah Newren @ 2021-02-24  2:30 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano,
	Nguyễn Thái Ngọc, Jonathan Nieder,
	Derrick Stolee, Derrick Stolee

On Tue, Feb 23, 2021 at 12:14 PM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <dstolee@microsoft.com>
>
> Create a test script that takes the default performance test (the Git
> codebase) and multiplies it by 256 using four layers of duplicated
> trees of width four. This results in nearly one million blob entries in
> the index. Then, we can clone this repository with sparse-checkout
> patterns that demonstrate four copies of the initial repository. Each
> clone will use a different index format or mode so peformance can be
> tested across the different options.
>
> Note that the initial repo is stripped of submodules before doing the
> copies. This preserves the expected data shape of the sparse index,
> because directories containing submodules are not collapsed to a sparse
> directory entry.
>
> Run a few Git commands on these clones, especially those that use the
> index (status, add, commit).
>
> Here are the results on my Linux machine:
>
> Test
> --------------------------------------------------------------
> 2000.2: git status (full-index-v3)             0.37(0.30+0.09)
> 2000.3: git status (full-index-v4)             0.39(0.32+0.10)
> 2000.4: git add -A (full-index-v3)             1.42(1.06+0.20)
> 2000.5: git add -A (full-index-v4)             1.26(0.98+0.16)
> 2000.6: git add . (full-index-v3)              1.40(1.04+0.18)
> 2000.7: git add . (full-index-v4)              1.26(0.98+0.17)
> 2000.8: git commit -a -m A (full-index-v3)     1.42(1.11+0.16)
> 2000.9: git commit -a -m A (full-index-v4)     1.33(1.08+0.16)
>
> It is perhaps noteworthy that there is an improvement when using index
> version 4. This is because the v3 index uses 108 MiB while the v4
> index uses 80 MiB. Since the repeated portions of the directories are
> very short (f3/f1/f2, for example) this ratio is less pronounced than in
> similarly-sized real repositories.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  t/perf/p2000-sparse-operations.sh | 87 +++++++++++++++++++++++++++++++
>  1 file changed, 87 insertions(+)
>  create mode 100755 t/perf/p2000-sparse-operations.sh
>
> diff --git a/t/perf/p2000-sparse-operations.sh b/t/perf/p2000-sparse-operations.sh
> new file mode 100755
> index 000000000000..52597683376e
> --- /dev/null
> +++ b/t/perf/p2000-sparse-operations.sh
> @@ -0,0 +1,87 @@
> +#!/bin/sh
> +
> +test_description="test performance of Git operations using the index"
> +
> +. ./perf-lib.sh
> +
> +test_perf_default_repo
> +
> +SPARSE_CONE=f2/f4/f1
> +
> +test_expect_success 'setup repo and indexes' '
> +       git reset --hard HEAD &&
> +       # Remove submodules from the example repo, because our
> +       # duplication of the entire repo creates an unlikly data shape.
> +       git config --file .gitmodules --get-regexp "submodule.*.path" >modules &&
> +       rm -f .gitmodules &&
> +       git add .gitmodules &&

Why not `git rm [-f] .gitmodules` instead of these two commands?  Is
there something special about .gitmodules that requires this special
handling?

> +       for module in $(awk "{print \$2}" modules)
> +       do
> +               git rm $module || return 1
> +       done &&
> +       git add . &&

What does the `git add .` do?  I don't see any changes there weren't
already git-add'ed or git-rm'ed.

> +       git commit -m "remove submodules" &&
> +
> +       echo bogus >a &&
> +       cp a b &&
> +       git add a b &&
> +       git commit -m "level 0" &&
> +       BLOB=$(git rev-parse HEAD:a) &&
> +       OLD_COMMIT=$(git rev-parse HEAD) &&
> +       OLD_TREE=$(git rev-parse HEAD^{tree}) &&
> +
> +       for i in $(test_seq 1 4)
> +       do
> +               cat >in <<-EOF &&
> +                       100755 blob $BLOB       a
> +                       040000 tree $OLD_TREE   f1
> +                       040000 tree $OLD_TREE   f2
> +                       040000 tree $OLD_TREE   f3
> +                       040000 tree $OLD_TREE   f4
> +               EOF
> +               NEW_TREE=$(git mktree <in) &&
> +               NEW_COMMIT=$(git commit-tree $NEW_TREE -p $OLD_COMMIT -m "level $i") &&
> +               OLD_TREE=$NEW_TREE &&
> +               OLD_COMMIT=$NEW_COMMIT || return 1
> +       done &&
> +
> +       git sparse-checkout init --cone &&
> +       git branch -f wide $OLD_COMMIT &&
> +       git -c core.sparseCheckoutCone=true clone --branch=wide --sparse . full-index-v3 &&
> +       (
> +               cd full-index-v3 &&
> +               git sparse-checkout init --cone &&
> +               git sparse-checkout set $SPARSE_CONE &&
> +               git config index.version 3 &&
> +               git update-index --index-version=3
> +       ) &&
> +       git -c core.sparseCheckoutCone=true clone --branch=wide --sparse . full-index-v4 &&
> +       (
> +               cd full-index-v4 &&
> +               git sparse-checkout init --cone &&
> +               git sparse-checkout set $SPARSE_CONE &&
> +               git config index.version 4 &&
> +               git update-index --index-version=4
> +       )
> +'
> +
> +test_perf_on_all () {
> +       command="$@"
> +       for repo in full-index-v3 full-index-v4
> +       do
> +               test_perf "$command ($repo)" "
> +                       (
> +                               cd $repo &&
> +                               echo >>$SPARSE_CONE/a &&
> +                               $command
> +                       )
> +               "
> +       done
> +}
> +
> +test_perf_on_all git status
> +test_perf_on_all git add -A
> +test_perf_on_all git add .
> +test_perf_on_all git commit -a -m A
> +
> +test_done
> --
> gitgitgadget

Other than the two minor questions, the rest looks good to me.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 04/20] sparse-index: add guard to ensure full index
  2021-02-23 20:14 ` [PATCH 04/20] sparse-index: add guard to ensure full index Derrick Stolee via GitGitGadget
@ 2021-02-24  2:44   ` Elijah Newren
  0 siblings, 0 replies; 203+ messages in thread
From: Elijah Newren @ 2021-02-24  2:44 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano,
	Nguyễn Thái Ngọc, Jonathan Nieder,
	Derrick Stolee, Derrick Stolee

On Tue, Feb 23, 2021 at 12:14 PM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <dstolee@microsoft.com>
>
> Upcoming changes will introduce modifications to the index format that
> allow sparse directories. It will be useful to have a mechanism for
> converting those sparse index files into full indexes by walking the
> tree at those sparse directories. Name this method ensure_full_index()
> as it will guarantee that the index is fully expanded.
>
> This method is not implemented yet, and instead we focus on the
> scaffolding to declare it and call it at the appropriate time.
>
> Add a 'command_requires_full_index' member to struct repo_settings. This
> will be an indicator that we need the index in full mode to do certain
> index operations. This starts as being true for every command, then we
> will set it to false as some commands integrate with sparse indexes.
>
> If 'command_requires_full_index' is true, then we will immediately
> expand a sparse index to a full one upon reading from disk. This
> suffices for now, but we will want to add more callers to
> ensure_full_index() later.

Same as 01/27 of your RFC series; looks good.

> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  Makefile        |  1 +
>  repo-settings.c |  8 ++++++++
>  repository.c    | 11 ++++++++++-
>  repository.h    |  2 ++
>  sparse-index.c  |  8 ++++++++
>  sparse-index.h  |  7 +++++++
>  6 files changed, 36 insertions(+), 1 deletion(-)
>  create mode 100644 sparse-index.c
>  create mode 100644 sparse-index.h
>
> diff --git a/Makefile b/Makefile
> index 5a239cac20e3..3bf61699238d 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -980,6 +980,7 @@ LIB_OBJS += setup.o
>  LIB_OBJS += shallow.o
>  LIB_OBJS += sideband.o
>  LIB_OBJS += sigchain.o
> +LIB_OBJS += sparse-index.o
>  LIB_OBJS += split-index.o
>  LIB_OBJS += stable-qsort.o
>  LIB_OBJS += strbuf.o
> diff --git a/repo-settings.c b/repo-settings.c
> index f7fff0f5ab83..d63569e4041e 100644
> --- a/repo-settings.c
> +++ b/repo-settings.c
> @@ -77,4 +77,12 @@ void prepare_repo_settings(struct repository *r)
>                 UPDATE_DEFAULT_BOOL(r->settings.core_untracked_cache, UNTRACKED_CACHE_KEEP);
>
>         UPDATE_DEFAULT_BOOL(r->settings.fetch_negotiation_algorithm, FETCH_NEGOTIATION_DEFAULT);
> +
> +       /*
> +        * This setting guards all index reads to require a full index
> +        * over a sparse index. After suitable guards are placed in the
> +        * codebase around uses of the index, this setting will be
> +        * removed.
> +        */
> +       r->settings.command_requires_full_index = 1;
>  }
> diff --git a/repository.c b/repository.c
> index c98298acd017..a8acae002f71 100644
> --- a/repository.c
> +++ b/repository.c
> @@ -10,6 +10,7 @@
>  #include "object.h"
>  #include "lockfile.h"
>  #include "submodule-config.h"
> +#include "sparse-index.h"
>
>  /* The main repository */
>  static struct repository the_repo;
> @@ -261,6 +262,8 @@ void repo_clear(struct repository *repo)
>
>  int repo_read_index(struct repository *repo)
>  {
> +       int res;
> +
>         if (!repo->index)
>                 repo->index = xcalloc(1, sizeof(*repo->index));
>
> @@ -270,7 +273,13 @@ int repo_read_index(struct repository *repo)
>         else if (repo->index->repo != repo)
>                 BUG("repo's index should point back at itself");
>
> -       return read_index_from(repo->index, repo->index_file, repo->gitdir);
> +       res = read_index_from(repo->index, repo->index_file, repo->gitdir);
> +
> +       prepare_repo_settings(repo);
> +       if (repo->settings.command_requires_full_index)
> +               ensure_full_index(repo->index);
> +
> +       return res;
>  }
>
>  int repo_hold_locked_index(struct repository *repo,
> diff --git a/repository.h b/repository.h
> index b385ca3c94b6..e06a23015697 100644
> --- a/repository.h
> +++ b/repository.h
> @@ -41,6 +41,8 @@ struct repo_settings {
>         enum fetch_negotiation_setting fetch_negotiation_algorithm;
>
>         int core_multi_pack_index;
> +
> +       unsigned command_requires_full_index:1;
>  };
>
>  struct repository {
> diff --git a/sparse-index.c b/sparse-index.c
> new file mode 100644
> index 000000000000..82183ead563b
> --- /dev/null
> +++ b/sparse-index.c
> @@ -0,0 +1,8 @@
> +#include "cache.h"
> +#include "repository.h"
> +#include "sparse-index.h"
> +
> +void ensure_full_index(struct index_state *istate)
> +{
> +       /* intentionally left blank */
> +}
> diff --git a/sparse-index.h b/sparse-index.h
> new file mode 100644
> index 000000000000..09a20d036c46
> --- /dev/null
> +++ b/sparse-index.h
> @@ -0,0 +1,7 @@
> +#ifndef SPARSE_INDEX_H__
> +#define SPARSE_INDEX_H__
> +
> +struct index_state;
> +void ensure_full_index(struct index_state *istate);
> +
> +#endif
> --
> gitgitgadget
>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 05/20] sparse-index: implement ensure_full_index()
  2021-02-23 20:14 ` [PATCH 05/20] sparse-index: implement ensure_full_index() Derrick Stolee via GitGitGadget
@ 2021-02-24  3:20   ` Elijah Newren
  0 siblings, 0 replies; 203+ messages in thread
From: Elijah Newren @ 2021-02-24  3:20 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano,
	Nguyễn Thái Ngọc, Jonathan Nieder,
	Derrick Stolee, Derrick Stolee

On Tue, Feb 23, 2021 at 12:14 PM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <dstolee@microsoft.com>
>
> We will mark an in-memory index_state as having sparse directory entries
> with the sparse_index bit. These currently cannot exist, but we will add
> a mechanism for collapsing a full index to a sparse one in a later
> change. That will happen at write time, so we must first allow parsing
> the format before writing it.
>
> Commands or methods that require a full index in order to operate can
> call ensure_full_index() to expand that index in-memory. This requires
> parsing trees using that index's repository.
>
> Sparse directory entries have a specific 'ce_mode' value. The macro
> S_ISSPARSEDIR(ce->ce_mode) can check if a cache_entry 'ce' has this type.
> This ce_mode is not possible with the existing index formats, so we don't
> also verify all properties of a sparse-directory entry, which are:
>
>  1. ce->ce_mode == 0040000
>  2. ce->flags & CE_SKIP_WORKTREE is true
>  3. ce->name[ce->namelen - 1] == '/' (ends in dir separator)
>  4. ce->oid references a tree object.
>
> These are all semi-enforced in ensure_full_index() to some extent. Any
> deviation will cause a warning at minimum or a failure in the worst
> case.

Thanks for spelling these all out; looks good.

>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  cache.h        |  7 +++-
>  read-cache.c   |  9 +++++
>  sparse-index.c | 95 +++++++++++++++++++++++++++++++++++++++++++++++++-
>  3 files changed, 109 insertions(+), 2 deletions(-)
>
> diff --git a/cache.h b/cache.h
> index d92814961405..1336c8d7435e 100644
> --- a/cache.h
> +++ b/cache.h
> @@ -204,6 +204,8 @@ struct cache_entry {
>  #error "CE_EXTENDED_FLAGS out of range"
>  #endif
>
> +#define S_ISSPARSEDIR(m) ((m) == S_IFDIR)

Much nicer, thanks.  :-)

> +
>  /* Forward structure decls */
>  struct pathspec;
>  struct child_process;
> @@ -319,7 +321,8 @@ struct index_state {
>                  drop_cache_tree : 1,
>                  updated_workdir : 1,
>                  updated_skipworktree : 1,
> -                fsmonitor_has_run_once : 1;
> +                fsmonitor_has_run_once : 1,
> +                sparse_index : 1;
>         struct hashmap name_hash;
>         struct hashmap dir_hash;
>         struct object_id oid;
> @@ -722,6 +725,8 @@ int read_index_from(struct index_state *, const char *path,
>                     const char *gitdir);
>  int is_index_unborn(struct index_state *);
>
> +void ensure_full_index(struct index_state *istate);
> +
>  /* For use with `write_locked_index()`. */
>  #define COMMIT_LOCK            (1 << 0)
>  #define SKIP_IF_UNCHANGED      (1 << 1)
> diff --git a/read-cache.c b/read-cache.c
> index 29144cf879e7..97dbf2434f30 100644
> --- a/read-cache.c
> +++ b/read-cache.c
> @@ -101,6 +101,9 @@ static const char *alternate_index_output;
>
>  static void set_index_entry(struct index_state *istate, int nr, struct cache_entry *ce)
>  {
> +       if (S_ISSPARSEDIR(ce->ce_mode))
> +               istate->sparse_index = 1;

A very minor question -- someone who sees "sparse_index" could
probably easily think either "index with missing entries, due to
having a SKIP_WORKTREE directory instead" or perhaps "index when using
the sparse-checkout feature, i.e. it has some SKIP_WORKTREE entries in
it".  From the code here, clearly the former is your intent.  I wonder
if it'd help to have a small comment near the declaration of
sparse_index to mention its intent.

> +
>         istate->cache[nr] = ce;
>         add_name_hash(istate, ce);
>  }
> @@ -2255,6 +2258,12 @@ int do_read_index(struct index_state *istate, const char *path, int must_exist)
>         trace2_data_intmax("index", the_repository, "read/cache_nr",
>                            istate->cache_nr);
>
> +       if (!istate->repo)
> +               istate->repo = the_repository;
> +       prepare_repo_settings(istate->repo);
> +       if (istate->repo->settings.command_requires_full_index)
> +               ensure_full_index(istate);
> +
>         return istate->cache_nr;
>
>  unmap:
> diff --git a/sparse-index.c b/sparse-index.c
> index 82183ead563b..316cb949b74b 100644
> --- a/sparse-index.c
> +++ b/sparse-index.c
> @@ -1,8 +1,101 @@
>  #include "cache.h"
>  #include "repository.h"
>  #include "sparse-index.h"
> +#include "tree.h"
> +#include "pathspec.h"
> +#include "trace2.h"
> +
> +static void set_index_entry(struct index_state *istate, int nr, struct cache_entry *ce)
> +{
> +       ALLOC_GROW(istate->cache, nr + 1, istate->cache_alloc);
> +
> +       istate->cache[nr] = ce;
> +       add_name_hash(istate, ce);
> +}
> +
> +static int add_path_to_index(const struct object_id *oid,
> +                               struct strbuf *base, const char *path,
> +                               unsigned int mode, int stage, void *context)
> +{
> +       struct index_state *istate = (struct index_state *)context;
> +       struct cache_entry *ce;
> +       size_t len = base->len;
> +
> +       if (S_ISDIR(mode))
> +               return READ_TREE_RECURSIVE;
> +
> +       strbuf_addstr(base, path);
> +
> +       ce = make_cache_entry(istate, mode, oid, base->buf, 0, 0);
> +       ce->ce_flags |= CE_SKIP_WORKTREE;
> +       set_index_entry(istate, istate->cache_nr++, ce);
> +
> +       strbuf_setlen(base, len);
> +       return 0;
> +}
>
>  void ensure_full_index(struct index_state *istate)
>  {
> -       /* intentionally left blank */
> +       int i;
> +       struct index_state *full;
> +
> +       if (!istate || !istate->sparse_index)
> +               return;
> +
> +       if (!istate->repo)
> +               istate->repo = the_repository;
> +
> +       trace2_region_enter("index", "ensure_full_index", istate->repo);
> +
> +       /* initialize basics of new index */
> +       full = xcalloc(1, sizeof(struct index_state));
> +       memcpy(full, istate, sizeof(struct index_state));
> +
> +       /* then change the necessary things */
> +       full->sparse_index = 0;
> +       full->cache_alloc = (3 * istate->cache_alloc) / 2;
> +       full->cache_nr = 0;
> +       ALLOC_ARRAY(full->cache, full->cache_alloc);
> +
> +       for (i = 0; i < istate->cache_nr; i++) {
> +               struct cache_entry *ce = istate->cache[i];
> +               struct tree *tree;
> +               struct pathspec ps;
> +
> +               if (!S_ISSPARSEDIR(ce->ce_mode)) {
> +                       set_index_entry(full, full->cache_nr++, ce);
> +                       continue;
> +               }
> +               if (!(ce->ce_flags & CE_SKIP_WORKTREE))
> +                       warning(_("index entry is a directory, but not sparse (%08x)"),
> +                               ce->ce_flags);
> +
> +               /* recursively walk into cd->name */
> +               tree = lookup_tree(istate->repo, &ce->oid);
> +
> +               memset(&ps, 0, sizeof(ps));
> +               ps.recursive = 1;
> +               ps.has_wildcard = 1;
> +               ps.max_depth = -1;
> +
> +               read_tree_recursive(istate->repo, tree,
> +                                   ce->name, strlen(ce->name),
> +                                   0, &ps,
> +                                   add_path_to_index, full);
> +
> +               /* free directory entries. full entries are re-used */
> +               discard_cache_entry(ce);
> +       }
> +
> +       /* Copy back into original index. */
> +       memcpy(&istate->name_hash, &full->name_hash, sizeof(full->name_hash));
> +       istate->sparse_index = 0;
> +       free(istate->cache);

Thanks for fixing that leak that from the RFC series.

> +       istate->cache = full->cache;
> +       istate->cache_nr = full->cache_nr;
> +       istate->cache_alloc = full->cache_alloc;
> +
> +       free(full);
> +
> +       trace2_region_leave("index", "ensure_full_index", istate->repo);
>  }
> --
> gitgitgadget

Looks good to me.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 16/20] sparse-checkout: toggle sparse index from builtin
  2021-02-23 20:14 ` [PATCH 16/20] sparse-checkout: toggle sparse index from builtin Derrick Stolee via GitGitGadget
@ 2021-02-24 19:11   ` Martin Ågren
  2021-03-09 20:52     ` Derrick Stolee
  0 siblings, 1 reply; 203+ messages in thread
From: Martin Ågren @ 2021-02-24 19:11 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Elijah Newren, Junio C Hamano,
	Nguyễn Thái Ngọc Duy, Jonathan Nieder,
	Derrick Stolee, Derrick Stolee

On Wed, 24 Feb 2021 at 00:57, Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
> +that is not completely understood by other tools. Enabling sparse index
> +enables the `extensions.spareseIndex` config value, which might cause

s/sparese/sparse

> +other tools to stop working with your repository. If you have trouble with
> +this compatibility, then run `git sparse-checkout sparse-index disable` to
> +remove this config and rewrite your index to not be sparse.

While I'm commenting on this..:

There are several "layers" here, for lack of a better term. "Enabling foo
enables bar which may cause baz. If you fail due to baz, try dropping
bar by dropping foo." If I remove any mention of the config variable from
your text, I get the following.

 Enabling sparse index might cause other tools to stop working with your
 repository. If you have trouble with this compatibility, then run `git
 sparse-checkout sparse-index disable` to rewrite your index to not be
 sparse.

I'm tempted to suggest such a rewrite to relieve readers of knowing of
the middle step, which you could say is more of an implementation
detail. But if we think that the symptoms / error messages might involve
"extensions.sparseIndex" or, as would be the case with an older Git
installation,

  fatal: unknown repository extensions found:
          sparseindex

maybe there is some value in mentioning the config item by name. Just
thinking out loud, really, and I don't have any strong opinion. I only
came here to point out the typo in the docs.

Martin

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 06/20] t1092: compare sparse-checkout to sparse-index
  2021-02-23 20:14 ` [PATCH 06/20] t1092: compare sparse-checkout to sparse-index Derrick Stolee via GitGitGadget
@ 2021-02-25  6:37   ` Elijah Newren
  0 siblings, 0 replies; 203+ messages in thread
From: Elijah Newren @ 2021-02-25  6:37 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano,
	Nguyễn Thái Ngọc, Jonathan Nieder,
	Derrick Stolee, Derrick Stolee

On Tue, Feb 23, 2021 at 12:14 PM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <dstolee@microsoft.com>
>
> Add a new 'sparse-index' repo alongside the 'full-checkout' and
> 'sparse-checkout' repos in t1092-sparse-checkout-compatibility.sh. Also
> add run_on_sparse and test_sparse_match helpers. These helpers will be
> used when the sparse index is implemented.
>
> Add GIT_TEST_SPARSE_INDEX environment variable to enable the
> sparse-index by default. This will be intended to use across the entire
> test suite, except that it will only affect cases where the
> sparse-checkout feature is enabled.

This last sentence was a bit awkward to read.  "will be intended to
use" -> "is intended to be used"?

>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  t/README                                 |  3 +++
>  t/t1092-sparse-checkout-compatibility.sh | 24 ++++++++++++++++++++----
>  2 files changed, 23 insertions(+), 4 deletions(-)
>
> diff --git a/t/README b/t/README
> index 593d4a4e270c..b98bc563aab5 100644
> --- a/t/README
> +++ b/t/README
> @@ -439,6 +439,9 @@ and "sha256".
>  GIT_TEST_WRITE_REV_INDEX=<boolean>, when true enables the
>  'pack.writeReverseIndex' setting.
>
> +GIT_TEST_SPARSE_INDEX=<boolean>, when true enables index writes to use the
> +sparse-index format by default.
> +
>  Naming Tests
>  ------------
>
> diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
> index 3725d3997e70..71d6f9e4c014 100755
> --- a/t/t1092-sparse-checkout-compatibility.sh
> +++ b/t/t1092-sparse-checkout-compatibility.sh
> @@ -7,6 +7,7 @@ test_description='compare full workdir to sparse workdir'
>  test_expect_success 'setup' '
>         git init initial-repo &&
>         (
> +               GIT_TEST_SPARSE_INDEX=0 &&
>                 cd initial-repo &&
>                 echo a >a &&
>                 echo "after deep" >e &&
> @@ -87,23 +88,32 @@ init_repos () {
>
>         cp -r initial-repo sparse-checkout &&
>         git -C sparse-checkout reset --hard &&
> -       git -C sparse-checkout sparse-checkout init --cone &&
> +
> +       cp -r initial-repo sparse-index &&
> +       git -C sparse-index reset --hard &&
>
>         # initialize sparse-checkout definitions
> -       git -C sparse-checkout sparse-checkout set deep
> +       git -C sparse-checkout sparse-checkout init --cone &&
> +       git -C sparse-checkout sparse-checkout set deep &&
> +       GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout init --cone &&
> +       GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set deep
>  }
>
>  run_on_sparse () {
>         (
>                 cd sparse-checkout &&
> -               "$@" >../sparse-checkout-out 2>../sparse-checkout-err
> +               GIT_TEST_SPARSE_INDEX=0 "$@" >../sparse-checkout-out 2>../sparse-checkout-err
> +       ) &&
> +       (
> +               cd sparse-index &&
> +               GIT_TEST_SPARSE_INDEX=1 "$@" >../sparse-index-out 2>../sparse-index-err
>         )
>  }
>
>  run_on_all () {
>         (
>                 cd full-checkout &&
> -               "$@" >../full-checkout-out 2>../full-checkout-err
> +               GIT_TEST_SPARSE_INDEX=0 "$@" >../full-checkout-out 2>../full-checkout-err
>         ) &&
>         run_on_sparse "$@"
>  }
> @@ -114,6 +124,12 @@ test_all_match () {
>         test_cmp full-checkout-err sparse-checkout-err
>  }
>
> +test_sparse_match () {
> +       run_on_sparse $* &&

Should this be
   run_on_sparse "$@"
in order to allow arguments with spaces?

> +       test_cmp sparse-checkout-out sparse-index-out &&
> +       test_cmp sparse-checkout-err sparse-index-err
> +}
> +
>  test_expect_success 'status with options' '
>         init_repos &&
>         test_all_match git status --porcelain=v2 &&
> --
> gitgitgadget

Other than those minor comments, looks good to me.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 07/20] test-read-cache: print cache entries with --table
  2021-02-23 20:14 ` [PATCH 07/20] test-read-cache: print cache entries with --table Derrick Stolee via GitGitGadget
@ 2021-02-25  7:02   ` Elijah Newren
  2021-03-09 21:00     ` Derrick Stolee
  0 siblings, 1 reply; 203+ messages in thread
From: Elijah Newren @ 2021-02-25  7:02 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano,
	Nguyễn Thái Ngọc, Jonathan Nieder,
	Derrick Stolee, Derrick Stolee

On Tue, Feb 23, 2021 at 12:14 PM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <dstolee@microsoft.com>
>
> This table is helpful for discovering data in the index to ensure it is
> being written correctly, especially as we build and test the
> sparse-index. This table includes an output format similar to 'git
> ls-tree', but should not be compared to that directly. The biggest
> reasons are that 'git ls-tree' includes a tree entry for every
> subdirectory, even those that would not appear as a sparse directory in
> a sparse-index. Further, 'git ls-tree' does not use a trailing directory
> separator for its tree rows.
>
> This does not print the stat() information for the blobs. That could be
> added in a future change with another option. The tests that are added
> in the next few changes care only about the object types and IDs.
>
> To make the option parsing slightly more robust, wrap the string
> comparisons in a loop adapted from test-dir-iterator.c.
>
> Care must be taken with the final check for the 'cnt' variable. We
> continue the expectation that the numerical value is the final argument.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  t/helper/test-read-cache.c | 50 ++++++++++++++++++++++++++++++--------
>  1 file changed, 40 insertions(+), 10 deletions(-)
>
> diff --git a/t/helper/test-read-cache.c b/t/helper/test-read-cache.c
> index 244977a29bdf..e4c3492f7d3e 100644
> --- a/t/helper/test-read-cache.c
> +++ b/t/helper/test-read-cache.c
> @@ -2,35 +2,65 @@
>  #include "cache.h"
>  #include "config.h"
>
> +static void print_cache_entry(struct cache_entry *ce)
> +{
> +       printf("%06o ", ce->ce_mode & 0777777);

This constant is curious.  I think it's because you want to strip off
the special in-memory bits of the ce_mode where git stores extra data,
which would be everything beyond the first 16 bits (as noted in a
comment near the beginning of cache.h).  But here you keep the first
18 bits.  Is CE_UPDATE and CE_REMOVE just 0 in the cases you've viewed
so this works (but you really should use 0177777 or 0xFFFF), or am I
off in my guess of what you're trying to do and you do want to see
those two flags?

It also seems surprising to me that this constant isn't already
defined somewhere in cache.h or as some variant of S_IFMT, though I'm
not finding it.

> +
> +       if (S_ISSPARSEDIR(ce->ce_mode))
> +               printf("tree ");
> +       else if (S_ISGITLINK(ce->ce_mode))
> +               printf("commit ");
> +       else
> +               printf("blob ");

Perhaps make use of the tree_type, commit_type, and blob_type global constants?

> +
> +       printf("%s\t%s\n",
> +              oid_to_hex(&ce->oid),
> +              ce->name);
> +}
> +
> +static void print_cache(struct index_state *cache)
> +{
> +       int i;
> +       for (i = 0; i < the_index.cache_nr; i++)
> +               print_cache_entry(the_index.cache[i]);

Why are you passing cache as a parameter, then ignoring it and using the_index?

> +}
> +
>  int cmd__read_cache(int argc, const char **argv)
>  {
> +       struct repository *r = the_repository;
>         int i, cnt = 1;
>         const char *name = NULL;
> +       int table = 0;
>
> -       if (argc > 1 && skip_prefix(argv[1], "--print-and-refresh=", &name)) {
> -               argc--;
> -               argv++;
> +       for (++argv, --argc; *argv && starts_with(*argv, "--"); ++argv, --argc) {
> +               if (skip_prefix(*argv, "--print-and-refresh=", &name))
> +                       continue;
> +               if (!strcmp(*argv, "--table"))
> +                       table = 1;
>         }
>
> -       if (argc == 2)
> -               cnt = strtol(argv[1], NULL, 0);
> +       if (argc == 1)
> +               cnt = strtol(argv[0], NULL, 0);
>         setup_git_directory();
>         git_config(git_default_config, NULL);
> +
>         for (i = 0; i < cnt; i++) {
> -               read_cache();
> +               repo_read_index(r);
>                 if (name) {
>                         int pos;
>
> -                       refresh_index(&the_index, REFRESH_QUIET,
> +                       refresh_index(r->index, REFRESH_QUIET,
>                                       NULL, NULL, NULL);
> -                       pos = index_name_pos(&the_index, name, strlen(name));
> +                       pos = index_name_pos(r->index, name, strlen(name));
>                         if (pos < 0)
>                                 die("%s not in index", name);
>                         printf("%s is%s up to date\n", name,
> -                              ce_uptodate(the_index.cache[pos]) ? "" : " not");
> +                              ce_uptodate(r->index->cache[pos]) ? "" : " not");
>                         write_file(name, "%d\n", i);
>                 }
> -               discard_cache();
> +               if (table)
> +                       print_cache(r->index);
> +               discard_index(r->index);
>         }
>         return 0;
>  }
> --
> gitgitgadget

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 10/20] sparse-checkout: hold pattern list in index
  2021-02-23 20:14 ` [PATCH 10/20] sparse-checkout: hold pattern list in index Derrick Stolee via GitGitGadget
@ 2021-02-25  7:14   ` Elijah Newren
  0 siblings, 0 replies; 203+ messages in thread
From: Elijah Newren @ 2021-02-25  7:14 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano,
	Nguyễn Thái Ngọc, Jonathan Nieder,
	Derrick Stolee, Derrick Stolee

On Tue, Feb 23, 2021 at 12:14 PM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <dstolee@microsoft.com>
>
> As we modify the sparse-checkout definition, we perform index operations
> on a pattern_list that only exists in-memory. This allows easy backing
> out in case the index update fails.
>
> However, if the index write itself cares about the sparse-checkout
> pattern set, we need access to that in-memory copy. Place a pointer to
> a 'struct pattern_list' in the index so we can access this on-demand.
> This will be used in the next change which uses the sparse-checkout
> definition to filter out directories that are outsie the sparse cone.

Looks like you still have the "outsie" typo.  ;-)

>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  builtin/sparse-checkout.c | 17 ++++++++++-------
>  cache.h                   |  2 ++
>  2 files changed, 12 insertions(+), 7 deletions(-)
>
> diff --git a/builtin/sparse-checkout.c b/builtin/sparse-checkout.c
> index 2306a9ad98e0..e00b82af727b 100644
> --- a/builtin/sparse-checkout.c
> +++ b/builtin/sparse-checkout.c
> @@ -110,6 +110,8 @@ static int update_working_directory(struct pattern_list *pl)
>         if (is_index_unborn(r->index))
>                 return UPDATE_SPARSITY_SUCCESS;
>
> +       r->index->sparse_checkout_patterns = pl;
> +
>         memset(&o, 0, sizeof(o));
>         o.verbose_update = isatty(2);
>         o.update = 1;
> @@ -138,6 +140,7 @@ static int update_working_directory(struct pattern_list *pl)
>         else
>                 rollback_lock_file(&lock_file);
>
> +       r->index->sparse_checkout_patterns = NULL;
>         return result;
>  }
>
> @@ -517,19 +520,18 @@ static int modify_pattern_list(int argc, const char **argv, enum modify_type m)
>  {
>         int result;
>         int changed_config = 0;
> -       struct pattern_list pl;
> -       memset(&pl, 0, sizeof(pl));
> +       struct pattern_list *pl = xcalloc(1, sizeof(*pl));
>
>         switch (m) {
>         case ADD:
>                 if (core_sparse_checkout_cone)
> -                       add_patterns_cone_mode(argc, argv, &pl);
> +                       add_patterns_cone_mode(argc, argv, pl);
>                 else
> -                       add_patterns_literal(argc, argv, &pl);
> +                       add_patterns_literal(argc, argv, pl);
>                 break;
>
>         case REPLACE:
> -               add_patterns_from_input(&pl, argc, argv);
> +               add_patterns_from_input(pl, argc, argv);
>                 break;
>         }
>
> @@ -539,12 +541,13 @@ static int modify_pattern_list(int argc, const char **argv, enum modify_type m)
>                 changed_config = 1;
>         }
>
> -       result = write_patterns_and_update(&pl);
> +       result = write_patterns_and_update(pl);
>
>         if (result && changed_config)
>                 set_config(MODE_NO_PATTERNS);
>
> -       clear_pattern_list(&pl);
> +       clear_pattern_list(pl);
> +       free(pl);
>         return result;
>  }
>
> diff --git a/cache.h b/cache.h
> index 1336c8d7435e..d75b352f38d3 100644
> --- a/cache.h
> +++ b/cache.h
> @@ -307,6 +307,7 @@ static inline unsigned int canon_mode(unsigned int mode)
>  struct split_index;
>  struct untracked_cache;
>  struct progress;
> +struct pattern_list;
>
>  struct index_state {
>         struct cache_entry **cache;
> @@ -332,6 +333,7 @@ struct index_state {
>         struct mem_pool *ce_mem_pool;
>         struct progress *progress;
>         struct repository *repo;
> +       struct pattern_list *sparse_checkout_patterns;
>  };
>
>  /* Name hashing */
> --
> gitgitgadget
>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 11/20] sparse-index: convert from full to sparse
  2021-02-23 20:14 ` [PATCH 11/20] sparse-index: convert from full to sparse Derrick Stolee via GitGitGadget
@ 2021-02-25  7:33   ` Elijah Newren
  2021-03-09 21:13     ` Derrick Stolee
  0 siblings, 1 reply; 203+ messages in thread
From: Elijah Newren @ 2021-02-25  7:33 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano,
	Nguyễn Thái Ngọc, Jonathan Nieder,
	Derrick Stolee, Derrick Stolee

On Tue, Feb 23, 2021 at 12:14 PM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <dstolee@microsoft.com>
>
> If we have a full index, then we can convert it to a sparse index by
> replacing directories outside of the sparse cone with sparse directory
> entries. The convert_to_sparse() method does this, when the situation is
> appropriate.
>
> For now, we avoid converting the index to a sparse index if:
>
>  1. the index is split.
>  2. the index is already sparse.
>  3. sparse-checkout is disabled.
>  4. sparse-checkout does not use cone mode.
>
> Finally, we currently limit the conversion to when the
> GIT_TEST_SPARSE_INDEX environment variable is enabled. A mode using Git
> config will be added in a later change.
>
> The trickiest thing about this conversion is that we might not be able
> to mark a directory as a sparse directory just because it is outside the
> sparse cone. There might be unmerged files within that directory, so we
> need to look for those. Also, if there is some strange reason why a file
> is not marked with CE_SKIP_WORKTREE, then we should give up on
> converting that directory. There is still hope that some of its
> subdirectories might be able to convert to sparse, so we keep looking
> deeper.
>
> The conversion process is assisted by the cache-tree extension. This is
> calculated from the full index if it does not already exist. We then
> abandon the cache-tree as it no longer applies to the newly-sparse
> index. Thus, this cache-tree will be recalculated in every
> sparse-full-sparse round-trip until we integrate the cache-tree
> extension with the sparse index.
>
> Some Git commands use the index after writing it. For example, 'git add'
> will update the index, then write it to disk, then read its entries to
> report information. To keep the in-memory index in a full state after
> writing, we re-expand it to a full one after the write. This is wasteful
> for commands that only write the index and do not read from it again,
> but that is only the case until we make those commands "sparse aware."
>
> We can compare the behavior of the sparse-index in
> t1092-sparse-checkout-compability.sh by using GIT_TEST_SPARSE_INDEX=1
> when operating on the 'sparse-index' repo. We can also compare the two
> sparse repos directly, such as comparing their indexes (when expanded to
> full in the case of the 'sparse-index' repo). We also verify that the
> index is actually populated with sparse directory entries.
>
> The 'checkout and reset (mixed)' test is marked for failure when
> comparing a sparse repo to a full repo, but we can compare the two
> sparse-checkout cases directly to ensure that we are not changing the
> behavior when using a sparse index.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  cache-tree.c                             |   3 +
>  cache.h                                  |   2 +
>  read-cache.c                             |  26 ++++-
>  sparse-index.c                           | 139 +++++++++++++++++++++++
>  sparse-index.h                           |   1 +
>  t/t1092-sparse-checkout-compatibility.sh |  61 +++++++++-
>  6 files changed, 227 insertions(+), 5 deletions(-)
>
> diff --git a/cache-tree.c b/cache-tree.c
> index 2fb483d3c083..5f07a39e501e 100644
> --- a/cache-tree.c
> +++ b/cache-tree.c
> @@ -6,6 +6,7 @@
>  #include "object-store.h"
>  #include "replace-object.h"
>  #include "promisor-remote.h"
> +#include "sparse-index.h"
>
>  #ifndef DEBUG_CACHE_TREE
>  #define DEBUG_CACHE_TREE 0
> @@ -442,6 +443,8 @@ int cache_tree_update(struct index_state *istate, int flags)
>         if (i)
>                 return i;
>
> +       ensure_full_index(istate);
> +
>         if (!istate->cache_tree)
>                 istate->cache_tree = cache_tree();
>
> diff --git a/cache.h b/cache.h
> index d75b352f38d3..e8b7d3b4fb33 100644
> --- a/cache.h
> +++ b/cache.h
> @@ -251,6 +251,8 @@ static inline unsigned int create_ce_mode(unsigned int mode)
>  {
>         if (S_ISLNK(mode))
>                 return S_IFLNK;
> +       if (mode == S_IFDIR)
> +               return S_IFDIR;
>         if (S_ISDIR(mode) || S_ISGITLINK(mode))
>                 return S_IFGITLINK;
>         return S_IFREG | ce_permissions(mode);
> diff --git a/read-cache.c b/read-cache.c
> index 97dbf2434f30..67acbf202f4e 100644
> --- a/read-cache.c
> +++ b/read-cache.c
> @@ -25,6 +25,7 @@
>  #include "fsmonitor.h"
>  #include "thread-utils.h"
>  #include "progress.h"
> +#include "sparse-index.h"
>
>  /* Mask for the name length in ce_flags in the on-disk index */
>
> @@ -1002,8 +1003,14 @@ int verify_path(const char *path, unsigned mode)
>
>                         c = *path++;
>                         if ((c == '.' && !verify_dotfile(path, mode)) ||
> -                           is_dir_sep(c) || c == '\0')
> +                           is_dir_sep(c))
>                                 return 0;
> +                       /*
> +                        * allow terminating directory separators for
> +                        * sparse directory enries.

enries -> entries

> +                        */
> +                       if (c == '\0')
> +                               return S_ISDIR(mode);

Yaay, much simpler (than the RFC version).

>                 } else if (c == '\\' && protect_ntfs) {
>                         if (is_ntfs_dotgit(path))
>                                 return 0;
> @@ -3061,6 +3068,14 @@ static int do_write_locked_index(struct index_state *istate, struct lock_file *l
>                                  unsigned flags)
>  {
>         int ret;
> +       int was_full = !istate->sparse_index;
> +
> +       ret = convert_to_sparse(istate);
> +
> +       if (ret) {
> +               warning(_("failed to convert to a sparse-index"));
> +               return ret;
> +       }
>
>         /*
>          * TODO trace2: replace "the_repository" with the actual repo instance
> @@ -3072,6 +3087,9 @@ static int do_write_locked_index(struct index_state *istate, struct lock_file *l
>         trace2_region_leave_printf("index", "do_write_index", the_repository,
>                                    "%s", get_lock_file_path(lock));
>
> +       if (was_full)
> +               ensure_full_index(istate);
> +
>         if (ret)
>                 return ret;
>         if (flags & COMMIT_LOCK)
> @@ -3162,9 +3180,10 @@ static int write_shared_index(struct index_state *istate,
>                               struct tempfile **temp)
>  {
>         struct split_index *si = istate->split_index;
> -       int ret;
> +       int ret, was_full = !istate->sparse_index;
>
>         move_cache_to_base_index(istate);
> +       convert_to_sparse(istate);
>
>         trace2_region_enter_printf("index", "shared/do_write_index",
>                                    the_repository, "%s", get_tempfile_path(*temp));
> @@ -3172,6 +3191,9 @@ static int write_shared_index(struct index_state *istate,
>         trace2_region_leave_printf("index", "shared/do_write_index",
>                                    the_repository, "%s", get_tempfile_path(*temp));
>
> +       if (was_full)
> +               ensure_full_index(istate);
> +
>         if (ret)
>                 return ret;
>         ret = adjust_shared_perm(get_tempfile_path(*temp));
> diff --git a/sparse-index.c b/sparse-index.c
> index 316cb949b74b..cb1f85635fbc 100644
> --- a/sparse-index.c
> +++ b/sparse-index.c
> @@ -4,6 +4,145 @@
>  #include "tree.h"
>  #include "pathspec.h"
>  #include "trace2.h"
> +#include "cache-tree.h"
> +#include "config.h"
> +#include "dir.h"
> +#include "fsmonitor.h"
> +
> +static struct cache_entry *construct_sparse_dir_entry(
> +                               struct index_state *istate,
> +                               const char *sparse_dir,
> +                               struct cache_tree *tree)
> +{
> +       struct cache_entry *de;
> +
> +       de = make_cache_entry(istate, S_IFDIR, &tree->oid, sparse_dir, 0, 0);
> +
> +       de->ce_flags |= CE_SKIP_WORKTREE;
> +       return de;
> +}
> +
> +/*
> + * Returns the number of entries "inserted" into the index.
> + */
> +static int convert_to_sparse_rec(struct index_state *istate,
> +                                int num_converted,
> +                                int start, int end,
> +                                const char *ct_path, size_t ct_pathlen,
> +                                struct cache_tree *ct)
> +{
> +       int i, can_convert = 1;
> +       int start_converted = num_converted;
> +       enum pattern_match_result match;
> +       int dtype;
> +       struct strbuf child_path = STRBUF_INIT;
> +       struct pattern_list *pl = istate->sparse_checkout_patterns;
> +
> +       /*
> +        * Is the current path outside of the sparse cone?
> +        * Then check if the region can be replaced by a sparse
> +        * directory entry (everything is sparse and merged).
> +        */
> +       match = path_matches_pattern_list(ct_path, ct_pathlen,
> +                                         NULL, &dtype, pl, istate);
> +       if (match != NOT_MATCHED)
> +               can_convert = 0;

Not sure if you saw my comments on the flow control at
https://lore.kernel.org/git/CABPp-BE9wPwmC0=pA4p1_QSRDHrO8RzqfJQdE2NxYZsYL_Rcig@mail.gmail.com/
(the typos elsewhere seem to still be present).  If you saw it and
decided against it, that's fine, just wanted the idea to at least be
floated.

> +
> +       for (i = start; can_convert && i < end; i++) {
> +               struct cache_entry *ce = istate->cache[i];
> +
> +               if (ce_stage(ce) ||
> +                   !(ce->ce_flags & CE_SKIP_WORKTREE))
> +                       can_convert = 0;
> +       }
> +
> +       if (can_convert) {
> +               struct cache_entry *se;
> +               se = construct_sparse_dir_entry(istate, ct_path, ct);
> +
> +               istate->cache[num_converted++] = se;
> +               return 1;
> +       }
> +
> +       for (i = start; i < end; ) {
> +               int count, span, pos = -1;
> +               const char *base, *slash;
> +               struct cache_entry *ce = istate->cache[i];
> +
> +               /*
> +                * Detect if this is a normal entry oustide of any subtree

s/oustide/outside/

> +                * entry.
> +                */
> +               base = ce->name + ct_pathlen;
> +               slash = strchr(base, '/');
> +
> +               if (slash)
> +                       pos = cache_tree_subtree_pos(ct, base, slash - base);
> +
> +               if (pos < 0) {
> +                       istate->cache[num_converted++] = ce;
> +                       i++;
> +                       continue;
> +               }
> +
> +               strbuf_setlen(&child_path, 0);
> +               strbuf_add(&child_path, ce->name, slash - ce->name + 1);
> +
> +               span = ct->down[pos]->cache_tree->entry_count;
> +               count = convert_to_sparse_rec(istate,
> +                                             num_converted, i, i + span,
> +                                             child_path.buf, child_path.len,
> +                                             ct->down[pos]->cache_tree);
> +               num_converted += count;
> +               i += span;
> +       }
> +
> +       strbuf_release(&child_path);
> +       return num_converted - start_converted;
> +}
> +
> +int convert_to_sparse(struct index_state *istate)
> +{
> +       if (istate->split_index || istate->sparse_index ||
> +           !core_apply_sparse_checkout || !core_sparse_checkout_cone)
> +               return 0;
> +
> +       /*
> +        * For now, only create a sparse index with the
> +        * GIT_TEST_SPARSE_INDEX environment variable. We will relax
> +        * this once we have a proper way to opt-in (and later still,
> +        * opt-out).
> +        */
> +       if (!git_env_bool("GIT_TEST_SPARSE_INDEX", 0))
> +               return 0;
> +
> +       if (!istate->sparse_checkout_patterns) {
> +               istate->sparse_checkout_patterns = xcalloc(1, sizeof(struct pattern_list));
> +               if (get_sparse_checkout_patterns(istate->sparse_checkout_patterns) < 0)
> +                       return 0;
> +       }
> +
> +       if (!istate->sparse_checkout_patterns->use_cone_patterns) {
> +               warning(_("attempting to use sparse-index without cone mode"));
> +               return -1;
> +       }
> +
> +       if (cache_tree_update(istate, 0)) {
> +               warning(_("unable to update cache-tree, staying full"));
> +               return -1;
> +       }
> +
> +       remove_fsmonitor(istate);
> +
> +       trace2_region_enter("index", "convert_to_sparse", istate->repo);
> +       istate->cache_nr = convert_to_sparse_rec(istate,
> +                                                0, 0, istate->cache_nr,
> +                                                "", 0, istate->cache_tree);
> +       istate->drop_cache_tree = 1;
> +       istate->sparse_index = 1;
> +       trace2_region_leave("index", "convert_to_sparse", istate->repo);
> +       return 0;
> +}
>
>  static void set_index_entry(struct index_state *istate, int nr, struct cache_entry *ce)
>  {
> diff --git a/sparse-index.h b/sparse-index.h
> index 09a20d036c46..64380e121d80 100644
> --- a/sparse-index.h
> +++ b/sparse-index.h
> @@ -3,5 +3,6 @@
>
>  struct index_state;
>  void ensure_full_index(struct index_state *istate);
> +int convert_to_sparse(struct index_state *istate);
>
>  #endif
> diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
> index 4d789fe86b9d..ca87033d30b0 100755
> --- a/t/t1092-sparse-checkout-compatibility.sh
> +++ b/t/t1092-sparse-checkout-compatibility.sh
> @@ -2,6 +2,9 @@
>
>  test_description='compare full workdir to sparse workdir'
>
> +GIT_TEST_CHECK_CACHE_TREE=0

Same question as I posted for the RFC series:

Why do you need to set this?  I vaguely remember needing to mess with
this when working with sparse checkouts because it did weird stuff but
I don't remember details.  But since your patch touches cache_trees, it
seems weird to show up without explanation.

> +GIT_TEST_SPLIT_INDEX=0
> +
>  . ./test-lib.sh
>
>  test_expect_success 'setup' '
> @@ -121,15 +124,49 @@ run_on_all () {
>  test_all_match () {
>         run_on_all "$@" &&
>         test_cmp full-checkout-out sparse-checkout-out &&
> -       test_cmp full-checkout-err sparse-checkout-err
> +       test_cmp full-checkout-out sparse-index-out &&
> +       test_cmp full-checkout-err sparse-checkout-err &&
> +       test_cmp full-checkout-err sparse-index-err
>  }
>
>  test_sparse_match () {
> -       run_on_sparse $* &&
> +       run_on_sparse "$@" &&
>         test_cmp sparse-checkout-out sparse-index-out &&
>         test_cmp sparse-checkout-err sparse-index-err
>  }
>
> +test_expect_success 'sparse-index contents' '
> +       init_repos &&
> +
> +       test-tool -C sparse-index read-cache --table >cache &&
> +       for dir in folder1 folder2 x
> +       do
> +               TREE=$(git -C sparse-index rev-parse HEAD:$dir) &&
> +               grep "040000 tree $TREE $dir/" cache \
> +                       || return 1
> +       done &&

Thanks for making the output look more like ls-tree output; it's
easier to parse that way, at least for me.

> +
> +       GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set folder1 &&
> +
> +       test-tool -C sparse-index read-cache --table >cache &&
> +       for dir in deep folder2 x
> +       do
> +               TREE=$(git -C sparse-index rev-parse HEAD:$dir) &&
> +               grep "040000 tree $TREE $dir/" cache \
> +                       || return 1
> +       done &&
> +
> +       GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set deep/deeper1 &&
> +
> +       test-tool -C sparse-index read-cache --table >cache &&
> +       for dir in deep/deeper2 folder1 folder2 x
> +       do
> +               TREE=$(git -C sparse-index rev-parse HEAD:$dir) &&
> +               grep "040000 tree $TREE $dir/" cache \
> +                       || return 1
> +       done
> +'
> +
>  test_expect_success 'expanded in-memory index matches full index' '
>         init_repos &&
>         test_sparse_match test-tool read-cache --expand --table
> @@ -137,6 +174,7 @@ test_expect_success 'expanded in-memory index matches full index' '
>
>  test_expect_success 'status with options' '
>         init_repos &&
> +       test_sparse_match ls &&
>         test_all_match git status --porcelain=v2 &&
>         test_all_match git status --porcelain=v2 -z -u &&
>         test_all_match git status --porcelain=v2 -uno &&
> @@ -273,6 +311,17 @@ test_expect_failure 'checkout and reset (mixed)' '
>         test_all_match git reset update-folder2
>  '
>
> +# Ensure that sparse-index behaves identically to
> +# sparse-checkout with a full index.
> +test_expect_success 'checkout and reset (mixed) [sparse]' '
> +       init_repos &&
> +
> +       test_sparse_match git checkout -b reset-test update-deep &&
> +       test_sparse_match git reset deepest &&
> +       test_sparse_match git reset update-folder1 &&
> +       test_sparse_match git reset update-folder2
> +'
> +
>  test_expect_success 'merge' '
>         init_repos &&
>
> @@ -309,14 +358,20 @@ test_expect_success 'clean' '
>         test_all_match git status --porcelain=v2 &&
>         test_all_match git clean -f &&
>         test_all_match git status --porcelain=v2 &&
> +       test_sparse_match ls &&
> +       test_sparse_match ls folder1 &&
>
>         test_all_match git clean -xf &&
>         test_all_match git status --porcelain=v2 &&
> +       test_sparse_match ls &&
> +       test_sparse_match ls folder1 &&
>
>         test_all_match git clean -xdf &&
>         test_all_match git status --porcelain=v2 &&
> +       test_sparse_match ls &&
> +       test_sparse_match ls folder1 &&
>
> -       test_path_is_dir sparse-checkout/folder1
> +       test_sparse_match test_path_is_dir folder1
>  '
>
>  test_done
> --
> gitgitgadget

I mostly read over the range-diff since it was much shorter.  You've
addressed a number of questions/comments I had on the RFC version, but
there's still some I didn't see a response to so I reposted them here.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 13/20] unpack-trees: allow sparse directories
  2021-02-23 20:14 ` [PATCH 13/20] unpack-trees: allow sparse directories Derrick Stolee via GitGitGadget
@ 2021-02-25  7:40   ` Elijah Newren
  2021-03-09 21:35     ` Derrick Stolee
  0 siblings, 1 reply; 203+ messages in thread
From: Elijah Newren @ 2021-02-25  7:40 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano,
	Nguyễn Thái Ngọc, Jonathan Nieder,
	Derrick Stolee, Derrick Stolee

On Tue, Feb 23, 2021 at 12:14 PM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <dstolee@microsoft.com>
>
> The index_pos_by_traverse_info() currently throws a BUG() when a
> directory entry exists exactly in the index. We need to consider that it
> is possible to have a directory in a sparse index as long as that entry
> is itself marked with the skip-worktree bit.
>
> The negation of the 'pos' variable must be conditioned to only when it
> starts as negative. This is identical behavior as before when the index
> is full.

Same comment on the second paragraph as I made in the RFC series --
https://lore.kernel.org/git/CABPp-BGPJgA4guWHVm3AVS=hM0fTixUpRvJe5i9NnHT-3QJMfw@mail.gmail.com/.
I apologize if I'm repeating stuff you chose to not change, but I
didn't see a response and given the three typos left in previous
patches, I'm unsure whether it was unaddressed on purpose or on
accident.

> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  unpack-trees.c | 9 ++++++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/unpack-trees.c b/unpack-trees.c
> index 4dd99219073a..b324eec2a5d1 100644
> --- a/unpack-trees.c
> +++ b/unpack-trees.c
> @@ -746,9 +746,12 @@ static int index_pos_by_traverse_info(struct name_entry *names,
>         strbuf_make_traverse_path(&name, info, names->path, names->pathlen);
>         strbuf_addch(&name, '/');
>         pos = index_name_pos(o->src_index, name.buf, name.len);
> -       if (pos >= 0)
> -               BUG("This is a directory and should not exist in index");
> -       pos = -pos - 1;
> +       if (pos >= 0) {
> +               if (!o->src_index->sparse_index ||
> +                   !(o->src_index->cache[pos]->ce_flags & CE_SKIP_WORKTREE))
> +                       BUG("This is a directory and should not exist in index");
> +       } else
> +               pos = -pos - 1;
>         if (pos >= o->src_index->cache_nr ||
>             !starts_with(o->src_index->cache[pos]->name, name.buf) ||
>             (pos > 0 && starts_with(o->src_index->cache[pos-1]->name, name.buf)))
> --
> gitgitgadget
>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 15/20] sparse-index: create extension for compatibility
  2021-02-23 20:14 ` [PATCH 15/20] sparse-index: create extension for compatibility Derrick Stolee via GitGitGadget
@ 2021-02-25  7:45   ` Elijah Newren
  2021-03-09 21:45     ` Derrick Stolee
  0 siblings, 1 reply; 203+ messages in thread
From: Elijah Newren @ 2021-02-25  7:45 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano,
	Nguyễn Thái Ngọc, Jonathan Nieder,
	Derrick Stolee, Derrick Stolee

On Tue, Feb 23, 2021 at 12:14 PM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <dstolee@microsoft.com>
>
> Previously, we enabled the sparse index format only using
> GIT_TEST_SPARSE_INDEX=1. This is not a feasible direction for users to
> actually select this mode. Further, sparse directory entries are not
> understood by the index formats as advertised.
>
> We _could_ add a new index version that explicitly adds these
> capabilities, but there are nuances to index formats 2, 3, and 4 that
> are still valuable to select as options. For now, create a repo
> extension, "extensions.sparseIndex", that specifies that the tool
> reading this repository must understand sparse directory entries.

This commit is unchanged from the RFC series, but given your comments
in the design document about how you do intend to create an index
format v5 now, do you want to reference that here?

>
> This change only encodes the extension and enables it when
> GIT_TEST_SPARSE_INDEX=1. Later, we will add a more user-friendly CLI
> mechanism.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  Documentation/config/extensions.txt |  7 ++++++
>  cache.h                             |  1 +
>  repo-settings.c                     |  7 ++++++
>  repository.h                        |  3 ++-
>  setup.c                             |  3 +++
>  sparse-index.c                      | 38 +++++++++++++++++++++++++----
>  6 files changed, 53 insertions(+), 6 deletions(-)
>
> diff --git a/Documentation/config/extensions.txt b/Documentation/config/extensions.txt
> index 4e23d73cdcad..5c86b3648732 100644
> --- a/Documentation/config/extensions.txt
> +++ b/Documentation/config/extensions.txt
> @@ -6,3 +6,10 @@ extensions.objectFormat::
>  Note that this setting should only be set by linkgit:git-init[1] or
>  linkgit:git-clone[1].  Trying to change it after initialization will not
>  work and will produce hard-to-diagnose issues.
> +
> +extensions.sparseIndex::
> +       When combined with `core.sparseCheckout=true` and
> +       `core.sparseCheckoutCone=true`, the index may contain entries
> +       corresponding to directories outside of the sparse-checkout
> +       definition. Versions of Git that do not understand this extension
> +       do not expect directory entries in the index.

I had a wording suggestion for this paragraph in the RFC series --
https://lore.kernel.org/git/CABPp-BFEJE82k4VgkR=Jf7V7sZxZzo2pHMfAGshhi9_vV6iK0w@mail.gmail.com/.
Let me know if you just decided to leave it out so I don't bug you
about stuff you already considered.

> diff --git a/cache.h b/cache.h
> index e8b7d3b4fb33..eea61fba7568 100644
> --- a/cache.h
> +++ b/cache.h
> @@ -1053,6 +1053,7 @@ struct repository_format {
>         int worktree_config;
>         int is_bare;
>         int hash_algo;
> +       int sparse_index;
>         char *work_tree;
>         struct string_list unknown_extensions;
>         struct string_list v1_only_extensions;
> diff --git a/repo-settings.c b/repo-settings.c
> index d63569e4041e..9677d50f9238 100644
> --- a/repo-settings.c
> +++ b/repo-settings.c
> @@ -85,4 +85,11 @@ void prepare_repo_settings(struct repository *r)
>          * removed.
>          */
>         r->settings.command_requires_full_index = 1;
> +
> +       /*
> +        * Initialize this as off.
> +        */
> +       r->settings.sparse_index = 0;
> +       if (!repo_config_get_bool(r, "extensions.sparseindex", &value) && value)
> +               r->settings.sparse_index = 1;
>  }
> diff --git a/repository.h b/repository.h
> index e06a23015697..a45f7520fd9e 100644
> --- a/repository.h
> +++ b/repository.h
> @@ -42,7 +42,8 @@ struct repo_settings {
>
>         int core_multi_pack_index;
>
> -       unsigned command_requires_full_index:1;
> +       unsigned command_requires_full_index:1,
> +                sparse_index:1;
>  };
>
>  struct repository {
> diff --git a/setup.c b/setup.c
> index c04cd25a30df..cd8394564613 100644
> --- a/setup.c
> +++ b/setup.c
> @@ -500,6 +500,9 @@ static enum extension_result handle_extension(const char *var,
>                         return error("invalid value for 'extensions.objectformat'");
>                 data->hash_algo = format;
>                 return EXTENSION_OK;
> +       } else if (!strcmp(ext, "sparseindex")) {
> +               data->sparse_index = 1;
> +               return EXTENSION_OK;
>         }
>         return EXTENSION_UNKNOWN;
>  }
> diff --git a/sparse-index.c b/sparse-index.c
> index 14029fafc750..97b0d0c57857 100644
> --- a/sparse-index.c
> +++ b/sparse-index.c
> @@ -102,19 +102,47 @@ static int convert_to_sparse_rec(struct index_state *istate,
>         return num_converted - start_converted;
>  }
>
> +static int enable_sparse_index(struct repository *repo)
> +{
> +       const char *config_path = repo_git_path(repo, "config.worktree");
> +
> +       if (upgrade_repository_format(1) < 0) {
> +               warning(_("unable to upgrade repository format to enable sparse-index"));
> +               return -1;
> +       }
> +       git_config_set_in_file_gently(config_path,
> +                                     "extensions.sparseIndex",
> +                                     "true");
> +
> +       prepare_repo_settings(repo);
> +       repo->settings.sparse_index = 1;
> +       return 0;
> +}
> +
>  int convert_to_sparse(struct index_state *istate)
>  {
>         if (istate->split_index || istate->sparse_index ||
>             !core_apply_sparse_checkout || !core_sparse_checkout_cone)
>                 return 0;
>
> +       if (!istate->repo)
> +               istate->repo = the_repository;
> +
> +       /*
> +        * The GIT_TEST_SPARSE_INDEX environment variable triggers the
> +        * extensions.sparseIndex config variable to be on.
> +        */
> +       if (git_env_bool("GIT_TEST_SPARSE_INDEX", 0)) {
> +               int err = enable_sparse_index(istate->repo);
> +               if (err < 0)
> +                       return err;
> +       }
> +
>         /*
> -        * For now, only create a sparse index with the
> -        * GIT_TEST_SPARSE_INDEX environment variable. We will relax
> -        * this once we have a proper way to opt-in (and later still,
> -        * opt-out).
> +        * Only convert to sparse if extensions.sparseIndex is set.
>          */
> -       if (!git_env_bool("GIT_TEST_SPARSE_INDEX", 0))
> +       prepare_repo_settings(istate->repo);
> +       if (!istate->repo->settings.sparse_index)
>                 return 0;
>
>         if (!istate->sparse_checkout_patterns) {
> --
> gitgitgadget

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 01/20] sparse-index: design doc and format update
  2021-02-24  1:13   ` Elijah Newren
@ 2021-02-25 15:29     ` Derrick Stolee
  2021-02-25 20:14       ` Elijah Newren
  0 siblings, 1 reply; 203+ messages in thread
From: Derrick Stolee @ 2021-02-25 15:29 UTC (permalink / raw)
  To: Elijah Newren, Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano,
	Nguyễn Thái Ngọc, Jonathan Nieder,
	Derrick Stolee, Derrick Stolee, Matheus Tavares Bernardino

On 2/23/2021 8:13 PM, Elijah Newren wrote:
> On Tue, Feb 23, 2021 at 12:14 PM Derrick Stolee via GitGitGadget
> <gitgitgadget@gmail.com> wrote:>> +This addition of sparse-directory entries violates expectations about the
> 
> Violates current expectations, yes.  Documentation tends to live a
> long time, and I suspect that 2-3 years from now reading this sentence
> might be jarring since we'll have modified the code to have an updated
> set of expectations.  Maybe a simple "As of time of writing, ..." at
> the beginning of the sentence here?  Or maybe I'm just being overly
> worried...

I was hoping that the phrase "this addition of" places this statement in
a moment of time where sparse-directory entries didn't exist, but now they
will. I'm open to alternatives and will give this some thought.

>> +To complete this phase, the commands `git status` and `git add` will be
>> +integrated with the sparse-index so that they operate with O(Populated)
>> +performance. They will be carefully tested for operations within and
>> +outside the sparse-checkout definition.
> 
> Good plan so far, but there's something else bugging me a little here.
> One thing we noticed with our usage of `sparse-checkout` is that
> although unimportant _tracked_ files go away, leftover build files and
> other untracked files stick around.  So, although 'git status'
> shouldn't have to check the tracked files anymore, it is still going
> to have to look at each of the *untracked* files and compare to
> .gitignore files in order to correctly classify each file as ignored
> or just plain untracked.  Our `sparsify` tool has for a long time
> tried to warn about such files when changing the sparsity
> patterns/modules and had an --remove-old-ignores option for clearing
> out ignored files that are found within directories that are sparse
> (Meaning the directories where all files under them are marked
> SKIP_WORKTREE.). I was never sure whether a warning was enough, or if
> pushing that option more made sense, but about a month ago my
> colleagues made the tool just auto-invoke that option from other
> sparsify invocations.  To my knowledge, there have been no complaints
> about that being automatically turned on; but there were
> complaints/confusion before about the directories being left around.
> (Of course, non-ignored files are still left around by that option.)
> 
> I'm worried that since sparse-checkout doesn't do anything to help
> with all these untracked/ignored files, we might not get all the
> performance improvements we want from a `git status` with sparse
> directories.  We'll be dropping from walking O(2*HEAD) files (1 source
> + 1 object file) down to O(HEAD) files (just the object files) rather
> than actually getting down to O(Populated).

This definitely seems like a valuable _enhancement_ to sparse-checkout
that could occur in parallel.

For a workaround in the moment: is "git clean -xdf" sufficient to help
these users?

>> +Phase III: Important command speedups
>> +-------------------------------------
>> +
>> +At this point, the patterns for testing and implementing sparse-directory
>> +logic should be relatively stable. This phase focuses on updating some of
>> +the most common builtins that use the index to operate as O(Populated).
>> +Here is a potential list of commands that could be valuable to integrate
>> +at this point:
>> +
>> +* `git commit`
>> +* `git checkout`
>> +* `git merge`
>> +* `git rebase`
>> +
>> +Along with `git status` and `git add`, these commands cover the majority
>> +of users' interactions with the working directory.
> 
> Sounds like a good plan as well.
> 
> I hope we get to make this specific to the merge-ort backend.  It
> localizes the index-related code to (a) a call to unpack_trees()
> called from checkout-like code (which would probably automatically be
> handled by your updates to git checkout), and (b) a single function
> named record_conflicted_index_entries().  I feel it should be pretty
> easy to update.
> 
> In contrast, the idea of attempting to update merge-recursive with
> this kind of change sounds overwhelming.

Yes, I'm hoping to eventually say "if you are in a sparse-checkout, then
you should use ORT by default." Then, if someone opts to do merge-recursive
instead, then they pay the index expansion cost.

While this is very clear in my head, it might be worth mentioning that
explicitly here.

>>  In addition, we can
>> +integrate with these commands:
>> +
>> +* `git grep`
>> +* `git rm`
>> +
>> +These have been proposed as some whose behavior could change when in a
>> +repo with a sparse-checkout definition. It would be good to include this
>> +behavior automatically when using a sparse-index. Some clarity is needed
>> +to make the behavior switch clear to the user.
> 
> Is this leftover from before recent events?  I think this portion of
> the document should just be stricken.
> 
> I argued about how these were buggy as-is due SKIP_WORKTREE always
> having been an incomplete implementation of an idea at [1], but didn't
> hear a further response from you.  I'm curious if you disagreed with
> my reasoning, or it just slipped through the cracks in a busy schedule
> and this portion of the document was leftover from before.  In my
> opinion, both commands are just buggy and should be fixed for general
> sparse-checkout usage cases, not just for sparse-index.

This is definitely a case of "I've been too busy to read those topics
in detail." I figured that there was something going on that was relevant
to the sparse-checkout and would affect the sparse-index, but I shelved
it in my mind until I had space to think about it directly.

> Anyway, that's a long way of saying I think this section of your
> document is already obsolete.  (Which is a good thing -- less work to
> do to get sparse-index working.  Thanks, Matheus!).

Thank you for your summary, which helps a lot. Thanks, Matheus, too!
If those fixes already include coverage for the behavior, then I'll see
if they also translate to sparse-index tests easily.

I feel like a lot of these later contributions will be more about adding
tests than actually writing a lot of code.

>> +This phase is the first where parallel work might be possible without too
>> +much conflicts between topics.
>> +
>> +Phase IV: The long tail
>> +-----------------------
>> +
>> +This last phase is less a "phase" and more "the new normal" after all of
>> +the previous work.
>> +
>> +To start, the `command_requires_full_index` option could be removed in
>> +favor of expanding only when hitting an API guard.
>> +
>> +There are many Git commands that could use special attention to operate as
>> +O(Populated), while some might be so rare that it is acceptable to leave
>> +them with additional overhead when a sparse-index is present.
>> +
>> +Here are some commands that might be useful to update:
>> +
>> +* `git sparse-checkout set`
>> +* `git am`
>> +* `git clean`
>> +* `git stash`
> 
> Oh, man, git stash is definitely in need of work.  It's still a
> minimalistic transliteration of shell to C, complete with lots of
> process forking and piping output between various low-level commands.
> It might be interesting to rewrite this in terms of the merge
> machinery, though its separate stashing of staged stuff, unstaged
> stuff, and possibly untracked stuff means that there is a sequence of
> two or three merges needed and interesting failure handling to do if
> those merges fail, especially if the user uses --index.  But I
> digress...

I would prefer to leave 'git stash' alone, but it's used by enough
people that I need to care about it eventually.

> Anyway, overall, very nicely written and planned out.  Thanks for
> taking the time to write this all up.

Thanks for your detailed comments!
-Stolee
 


^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 01/20] sparse-index: design doc and format update
  2021-02-25 15:29     ` Derrick Stolee
@ 2021-02-25 20:14       ` Elijah Newren
  0 siblings, 0 replies; 203+ messages in thread
From: Elijah Newren @ 2021-02-25 20:14 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, Git Mailing List,
	Junio C Hamano, Nguyễn Thái Ngọc,
	Jonathan Nieder, Derrick Stolee, Derrick Stolee,
	Matheus Tavares Bernardino

On Thu, Feb 25, 2021 at 7:29 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 2/23/2021 8:13 PM, Elijah Newren wrote:
> > On Tue, Feb 23, 2021 at 12:14 PM Derrick Stolee via GitGitGadget
> > <gitgitgadget@gmail.com> wrote:>> +This addition of sparse-directory entries violates expectations about the
> >
> > Violates current expectations, yes.  Documentation tends to live a
> > long time, and I suspect that 2-3 years from now reading this sentence
> > might be jarring since we'll have modified the code to have an updated
> > set of expectations.  Maybe a simple "As of time of writing, ..." at
> > the beginning of the sentence here?  Or maybe I'm just being overly
> > worried...
>
> I was hoping that the phrase "this addition of" places this statement in
> a moment of time where sparse-directory entries didn't exist, but now they
> will. I'm open to alternatives and will give this some thought.

I already listed my only suggestion -- adding a "As of time of
writing," at the beginning.  I'm totally open to other
proposals/suggestions, and it's admittedly a minor point so you can
feel free to just ignore it if we can't come up with wording everyone
likes.

>
> >> +To complete this phase, the commands `git status` and `git add` will be
> >> +integrated with the sparse-index so that they operate with O(Populated)
> >> +performance. They will be carefully tested for operations within and
> >> +outside the sparse-checkout definition.
> >
> > Good plan so far, but there's something else bugging me a little here.
> > One thing we noticed with our usage of `sparse-checkout` is that
> > although unimportant _tracked_ files go away, leftover build files and
> > other untracked files stick around.  So, although 'git status'
> > shouldn't have to check the tracked files anymore, it is still going
> > to have to look at each of the *untracked* files and compare to
> > .gitignore files in order to correctly classify each file as ignored
> > or just plain untracked.  Our `sparsify` tool has for a long time
> > tried to warn about such files when changing the sparsity
> > patterns/modules and had an --remove-old-ignores option for clearing
> > out ignored files that are found within directories that are sparse
> > (Meaning the directories where all files under them are marked
> > SKIP_WORKTREE.). I was never sure whether a warning was enough, or if
> > pushing that option more made sense, but about a month ago my
> > colleagues made the tool just auto-invoke that option from other
> > sparsify invocations.  To my knowledge, there have been no complaints
> > about that being automatically turned on; but there were
> > complaints/confusion before about the directories being left around.
> > (Of course, non-ignored files are still left around by that option.)
> >
> > I'm worried that since sparse-checkout doesn't do anything to help
> > with all these untracked/ignored files, we might not get all the
> > performance improvements we want from a `git status` with sparse
> > directories.  We'll be dropping from walking O(2*HEAD) files (1 source
> > + 1 object file) down to O(HEAD) files (just the object files) rather
> > than actually getting down to O(Populated).
>
> This definitely seems like a valuable _enhancement_ to sparse-checkout
> that could occur in parallel.

Yes, indeed.  Your discussion of performance just reminded me of it,
and since this idea might be important in order to drive the costs
down to O(populated) in practice, I thought I'd mention it.

> For a workaround in the moment: is "git clean -xdf" sufficient to help
> these users?

Not really; that wouldn't remove the ignored stuff (build files) under
sparsified directories which is the point.  (Builds build everything
over here; once you sparsify you have leftover build files from
projects you now don't care about.)  If you convert it to "git clean
-Xdf" then you're closer, but that wouldn't just remove builds info
from sparse projects, it'd force users to rebuild all the stuff
they're interested in.

It's close though; what's wanted is basically a special flag that runs
"git clean -Xf <long list of sparsified directories>", without the
user having to specify 300 directories.

However, for now, since I've got a 'sparsify' script anyway (needed
for determining inter-module dependencies and certain directories that
always need to be present, etc.), it just has a flag for running "git
clean -Xf <long list of sparsified directories>" since it has logic to
compute what all those directories are anyway.

> >> +Phase III: Important command speedups
> >> +-------------------------------------
> >> +
> >> +At this point, the patterns for testing and implementing sparse-directory
> >> +logic should be relatively stable. This phase focuses on updating some of
> >> +the most common builtins that use the index to operate as O(Populated).
> >> +Here is a potential list of commands that could be valuable to integrate
> >> +at this point:
> >> +
> >> +* `git commit`
> >> +* `git checkout`
> >> +* `git merge`
> >> +* `git rebase`
> >> +
> >> +Along with `git status` and `git add`, these commands cover the majority
> >> +of users' interactions with the working directory.
> >
> > Sounds like a good plan as well.
> >
> > I hope we get to make this specific to the merge-ort backend.  It
> > localizes the index-related code to (a) a call to unpack_trees()
> > called from checkout-like code (which would probably automatically be
> > handled by your updates to git checkout), and (b) a single function
> > named record_conflicted_index_entries().  I feel it should be pretty
> > easy to update.
> >
> > In contrast, the idea of attempting to update merge-recursive with
> > this kind of change sounds overwhelming.
>
> Yes, I'm hoping to eventually say "if you are in a sparse-checkout, then
> you should use ORT by default." Then, if someone opts to do merge-recursive
> instead, then they pay the index expansion cost.
>
> While this is very clear in my head, it might be worth mentioning that
> explicitly here.

:+1:

> >>  In addition, we can
> >> +integrate with these commands:
> >> +
> >> +* `git grep`
> >> +* `git rm`
> >> +
> >> +These have been proposed as some whose behavior could change when in a
> >> +repo with a sparse-checkout definition. It would be good to include this
> >> +behavior automatically when using a sparse-index. Some clarity is needed
> >> +to make the behavior switch clear to the user.
> >
> > Is this leftover from before recent events?  I think this portion of
> > the document should just be stricken.
> >
> > I argued about how these were buggy as-is due SKIP_WORKTREE always
> > having been an incomplete implementation of an idea at [1], but didn't
> > hear a further response from you.  I'm curious if you disagreed with
> > my reasoning, or it just slipped through the cracks in a busy schedule
> > and this portion of the document was leftover from before.  In my
> > opinion, both commands are just buggy and should be fixed for general
> > sparse-checkout usage cases, not just for sparse-index.
>
> This is definitely a case of "I've been too busy to read those topics
> in detail." I figured that there was something going on that was relevant
> to the sparse-checkout and would affect the sparse-index, but I shelved
> it in my mind until I had space to think about it directly.
>
> > Anyway, that's a long way of saying I think this section of your
> > document is already obsolete.  (Which is a good thing -- less work to
> > do to get sparse-index working.  Thanks, Matheus!).
>
> Thank you for your summary, which helps a lot. Thanks, Matheus, too!
> If those fixes already include coverage for the behavior, then I'll see
> if they also translate to sparse-index tests easily.
>
> I feel like a lot of these later contributions will be more about adding
> tests than actually writing a lot of code.
>
> >> +This phase is the first where parallel work might be possible without too
> >> +much conflicts between topics.
> >> +
> >> +Phase IV: The long tail
> >> +-----------------------
> >> +
> >> +This last phase is less a "phase" and more "the new normal" after all of
> >> +the previous work.
> >> +
> >> +To start, the `command_requires_full_index` option could be removed in
> >> +favor of expanding only when hitting an API guard.
> >> +
> >> +There are many Git commands that could use special attention to operate as
> >> +O(Populated), while some might be so rare that it is acceptable to leave
> >> +them with additional overhead when a sparse-index is present.
> >> +
> >> +Here are some commands that might be useful to update:
> >> +
> >> +* `git sparse-checkout set`
> >> +* `git am`
> >> +* `git clean`
> >> +* `git stash`
> >
> > Oh, man, git stash is definitely in need of work.  It's still a
> > minimalistic transliteration of shell to C, complete with lots of
> > process forking and piping output between various low-level commands.
> > It might be interesting to rewrite this in terms of the merge
> > machinery, though its separate stashing of staged stuff, unstaged
> > stuff, and possibly untracked stuff means that there is a sequence of
> > two or three merges needed and interesting failure handling to do if
> > those merges fail, especially if the user uses --index.  But I
> > digress...
>
> I would prefer to leave 'git stash' alone, but it's used by enough
> people that I need to care about it eventually.

Oh, it can definitely come later.  And I agree about the desirability
of touching that code; I was avoiding it for a long time, but there
was one important sparse-checkout-related bug recently[1] so I've
already been forced to touch it once.  That might mean I'm
(eventually) on the hook to make it sparse-index friendly, especially
since it might involve using merge-ort to do so...

[1] https://lore.kernel.org/git/pull.919.git.git.1605891222.gitgitgadget@gmail.com/

> > Anyway, overall, very nicely written and planned out.  Thanks for
> > taking the time to write this all up.
>
> Thanks for your detailed comments!
> -Stolee

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 00/20] Sparse Index: Design, Format, Tests
  2021-02-23 23:49 ` [PATCH 00/20] Sparse Index: Design, Format, Tests Elijah Newren
@ 2021-02-26 21:28   ` Elijah Newren
  0 siblings, 0 replies; 203+ messages in thread
From: Elijah Newren @ 2021-02-26 21:28 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano,
	Nguyễn Thái Ngọc, Jonathan Nieder,
	Derrick Stolee

On Tue, Feb 23, 2021 at 3:49 PM Elijah Newren <newren@gmail.com> wrote:
>
> On Tue, Feb 23, 2021 at 12:14 PM Derrick Stolee via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
> >
> > Here is the first full patch series submission coming out of the
> > sparse-index RFC [1].
>
> Wahoo!  I'll be reading these over the next few days.

I finally finished the last five patches today, and didn't spot
anything on those to comment on.

Overall, I find the series well constructed, motivated, and explained.
I've left various comments on individual patches, but they're mostly
all minor things that should be easy to cleanup and/or just comment
on.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 17/20] sparse-checkout: disable sparse-index
  2021-02-23 20:14 ` [PATCH 17/20] sparse-checkout: disable sparse-index Derrick Stolee via GitGitGadget
@ 2021-02-27 12:32   ` SZEDER Gábor
  2021-03-09 20:20     ` Derrick Stolee
  0 siblings, 1 reply; 203+ messages in thread
From: SZEDER Gábor @ 2021-02-27 12:32 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, newren, gitster, pclouds, jrnieder, Derrick Stolee, Derrick Stolee

On Tue, Feb 23, 2021 at 08:14:26PM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
> 
> We use 'git sparse-checkout init --cone --sparse-index' to toggle the
> sparse-index feature. It makes sense to also disable it when running
> 'git sparse-checkout disable'. This is particularly important because it
> removes the extensions.sparseIndex config option, allowing other tools
> to use this Git repository again.
> 
> This does mean that 'git sparse-checkout init' will not re-enable the
> sparse-index feature, even if it was previously enabled.
> 
> While testing this feature, I noticed that the sparse-index was not
> being written on the first run, but by a second. This was caught by the
> call to 'test-tool read-cache --table'. This requires adjusting some
> assignments to core_apply_sparse_checkout and pl.use_cone_patterns in
> the sparse_checkout_init() logic.
> 
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  builtin/sparse-checkout.c          | 10 +++++++++-
>  t/t1091-sparse-checkout-builtin.sh | 13 +++++++++++++
>  2 files changed, 22 insertions(+), 1 deletion(-)
> 
> diff --git a/builtin/sparse-checkout.c b/builtin/sparse-checkout.c
> index ca63e2c64e95..585343fa1972 100644
> --- a/builtin/sparse-checkout.c
> +++ b/builtin/sparse-checkout.c
> @@ -280,6 +280,9 @@ static int set_config(enum sparse_checkout_mode mode)
>  				      "core.sparseCheckoutCone",
>  				      mode == MODE_CONE_PATTERNS ? "true" : NULL);
>  
> +	if (mode == MODE_NO_PATTERNS)
> +		set_sparse_index_config(the_repository, 0);
> +
>  	return 0;
>  }
>  
> @@ -341,10 +344,11 @@ static int sparse_checkout_init(int argc, const char **argv)
>  		the_repository->index->updated_workdir = 1;
>  	}
>  
> +	core_apply_sparse_checkout = 1;
> +
>  	/* If we already have a sparse-checkout file, use it. */
>  	if (res >= 0) {
>  		free(sparse_filename);
> -		core_apply_sparse_checkout = 1;
>  		return update_working_directory(NULL);
>  	}
>  
> @@ -366,6 +370,7 @@ static int sparse_checkout_init(int argc, const char **argv)
>  	add_pattern(strbuf_detach(&pattern, NULL), empty_base, 0, &pl, 0);
>  	strbuf_addstr(&pattern, "!/*/");
>  	add_pattern(strbuf_detach(&pattern, NULL), empty_base, 0, &pl, 0);
> +	pl.use_cone_patterns = init_opts.cone_mode;
>  
>  	return write_patterns_and_update(&pl);
>  }
> @@ -632,6 +637,9 @@ static int sparse_checkout_disable(int argc, const char **argv)
>  	strbuf_addstr(&match_all, "/*");
>  	add_pattern(strbuf_detach(&match_all, NULL), empty_base, 0, &pl, 0);
>  
> +	prepare_repo_settings(the_repository);
> +	the_repository->settings.sparse_index = 0;
> +
>  	if (update_working_directory(&pl))
>  		die(_("error while refreshing working directory"));
>  
> diff --git a/t/t1091-sparse-checkout-builtin.sh b/t/t1091-sparse-checkout-builtin.sh
> index fc64e9ed99f4..ff1ad570a255 100755
> --- a/t/t1091-sparse-checkout-builtin.sh
> +++ b/t/t1091-sparse-checkout-builtin.sh
> @@ -205,6 +205,19 @@ test_expect_success 'sparse-checkout disable' '
>  	check_files repo a deep folder1 folder2
>  '
>  
> +test_expect_success 'sparse-index enabled and disabled' '
> +	git -C repo sparse-checkout init --cone --sparse-index &&
> +	test_cmp_config -C repo true extensions.sparseIndex &&
> +	test-tool -C repo read-cache --table >cache &&
> +	grep " tree " cache &&
> +
> +	git -C repo sparse-checkout disable &&
> +	test-tool -C repo read-cache --table >cache &&
> +	! grep " tree " cache &&
> +	git -C repo config --list >config &&
> +	! grep extensions.sparseindex config
> +'

This test passes with GIT_TEST_SPLIT_INDEX=1 at the moment, because,
unfortunately, GIT_TEST_SPLIT_INDEX has been broken for the past two
years.  However, if I run it with my WIP fixes for that issue [1],
then it will fail:

  +git -C repo sparse-checkout init --cone --sparse-index
  +test_cmp_config -C repo true extensions.sparseIndex
  +test-tool -C repo read-cache --table
  +grep  tree  cache
  error: last command exited with $?=1
  not ok 16 - sparse-index enabled and disabled

https://travis-ci.com/github/szeder/git-cooking-topics-for-travis-ci/jobs/486702444#L2594

[1] Try to run it with:

      https://github.com/szeder/git split-index-fixes

    The code is, I believe, close to final, the commit messages,
    however, are far from being finished.


> +
>  test_expect_success 'cone mode: init and set' '
>  	git -C repo sparse-checkout init --cone &&
>  	git -C repo config --list >config &&
> -- 
> gitgitgadget
> 

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 02/20] t/perf: add performance test for sparse operations
  2021-02-24  2:30   ` Elijah Newren
@ 2021-03-09 20:03     ` Derrick Stolee
  0 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee @ 2021-03-09 20:03 UTC (permalink / raw)
  To: Elijah Newren, Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano,
	Nguyễn Thái Ngọc, Jonathan Nieder,
	Derrick Stolee, Derrick Stolee

On 2/23/2021 9:30 PM, Elijah Newren wrote:
> On Tue, Feb 23, 2021 at 12:14 PM Derrick Stolee via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
> +test_expect_success 'setup repo and indexes' '
> +       git reset --hard HEAD &&
> +       # Remove submodules from the example repo, because our
> +       # duplication of the entire repo creates an unlikly data shape.
> +       git config --file .gitmodules --get-regexp "submodule.*.path" >modules &&
> +       rm -f .gitmodules &&
> +       git add .gitmodules &&
> Why not `git rm [-f] .gitmodules` instead of these two commands?  Is
> there something special about .gitmodules that requires this special
> handling?

No, I'm just being sloppy. Will clean up.

>> +       for module in $(awk "{print \$2}" modules)
>> +       do
>> +               git rm $module || return 1
>> +       done &&
>> +       git add . &&
> What does the `git add .` do?  I don't see any changes there weren't
> already git-add'ed or git-rm'ed.

Same here. Thanks.

-Stolee


^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 17/20] sparse-checkout: disable sparse-index
  2021-02-27 12:32   ` SZEDER Gábor
@ 2021-03-09 20:20     ` Derrick Stolee
  2021-03-10 18:20       ` Derrick Stolee
  0 siblings, 1 reply; 203+ messages in thread
From: Derrick Stolee @ 2021-03-09 20:20 UTC (permalink / raw)
  To: SZEDER Gábor, Derrick Stolee via GitGitGadget
  Cc: git, newren, gitster, pclouds, jrnieder, Derrick Stolee, Derrick Stolee

On 2/27/2021 7:32 AM, SZEDER Gábor wrote:
> On Tue, Feb 23, 2021 at 08:14:26PM +0000, Derrick Stolee via GitGitGadget wrote:
>> +test_expect_success 'sparse-index enabled and disabled' '
>> +	git -C repo sparse-checkout init --cone --sparse-index &&
>> +	test_cmp_config -C repo true extensions.sparseIndex &&
>> +	test-tool -C repo read-cache --table >cache &&
>> +	grep " tree " cache &&
>> +
>> +	git -C repo sparse-checkout disable &&
>> +	test-tool -C repo read-cache --table >cache &&
>> +	! grep " tree " cache &&
>> +	git -C repo config --list >config &&
>> +	! grep extensions.sparseindex config
>> +'
> 
> This test passes with GIT_TEST_SPLIT_INDEX=1 at the moment, because,
> unfortunately, GIT_TEST_SPLIT_INDEX has been broken for the past two
> years.  However, if I run it with my WIP fixes for that issue [1],
> then it will fail:
> 
>   +git -C repo sparse-checkout init --cone --sparse-index
>   +test_cmp_config -C repo true extensions.sparseIndex
>   +test-tool -C repo read-cache --table
>   +grep  tree  cache
>   error: last command exited with $?=1
>   not ok 16 - sparse-index enabled and disabled
> 
> https://travis-ci.com/github/szeder/git-cooking-topics-for-travis-ci/jobs/486702444#L2594
> 
> [1] Try to run it with:
> 
>       https://github.com/szeder/git split-index-fixes
> 
>     The code is, I believe, close to final, the commit messages,
>     however, are far from being finished.

I'll keep that in mind. I should have added a variable
that disables GIT_TEST_SPLIT_INDEX for this test script,
since the sparse-index is (currently) incompatible with
the split-index. I bet that the test is failing because
it isn't actually writing the sparse-directory entry due
to that short-circuit check.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 16/20] sparse-checkout: toggle sparse index from builtin
  2021-02-24 19:11   ` Martin Ågren
@ 2021-03-09 20:52     ` Derrick Stolee
  2021-03-09 21:03       ` Elijah Newren
  2021-03-14 20:08       ` Martin Ågren
  0 siblings, 2 replies; 203+ messages in thread
From: Derrick Stolee @ 2021-03-09 20:52 UTC (permalink / raw)
  To: Martin Ågren, Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Elijah Newren, Junio C Hamano,
	Nguyễn Thái Ngọc Duy, Jonathan Nieder,
	Derrick Stolee, Derrick Stolee

On 2/24/2021 2:11 PM, Martin Ågren wrote:
> On Wed, 24 Feb 2021 at 00:57, Derrick Stolee via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
>> +that is not completely understood by other tools. Enabling sparse index
>> +enables the `extensions.spareseIndex` config value, which might cause
> 
> s/sparese/sparse

Thanks!

 
>> +other tools to stop working with your repository. If you have trouble with
>> +this compatibility, then run `git sparse-checkout sparse-index disable` to
>> +remove this config and rewrite your index to not be sparse.
> 
> While I'm commenting on this..:
> 
> There are several "layers" here, for lack of a better term. "Enabling foo
> enables bar which may cause baz. If you fail due to baz, try dropping
> bar by dropping foo." If I remove any mention of the config variable from
> your text, I get the following.
> 
>  Enabling sparse index might cause other tools to stop working with your
>  repository. If you have trouble with this compatibility, then run `git
>  sparse-checkout sparse-index disable` to rewrite your index to not be
>  sparse.
> 
> I'm tempted to suggest such a rewrite to relieve readers of knowing of
> the middle step, which you could say is more of an implementation
> detail. But if we think that the symptoms / error messages might involve
> "extensions.sparseIndex" or, as would be the case with an older Git
> installation,
> 
>   fatal: unknown repository extensions found:
>           sparseindex
> 
> maybe there is some value in mentioning the config item by name. Just
> thinking out loud, really, and I don't have any strong opinion. I only
> came here to point out the typo in the docs.
 
I agree that the layers are confusing. We could rearrange and have
a similar flow to what you recommend by mentioning the extension at
the end:

**WARNING:** Using a sparse index requires modifying the index in a way
that is not completely understood by other tools. If you have trouble with
this compatibility, then run `git sparse-checkout sparse-index disable` to
rewrite your index to not be sparse. Older versions of Git will not
understand the `sparseIndex` repository extension and may fail to interact
with your repository until it is disabled.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 07/20] test-read-cache: print cache entries with --table
  2021-02-25  7:02   ` Elijah Newren
@ 2021-03-09 21:00     ` Derrick Stolee
  0 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee @ 2021-03-09 21:00 UTC (permalink / raw)
  To: Elijah Newren, Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano,
	Nguyễn Thái Ngọc, Jonathan Nieder,
	Derrick Stolee, Derrick Stolee

On 2/25/2021 2:02 AM, Elijah Newren wrote:
> On Tue, Feb 23, 2021 at 12:14 PM Derrick Stolee via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
>>
>> From: Derrick Stolee <dstolee@microsoft.com>
>>
>> This table is helpful for discovering data in the index to ensure it is
>> being written correctly, especially as we build and test the
>> sparse-index. This table includes an output format similar to 'git
>> ls-tree', but should not be compared to that directly. The biggest
>> reasons are that 'git ls-tree' includes a tree entry for every
>> subdirectory, even those that would not appear as a sparse directory in
>> a sparse-index. Further, 'git ls-tree' does not use a trailing directory
>> separator for its tree rows.
>>
>> This does not print the stat() information for the blobs. That could be
>> added in a future change with another option. The tests that are added
>> in the next few changes care only about the object types and IDs.
>>
>> To make the option parsing slightly more robust, wrap the string
>> comparisons in a loop adapted from test-dir-iterator.c.
>>
>> Care must be taken with the final check for the 'cnt' variable. We
>> continue the expectation that the numerical value is the final argument.
>>
>> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
>> ---
>>  t/helper/test-read-cache.c | 50 ++++++++++++++++++++++++++++++--------
>>  1 file changed, 40 insertions(+), 10 deletions(-)
>>
>> diff --git a/t/helper/test-read-cache.c b/t/helper/test-read-cache.c
>> index 244977a29bdf..e4c3492f7d3e 100644
>> --- a/t/helper/test-read-cache.c
>> +++ b/t/helper/test-read-cache.c
>> @@ -2,35 +2,65 @@
>>  #include "cache.h"
>>  #include "config.h"
>>
>> +static void print_cache_entry(struct cache_entry *ce)
>> +{
>> +       printf("%06o ", ce->ce_mode & 0777777);
> 
> This constant is curious.  I think it's because you want to strip off
> the special in-memory bits of the ce_mode where git stores extra data,
> which would be everything beyond the first 16 bits (as noted in a
> comment near the beginning of cache.h).  But here you keep the first
> 18 bits.  Is CE_UPDATE and CE_REMOVE just 0 in the cases you've viewed
> so this works (but you really should use 0177777 or 0xFFFF), or am I
> off in my guess of what you're trying to do and you do want to see
> those two flags?

You're right, 0177777 is what I want. I'm focusing only on the
file permissions bits that are reported by ls-tree.

> It also seems surprising to me that this constant isn't already
> defined somewhere in cache.h or as some variant of S_IFMT, though I'm
> not finding it.

I'm not, either.

>> +
>> +       if (S_ISSPARSEDIR(ce->ce_mode))
>> +               printf("tree ");
>> +       else if (S_ISGITLINK(ce->ce_mode))
>> +               printf("commit ");
>> +       else
>> +               printf("blob ");
> 
> Perhaps make use of the tree_type, commit_type, and blob_type global constants?

Today I Learned...

>> +
>> +       printf("%s\t%s\n",
>> +              oid_to_hex(&ce->oid),
>> +              ce->name);
>> +}
>> +
>> +static void print_cache(struct index_state *cache)
>> +{
>> +       int i;
>> +       for (i = 0; i < the_index.cache_nr; i++)
>> +               print_cache_entry(the_index.cache[i]);
> 
> Why are you passing cache as a parameter, then ignoring it and using the_index?

Good catch!

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 16/20] sparse-checkout: toggle sparse index from builtin
  2021-03-09 20:52     ` Derrick Stolee
@ 2021-03-09 21:03       ` Elijah Newren
  2021-03-09 21:10         ` Derrick Stolee
  2021-03-14 20:08       ` Martin Ågren
  1 sibling, 1 reply; 203+ messages in thread
From: Elijah Newren @ 2021-03-09 21:03 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Martin Ågren, Derrick Stolee via GitGitGadget,
	Git Mailing List, Junio C Hamano,
	Nguyễn Thái Ngọc Duy, Jonathan Nieder,
	Derrick Stolee, Derrick Stolee

On Tue, Mar 9, 2021 at 12:52 PM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 2/24/2021 2:11 PM, Martin Ågren wrote:
> > On Wed, 24 Feb 2021 at 00:57, Derrick Stolee via GitGitGadget
> > <gitgitgadget@gmail.com> wrote:
> >> +that is not completely understood by other tools. Enabling sparse index
> >> +enables the `extensions.spareseIndex` config value, which might cause
> >
> > s/sparese/sparse
>
> Thanks!
>
>
> >> +other tools to stop working with your repository. If you have trouble with
> >> +this compatibility, then run `git sparse-checkout sparse-index disable` to
> >> +remove this config and rewrite your index to not be sparse.
> >
> > While I'm commenting on this..:
> >
> > There are several "layers" here, for lack of a better term. "Enabling foo
> > enables bar which may cause baz. If you fail due to baz, try dropping
> > bar by dropping foo." If I remove any mention of the config variable from
> > your text, I get the following.
> >
> >  Enabling sparse index might cause other tools to stop working with your
> >  repository. If you have trouble with this compatibility, then run `git
> >  sparse-checkout sparse-index disable` to rewrite your index to not be
> >  sparse.
> >
> > I'm tempted to suggest such a rewrite to relieve readers of knowing of
> > the middle step, which you could say is more of an implementation
> > detail. But if we think that the symptoms / error messages might involve
> > "extensions.sparseIndex" or, as would be the case with an older Git
> > installation,
> >
> >   fatal: unknown repository extensions found:
> >           sparseindex
> >
> > maybe there is some value in mentioning the config item by name. Just
> > thinking out loud, really, and I don't have any strong opinion. I only
> > came here to point out the typo in the docs.
>
> I agree that the layers are confusing. We could rearrange and have
> a similar flow to what you recommend by mentioning the extension at
> the end:
>
> **WARNING:** Using a sparse index requires modifying the index in a way
> that is not completely understood by other tools. If you have trouble with
> this compatibility, then run `git sparse-checkout sparse-index disable` to
> rewrite your index to not be sparse. Older versions of Git will not
> understand the `sparseIndex` repository extension and may fail to interact
> with your repository until it is disabled.
>
> Thanks,
> -Stolee

This looks pretty good to me, but could we change the first sentence
to read "...modifying the index in a way that may not yet be
understood by external tools." ?  I'm worried "other tools" might make
people worry about different builtin commands (e.g. fast-export, log).
I also prefer "may" and "yet" because I suspect most external tools
(e.g. git filter-repo just to name a personal example) won't need to
read an index format and will thus be unaffected, and any tools that
do read the index format will probably eventually learn how to work
with the new format.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 16/20] sparse-checkout: toggle sparse index from builtin
  2021-03-09 21:03       ` Elijah Newren
@ 2021-03-09 21:10         ` Derrick Stolee
  2021-03-09 21:38           ` Elijah Newren
  0 siblings, 1 reply; 203+ messages in thread
From: Derrick Stolee @ 2021-03-09 21:10 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Martin Ågren, Derrick Stolee via GitGitGadget,
	Git Mailing List, Junio C Hamano,
	Nguyễn Thái Ngọc Duy, Jonathan Nieder,
	Derrick Stolee, Derrick Stolee

On 3/9/2021 4:03 PM, Elijah Newren wrote:
> On Tue, Mar 9, 2021 at 12:52 PM Derrick Stolee <stolee@gmail.com> wrote:
>>
>> On 2/24/2021 2:11 PM, Martin Ågren wrote:
>>> There are several "layers" here, for lack of a better term. "Enabling foo
>>> enables bar which may cause baz. If you fail due to baz, try dropping
>>> bar by dropping foo." If I remove any mention of the config variable from
>>> your text, I get the following.
>>>
>>>  Enabling sparse index might cause other tools to stop working with your
>>>  repository. If you have trouble with this compatibility, then run `git
>>>  sparse-checkout sparse-index disable` to rewrite your index to not be
>>>  sparse.
>>>
>>> I'm tempted to suggest such a rewrite to relieve readers of knowing of
>>> the middle step, which you could say is more of an implementation
>>> detail. But if we think that the symptoms / error messages might involve
>>> "extensions.sparseIndex" or, as would be the case with an older Git
>>> installation,
>>>
>>>   fatal: unknown repository extensions found:
>>>           sparseindex
>>>
>>> maybe there is some value in mentioning the config item by name. Just
>>> thinking out loud, really, and I don't have any strong opinion. I only
>>> came here to point out the typo in the docs.
>>
>> I agree that the layers are confusing. We could rearrange and have
>> a similar flow to what you recommend by mentioning the extension at
>> the end:
>>
>> **WARNING:** Using a sparse index requires modifying the index in a way
>> that is not completely understood by other tools. If you have trouble with
>> this compatibility, then run `git sparse-checkout sparse-index disable` to
>> rewrite your index to not be sparse. Older versions of Git will not
>> understand the `sparseIndex` repository extension and may fail to interact
>> with your repository until it is disabled.
>>
>> Thanks,
>> -Stolee
> 
> This looks pretty good to me, but could we change the first sentence
> to read "...modifying the index in a way that may not yet be
> understood by external tools." ?  I'm worried "other tools" might make
> people worry about different builtin commands (e.g. fast-export, log).
> I also prefer "may" and "yet" because I suspect most external tools
> (e.g. git filter-repo just to name a personal example) won't need to
> read an index format and will thus be unaffected, and any tools that
> do read the index format will probably eventually learn how to work
> with the new format.

I can make the change, but I do want to point out that the current
use of a repository extension _does_ mean that tools that (correctly)
interact with a Git repository should fail even if they don't try to
access the index file. This is only something to make this work until
we introduce a new index file format version and then can drop the
extension.

"git filter-repo" _should_ be safe because it's really just shelling
to Git, right? I'm more concerned about tools like libgit2.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 11/20] sparse-index: convert from full to sparse
  2021-02-25  7:33   ` Elijah Newren
@ 2021-03-09 21:13     ` Derrick Stolee
  0 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee @ 2021-03-09 21:13 UTC (permalink / raw)
  To: Elijah Newren, Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano,
	Nguyễn Thái Ngọc, Jonathan Nieder,
	Derrick Stolee, Derrick Stolee

On 2/25/2021 2:33 AM, Elijah Newren wrote:
> On Tue, Feb 23, 2021 at 12:14 PM Derrick Stolee via GitGitGadget
> <gitgitgadget@gmail.com> wrote:

>> +                       /*
>> +                        * allow terminating directory separators for
>> +                        * sparse directory enries.
> 
> enries -> entries

Thanks.

>> +                        */
>> +                       if (c == '\0')
>> +                               return S_ISDIR(mode);
> 
> Yaay, much simpler (than the RFC version).

>> +       /*
>> +        * Is the current path outside of the sparse cone?
>> +        * Then check if the region can be replaced by a sparse
>> +        * directory entry (everything is sparse and merged).
>> +        */
>> +       match = path_matches_pattern_list(ct_path, ct_pathlen,
>> +                                         NULL, &dtype, pl, istate);
>> +       if (match != NOT_MATCHED)
>> +               can_convert = 0;
> 
> Not sure if you saw my comments on the flow control at
> https://lore.kernel.org/git/CABPp-BE9wPwmC0=pA4p1_QSRDHrO8RzqfJQdE2NxYZsYL_Rcig@mail.gmail.com/
> (the typos elsewhere seem to still be present).  If you saw it and
> decided against it, that's fine, just wanted the idea to at least be
> floated.

Sorry for dropping this one. I _did_ decide against it, and
primarily because the "if (can_convert)" condition contains
a return statement. I like to use 'gotos' for blocks that
will eventually be entered by all paths through the code,
such as "goto cleanup;" but here I find the "can_convert"
check to be clearer.

>> +               /*
>> +                * Detect if this is a normal entry oustide of any subtree
> 
> s/oustide/outside/

Got it.

>> +test_expect_success 'sparse-index contents' '
>> +       init_repos &&
>> +
>> +       test-tool -C sparse-index read-cache --table >cache &&
>> +       for dir in folder1 folder2 x
>> +       do
>> +               TREE=$(git -C sparse-index rev-parse HEAD:$dir) &&
>> +               grep "040000 tree $TREE $dir/" cache \
>> +                       || return 1
>> +       done &&
> 
> Thanks for making the output look more like ls-tree output; it's
> easier to parse that way, at least for me.

Excellent.
 
> I mostly read over the range-diff since it was much shorter.  You've
> addressed a number of questions/comments I had on the RFC version, but
> there's still some I didn't see a response to so I reposted them here.
 
Thanks for being diligent!
-Stolee

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 13/20] unpack-trees: allow sparse directories
  2021-02-25  7:40   ` Elijah Newren
@ 2021-03-09 21:35     ` Derrick Stolee
  2021-03-09 21:39       ` Elijah Newren
  0 siblings, 1 reply; 203+ messages in thread
From: Derrick Stolee @ 2021-03-09 21:35 UTC (permalink / raw)
  To: Elijah Newren, Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano,
	Nguyễn Thái Ngọc, Jonathan Nieder,
	Derrick Stolee, Derrick Stolee

On 2/25/2021 2:40 AM, Elijah Newren wrote:
> On Tue, Feb 23, 2021 at 12:14 PM Derrick Stolee via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
>>
>> From: Derrick Stolee <dstolee@microsoft.com>
>>
>> The index_pos_by_traverse_info() currently throws a BUG() when a
>> directory entry exists exactly in the index. We need to consider that it
>> is possible to have a directory in a sparse index as long as that entry
>> is itself marked with the skip-worktree bit.
>>
>> The negation of the 'pos' variable must be conditioned to only when it
>> starts as negative. This is identical behavior as before when the index
>> is full.
> 
> Same comment on the second paragraph as I made in the RFC series --
> https://lore.kernel.org/git/CABPp-BGPJgA4guWHVm3AVS=hM0fTixUpRvJe5i9NnHT-3QJMfw@mail.gmail.com/.
> I apologize if I'm repeating stuff you chose to not change, but I
> didn't see a response and given the three typos left in previous
> patches, I'm unsure whether it was unaddressed on purpose or on
> accident.

Yes, I dropped this one. How about this?

    The 'pos' variable is assigned a negative value if an exact match is not
    found. Since a directory name can be an exact match, it is no longer an
    error to have a nonnegative 'pos' value.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 16/20] sparse-checkout: toggle sparse index from builtin
  2021-03-09 21:10         ` Derrick Stolee
@ 2021-03-09 21:38           ` Elijah Newren
  0 siblings, 0 replies; 203+ messages in thread
From: Elijah Newren @ 2021-03-09 21:38 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Martin Ågren, Derrick Stolee via GitGitGadget,
	Git Mailing List, Junio C Hamano,
	Nguyễn Thái Ngọc Duy, Jonathan Nieder,
	Derrick Stolee, Derrick Stolee

On Tue, Mar 9, 2021 at 1:10 PM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 3/9/2021 4:03 PM, Elijah Newren wrote:
> > On Tue, Mar 9, 2021 at 12:52 PM Derrick Stolee <stolee@gmail.com> wrote:
> >>
> >> On 2/24/2021 2:11 PM, Martin Ågren wrote:
> >>> There are several "layers" here, for lack of a better term. "Enabling foo
> >>> enables bar which may cause baz. If you fail due to baz, try dropping
> >>> bar by dropping foo." If I remove any mention of the config variable from
> >>> your text, I get the following.
> >>>
> >>>  Enabling sparse index might cause other tools to stop working with your
> >>>  repository. If you have trouble with this compatibility, then run `git
> >>>  sparse-checkout sparse-index disable` to rewrite your index to not be
> >>>  sparse.
> >>>
> >>> I'm tempted to suggest such a rewrite to relieve readers of knowing of
> >>> the middle step, which you could say is more of an implementation
> >>> detail. But if we think that the symptoms / error messages might involve
> >>> "extensions.sparseIndex" or, as would be the case with an older Git
> >>> installation,
> >>>
> >>>   fatal: unknown repository extensions found:
> >>>           sparseindex
> >>>
> >>> maybe there is some value in mentioning the config item by name. Just
> >>> thinking out loud, really, and I don't have any strong opinion. I only
> >>> came here to point out the typo in the docs.
> >>
> >> I agree that the layers are confusing. We could rearrange and have
> >> a similar flow to what you recommend by mentioning the extension at
> >> the end:
> >>
> >> **WARNING:** Using a sparse index requires modifying the index in a way
> >> that is not completely understood by other tools. If you have trouble with
> >> this compatibility, then run `git sparse-checkout sparse-index disable` to
> >> rewrite your index to not be sparse. Older versions of Git will not
> >> understand the `sparseIndex` repository extension and may fail to interact
> >> with your repository until it is disabled.
> >>
> >> Thanks,
> >> -Stolee
> >
> > This looks pretty good to me, but could we change the first sentence
> > to read "...modifying the index in a way that may not yet be
> > understood by external tools." ?  I'm worried "other tools" might make
> > people worry about different builtin commands (e.g. fast-export, log).
> > I also prefer "may" and "yet" because I suspect most external tools
> > (e.g. git filter-repo just to name a personal example) won't need to
> > read an index format and will thus be unaffected, and any tools that
> > do read the index format will probably eventually learn how to work
> > with the new format.
>
> I can make the change, but I do want to point out that the current
> use of a repository extension _does_ mean that tools that (correctly)
> interact with a Git repository should fail even if they don't try to
> access the index file. This is only something to make this work until
> we introduce a new index file format version and then can drop the
> extension.

Good point, though...

> "git filter-repo" _should_ be safe because it's really just shelling
> to Git, right? I'm more concerned about tools like libgit2.

Yes, libgit2 and jgit and similar tools are clearly going to be
affected and deeply.  Those are of concern, but I suspect most users
when they see "external tools" will be thinking of the large multitude
of scripts out there that just shell out to git under the hood to
provide some higher level wrapper of some sort.  And anything that
operates that way won't be affected directly by the repository
extension.

I'm not sure I'd even mark things that shell out to git as _should_ be
safe.  In general, scripts can make all kinds of assumptions on
interpreting output, and I suspect some of those may become
invalidated by this new feature.  We have a recent guidepost that's
very close to home on that too -- git stash had *3* different bugs in
it once sparse-checkouts were introduced, based on the fact that it
was designed as a just-shell-out-to-low-level-git-commands script and
it made assumptions on how those commands worked together.  See
https://lore.kernel.org/git/ccfedc7140dbf63ba26a15f93bd3885180b26517.1606861519.git.gitgitgadget@gmail.com/.
Sure git-stash is a builtin (supposedly, anyway), but external tools
can make similar logical jumps.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 13/20] unpack-trees: allow sparse directories
  2021-03-09 21:35     ` Derrick Stolee
@ 2021-03-09 21:39       ` Elijah Newren
  0 siblings, 0 replies; 203+ messages in thread
From: Elijah Newren @ 2021-03-09 21:39 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, Git Mailing List,
	Junio C Hamano, Nguyễn Thái Ngọc,
	Jonathan Nieder, Derrick Stolee, Derrick Stolee

On Tue, Mar 9, 2021 at 1:35 PM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 2/25/2021 2:40 AM, Elijah Newren wrote:
> > On Tue, Feb 23, 2021 at 12:14 PM Derrick Stolee via GitGitGadget
> > <gitgitgadget@gmail.com> wrote:
> >>
> >> From: Derrick Stolee <dstolee@microsoft.com>
> >>
> >> The index_pos_by_traverse_info() currently throws a BUG() when a
> >> directory entry exists exactly in the index. We need to consider that it
> >> is possible to have a directory in a sparse index as long as that entry
> >> is itself marked with the skip-worktree bit.
> >>
> >> The negation of the 'pos' variable must be conditioned to only when it
> >> starts as negative. This is identical behavior as before when the index
> >> is full.
> >
> > Same comment on the second paragraph as I made in the RFC series --
> > https://lore.kernel.org/git/CABPp-BGPJgA4guWHVm3AVS=hM0fTixUpRvJe5i9NnHT-3QJMfw@mail.gmail.com/.
> > I apologize if I'm repeating stuff you chose to not change, but I
> > didn't see a response and given the three typos left in previous
> > patches, I'm unsure whether it was unaddressed on purpose or on
> > accident.
>
> Yes, I dropped this one. How about this?
>
>     The 'pos' variable is assigned a negative value if an exact match is not
>     found. Since a directory name can be an exact match, it is no longer an
>     error to have a nonnegative 'pos' value.

I like it!

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 15/20] sparse-index: create extension for compatibility
  2021-02-25  7:45   ` Elijah Newren
@ 2021-03-09 21:45     ` Derrick Stolee
  0 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee @ 2021-03-09 21:45 UTC (permalink / raw)
  To: Elijah Newren, Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano,
	Nguyễn Thái Ngọc, Jonathan Nieder,
	Derrick Stolee, Derrick Stolee

On 2/25/2021 2:45 AM, Elijah Newren wrote:
> On Tue, Feb 23, 2021 at 12:14 PM Derrick Stolee via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
>>
>> From: Derrick Stolee <dstolee@microsoft.com>
>>
>> Previously, we enabled the sparse index format only using
>> GIT_TEST_SPARSE_INDEX=1. This is not a feasible direction for users to
>> actually select this mode. Further, sparse directory entries are not
>> understood by the index formats as advertised.
>>
>> We _could_ add a new index version that explicitly adds these
>> capabilities, but there are nuances to index formats 2, 3, and 4 that
>> are still valuable to select as options. For now, create a repo
>> extension, "extensions.sparseIndex", that specifies that the tool
>> reading this repository must understand sparse directory entries.
> 
> This commit is unchanged from the RFC series, but given your comments
> in the design document about how you do intend to create an index
> format v5 now, do you want to reference that here?

I'll insert detail about v5.
 
>> +extensions.sparseIndex::
>> +       When combined with `core.sparseCheckout=true` and
>> +       `core.sparseCheckoutCone=true`, the index may contain entries
>> +       corresponding to directories outside of the sparse-checkout
>> +       definition. Versions of Git that do not understand this extension
>> +       do not expect directory entries in the index.
> 
> I had a wording suggestion for this paragraph in the RFC series --
> https://lore.kernel.org/git/CABPp-BFEJE82k4VgkR=Jf7V7sZxZzo2pHMfAGshhi9_vV6iK0w@mail.gmail.com/.
> Let me know if you just decided to leave it out so I don't bug you
> about stuff you already considered.

I'll take your suggestion, thanks.

-Stolee

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 17/20] sparse-checkout: disable sparse-index
  2021-03-09 20:20     ` Derrick Stolee
@ 2021-03-10 18:20       ` Derrick Stolee
  0 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee @ 2021-03-10 18:20 UTC (permalink / raw)
  To: SZEDER Gábor, Derrick Stolee via GitGitGadget
  Cc: git, newren, gitster, pclouds, jrnieder, Derrick Stolee, Derrick Stolee

On 3/9/2021 3:20 PM, Derrick Stolee wrote:
> On 2/27/2021 7:32 AM, SZEDER Gábor wrote:
>> On Tue, Feb 23, 2021 at 08:14:26PM +0000, Derrick Stolee via GitGitGadget wrote:
>>> +test_expect_success 'sparse-index enabled and disabled' '
>>> +	git -C repo sparse-checkout init --cone --sparse-index &&
>>> +	test_cmp_config -C repo true extensions.sparseIndex &&
>>> +	test-tool -C repo read-cache --table >cache &&
>>> +	grep " tree " cache &&
>>> +
>>> +	git -C repo sparse-checkout disable &&
>>> +	test-tool -C repo read-cache --table >cache &&
>>> +	! grep " tree " cache &&
>>> +	git -C repo config --list >config &&
>>> +	! grep extensions.sparseindex config
>>> +'
>>
>> This test passes with GIT_TEST_SPLIT_INDEX=1 at the moment, because,
>> unfortunately, GIT_TEST_SPLIT_INDEX has been broken for the past two
>> years.  However, if I run it with my WIP fixes for that issue [1],
>> then it will fail:
>>
>>   +git -C repo sparse-checkout init --cone --sparse-index
>>   +test_cmp_config -C repo true extensions.sparseIndex
>>   +test-tool -C repo read-cache --table
>>   +grep  tree  cache
>>   error: last command exited with $?=1
>>   not ok 16 - sparse-index enabled and disabled
>>
>> https://travis-ci.com/github/szeder/git-cooking-topics-for-travis-ci/jobs/486702444#L2594
>>
>> [1] Try to run it with:
>>
>>       https://github.com/szeder/git split-index-fixes
>>
>>     The code is, I believe, close to final, the commit messages,
>>     however, are far from being finished.
> 
> I'll keep that in mind. I should have added a variable
> that disables GIT_TEST_SPLIT_INDEX for this test script,
> since the sparse-index is (currently) incompatible with
> the split-index. I bet that the test is failing because
> it isn't actually writing the sparse-directory entry due
> to that short-circuit check.

The next version will include GIT_TEST_SPLIT_INDEX=0 at
the start and that will make it work with your branch.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH v2 00/20] Sparse Index: Design, Format, Tests
  2021-02-23 20:14 [PATCH 00/20] Sparse Index: Design, Format, Tests Derrick Stolee via GitGitGadget
                   ` (20 preceding siblings ...)
  2021-02-23 23:49 ` [PATCH 00/20] Sparse Index: Design, Format, Tests Elijah Newren
@ 2021-03-10 19:30 ` Derrick Stolee via GitGitGadget
  2021-03-10 19:30   ` [PATCH v2 01/20] sparse-index: design doc and format update Derrick Stolee via GitGitGadget
                     ` (21 more replies)
  21 siblings, 22 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-10 19:30 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor, Derrick Stolee

Here is the first full patch series submission coming out of the
sparse-index RFC [1].

[1]
https://lore.kernel.org/git/pull.847.git.1611596533.gitgitgadget@gmail.com/

I won't waste too much space here, because PATCH 1 includes a sizeable
design document that describes the feature, the reasoning behind it, and my
plan for getting this implemented widely throughout the codebase.

There are some new things here that were not in the RFC:

 * Design doc and format updates. (Patch 1)
 * Performance test script. (Patches 2 and 20)

Notably missing in this series from the RFC:

 * The mega-patch inserting ensure_full_index() throughout the codebase.
   That will be a follow-up series to this one.
 * The integrations with git status and git add to demonstrate the improved
   performance. Those will also appear in their own series later.

I plan to keep my latest work in this area in my 'sparse-index/wip' branch
[2]. It includes all of the work from the RFC right now, updated with the
work from this series.

[2] https://github.com/derrickstolee/git/tree/sparse-index/wip


Updates in V2
=============

 * Various typos and awkward grammar is fixed.
 * Cleaned up unnecessary commands in p2000-sparse-operations.sh
 * Added a comment to the sparse_index member of struct index_state.
 * Used tree_type, commit_type, and blob_type in test-read-cache.c.

Thanks, -Stolee

Derrick Stolee (20):
  sparse-index: design doc and format update
  t/perf: add performance test for sparse operations
  t1092: clean up script quoting
  sparse-index: add guard to ensure full index
  sparse-index: implement ensure_full_index()
  t1092: compare sparse-checkout to sparse-index
  test-read-cache: print cache entries with --table
  test-tool: don't force full index
  unpack-trees: ensure full index
  sparse-checkout: hold pattern list in index
  sparse-index: convert from full to sparse
  submodule: sparse-index should not collapse links
  unpack-trees: allow sparse directories
  sparse-index: check index conversion happens
  sparse-index: create extension for compatibility
  sparse-checkout: toggle sparse index from builtin
  sparse-checkout: disable sparse-index
  cache-tree: integrate with sparse directory entries
  sparse-index: loose integration with cache_tree_verify()
  p2000: add sparse-index repos

 Documentation/config/extensions.txt      |   8 +
 Documentation/git-sparse-checkout.txt    |  14 ++
 Documentation/technical/index-format.txt |   7 +
 Documentation/technical/sparse-index.txt | 173 ++++++++++++++
 Makefile                                 |   1 +
 builtin/sparse-checkout.c                |  44 +++-
 cache-tree.c                             |  40 ++++
 cache.h                                  |  18 +-
 read-cache.c                             |  35 ++-
 repo-settings.c                          |  15 ++
 repository.c                             |  11 +-
 repository.h                             |   3 +
 setup.c                                  |   3 +
 sparse-index.c                           | 290 +++++++++++++++++++++++
 sparse-index.h                           |  11 +
 t/README                                 |   3 +
 t/helper/test-read-cache.c               |  66 +++++-
 t/perf/p2000-sparse-operations.sh        | 102 ++++++++
 t/t1091-sparse-checkout-builtin.sh       |  13 +
 t/t1092-sparse-checkout-compatibility.sh | 136 +++++++++--
 unpack-trees.c                           |  16 +-
 21 files changed, 969 insertions(+), 40 deletions(-)
 create mode 100644 Documentation/technical/sparse-index.txt
 create mode 100644 sparse-index.c
 create mode 100644 sparse-index.h
 create mode 100755 t/perf/p2000-sparse-operations.sh


base-commit: 966e671106b2fd38301e7c344c754fd118d0bb07
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-883%2Fderrickstolee%2Fsparse-index%2Fformat-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-883/derrickstolee/sparse-index/format-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/883

Range-diff vs v1:

  1:  daa9a6bcefbc !  1:  2fe413fdac80 sparse-index: design doc and format update
     @@ Documentation/technical/sparse-index.txt (new)
      +If we need to discover the details for paths within that directory, we
      +can parse trees to find that list.
      +
     -+This addition of sparse-directory entries violates expectations about the
     ++At time of writing, sparse-directory entries violate expectations about the
      +index format and its in-memory data structure. There are many consumers in
      +the codebase that expect to iterate through all of the index entries and
      +see only files. In addition, they expect to see all files at `HEAD`. One
     @@ Documentation/technical/sparse-index.txt (new)
      +* `git merge`
      +* `git rebase`
      +
     ++Hopefully, commands such as `git merge` and `git rebase` can benefit
     ++instead from merge algorithms that do not use the index as a data
     ++structure, such as the merge-ORT strategy. As these topics mature, we
     ++may enalbe the ORT strategy by default for repositories using the
     ++sparse-index feature.
     ++
      +Along with `git status` and `git add`, these commands cover the majority
      +of users' interactions with the working directory. In addition, we can
      +integrate with these commands:
  2:  a8c6322a3dbe !  2:  540ab5495065 t/perf: add performance test for sparse operations
     @@ t/perf/p2000-sparse-operations.sh (new)
      +	# Remove submodules from the example repo, because our
      +	# duplication of the entire repo creates an unlikly data shape.
      +	git config --file .gitmodules --get-regexp "submodule.*.path" >modules &&
     -+	rm -f .gitmodules &&
     -+	git add .gitmodules &&
     ++	git rm -f .gitmodules &&
      +	for module in $(awk "{print \$2}" modules)
      +	do
      +		git rm $module || return 1
      +	done &&
     -+	git add . &&
      +	git commit -m "remove submodules" &&
      +
      +	echo bogus >a &&
  3:  6e783c88821e =  3:  5cbedb377b37 t1092: clean up script quoting
  4:  01da4c48a1fa =  4:  6e21f776e883 sparse-index: add guard to ensure full index
  5:  2b83989fbcd3 !  5:  399ddb0bad56 sparse-index: implement ensure_full_index()
     @@ cache.h: struct index_state {
       		 updated_skipworktree : 1,
      -		 fsmonitor_has_run_once : 1;
      +		 fsmonitor_has_run_once : 1,
     ++
     ++		 /*
     ++		  * sparse_index == 1 when sparse-directory
     ++		  * entries exist. Requires sparse-checkout
     ++		  * in cone mode.
     ++		  */
      +		 sparse_index : 1;
       	struct hashmap name_hash;
       	struct hashmap dir_hash;
  6:  c9910a37579c =  6:  eac2db5efc22 t1092: compare sparse-checkout to sparse-index
  7:  3d92df7a0cf9 !  7:  e9c82d2eda82 test-read-cache: print cache entries with --table
     @@ Commit message
      
       ## t/helper/test-read-cache.c ##
      @@
     + #include "test-tool.h"
       #include "cache.h"
       #include "config.h"
     - 
     ++#include "blob.h"
     ++#include "commit.h"
     ++#include "tree.h"
     ++
      +static void print_cache_entry(struct cache_entry *ce)
      +{
     -+	printf("%06o ", ce->ce_mode & 0777777);
     ++	const char *type;
     ++	printf("%06o ", ce->ce_mode & 0177777);
      +
      +	if (S_ISSPARSEDIR(ce->ce_mode))
     -+		printf("tree ");
     ++		type = tree_type;
      +	else if (S_ISGITLINK(ce->ce_mode))
     -+		printf("commit ");
     ++		type = commit_type;
      +	else
     -+		printf("blob ");
     ++		type = blob_type;
      +
     -+	printf("%s\t%s\n",
     ++	printf("%s %s\t%s\n",
     ++	       type,
      +	       oid_to_hex(&ce->oid),
      +	       ce->name);
      +}
      +
     -+static void print_cache(struct index_state *cache)
     ++static void print_cache(struct index_state *istate)
      +{
      +	int i;
     -+	for (i = 0; i < the_index.cache_nr; i++)
     -+		print_cache_entry(the_index.cache[i]);
     ++	for (i = 0; i < istate->cache_nr; i++)
     ++		print_cache_entry(istate->cache[i]);
      +}
     -+
     + 
       int cmd__read_cache(int argc, const char **argv)
       {
      +	struct repository *r = the_repository;
  8:  94373e2bfbbc !  8:  243541fc5820 test-tool: don't force full index
     @@ Commit message
      
       ## t/helper/test-read-cache.c ##
      @@
     - #include "test-tool.h"
     - #include "cache.h"
     - #include "config.h"
     + #include "blob.h"
     + #include "commit.h"
     + #include "tree.h"
      +#include "sparse-index.h"
       
       static void print_cache_entry(struct cache_entry *ce)
  9:  e71f033c2871 =  9:  48f65093b3da unpack-trees: ensure full index
 10:  f86d3dc154d1 ! 10:  83aac8b7a1ec sparse-checkout: hold pattern list in index
     @@ Commit message
          pattern set, we need access to that in-memory copy. Place a pointer to
          a 'struct pattern_list' in the index so we can access this on-demand.
          This will be used in the next change which uses the sparse-checkout
     -    definition to filter out directories that are outsie the sparse cone.
     +    definition to filter out directories that are outside the sparse cone.
      
          Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
      
 11:  a2d77c23a0cb ! 11:  f6db0c27a285 sparse-index: convert from full to sparse
     @@ read-cache.c: int verify_path(const char *path, unsigned mode)
       				return 0;
      +			/*
      +			 * allow terminating directory separators for
     -+			 * sparse directory enries.
     ++			 * sparse directory entries.
      +			 */
      +			if (c == '\0')
      +				return S_ISDIR(mode);
     @@ sparse-index.c
      +		struct cache_entry *ce = istate->cache[i];
      +
      +		/*
     -+		 * Detect if this is a normal entry oustide of any subtree
     ++		 * Detect if this is a normal entry outside of any subtree
      +		 * entry.
      +		 */
      +		base = ce->name + ct_pathlen;
 12:  4405a9115c3b = 12:  f2a3e7298798 submodule: sparse-index should not collapse links
 13:  fda23f07e6a2 ! 13:  6f1ebe6ccc08 unpack-trees: allow sparse directories
     @@ Commit message
          is possible to have a directory in a sparse index as long as that entry
          is itself marked with the skip-worktree bit.
      
     -    The negation of the 'pos' variable must be conditioned to only when it
     -    starts as negative. This is identical behavior as before when the index
     -    is full.
     +    The 'pos' variable is assigned a negative value if an exact match is not
     +    found. Since a directory name can be an exact match, it is no longer an
     +    error to have a nonnegative 'pos' value.
      
          Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
      
 14:  7d4627574bb8 = 14:  3fa684b315fb sparse-index: check index conversion happens
 15:  564503f78784 ! 15:  d74576d677f6 sparse-index: create extension for compatibility
     @@ Commit message
      
          We _could_ add a new index version that explicitly adds these
          capabilities, but there are nuances to index formats 2, 3, and 4 that
     -    are still valuable to select as options. For now, create a repo
     -    extension, "extensions.sparseIndex", that specifies that the tool
     -    reading this repository must understand sparse directory entries.
     +    are still valuable to select as options. Until we add index format
     +    version 5, create a repo extension, "extensions.sparseIndex", that
     +    specifies that the tool reading this repository must understand sparse
     +    directory entries.
      
          This change only encodes the extension and enables it when
          GIT_TEST_SPARSE_INDEX=1. Later, we will add a more user-friendly CLI
     @@ Documentation/config/extensions.txt: extensions.objectFormat::
      +	When combined with `core.sparseCheckout=true` and
      +	`core.sparseCheckoutCone=true`, the index may contain entries
      +	corresponding to directories outside of the sparse-checkout
     -+	definition. Versions of Git that do not understand this extension
     -+	do not expect directory entries in the index.
     ++	definition in lieu of containing each path under such directories.
     ++	Versions of Git that do not understand this extension do not
     ++	expect directory entries in the index.
      
       ## cache.h ##
      @@ cache.h: struct repository_format {
 16:  6d6b230e3318 ! 16:  e530ca5f668d sparse-checkout: toggle sparse index from builtin
     @@ Documentation/git-sparse-checkout.txt: To avoid interfering with other worktrees
      +a sparse index until they are properly integrated with the feature.
      ++
      +**WARNING:** Using a sparse index requires modifying the index in a way
     -+that is not completely understood by other tools. Enabling sparse index
     -+enables the `extensions.spareseIndex` config value, which might cause
     -+other tools to stop working with your repository. If you have trouble with
     -+this compatibility, then run `git sparse-checkout sparse-index disable` to
     -+remove this config and rewrite your index to not be sparse.
     ++that is not completely understood by external tools. If you have trouble
     ++with this compatibility, then run `git sparse-checkout sparse-index disable`
     ++to rewrite your index to not be sparse. Older versions of Git will not
     ++understand the `sparseIndex` repository extension and may fail to interact
     ++with your repository until it is disabled.
       
       'set'::
       	Write a set of patterns to the sparse-checkout file, as given as
 17:  bcf960ef2362 = 17:  42d0da9c5def sparse-checkout: disable sparse-index
 18:  e6afec58674e = 18:  6bb0976a6295 cache-tree: integrate with sparse directory entries
 19:  2be4981fe698 = 19:  07f34e80609a sparse-index: loose integration with cache_tree_verify()
 20:  a738b0ba8ab4 = 20:  41e3b56b9c17 p2000: add sparse-index repos

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH v2 01/20] sparse-index: design doc and format update
  2021-03-10 19:30 ` [PATCH v2 " Derrick Stolee via GitGitGadget
@ 2021-03-10 19:30   ` Derrick Stolee via GitGitGadget
  2021-03-10 22:19     ` Elijah Newren
  2021-03-10 19:30   ` [PATCH v2 02/20] t/perf: add performance test for sparse operations Derrick Stolee via GitGitGadget
                     ` (20 subsequent siblings)
  21 siblings, 1 reply; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-10 19:30 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

This begins a long effort to update the index format to allow sparse
directory entries. This should result in a significant improvement to
Git commands when HEAD contains millions of files, but the user has
selected many fewer files to keep in their sparse-checkout definition.

Currently, the index format is only updated in the presence of
extensions.sparseIndex instead of increasing a file format version
number. This is temporary, and index v5 is part of the plan for future
work in this area.

The design document details many of the reasons for embarking on this
work, and also the plan for completing it safely.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/index-format.txt |   7 +
 Documentation/technical/sparse-index.txt | 173 +++++++++++++++++++++++
 2 files changed, 180 insertions(+)
 create mode 100644 Documentation/technical/sparse-index.txt

diff --git a/Documentation/technical/index-format.txt b/Documentation/technical/index-format.txt
index b633482b1bdf..387126582556 100644
--- a/Documentation/technical/index-format.txt
+++ b/Documentation/technical/index-format.txt
@@ -44,6 +44,13 @@ Git index format
   localization, no special casing of directory separator '/'). Entries
   with the same name are sorted by their stage field.
 
+  An index entry typically represents a file. However, if sparse-checkout
+  is enabled in cone mode (`core.sparseCheckoutCone` is enabled) and the
+  `extensions.sparseIndex` extension is enabled, then the index may
+  contain entries for directories outside of the sparse-checkout definition.
+  These entries have mode `0040000`, include the `SKIP_WORKTREE` bit, and
+  the path ends in a directory separator.
+
   32-bit ctime seconds, the last time a file's metadata changed
     this is stat(2) data
 
diff --git a/Documentation/technical/sparse-index.txt b/Documentation/technical/sparse-index.txt
new file mode 100644
index 000000000000..787a2a0b3b81
--- /dev/null
+++ b/Documentation/technical/sparse-index.txt
@@ -0,0 +1,173 @@
+Git Sparse-Index Design Document
+================================
+
+The sparse-checkout feature allows users to focus a working directory on
+a subset of the files at HEAD. The cone mode patterns, enabled by
+`core.sparseCheckoutCone`, allow for very fast pattern matching to
+discover which files at HEAD belong in the sparse-checkout cone.
+
+Three important scale dimensions for a Git worktree are:
+
+* `HEAD`: How many files are present at `HEAD`?
+
+* Populated: How many files are within the sparse-checkout cone.
+
+* Modified: How many files has the user modified in the working directory?
+
+We will use big-O notation -- O(X) -- to denote how expensive certain
+operations are in terms of these dimensions.
+
+These dimensions are ordered by their magnitude: users (typically) modify
+fewer files than are populated, and we can only populate files at `HEAD`.
+These dimensions are also ordered by how expensive they are per item: it
+is expensive to detect a modified file than it is to write one that we
+know must be populated; changing `HEAD` only really requires updating the
+index.
+
+Problems occur if there is an extreme imbalance in these dimensions. For
+example, if `HEAD` contains millions of paths but the populated set has
+only tens of thousands, then commands like `git status` and `git add` can
+be dominated by operations that require O(`HEAD`) operations instead of
+O(Populated). Primarily, the cost is in parsing and rewriting the index,
+which is filled primarily with files at `HEAD` that are marked with the
+`SKIP_WORKTREE` bit.
+
+The sparse-index intends to take these commands that read and modify the
+index from O(`HEAD`) to O(Populated). To do this, we need to modify the
+index format in a significant way: add "sparse directory" entries.
+
+With cone mode patterns, it is possible to detect when an entire
+directory will have its contents outside of the sparse-checkout definition.
+Instead of listing all of the files it contains as individual entries, a
+sparse-index contains an entry with the directory name, referencing the
+object ID of the tree at `HEAD` and marked with the `SKIP_WORKTREE` bit.
+If we need to discover the details for paths within that directory, we
+can parse trees to find that list.
+
+At time of writing, sparse-directory entries violate expectations about the
+index format and its in-memory data structure. There are many consumers in
+the codebase that expect to iterate through all of the index entries and
+see only files. In addition, they expect to see all files at `HEAD`. One
+way to handle this is to parse trees to replace a sparse-directory entry
+with all of the files within that tree as the index is loaded. However,
+parsing trees is slower than parsing the index format, so that is a slower
+operation than if we left the index alone.
+
+The implementation plan below follows four phases to slowly integrate with
+the sparse-index. The intention is to incrementally update Git commands to
+interact safely with the sparse-index without significant slowdowns. This
+may not always be possible, but the hope is that the primary commands that
+users need in their daily work are dramatically improved.
+
+Phase I: Format and initial speedups
+------------------------------------
+
+During this phase, Git learns to enable the sparse-index and safely parse
+one. Protections are put in place so that every consumer of the in-memory
+data structure can operate with its current assumption of every file at
+`HEAD`.
+
+At first, every index parse will expand the sparse-directory entries into
+the full list of paths at `HEAD`. This will be slower in all cases. The
+only noticable change in behavior will be that the serialized index file
+contains sparse-directory entries.
+
+To start, we use a new repository extension, `extensions.sparseIndex`, to
+allow inserting sparse-directory entries into indexes with file format
+versions 2, 3, and 4. This prevents Git versions that do not understand
+the sparse-index from operating on one, but it also prevents other
+operations that do not use the index at all. A new format, index v5, will
+be introduced that includes sparse-directory entries by default. It might
+also introduce other features that have been considered for improving the
+index, as well.
+
+Next, consumers of the index will be guarded against operating on a
+sparse-index by inserting calls to `ensure_full_index()` or
+`expand_index_to_path()`. After these guards are in place, we can begin
+leaving sparse-directory entries in the in-memory index structure.
+
+Even after inserting these guards, we will keep expanding sparse-indexes
+for most Git commands using the `command_requires_full_index` repository
+setting. This setting will be on by default and disabled one builtin at a
+time until we have sufficient confidence that all of the index operations
+are properly guarded.
+
+To complete this phase, the commands `git status` and `git add` will be
+integrated with the sparse-index so that they operate with O(Populated)
+performance. They will be carefully tested for operations within and
+outside the sparse-checkout definition.
+
+Phase II: Careful integrations
+------------------------------
+
+This phase focuses on ensuring that all index extensions and APIs work
+well with a sparse-index. This requires significant increases to our test
+coverage, especially for operations that interact with the working
+directory outside of the sparse-checkout definition. Some of these
+behaviors may not be the desirable ones, such as some tests already
+marked for failure in `t1092-sparse-checkout-compatibility.sh`.
+
+The index extensions that may require special integrations are:
+
+* FS Monitor
+* Untracked cache
+
+While integrating with these features, we should look for patterns that
+might lead to better APIs for interacting with the index. Coalescing
+common usage patterns into an API call can reduce the number of places
+where sparse-directories need to be handled carefully.
+
+Phase III: Important command speedups
+-------------------------------------
+
+At this point, the patterns for testing and implementing sparse-directory
+logic should be relatively stable. This phase focuses on updating some of
+the most common builtins that use the index to operate as O(Populated).
+Here is a potential list of commands that could be valuable to integrate
+at this point:
+
+* `git commit`
+* `git checkout`
+* `git merge`
+* `git rebase`
+
+Hopefully, commands such as `git merge` and `git rebase` can benefit
+instead from merge algorithms that do not use the index as a data
+structure, such as the merge-ORT strategy. As these topics mature, we
+may enalbe the ORT strategy by default for repositories using the
+sparse-index feature.
+
+Along with `git status` and `git add`, these commands cover the majority
+of users' interactions with the working directory. In addition, we can
+integrate with these commands:
+
+* `git grep`
+* `git rm`
+
+These have been proposed as some whose behavior could change when in a
+repo with a sparse-checkout definition. It would be good to include this
+behavior automatically when using a sparse-index. Some clarity is needed
+to make the behavior switch clear to the user.
+
+This phase is the first where parallel work might be possible without too
+much conflicts between topics.
+
+Phase IV: The long tail
+-----------------------
+
+This last phase is less a "phase" and more "the new normal" after all of
+the previous work.
+
+To start, the `command_requires_full_index` option could be removed in
+favor of expanding only when hitting an API guard.
+
+There are many Git commands that could use special attention to operate as
+O(Populated), while some might be so rare that it is acceptable to leave
+them with additional overhead when a sparse-index is present.
+
+Here are some commands that might be useful to update:
+
+* `git sparse-checkout set`
+* `git am`
+* `git clean`
+* `git stash`
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH v2 02/20] t/perf: add performance test for sparse operations
  2021-03-10 19:30 ` [PATCH v2 " Derrick Stolee via GitGitGadget
  2021-03-10 19:30   ` [PATCH v2 01/20] sparse-index: design doc and format update Derrick Stolee via GitGitGadget
@ 2021-03-10 19:30   ` Derrick Stolee via GitGitGadget
  2021-03-10 19:30   ` [PATCH v2 03/20] t1092: clean up script quoting Derrick Stolee via GitGitGadget
                     ` (19 subsequent siblings)
  21 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-10 19:30 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Create a test script that takes the default performance test (the Git
codebase) and multiplies it by 256 using four layers of duplicated
trees of width four. This results in nearly one million blob entries in
the index. Then, we can clone this repository with sparse-checkout
patterns that demonstrate four copies of the initial repository. Each
clone will use a different index format or mode so peformance can be
tested across the different options.

Note that the initial repo is stripped of submodules before doing the
copies. This preserves the expected data shape of the sparse index,
because directories containing submodules are not collapsed to a sparse
directory entry.

Run a few Git commands on these clones, especially those that use the
index (status, add, commit).

Here are the results on my Linux machine:

Test
--------------------------------------------------------------
2000.2: git status (full-index-v3)             0.37(0.30+0.09)
2000.3: git status (full-index-v4)             0.39(0.32+0.10)
2000.4: git add -A (full-index-v3)             1.42(1.06+0.20)
2000.5: git add -A (full-index-v4)             1.26(0.98+0.16)
2000.6: git add . (full-index-v3)              1.40(1.04+0.18)
2000.7: git add . (full-index-v4)              1.26(0.98+0.17)
2000.8: git commit -a -m A (full-index-v3)     1.42(1.11+0.16)
2000.9: git commit -a -m A (full-index-v4)     1.33(1.08+0.16)

It is perhaps noteworthy that there is an improvement when using index
version 4. This is because the v3 index uses 108 MiB while the v4
index uses 80 MiB. Since the repeated portions of the directories are
very short (f3/f1/f2, for example) this ratio is less pronounced than in
similarly-sized real repositories.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/perf/p2000-sparse-operations.sh | 85 +++++++++++++++++++++++++++++++
 1 file changed, 85 insertions(+)
 create mode 100755 t/perf/p2000-sparse-operations.sh

diff --git a/t/perf/p2000-sparse-operations.sh b/t/perf/p2000-sparse-operations.sh
new file mode 100755
index 000000000000..2fbc81b22119
--- /dev/null
+++ b/t/perf/p2000-sparse-operations.sh
@@ -0,0 +1,85 @@
+#!/bin/sh
+
+test_description="test performance of Git operations using the index"
+
+. ./perf-lib.sh
+
+test_perf_default_repo
+
+SPARSE_CONE=f2/f4/f1
+
+test_expect_success 'setup repo and indexes' '
+	git reset --hard HEAD &&
+	# Remove submodules from the example repo, because our
+	# duplication of the entire repo creates an unlikly data shape.
+	git config --file .gitmodules --get-regexp "submodule.*.path" >modules &&
+	git rm -f .gitmodules &&
+	for module in $(awk "{print \$2}" modules)
+	do
+		git rm $module || return 1
+	done &&
+	git commit -m "remove submodules" &&
+
+	echo bogus >a &&
+	cp a b &&
+	git add a b &&
+	git commit -m "level 0" &&
+	BLOB=$(git rev-parse HEAD:a) &&
+	OLD_COMMIT=$(git rev-parse HEAD) &&
+	OLD_TREE=$(git rev-parse HEAD^{tree}) &&
+
+	for i in $(test_seq 1 4)
+	do
+		cat >in <<-EOF &&
+			100755 blob $BLOB	a
+			040000 tree $OLD_TREE	f1
+			040000 tree $OLD_TREE	f2
+			040000 tree $OLD_TREE	f3
+			040000 tree $OLD_TREE	f4
+		EOF
+		NEW_TREE=$(git mktree <in) &&
+		NEW_COMMIT=$(git commit-tree $NEW_TREE -p $OLD_COMMIT -m "level $i") &&
+		OLD_TREE=$NEW_TREE &&
+		OLD_COMMIT=$NEW_COMMIT || return 1
+	done &&
+
+	git sparse-checkout init --cone &&
+	git branch -f wide $OLD_COMMIT &&
+	git -c core.sparseCheckoutCone=true clone --branch=wide --sparse . full-index-v3 &&
+	(
+		cd full-index-v3 &&
+		git sparse-checkout init --cone &&
+		git sparse-checkout set $SPARSE_CONE &&
+		git config index.version 3 &&
+		git update-index --index-version=3
+	) &&
+	git -c core.sparseCheckoutCone=true clone --branch=wide --sparse . full-index-v4 &&
+	(
+		cd full-index-v4 &&
+		git sparse-checkout init --cone &&
+		git sparse-checkout set $SPARSE_CONE &&
+		git config index.version 4 &&
+		git update-index --index-version=4
+	)
+'
+
+test_perf_on_all () {
+	command="$@"
+	for repo in full-index-v3 full-index-v4
+	do
+		test_perf "$command ($repo)" "
+			(
+				cd $repo &&
+				echo >>$SPARSE_CONE/a &&
+				$command
+			)
+		"
+	done
+}
+
+test_perf_on_all git status
+test_perf_on_all git add -A
+test_perf_on_all git add .
+test_perf_on_all git commit -a -m A
+
+test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH v2 03/20] t1092: clean up script quoting
  2021-03-10 19:30 ` [PATCH v2 " Derrick Stolee via GitGitGadget
  2021-03-10 19:30   ` [PATCH v2 01/20] sparse-index: design doc and format update Derrick Stolee via GitGitGadget
  2021-03-10 19:30   ` [PATCH v2 02/20] t/perf: add performance test for sparse operations Derrick Stolee via GitGitGadget
@ 2021-03-10 19:30   ` Derrick Stolee via GitGitGadget
  2021-03-10 19:30   ` [PATCH v2 04/20] sparse-index: add guard to ensure full index Derrick Stolee via GitGitGadget
                     ` (18 subsequent siblings)
  21 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-10 19:30 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

This test was introduced in 19a0acc83e4 (t1092: test interesting
sparse-checkout scenarios, 2021-01-23), but these issues with quoting
were not noticed until starting this follow-up series. The old mechanism
would drop quoting such as in

   test_all_match git commit -m "touch README.md"

The above happened to work because README.md is a file in the
repository, so 'git commit -m touch REAMDE.md' would succeed by
accident.

Other cases included quoting for no good reason, so clean that up now.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/t1092-sparse-checkout-compatibility.sh | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index 8cd3e5a8d227..3725d3997e70 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -96,20 +96,20 @@ init_repos () {
 run_on_sparse () {
 	(
 		cd sparse-checkout &&
-		$* >../sparse-checkout-out 2>../sparse-checkout-err
+		"$@" >../sparse-checkout-out 2>../sparse-checkout-err
 	)
 }
 
 run_on_all () {
 	(
 		cd full-checkout &&
-		$* >../full-checkout-out 2>../full-checkout-err
+		"$@" >../full-checkout-out 2>../full-checkout-err
 	) &&
-	run_on_sparse $*
+	run_on_sparse "$@"
 }
 
 test_all_match () {
-	run_on_all $* &&
+	run_on_all "$@" &&
 	test_cmp full-checkout-out sparse-checkout-out &&
 	test_cmp full-checkout-err sparse-checkout-err
 }
@@ -119,7 +119,7 @@ test_expect_success 'status with options' '
 	test_all_match git status --porcelain=v2 &&
 	test_all_match git status --porcelain=v2 -z -u &&
 	test_all_match git status --porcelain=v2 -uno &&
-	run_on_all "touch README.md" &&
+	run_on_all touch README.md &&
 	test_all_match git status --porcelain=v2 &&
 	test_all_match git status --porcelain=v2 -z -u &&
 	test_all_match git status --porcelain=v2 -uno &&
@@ -135,7 +135,7 @@ test_expect_success 'add, commit, checkout' '
 	write_script edit-contents <<-\EOF &&
 	echo text >>$1
 	EOF
-	run_on_all "../edit-contents README.md" &&
+	run_on_all ../edit-contents README.md &&
 
 	test_all_match git add README.md &&
 	test_all_match git status --porcelain=v2 &&
@@ -144,7 +144,7 @@ test_expect_success 'add, commit, checkout' '
 	test_all_match git checkout HEAD~1 &&
 	test_all_match git checkout - &&
 
-	run_on_all "../edit-contents README.md" &&
+	run_on_all ../edit-contents README.md &&
 
 	test_all_match git add -A &&
 	test_all_match git status --porcelain=v2 &&
@@ -153,7 +153,7 @@ test_expect_success 'add, commit, checkout' '
 	test_all_match git checkout HEAD~1 &&
 	test_all_match git checkout - &&
 
-	run_on_all "../edit-contents deep/newfile" &&
+	run_on_all ../edit-contents deep/newfile &&
 
 	test_all_match git status --porcelain=v2 -uno &&
 	test_all_match git status --porcelain=v2 &&
@@ -186,7 +186,7 @@ test_expect_success 'diff --staged' '
 	write_script edit-contents <<-\EOF &&
 	echo text >>README.md
 	EOF
-	run_on_all "../edit-contents" &&
+	run_on_all ../edit-contents &&
 
 	test_all_match git diff &&
 	test_all_match git diff --staged &&
@@ -280,7 +280,7 @@ test_expect_success 'clean' '
 	echo bogus >>.gitignore &&
 	run_on_all cp ../.gitignore . &&
 	test_all_match git add .gitignore &&
-	test_all_match git commit -m ignore-bogus-files &&
+	test_all_match git commit -m "ignore bogus files" &&
 
 	run_on_sparse mkdir folder1 &&
 	run_on_all touch folder1/bogus &&
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH v2 04/20] sparse-index: add guard to ensure full index
  2021-03-10 19:30 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (2 preceding siblings ...)
  2021-03-10 19:30   ` [PATCH v2 03/20] t1092: clean up script quoting Derrick Stolee via GitGitGadget
@ 2021-03-10 19:30   ` Derrick Stolee via GitGitGadget
  2021-03-10 19:30   ` [PATCH v2 05/20] sparse-index: implement ensure_full_index() Derrick Stolee via GitGitGadget
                     ` (17 subsequent siblings)
  21 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-10 19:30 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Upcoming changes will introduce modifications to the index format that
allow sparse directories. It will be useful to have a mechanism for
converting those sparse index files into full indexes by walking the
tree at those sparse directories. Name this method ensure_full_index()
as it will guarantee that the index is fully expanded.

This method is not implemented yet, and instead we focus on the
scaffolding to declare it and call it at the appropriate time.

Add a 'command_requires_full_index' member to struct repo_settings. This
will be an indicator that we need the index in full mode to do certain
index operations. This starts as being true for every command, then we
will set it to false as some commands integrate with sparse indexes.

If 'command_requires_full_index' is true, then we will immediately
expand a sparse index to a full one upon reading from disk. This
suffices for now, but we will want to add more callers to
ensure_full_index() later.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Makefile        |  1 +
 repo-settings.c |  8 ++++++++
 repository.c    | 11 ++++++++++-
 repository.h    |  2 ++
 sparse-index.c  |  8 ++++++++
 sparse-index.h  |  7 +++++++
 6 files changed, 36 insertions(+), 1 deletion(-)
 create mode 100644 sparse-index.c
 create mode 100644 sparse-index.h

diff --git a/Makefile b/Makefile
index 5a239cac20e3..3bf61699238d 100644
--- a/Makefile
+++ b/Makefile
@@ -980,6 +980,7 @@ LIB_OBJS += setup.o
 LIB_OBJS += shallow.o
 LIB_OBJS += sideband.o
 LIB_OBJS += sigchain.o
+LIB_OBJS += sparse-index.o
 LIB_OBJS += split-index.o
 LIB_OBJS += stable-qsort.o
 LIB_OBJS += strbuf.o
diff --git a/repo-settings.c b/repo-settings.c
index f7fff0f5ab83..d63569e4041e 100644
--- a/repo-settings.c
+++ b/repo-settings.c
@@ -77,4 +77,12 @@ void prepare_repo_settings(struct repository *r)
 		UPDATE_DEFAULT_BOOL(r->settings.core_untracked_cache, UNTRACKED_CACHE_KEEP);
 
 	UPDATE_DEFAULT_BOOL(r->settings.fetch_negotiation_algorithm, FETCH_NEGOTIATION_DEFAULT);
+
+	/*
+	 * This setting guards all index reads to require a full index
+	 * over a sparse index. After suitable guards are placed in the
+	 * codebase around uses of the index, this setting will be
+	 * removed.
+	 */
+	r->settings.command_requires_full_index = 1;
 }
diff --git a/repository.c b/repository.c
index c98298acd017..a8acae002f71 100644
--- a/repository.c
+++ b/repository.c
@@ -10,6 +10,7 @@
 #include "object.h"
 #include "lockfile.h"
 #include "submodule-config.h"
+#include "sparse-index.h"
 
 /* The main repository */
 static struct repository the_repo;
@@ -261,6 +262,8 @@ void repo_clear(struct repository *repo)
 
 int repo_read_index(struct repository *repo)
 {
+	int res;
+
 	if (!repo->index)
 		repo->index = xcalloc(1, sizeof(*repo->index));
 
@@ -270,7 +273,13 @@ int repo_read_index(struct repository *repo)
 	else if (repo->index->repo != repo)
 		BUG("repo's index should point back at itself");
 
-	return read_index_from(repo->index, repo->index_file, repo->gitdir);
+	res = read_index_from(repo->index, repo->index_file, repo->gitdir);
+
+	prepare_repo_settings(repo);
+	if (repo->settings.command_requires_full_index)
+		ensure_full_index(repo->index);
+
+	return res;
 }
 
 int repo_hold_locked_index(struct repository *repo,
diff --git a/repository.h b/repository.h
index b385ca3c94b6..e06a23015697 100644
--- a/repository.h
+++ b/repository.h
@@ -41,6 +41,8 @@ struct repo_settings {
 	enum fetch_negotiation_setting fetch_negotiation_algorithm;
 
 	int core_multi_pack_index;
+
+	unsigned command_requires_full_index:1;
 };
 
 struct repository {
diff --git a/sparse-index.c b/sparse-index.c
new file mode 100644
index 000000000000..82183ead563b
--- /dev/null
+++ b/sparse-index.c
@@ -0,0 +1,8 @@
+#include "cache.h"
+#include "repository.h"
+#include "sparse-index.h"
+
+void ensure_full_index(struct index_state *istate)
+{
+	/* intentionally left blank */
+}
diff --git a/sparse-index.h b/sparse-index.h
new file mode 100644
index 000000000000..09a20d036c46
--- /dev/null
+++ b/sparse-index.h
@@ -0,0 +1,7 @@
+#ifndef SPARSE_INDEX_H__
+#define SPARSE_INDEX_H__
+
+struct index_state;
+void ensure_full_index(struct index_state *istate);
+
+#endif
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH v2 05/20] sparse-index: implement ensure_full_index()
  2021-03-10 19:30 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (3 preceding siblings ...)
  2021-03-10 19:30   ` [PATCH v2 04/20] sparse-index: add guard to ensure full index Derrick Stolee via GitGitGadget
@ 2021-03-10 19:30   ` Derrick Stolee via GitGitGadget
  2021-03-12  6:50     ` Junio C Hamano
  2021-03-10 19:30   ` [PATCH v2 06/20] t1092: compare sparse-checkout to sparse-index Derrick Stolee via GitGitGadget
                     ` (16 subsequent siblings)
  21 siblings, 1 reply; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-10 19:30 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

We will mark an in-memory index_state as having sparse directory entries
with the sparse_index bit. These currently cannot exist, but we will add
a mechanism for collapsing a full index to a sparse one in a later
change. That will happen at write time, so we must first allow parsing
the format before writing it.

Commands or methods that require a full index in order to operate can
call ensure_full_index() to expand that index in-memory. This requires
parsing trees using that index's repository.

Sparse directory entries have a specific 'ce_mode' value. The macro
S_ISSPARSEDIR(ce->ce_mode) can check if a cache_entry 'ce' has this type.
This ce_mode is not possible with the existing index formats, so we don't
also verify all properties of a sparse-directory entry, which are:

 1. ce->ce_mode == 0040000
 2. ce->flags & CE_SKIP_WORKTREE is true
 3. ce->name[ce->namelen - 1] == '/' (ends in dir separator)
 4. ce->oid references a tree object.

These are all semi-enforced in ensure_full_index() to some extent. Any
deviation will cause a warning at minimum or a failure in the worst
case.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 cache.h        | 13 ++++++-
 read-cache.c   |  9 +++++
 sparse-index.c | 95 +++++++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 115 insertions(+), 2 deletions(-)

diff --git a/cache.h b/cache.h
index d92814961405..1f0b42264606 100644
--- a/cache.h
+++ b/cache.h
@@ -204,6 +204,8 @@ struct cache_entry {
 #error "CE_EXTENDED_FLAGS out of range"
 #endif
 
+#define S_ISSPARSEDIR(m) ((m) == S_IFDIR)
+
 /* Forward structure decls */
 struct pathspec;
 struct child_process;
@@ -319,7 +321,14 @@ struct index_state {
 		 drop_cache_tree : 1,
 		 updated_workdir : 1,
 		 updated_skipworktree : 1,
-		 fsmonitor_has_run_once : 1;
+		 fsmonitor_has_run_once : 1,
+
+		 /*
+		  * sparse_index == 1 when sparse-directory
+		  * entries exist. Requires sparse-checkout
+		  * in cone mode.
+		  */
+		 sparse_index : 1;
 	struct hashmap name_hash;
 	struct hashmap dir_hash;
 	struct object_id oid;
@@ -722,6 +731,8 @@ int read_index_from(struct index_state *, const char *path,
 		    const char *gitdir);
 int is_index_unborn(struct index_state *);
 
+void ensure_full_index(struct index_state *istate);
+
 /* For use with `write_locked_index()`. */
 #define COMMIT_LOCK		(1 << 0)
 #define SKIP_IF_UNCHANGED	(1 << 1)
diff --git a/read-cache.c b/read-cache.c
index 29144cf879e7..97dbf2434f30 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -101,6 +101,9 @@ static const char *alternate_index_output;
 
 static void set_index_entry(struct index_state *istate, int nr, struct cache_entry *ce)
 {
+	if (S_ISSPARSEDIR(ce->ce_mode))
+		istate->sparse_index = 1;
+
 	istate->cache[nr] = ce;
 	add_name_hash(istate, ce);
 }
@@ -2255,6 +2258,12 @@ int do_read_index(struct index_state *istate, const char *path, int must_exist)
 	trace2_data_intmax("index", the_repository, "read/cache_nr",
 			   istate->cache_nr);
 
+	if (!istate->repo)
+		istate->repo = the_repository;
+	prepare_repo_settings(istate->repo);
+	if (istate->repo->settings.command_requires_full_index)
+		ensure_full_index(istate);
+
 	return istate->cache_nr;
 
 unmap:
diff --git a/sparse-index.c b/sparse-index.c
index 82183ead563b..316cb949b74b 100644
--- a/sparse-index.c
+++ b/sparse-index.c
@@ -1,8 +1,101 @@
 #include "cache.h"
 #include "repository.h"
 #include "sparse-index.h"
+#include "tree.h"
+#include "pathspec.h"
+#include "trace2.h"
+
+static void set_index_entry(struct index_state *istate, int nr, struct cache_entry *ce)
+{
+	ALLOC_GROW(istate->cache, nr + 1, istate->cache_alloc);
+
+	istate->cache[nr] = ce;
+	add_name_hash(istate, ce);
+}
+
+static int add_path_to_index(const struct object_id *oid,
+				struct strbuf *base, const char *path,
+				unsigned int mode, int stage, void *context)
+{
+	struct index_state *istate = (struct index_state *)context;
+	struct cache_entry *ce;
+	size_t len = base->len;
+
+	if (S_ISDIR(mode))
+		return READ_TREE_RECURSIVE;
+
+	strbuf_addstr(base, path);
+
+	ce = make_cache_entry(istate, mode, oid, base->buf, 0, 0);
+	ce->ce_flags |= CE_SKIP_WORKTREE;
+	set_index_entry(istate, istate->cache_nr++, ce);
+
+	strbuf_setlen(base, len);
+	return 0;
+}
 
 void ensure_full_index(struct index_state *istate)
 {
-	/* intentionally left blank */
+	int i;
+	struct index_state *full;
+
+	if (!istate || !istate->sparse_index)
+		return;
+
+	if (!istate->repo)
+		istate->repo = the_repository;
+
+	trace2_region_enter("index", "ensure_full_index", istate->repo);
+
+	/* initialize basics of new index */
+	full = xcalloc(1, sizeof(struct index_state));
+	memcpy(full, istate, sizeof(struct index_state));
+
+	/* then change the necessary things */
+	full->sparse_index = 0;
+	full->cache_alloc = (3 * istate->cache_alloc) / 2;
+	full->cache_nr = 0;
+	ALLOC_ARRAY(full->cache, full->cache_alloc);
+
+	for (i = 0; i < istate->cache_nr; i++) {
+		struct cache_entry *ce = istate->cache[i];
+		struct tree *tree;
+		struct pathspec ps;
+
+		if (!S_ISSPARSEDIR(ce->ce_mode)) {
+			set_index_entry(full, full->cache_nr++, ce);
+			continue;
+		}
+		if (!(ce->ce_flags & CE_SKIP_WORKTREE))
+			warning(_("index entry is a directory, but not sparse (%08x)"),
+				ce->ce_flags);
+
+		/* recursively walk into cd->name */
+		tree = lookup_tree(istate->repo, &ce->oid);
+
+		memset(&ps, 0, sizeof(ps));
+		ps.recursive = 1;
+		ps.has_wildcard = 1;
+		ps.max_depth = -1;
+
+		read_tree_recursive(istate->repo, tree,
+				    ce->name, strlen(ce->name),
+				    0, &ps,
+				    add_path_to_index, full);
+
+		/* free directory entries. full entries are re-used */
+		discard_cache_entry(ce);
+	}
+
+	/* Copy back into original index. */
+	memcpy(&istate->name_hash, &full->name_hash, sizeof(full->name_hash));
+	istate->sparse_index = 0;
+	free(istate->cache);
+	istate->cache = full->cache;
+	istate->cache_nr = full->cache_nr;
+	istate->cache_alloc = full->cache_alloc;
+
+	free(full);
+
+	trace2_region_leave("index", "ensure_full_index", istate->repo);
 }
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH v2 06/20] t1092: compare sparse-checkout to sparse-index
  2021-03-10 19:30 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (4 preceding siblings ...)
  2021-03-10 19:30   ` [PATCH v2 05/20] sparse-index: implement ensure_full_index() Derrick Stolee via GitGitGadget
@ 2021-03-10 19:30   ` Derrick Stolee via GitGitGadget
  2021-03-10 23:04     ` Elijah Newren
  2021-03-10 19:30   ` [PATCH v2 07/20] test-read-cache: print cache entries with --table Derrick Stolee via GitGitGadget
                     ` (15 subsequent siblings)
  21 siblings, 1 reply; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-10 19:30 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Add a new 'sparse-index' repo alongside the 'full-checkout' and
'sparse-checkout' repos in t1092-sparse-checkout-compatibility.sh. Also
add run_on_sparse and test_sparse_match helpers. These helpers will be
used when the sparse index is implemented.

Add GIT_TEST_SPARSE_INDEX environment variable to enable the
sparse-index by default. This will be intended to use across the entire
test suite, except that it will only affect cases where the
sparse-checkout feature is enabled.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/README                                 |  3 +++
 t/t1092-sparse-checkout-compatibility.sh | 24 ++++++++++++++++++++----
 2 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/t/README b/t/README
index 593d4a4e270c..b98bc563aab5 100644
--- a/t/README
+++ b/t/README
@@ -439,6 +439,9 @@ and "sha256".
 GIT_TEST_WRITE_REV_INDEX=<boolean>, when true enables the
 'pack.writeReverseIndex' setting.
 
+GIT_TEST_SPARSE_INDEX=<boolean>, when true enables index writes to use the
+sparse-index format by default.
+
 Naming Tests
 ------------
 
diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index 3725d3997e70..71d6f9e4c014 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -7,6 +7,7 @@ test_description='compare full workdir to sparse workdir'
 test_expect_success 'setup' '
 	git init initial-repo &&
 	(
+		GIT_TEST_SPARSE_INDEX=0 &&
 		cd initial-repo &&
 		echo a >a &&
 		echo "after deep" >e &&
@@ -87,23 +88,32 @@ init_repos () {
 
 	cp -r initial-repo sparse-checkout &&
 	git -C sparse-checkout reset --hard &&
-	git -C sparse-checkout sparse-checkout init --cone &&
+
+	cp -r initial-repo sparse-index &&
+	git -C sparse-index reset --hard &&
 
 	# initialize sparse-checkout definitions
-	git -C sparse-checkout sparse-checkout set deep
+	git -C sparse-checkout sparse-checkout init --cone &&
+	git -C sparse-checkout sparse-checkout set deep &&
+	GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout init --cone &&
+	GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set deep
 }
 
 run_on_sparse () {
 	(
 		cd sparse-checkout &&
-		"$@" >../sparse-checkout-out 2>../sparse-checkout-err
+		GIT_TEST_SPARSE_INDEX=0 "$@" >../sparse-checkout-out 2>../sparse-checkout-err
+	) &&
+	(
+		cd sparse-index &&
+		GIT_TEST_SPARSE_INDEX=1 "$@" >../sparse-index-out 2>../sparse-index-err
 	)
 }
 
 run_on_all () {
 	(
 		cd full-checkout &&
-		"$@" >../full-checkout-out 2>../full-checkout-err
+		GIT_TEST_SPARSE_INDEX=0 "$@" >../full-checkout-out 2>../full-checkout-err
 	) &&
 	run_on_sparse "$@"
 }
@@ -114,6 +124,12 @@ test_all_match () {
 	test_cmp full-checkout-err sparse-checkout-err
 }
 
+test_sparse_match () {
+	run_on_sparse $* &&
+	test_cmp sparse-checkout-out sparse-index-out &&
+	test_cmp sparse-checkout-err sparse-index-err
+}
+
 test_expect_success 'status with options' '
 	init_repos &&
 	test_all_match git status --porcelain=v2 &&
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH v2 07/20] test-read-cache: print cache entries with --table
  2021-03-10 19:30 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (5 preceding siblings ...)
  2021-03-10 19:30   ` [PATCH v2 06/20] t1092: compare sparse-checkout to sparse-index Derrick Stolee via GitGitGadget
@ 2021-03-10 19:30   ` Derrick Stolee via GitGitGadget
  2021-03-10 19:30   ` [PATCH v2 08/20] test-tool: don't force full index Derrick Stolee via GitGitGadget
                     ` (14 subsequent siblings)
  21 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-10 19:30 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

This table is helpful for discovering data in the index to ensure it is
being written correctly, especially as we build and test the
sparse-index. This table includes an output format similar to 'git
ls-tree', but should not be compared to that directly. The biggest
reasons are that 'git ls-tree' includes a tree entry for every
subdirectory, even those that would not appear as a sparse directory in
a sparse-index. Further, 'git ls-tree' does not use a trailing directory
separator for its tree rows.

This does not print the stat() information for the blobs. That could be
added in a future change with another option. The tests that are added
in the next few changes care only about the object types and IDs.

To make the option parsing slightly more robust, wrap the string
comparisons in a loop adapted from test-dir-iterator.c.

Care must be taken with the final check for the 'cnt' variable. We
continue the expectation that the numerical value is the final argument.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/helper/test-read-cache.c | 55 +++++++++++++++++++++++++++++++-------
 1 file changed, 45 insertions(+), 10 deletions(-)

diff --git a/t/helper/test-read-cache.c b/t/helper/test-read-cache.c
index 244977a29bdf..6cfd8f2de71c 100644
--- a/t/helper/test-read-cache.c
+++ b/t/helper/test-read-cache.c
@@ -1,36 +1,71 @@
 #include "test-tool.h"
 #include "cache.h"
 #include "config.h"
+#include "blob.h"
+#include "commit.h"
+#include "tree.h"
+
+static void print_cache_entry(struct cache_entry *ce)
+{
+	const char *type;
+	printf("%06o ", ce->ce_mode & 0177777);
+
+	if (S_ISSPARSEDIR(ce->ce_mode))
+		type = tree_type;
+	else if (S_ISGITLINK(ce->ce_mode))
+		type = commit_type;
+	else
+		type = blob_type;
+
+	printf("%s %s\t%s\n",
+	       type,
+	       oid_to_hex(&ce->oid),
+	       ce->name);
+}
+
+static void print_cache(struct index_state *istate)
+{
+	int i;
+	for (i = 0; i < istate->cache_nr; i++)
+		print_cache_entry(istate->cache[i]);
+}
 
 int cmd__read_cache(int argc, const char **argv)
 {
+	struct repository *r = the_repository;
 	int i, cnt = 1;
 	const char *name = NULL;
+	int table = 0;
 
-	if (argc > 1 && skip_prefix(argv[1], "--print-and-refresh=", &name)) {
-		argc--;
-		argv++;
+	for (++argv, --argc; *argv && starts_with(*argv, "--"); ++argv, --argc) {
+		if (skip_prefix(*argv, "--print-and-refresh=", &name))
+			continue;
+		if (!strcmp(*argv, "--table"))
+			table = 1;
 	}
 
-	if (argc == 2)
-		cnt = strtol(argv[1], NULL, 0);
+	if (argc == 1)
+		cnt = strtol(argv[0], NULL, 0);
 	setup_git_directory();
 	git_config(git_default_config, NULL);
+
 	for (i = 0; i < cnt; i++) {
-		read_cache();
+		repo_read_index(r);
 		if (name) {
 			int pos;
 
-			refresh_index(&the_index, REFRESH_QUIET,
+			refresh_index(r->index, REFRESH_QUIET,
 				      NULL, NULL, NULL);
-			pos = index_name_pos(&the_index, name, strlen(name));
+			pos = index_name_pos(r->index, name, strlen(name));
 			if (pos < 0)
 				die("%s not in index", name);
 			printf("%s is%s up to date\n", name,
-			       ce_uptodate(the_index.cache[pos]) ? "" : " not");
+			       ce_uptodate(r->index->cache[pos]) ? "" : " not");
 			write_file(name, "%d\n", i);
 		}
-		discard_cache();
+		if (table)
+			print_cache(r->index);
+		discard_index(r->index);
 	}
 	return 0;
 }
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH v2 08/20] test-tool: don't force full index
  2021-03-10 19:30 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (6 preceding siblings ...)
  2021-03-10 19:30   ` [PATCH v2 07/20] test-read-cache: print cache entries with --table Derrick Stolee via GitGitGadget
@ 2021-03-10 19:30   ` Derrick Stolee via GitGitGadget
  2021-03-10 19:30   ` [PATCH v2 09/20] unpack-trees: ensure " Derrick Stolee via GitGitGadget
                     ` (13 subsequent siblings)
  21 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-10 19:30 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

We will use 'test-tool read-cache --table' to check that a sparse
index is written as part of init_repos. Since we will no longer always
expand a sparse index into a full index, add an '--expand' parameter
that adds a call to ensure_full_index() so we can compare a sparse index
directly against a full index, or at least what the in-memory index
looks like when expanded in this way.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/helper/test-read-cache.c               | 13 ++++++++++++-
 t/t1092-sparse-checkout-compatibility.sh |  5 +++++
 2 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/t/helper/test-read-cache.c b/t/helper/test-read-cache.c
index 6cfd8f2de71c..b52c174acc7a 100644
--- a/t/helper/test-read-cache.c
+++ b/t/helper/test-read-cache.c
@@ -4,6 +4,7 @@
 #include "blob.h"
 #include "commit.h"
 #include "tree.h"
+#include "sparse-index.h"
 
 static void print_cache_entry(struct cache_entry *ce)
 {
@@ -35,13 +36,19 @@ int cmd__read_cache(int argc, const char **argv)
 	struct repository *r = the_repository;
 	int i, cnt = 1;
 	const char *name = NULL;
-	int table = 0;
+	int table = 0, expand = 0;
+
+	initialize_the_repository();
+	prepare_repo_settings(r);
+	r->settings.command_requires_full_index = 0;
 
 	for (++argv, --argc; *argv && starts_with(*argv, "--"); ++argv, --argc) {
 		if (skip_prefix(*argv, "--print-and-refresh=", &name))
 			continue;
 		if (!strcmp(*argv, "--table"))
 			table = 1;
+		else if (!strcmp(*argv, "--expand"))
+			expand = 1;
 	}
 
 	if (argc == 1)
@@ -51,6 +58,10 @@ int cmd__read_cache(int argc, const char **argv)
 
 	for (i = 0; i < cnt; i++) {
 		repo_read_index(r);
+
+		if (expand)
+			ensure_full_index(r->index);
+
 		if (name) {
 			int pos;
 
diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index 71d6f9e4c014..4d789fe86b9d 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -130,6 +130,11 @@ test_sparse_match () {
 	test_cmp sparse-checkout-err sparse-index-err
 }
 
+test_expect_success 'expanded in-memory index matches full index' '
+	init_repos &&
+	test_sparse_match test-tool read-cache --expand --table
+'
+
 test_expect_success 'status with options' '
 	init_repos &&
 	test_all_match git status --porcelain=v2 &&
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH v2 09/20] unpack-trees: ensure full index
  2021-03-10 19:30 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (7 preceding siblings ...)
  2021-03-10 19:30   ` [PATCH v2 08/20] test-tool: don't force full index Derrick Stolee via GitGitGadget
@ 2021-03-10 19:30   ` Derrick Stolee via GitGitGadget
  2021-03-10 19:30   ` [PATCH v2 10/20] sparse-checkout: hold pattern list in index Derrick Stolee via GitGitGadget
                     ` (12 subsequent siblings)
  21 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-10 19:30 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The next change will translate full indexes into sparse indexes at write
time. The existing logic provides a way for every sparse index to be
expanded to a full index at read time. However, there are cases where an
index is written and then continues to be used in-memory to perform
further updates.

unpack_trees() is frequently called after such a write. In particular,
commands like 'git reset' do this double-update of the index.

Ensure that we have a full index when entering unpack_trees(), but only
when command_requires_full_index is true. This is always true at the
moment, but we will later relax that after unpack_trees() is updated to
handle sparse directory entries.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 unpack-trees.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/unpack-trees.c b/unpack-trees.c
index f5f668f532d8..4dd99219073a 100644
--- a/unpack-trees.c
+++ b/unpack-trees.c
@@ -1567,6 +1567,7 @@ static int verify_absent(const struct cache_entry *,
  */
 int unpack_trees(unsigned len, struct tree_desc *t, struct unpack_trees_options *o)
 {
+	struct repository *repo = the_repository;
 	int i, ret;
 	static struct cache_entry *dfc;
 	struct pattern_list pl;
@@ -1578,6 +1579,12 @@ int unpack_trees(unsigned len, struct tree_desc *t, struct unpack_trees_options
 	trace_performance_enter();
 	trace2_region_enter("unpack_trees", "unpack_trees", the_repository);
 
+	prepare_repo_settings(repo);
+	if (repo->settings.command_requires_full_index) {
+		ensure_full_index(o->src_index);
+		ensure_full_index(o->dst_index);
+	}
+
 	if (!core_apply_sparse_checkout || !o->update)
 		o->skip_sparse_checkout = 1;
 	if (!o->skip_sparse_checkout && !o->pl) {
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH v2 10/20] sparse-checkout: hold pattern list in index
  2021-03-10 19:30 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (8 preceding siblings ...)
  2021-03-10 19:30   ` [PATCH v2 09/20] unpack-trees: ensure " Derrick Stolee via GitGitGadget
@ 2021-03-10 19:30   ` Derrick Stolee via GitGitGadget
  2021-03-10 19:30   ` [PATCH v2 11/20] sparse-index: convert from full to sparse Derrick Stolee via GitGitGadget
                     ` (11 subsequent siblings)
  21 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-10 19:30 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

As we modify the sparse-checkout definition, we perform index operations
on a pattern_list that only exists in-memory. This allows easy backing
out in case the index update fails.

However, if the index write itself cares about the sparse-checkout
pattern set, we need access to that in-memory copy. Place a pointer to
a 'struct pattern_list' in the index so we can access this on-demand.
This will be used in the next change which uses the sparse-checkout
definition to filter out directories that are outside the sparse cone.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 builtin/sparse-checkout.c | 17 ++++++++++-------
 cache.h                   |  2 ++
 2 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/builtin/sparse-checkout.c b/builtin/sparse-checkout.c
index 2306a9ad98e0..e00b82af727b 100644
--- a/builtin/sparse-checkout.c
+++ b/builtin/sparse-checkout.c
@@ -110,6 +110,8 @@ static int update_working_directory(struct pattern_list *pl)
 	if (is_index_unborn(r->index))
 		return UPDATE_SPARSITY_SUCCESS;
 
+	r->index->sparse_checkout_patterns = pl;
+
 	memset(&o, 0, sizeof(o));
 	o.verbose_update = isatty(2);
 	o.update = 1;
@@ -138,6 +140,7 @@ static int update_working_directory(struct pattern_list *pl)
 	else
 		rollback_lock_file(&lock_file);
 
+	r->index->sparse_checkout_patterns = NULL;
 	return result;
 }
 
@@ -517,19 +520,18 @@ static int modify_pattern_list(int argc, const char **argv, enum modify_type m)
 {
 	int result;
 	int changed_config = 0;
-	struct pattern_list pl;
-	memset(&pl, 0, sizeof(pl));
+	struct pattern_list *pl = xcalloc(1, sizeof(*pl));
 
 	switch (m) {
 	case ADD:
 		if (core_sparse_checkout_cone)
-			add_patterns_cone_mode(argc, argv, &pl);
+			add_patterns_cone_mode(argc, argv, pl);
 		else
-			add_patterns_literal(argc, argv, &pl);
+			add_patterns_literal(argc, argv, pl);
 		break;
 
 	case REPLACE:
-		add_patterns_from_input(&pl, argc, argv);
+		add_patterns_from_input(pl, argc, argv);
 		break;
 	}
 
@@ -539,12 +541,13 @@ static int modify_pattern_list(int argc, const char **argv, enum modify_type m)
 		changed_config = 1;
 	}
 
-	result = write_patterns_and_update(&pl);
+	result = write_patterns_and_update(pl);
 
 	if (result && changed_config)
 		set_config(MODE_NO_PATTERNS);
 
-	clear_pattern_list(&pl);
+	clear_pattern_list(pl);
+	free(pl);
 	return result;
 }
 
diff --git a/cache.h b/cache.h
index 1f0b42264606..303411726e10 100644
--- a/cache.h
+++ b/cache.h
@@ -307,6 +307,7 @@ static inline unsigned int canon_mode(unsigned int mode)
 struct split_index;
 struct untracked_cache;
 struct progress;
+struct pattern_list;
 
 struct index_state {
 	struct cache_entry **cache;
@@ -338,6 +339,7 @@ struct index_state {
 	struct mem_pool *ce_mem_pool;
 	struct progress *progress;
 	struct repository *repo;
+	struct pattern_list *sparse_checkout_patterns;
 };
 
 /* Name hashing */
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH v2 11/20] sparse-index: convert from full to sparse
  2021-03-10 19:30 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (9 preceding siblings ...)
  2021-03-10 19:30   ` [PATCH v2 10/20] sparse-checkout: hold pattern list in index Derrick Stolee via GitGitGadget
@ 2021-03-10 19:30   ` Derrick Stolee via GitGitGadget
  2021-03-10 23:44     ` Elijah Newren
  2021-03-10 19:30   ` [PATCH v2 12/20] submodule: sparse-index should not collapse links Derrick Stolee via GitGitGadget
                     ` (10 subsequent siblings)
  21 siblings, 1 reply; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-10 19:30 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

If we have a full index, then we can convert it to a sparse index by
replacing directories outside of the sparse cone with sparse directory
entries. The convert_to_sparse() method does this, when the situation is
appropriate.

For now, we avoid converting the index to a sparse index if:

 1. the index is split.
 2. the index is already sparse.
 3. sparse-checkout is disabled.
 4. sparse-checkout does not use cone mode.

Finally, we currently limit the conversion to when the
GIT_TEST_SPARSE_INDEX environment variable is enabled. A mode using Git
config will be added in a later change.

The trickiest thing about this conversion is that we might not be able
to mark a directory as a sparse directory just because it is outside the
sparse cone. There might be unmerged files within that directory, so we
need to look for those. Also, if there is some strange reason why a file
is not marked with CE_SKIP_WORKTREE, then we should give up on
converting that directory. There is still hope that some of its
subdirectories might be able to convert to sparse, so we keep looking
deeper.

The conversion process is assisted by the cache-tree extension. This is
calculated from the full index if it does not already exist. We then
abandon the cache-tree as it no longer applies to the newly-sparse
index. Thus, this cache-tree will be recalculated in every
sparse-full-sparse round-trip until we integrate the cache-tree
extension with the sparse index.

Some Git commands use the index after writing it. For example, 'git add'
will update the index, then write it to disk, then read its entries to
report information. To keep the in-memory index in a full state after
writing, we re-expand it to a full one after the write. This is wasteful
for commands that only write the index and do not read from it again,
but that is only the case until we make those commands "sparse aware."

We can compare the behavior of the sparse-index in
t1092-sparse-checkout-compability.sh by using GIT_TEST_SPARSE_INDEX=1
when operating on the 'sparse-index' repo. We can also compare the two
sparse repos directly, such as comparing their indexes (when expanded to
full in the case of the 'sparse-index' repo). We also verify that the
index is actually populated with sparse directory entries.

The 'checkout and reset (mixed)' test is marked for failure when
comparing a sparse repo to a full repo, but we can compare the two
sparse-checkout cases directly to ensure that we are not changing the
behavior when using a sparse index.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 cache-tree.c                             |   3 +
 cache.h                                  |   2 +
 read-cache.c                             |  26 ++++-
 sparse-index.c                           | 139 +++++++++++++++++++++++
 sparse-index.h                           |   1 +
 t/t1092-sparse-checkout-compatibility.sh |  61 +++++++++-
 6 files changed, 227 insertions(+), 5 deletions(-)

diff --git a/cache-tree.c b/cache-tree.c
index 2fb483d3c083..5f07a39e501e 100644
--- a/cache-tree.c
+++ b/cache-tree.c
@@ -6,6 +6,7 @@
 #include "object-store.h"
 #include "replace-object.h"
 #include "promisor-remote.h"
+#include "sparse-index.h"
 
 #ifndef DEBUG_CACHE_TREE
 #define DEBUG_CACHE_TREE 0
@@ -442,6 +443,8 @@ int cache_tree_update(struct index_state *istate, int flags)
 	if (i)
 		return i;
 
+	ensure_full_index(istate);
+
 	if (!istate->cache_tree)
 		istate->cache_tree = cache_tree();
 
diff --git a/cache.h b/cache.h
index 303411726e10..9217d405b9b8 100644
--- a/cache.h
+++ b/cache.h
@@ -251,6 +251,8 @@ static inline unsigned int create_ce_mode(unsigned int mode)
 {
 	if (S_ISLNK(mode))
 		return S_IFLNK;
+	if (mode == S_IFDIR)
+		return S_IFDIR;
 	if (S_ISDIR(mode) || S_ISGITLINK(mode))
 		return S_IFGITLINK;
 	return S_IFREG | ce_permissions(mode);
diff --git a/read-cache.c b/read-cache.c
index 97dbf2434f30..92126b9d23c9 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -25,6 +25,7 @@
 #include "fsmonitor.h"
 #include "thread-utils.h"
 #include "progress.h"
+#include "sparse-index.h"
 
 /* Mask for the name length in ce_flags in the on-disk index */
 
@@ -1002,8 +1003,14 @@ int verify_path(const char *path, unsigned mode)
 
 			c = *path++;
 			if ((c == '.' && !verify_dotfile(path, mode)) ||
-			    is_dir_sep(c) || c == '\0')
+			    is_dir_sep(c))
 				return 0;
+			/*
+			 * allow terminating directory separators for
+			 * sparse directory entries.
+			 */
+			if (c == '\0')
+				return S_ISDIR(mode);
 		} else if (c == '\\' && protect_ntfs) {
 			if (is_ntfs_dotgit(path))
 				return 0;
@@ -3061,6 +3068,14 @@ static int do_write_locked_index(struct index_state *istate, struct lock_file *l
 				 unsigned flags)
 {
 	int ret;
+	int was_full = !istate->sparse_index;
+
+	ret = convert_to_sparse(istate);
+
+	if (ret) {
+		warning(_("failed to convert to a sparse-index"));
+		return ret;
+	}
 
 	/*
 	 * TODO trace2: replace "the_repository" with the actual repo instance
@@ -3072,6 +3087,9 @@ static int do_write_locked_index(struct index_state *istate, struct lock_file *l
 	trace2_region_leave_printf("index", "do_write_index", the_repository,
 				   "%s", get_lock_file_path(lock));
 
+	if (was_full)
+		ensure_full_index(istate);
+
 	if (ret)
 		return ret;
 	if (flags & COMMIT_LOCK)
@@ -3162,9 +3180,10 @@ static int write_shared_index(struct index_state *istate,
 			      struct tempfile **temp)
 {
 	struct split_index *si = istate->split_index;
-	int ret;
+	int ret, was_full = !istate->sparse_index;
 
 	move_cache_to_base_index(istate);
+	convert_to_sparse(istate);
 
 	trace2_region_enter_printf("index", "shared/do_write_index",
 				   the_repository, "%s", get_tempfile_path(*temp));
@@ -3172,6 +3191,9 @@ static int write_shared_index(struct index_state *istate,
 	trace2_region_leave_printf("index", "shared/do_write_index",
 				   the_repository, "%s", get_tempfile_path(*temp));
 
+	if (was_full)
+		ensure_full_index(istate);
+
 	if (ret)
 		return ret;
 	ret = adjust_shared_perm(get_tempfile_path(*temp));
diff --git a/sparse-index.c b/sparse-index.c
index 316cb949b74b..5eb561259bb1 100644
--- a/sparse-index.c
+++ b/sparse-index.c
@@ -4,6 +4,145 @@
 #include "tree.h"
 #include "pathspec.h"
 #include "trace2.h"
+#include "cache-tree.h"
+#include "config.h"
+#include "dir.h"
+#include "fsmonitor.h"
+
+static struct cache_entry *construct_sparse_dir_entry(
+				struct index_state *istate,
+				const char *sparse_dir,
+				struct cache_tree *tree)
+{
+	struct cache_entry *de;
+
+	de = make_cache_entry(istate, S_IFDIR, &tree->oid, sparse_dir, 0, 0);
+
+	de->ce_flags |= CE_SKIP_WORKTREE;
+	return de;
+}
+
+/*
+ * Returns the number of entries "inserted" into the index.
+ */
+static int convert_to_sparse_rec(struct index_state *istate,
+				 int num_converted,
+				 int start, int end,
+				 const char *ct_path, size_t ct_pathlen,
+				 struct cache_tree *ct)
+{
+	int i, can_convert = 1;
+	int start_converted = num_converted;
+	enum pattern_match_result match;
+	int dtype;
+	struct strbuf child_path = STRBUF_INIT;
+	struct pattern_list *pl = istate->sparse_checkout_patterns;
+
+	/*
+	 * Is the current path outside of the sparse cone?
+	 * Then check if the region can be replaced by a sparse
+	 * directory entry (everything is sparse and merged).
+	 */
+	match = path_matches_pattern_list(ct_path, ct_pathlen,
+					  NULL, &dtype, pl, istate);
+	if (match != NOT_MATCHED)
+		can_convert = 0;
+
+	for (i = start; can_convert && i < end; i++) {
+		struct cache_entry *ce = istate->cache[i];
+
+		if (ce_stage(ce) ||
+		    !(ce->ce_flags & CE_SKIP_WORKTREE))
+			can_convert = 0;
+	}
+
+	if (can_convert) {
+		struct cache_entry *se;
+		se = construct_sparse_dir_entry(istate, ct_path, ct);
+
+		istate->cache[num_converted++] = se;
+		return 1;
+	}
+
+	for (i = start; i < end; ) {
+		int count, span, pos = -1;
+		const char *base, *slash;
+		struct cache_entry *ce = istate->cache[i];
+
+		/*
+		 * Detect if this is a normal entry outside of any subtree
+		 * entry.
+		 */
+		base = ce->name + ct_pathlen;
+		slash = strchr(base, '/');
+
+		if (slash)
+			pos = cache_tree_subtree_pos(ct, base, slash - base);
+
+		if (pos < 0) {
+			istate->cache[num_converted++] = ce;
+			i++;
+			continue;
+		}
+
+		strbuf_setlen(&child_path, 0);
+		strbuf_add(&child_path, ce->name, slash - ce->name + 1);
+
+		span = ct->down[pos]->cache_tree->entry_count;
+		count = convert_to_sparse_rec(istate,
+					      num_converted, i, i + span,
+					      child_path.buf, child_path.len,
+					      ct->down[pos]->cache_tree);
+		num_converted += count;
+		i += span;
+	}
+
+	strbuf_release(&child_path);
+	return num_converted - start_converted;
+}
+
+int convert_to_sparse(struct index_state *istate)
+{
+	if (istate->split_index || istate->sparse_index ||
+	    !core_apply_sparse_checkout || !core_sparse_checkout_cone)
+		return 0;
+
+	/*
+	 * For now, only create a sparse index with the
+	 * GIT_TEST_SPARSE_INDEX environment variable. We will relax
+	 * this once we have a proper way to opt-in (and later still,
+	 * opt-out).
+	 */
+	if (!git_env_bool("GIT_TEST_SPARSE_INDEX", 0))
+		return 0;
+
+	if (!istate->sparse_checkout_patterns) {
+		istate->sparse_checkout_patterns = xcalloc(1, sizeof(struct pattern_list));
+		if (get_sparse_checkout_patterns(istate->sparse_checkout_patterns) < 0)
+			return 0;
+	}
+
+	if (!istate->sparse_checkout_patterns->use_cone_patterns) {
+		warning(_("attempting to use sparse-index without cone mode"));
+		return -1;
+	}
+
+	if (cache_tree_update(istate, 0)) {
+		warning(_("unable to update cache-tree, staying full"));
+		return -1;
+	}
+
+	remove_fsmonitor(istate);
+
+	trace2_region_enter("index", "convert_to_sparse", istate->repo);
+	istate->cache_nr = convert_to_sparse_rec(istate,
+						 0, 0, istate->cache_nr,
+						 "", 0, istate->cache_tree);
+	istate->drop_cache_tree = 1;
+	istate->sparse_index = 1;
+	trace2_region_leave("index", "convert_to_sparse", istate->repo);
+	return 0;
+}
 
 static void set_index_entry(struct index_state *istate, int nr, struct cache_entry *ce)
 {
diff --git a/sparse-index.h b/sparse-index.h
index 09a20d036c46..64380e121d80 100644
--- a/sparse-index.h
+++ b/sparse-index.h
@@ -3,5 +3,6 @@
 
 struct index_state;
 void ensure_full_index(struct index_state *istate);
+int convert_to_sparse(struct index_state *istate);
 
 #endif
diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index 4d789fe86b9d..ca87033d30b0 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -2,6 +2,9 @@
 
 test_description='compare full workdir to sparse workdir'
 
+GIT_TEST_CHECK_CACHE_TREE=0
+GIT_TEST_SPLIT_INDEX=0
+
 . ./test-lib.sh
 
 test_expect_success 'setup' '
@@ -121,15 +124,49 @@ run_on_all () {
 test_all_match () {
 	run_on_all "$@" &&
 	test_cmp full-checkout-out sparse-checkout-out &&
-	test_cmp full-checkout-err sparse-checkout-err
+	test_cmp full-checkout-out sparse-index-out &&
+	test_cmp full-checkout-err sparse-checkout-err &&
+	test_cmp full-checkout-err sparse-index-err
 }
 
 test_sparse_match () {
-	run_on_sparse $* &&
+	run_on_sparse "$@" &&
 	test_cmp sparse-checkout-out sparse-index-out &&
 	test_cmp sparse-checkout-err sparse-index-err
 }
 
+test_expect_success 'sparse-index contents' '
+	init_repos &&
+
+	test-tool -C sparse-index read-cache --table >cache &&
+	for dir in folder1 folder2 x
+	do
+		TREE=$(git -C sparse-index rev-parse HEAD:$dir) &&
+		grep "040000 tree $TREE	$dir/" cache \
+			|| return 1
+	done &&
+
+	GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set folder1 &&
+
+	test-tool -C sparse-index read-cache --table >cache &&
+	for dir in deep folder2 x
+	do
+		TREE=$(git -C sparse-index rev-parse HEAD:$dir) &&
+		grep "040000 tree $TREE	$dir/" cache \
+			|| return 1
+	done &&
+
+	GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set deep/deeper1 &&
+
+	test-tool -C sparse-index read-cache --table >cache &&
+	for dir in deep/deeper2 folder1 folder2 x
+	do
+		TREE=$(git -C sparse-index rev-parse HEAD:$dir) &&
+		grep "040000 tree $TREE	$dir/" cache \
+			|| return 1
+	done
+'
+
 test_expect_success 'expanded in-memory index matches full index' '
 	init_repos &&
 	test_sparse_match test-tool read-cache --expand --table
@@ -137,6 +174,7 @@ test_expect_success 'expanded in-memory index matches full index' '
 
 test_expect_success 'status with options' '
 	init_repos &&
+	test_sparse_match ls &&
 	test_all_match git status --porcelain=v2 &&
 	test_all_match git status --porcelain=v2 -z -u &&
 	test_all_match git status --porcelain=v2 -uno &&
@@ -273,6 +311,17 @@ test_expect_failure 'checkout and reset (mixed)' '
 	test_all_match git reset update-folder2
 '
 
+# Ensure that sparse-index behaves identically to
+# sparse-checkout with a full index.
+test_expect_success 'checkout and reset (mixed) [sparse]' '
+	init_repos &&
+
+	test_sparse_match git checkout -b reset-test update-deep &&
+	test_sparse_match git reset deepest &&
+	test_sparse_match git reset update-folder1 &&
+	test_sparse_match git reset update-folder2
+'
+
 test_expect_success 'merge' '
 	init_repos &&
 
@@ -309,14 +358,20 @@ test_expect_success 'clean' '
 	test_all_match git status --porcelain=v2 &&
 	test_all_match git clean -f &&
 	test_all_match git status --porcelain=v2 &&
+	test_sparse_match ls &&
+	test_sparse_match ls folder1 &&
 
 	test_all_match git clean -xf &&
 	test_all_match git status --porcelain=v2 &&
+	test_sparse_match ls &&
+	test_sparse_match ls folder1 &&
 
 	test_all_match git clean -xdf &&
 	test_all_match git status --porcelain=v2 &&
+	test_sparse_match ls &&
+	test_sparse_match ls folder1 &&
 
-	test_path_is_dir sparse-checkout/folder1
+	test_sparse_match test_path_is_dir folder1
 '
 
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH v2 12/20] submodule: sparse-index should not collapse links
  2021-03-10 19:30 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (10 preceding siblings ...)
  2021-03-10 19:30   ` [PATCH v2 11/20] sparse-index: convert from full to sparse Derrick Stolee via GitGitGadget
@ 2021-03-10 19:30   ` Derrick Stolee via GitGitGadget
  2021-03-10 19:30   ` [PATCH v2 13/20] unpack-trees: allow sparse directories Derrick Stolee via GitGitGadget
                     ` (9 subsequent siblings)
  21 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-10 19:30 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

A submodule is stored as a "Git link" that actually points to a commit
within a submodule. Submodules are populated or not depending on
submodule configuration, not sparse-checkout. To ensure that the
sparse-index feature integrates correctly with submodules, we should not
collapse a directory if there is a Git link within its range.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 sparse-index.c                           |  1 +
 t/t1092-sparse-checkout-compatibility.sh | 17 +++++++++++++++++
 2 files changed, 18 insertions(+)

diff --git a/sparse-index.c b/sparse-index.c
index 5eb561259bb1..36b4dde7eeda 100644
--- a/sparse-index.c
+++ b/sparse-index.c
@@ -52,6 +52,7 @@ static int convert_to_sparse_rec(struct index_state *istate,
 		struct cache_entry *ce = istate->cache[i];
 
 		if (ce_stage(ce) ||
+		    S_ISGITLINK(ce->ce_mode) ||
 		    !(ce->ce_flags & CE_SKIP_WORKTREE))
 			can_convert = 0;
 	}
diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index ca87033d30b0..b38fab6455d9 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -374,4 +374,21 @@ test_expect_success 'clean' '
 	test_sparse_match test_path_is_dir folder1
 '
 
+test_expect_success 'submodule handling' '
+	init_repos &&
+
+	test_all_match mkdir modules &&
+	test_all_match touch modules/a &&
+	test_all_match git add modules &&
+	test_all_match git commit -m "add modules directory" &&
+
+	run_on_all git submodule add "$(pwd)/initial-repo" modules/sub &&
+	test_all_match git commit -m "add submodule" &&
+
+	# having a submodule prevents "modules" from collapse
+	test-tool -C sparse-index read-cache --table >cache &&
+	grep "100644 blob .*	modules/a" cache &&
+	grep "160000 commit $(git -C initial-repo rev-parse HEAD)	modules/sub" cache
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH v2 13/20] unpack-trees: allow sparse directories
  2021-03-10 19:30 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (11 preceding siblings ...)
  2021-03-10 19:30   ` [PATCH v2 12/20] submodule: sparse-index should not collapse links Derrick Stolee via GitGitGadget
@ 2021-03-10 19:30   ` Derrick Stolee via GitGitGadget
  2021-03-10 19:30   ` [PATCH v2 14/20] sparse-index: check index conversion happens Derrick Stolee via GitGitGadget
                     ` (8 subsequent siblings)
  21 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-10 19:30 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The index_pos_by_traverse_info() currently throws a BUG() when a
directory entry exists exactly in the index. We need to consider that it
is possible to have a directory in a sparse index as long as that entry
is itself marked with the skip-worktree bit.

The 'pos' variable is assigned a negative value if an exact match is not
found. Since a directory name can be an exact match, it is no longer an
error to have a nonnegative 'pos' value.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 unpack-trees.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/unpack-trees.c b/unpack-trees.c
index 4dd99219073a..b324eec2a5d1 100644
--- a/unpack-trees.c
+++ b/unpack-trees.c
@@ -746,9 +746,12 @@ static int index_pos_by_traverse_info(struct name_entry *names,
 	strbuf_make_traverse_path(&name, info, names->path, names->pathlen);
 	strbuf_addch(&name, '/');
 	pos = index_name_pos(o->src_index, name.buf, name.len);
-	if (pos >= 0)
-		BUG("This is a directory and should not exist in index");
-	pos = -pos - 1;
+	if (pos >= 0) {
+		if (!o->src_index->sparse_index ||
+		    !(o->src_index->cache[pos]->ce_flags & CE_SKIP_WORKTREE))
+			BUG("This is a directory and should not exist in index");
+	} else
+		pos = -pos - 1;
 	if (pos >= o->src_index->cache_nr ||
 	    !starts_with(o->src_index->cache[pos]->name, name.buf) ||
 	    (pos > 0 && starts_with(o->src_index->cache[pos-1]->name, name.buf)))
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH v2 14/20] sparse-index: check index conversion happens
  2021-03-10 19:30 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (12 preceding siblings ...)
  2021-03-10 19:30   ` [PATCH v2 13/20] unpack-trees: allow sparse directories Derrick Stolee via GitGitGadget
@ 2021-03-10 19:30   ` Derrick Stolee via GitGitGadget
  2021-03-10 19:30   ` [PATCH v2 15/20] sparse-index: create extension for compatibility Derrick Stolee via GitGitGadget
                     ` (7 subsequent siblings)
  21 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-10 19:30 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Add a test case that uses test_region to ensure that we are truly
expanding a sparse index to a full one, then converting back to sparse
when writing the index. As we integrate more Git commands with the
sparse index, we will convert these commands to check that we do _not_
convert the sparse index to a full index and instead stay sparse the
entire time.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/t1092-sparse-checkout-compatibility.sh | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index b38fab6455d9..bfc9e28ef0e1 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -391,4 +391,22 @@ test_expect_success 'submodule handling' '
 	grep "160000 commit $(git -C initial-repo rev-parse HEAD)	modules/sub" cache
 '
 
+test_expect_success 'sparse-index is expanded and converted back' '
+	init_repos &&
+
+	(
+		GIT_TEST_SPARSE_INDEX=1 &&
+		export GIT_TEST_SPARSE_INDEX &&
+		GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
+			git -C sparse-index -c core.fsmonitor="" reset --hard &&
+		test_region index convert_to_sparse trace2.txt &&
+		test_region index ensure_full_index trace2.txt &&
+
+		rm trace2.txt &&
+		GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
+			git -C sparse-index -c core.fsmonitor="" status -uno &&
+		test_region index ensure_full_index trace2.txt
+	)
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH v2 15/20] sparse-index: create extension for compatibility
  2021-03-10 19:30 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (13 preceding siblings ...)
  2021-03-10 19:30   ` [PATCH v2 14/20] sparse-index: check index conversion happens Derrick Stolee via GitGitGadget
@ 2021-03-10 19:30   ` Derrick Stolee via GitGitGadget
  2021-03-10 19:30   ` [PATCH v2 16/20] sparse-checkout: toggle sparse index from builtin Derrick Stolee via GitGitGadget
                     ` (6 subsequent siblings)
  21 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-10 19:30 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Previously, we enabled the sparse index format only using
GIT_TEST_SPARSE_INDEX=1. This is not a feasible direction for users to
actually select this mode. Further, sparse directory entries are not
understood by the index formats as advertised.

We _could_ add a new index version that explicitly adds these
capabilities, but there are nuances to index formats 2, 3, and 4 that
are still valuable to select as options. Until we add index format
version 5, create a repo extension, "extensions.sparseIndex", that
specifies that the tool reading this repository must understand sparse
directory entries.

This change only encodes the extension and enables it when
GIT_TEST_SPARSE_INDEX=1. Later, we will add a more user-friendly CLI
mechanism.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/config/extensions.txt |  8 ++++++
 cache.h                             |  1 +
 repo-settings.c                     |  7 ++++++
 repository.h                        |  3 ++-
 setup.c                             |  3 +++
 sparse-index.c                      | 38 +++++++++++++++++++++++++----
 6 files changed, 54 insertions(+), 6 deletions(-)

diff --git a/Documentation/config/extensions.txt b/Documentation/config/extensions.txt
index 4e23d73cdcad..c02e09af0046 100644
--- a/Documentation/config/extensions.txt
+++ b/Documentation/config/extensions.txt
@@ -6,3 +6,11 @@ extensions.objectFormat::
 Note that this setting should only be set by linkgit:git-init[1] or
 linkgit:git-clone[1].  Trying to change it after initialization will not
 work and will produce hard-to-diagnose issues.
+
+extensions.sparseIndex::
+	When combined with `core.sparseCheckout=true` and
+	`core.sparseCheckoutCone=true`, the index may contain entries
+	corresponding to directories outside of the sparse-checkout
+	definition in lieu of containing each path under such directories.
+	Versions of Git that do not understand this extension do not
+	expect directory entries in the index.
diff --git a/cache.h b/cache.h
index 9217d405b9b8..03f931c5f34d 100644
--- a/cache.h
+++ b/cache.h
@@ -1059,6 +1059,7 @@ struct repository_format {
 	int worktree_config;
 	int is_bare;
 	int hash_algo;
+	int sparse_index;
 	char *work_tree;
 	struct string_list unknown_extensions;
 	struct string_list v1_only_extensions;
diff --git a/repo-settings.c b/repo-settings.c
index d63569e4041e..9677d50f9238 100644
--- a/repo-settings.c
+++ b/repo-settings.c
@@ -85,4 +85,11 @@ void prepare_repo_settings(struct repository *r)
 	 * removed.
 	 */
 	r->settings.command_requires_full_index = 1;
+
+	/*
+	 * Initialize this as off.
+	 */
+	r->settings.sparse_index = 0;
+	if (!repo_config_get_bool(r, "extensions.sparseindex", &value) && value)
+		r->settings.sparse_index = 1;
 }
diff --git a/repository.h b/repository.h
index e06a23015697..a45f7520fd9e 100644
--- a/repository.h
+++ b/repository.h
@@ -42,7 +42,8 @@ struct repo_settings {
 
 	int core_multi_pack_index;
 
-	unsigned command_requires_full_index:1;
+	unsigned command_requires_full_index:1,
+		 sparse_index:1;
 };
 
 struct repository {
diff --git a/setup.c b/setup.c
index c04cd25a30df..cd8394564613 100644
--- a/setup.c
+++ b/setup.c
@@ -500,6 +500,9 @@ static enum extension_result handle_extension(const char *var,
 			return error("invalid value for 'extensions.objectformat'");
 		data->hash_algo = format;
 		return EXTENSION_OK;
+	} else if (!strcmp(ext, "sparseindex")) {
+		data->sparse_index = 1;
+		return EXTENSION_OK;
 	}
 	return EXTENSION_UNKNOWN;
 }
diff --git a/sparse-index.c b/sparse-index.c
index 36b4dde7eeda..b9c14ef7ab50 100644
--- a/sparse-index.c
+++ b/sparse-index.c
@@ -102,19 +102,47 @@ static int convert_to_sparse_rec(struct index_state *istate,
 	return num_converted - start_converted;
 }
 
+static int enable_sparse_index(struct repository *repo)
+{
+	const char *config_path = repo_git_path(repo, "config.worktree");
+
+	if (upgrade_repository_format(1) < 0) {
+		warning(_("unable to upgrade repository format to enable sparse-index"));
+		return -1;
+	}
+	git_config_set_in_file_gently(config_path,
+				      "extensions.sparseIndex",
+				      "true");
+
+	prepare_repo_settings(repo);
+	repo->settings.sparse_index = 1;
+	return 0;
+}
+
 int convert_to_sparse(struct index_state *istate)
 {
 	if (istate->split_index || istate->sparse_index ||
 	    !core_apply_sparse_checkout || !core_sparse_checkout_cone)
 		return 0;
 
+	if (!istate->repo)
+		istate->repo = the_repository;
+
+	/*
+	 * The GIT_TEST_SPARSE_INDEX environment variable triggers the
+	 * extensions.sparseIndex config variable to be on.
+	 */
+	if (git_env_bool("GIT_TEST_SPARSE_INDEX", 0)) {
+		int err = enable_sparse_index(istate->repo);
+		if (err < 0)
+			return err;
+	}
+
 	/*
-	 * For now, only create a sparse index with the
-	 * GIT_TEST_SPARSE_INDEX environment variable. We will relax
-	 * this once we have a proper way to opt-in (and later still,
-	 * opt-out).
+	 * Only convert to sparse if extensions.sparseIndex is set.
 	 */
-	if (!git_env_bool("GIT_TEST_SPARSE_INDEX", 0))
+	prepare_repo_settings(istate->repo);
+	if (!istate->repo->settings.sparse_index)
 		return 0;
 
 	if (!istate->sparse_checkout_patterns) {
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH v2 16/20] sparse-checkout: toggle sparse index from builtin
  2021-03-10 19:30 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (14 preceding siblings ...)
  2021-03-10 19:30   ` [PATCH v2 15/20] sparse-index: create extension for compatibility Derrick Stolee via GitGitGadget
@ 2021-03-10 19:30   ` Derrick Stolee via GitGitGadget
  2021-03-10 19:31   ` [PATCH v2 17/20] sparse-checkout: disable sparse-index Derrick Stolee via GitGitGadget
                     ` (5 subsequent siblings)
  21 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-10 19:30 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The sparse index extension is used to signal that index writes should be
in sparse mode. This was only updated using GIT_TEST_SPARSE_INDEX=1.

Add a '--[no-]sparse-index' option to 'git sparse-checkout init' that
specifies if the sparse index should be used. It also updates the index
to use the correct format, either way. Add a warning in the
documentation that the use of a repository extension might reduce
compatibility with third-party tools. 'git sparse-checkout init' already
sets extension.worktreeConfig, which places most sparse-checkout users
outside of the scope of most third-party tools.

Update t1092-sparse-checkout-compatibility.sh to use this CLI instead of
GIT_TEST_SPARSE_INDEX=1.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-sparse-checkout.txt    | 14 +++++++++
 builtin/sparse-checkout.c                | 17 ++++++++++-
 sparse-index.c                           | 37 +++++++++++++++--------
 sparse-index.h                           |  3 ++
 t/t1092-sparse-checkout-compatibility.sh | 38 +++++++++++-------------
 5 files changed, 76 insertions(+), 33 deletions(-)

diff --git a/Documentation/git-sparse-checkout.txt b/Documentation/git-sparse-checkout.txt
index a0eeaeb02ee3..4a8343cf7fa4 100644
--- a/Documentation/git-sparse-checkout.txt
+++ b/Documentation/git-sparse-checkout.txt
@@ -45,6 +45,20 @@ To avoid interfering with other worktrees, it first enables the
 When `--cone` is provided, the `core.sparseCheckoutCone` setting is
 also set, allowing for better performance with a limited set of
 patterns (see 'CONE PATTERN SET' below).
++
+Use the `--[no-]sparse-index` option to toggle the use of the sparse
+index format. This reduces the size of the index to be more closely
+aligned with your sparse-checkout definition. This can have significant
+performance advantages for commands such as `git status` or `git add`.
+This feature is still experimental. Some commands might be slower with
+a sparse index until they are properly integrated with the feature.
++
+**WARNING:** Using a sparse index requires modifying the index in a way
+that is not completely understood by external tools. If you have trouble
+with this compatibility, then run `git sparse-checkout sparse-index disable`
+to rewrite your index to not be sparse. Older versions of Git will not
+understand the `sparseIndex` repository extension and may fail to interact
+with your repository until it is disabled.
 
 'set'::
 	Write a set of patterns to the sparse-checkout file, as given as
diff --git a/builtin/sparse-checkout.c b/builtin/sparse-checkout.c
index e00b82af727b..ca63e2c64e95 100644
--- a/builtin/sparse-checkout.c
+++ b/builtin/sparse-checkout.c
@@ -14,6 +14,7 @@
 #include "unpack-trees.h"
 #include "wt-status.h"
 #include "quote.h"
+#include "sparse-index.h"
 
 static const char *empty_base = "";
 
@@ -283,12 +284,13 @@ static int set_config(enum sparse_checkout_mode mode)
 }
 
 static char const * const builtin_sparse_checkout_init_usage[] = {
-	N_("git sparse-checkout init [--cone]"),
+	N_("git sparse-checkout init [--cone] [--[no-]sparse-index]"),
 	NULL
 };
 
 static struct sparse_checkout_init_opts {
 	int cone_mode;
+	int sparse_index;
 } init_opts;
 
 static int sparse_checkout_init(int argc, const char **argv)
@@ -303,11 +305,15 @@ static int sparse_checkout_init(int argc, const char **argv)
 	static struct option builtin_sparse_checkout_init_options[] = {
 		OPT_BOOL(0, "cone", &init_opts.cone_mode,
 			 N_("initialize the sparse-checkout in cone mode")),
+		OPT_BOOL(0, "sparse-index", &init_opts.sparse_index,
+			 N_("toggle the use of a sparse index")),
 		OPT_END(),
 	};
 
 	repo_read_index(the_repository);
 
+	init_opts.sparse_index = -1;
+
 	argc = parse_options(argc, argv, NULL,
 			     builtin_sparse_checkout_init_options,
 			     builtin_sparse_checkout_init_usage, 0);
@@ -326,6 +332,15 @@ static int sparse_checkout_init(int argc, const char **argv)
 	sparse_filename = get_sparse_checkout_filename();
 	res = add_patterns_from_file_to_list(sparse_filename, "", 0, &pl, NULL);
 
+	if (init_opts.sparse_index >= 0) {
+		if (set_sparse_index_config(the_repository, init_opts.sparse_index) < 0)
+			die(_("failed to modify sparse-index config"));
+
+		/* force an index rewrite */
+		repo_read_index(the_repository);
+		the_repository->index->updated_workdir = 1;
+	}
+
 	/* If we already have a sparse-checkout file, use it. */
 	if (res >= 0) {
 		free(sparse_filename);
diff --git a/sparse-index.c b/sparse-index.c
index b9c14ef7ab50..1c84cac255bf 100644
--- a/sparse-index.c
+++ b/sparse-index.c
@@ -104,23 +104,37 @@ static int convert_to_sparse_rec(struct index_state *istate,
 
 static int enable_sparse_index(struct repository *repo)
 {
-	const char *config_path = repo_git_path(repo, "config.worktree");
+	int res;
 
 	if (upgrade_repository_format(1) < 0) {
 		warning(_("unable to upgrade repository format to enable sparse-index"));
 		return -1;
 	}
-	git_config_set_in_file_gently(config_path,
-				      "extensions.sparseIndex",
-				      "true");
+	res = git_config_set_gently("extensions.sparseindex", "true");
 
 	prepare_repo_settings(repo);
 	repo->settings.sparse_index = 1;
-	return 0;
+	return res;
+}
+
+int set_sparse_index_config(struct repository *repo, int enable)
+{
+	int res;
+
+	if (enable)
+		return enable_sparse_index(repo);
+
+	/* Don't downgrade repository format, just remove the extension. */
+	res = git_config_set_gently("extensions.sparseindex", NULL);
+
+	prepare_repo_settings(repo);
+	repo->settings.sparse_index = 0;
+	return res;
 }
 
 int convert_to_sparse(struct index_state *istate)
 {
+	int test_env;
 	if (istate->split_index || istate->sparse_index ||
 	    !core_apply_sparse_checkout || !core_sparse_checkout_cone)
 		return 0;
@@ -129,14 +143,13 @@ int convert_to_sparse(struct index_state *istate)
 		istate->repo = the_repository;
 
 	/*
-	 * The GIT_TEST_SPARSE_INDEX environment variable triggers the
-	 * extensions.sparseIndex config variable to be on.
+	 * If GIT_TEST_SPARSE_INDEX=1, then trigger extensions.sparseIndex
+	 * to be fully enabled. If GIT_TEST_SPARSE_INDEX=0 (set explicitly),
+	 * then purposefully disable the setting.
 	 */
-	if (git_env_bool("GIT_TEST_SPARSE_INDEX", 0)) {
-		int err = enable_sparse_index(istate->repo);
-		if (err < 0)
-			return err;
-	}
+	test_env = git_env_bool("GIT_TEST_SPARSE_INDEX", -1);
+	if (test_env >= 0)
+		set_sparse_index_config(istate->repo, test_env);
 
 	/*
 	 * Only convert to sparse if extensions.sparseIndex is set.
diff --git a/sparse-index.h b/sparse-index.h
index 64380e121d80..39dcc859735e 100644
--- a/sparse-index.h
+++ b/sparse-index.h
@@ -5,4 +5,7 @@ struct index_state;
 void ensure_full_index(struct index_state *istate);
 int convert_to_sparse(struct index_state *istate);
 
+struct repository;
+int set_sparse_index_config(struct repository *repo, int enable);
+
 #endif
diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index bfc9e28ef0e1..9c2bc4d25f66 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -4,6 +4,7 @@ test_description='compare full workdir to sparse workdir'
 
 GIT_TEST_CHECK_CACHE_TREE=0
 GIT_TEST_SPLIT_INDEX=0
+GIT_TEST_SPARSE_INDEX=
 
 . ./test-lib.sh
 
@@ -98,25 +99,26 @@ init_repos () {
 	# initialize sparse-checkout definitions
 	git -C sparse-checkout sparse-checkout init --cone &&
 	git -C sparse-checkout sparse-checkout set deep &&
-	GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout init --cone &&
-	GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set deep
+	git -C sparse-index sparse-checkout init --cone --sparse-index &&
+	test_cmp_config -C sparse-index true extensions.sparseindex &&
+	git -C sparse-index sparse-checkout set deep
 }
 
 run_on_sparse () {
 	(
 		cd sparse-checkout &&
-		GIT_TEST_SPARSE_INDEX=0 "$@" >../sparse-checkout-out 2>../sparse-checkout-err
+		"$@" >../sparse-checkout-out 2>../sparse-checkout-err
 	) &&
 	(
 		cd sparse-index &&
-		GIT_TEST_SPARSE_INDEX=1 "$@" >../sparse-index-out 2>../sparse-index-err
+		"$@" >../sparse-index-out 2>../sparse-index-err
 	)
 }
 
 run_on_all () {
 	(
 		cd full-checkout &&
-		GIT_TEST_SPARSE_INDEX=0 "$@" >../full-checkout-out 2>../full-checkout-err
+		"$@" >../full-checkout-out 2>../full-checkout-err
 	) &&
 	run_on_sparse "$@"
 }
@@ -146,7 +148,7 @@ test_expect_success 'sparse-index contents' '
 			|| return 1
 	done &&
 
-	GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set folder1 &&
+	git -C sparse-index sparse-checkout set folder1 &&
 
 	test-tool -C sparse-index read-cache --table >cache &&
 	for dir in deep folder2 x
@@ -156,7 +158,7 @@ test_expect_success 'sparse-index contents' '
 			|| return 1
 	done &&
 
-	GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set deep/deeper1 &&
+	git -C sparse-index sparse-checkout set deep/deeper1 &&
 
 	test-tool -C sparse-index read-cache --table >cache &&
 	for dir in deep/deeper2 folder1 folder2 x
@@ -394,19 +396,15 @@ test_expect_success 'submodule handling' '
 test_expect_success 'sparse-index is expanded and converted back' '
 	init_repos &&
 
-	(
-		GIT_TEST_SPARSE_INDEX=1 &&
-		export GIT_TEST_SPARSE_INDEX &&
-		GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
-			git -C sparse-index -c core.fsmonitor="" reset --hard &&
-		test_region index convert_to_sparse trace2.txt &&
-		test_region index ensure_full_index trace2.txt &&
-
-		rm trace2.txt &&
-		GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
-			git -C sparse-index -c core.fsmonitor="" status -uno &&
-		test_region index ensure_full_index trace2.txt
-	)
+	GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
+		git -C sparse-index -c core.fsmonitor="" reset --hard &&
+	test_region index convert_to_sparse trace2.txt &&
+	test_region index ensure_full_index trace2.txt &&
+
+	rm trace2.txt &&
+	GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
+		git -C sparse-index -c core.fsmonitor="" status -uno &&
+	test_region index ensure_full_index trace2.txt
 '
 
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH v2 17/20] sparse-checkout: disable sparse-index
  2021-03-10 19:30 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (15 preceding siblings ...)
  2021-03-10 19:30   ` [PATCH v2 16/20] sparse-checkout: toggle sparse index from builtin Derrick Stolee via GitGitGadget
@ 2021-03-10 19:31   ` Derrick Stolee via GitGitGadget
  2021-03-10 19:31   ` [PATCH v2 18/20] cache-tree: integrate with sparse directory entries Derrick Stolee via GitGitGadget
                     ` (4 subsequent siblings)
  21 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-10 19:31 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

We use 'git sparse-checkout init --cone --sparse-index' to toggle the
sparse-index feature. It makes sense to also disable it when running
'git sparse-checkout disable'. This is particularly important because it
removes the extensions.sparseIndex config option, allowing other tools
to use this Git repository again.

This does mean that 'git sparse-checkout init' will not re-enable the
sparse-index feature, even if it was previously enabled.

While testing this feature, I noticed that the sparse-index was not
being written on the first run, but by a second. This was caught by the
call to 'test-tool read-cache --table'. This requires adjusting some
assignments to core_apply_sparse_checkout and pl.use_cone_patterns in
the sparse_checkout_init() logic.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 builtin/sparse-checkout.c          | 10 +++++++++-
 t/t1091-sparse-checkout-builtin.sh | 13 +++++++++++++
 2 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/builtin/sparse-checkout.c b/builtin/sparse-checkout.c
index ca63e2c64e95..585343fa1972 100644
--- a/builtin/sparse-checkout.c
+++ b/builtin/sparse-checkout.c
@@ -280,6 +280,9 @@ static int set_config(enum sparse_checkout_mode mode)
 				      "core.sparseCheckoutCone",
 				      mode == MODE_CONE_PATTERNS ? "true" : NULL);
 
+	if (mode == MODE_NO_PATTERNS)
+		set_sparse_index_config(the_repository, 0);
+
 	return 0;
 }
 
@@ -341,10 +344,11 @@ static int sparse_checkout_init(int argc, const char **argv)
 		the_repository->index->updated_workdir = 1;
 	}
 
+	core_apply_sparse_checkout = 1;
+
 	/* If we already have a sparse-checkout file, use it. */
 	if (res >= 0) {
 		free(sparse_filename);
-		core_apply_sparse_checkout = 1;
 		return update_working_directory(NULL);
 	}
 
@@ -366,6 +370,7 @@ static int sparse_checkout_init(int argc, const char **argv)
 	add_pattern(strbuf_detach(&pattern, NULL), empty_base, 0, &pl, 0);
 	strbuf_addstr(&pattern, "!/*/");
 	add_pattern(strbuf_detach(&pattern, NULL), empty_base, 0, &pl, 0);
+	pl.use_cone_patterns = init_opts.cone_mode;
 
 	return write_patterns_and_update(&pl);
 }
@@ -632,6 +637,9 @@ static int sparse_checkout_disable(int argc, const char **argv)
 	strbuf_addstr(&match_all, "/*");
 	add_pattern(strbuf_detach(&match_all, NULL), empty_base, 0, &pl, 0);
 
+	prepare_repo_settings(the_repository);
+	the_repository->settings.sparse_index = 0;
+
 	if (update_working_directory(&pl))
 		die(_("error while refreshing working directory"));
 
diff --git a/t/t1091-sparse-checkout-builtin.sh b/t/t1091-sparse-checkout-builtin.sh
index fc64e9ed99f4..ff1ad570a255 100755
--- a/t/t1091-sparse-checkout-builtin.sh
+++ b/t/t1091-sparse-checkout-builtin.sh
@@ -205,6 +205,19 @@ test_expect_success 'sparse-checkout disable' '
 	check_files repo a deep folder1 folder2
 '
 
+test_expect_success 'sparse-index enabled and disabled' '
+	git -C repo sparse-checkout init --cone --sparse-index &&
+	test_cmp_config -C repo true extensions.sparseIndex &&
+	test-tool -C repo read-cache --table >cache &&
+	grep " tree " cache &&
+
+	git -C repo sparse-checkout disable &&
+	test-tool -C repo read-cache --table >cache &&
+	! grep " tree " cache &&
+	git -C repo config --list >config &&
+	! grep extensions.sparseindex config
+'
+
 test_expect_success 'cone mode: init and set' '
 	git -C repo sparse-checkout init --cone &&
 	git -C repo config --list >config &&
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH v2 18/20] cache-tree: integrate with sparse directory entries
  2021-03-10 19:30 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (16 preceding siblings ...)
  2021-03-10 19:31   ` [PATCH v2 17/20] sparse-checkout: disable sparse-index Derrick Stolee via GitGitGadget
@ 2021-03-10 19:31   ` Derrick Stolee via GitGitGadget
  2021-03-10 19:31   ` [PATCH v2 19/20] sparse-index: loose integration with cache_tree_verify() Derrick Stolee via GitGitGadget
                     ` (3 subsequent siblings)
  21 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-10 19:31 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The cache-tree extension was previously disabled with sparse indexes.
However, the cache-tree is an important performance feature for commands
like 'git status' and 'git add'. Integrate it with sparse directory
entries.

When writing a sparse index, completely clear and recalculate the cache
tree. By starting from scratch, the only integration necessary is to
check if we hit a sparse directory entry and create a leaf of the
cache-tree that has an entry_count of one and no subtrees.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 cache-tree.c   | 18 ++++++++++++++++++
 sparse-index.c | 10 +++++++++-
 2 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/cache-tree.c b/cache-tree.c
index 5f07a39e501e..950a9615db8f 100644
--- a/cache-tree.c
+++ b/cache-tree.c
@@ -256,6 +256,24 @@ static int update_one(struct cache_tree *it,
 
 	*skip_count = 0;
 
+	/*
+	 * If the first entry of this region is a sparse directory
+	 * entry corresponding exactly to 'base', then this cache_tree
+	 * struct is a "leaf" in the data structure, pointing to the
+	 * tree OID specified in the entry.
+	 */
+	if (entries > 0) {
+		const struct cache_entry *ce = cache[0];
+
+		if (S_ISSPARSEDIR(ce->ce_mode) &&
+		    ce->ce_namelen == baselen &&
+		    !strncmp(ce->name, base, baselen)) {
+			it->entry_count = 1;
+			oidcpy(&it->oid, &ce->oid);
+			return 1;
+		}
+	}
+
 	if (0 <= it->entry_count && has_object_file(&it->oid))
 		return it->entry_count;
 
diff --git a/sparse-index.c b/sparse-index.c
index 1c84cac255bf..ea603201a323 100644
--- a/sparse-index.c
+++ b/sparse-index.c
@@ -180,7 +180,11 @@ int convert_to_sparse(struct index_state *istate)
 	istate->cache_nr = convert_to_sparse_rec(istate,
 						 0, 0, istate->cache_nr,
 						 "", 0, istate->cache_tree);
-	istate->drop_cache_tree = 1;
+
+	/* Clear and recompute the cache-tree */
+	cache_tree_free(&istate->cache_tree);
+	cache_tree_update(istate, 0);
+
 	istate->sparse_index = 1;
 	trace2_region_leave("index", "convert_to_sparse", istate->repo);
 	return 0;
@@ -278,5 +282,9 @@ void ensure_full_index(struct index_state *istate)
 
 	free(full);
 
+	/* Clear and recompute the cache-tree */
+	cache_tree_free(&istate->cache_tree);
+	cache_tree_update(istate, 0);
+
 	trace2_region_leave("index", "ensure_full_index", istate->repo);
 }
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH v2 19/20] sparse-index: loose integration with cache_tree_verify()
  2021-03-10 19:30 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (17 preceding siblings ...)
  2021-03-10 19:31   ` [PATCH v2 18/20] cache-tree: integrate with sparse directory entries Derrick Stolee via GitGitGadget
@ 2021-03-10 19:31   ` Derrick Stolee via GitGitGadget
  2021-03-10 19:31   ` [PATCH v2 20/20] p2000: add sparse-index repos Derrick Stolee via GitGitGadget
                     ` (2 subsequent siblings)
  21 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-10 19:31 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The cache_tree_verify() method is run when GIT_TEST_CHECK_CACHE_TREE
is enabled, which it is by default in the test suite. The logic must
be adjusted for the presence of these directory entries.

For now, leave the test as a simple check for whether the directory
entry is sparse. Do not go any further until needed.

This allows us to re-enable GIT_TEST_CHECK_CACHE_TREE in
t1092-sparse-checkout-compatibility.sh. Further,
p2000-sparse-operations.sh uses the test suite and hence this is enabled
for all tests. We need to integrate with it before we run our
performance tests with a sparse-index.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 cache-tree.c                             | 19 +++++++++++++++++++
 t/t1092-sparse-checkout-compatibility.sh |  1 -
 2 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/cache-tree.c b/cache-tree.c
index 950a9615db8f..11bf1fcae6e1 100644
--- a/cache-tree.c
+++ b/cache-tree.c
@@ -808,6 +808,19 @@ int cache_tree_matches_traversal(struct cache_tree *root,
 	return 0;
 }
 
+static void verify_one_sparse(struct repository *r,
+			      struct index_state *istate,
+			      struct cache_tree *it,
+			      struct strbuf *path,
+			      int pos)
+{
+	struct cache_entry *ce = istate->cache[pos];
+
+	if (!S_ISSPARSEDIR(ce->ce_mode))
+		BUG("directory '%s' is present in index, but not sparse",
+		    path->buf);
+}
+
 static void verify_one(struct repository *r,
 		       struct index_state *istate,
 		       struct cache_tree *it,
@@ -830,6 +843,12 @@ static void verify_one(struct repository *r,
 
 	if (path->len) {
 		pos = index_name_pos(istate, path->buf, path->len);
+
+		if (pos >= 0) {
+			verify_one_sparse(r, istate, it, path, pos);
+			return;
+		}
+
 		pos = -pos - 1;
 	} else {
 		pos = 0;
diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index 9c2bc4d25f66..c2624176c2e0 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -2,7 +2,6 @@
 
 test_description='compare full workdir to sparse workdir'
 
-GIT_TEST_CHECK_CACHE_TREE=0
 GIT_TEST_SPLIT_INDEX=0
 GIT_TEST_SPARSE_INDEX=
 
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH v2 20/20] p2000: add sparse-index repos
  2021-03-10 19:30 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (18 preceding siblings ...)
  2021-03-10 19:31   ` [PATCH v2 19/20] sparse-index: loose integration with cache_tree_verify() Derrick Stolee via GitGitGadget
@ 2021-03-10 19:31   ` Derrick Stolee via GitGitGadget
  2021-03-11  0:07   ` [PATCH v2 00/20] Sparse Index: Design, Format, Tests Elijah Newren
  2021-03-16 16:42   ` [PATCH v3 " Derrick Stolee via GitGitGadget
  21 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-10 19:31 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

p2000-sparse-operations.sh compares different Git commands in
repositories with many files at HEAD but using sparse-checkout to focus
on a small portion of those files.

Add extra copies of the repository that use the sparse-index format so
we can track how that affects the performance of different commands.

At this point in time, the sparse-index is 100% overhead from the CPU
front, and this is measurable in these tests:

Test
---------------------------------------------------------------
2000.2: git status (full-index-v3)              0.59(0.51+0.12)
2000.3: git status (full-index-v4)              0.59(0.52+0.11)
2000.4: git status (sparse-index-v3)            1.40(1.32+0.12)
2000.5: git status (sparse-index-v4)            1.41(1.36+0.08)
2000.6: git add -A (full-index-v3)              2.32(1.97+0.19)
2000.7: git add -A (full-index-v4)              2.17(1.92+0.14)
2000.8: git add -A (sparse-index-v3)            2.31(2.21+0.15)
2000.9: git add -A (sparse-index-v4)            2.30(2.20+0.13)
2000.10: git add . (full-index-v3)              2.39(2.02+0.20)
2000.11: git add . (full-index-v4)              2.20(1.94+0.16)
2000.12: git add . (sparse-index-v3)            2.36(2.27+0.12)
2000.13: git add . (sparse-index-v4)            2.33(2.21+0.16)
2000.14: git commit -a -m A (full-index-v3)     2.47(2.12+0.20)
2000.15: git commit -a -m A (full-index-v4)     2.26(2.00+0.17)
2000.16: git commit -a -m A (sparse-index-v3)   3.01(2.92+0.16)
2000.17: git commit -a -m A (sparse-index-v4)   3.01(2.94+0.15)

Note that there is very little difference between the v3 and v4 index
formats when the sparse-index is enabled. This is primarily due to the
fact that the relative file sizes are the same, and the command time is
mostly taken up by parsing tree objects to expand the sparse index into
a full one.

With the current file layout, the index file sizes are given by this
table:

       |  full index | sparse index |
       +-------------+--------------+
    v3 |     108 MiB |      1.6 MiB |
    v4 |      80 MiB |      1.2 MiB |

Future updates will improve the performance of Git commands when the
index is sparse.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/perf/p2000-sparse-operations.sh | 19 ++++++++++++++++++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/t/perf/p2000-sparse-operations.sh b/t/perf/p2000-sparse-operations.sh
index 2fbc81b22119..e527316e66d6 100755
--- a/t/perf/p2000-sparse-operations.sh
+++ b/t/perf/p2000-sparse-operations.sh
@@ -60,12 +60,29 @@ test_expect_success 'setup repo and indexes' '
 		git sparse-checkout set $SPARSE_CONE &&
 		git config index.version 4 &&
 		git update-index --index-version=4
+	) &&
+	git -c core.sparseCheckoutCone=true clone --branch=wide --sparse . sparse-index-v3 &&
+	(
+		cd sparse-index-v3 &&
+		git sparse-checkout init --cone --sparse-index &&
+		git sparse-checkout set $SPARSE_CONE &&
+		git config index.version 3 &&
+		git update-index --index-version=3
+	) &&
+	git -c core.sparseCheckoutCone=true clone --branch=wide --sparse . sparse-index-v4 &&
+	(
+		cd sparse-index-v4 &&
+		git sparse-checkout init --cone --sparse-index &&
+		git sparse-checkout set $SPARSE_CONE &&
+		git config index.version 4 &&
+		git update-index --index-version=4
 	)
 '
 
 test_perf_on_all () {
 	command="$@"
-	for repo in full-index-v3 full-index-v4
+	for repo in full-index-v3 full-index-v4 \
+		    sparse-index-v3 sparse-index-v4
 	do
 		test_perf "$command ($repo)" "
 			(
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 203+ messages in thread

* Re: [PATCH v2 01/20] sparse-index: design doc and format update
  2021-03-10 19:30   ` [PATCH v2 01/20] sparse-index: design doc and format update Derrick Stolee via GitGitGadget
@ 2021-03-10 22:19     ` Elijah Newren
  0 siblings, 0 replies; 203+ messages in thread
From: Elijah Newren @ 2021-03-10 22:19 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano,
	Nguyễn Thái Ngọc, Jonathan Nieder,
	Martin Ågren, Derrick Stolee, SZEDER Gábor,
	Derrick Stolee, Derrick Stolee

On Wed, Mar 10, 2021 at 11:31 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <dstolee@microsoft.com>
>
> This begins a long effort to update the index format to allow sparse
> directory entries. This should result in a significant improvement to
> Git commands when HEAD contains millions of files, but the user has
> selected many fewer files to keep in their sparse-checkout definition.
>
> Currently, the index format is only updated in the presence of
> extensions.sparseIndex instead of increasing a file format version
> number. This is temporary, and index v5 is part of the plan for future
> work in this area.
>
> The design document details many of the reasons for embarking on this
> work, and also the plan for completing it safely.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  Documentation/technical/index-format.txt |   7 +
>  Documentation/technical/sparse-index.txt | 173 +++++++++++++++++++++++
>  2 files changed, 180 insertions(+)
>  create mode 100644 Documentation/technical/sparse-index.txt
>
> diff --git a/Documentation/technical/index-format.txt b/Documentation/technical/index-format.txt
> index b633482b1bdf..387126582556 100644
> --- a/Documentation/technical/index-format.txt
> +++ b/Documentation/technical/index-format.txt
> @@ -44,6 +44,13 @@ Git index format
>    localization, no special casing of directory separator '/'). Entries
>    with the same name are sorted by their stage field.
>
> +  An index entry typically represents a file. However, if sparse-checkout
> +  is enabled in cone mode (`core.sparseCheckoutCone` is enabled) and the
> +  `extensions.sparseIndex` extension is enabled, then the index may
> +  contain entries for directories outside of the sparse-checkout definition.
> +  These entries have mode `0040000`, include the `SKIP_WORKTREE` bit, and
> +  the path ends in a directory separator.
> +
>    32-bit ctime seconds, the last time a file's metadata changed
>      this is stat(2) data
>
> diff --git a/Documentation/technical/sparse-index.txt b/Documentation/technical/sparse-index.txt
> new file mode 100644
> index 000000000000..787a2a0b3b81
> --- /dev/null
> +++ b/Documentation/technical/sparse-index.txt
> @@ -0,0 +1,173 @@
> +Git Sparse-Index Design Document
> +================================
> +
> +The sparse-checkout feature allows users to focus a working directory on
> +a subset of the files at HEAD. The cone mode patterns, enabled by
> +`core.sparseCheckoutCone`, allow for very fast pattern matching to
> +discover which files at HEAD belong in the sparse-checkout cone.
> +
> +Three important scale dimensions for a Git worktree are:
> +
> +* `HEAD`: How many files are present at `HEAD`?
> +
> +* Populated: How many files are within the sparse-checkout cone.
> +
> +* Modified: How many files has the user modified in the working directory?
> +
> +We will use big-O notation -- O(X) -- to denote how expensive certain
> +operations are in terms of these dimensions.
> +
> +These dimensions are ordered by their magnitude: users (typically) modify
> +fewer files than are populated, and we can only populate files at `HEAD`.
> +These dimensions are also ordered by how expensive they are per item: it
> +is expensive to detect a modified file than it is to write one that we
> +know must be populated; changing `HEAD` only really requires updating the
> +index.
> +
> +Problems occur if there is an extreme imbalance in these dimensions. For
> +example, if `HEAD` contains millions of paths but the populated set has
> +only tens of thousands, then commands like `git status` and `git add` can
> +be dominated by operations that require O(`HEAD`) operations instead of
> +O(Populated). Primarily, the cost is in parsing and rewriting the index,
> +which is filled primarily with files at `HEAD` that are marked with the
> +`SKIP_WORKTREE` bit.
> +
> +The sparse-index intends to take these commands that read and modify the
> +index from O(`HEAD`) to O(Populated). To do this, we need to modify the
> +index format in a significant way: add "sparse directory" entries.
> +
> +With cone mode patterns, it is possible to detect when an entire
> +directory will have its contents outside of the sparse-checkout definition.
> +Instead of listing all of the files it contains as individual entries, a
> +sparse-index contains an entry with the directory name, referencing the
> +object ID of the tree at `HEAD` and marked with the `SKIP_WORKTREE` bit.
> +If we need to discover the details for paths within that directory, we
> +can parse trees to find that list.
> +
> +At time of writing, sparse-directory entries violate expectations about the
> +index format and its in-memory data structure. There are many consumers in
> +the codebase that expect to iterate through all of the index entries and
> +see only files. In addition, they expect to see all files at `HEAD`. One
> +way to handle this is to parse trees to replace a sparse-directory entry
> +with all of the files within that tree as the index is loaded. However,
> +parsing trees is slower than parsing the index format, so that is a slower
> +operation than if we left the index alone.
> +
> +The implementation plan below follows four phases to slowly integrate with
> +the sparse-index. The intention is to incrementally update Git commands to
> +interact safely with the sparse-index without significant slowdowns. This
> +may not always be possible, but the hope is that the primary commands that
> +users need in their daily work are dramatically improved.
> +
> +Phase I: Format and initial speedups
> +------------------------------------
> +
> +During this phase, Git learns to enable the sparse-index and safely parse
> +one. Protections are put in place so that every consumer of the in-memory
> +data structure can operate with its current assumption of every file at
> +`HEAD`.
> +
> +At first, every index parse will expand the sparse-directory entries into
> +the full list of paths at `HEAD`. This will be slower in all cases. The
> +only noticable change in behavior will be that the serialized index file
> +contains sparse-directory entries.
> +
> +To start, we use a new repository extension, `extensions.sparseIndex`, to
> +allow inserting sparse-directory entries into indexes with file format
> +versions 2, 3, and 4. This prevents Git versions that do not understand
> +the sparse-index from operating on one, but it also prevents other
> +operations that do not use the index at all. A new format, index v5, will
> +be introduced that includes sparse-directory entries by default. It might
> +also introduce other features that have been considered for improving the
> +index, as well.
> +
> +Next, consumers of the index will be guarded against operating on a
> +sparse-index by inserting calls to `ensure_full_index()` or
> +`expand_index_to_path()`. After these guards are in place, we can begin
> +leaving sparse-directory entries in the in-memory index structure.
> +
> +Even after inserting these guards, we will keep expanding sparse-indexes
> +for most Git commands using the `command_requires_full_index` repository
> +setting. This setting will be on by default and disabled one builtin at a
> +time until we have sufficient confidence that all of the index operations
> +are properly guarded.
> +
> +To complete this phase, the commands `git status` and `git add` will be
> +integrated with the sparse-index so that they operate with O(Populated)
> +performance. They will be carefully tested for operations within and
> +outside the sparse-checkout definition.
> +
> +Phase II: Careful integrations
> +------------------------------
> +
> +This phase focuses on ensuring that all index extensions and APIs work
> +well with a sparse-index. This requires significant increases to our test
> +coverage, especially for operations that interact with the working
> +directory outside of the sparse-checkout definition. Some of these
> +behaviors may not be the desirable ones, such as some tests already
> +marked for failure in `t1092-sparse-checkout-compatibility.sh`.
> +
> +The index extensions that may require special integrations are:
> +
> +* FS Monitor
> +* Untracked cache
> +
> +While integrating with these features, we should look for patterns that
> +might lead to better APIs for interacting with the index. Coalescing
> +common usage patterns into an API call can reduce the number of places
> +where sparse-directories need to be handled carefully.
> +
> +Phase III: Important command speedups
> +-------------------------------------
> +
> +At this point, the patterns for testing and implementing sparse-directory
> +logic should be relatively stable. This phase focuses on updating some of
> +the most common builtins that use the index to operate as O(Populated).
> +Here is a potential list of commands that could be valuable to integrate
> +at this point:
> +
> +* `git commit`
> +* `git checkout`
> +* `git merge`
> +* `git rebase`
> +
> +Hopefully, commands such as `git merge` and `git rebase` can benefit
> +instead from merge algorithms that do not use the index as a data
> +structure, such as the merge-ORT strategy. As these topics mature, we
> +may enalbe the ORT strategy by default for repositories using the

s/enalbe/enable/

> +sparse-index feature.
> +
> +Along with `git status` and `git add`, these commands cover the majority
> +of users' interactions with the working directory. In addition, we can
> +integrate with these commands:
> +
> +* `git grep`
> +* `git rm`
> +
> +These have been proposed as some whose behavior could change when in a
> +repo with a sparse-checkout definition. It would be good to include this
> +behavior automatically when using a sparse-index. Some clarity is needed
> +to make the behavior switch clear to the user.
> +
> +This phase is the first where parallel work might be possible without too
> +much conflicts between topics.
> +
> +Phase IV: The long tail
> +-----------------------
> +
> +This last phase is less a "phase" and more "the new normal" after all of
> +the previous work.
> +
> +To start, the `command_requires_full_index` option could be removed in
> +favor of expanding only when hitting an API guard.
> +
> +There are many Git commands that could use special attention to operate as
> +O(Populated), while some might be so rare that it is acceptable to leave
> +them with additional overhead when a sparse-index is present.
> +
> +Here are some commands that might be useful to update:
> +
> +* `git sparse-checkout set`
> +* `git am`
> +* `git clean`
> +* `git stash`
> --
> gitgitgadget
>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v2 06/20] t1092: compare sparse-checkout to sparse-index
  2021-03-10 19:30   ` [PATCH v2 06/20] t1092: compare sparse-checkout to sparse-index Derrick Stolee via GitGitGadget
@ 2021-03-10 23:04     ` Elijah Newren
  2021-03-11 14:17       ` Derrick Stolee
  0 siblings, 1 reply; 203+ messages in thread
From: Elijah Newren @ 2021-03-10 23:04 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano,
	Nguyễn Thái Ngọc, Jonathan Nieder,
	Martin Ågren, Derrick Stolee, SZEDER Gábor,
	Derrick Stolee, Derrick Stolee

On Wed, Mar 10, 2021 at 11:31 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <dstolee@microsoft.com>
>
> Add a new 'sparse-index' repo alongside the 'full-checkout' and
> 'sparse-checkout' repos in t1092-sparse-checkout-compatibility.sh. Also
> add run_on_sparse and test_sparse_match helpers. These helpers will be
> used when the sparse index is implemented.
>
> Add GIT_TEST_SPARSE_INDEX environment variable to enable the
> sparse-index by default. This will be intended to use across the entire
> test suite, except that it will only affect cases where the
> sparse-checkout feature is enabled.

This last sentence was a bit awkward to read.  "will be intended to
use" -> "is intended to be used"?

>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  t/README                                 |  3 +++
>  t/t1092-sparse-checkout-compatibility.sh | 24 ++++++++++++++++++++----
>  2 files changed, 23 insertions(+), 4 deletions(-)
>
> diff --git a/t/README b/t/README
> index 593d4a4e270c..b98bc563aab5 100644
> --- a/t/README
> +++ b/t/README
> @@ -439,6 +439,9 @@ and "sha256".
>  GIT_TEST_WRITE_REV_INDEX=<boolean>, when true enables the
>  'pack.writeReverseIndex' setting.
>
> +GIT_TEST_SPARSE_INDEX=<boolean>, when true enables index writes to use the
> +sparse-index format by default.
> +
>  Naming Tests
>  ------------
>
> diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
> index 3725d3997e70..71d6f9e4c014 100755
> --- a/t/t1092-sparse-checkout-compatibility.sh
> +++ b/t/t1092-sparse-checkout-compatibility.sh
> @@ -7,6 +7,7 @@ test_description='compare full workdir to sparse workdir'
>  test_expect_success 'setup' '
>         git init initial-repo &&
>         (
> +               GIT_TEST_SPARSE_INDEX=0 &&
>                 cd initial-repo &&
>                 echo a >a &&
>                 echo "after deep" >e &&
> @@ -87,23 +88,32 @@ init_repos () {
>
>         cp -r initial-repo sparse-checkout &&
>         git -C sparse-checkout reset --hard &&
> -       git -C sparse-checkout sparse-checkout init --cone &&
> +
> +       cp -r initial-repo sparse-index &&
> +       git -C sparse-index reset --hard &&
>
>         # initialize sparse-checkout definitions
> -       git -C sparse-checkout sparse-checkout set deep
> +       git -C sparse-checkout sparse-checkout init --cone &&
> +       git -C sparse-checkout sparse-checkout set deep &&
> +       GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout init --cone &&
> +       GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set deep
>  }
>
>  run_on_sparse () {
>         (
>                 cd sparse-checkout &&
> -               "$@" >../sparse-checkout-out 2>../sparse-checkout-err
> +               GIT_TEST_SPARSE_INDEX=0 "$@" >../sparse-checkout-out 2>../sparse-checkout-err
> +       ) &&
> +       (
> +               cd sparse-index &&
> +               GIT_TEST_SPARSE_INDEX=1 "$@" >../sparse-index-out 2>../sparse-index-err
>         )
>  }
>
>  run_on_all () {
>         (
>                 cd full-checkout &&
> -               "$@" >../full-checkout-out 2>../full-checkout-err
> +               GIT_TEST_SPARSE_INDEX=0 "$@" >../full-checkout-out 2>../full-checkout-err
>         ) &&
>         run_on_sparse "$@"
>  }
> @@ -114,6 +124,12 @@ test_all_match () {
>         test_cmp full-checkout-err sparse-checkout-err
>  }
>
> +test_sparse_match () {
> +       run_on_sparse $* &&

Should this be
   run_on_sparse "$@"
in order to allow arguments with spaces?

> +       test_cmp sparse-checkout-out sparse-index-out &&
> +       test_cmp sparse-checkout-err sparse-index-err
> +}
> +
>  test_expect_success 'status with options' '
>         init_repos &&
>         test_all_match git status --porcelain=v2 &&
> --
> gitgitgadget
>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v2 11/20] sparse-index: convert from full to sparse
  2021-03-10 19:30   ` [PATCH v2 11/20] sparse-index: convert from full to sparse Derrick Stolee via GitGitGadget
@ 2021-03-10 23:44     ` Elijah Newren
  2021-03-11 14:13       ` Derrick Stolee
  0 siblings, 1 reply; 203+ messages in thread
From: Elijah Newren @ 2021-03-10 23:44 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano,
	Nguyễn Thái Ngọc, Jonathan Nieder,
	Martin Ågren, Derrick Stolee, SZEDER Gábor,
	Derrick Stolee, Derrick Stolee

On Wed, Mar 10, 2021 at 11:31 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <dstolee@microsoft.com>
>
> If we have a full index, then we can convert it to a sparse index by
> replacing directories outside of the sparse cone with sparse directory
> entries. The convert_to_sparse() method does this, when the situation is
> appropriate.
>
> For now, we avoid converting the index to a sparse index if:
>
>  1. the index is split.
>  2. the index is already sparse.
>  3. sparse-checkout is disabled.
>  4. sparse-checkout does not use cone mode.
>
> Finally, we currently limit the conversion to when the
> GIT_TEST_SPARSE_INDEX environment variable is enabled. A mode using Git
> config will be added in a later change.
>
> The trickiest thing about this conversion is that we might not be able
> to mark a directory as a sparse directory just because it is outside the
> sparse cone. There might be unmerged files within that directory, so we
> need to look for those. Also, if there is some strange reason why a file
> is not marked with CE_SKIP_WORKTREE, then we should give up on
> converting that directory. There is still hope that some of its
> subdirectories might be able to convert to sparse, so we keep looking
> deeper.
>
> The conversion process is assisted by the cache-tree extension. This is
> calculated from the full index if it does not already exist. We then
> abandon the cache-tree as it no longer applies to the newly-sparse
> index. Thus, this cache-tree will be recalculated in every
> sparse-full-sparse round-trip until we integrate the cache-tree
> extension with the sparse index.
>
> Some Git commands use the index after writing it. For example, 'git add'
> will update the index, then write it to disk, then read its entries to
> report information. To keep the in-memory index in a full state after
> writing, we re-expand it to a full one after the write. This is wasteful
> for commands that only write the index and do not read from it again,
> but that is only the case until we make those commands "sparse aware."
>
> We can compare the behavior of the sparse-index in
> t1092-sparse-checkout-compability.sh by using GIT_TEST_SPARSE_INDEX=1
> when operating on the 'sparse-index' repo. We can also compare the two
> sparse repos directly, such as comparing their indexes (when expanded to
> full in the case of the 'sparse-index' repo). We also verify that the
> index is actually populated with sparse directory entries.
>
> The 'checkout and reset (mixed)' test is marked for failure when
> comparing a sparse repo to a full repo, but we can compare the two
> sparse-checkout cases directly to ensure that we are not changing the
> behavior when using a sparse index.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  cache-tree.c                             |   3 +
>  cache.h                                  |   2 +
>  read-cache.c                             |  26 ++++-
>  sparse-index.c                           | 139 +++++++++++++++++++++++
>  sparse-index.h                           |   1 +
>  t/t1092-sparse-checkout-compatibility.sh |  61 +++++++++-
>  6 files changed, 227 insertions(+), 5 deletions(-)
>
> diff --git a/cache-tree.c b/cache-tree.c
> index 2fb483d3c083..5f07a39e501e 100644
> --- a/cache-tree.c
> +++ b/cache-tree.c
> @@ -6,6 +6,7 @@
>  #include "object-store.h"
>  #include "replace-object.h"
>  #include "promisor-remote.h"
> +#include "sparse-index.h"
>
>  #ifndef DEBUG_CACHE_TREE
>  #define DEBUG_CACHE_TREE 0
> @@ -442,6 +443,8 @@ int cache_tree_update(struct index_state *istate, int flags)
>         if (i)
>                 return i;
>
> +       ensure_full_index(istate);
> +
>         if (!istate->cache_tree)
>                 istate->cache_tree = cache_tree();
>
> diff --git a/cache.h b/cache.h
> index 303411726e10..9217d405b9b8 100644
> --- a/cache.h
> +++ b/cache.h
> @@ -251,6 +251,8 @@ static inline unsigned int create_ce_mode(unsigned int mode)
>  {
>         if (S_ISLNK(mode))
>                 return S_IFLNK;
> +       if (mode == S_IFDIR)
> +               return S_IFDIR;
>         if (S_ISDIR(mode) || S_ISGITLINK(mode))
>                 return S_IFGITLINK;
>         return S_IFREG | ce_permissions(mode);
> diff --git a/read-cache.c b/read-cache.c
> index 97dbf2434f30..92126b9d23c9 100644
> --- a/read-cache.c
> +++ b/read-cache.c
> @@ -25,6 +25,7 @@
>  #include "fsmonitor.h"
>  #include "thread-utils.h"
>  #include "progress.h"
> +#include "sparse-index.h"
>
>  /* Mask for the name length in ce_flags in the on-disk index */
>
> @@ -1002,8 +1003,14 @@ int verify_path(const char *path, unsigned mode)
>
>                         c = *path++;
>                         if ((c == '.' && !verify_dotfile(path, mode)) ||
> -                           is_dir_sep(c) || c == '\0')
> +                           is_dir_sep(c))
>                                 return 0;
> +                       /*
> +                        * allow terminating directory separators for
> +                        * sparse directory entries.
> +                        */
> +                       if (c == '\0')
> +                               return S_ISDIR(mode);
>                 } else if (c == '\\' && protect_ntfs) {
>                         if (is_ntfs_dotgit(path))
>                                 return 0;
> @@ -3061,6 +3068,14 @@ static int do_write_locked_index(struct index_state *istate, struct lock_file *l
>                                  unsigned flags)
>  {
>         int ret;
> +       int was_full = !istate->sparse_index;
> +
> +       ret = convert_to_sparse(istate);
> +
> +       if (ret) {
> +               warning(_("failed to convert to a sparse-index"));
> +               return ret;
> +       }
>
>         /*
>          * TODO trace2: replace "the_repository" with the actual repo instance
> @@ -3072,6 +3087,9 @@ static int do_write_locked_index(struct index_state *istate, struct lock_file *l
>         trace2_region_leave_printf("index", "do_write_index", the_repository,
>                                    "%s", get_lock_file_path(lock));
>
> +       if (was_full)
> +               ensure_full_index(istate);
> +
>         if (ret)
>                 return ret;
>         if (flags & COMMIT_LOCK)
> @@ -3162,9 +3180,10 @@ static int write_shared_index(struct index_state *istate,
>                               struct tempfile **temp)
>  {
>         struct split_index *si = istate->split_index;
> -       int ret;
> +       int ret, was_full = !istate->sparse_index;
>
>         move_cache_to_base_index(istate);
> +       convert_to_sparse(istate);
>
>         trace2_region_enter_printf("index", "shared/do_write_index",
>                                    the_repository, "%s", get_tempfile_path(*temp));
> @@ -3172,6 +3191,9 @@ static int write_shared_index(struct index_state *istate,
>         trace2_region_leave_printf("index", "shared/do_write_index",
>                                    the_repository, "%s", get_tempfile_path(*temp));
>
> +       if (was_full)
> +               ensure_full_index(istate);
> +
>         if (ret)
>                 return ret;
>         ret = adjust_shared_perm(get_tempfile_path(*temp));
> diff --git a/sparse-index.c b/sparse-index.c
> index 316cb949b74b..5eb561259bb1 100644
> --- a/sparse-index.c
> +++ b/sparse-index.c
> @@ -4,6 +4,145 @@
>  #include "tree.h"
>  #include "pathspec.h"
>  #include "trace2.h"
> +#include "cache-tree.h"
> +#include "config.h"
> +#include "dir.h"
> +#include "fsmonitor.h"
> +
> +static struct cache_entry *construct_sparse_dir_entry(
> +                               struct index_state *istate,
> +                               const char *sparse_dir,
> +                               struct cache_tree *tree)
> +{
> +       struct cache_entry *de;
> +
> +       de = make_cache_entry(istate, S_IFDIR, &tree->oid, sparse_dir, 0, 0);
> +
> +       de->ce_flags |= CE_SKIP_WORKTREE;
> +       return de;
> +}
> +
> +/*
> + * Returns the number of entries "inserted" into the index.
> + */
> +static int convert_to_sparse_rec(struct index_state *istate,
> +                                int num_converted,
> +                                int start, int end,
> +                                const char *ct_path, size_t ct_pathlen,
> +                                struct cache_tree *ct)
> +{
> +       int i, can_convert = 1;
> +       int start_converted = num_converted;
> +       enum pattern_match_result match;
> +       int dtype;
> +       struct strbuf child_path = STRBUF_INIT;
> +       struct pattern_list *pl = istate->sparse_checkout_patterns;
> +
> +       /*
> +        * Is the current path outside of the sparse cone?
> +        * Then check if the region can be replaced by a sparse
> +        * directory entry (everything is sparse and merged).
> +        */
> +       match = path_matches_pattern_list(ct_path, ct_pathlen,
> +                                         NULL, &dtype, pl, istate);
> +       if (match != NOT_MATCHED)
> +               can_convert = 0;
> +
> +       for (i = start; can_convert && i < end; i++) {
> +               struct cache_entry *ce = istate->cache[i];
> +
> +               if (ce_stage(ce) ||
> +                   !(ce->ce_flags & CE_SKIP_WORKTREE))
> +                       can_convert = 0;
> +       }
> +
> +       if (can_convert) {
> +               struct cache_entry *se;
> +               se = construct_sparse_dir_entry(istate, ct_path, ct);
> +
> +               istate->cache[num_converted++] = se;
> +               return 1;
> +       }
> +
> +       for (i = start; i < end; ) {
> +               int count, span, pos = -1;
> +               const char *base, *slash;
> +               struct cache_entry *ce = istate->cache[i];
> +
> +               /*
> +                * Detect if this is a normal entry outside of any subtree
> +                * entry.
> +                */
> +               base = ce->name + ct_pathlen;
> +               slash = strchr(base, '/');
> +
> +               if (slash)
> +                       pos = cache_tree_subtree_pos(ct, base, slash - base);
> +
> +               if (pos < 0) {
> +                       istate->cache[num_converted++] = ce;
> +                       i++;
> +                       continue;
> +               }
> +
> +               strbuf_setlen(&child_path, 0);
> +               strbuf_add(&child_path, ce->name, slash - ce->name + 1);
> +
> +               span = ct->down[pos]->cache_tree->entry_count;
> +               count = convert_to_sparse_rec(istate,
> +                                             num_converted, i, i + span,
> +                                             child_path.buf, child_path.len,
> +                                             ct->down[pos]->cache_tree);
> +               num_converted += count;
> +               i += span;
> +       }
> +
> +       strbuf_release(&child_path);
> +       return num_converted - start_converted;
> +}
> +
> +int convert_to_sparse(struct index_state *istate)
> +{
> +       if (istate->split_index || istate->sparse_index ||
> +           !core_apply_sparse_checkout || !core_sparse_checkout_cone)
> +               return 0;
> +
> +       /*
> +        * For now, only create a sparse index with the
> +        * GIT_TEST_SPARSE_INDEX environment variable. We will relax
> +        * this once we have a proper way to opt-in (and later still,
> +        * opt-out).
> +        */
> +       if (!git_env_bool("GIT_TEST_SPARSE_INDEX", 0))
> +               return 0;
> +
> +       if (!istate->sparse_checkout_patterns) {
> +               istate->sparse_checkout_patterns = xcalloc(1, sizeof(struct pattern_list));
> +               if (get_sparse_checkout_patterns(istate->sparse_checkout_patterns) < 0)
> +                       return 0;
> +       }
> +
> +       if (!istate->sparse_checkout_patterns->use_cone_patterns) {
> +               warning(_("attempting to use sparse-index without cone mode"));
> +               return -1;
> +       }
> +
> +       if (cache_tree_update(istate, 0)) {
> +               warning(_("unable to update cache-tree, staying full"));
> +               return -1;
> +       }
> +
> +       remove_fsmonitor(istate);
> +
> +       trace2_region_enter("index", "convert_to_sparse", istate->repo);
> +       istate->cache_nr = convert_to_sparse_rec(istate,
> +                                                0, 0, istate->cache_nr,
> +                                                "", 0, istate->cache_tree);
> +       istate->drop_cache_tree = 1;
> +       istate->sparse_index = 1;
> +       trace2_region_leave("index", "convert_to_sparse", istate->repo);
> +       return 0;
> +}
>
>  static void set_index_entry(struct index_state *istate, int nr, struct cache_entry *ce)
>  {
> diff --git a/sparse-index.h b/sparse-index.h
> index 09a20d036c46..64380e121d80 100644
> --- a/sparse-index.h
> +++ b/sparse-index.h
> @@ -3,5 +3,6 @@
>
>  struct index_state;
>  void ensure_full_index(struct index_state *istate);
> +int convert_to_sparse(struct index_state *istate);
>
>  #endif
> diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
> index 4d789fe86b9d..ca87033d30b0 100755
> --- a/t/t1092-sparse-checkout-compatibility.sh
> +++ b/t/t1092-sparse-checkout-compatibility.sh
> @@ -2,6 +2,9 @@
>
>  test_description='compare full workdir to sparse workdir'
>
> +GIT_TEST_CHECK_CACHE_TREE=0

I still think it'd be nice to get a comment, either in the code or the
commit message, explaining why your series needs to set
GIT_TEST_CHECK_CACHE_TREE to 0.  I feel like I should almost know the
answer (was this just a preliminary step and it'll later be turned on?
did the cache-tree checking do stuff that assumes no sparse directory
entries? is it really slow?), but I don't.

> +GIT_TEST_SPLIT_INDEX=0
> +
>  . ./test-lib.sh
>
>  test_expect_success 'setup' '
> @@ -121,15 +124,49 @@ run_on_all () {
>  test_all_match () {
>         run_on_all "$@" &&
>         test_cmp full-checkout-out sparse-checkout-out &&
> -       test_cmp full-checkout-err sparse-checkout-err
> +       test_cmp full-checkout-out sparse-index-out &&
> +       test_cmp full-checkout-err sparse-checkout-err &&
> +       test_cmp full-checkout-err sparse-index-err
>  }
>
>  test_sparse_match () {
> -       run_on_sparse $* &&
> +       run_on_sparse "$@" &&
>         test_cmp sparse-checkout-out sparse-index-out &&
>         test_cmp sparse-checkout-err sparse-index-err
>  }
>
> +test_expect_success 'sparse-index contents' '
> +       init_repos &&
> +
> +       test-tool -C sparse-index read-cache --table >cache &&
> +       for dir in folder1 folder2 x
> +       do
> +               TREE=$(git -C sparse-index rev-parse HEAD:$dir) &&
> +               grep "040000 tree $TREE $dir/" cache \
> +                       || return 1
> +       done &&
> +
> +       GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set folder1 &&
> +
> +       test-tool -C sparse-index read-cache --table >cache &&
> +       for dir in deep folder2 x
> +       do
> +               TREE=$(git -C sparse-index rev-parse HEAD:$dir) &&
> +               grep "040000 tree $TREE $dir/" cache \
> +                       || return 1
> +       done &&
> +
> +       GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set deep/deeper1 &&
> +
> +       test-tool -C sparse-index read-cache --table >cache &&
> +       for dir in deep/deeper2 folder1 folder2 x
> +       do
> +               TREE=$(git -C sparse-index rev-parse HEAD:$dir) &&
> +               grep "040000 tree $TREE $dir/" cache \
> +                       || return 1
> +       done
> +'
> +
>  test_expect_success 'expanded in-memory index matches full index' '
>         init_repos &&
>         test_sparse_match test-tool read-cache --expand --table
> @@ -137,6 +174,7 @@ test_expect_success 'expanded in-memory index matches full index' '
>
>  test_expect_success 'status with options' '
>         init_repos &&
> +       test_sparse_match ls &&
>         test_all_match git status --porcelain=v2 &&
>         test_all_match git status --porcelain=v2 -z -u &&
>         test_all_match git status --porcelain=v2 -uno &&
> @@ -273,6 +311,17 @@ test_expect_failure 'checkout and reset (mixed)' '
>         test_all_match git reset update-folder2
>  '
>
> +# Ensure that sparse-index behaves identically to
> +# sparse-checkout with a full index.
> +test_expect_success 'checkout and reset (mixed) [sparse]' '
> +       init_repos &&
> +
> +       test_sparse_match git checkout -b reset-test update-deep &&
> +       test_sparse_match git reset deepest &&
> +       test_sparse_match git reset update-folder1 &&
> +       test_sparse_match git reset update-folder2
> +'
> +
>  test_expect_success 'merge' '
>         init_repos &&
>
> @@ -309,14 +358,20 @@ test_expect_success 'clean' '
>         test_all_match git status --porcelain=v2 &&
>         test_all_match git clean -f &&
>         test_all_match git status --porcelain=v2 &&
> +       test_sparse_match ls &&
> +       test_sparse_match ls folder1 &&
>
>         test_all_match git clean -xf &&
>         test_all_match git status --porcelain=v2 &&
> +       test_sparse_match ls &&
> +       test_sparse_match ls folder1 &&
>
>         test_all_match git clean -xdf &&
>         test_all_match git status --porcelain=v2 &&
> +       test_sparse_match ls &&
> +       test_sparse_match ls folder1 &&
>
> -       test_path_is_dir sparse-checkout/folder1
> +       test_sparse_match test_path_is_dir folder1
>  '
>
>  test_done
> --
> gitgitgadget
>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v2 00/20] Sparse Index: Design, Format, Tests
  2021-03-10 19:30 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (19 preceding siblings ...)
  2021-03-10 19:31   ` [PATCH v2 20/20] p2000: add sparse-index repos Derrick Stolee via GitGitGadget
@ 2021-03-11  0:07   ` Elijah Newren
  2021-03-16 16:42   ` [PATCH v3 " Derrick Stolee via GitGitGadget
  21 siblings, 0 replies; 203+ messages in thread
From: Elijah Newren @ 2021-03-11  0:07 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano,
	Nguyễn Thái Ngọc, Jonathan Nieder,
	Martin Ågren, Derrick Stolee, SZEDER Gábor,
	Derrick Stolee

On Wed, Mar 10, 2021 at 11:31 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> Here is the first full patch series submission coming out of the
> sparse-index RFC [1].
>
> [1]
> https://lore.kernel.org/git/pull.847.git.1611596533.gitgitgadget@gmail.com/
>
> I won't waste too much space here, because PATCH 1 includes a sizeable
> design document that describes the feature, the reasoning behind it, and my
> plan for getting this implemented widely throughout the codebase.
>
> There are some new things here that were not in the RFC:
>
>  * Design doc and format updates. (Patch 1)
>  * Performance test script. (Patches 2 and 20)
>
> Notably missing in this series from the RFC:
>
>  * The mega-patch inserting ensure_full_index() throughout the codebase.
>    That will be a follow-up series to this one.
>  * The integrations with git status and git add to demonstrate the improved
>    performance. Those will also appear in their own series later.
>
> I plan to keep my latest work in this area in my 'sparse-index/wip' branch
> [2]. It includes all of the work from the RFC right now, updated with the
> work from this series.
>
> [2] https://github.com/derrickstolee/git/tree/sparse-index/wip
>
>
> Updates in V2
> =============
>
>  * Various typos and awkward grammar is fixed.
>  * Cleaned up unnecessary commands in p2000-sparse-operations.sh
>  * Added a comment to the sparse_index member of struct index_state.
>  * Used tree_type, commit_type, and blob_type in test-read-cache.c.

I read through the range-diff and comments from the previous series.
There's only a few things left (as I noted in comments), but they're
all pretty trivial so this one is:

Reviewed-by: Elijah Newren <newren@gmail.com>

>
> Thanks, -Stolee
>
> Derrick Stolee (20):
>   sparse-index: design doc and format update
>   t/perf: add performance test for sparse operations
>   t1092: clean up script quoting
>   sparse-index: add guard to ensure full index
>   sparse-index: implement ensure_full_index()
>   t1092: compare sparse-checkout to sparse-index
>   test-read-cache: print cache entries with --table
>   test-tool: don't force full index
>   unpack-trees: ensure full index
>   sparse-checkout: hold pattern list in index
>   sparse-index: convert from full to sparse
>   submodule: sparse-index should not collapse links
>   unpack-trees: allow sparse directories
>   sparse-index: check index conversion happens
>   sparse-index: create extension for compatibility
>   sparse-checkout: toggle sparse index from builtin
>   sparse-checkout: disable sparse-index
>   cache-tree: integrate with sparse directory entries
>   sparse-index: loose integration with cache_tree_verify()
>   p2000: add sparse-index repos
>
>  Documentation/config/extensions.txt      |   8 +
>  Documentation/git-sparse-checkout.txt    |  14 ++
>  Documentation/technical/index-format.txt |   7 +
>  Documentation/technical/sparse-index.txt | 173 ++++++++++++++
>  Makefile                                 |   1 +
>  builtin/sparse-checkout.c                |  44 +++-
>  cache-tree.c                             |  40 ++++
>  cache.h                                  |  18 +-
>  read-cache.c                             |  35 ++-
>  repo-settings.c                          |  15 ++
>  repository.c                             |  11 +-
>  repository.h                             |   3 +
>  setup.c                                  |   3 +
>  sparse-index.c                           | 290 +++++++++++++++++++++++
>  sparse-index.h                           |  11 +
>  t/README                                 |   3 +
>  t/helper/test-read-cache.c               |  66 +++++-
>  t/perf/p2000-sparse-operations.sh        | 102 ++++++++
>  t/t1091-sparse-checkout-builtin.sh       |  13 +
>  t/t1092-sparse-checkout-compatibility.sh | 136 +++++++++--
>  unpack-trees.c                           |  16 +-
>  21 files changed, 969 insertions(+), 40 deletions(-)
>  create mode 100644 Documentation/technical/sparse-index.txt
>  create mode 100644 sparse-index.c
>  create mode 100644 sparse-index.h
>  create mode 100755 t/perf/p2000-sparse-operations.sh
>
>
> base-commit: 966e671106b2fd38301e7c344c754fd118d0bb07
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-883%2Fderrickstolee%2Fsparse-index%2Fformat-v2
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-883/derrickstolee/sparse-index/format-v2
> Pull-Request: https://github.com/gitgitgadget/git/pull/883
>
> Range-diff vs v1:
>
>   1:  daa9a6bcefbc !  1:  2fe413fdac80 sparse-index: design doc and format update
>      @@ Documentation/technical/sparse-index.txt (new)
>       +If we need to discover the details for paths within that directory, we
>       +can parse trees to find that list.
>       +
>      -+This addition of sparse-directory entries violates expectations about the
>      ++At time of writing, sparse-directory entries violate expectations about the
>       +index format and its in-memory data structure. There are many consumers in
>       +the codebase that expect to iterate through all of the index entries and
>       +see only files. In addition, they expect to see all files at `HEAD`. One
>      @@ Documentation/technical/sparse-index.txt (new)
>       +* `git merge`
>       +* `git rebase`
>       +
>      ++Hopefully, commands such as `git merge` and `git rebase` can benefit
>      ++instead from merge algorithms that do not use the index as a data
>      ++structure, such as the merge-ORT strategy. As these topics mature, we
>      ++may enalbe the ORT strategy by default for repositories using the
>      ++sparse-index feature.
>      ++
>       +Along with `git status` and `git add`, these commands cover the majority
>       +of users' interactions with the working directory. In addition, we can
>       +integrate with these commands:
>   2:  a8c6322a3dbe !  2:  540ab5495065 t/perf: add performance test for sparse operations
>      @@ t/perf/p2000-sparse-operations.sh (new)
>       + # Remove submodules from the example repo, because our
>       + # duplication of the entire repo creates an unlikly data shape.
>       + git config --file .gitmodules --get-regexp "submodule.*.path" >modules &&
>      -+ rm -f .gitmodules &&
>      -+ git add .gitmodules &&
>      ++ git rm -f .gitmodules &&
>       + for module in $(awk "{print \$2}" modules)
>       + do
>       +         git rm $module || return 1
>       + done &&
>      -+ git add . &&
>       + git commit -m "remove submodules" &&
>       +
>       + echo bogus >a &&
>   3:  6e783c88821e =  3:  5cbedb377b37 t1092: clean up script quoting
>   4:  01da4c48a1fa =  4:  6e21f776e883 sparse-index: add guard to ensure full index
>   5:  2b83989fbcd3 !  5:  399ddb0bad56 sparse-index: implement ensure_full_index()
>      @@ cache.h: struct index_state {
>                  updated_skipworktree : 1,
>       -          fsmonitor_has_run_once : 1;
>       +          fsmonitor_has_run_once : 1,
>      ++
>      ++          /*
>      ++           * sparse_index == 1 when sparse-directory
>      ++           * entries exist. Requires sparse-checkout
>      ++           * in cone mode.
>      ++           */
>       +          sparse_index : 1;
>         struct hashmap name_hash;
>         struct hashmap dir_hash;
>   6:  c9910a37579c =  6:  eac2db5efc22 t1092: compare sparse-checkout to sparse-index
>   7:  3d92df7a0cf9 !  7:  e9c82d2eda82 test-read-cache: print cache entries with --table
>      @@ Commit message
>
>        ## t/helper/test-read-cache.c ##
>       @@
>      + #include "test-tool.h"
>        #include "cache.h"
>        #include "config.h"
>      -
>      ++#include "blob.h"
>      ++#include "commit.h"
>      ++#include "tree.h"
>      ++
>       +static void print_cache_entry(struct cache_entry *ce)
>       +{
>      -+ printf("%06o ", ce->ce_mode & 0777777);
>      ++ const char *type;
>      ++ printf("%06o ", ce->ce_mode & 0177777);
>       +
>       + if (S_ISSPARSEDIR(ce->ce_mode))
>      -+         printf("tree ");
>      ++         type = tree_type;
>       + else if (S_ISGITLINK(ce->ce_mode))
>      -+         printf("commit ");
>      ++         type = commit_type;
>       + else
>      -+         printf("blob ");
>      ++         type = blob_type;
>       +
>      -+ printf("%s\t%s\n",
>      ++ printf("%s %s\t%s\n",
>      ++        type,
>       +        oid_to_hex(&ce->oid),
>       +        ce->name);
>       +}
>       +
>      -+static void print_cache(struct index_state *cache)
>      ++static void print_cache(struct index_state *istate)
>       +{
>       + int i;
>      -+ for (i = 0; i < the_index.cache_nr; i++)
>      -+         print_cache_entry(the_index.cache[i]);
>      ++ for (i = 0; i < istate->cache_nr; i++)
>      ++         print_cache_entry(istate->cache[i]);
>       +}
>      -+
>      +
>        int cmd__read_cache(int argc, const char **argv)
>        {
>       + struct repository *r = the_repository;
>   8:  94373e2bfbbc !  8:  243541fc5820 test-tool: don't force full index
>      @@ Commit message
>
>        ## t/helper/test-read-cache.c ##
>       @@
>      - #include "test-tool.h"
>      - #include "cache.h"
>      - #include "config.h"
>      + #include "blob.h"
>      + #include "commit.h"
>      + #include "tree.h"
>       +#include "sparse-index.h"
>
>        static void print_cache_entry(struct cache_entry *ce)
>   9:  e71f033c2871 =  9:  48f65093b3da unpack-trees: ensure full index
>  10:  f86d3dc154d1 ! 10:  83aac8b7a1ec sparse-checkout: hold pattern list in index
>      @@ Commit message
>           pattern set, we need access to that in-memory copy. Place a pointer to
>           a 'struct pattern_list' in the index so we can access this on-demand.
>           This will be used in the next change which uses the sparse-checkout
>      -    definition to filter out directories that are outsie the sparse cone.
>      +    definition to filter out directories that are outside the sparse cone.
>
>           Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
>
>  11:  a2d77c23a0cb ! 11:  f6db0c27a285 sparse-index: convert from full to sparse
>      @@ read-cache.c: int verify_path(const char *path, unsigned mode)
>                                 return 0;
>       +                 /*
>       +                  * allow terminating directory separators for
>      -+                  * sparse directory enries.
>      ++                  * sparse directory entries.
>       +                  */
>       +                 if (c == '\0')
>       +                         return S_ISDIR(mode);
>      @@ sparse-index.c
>       +         struct cache_entry *ce = istate->cache[i];
>       +
>       +         /*
>      -+          * Detect if this is a normal entry oustide of any subtree
>      ++          * Detect if this is a normal entry outside of any subtree
>       +          * entry.
>       +          */
>       +         base = ce->name + ct_pathlen;
>  12:  4405a9115c3b = 12:  f2a3e7298798 submodule: sparse-index should not collapse links
>  13:  fda23f07e6a2 ! 13:  6f1ebe6ccc08 unpack-trees: allow sparse directories
>      @@ Commit message
>           is possible to have a directory in a sparse index as long as that entry
>           is itself marked with the skip-worktree bit.
>
>      -    The negation of the 'pos' variable must be conditioned to only when it
>      -    starts as negative. This is identical behavior as before when the index
>      -    is full.
>      +    The 'pos' variable is assigned a negative value if an exact match is not
>      +    found. Since a directory name can be an exact match, it is no longer an
>      +    error to have a nonnegative 'pos' value.
>
>           Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
>
>  14:  7d4627574bb8 = 14:  3fa684b315fb sparse-index: check index conversion happens
>  15:  564503f78784 ! 15:  d74576d677f6 sparse-index: create extension for compatibility
>      @@ Commit message
>
>           We _could_ add a new index version that explicitly adds these
>           capabilities, but there are nuances to index formats 2, 3, and 4 that
>      -    are still valuable to select as options. For now, create a repo
>      -    extension, "extensions.sparseIndex", that specifies that the tool
>      -    reading this repository must understand sparse directory entries.
>      +    are still valuable to select as options. Until we add index format
>      +    version 5, create a repo extension, "extensions.sparseIndex", that
>      +    specifies that the tool reading this repository must understand sparse
>      +    directory entries.
>
>           This change only encodes the extension and enables it when
>           GIT_TEST_SPARSE_INDEX=1. Later, we will add a more user-friendly CLI
>      @@ Documentation/config/extensions.txt: extensions.objectFormat::
>       + When combined with `core.sparseCheckout=true` and
>       + `core.sparseCheckoutCone=true`, the index may contain entries
>       + corresponding to directories outside of the sparse-checkout
>      -+ definition. Versions of Git that do not understand this extension
>      -+ do not expect directory entries in the index.
>      ++ definition in lieu of containing each path under such directories.
>      ++ Versions of Git that do not understand this extension do not
>      ++ expect directory entries in the index.
>
>        ## cache.h ##
>       @@ cache.h: struct repository_format {
>  16:  6d6b230e3318 ! 16:  e530ca5f668d sparse-checkout: toggle sparse index from builtin
>      @@ Documentation/git-sparse-checkout.txt: To avoid interfering with other worktrees
>       +a sparse index until they are properly integrated with the feature.
>       ++
>       +**WARNING:** Using a sparse index requires modifying the index in a way
>      -+that is not completely understood by other tools. Enabling sparse index
>      -+enables the `extensions.spareseIndex` config value, which might cause
>      -+other tools to stop working with your repository. If you have trouble with
>      -+this compatibility, then run `git sparse-checkout sparse-index disable` to
>      -+remove this config and rewrite your index to not be sparse.
>      ++that is not completely understood by external tools. If you have trouble
>      ++with this compatibility, then run `git sparse-checkout sparse-index disable`
>      ++to rewrite your index to not be sparse. Older versions of Git will not
>      ++understand the `sparseIndex` repository extension and may fail to interact
>      ++with your repository until it is disabled.
>
>        'set'::
>         Write a set of patterns to the sparse-checkout file, as given as
>  17:  bcf960ef2362 = 17:  42d0da9c5def sparse-checkout: disable sparse-index
>  18:  e6afec58674e = 18:  6bb0976a6295 cache-tree: integrate with sparse directory entries
>  19:  2be4981fe698 = 19:  07f34e80609a sparse-index: loose integration with cache_tree_verify()
>  20:  a738b0ba8ab4 = 20:  41e3b56b9c17 p2000: add sparse-index repos
>
> --
> gitgitgadget

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v2 11/20] sparse-index: convert from full to sparse
  2021-03-10 23:44     ` Elijah Newren
@ 2021-03-11 14:13       ` Derrick Stolee
  0 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee @ 2021-03-11 14:13 UTC (permalink / raw)
  To: Elijah Newren, Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano,
	Nguyễn Thái Ngọc, Jonathan Nieder,
	Martin Ågren, SZEDER Gábor, Derrick Stolee,
	Derrick Stolee

On 3/10/2021 6:44 PM, Elijah Newren wrote:
> On Wed, Mar 10, 2021 at 11:31 AM Derrick Stolee via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
>> +GIT_TEST_CHECK_CACHE_TREE=0
> 
> I still think it'd be nice to get a comment, either in the code or the
> commit message, explaining why your series needs to set
> GIT_TEST_CHECK_CACHE_TREE to 0.  I feel like I should almost know the
> answer (was this just a preliminary step and it'll later be turned on?
> did the cache-tree checking do stuff that assumes no sparse directory
> entries? is it really slow?), but I don't.

Sorry I missed commenting on this earlier.

The GIT_TEST_CHECK_CACHE_TREE environment is enabled by the test suite
by default and it does extra validation to see that the cache-tree
extension exists and matches the index contents. Since at this point
we don't have the cache-tree extension enabled with sparse-index, we
would start getting failures by those tests.

This is re-enabled in "sparse-index: loose integration with
cache_tree_verify()" so everything is being verified at the end of the
series.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v2 06/20] t1092: compare sparse-checkout to sparse-index
  2021-03-10 23:04     ` Elijah Newren
@ 2021-03-11 14:17       ` Derrick Stolee
  0 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee @ 2021-03-11 14:17 UTC (permalink / raw)
  To: Elijah Newren, Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano,
	Nguyễn Thái Ngọc, Jonathan Nieder,
	Martin Ågren, SZEDER Gábor, Derrick Stolee,
	Derrick Stolee

On 3/10/2021 6:04 PM, Elijah Newren wrote:
> On Wed, Mar 10, 2021 at 11:31 AM Derrick Stolee via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
>> Add GIT_TEST_SPARSE_INDEX environment variable to enable the
>> sparse-index by default. This will be intended to use across the entire
>> test suite, except that it will only affect cases where the
>> sparse-checkout feature is enabled.
> 
> This last sentence was a bit awkward to read.  "will be intended to
> use" -> "is intended to be used"?

Fixed locally to:

    Add the GIT_TEST_SPARSE_INDEX environment variable to enable the
    sparse-index by default. This can be enabled across all tests, but that
    will only affect cases where the sparse-checkout feature is enabled.
 
>> +test_sparse_match () {
>> +       run_on_sparse $* &&
> 
> Should this be
>    run_on_sparse "$@"
> in order to allow arguments with spaces?

Sorry I missed this one. It was fixed to the right use in
"sparse-index: convert from full to sparse" so I thought I
had already covered this one when looking at the tip of my
branch.
 
Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v2 05/20] sparse-index: implement ensure_full_index()
  2021-03-10 19:30   ` [PATCH v2 05/20] sparse-index: implement ensure_full_index() Derrick Stolee via GitGitGadget
@ 2021-03-12  6:50     ` Junio C Hamano
  2021-03-12 13:56       ` Derrick Stolee
  0 siblings, 1 reply; 203+ messages in thread
From: Junio C Hamano @ 2021-03-12  6:50 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, newren, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor, Derrick Stolee,
	Derrick Stolee, Ævar Arnfjörð Bjarmason

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

>  void ensure_full_index(struct index_state *istate)
>  {
> ...
> +	int i;
> +		tree = lookup_tree(istate->repo, &ce->oid);
> +
> +		memset(&ps, 0, sizeof(ps));
> +		ps.recursive = 1;
> +		ps.has_wildcard = 1;
> +		ps.max_depth = -1;
> +
> +		read_tree_recursive(istate->repo, tree,
> +				    ce->name, strlen(ce->name),
> +				    0, &ps,
> +				    add_path_to_index, full);

Ævar, the assumption that led to your e68237bb (tree.h API: remove
support for starting at prefix != "", 2021-03-08) closes the door
for this code rather badly.  Please work with Derrick to figure out
what the best course of action would be.

Thanks.

> +		/* free directory entries. full entries are re-used */
> +		discard_cache_entry(ce);
> +	}
> +
> +	/* Copy back into original index. */
> +	memcpy(&istate->name_hash, &full->name_hash, sizeof(full->name_hash));
> +	istate->sparse_index = 0;
> +	free(istate->cache);
> +	istate->cache = full->cache;
> +	istate->cache_nr = full->cache_nr;
> +	istate->cache_alloc = full->cache_alloc;
> +
> +	free(full);
> +
> +	trace2_region_leave("index", "ensure_full_index", istate->repo);
>  }

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v2 05/20] sparse-index: implement ensure_full_index()
  2021-03-12  6:50     ` Junio C Hamano
@ 2021-03-12 13:56       ` Derrick Stolee
  2021-03-12 20:08         ` Junio C Hamano
  0 siblings, 1 reply; 203+ messages in thread
From: Derrick Stolee @ 2021-03-12 13:56 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee via GitGitGadget
  Cc: git, newren, pclouds, jrnieder, Martin Ågren,
	SZEDER Gábor, Derrick Stolee, Derrick Stolee,
	Ævar Arnfjörð Bjarmason

On 3/12/2021 1:50 AM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>>  void ensure_full_index(struct index_state *istate)
>>  {
>> ...
>> +	int i;
>> +		tree = lookup_tree(istate->repo, &ce->oid);
>> +
>> +		memset(&ps, 0, sizeof(ps));
>> +		ps.recursive = 1;
>> +		ps.has_wildcard = 1;
>> +		ps.max_depth = -1;
>> +
>> +		read_tree_recursive(istate->repo, tree,
>> +				    ce->name, strlen(ce->name),
>> +				    0, &ps,
>> +				    add_path_to_index, full);
> 
> Ævar, the assumption that led to your e68237bb (tree.h API: remove
> support for starting at prefix != "", 2021-03-08) closes the door
> for this code rather badly.  Please work with Derrick to figure out
> what the best course of action would be.

Thanks for pointing this out, Junio.

My preference would be to drop "tree.h API: remove support for
starting at prefix != """, but it should be OK to keep "tree.h API:
remove "stage" parameter from read_tree_recursive()" (currently
b3a078863f6), even though it introduces a semantic conflict here.

Since I haven't seen my sparse-index topic get picked up by a
tracking branch, I'd be happy to rebase on top of Ævar's topic if
I can still set a non-root prefix.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v2 05/20] sparse-index: implement ensure_full_index()
  2021-03-12 13:56       ` Derrick Stolee
@ 2021-03-12 20:08         ` Junio C Hamano
  2021-03-12 20:11           ` Derrick Stolee
  0 siblings, 1 reply; 203+ messages in thread
From: Junio C Hamano @ 2021-03-12 20:08 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, git, newren, pclouds, jrnieder,
	Martin Ågren, SZEDER Gábor, Derrick Stolee,
	Derrick Stolee, Ævar Arnfjörð Bjarmason

Derrick Stolee <stolee@gmail.com> writes:

>> Ævar, the assumption that led to your e68237bb (tree.h API: remove
>> support for starting at prefix != "", 2021-03-08) closes the door
>> for this code rather badly.  Please work with Derrick to figure out
>> what the best course of action would be.
>
> Thanks for pointing this out, Junio.
>
> My preference would be to drop "tree.h API: remove support for
> starting at prefix != """, but it should be OK to keep "tree.h API:
> remove "stage" parameter from read_tree_recursive()" (currently
> b3a078863f6), even though it introduces a semantic conflict here.
>
> Since I haven't seen my sparse-index topic get picked up by a
> tracking branch, I'd be happy to rebase on top of Ævar's topic if
> I can still set a non-root prefix.

I did try to have both in 'seen' (after all, that is the primary way
I find out these conflicts early---no one can keep all the details
of all the topics in flight in one's head), and saw that we now have
a need for non-empty prefix that we thought we no longer have in the
other topic --- I think we should probably keep support of non-empty
prefix (as the primary reason why that patch exists is because we
saw no in-tree users---now if your 05/20 proves to be a good use of
the feature, there is one fewer reasons to remove the support) in
some form, so discarding e68237bb certainly is an option.


If we were to base the sparse-index topic on top of ab/read-tree, we
may be able to gain further simplification and clean-up of the API.

I think all the clean-up value e68237bb has are on the calling side
(they no longer have to pass constant ("", 0) to the function), and
we could rewrite e68237bb by

 - renaming "read_tree_recursive()" to "read_tree_at()", with the
   non-empty prefix support.

 - creating a new function "read_tree()", which lacks the support
   for prefix, as a thin-wrapper around "read_tree_at()".

 - modifying the callers of "read_tree_recursive()" changed by
   e68237bb to instead call "read_tree()" (without prefix).

to simplify majority of calling sites without losing functionality.

Then your [05/20] can use the read_tree_at() to read with a prefix.


But that kind of details, I'd want to see you two figure out
yourselves.

Thanks.


^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v2 05/20] sparse-index: implement ensure_full_index()
  2021-03-12 20:08         ` Junio C Hamano
@ 2021-03-12 20:11           ` Derrick Stolee
  2021-03-15 23:52             ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 203+ messages in thread
From: Derrick Stolee @ 2021-03-12 20:11 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Derrick Stolee via GitGitGadget, git, newren, pclouds, jrnieder,
	Martin Ågren, SZEDER Gábor, Derrick Stolee,
	Derrick Stolee, Ævar Arnfjörð Bjarmason

On 3/12/2021 3:08 PM, Junio C Hamano wrote:
> Derrick Stolee <stolee@gmail.com> writes:
> 
>>> Ævar, the assumption that led to your e68237bb (tree.h API: remove
>>> support for starting at prefix != "", 2021-03-08) closes the door
>>> for this code rather badly.  Please work with Derrick to figure out
>>> what the best course of action would be.
>>
>> Thanks for pointing this out, Junio.
>>
>> My preference would be to drop "tree.h API: remove support for
>> starting at prefix != """, but it should be OK to keep "tree.h API:
>> remove "stage" parameter from read_tree_recursive()" (currently
>> b3a078863f6), even though it introduces a semantic conflict here.
>>
>> Since I haven't seen my sparse-index topic get picked up by a
>> tracking branch, I'd be happy to rebase on top of Ævar's topic if
>> I can still set a non-root prefix.
> I think all the clean-up value e68237bb has are on the calling side
> (they no longer have to pass constant ("", 0) to the function), and
> we could rewrite e68237bb by
> 
>  - renaming "read_tree_recursive()" to "read_tree_at()", with the
>    non-empty prefix support.
> 
>  - creating a new function "read_tree()", which lacks the support
>    for prefix, as a thin-wrapper around "read_tree_at()".
> 
>  - modifying the callers of "read_tree_recursive()" changed by
>    e68237bb to instead call "read_tree()" (without prefix).
> 
> to simplify majority of calling sites without losing functionality.
> 
> Then your [05/20] can use the read_tree_at() to read with a prefix.
> 
> 
> But that kind of details, I'd want to see you two figure out
> yourselves.

You've given us a great proposal. I'll wait for Ævar to chime in
(and probably update his topic) before I submit a new version.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 16/20] sparse-checkout: toggle sparse index from builtin
  2021-03-09 20:52     ` Derrick Stolee
  2021-03-09 21:03       ` Elijah Newren
@ 2021-03-14 20:08       ` Martin Ågren
  2021-03-15 13:36         ` Derrick Stolee
  1 sibling, 1 reply; 203+ messages in thread
From: Martin Ågren @ 2021-03-14 20:08 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, Git Mailing List, Elijah Newren,
	Junio C Hamano, Nguyễn Thái Ngọc Duy,
	Jonathan Nieder, Derrick Stolee, Derrick Stolee

On Tue, 9 Mar 2021 at 21:52, Derrick Stolee <stolee@gmail.com> wrote:
>
> I agree that the layers are confusing. We could rearrange and have
> a similar flow to what you recommend by mentioning the extension at
> the end:
>
> **WARNING:** Using a sparse index requires modifying the index in a way
> that is not completely understood by other tools. If you have trouble with
> this compatibility, then run `git sparse-checkout sparse-index disable` to
> rewrite your index to not be sparse. Older versions of Git will not
> understand the `sparseIndex` repository extension and may fail to interact
> with your repository until it is disabled.

I like it. I find this easier to read than the previous version. That
said, is `git sparse-index sparse-checkout disable` really the way to do
this? I don't see a "sparse-index" subcommand of git-sparse-checkout.
... Hmm, no, after building and installing your patches, I get

  $ git sparse-checkout sparse-index disable
  usage: git sparse-checkout (init|list|set|add|reapply|disable) <options>

Should that be `git sparse-checkout init --no-sparse-index`? I just
tried that on a fresh, empty repo. It seems to work in the sense that it
drops the config item. I'm guessing re-initing a sparse checkout is a
safe and sane thing to do?

I don't find any tests for this. If re-initing should be ok and in
particular if it should allow toggling the use of sparse index, it might
be good having a test. At a minimum to see that the command passes and
that the config item goes away? And check that the actual index is
rewritten back to the "old" format? (Sorry if you have that already and
I'm just bad at finding it.)

Martin

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 16/20] sparse-checkout: toggle sparse index from builtin
  2021-03-14 20:08       ` Martin Ågren
@ 2021-03-15 13:36         ` Derrick Stolee
  0 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee @ 2021-03-15 13:36 UTC (permalink / raw)
  To: Martin Ågren
  Cc: Derrick Stolee via GitGitGadget, Git Mailing List, Elijah Newren,
	Junio C Hamano, Nguyễn Thái Ngọc Duy,
	Jonathan Nieder, Derrick Stolee, Derrick Stolee

On 3/14/2021 4:08 PM, Martin Ågren wrote:
> On Tue, 9 Mar 2021 at 21:52, Derrick Stolee <stolee@gmail.com> wrote:
>>
>> I agree that the layers are confusing. We could rearrange and have
>> a similar flow to what you recommend by mentioning the extension at
>> the end:
>>
>> **WARNING:** Using a sparse index requires modifying the index in a way
>> that is not completely understood by other tools. If you have trouble with
>> this compatibility, then run `git sparse-checkout sparse-index disable` to
>> rewrite your index to not be sparse. Older versions of Git will not
>> understand the `sparseIndex` repository extension and may fail to interact
>> with your repository until it is disabled.
> 
> I like it. I find this easier to read than the previous version. That
> said, is `git sparse-index sparse-checkout disable` really the way to do
> this? I don't see a "sparse-index" subcommand of git-sparse-checkout.
> ... Hmm, no, after building and installing your patches, I get
> 
>   $ git sparse-checkout sparse-index disable
>   usage: git sparse-checkout (init|list|set|add|reapply|disable) <options>
> 
> Should that be `git sparse-checkout init --no-sparse-index`? I just
> tried that on a fresh, empty repo. It seems to work in the sense that it
> drops the config item. I'm guessing re-initing a sparse checkout is a
> safe and sane thing to do?

Yes! Sorry I missed updating this instance when changing the
design. Your suggestion is indeed the proper way to disable the
sparse-index.
 
> I don't find any tests for this. If re-initing should be ok and in
> particular if it should allow toggling the use of sparse index, it might
> be good having a test. At a minimum to see that the command passes and
> that the config item goes away? And check that the actual index is
> rewritten back to the "old" format? (Sorry if you have that already and
> I'm just bad at finding it.)

We have tests already that 'git sparse-checkout init' will preserve
existing sparse-checkout patterns.

I should definitely have a test to ensure that '--no-sparse-index'
rewrites the index to be a full one. Thanks!

-Stolee

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v2 05/20] sparse-index: implement ensure_full_index()
  2021-03-12 20:11           ` Derrick Stolee
@ 2021-03-15 23:52             ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 203+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-03-15 23:52 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Junio C Hamano, Derrick Stolee via GitGitGadget, git, newren,
	pclouds, jrnieder, Martin Ågren, SZEDER Gábor,
	Derrick Stolee, Derrick Stolee


On Fri, Mar 12 2021, Derrick Stolee wrote:

> On 3/12/2021 3:08 PM, Junio C Hamano wrote:
>> Derrick Stolee <stolee@gmail.com> writes:
>> 
>>>> Ævar, the assumption that led to your e68237bb (tree.h API: remove
>>>> support for starting at prefix != "", 2021-03-08) closes the door
>>>> for this code rather badly.  Please work with Derrick to figure out
>>>> what the best course of action would be.
>>>
>>> Thanks for pointing this out, Junio.
>>>
>>> My preference would be to drop "tree.h API: remove support for
>>> starting at prefix != """, but it should be OK to keep "tree.h API:
>>> remove "stage" parameter from read_tree_recursive()" (currently
>>> b3a078863f6), even though it introduces a semantic conflict here.
>>>
>>> Since I haven't seen my sparse-index topic get picked up by a
>>> tracking branch, I'd be happy to rebase on top of Ævar's topic if
>>> I can still set a non-root prefix.
>> I think all the clean-up value e68237bb has are on the calling side
>> (they no longer have to pass constant ("", 0) to the function), and
>> we could rewrite e68237bb by
>> 
>>  - renaming "read_tree_recursive()" to "read_tree_at()", with the
>>    non-empty prefix support.
>> 
>>  - creating a new function "read_tree()", which lacks the support
>>    for prefix, as a thin-wrapper around "read_tree_at()".
>> 
>>  - modifying the callers of "read_tree_recursive()" changed by
>>    e68237bb to instead call "read_tree()" (without prefix).
>> 
>> to simplify majority of calling sites without losing functionality.
>> 
>> Then your [05/20] can use the read_tree_at() to read with a prefix.
>> 
>> 
>> But that kind of details, I'd want to see you two figure out
>> yourselves.
>
> You've given us a great proposal. I'll wait for Ævar to chime in
> (and probably update his topic) before I submit a new version.

I've re-rolled my series just now at
https://lore.kernel.org/git/20210315234344.28427-1-avarab@gmail.com/
sorry for the delay.

You should be able to rebase easily on top of it, although note that the
new read_tree_at() uses a strbuf, but is otherwise the same as the old
read_tree_recursive().

Note that the pathspec can also be used to get to where
read_tree_recursive() would have brought you. I haven't looked at
whether there's reasons to convert in-tree (or this) code to pathspec
use, or vice-versa convert some things that use pathspecs now
(e.g. ls-tree with a path) to providing a prefix via the strbuf.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH v3 00/20] Sparse Index: Design, Format, Tests
  2021-03-10 19:30 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (20 preceding siblings ...)
  2021-03-11  0:07   ` [PATCH v2 00/20] Sparse Index: Design, Format, Tests Elijah Newren
@ 2021-03-16 16:42   ` Derrick Stolee via GitGitGadget
  2021-03-16 16:42     ` [PATCH v3 01/20] sparse-index: design doc and format update Derrick Stolee via GitGitGadget
                       ` (23 more replies)
  21 siblings, 24 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-16 16:42 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee

Here is the first full patch series submission coming out of the
sparse-index RFC [1].

[1]
https://lore.kernel.org/git/pull.847.git.1611596533.gitgitgadget@gmail.com/

I won't waste too much space here, because PATCH 1 includes a sizeable
design document that describes the feature, the reasoning behind it, and my
plan for getting this implemented widely throughout the codebase.

There are some new things here that were not in the RFC:

 * Design doc and format updates. (Patch 1)
 * Performance test script. (Patches 2 and 20)

Notably missing in this series from the RFC:

 * The mega-patch inserting ensure_full_index() throughout the codebase.
   That will be a follow-up series to this one.
 * The integrations with git status and git add to demonstrate the improved
   performance. Those will also appear in their own series later.

I plan to keep my latest work in this area in my 'sparse-index/wip' branch
[2]. It includes all of the work from the RFC right now, updated with the
work from this series.

[2] https://github.com/derrickstolee/git/tree/sparse-index/wip


Updates in V3
=============

For this version, I took Ævar's latest patches and applied them to v2.31.0
and rebased this series on top. It uses his new "read_tree_at()" helper and
the associated changes to the function pointer type.

 * Fixed more typos. Thanks Martin and Elijah!
 * Updated the test_sparse_match() macro to use "$@" instead of $*
 * Added a test that git sparse-checkout init --no-sparse-index rewrites the
   index to be full.


Updates in V2
=============

 * Various typos and awkward grammar is fixed.
 * Cleaned up unnecessary commands in p2000-sparse-operations.sh
 * Added a comment to the sparse_index member of struct index_state.
 * Used tree_type, commit_type, and blob_type in test-read-cache.c.

Thanks, -Stolee

Derrick Stolee (20):
  sparse-index: design doc and format update
  t/perf: add performance test for sparse operations
  t1092: clean up script quoting
  sparse-index: add guard to ensure full index
  sparse-index: implement ensure_full_index()
  t1092: compare sparse-checkout to sparse-index
  test-read-cache: print cache entries with --table
  test-tool: don't force full index
  unpack-trees: ensure full index
  sparse-checkout: hold pattern list in index
  sparse-index: convert from full to sparse
  submodule: sparse-index should not collapse links
  unpack-trees: allow sparse directories
  sparse-index: check index conversion happens
  sparse-index: create extension for compatibility
  sparse-checkout: toggle sparse index from builtin
  sparse-checkout: disable sparse-index
  cache-tree: integrate with sparse directory entries
  sparse-index: loose integration with cache_tree_verify()
  p2000: add sparse-index repos

 Documentation/config/extensions.txt      |   8 +
 Documentation/git-sparse-checkout.txt    |  14 ++
 Documentation/technical/index-format.txt |   7 +
 Documentation/technical/sparse-index.txt | 173 +++++++++++++
 Makefile                                 |   1 +
 builtin/sparse-checkout.c                |  44 +++-
 cache-tree.c                             |  40 ++++
 cache.h                                  |  18 +-
 read-cache.c                             |  35 ++-
 repo-settings.c                          |  15 ++
 repository.c                             |  11 +-
 repository.h                             |   3 +
 setup.c                                  |   3 +
 sparse-index.c                           | 293 +++++++++++++++++++++++
 sparse-index.h                           |  11 +
 t/README                                 |   3 +
 t/helper/test-read-cache.c               |  66 ++++-
 t/perf/p2000-sparse-operations.sh        | 102 ++++++++
 t/t1091-sparse-checkout-builtin.sh       |  13 +
 t/t1092-sparse-checkout-compatibility.sh | 143 +++++++++--
 unpack-trees.c                           |  16 +-
 21 files changed, 979 insertions(+), 40 deletions(-)
 create mode 100644 Documentation/technical/sparse-index.txt
 create mode 100644 sparse-index.c
 create mode 100644 sparse-index.h
 create mode 100755 t/perf/p2000-sparse-operations.sh


base-commit: 9c34e7ffd7b544199d889e2f3f7d9ba663c4357d
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-883%2Fderrickstolee%2Fsparse-index%2Fformat-v3
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-883/derrickstolee/sparse-index/format-v3
Pull-Request: https://github.com/gitgitgadget/git/pull/883

Range-diff vs v2:

  1:  2fe413fdac80 !  1:  62ac13945bec sparse-index: design doc and format update
     @@ Documentation/technical/sparse-index.txt (new)
      +Hopefully, commands such as `git merge` and `git rebase` can benefit
      +instead from merge algorithms that do not use the index as a data
      +structure, such as the merge-ORT strategy. As these topics mature, we
     -+may enalbe the ORT strategy by default for repositories using the
     ++may enable the ORT strategy by default for repositories using the
      +sparse-index feature.
      +
      +Along with `git status` and `git add`, these commands cover the majority
  2:  540ab5495065 =  2:  d2197e895e4d t/perf: add performance test for sparse operations
  3:  5cbedb377b37 =  3:  d3cfd34b8418 t1092: clean up script quoting
  4:  6e21f776e883 =  4:  4472118cf903 sparse-index: add guard to ensure full index
  5:  399ddb0bad56 !  5:  99292cdbaae4 sparse-index: implement ensure_full_index()
     @@ sparse-index.c
      +}
      +
      +static int add_path_to_index(const struct object_id *oid,
     -+				struct strbuf *base, const char *path,
     -+				unsigned int mode, int stage, void *context)
     ++			     struct strbuf *base, const char *path,
     ++			     unsigned int mode, void *context)
      +{
      +	struct index_state *istate = (struct index_state *)context;
      +	struct cache_entry *ce;
     @@ sparse-index.c
      -	/* intentionally left blank */
      +	int i;
      +	struct index_state *full;
     ++	struct strbuf base = STRBUF_INIT;
      +
      +	if (!istate || !istate->sparse_index)
      +		return;
     @@ sparse-index.c
      +		ps.has_wildcard = 1;
      +		ps.max_depth = -1;
      +
     -+		read_tree_recursive(istate->repo, tree,
     -+				    ce->name, strlen(ce->name),
     -+				    0, &ps,
     -+				    add_path_to_index, full);
     ++		strbuf_setlen(&base, 0);
     ++		strbuf_add(&base, ce->name, strlen(ce->name));
     ++
     ++		read_tree_at(istate->repo, tree, &base, &ps,
     ++			     add_path_to_index, full);
      +
      +		/* free directory entries. full entries are re-used */
      +		discard_cache_entry(ce);
     @@ sparse-index.c
      +	istate->cache_nr = full->cache_nr;
      +	istate->cache_alloc = full->cache_alloc;
      +
     ++	strbuf_release(&base);
      +	free(full);
      +
      +	trace2_region_leave("index", "ensure_full_index", istate->repo);
  6:  eac2db5efc22 !  6:  fae5663a17bb t1092: compare sparse-checkout to sparse-index
     @@ Commit message
          add run_on_sparse and test_sparse_match helpers. These helpers will be
          used when the sparse index is implemented.
      
     -    Add GIT_TEST_SPARSE_INDEX environment variable to enable the
     -    sparse-index by default. This will be intended to use across the entire
     -    test suite, except that it will only affect cases where the
     -    sparse-checkout feature is enabled.
     +    Add the GIT_TEST_SPARSE_INDEX environment variable to enable the
     +    sparse-index by default. This can be enabled across all tests, but that
     +    will only affect cases where the sparse-checkout feature is enabled.
      
          Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
      
     @@ t/t1092-sparse-checkout-compatibility.sh: test_all_match () {
       }
       
      +test_sparse_match () {
     -+	run_on_sparse $* &&
     ++	run_on_sparse "$@" &&
      +	test_cmp sparse-checkout-out sparse-index-out &&
      +	test_cmp sparse-checkout-err sparse-index-err
      +}
  7:  e9c82d2eda82 =  7:  dffe8821fde2 test-read-cache: print cache entries with --table
  8:  243541fc5820 =  8:  f4ad081f25bb test-tool: don't force full index
  9:  48f65093b3da =  9:  4780076a50df unpack-trees: ensure full index
 10:  83aac8b7a1ec = 10:  33fdba2b8cfd sparse-checkout: hold pattern list in index
 11:  f6db0c27a285 ! 11:  e41b14e03ebb sparse-index: convert from full to sparse
     @@ t/t1092-sparse-checkout-compatibility.sh
       
       test_description='compare full workdir to sparse workdir'
       
     ++# The verify_cache_tree() check is not sparse-aware (yet).
     ++# So, disable the check until that integration is complete.
      +GIT_TEST_CHECK_CACHE_TREE=0
      +GIT_TEST_SPLIT_INDEX=0
      +
     @@ t/t1092-sparse-checkout-compatibility.sh: run_on_all () {
       }
       
       test_sparse_match () {
     --	run_on_sparse $* &&
     -+	run_on_sparse "$@" &&
     - 	test_cmp sparse-checkout-out sparse-index-out &&
     +@@ t/t1092-sparse-checkout-compatibility.sh: test_sparse_match () {
       	test_cmp sparse-checkout-err sparse-index-err
       }
       
 12:  f2a3e7298798 = 12:  b77cd6b02265 submodule: sparse-index should not collapse links
 13:  6f1ebe6ccc08 = 13:  4000c5cdd4cf unpack-trees: allow sparse directories
 14:  3fa684b315fb = 14:  1a2be38b2ca7 sparse-index: check index conversion happens
 15:  d74576d677f6 = 15:  f89891b0ae4e sparse-index: create extension for compatibility
 16:  e530ca5f668d ! 16:  bd703c76c859 sparse-checkout: toggle sparse index from builtin
     @@ Documentation/git-sparse-checkout.txt: To avoid interfering with other worktrees
      ++
      +**WARNING:** Using a sparse index requires modifying the index in a way
      +that is not completely understood by external tools. If you have trouble
     -+with this compatibility, then run `git sparse-checkout sparse-index disable`
     ++with this compatibility, then run `git sparse-checkout init --no-sparse-index`
      +to rewrite your index to not be sparse. Older versions of Git will not
      +understand the `sparseIndex` repository extension and may fail to interact
      +with your repository until it is disabled.
     @@ sparse-index.h: struct index_state;
      
       ## t/t1092-sparse-checkout-compatibility.sh ##
      @@ t/t1092-sparse-checkout-compatibility.sh: test_description='compare full workdir to sparse workdir'
     - 
     + # So, disable the check until that integration is complete.
       GIT_TEST_CHECK_CACHE_TREE=0
       GIT_TEST_SPLIT_INDEX=0
      +GIT_TEST_SPARSE_INDEX=
     @@ t/t1092-sparse-checkout-compatibility.sh: test_expect_success 'sparse-index cont
       
       	test-tool -C sparse-index read-cache --table >cache &&
       	for dir in deep/deeper2 folder1 folder2 x
     +@@ t/t1092-sparse-checkout-compatibility.sh: test_expect_success 'sparse-index contents' '
     + 		TREE=$(git -C sparse-index rev-parse HEAD:$dir) &&
     + 		grep "040000 tree $TREE	$dir/" cache \
     + 			|| return 1
     +-	done
     ++	done &&
     ++
     ++	# Disabling the sparse-index removes tree entries with full ones
     ++	git -C sparse-index sparse-checkout init --no-sparse-index &&
     ++
     ++	test-tool -C sparse-index read-cache --table >cache &&
     ++	! grep "040000 tree" cache &&
     ++	test_sparse_match test-tool read-cache --table
     + '
     + 
     + test_expect_success 'expanded in-memory index matches full index' '
      @@ t/t1092-sparse-checkout-compatibility.sh: test_expect_success 'submodule handling' '
       test_expect_success 'sparse-index is expanded and converted back' '
       	init_repos &&
 17:  42d0da9c5def = 17:  598557f90a2a sparse-checkout: disable sparse-index
 18:  6bb0976a6295 ! 18:  c2d0c17db31a cache-tree: integrate with sparse directory entries
     @@ sparse-index.c: int convert_to_sparse(struct index_state *istate)
       	trace2_region_leave("index", "convert_to_sparse", istate->repo);
       	return 0;
      @@ sparse-index.c: void ensure_full_index(struct index_state *istate)
     - 
     + 	strbuf_release(&base);
       	free(full);
       
      +	/* Clear and recompute the cache-tree */
 19:  07f34e80609a ! 19:  6fdd9323c14e sparse-index: loose integration with cache_tree_verify()
     @@ t/t1092-sparse-checkout-compatibility.sh
       
       test_description='compare full workdir to sparse workdir'
       
     +-# The verify_cache_tree() check is not sparse-aware (yet).
     +-# So, disable the check until that integration is complete.
      -GIT_TEST_CHECK_CACHE_TREE=0
       GIT_TEST_SPLIT_INDEX=0
       GIT_TEST_SPARSE_INDEX=
 20:  41e3b56b9c17 = 20:  3db06ac46dd5 p2000: add sparse-index repos

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH v3 01/20] sparse-index: design doc and format update
  2021-03-16 16:42   ` [PATCH v3 " Derrick Stolee via GitGitGadget
@ 2021-03-16 16:42     ` Derrick Stolee via GitGitGadget
  2021-03-19 23:43       ` Junio C Hamano
  2021-03-16 16:42     ` [PATCH v3 02/20] t/perf: add performance test for sparse operations Derrick Stolee via GitGitGadget
                       ` (22 subsequent siblings)
  23 siblings, 1 reply; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-16 16:42 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

This begins a long effort to update the index format to allow sparse
directory entries. This should result in a significant improvement to
Git commands when HEAD contains millions of files, but the user has
selected many fewer files to keep in their sparse-checkout definition.

Currently, the index format is only updated in the presence of
extensions.sparseIndex instead of increasing a file format version
number. This is temporary, and index v5 is part of the plan for future
work in this area.

The design document details many of the reasons for embarking on this
work, and also the plan for completing it safely.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/index-format.txt |   7 +
 Documentation/technical/sparse-index.txt | 173 +++++++++++++++++++++++
 2 files changed, 180 insertions(+)
 create mode 100644 Documentation/technical/sparse-index.txt

diff --git a/Documentation/technical/index-format.txt b/Documentation/technical/index-format.txt
index d363a71c37ec..cc548eaa0e97 100644
--- a/Documentation/technical/index-format.txt
+++ b/Documentation/technical/index-format.txt
@@ -44,6 +44,13 @@ Git index format
   localization, no special casing of directory separator '/'). Entries
   with the same name are sorted by their stage field.
 
+  An index entry typically represents a file. However, if sparse-checkout
+  is enabled in cone mode (`core.sparseCheckoutCone` is enabled) and the
+  `extensions.sparseIndex` extension is enabled, then the index may
+  contain entries for directories outside of the sparse-checkout definition.
+  These entries have mode `0040000`, include the `SKIP_WORKTREE` bit, and
+  the path ends in a directory separator.
+
   32-bit ctime seconds, the last time a file's metadata changed
     this is stat(2) data
 
diff --git a/Documentation/technical/sparse-index.txt b/Documentation/technical/sparse-index.txt
new file mode 100644
index 000000000000..aa116406a016
--- /dev/null
+++ b/Documentation/technical/sparse-index.txt
@@ -0,0 +1,173 @@
+Git Sparse-Index Design Document
+================================
+
+The sparse-checkout feature allows users to focus a working directory on
+a subset of the files at HEAD. The cone mode patterns, enabled by
+`core.sparseCheckoutCone`, allow for very fast pattern matching to
+discover which files at HEAD belong in the sparse-checkout cone.
+
+Three important scale dimensions for a Git worktree are:
+
+* `HEAD`: How many files are present at `HEAD`?
+
+* Populated: How many files are within the sparse-checkout cone.
+
+* Modified: How many files has the user modified in the working directory?
+
+We will use big-O notation -- O(X) -- to denote how expensive certain
+operations are in terms of these dimensions.
+
+These dimensions are ordered by their magnitude: users (typically) modify
+fewer files than are populated, and we can only populate files at `HEAD`.
+These dimensions are also ordered by how expensive they are per item: it
+is expensive to detect a modified file than it is to write one that we
+know must be populated; changing `HEAD` only really requires updating the
+index.
+
+Problems occur if there is an extreme imbalance in these dimensions. For
+example, if `HEAD` contains millions of paths but the populated set has
+only tens of thousands, then commands like `git status` and `git add` can
+be dominated by operations that require O(`HEAD`) operations instead of
+O(Populated). Primarily, the cost is in parsing and rewriting the index,
+which is filled primarily with files at `HEAD` that are marked with the
+`SKIP_WORKTREE` bit.
+
+The sparse-index intends to take these commands that read and modify the
+index from O(`HEAD`) to O(Populated). To do this, we need to modify the
+index format in a significant way: add "sparse directory" entries.
+
+With cone mode patterns, it is possible to detect when an entire
+directory will have its contents outside of the sparse-checkout definition.
+Instead of listing all of the files it contains as individual entries, a
+sparse-index contains an entry with the directory name, referencing the
+object ID of the tree at `HEAD` and marked with the `SKIP_WORKTREE` bit.
+If we need to discover the details for paths within that directory, we
+can parse trees to find that list.
+
+At time of writing, sparse-directory entries violate expectations about the
+index format and its in-memory data structure. There are many consumers in
+the codebase that expect to iterate through all of the index entries and
+see only files. In addition, they expect to see all files at `HEAD`. One
+way to handle this is to parse trees to replace a sparse-directory entry
+with all of the files within that tree as the index is loaded. However,
+parsing trees is slower than parsing the index format, so that is a slower
+operation than if we left the index alone.
+
+The implementation plan below follows four phases to slowly integrate with
+the sparse-index. The intention is to incrementally update Git commands to
+interact safely with the sparse-index without significant slowdowns. This
+may not always be possible, but the hope is that the primary commands that
+users need in their daily work are dramatically improved.
+
+Phase I: Format and initial speedups
+------------------------------------
+
+During this phase, Git learns to enable the sparse-index and safely parse
+one. Protections are put in place so that every consumer of the in-memory
+data structure can operate with its current assumption of every file at
+`HEAD`.
+
+At first, every index parse will expand the sparse-directory entries into
+the full list of paths at `HEAD`. This will be slower in all cases. The
+only noticable change in behavior will be that the serialized index file
+contains sparse-directory entries.
+
+To start, we use a new repository extension, `extensions.sparseIndex`, to
+allow inserting sparse-directory entries into indexes with file format
+versions 2, 3, and 4. This prevents Git versions that do not understand
+the sparse-index from operating on one, but it also prevents other
+operations that do not use the index at all. A new format, index v5, will
+be introduced that includes sparse-directory entries by default. It might
+also introduce other features that have been considered for improving the
+index, as well.
+
+Next, consumers of the index will be guarded against operating on a
+sparse-index by inserting calls to `ensure_full_index()` or
+`expand_index_to_path()`. After these guards are in place, we can begin
+leaving sparse-directory entries in the in-memory index structure.
+
+Even after inserting these guards, we will keep expanding sparse-indexes
+for most Git commands using the `command_requires_full_index` repository
+setting. This setting will be on by default and disabled one builtin at a
+time until we have sufficient confidence that all of the index operations
+are properly guarded.
+
+To complete this phase, the commands `git status` and `git add` will be
+integrated with the sparse-index so that they operate with O(Populated)
+performance. They will be carefully tested for operations within and
+outside the sparse-checkout definition.
+
+Phase II: Careful integrations
+------------------------------
+
+This phase focuses on ensuring that all index extensions and APIs work
+well with a sparse-index. This requires significant increases to our test
+coverage, especially for operations that interact with the working
+directory outside of the sparse-checkout definition. Some of these
+behaviors may not be the desirable ones, such as some tests already
+marked for failure in `t1092-sparse-checkout-compatibility.sh`.
+
+The index extensions that may require special integrations are:
+
+* FS Monitor
+* Untracked cache
+
+While integrating with these features, we should look for patterns that
+might lead to better APIs for interacting with the index. Coalescing
+common usage patterns into an API call can reduce the number of places
+where sparse-directories need to be handled carefully.
+
+Phase III: Important command speedups
+-------------------------------------
+
+At this point, the patterns for testing and implementing sparse-directory
+logic should be relatively stable. This phase focuses on updating some of
+the most common builtins that use the index to operate as O(Populated).
+Here is a potential list of commands that could be valuable to integrate
+at this point:
+
+* `git commit`
+* `git checkout`
+* `git merge`
+* `git rebase`
+
+Hopefully, commands such as `git merge` and `git rebase` can benefit
+instead from merge algorithms that do not use the index as a data
+structure, such as the merge-ORT strategy. As these topics mature, we
+may enable the ORT strategy by default for repositories using the
+sparse-index feature.
+
+Along with `git status` and `git add`, these commands cover the majority
+of users' interactions with the working directory. In addition, we can
+integrate with these commands:
+
+* `git grep`
+* `git rm`
+
+These have been proposed as some whose behavior could change when in a
+repo with a sparse-checkout definition. It would be good to include this
+behavior automatically when using a sparse-index. Some clarity is needed
+to make the behavior switch clear to the user.
+
+This phase is the first where parallel work might be possible without too
+much conflicts between topics.
+
+Phase IV: The long tail
+-----------------------
+
+This last phase is less a "phase" and more "the new normal" after all of
+the previous work.
+
+To start, the `command_requires_full_index` option could be removed in
+favor of expanding only when hitting an API guard.
+
+There are many Git commands that could use special attention to operate as
+O(Populated), while some might be so rare that it is acceptable to leave
+them with additional overhead when a sparse-index is present.
+
+Here are some commands that might be useful to update:
+
+* `git sparse-checkout set`
+* `git am`
+* `git clean`
+* `git stash`
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH v3 02/20] t/perf: add performance test for sparse operations
  2021-03-16 16:42   ` [PATCH v3 " Derrick Stolee via GitGitGadget
  2021-03-16 16:42     ` [PATCH v3 01/20] sparse-index: design doc and format update Derrick Stolee via GitGitGadget
@ 2021-03-16 16:42     ` Derrick Stolee via GitGitGadget
  2021-03-17  8:41       ` Ævar Arnfjörð Bjarmason
  2021-03-16 16:42     ` [PATCH v3 03/20] t1092: clean up script quoting Derrick Stolee via GitGitGadget
                       ` (21 subsequent siblings)
  23 siblings, 1 reply; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-16 16:42 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Create a test script that takes the default performance test (the Git
codebase) and multiplies it by 256 using four layers of duplicated
trees of width four. This results in nearly one million blob entries in
the index. Then, we can clone this repository with sparse-checkout
patterns that demonstrate four copies of the initial repository. Each
clone will use a different index format or mode so peformance can be
tested across the different options.

Note that the initial repo is stripped of submodules before doing the
copies. This preserves the expected data shape of the sparse index,
because directories containing submodules are not collapsed to a sparse
directory entry.

Run a few Git commands on these clones, especially those that use the
index (status, add, commit).

Here are the results on my Linux machine:

Test
--------------------------------------------------------------
2000.2: git status (full-index-v3)             0.37(0.30+0.09)
2000.3: git status (full-index-v4)             0.39(0.32+0.10)
2000.4: git add -A (full-index-v3)             1.42(1.06+0.20)
2000.5: git add -A (full-index-v4)             1.26(0.98+0.16)
2000.6: git add . (full-index-v3)              1.40(1.04+0.18)
2000.7: git add . (full-index-v4)              1.26(0.98+0.17)
2000.8: git commit -a -m A (full-index-v3)     1.42(1.11+0.16)
2000.9: git commit -a -m A (full-index-v4)     1.33(1.08+0.16)

It is perhaps noteworthy that there is an improvement when using index
version 4. This is because the v3 index uses 108 MiB while the v4
index uses 80 MiB. Since the repeated portions of the directories are
very short (f3/f1/f2, for example) this ratio is less pronounced than in
similarly-sized real repositories.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/perf/p2000-sparse-operations.sh | 85 +++++++++++++++++++++++++++++++
 1 file changed, 85 insertions(+)
 create mode 100755 t/perf/p2000-sparse-operations.sh

diff --git a/t/perf/p2000-sparse-operations.sh b/t/perf/p2000-sparse-operations.sh
new file mode 100755
index 000000000000..2fbc81b22119
--- /dev/null
+++ b/t/perf/p2000-sparse-operations.sh
@@ -0,0 +1,85 @@
+#!/bin/sh
+
+test_description="test performance of Git operations using the index"
+
+. ./perf-lib.sh
+
+test_perf_default_repo
+
+SPARSE_CONE=f2/f4/f1
+
+test_expect_success 'setup repo and indexes' '
+	git reset --hard HEAD &&
+	# Remove submodules from the example repo, because our
+	# duplication of the entire repo creates an unlikly data shape.
+	git config --file .gitmodules --get-regexp "submodule.*.path" >modules &&
+	git rm -f .gitmodules &&
+	for module in $(awk "{print \$2}" modules)
+	do
+		git rm $module || return 1
+	done &&
+	git commit -m "remove submodules" &&
+
+	echo bogus >a &&
+	cp a b &&
+	git add a b &&
+	git commit -m "level 0" &&
+	BLOB=$(git rev-parse HEAD:a) &&
+	OLD_COMMIT=$(git rev-parse HEAD) &&
+	OLD_TREE=$(git rev-parse HEAD^{tree}) &&
+
+	for i in $(test_seq 1 4)
+	do
+		cat >in <<-EOF &&
+			100755 blob $BLOB	a
+			040000 tree $OLD_TREE	f1
+			040000 tree $OLD_TREE	f2
+			040000 tree $OLD_TREE	f3
+			040000 tree $OLD_TREE	f4
+		EOF
+		NEW_TREE=$(git mktree <in) &&
+		NEW_COMMIT=$(git commit-tree $NEW_TREE -p $OLD_COMMIT -m "level $i") &&
+		OLD_TREE=$NEW_TREE &&
+		OLD_COMMIT=$NEW_COMMIT || return 1
+	done &&
+
+	git sparse-checkout init --cone &&
+	git branch -f wide $OLD_COMMIT &&
+	git -c core.sparseCheckoutCone=true clone --branch=wide --sparse . full-index-v3 &&
+	(
+		cd full-index-v3 &&
+		git sparse-checkout init --cone &&
+		git sparse-checkout set $SPARSE_CONE &&
+		git config index.version 3 &&
+		git update-index --index-version=3
+	) &&
+	git -c core.sparseCheckoutCone=true clone --branch=wide --sparse . full-index-v4 &&
+	(
+		cd full-index-v4 &&
+		git sparse-checkout init --cone &&
+		git sparse-checkout set $SPARSE_CONE &&
+		git config index.version 4 &&
+		git update-index --index-version=4
+	)
+'
+
+test_perf_on_all () {
+	command="$@"
+	for repo in full-index-v3 full-index-v4
+	do
+		test_perf "$command ($repo)" "
+			(
+				cd $repo &&
+				echo >>$SPARSE_CONE/a &&
+				$command
+			)
+		"
+	done
+}
+
+test_perf_on_all git status
+test_perf_on_all git add -A
+test_perf_on_all git add .
+test_perf_on_all git commit -a -m A
+
+test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH v3 03/20] t1092: clean up script quoting
  2021-03-16 16:42   ` [PATCH v3 " Derrick Stolee via GitGitGadget
  2021-03-16 16:42     ` [PATCH v3 01/20] sparse-index: design doc and format update Derrick Stolee via GitGitGadget
  2021-03-16 16:42     ` [PATCH v3 02/20] t/perf: add performance test for sparse operations Derrick Stolee via GitGitGadget
@ 2021-03-16 16:42     ` Derrick Stolee via GitGitGadget
  2021-03-17  8:47       ` Ævar Arnfjörð Bjarmason
  2021-03-16 16:42     ` [PATCH v3 04/20] sparse-index: add guard to ensure full index Derrick Stolee via GitGitGadget
                       ` (20 subsequent siblings)
  23 siblings, 1 reply; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-16 16:42 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

This test was introduced in 19a0acc83e4 (t1092: test interesting
sparse-checkout scenarios, 2021-01-23), but these issues with quoting
were not noticed until starting this follow-up series. The old mechanism
would drop quoting such as in

   test_all_match git commit -m "touch README.md"

The above happened to work because README.md is a file in the
repository, so 'git commit -m touch REAMDE.md' would succeed by
accident.

Other cases included quoting for no good reason, so clean that up now.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/t1092-sparse-checkout-compatibility.sh | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index 8cd3e5a8d227..3725d3997e70 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -96,20 +96,20 @@ init_repos () {
 run_on_sparse () {
 	(
 		cd sparse-checkout &&
-		$* >../sparse-checkout-out 2>../sparse-checkout-err
+		"$@" >../sparse-checkout-out 2>../sparse-checkout-err
 	)
 }
 
 run_on_all () {
 	(
 		cd full-checkout &&
-		$* >../full-checkout-out 2>../full-checkout-err
+		"$@" >../full-checkout-out 2>../full-checkout-err
 	) &&
-	run_on_sparse $*
+	run_on_sparse "$@"
 }
 
 test_all_match () {
-	run_on_all $* &&
+	run_on_all "$@" &&
 	test_cmp full-checkout-out sparse-checkout-out &&
 	test_cmp full-checkout-err sparse-checkout-err
 }
@@ -119,7 +119,7 @@ test_expect_success 'status with options' '
 	test_all_match git status --porcelain=v2 &&
 	test_all_match git status --porcelain=v2 -z -u &&
 	test_all_match git status --porcelain=v2 -uno &&
-	run_on_all "touch README.md" &&
+	run_on_all touch README.md &&
 	test_all_match git status --porcelain=v2 &&
 	test_all_match git status --porcelain=v2 -z -u &&
 	test_all_match git status --porcelain=v2 -uno &&
@@ -135,7 +135,7 @@ test_expect_success 'add, commit, checkout' '
 	write_script edit-contents <<-\EOF &&
 	echo text >>$1
 	EOF
-	run_on_all "../edit-contents README.md" &&
+	run_on_all ../edit-contents README.md &&
 
 	test_all_match git add README.md &&
 	test_all_match git status --porcelain=v2 &&
@@ -144,7 +144,7 @@ test_expect_success 'add, commit, checkout' '
 	test_all_match git checkout HEAD~1 &&
 	test_all_match git checkout - &&
 
-	run_on_all "../edit-contents README.md" &&
+	run_on_all ../edit-contents README.md &&
 
 	test_all_match git add -A &&
 	test_all_match git status --porcelain=v2 &&
@@ -153,7 +153,7 @@ test_expect_success 'add, commit, checkout' '
 	test_all_match git checkout HEAD~1 &&
 	test_all_match git checkout - &&
 
-	run_on_all "../edit-contents deep/newfile" &&
+	run_on_all ../edit-contents deep/newfile &&
 
 	test_all_match git status --porcelain=v2 -uno &&
 	test_all_match git status --porcelain=v2 &&
@@ -186,7 +186,7 @@ test_expect_success 'diff --staged' '
 	write_script edit-contents <<-\EOF &&
 	echo text >>README.md
 	EOF
-	run_on_all "../edit-contents" &&
+	run_on_all ../edit-contents &&
 
 	test_all_match git diff &&
 	test_all_match git diff --staged &&
@@ -280,7 +280,7 @@ test_expect_success 'clean' '
 	echo bogus >>.gitignore &&
 	run_on_all cp ../.gitignore . &&
 	test_all_match git add .gitignore &&
-	test_all_match git commit -m ignore-bogus-files &&
+	test_all_match git commit -m "ignore bogus files" &&
 
 	run_on_sparse mkdir folder1 &&
 	run_on_all touch folder1/bogus &&
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH v3 04/20] sparse-index: add guard to ensure full index
  2021-03-16 16:42   ` [PATCH v3 " Derrick Stolee via GitGitGadget
                       ` (2 preceding siblings ...)
  2021-03-16 16:42     ` [PATCH v3 03/20] t1092: clean up script quoting Derrick Stolee via GitGitGadget
@ 2021-03-16 16:42     ` Derrick Stolee via GitGitGadget
  2021-03-16 16:42     ` [PATCH v3 05/20] sparse-index: implement ensure_full_index() Derrick Stolee via GitGitGadget
                       ` (19 subsequent siblings)
  23 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-16 16:42 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Upcoming changes will introduce modifications to the index format that
allow sparse directories. It will be useful to have a mechanism for
converting those sparse index files into full indexes by walking the
tree at those sparse directories. Name this method ensure_full_index()
as it will guarantee that the index is fully expanded.

This method is not implemented yet, and instead we focus on the
scaffolding to declare it and call it at the appropriate time.

Add a 'command_requires_full_index' member to struct repo_settings. This
will be an indicator that we need the index in full mode to do certain
index operations. This starts as being true for every command, then we
will set it to false as some commands integrate with sparse indexes.

If 'command_requires_full_index' is true, then we will immediately
expand a sparse index to a full one upon reading from disk. This
suffices for now, but we will want to add more callers to
ensure_full_index() later.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Makefile        |  1 +
 repo-settings.c |  8 ++++++++
 repository.c    | 11 ++++++++++-
 repository.h    |  2 ++
 sparse-index.c  |  8 ++++++++
 sparse-index.h  |  7 +++++++
 6 files changed, 36 insertions(+), 1 deletion(-)
 create mode 100644 sparse-index.c
 create mode 100644 sparse-index.h

diff --git a/Makefile b/Makefile
index dfb0f1000fa3..89b1d5374107 100644
--- a/Makefile
+++ b/Makefile
@@ -985,6 +985,7 @@ LIB_OBJS += setup.o
 LIB_OBJS += shallow.o
 LIB_OBJS += sideband.o
 LIB_OBJS += sigchain.o
+LIB_OBJS += sparse-index.o
 LIB_OBJS += split-index.o
 LIB_OBJS += stable-qsort.o
 LIB_OBJS += strbuf.o
diff --git a/repo-settings.c b/repo-settings.c
index f7fff0f5ab83..d63569e4041e 100644
--- a/repo-settings.c
+++ b/repo-settings.c
@@ -77,4 +77,12 @@ void prepare_repo_settings(struct repository *r)
 		UPDATE_DEFAULT_BOOL(r->settings.core_untracked_cache, UNTRACKED_CACHE_KEEP);
 
 	UPDATE_DEFAULT_BOOL(r->settings.fetch_negotiation_algorithm, FETCH_NEGOTIATION_DEFAULT);
+
+	/*
+	 * This setting guards all index reads to require a full index
+	 * over a sparse index. After suitable guards are placed in the
+	 * codebase around uses of the index, this setting will be
+	 * removed.
+	 */
+	r->settings.command_requires_full_index = 1;
 }
diff --git a/repository.c b/repository.c
index c98298acd017..a8acae002f71 100644
--- a/repository.c
+++ b/repository.c
@@ -10,6 +10,7 @@
 #include "object.h"
 #include "lockfile.h"
 #include "submodule-config.h"
+#include "sparse-index.h"
 
 /* The main repository */
 static struct repository the_repo;
@@ -261,6 +262,8 @@ void repo_clear(struct repository *repo)
 
 int repo_read_index(struct repository *repo)
 {
+	int res;
+
 	if (!repo->index)
 		repo->index = xcalloc(1, sizeof(*repo->index));
 
@@ -270,7 +273,13 @@ int repo_read_index(struct repository *repo)
 	else if (repo->index->repo != repo)
 		BUG("repo's index should point back at itself");
 
-	return read_index_from(repo->index, repo->index_file, repo->gitdir);
+	res = read_index_from(repo->index, repo->index_file, repo->gitdir);
+
+	prepare_repo_settings(repo);
+	if (repo->settings.command_requires_full_index)
+		ensure_full_index(repo->index);
+
+	return res;
 }
 
 int repo_hold_locked_index(struct repository *repo,
diff --git a/repository.h b/repository.h
index b385ca3c94b6..e06a23015697 100644
--- a/repository.h
+++ b/repository.h
@@ -41,6 +41,8 @@ struct repo_settings {
 	enum fetch_negotiation_setting fetch_negotiation_algorithm;
 
 	int core_multi_pack_index;
+
+	unsigned command_requires_full_index:1;
 };
 
 struct repository {
diff --git a/sparse-index.c b/sparse-index.c
new file mode 100644
index 000000000000..82183ead563b
--- /dev/null
+++ b/sparse-index.c
@@ -0,0 +1,8 @@
+#include "cache.h"
+#include "repository.h"
+#include "sparse-index.h"
+
+void ensure_full_index(struct index_state *istate)
+{
+	/* intentionally left blank */
+}
diff --git a/sparse-index.h b/sparse-index.h
new file mode 100644
index 000000000000..09a20d036c46
--- /dev/null
+++ b/sparse-index.h
@@ -0,0 +1,7 @@
+#ifndef SPARSE_INDEX_H__
+#define SPARSE_INDEX_H__
+
+struct index_state;
+void ensure_full_index(struct index_state *istate);
+
+#endif
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH v3 05/20] sparse-index: implement ensure_full_index()
  2021-03-16 16:42   ` [PATCH v3 " Derrick Stolee via GitGitGadget
                       ` (3 preceding siblings ...)
  2021-03-16 16:42     ` [PATCH v3 04/20] sparse-index: add guard to ensure full index Derrick Stolee via GitGitGadget
@ 2021-03-16 16:42     ` Derrick Stolee via GitGitGadget
  2021-03-17 13:03       ` Ævar Arnfjörð Bjarmason
  2021-03-16 16:42     ` [PATCH v3 06/20] t1092: compare sparse-checkout to sparse-index Derrick Stolee via GitGitGadget
                       ` (18 subsequent siblings)
  23 siblings, 1 reply; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-16 16:42 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

We will mark an in-memory index_state as having sparse directory entries
with the sparse_index bit. These currently cannot exist, but we will add
a mechanism for collapsing a full index to a sparse one in a later
change. That will happen at write time, so we must first allow parsing
the format before writing it.

Commands or methods that require a full index in order to operate can
call ensure_full_index() to expand that index in-memory. This requires
parsing trees using that index's repository.

Sparse directory entries have a specific 'ce_mode' value. The macro
S_ISSPARSEDIR(ce->ce_mode) can check if a cache_entry 'ce' has this type.
This ce_mode is not possible with the existing index formats, so we don't
also verify all properties of a sparse-directory entry, which are:

 1. ce->ce_mode == 0040000
 2. ce->flags & CE_SKIP_WORKTREE is true
 3. ce->name[ce->namelen - 1] == '/' (ends in dir separator)
 4. ce->oid references a tree object.

These are all semi-enforced in ensure_full_index() to some extent. Any
deviation will cause a warning at minimum or a failure in the worst
case.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 cache.h        | 13 ++++++-
 read-cache.c   |  9 +++++
 sparse-index.c | 98 +++++++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 118 insertions(+), 2 deletions(-)

diff --git a/cache.h b/cache.h
index c2f8a8eadf67..abb00a068e5d 100644
--- a/cache.h
+++ b/cache.h
@@ -204,6 +204,8 @@ struct cache_entry {
 #error "CE_EXTENDED_FLAGS out of range"
 #endif
 
+#define S_ISSPARSEDIR(m) ((m) == S_IFDIR)
+
 /* Forward structure decls */
 struct pathspec;
 struct child_process;
@@ -319,7 +321,14 @@ struct index_state {
 		 drop_cache_tree : 1,
 		 updated_workdir : 1,
 		 updated_skipworktree : 1,
-		 fsmonitor_has_run_once : 1;
+		 fsmonitor_has_run_once : 1,
+
+		 /*
+		  * sparse_index == 1 when sparse-directory
+		  * entries exist. Requires sparse-checkout
+		  * in cone mode.
+		  */
+		 sparse_index : 1;
 	struct hashmap name_hash;
 	struct hashmap dir_hash;
 	struct object_id oid;
@@ -722,6 +731,8 @@ int read_index_from(struct index_state *, const char *path,
 		    const char *gitdir);
 int is_index_unborn(struct index_state *);
 
+void ensure_full_index(struct index_state *istate);
+
 /* For use with `write_locked_index()`. */
 #define COMMIT_LOCK		(1 << 0)
 #define SKIP_IF_UNCHANGED	(1 << 1)
diff --git a/read-cache.c b/read-cache.c
index 1e9a50c6c734..dd3980c12b53 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -101,6 +101,9 @@ static const char *alternate_index_output;
 
 static void set_index_entry(struct index_state *istate, int nr, struct cache_entry *ce)
 {
+	if (S_ISSPARSEDIR(ce->ce_mode))
+		istate->sparse_index = 1;
+
 	istate->cache[nr] = ce;
 	add_name_hash(istate, ce);
 }
@@ -2273,6 +2276,12 @@ int do_read_index(struct index_state *istate, const char *path, int must_exist)
 	trace2_data_intmax("index", the_repository, "read/cache_nr",
 			   istate->cache_nr);
 
+	if (!istate->repo)
+		istate->repo = the_repository;
+	prepare_repo_settings(istate->repo);
+	if (istate->repo->settings.command_requires_full_index)
+		ensure_full_index(istate);
+
 	return istate->cache_nr;
 
 unmap:
diff --git a/sparse-index.c b/sparse-index.c
index 82183ead563b..7095378a1b28 100644
--- a/sparse-index.c
+++ b/sparse-index.c
@@ -1,8 +1,104 @@
 #include "cache.h"
 #include "repository.h"
 #include "sparse-index.h"
+#include "tree.h"
+#include "pathspec.h"
+#include "trace2.h"
+
+static void set_index_entry(struct index_state *istate, int nr, struct cache_entry *ce)
+{
+	ALLOC_GROW(istate->cache, nr + 1, istate->cache_alloc);
+
+	istate->cache[nr] = ce;
+	add_name_hash(istate, ce);
+}
+
+static int add_path_to_index(const struct object_id *oid,
+			     struct strbuf *base, const char *path,
+			     unsigned int mode, void *context)
+{
+	struct index_state *istate = (struct index_state *)context;
+	struct cache_entry *ce;
+	size_t len = base->len;
+
+	if (S_ISDIR(mode))
+		return READ_TREE_RECURSIVE;
+
+	strbuf_addstr(base, path);
+
+	ce = make_cache_entry(istate, mode, oid, base->buf, 0, 0);
+	ce->ce_flags |= CE_SKIP_WORKTREE;
+	set_index_entry(istate, istate->cache_nr++, ce);
+
+	strbuf_setlen(base, len);
+	return 0;
+}
 
 void ensure_full_index(struct index_state *istate)
 {
-	/* intentionally left blank */
+	int i;
+	struct index_state *full;
+	struct strbuf base = STRBUF_INIT;
+
+	if (!istate || !istate->sparse_index)
+		return;
+
+	if (!istate->repo)
+		istate->repo = the_repository;
+
+	trace2_region_enter("index", "ensure_full_index", istate->repo);
+
+	/* initialize basics of new index */
+	full = xcalloc(1, sizeof(struct index_state));
+	memcpy(full, istate, sizeof(struct index_state));
+
+	/* then change the necessary things */
+	full->sparse_index = 0;
+	full->cache_alloc = (3 * istate->cache_alloc) / 2;
+	full->cache_nr = 0;
+	ALLOC_ARRAY(full->cache, full->cache_alloc);
+
+	for (i = 0; i < istate->cache_nr; i++) {
+		struct cache_entry *ce = istate->cache[i];
+		struct tree *tree;
+		struct pathspec ps;
+
+		if (!S_ISSPARSEDIR(ce->ce_mode)) {
+			set_index_entry(full, full->cache_nr++, ce);
+			continue;
+		}
+		if (!(ce->ce_flags & CE_SKIP_WORKTREE))
+			warning(_("index entry is a directory, but not sparse (%08x)"),
+				ce->ce_flags);
+
+		/* recursively walk into cd->name */
+		tree = lookup_tree(istate->repo, &ce->oid);
+
+		memset(&ps, 0, sizeof(ps));
+		ps.recursive = 1;
+		ps.has_wildcard = 1;
+		ps.max_depth = -1;
+
+		strbuf_setlen(&base, 0);
+		strbuf_add(&base, ce->name, strlen(ce->name));
+
+		read_tree_at(istate->repo, tree, &base, &ps,
+			     add_path_to_index, full);
+
+		/* free directory entries. full entries are re-used */
+		discard_cache_entry(ce);
+	}
+
+	/* Copy back into original index. */
+	memcpy(&istate->name_hash, &full->name_hash, sizeof(full->name_hash));
+	istate->sparse_index = 0;
+	free(istate->cache);
+	istate->cache = full->cache;
+	istate->cache_nr = full->cache_nr;
+	istate->cache_alloc = full->cache_alloc;
+
+	strbuf_release(&base);
+	free(full);
+
+	trace2_region_leave("index", "ensure_full_index", istate->repo);
 }
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH v3 06/20] t1092: compare sparse-checkout to sparse-index
  2021-03-16 16:42   ` [PATCH v3 " Derrick Stolee via GitGitGadget
                       ` (4 preceding siblings ...)
  2021-03-16 16:42     ` [PATCH v3 05/20] sparse-index: implement ensure_full_index() Derrick Stolee via GitGitGadget
@ 2021-03-16 16:42     ` Derrick Stolee via GitGitGadget
  2021-03-16 16:42     ` [PATCH v3 07/20] test-read-cache: print cache entries with --table Derrick Stolee via GitGitGadget
                       ` (17 subsequent siblings)
  23 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-16 16:42 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Add a new 'sparse-index' repo alongside the 'full-checkout' and
'sparse-checkout' repos in t1092-sparse-checkout-compatibility.sh. Also
add run_on_sparse and test_sparse_match helpers. These helpers will be
used when the sparse index is implemented.

Add the GIT_TEST_SPARSE_INDEX environment variable to enable the
sparse-index by default. This can be enabled across all tests, but that
will only affect cases where the sparse-checkout feature is enabled.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/README                                 |  3 +++
 t/t1092-sparse-checkout-compatibility.sh | 24 ++++++++++++++++++++----
 2 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/t/README b/t/README
index 593d4a4e270c..b98bc563aab5 100644
--- a/t/README
+++ b/t/README
@@ -439,6 +439,9 @@ and "sha256".
 GIT_TEST_WRITE_REV_INDEX=<boolean>, when true enables the
 'pack.writeReverseIndex' setting.
 
+GIT_TEST_SPARSE_INDEX=<boolean>, when true enables index writes to use the
+sparse-index format by default.
+
 Naming Tests
 ------------
 
diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index 3725d3997e70..de5d8461c993 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -7,6 +7,7 @@ test_description='compare full workdir to sparse workdir'
 test_expect_success 'setup' '
 	git init initial-repo &&
 	(
+		GIT_TEST_SPARSE_INDEX=0 &&
 		cd initial-repo &&
 		echo a >a &&
 		echo "after deep" >e &&
@@ -87,23 +88,32 @@ init_repos () {
 
 	cp -r initial-repo sparse-checkout &&
 	git -C sparse-checkout reset --hard &&
-	git -C sparse-checkout sparse-checkout init --cone &&
+
+	cp -r initial-repo sparse-index &&
+	git -C sparse-index reset --hard &&
 
 	# initialize sparse-checkout definitions
-	git -C sparse-checkout sparse-checkout set deep
+	git -C sparse-checkout sparse-checkout init --cone &&
+	git -C sparse-checkout sparse-checkout set deep &&
+	GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout init --cone &&
+	GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set deep
 }
 
 run_on_sparse () {
 	(
 		cd sparse-checkout &&
-		"$@" >../sparse-checkout-out 2>../sparse-checkout-err
+		GIT_TEST_SPARSE_INDEX=0 "$@" >../sparse-checkout-out 2>../sparse-checkout-err
+	) &&
+	(
+		cd sparse-index &&
+		GIT_TEST_SPARSE_INDEX=1 "$@" >../sparse-index-out 2>../sparse-index-err
 	)
 }
 
 run_on_all () {
 	(
 		cd full-checkout &&
-		"$@" >../full-checkout-out 2>../full-checkout-err
+		GIT_TEST_SPARSE_INDEX=0 "$@" >../full-checkout-out 2>../full-checkout-err
 	) &&
 	run_on_sparse "$@"
 }
@@ -114,6 +124,12 @@ test_all_match () {
 	test_cmp full-checkout-err sparse-checkout-err
 }
 
+test_sparse_match () {
+	run_on_sparse "$@" &&
+	test_cmp sparse-checkout-out sparse-index-out &&
+	test_cmp sparse-checkout-err sparse-index-err
+}
+
 test_expect_success 'status with options' '
 	init_repos &&
 	test_all_match git status --porcelain=v2 &&
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH v3 07/20] test-read-cache: print cache entries with --table
  2021-03-16 16:42   ` [PATCH v3 " Derrick Stolee via GitGitGadget
                       ` (5 preceding siblings ...)
  2021-03-16 16:42     ` [PATCH v3 06/20] t1092: compare sparse-checkout to sparse-index Derrick Stolee via GitGitGadget
@ 2021-03-16 16:42     ` Derrick Stolee via GitGitGadget
  2021-03-17 13:28       ` [RFC/PATCH 0/5] " Ævar Arnfjörð Bjarmason
                         ` (5 more replies)
  2021-03-16 16:42     ` [PATCH v3 08/20] test-tool: don't force full index Derrick Stolee via GitGitGadget
                       ` (16 subsequent siblings)
  23 siblings, 6 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-16 16:42 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

This table is helpful for discovering data in the index to ensure it is
being written correctly, especially as we build and test the
sparse-index. This table includes an output format similar to 'git
ls-tree', but should not be compared to that directly. The biggest
reasons are that 'git ls-tree' includes a tree entry for every
subdirectory, even those that would not appear as a sparse directory in
a sparse-index. Further, 'git ls-tree' does not use a trailing directory
separator for its tree rows.

This does not print the stat() information for the blobs. That could be
added in a future change with another option. The tests that are added
in the next few changes care only about the object types and IDs.

To make the option parsing slightly more robust, wrap the string
comparisons in a loop adapted from test-dir-iterator.c.

Care must be taken with the final check for the 'cnt' variable. We
continue the expectation that the numerical value is the final argument.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/helper/test-read-cache.c | 55 +++++++++++++++++++++++++++++++-------
 1 file changed, 45 insertions(+), 10 deletions(-)

diff --git a/t/helper/test-read-cache.c b/t/helper/test-read-cache.c
index 244977a29bdf..6cfd8f2de71c 100644
--- a/t/helper/test-read-cache.c
+++ b/t/helper/test-read-cache.c
@@ -1,36 +1,71 @@
 #include "test-tool.h"
 #include "cache.h"
 #include "config.h"
+#include "blob.h"
+#include "commit.h"
+#include "tree.h"
+
+static void print_cache_entry(struct cache_entry *ce)
+{
+	const char *type;
+	printf("%06o ", ce->ce_mode & 0177777);
+
+	if (S_ISSPARSEDIR(ce->ce_mode))
+		type = tree_type;
+	else if (S_ISGITLINK(ce->ce_mode))
+		type = commit_type;
+	else
+		type = blob_type;
+
+	printf("%s %s\t%s\n",
+	       type,
+	       oid_to_hex(&ce->oid),
+	       ce->name);
+}
+
+static void print_cache(struct index_state *istate)
+{
+	int i;
+	for (i = 0; i < istate->cache_nr; i++)
+		print_cache_entry(istate->cache[i]);
+}
 
 int cmd__read_cache(int argc, const char **argv)
 {
+	struct repository *r = the_repository;
 	int i, cnt = 1;
 	const char *name = NULL;
+	int table = 0;
 
-	if (argc > 1 && skip_prefix(argv[1], "--print-and-refresh=", &name)) {
-		argc--;
-		argv++;
+	for (++argv, --argc; *argv && starts_with(*argv, "--"); ++argv, --argc) {
+		if (skip_prefix(*argv, "--print-and-refresh=", &name))
+			continue;
+		if (!strcmp(*argv, "--table"))
+			table = 1;
 	}
 
-	if (argc == 2)
-		cnt = strtol(argv[1], NULL, 0);
+	if (argc == 1)
+		cnt = strtol(argv[0], NULL, 0);
 	setup_git_directory();
 	git_config(git_default_config, NULL);
+
 	for (i = 0; i < cnt; i++) {
-		read_cache();
+		repo_read_index(r);
 		if (name) {
 			int pos;
 
-			refresh_index(&the_index, REFRESH_QUIET,
+			refresh_index(r->index, REFRESH_QUIET,
 				      NULL, NULL, NULL);
-			pos = index_name_pos(&the_index, name, strlen(name));
+			pos = index_name_pos(r->index, name, strlen(name));
 			if (pos < 0)
 				die("%s not in index", name);
 			printf("%s is%s up to date\n", name,
-			       ce_uptodate(the_index.cache[pos]) ? "" : " not");
+			       ce_uptodate(r->index->cache[pos]) ? "" : " not");
 			write_file(name, "%d\n", i);
 		}
-		discard_cache();
+		if (table)
+			print_cache(r->index);
+		discard_index(r->index);
 	}
 	return 0;
 }
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH v3 08/20] test-tool: don't force full index
  2021-03-16 16:42   ` [PATCH v3 " Derrick Stolee via GitGitGadget
                       ` (6 preceding siblings ...)
  2021-03-16 16:42     ` [PATCH v3 07/20] test-read-cache: print cache entries with --table Derrick Stolee via GitGitGadget
@ 2021-03-16 16:42     ` Derrick Stolee via GitGitGadget
  2021-03-16 16:42     ` [PATCH v3 09/20] unpack-trees: ensure " Derrick Stolee via GitGitGadget
                       ` (15 subsequent siblings)
  23 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-16 16:42 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

We will use 'test-tool read-cache --table' to check that a sparse
index is written as part of init_repos. Since we will no longer always
expand a sparse index into a full index, add an '--expand' parameter
that adds a call to ensure_full_index() so we can compare a sparse index
directly against a full index, or at least what the in-memory index
looks like when expanded in this way.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/helper/test-read-cache.c               | 13 ++++++++++++-
 t/t1092-sparse-checkout-compatibility.sh |  5 +++++
 2 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/t/helper/test-read-cache.c b/t/helper/test-read-cache.c
index 6cfd8f2de71c..b52c174acc7a 100644
--- a/t/helper/test-read-cache.c
+++ b/t/helper/test-read-cache.c
@@ -4,6 +4,7 @@
 #include "blob.h"
 #include "commit.h"
 #include "tree.h"
+#include "sparse-index.h"
 
 static void print_cache_entry(struct cache_entry *ce)
 {
@@ -35,13 +36,19 @@ int cmd__read_cache(int argc, const char **argv)
 	struct repository *r = the_repository;
 	int i, cnt = 1;
 	const char *name = NULL;
-	int table = 0;
+	int table = 0, expand = 0;
+
+	initialize_the_repository();
+	prepare_repo_settings(r);
+	r->settings.command_requires_full_index = 0;
 
 	for (++argv, --argc; *argv && starts_with(*argv, "--"); ++argv, --argc) {
 		if (skip_prefix(*argv, "--print-and-refresh=", &name))
 			continue;
 		if (!strcmp(*argv, "--table"))
 			table = 1;
+		else if (!strcmp(*argv, "--expand"))
+			expand = 1;
 	}
 
 	if (argc == 1)
@@ -51,6 +58,10 @@ int cmd__read_cache(int argc, const char **argv)
 
 	for (i = 0; i < cnt; i++) {
 		repo_read_index(r);
+
+		if (expand)
+			ensure_full_index(r->index);
+
 		if (name) {
 			int pos;
 
diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index de5d8461c993..a1aea141c62c 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -130,6 +130,11 @@ test_sparse_match () {
 	test_cmp sparse-checkout-err sparse-index-err
 }
 
+test_expect_success 'expanded in-memory index matches full index' '
+	init_repos &&
+	test_sparse_match test-tool read-cache --expand --table
+'
+
 test_expect_success 'status with options' '
 	init_repos &&
 	test_all_match git status --porcelain=v2 &&
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH v3 09/20] unpack-trees: ensure full index
  2021-03-16 16:42   ` [PATCH v3 " Derrick Stolee via GitGitGadget
                       ` (7 preceding siblings ...)
  2021-03-16 16:42     ` [PATCH v3 08/20] test-tool: don't force full index Derrick Stolee via GitGitGadget
@ 2021-03-16 16:42     ` Derrick Stolee via GitGitGadget
  2021-03-16 16:42     ` [PATCH v3 10/20] sparse-checkout: hold pattern list in index Derrick Stolee via GitGitGadget
                       ` (14 subsequent siblings)
  23 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-16 16:42 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The next change will translate full indexes into sparse indexes at write
time. The existing logic provides a way for every sparse index to be
expanded to a full index at read time. However, there are cases where an
index is written and then continues to be used in-memory to perform
further updates.

unpack_trees() is frequently called after such a write. In particular,
commands like 'git reset' do this double-update of the index.

Ensure that we have a full index when entering unpack_trees(), but only
when command_requires_full_index is true. This is always true at the
moment, but we will later relax that after unpack_trees() is updated to
handle sparse directory entries.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 unpack-trees.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/unpack-trees.c b/unpack-trees.c
index eb8fcda31ba7..2da3e5ec77a1 100644
--- a/unpack-trees.c
+++ b/unpack-trees.c
@@ -1570,6 +1570,7 @@ static int verify_absent(const struct cache_entry *,
  */
 int unpack_trees(unsigned len, struct tree_desc *t, struct unpack_trees_options *o)
 {
+	struct repository *repo = the_repository;
 	int i, ret;
 	static struct cache_entry *dfc;
 	struct pattern_list pl;
@@ -1581,6 +1582,12 @@ int unpack_trees(unsigned len, struct tree_desc *t, struct unpack_trees_options
 	trace_performance_enter();
 	trace2_region_enter("unpack_trees", "unpack_trees", the_repository);
 
+	prepare_repo_settings(repo);
+	if (repo->settings.command_requires_full_index) {
+		ensure_full_index(o->src_index);
+		ensure_full_index(o->dst_index);
+	}
+
 	if (!core_apply_sparse_checkout || !o->update)
 		o->skip_sparse_checkout = 1;
 	if (!o->skip_sparse_checkout && !o->pl) {
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH v3 10/20] sparse-checkout: hold pattern list in index
  2021-03-16 16:42   ` [PATCH v3 " Derrick Stolee via GitGitGadget
                       ` (8 preceding siblings ...)
  2021-03-16 16:42     ` [PATCH v3 09/20] unpack-trees: ensure " Derrick Stolee via GitGitGadget
@ 2021-03-16 16:42     ` Derrick Stolee via GitGitGadget
  2021-03-16 16:42     ` [PATCH v3 11/20] sparse-index: convert from full to sparse Derrick Stolee via GitGitGadget
                       ` (13 subsequent siblings)
  23 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-16 16:42 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

As we modify the sparse-checkout definition, we perform index operations
on a pattern_list that only exists in-memory. This allows easy backing
out in case the index update fails.

However, if the index write itself cares about the sparse-checkout
pattern set, we need access to that in-memory copy. Place a pointer to
a 'struct pattern_list' in the index so we can access this on-demand.
This will be used in the next change which uses the sparse-checkout
definition to filter out directories that are outside the sparse cone.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 builtin/sparse-checkout.c | 17 ++++++++++-------
 cache.h                   |  2 ++
 2 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/builtin/sparse-checkout.c b/builtin/sparse-checkout.c
index 2306a9ad98e0..e00b82af727b 100644
--- a/builtin/sparse-checkout.c
+++ b/builtin/sparse-checkout.c
@@ -110,6 +110,8 @@ static int update_working_directory(struct pattern_list *pl)
 	if (is_index_unborn(r->index))
 		return UPDATE_SPARSITY_SUCCESS;
 
+	r->index->sparse_checkout_patterns = pl;
+
 	memset(&o, 0, sizeof(o));
 	o.verbose_update = isatty(2);
 	o.update = 1;
@@ -138,6 +140,7 @@ static int update_working_directory(struct pattern_list *pl)
 	else
 		rollback_lock_file(&lock_file);
 
+	r->index->sparse_checkout_patterns = NULL;
 	return result;
 }
 
@@ -517,19 +520,18 @@ static int modify_pattern_list(int argc, const char **argv, enum modify_type m)
 {
 	int result;
 	int changed_config = 0;
-	struct pattern_list pl;
-	memset(&pl, 0, sizeof(pl));
+	struct pattern_list *pl = xcalloc(1, sizeof(*pl));
 
 	switch (m) {
 	case ADD:
 		if (core_sparse_checkout_cone)
-			add_patterns_cone_mode(argc, argv, &pl);
+			add_patterns_cone_mode(argc, argv, pl);
 		else
-			add_patterns_literal(argc, argv, &pl);
+			add_patterns_literal(argc, argv, pl);
 		break;
 
 	case REPLACE:
-		add_patterns_from_input(&pl, argc, argv);
+		add_patterns_from_input(pl, argc, argv);
 		break;
 	}
 
@@ -539,12 +541,13 @@ static int modify_pattern_list(int argc, const char **argv, enum modify_type m)
 		changed_config = 1;
 	}
 
-	result = write_patterns_and_update(&pl);
+	result = write_patterns_and_update(pl);
 
 	if (result && changed_config)
 		set_config(MODE_NO_PATTERNS);
 
-	clear_pattern_list(&pl);
+	clear_pattern_list(pl);
+	free(pl);
 	return result;
 }
 
diff --git a/cache.h b/cache.h
index abb00a068e5d..759ca92e2ecc 100644
--- a/cache.h
+++ b/cache.h
@@ -307,6 +307,7 @@ static inline unsigned int canon_mode(unsigned int mode)
 struct split_index;
 struct untracked_cache;
 struct progress;
+struct pattern_list;
 
 struct index_state {
 	struct cache_entry **cache;
@@ -338,6 +339,7 @@ struct index_state {
 	struct mem_pool *ce_mem_pool;
 	struct progress *progress;
 	struct repository *repo;
+	struct pattern_list *sparse_checkout_patterns;
 };
 
 /* Name hashing */
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH v3 11/20] sparse-index: convert from full to sparse
  2021-03-16 16:42   ` [PATCH v3 " Derrick Stolee via GitGitGadget
                       ` (9 preceding siblings ...)
  2021-03-16 16:42     ` [PATCH v3 10/20] sparse-checkout: hold pattern list in index Derrick Stolee via GitGitGadget
@ 2021-03-16 16:42     ` Derrick Stolee via GitGitGadget
  2021-03-17 13:43       ` Ævar Arnfjörð Bjarmason
  2021-03-16 16:42     ` [PATCH v3 12/20] submodule: sparse-index should not collapse links Derrick Stolee via GitGitGadget
                       ` (12 subsequent siblings)
  23 siblings, 1 reply; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-16 16:42 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

If we have a full index, then we can convert it to a sparse index by
replacing directories outside of the sparse cone with sparse directory
entries. The convert_to_sparse() method does this, when the situation is
appropriate.

For now, we avoid converting the index to a sparse index if:

 1. the index is split.
 2. the index is already sparse.
 3. sparse-checkout is disabled.
 4. sparse-checkout does not use cone mode.

Finally, we currently limit the conversion to when the
GIT_TEST_SPARSE_INDEX environment variable is enabled. A mode using Git
config will be added in a later change.

The trickiest thing about this conversion is that we might not be able
to mark a directory as a sparse directory just because it is outside the
sparse cone. There might be unmerged files within that directory, so we
need to look for those. Also, if there is some strange reason why a file
is not marked with CE_SKIP_WORKTREE, then we should give up on
converting that directory. There is still hope that some of its
subdirectories might be able to convert to sparse, so we keep looking
deeper.

The conversion process is assisted by the cache-tree extension. This is
calculated from the full index if it does not already exist. We then
abandon the cache-tree as it no longer applies to the newly-sparse
index. Thus, this cache-tree will be recalculated in every
sparse-full-sparse round-trip until we integrate the cache-tree
extension with the sparse index.

Some Git commands use the index after writing it. For example, 'git add'
will update the index, then write it to disk, then read its entries to
report information. To keep the in-memory index in a full state after
writing, we re-expand it to a full one after the write. This is wasteful
for commands that only write the index and do not read from it again,
but that is only the case until we make those commands "sparse aware."

We can compare the behavior of the sparse-index in
t1092-sparse-checkout-compability.sh by using GIT_TEST_SPARSE_INDEX=1
when operating on the 'sparse-index' repo. We can also compare the two
sparse repos directly, such as comparing their indexes (when expanded to
full in the case of the 'sparse-index' repo). We also verify that the
index is actually populated with sparse directory entries.

The 'checkout and reset (mixed)' test is marked for failure when
comparing a sparse repo to a full repo, but we can compare the two
sparse-checkout cases directly to ensure that we are not changing the
behavior when using a sparse index.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 cache-tree.c                             |   3 +
 cache.h                                  |   2 +
 read-cache.c                             |  26 ++++-
 sparse-index.c                           | 139 +++++++++++++++++++++++
 sparse-index.h                           |   1 +
 t/t1092-sparse-checkout-compatibility.sh |  61 +++++++++-
 6 files changed, 228 insertions(+), 4 deletions(-)

diff --git a/cache-tree.c b/cache-tree.c
index 2fb483d3c083..5f07a39e501e 100644
--- a/cache-tree.c
+++ b/cache-tree.c
@@ -6,6 +6,7 @@
 #include "object-store.h"
 #include "replace-object.h"
 #include "promisor-remote.h"
+#include "sparse-index.h"
 
 #ifndef DEBUG_CACHE_TREE
 #define DEBUG_CACHE_TREE 0
@@ -442,6 +443,8 @@ int cache_tree_update(struct index_state *istate, int flags)
 	if (i)
 		return i;
 
+	ensure_full_index(istate);
+
 	if (!istate->cache_tree)
 		istate->cache_tree = cache_tree();
 
diff --git a/cache.h b/cache.h
index 759ca92e2ecc..69a32146cd77 100644
--- a/cache.h
+++ b/cache.h
@@ -251,6 +251,8 @@ static inline unsigned int create_ce_mode(unsigned int mode)
 {
 	if (S_ISLNK(mode))
 		return S_IFLNK;
+	if (mode == S_IFDIR)
+		return S_IFDIR;
 	if (S_ISDIR(mode) || S_ISGITLINK(mode))
 		return S_IFGITLINK;
 	return S_IFREG | ce_permissions(mode);
diff --git a/read-cache.c b/read-cache.c
index dd3980c12b53..b9c08773466c 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -25,6 +25,7 @@
 #include "fsmonitor.h"
 #include "thread-utils.h"
 #include "progress.h"
+#include "sparse-index.h"
 
 /* Mask for the name length in ce_flags in the on-disk index */
 
@@ -1002,8 +1003,14 @@ int verify_path(const char *path, unsigned mode)
 
 			c = *path++;
 			if ((c == '.' && !verify_dotfile(path, mode)) ||
-			    is_dir_sep(c) || c == '\0')
+			    is_dir_sep(c))
 				return 0;
+			/*
+			 * allow terminating directory separators for
+			 * sparse directory entries.
+			 */
+			if (c == '\0')
+				return S_ISDIR(mode);
 		} else if (c == '\\' && protect_ntfs) {
 			if (is_ntfs_dotgit(path))
 				return 0;
@@ -3079,6 +3086,14 @@ static int do_write_locked_index(struct index_state *istate, struct lock_file *l
 				 unsigned flags)
 {
 	int ret;
+	int was_full = !istate->sparse_index;
+
+	ret = convert_to_sparse(istate);
+
+	if (ret) {
+		warning(_("failed to convert to a sparse-index"));
+		return ret;
+	}
 
 	/*
 	 * TODO trace2: replace "the_repository" with the actual repo instance
@@ -3090,6 +3105,9 @@ static int do_write_locked_index(struct index_state *istate, struct lock_file *l
 	trace2_region_leave_printf("index", "do_write_index", the_repository,
 				   "%s", get_lock_file_path(lock));
 
+	if (was_full)
+		ensure_full_index(istate);
+
 	if (ret)
 		return ret;
 	if (flags & COMMIT_LOCK)
@@ -3180,9 +3198,10 @@ static int write_shared_index(struct index_state *istate,
 			      struct tempfile **temp)
 {
 	struct split_index *si = istate->split_index;
-	int ret;
+	int ret, was_full = !istate->sparse_index;
 
 	move_cache_to_base_index(istate);
+	convert_to_sparse(istate);
 
 	trace2_region_enter_printf("index", "shared/do_write_index",
 				   the_repository, "%s", get_tempfile_path(*temp));
@@ -3190,6 +3209,9 @@ static int write_shared_index(struct index_state *istate,
 	trace2_region_leave_printf("index", "shared/do_write_index",
 				   the_repository, "%s", get_tempfile_path(*temp));
 
+	if (was_full)
+		ensure_full_index(istate);
+
 	if (ret)
 		return ret;
 	ret = adjust_shared_perm(get_tempfile_path(*temp));
diff --git a/sparse-index.c b/sparse-index.c
index 7095378a1b28..619ff7c2e217 100644
--- a/sparse-index.c
+++ b/sparse-index.c
@@ -4,6 +4,145 @@
 #include "tree.h"
 #include "pathspec.h"
 #include "trace2.h"
+#include "cache-tree.h"
+#include "config.h"
+#include "dir.h"
+#include "fsmonitor.h"
+
+static struct cache_entry *construct_sparse_dir_entry(
+				struct index_state *istate,
+				const char *sparse_dir,
+				struct cache_tree *tree)
+{
+	struct cache_entry *de;
+
+	de = make_cache_entry(istate, S_IFDIR, &tree->oid, sparse_dir, 0, 0);
+
+	de->ce_flags |= CE_SKIP_WORKTREE;
+	return de;
+}
+
+/*
+ * Returns the number of entries "inserted" into the index.
+ */
+static int convert_to_sparse_rec(struct index_state *istate,
+				 int num_converted,
+				 int start, int end,
+				 const char *ct_path, size_t ct_pathlen,
+				 struct cache_tree *ct)
+{
+	int i, can_convert = 1;
+	int start_converted = num_converted;
+	enum pattern_match_result match;
+	int dtype;
+	struct strbuf child_path = STRBUF_INIT;
+	struct pattern_list *pl = istate->sparse_checkout_patterns;
+
+	/*
+	 * Is the current path outside of the sparse cone?
+	 * Then check if the region can be replaced by a sparse
+	 * directory entry (everything is sparse and merged).
+	 */
+	match = path_matches_pattern_list(ct_path, ct_pathlen,
+					  NULL, &dtype, pl, istate);
+	if (match != NOT_MATCHED)
+		can_convert = 0;
+
+	for (i = start; can_convert && i < end; i++) {
+		struct cache_entry *ce = istate->cache[i];
+
+		if (ce_stage(ce) ||
+		    !(ce->ce_flags & CE_SKIP_WORKTREE))
+			can_convert = 0;
+	}
+
+	if (can_convert) {
+		struct cache_entry *se;
+		se = construct_sparse_dir_entry(istate, ct_path, ct);
+
+		istate->cache[num_converted++] = se;
+		return 1;
+	}
+
+	for (i = start; i < end; ) {
+		int count, span, pos = -1;
+		const char *base, *slash;
+		struct cache_entry *ce = istate->cache[i];
+
+		/*
+		 * Detect if this is a normal entry outside of any subtree
+		 * entry.
+		 */
+		base = ce->name + ct_pathlen;
+		slash = strchr(base, '/');
+
+		if (slash)
+			pos = cache_tree_subtree_pos(ct, base, slash - base);
+
+		if (pos < 0) {
+			istate->cache[num_converted++] = ce;
+			i++;
+			continue;
+		}
+
+		strbuf_setlen(&child_path, 0);
+		strbuf_add(&child_path, ce->name, slash - ce->name + 1);
+
+		span = ct->down[pos]->cache_tree->entry_count;
+		count = convert_to_sparse_rec(istate,
+					      num_converted, i, i + span,
+					      child_path.buf, child_path.len,
+					      ct->down[pos]->cache_tree);
+		num_converted += count;
+		i += span;
+	}
+
+	strbuf_release(&child_path);
+	return num_converted - start_converted;
+}
+
+int convert_to_sparse(struct index_state *istate)
+{
+	if (istate->split_index || istate->sparse_index ||
+	    !core_apply_sparse_checkout || !core_sparse_checkout_cone)
+		return 0;
+
+	/*
+	 * For now, only create a sparse index with the
+	 * GIT_TEST_SPARSE_INDEX environment variable. We will relax
+	 * this once we have a proper way to opt-in (and later still,
+	 * opt-out).
+	 */
+	if (!git_env_bool("GIT_TEST_SPARSE_INDEX", 0))
+		return 0;
+
+	if (!istate->sparse_checkout_patterns) {
+		istate->sparse_checkout_patterns = xcalloc(1, sizeof(struct pattern_list));
+		if (get_sparse_checkout_patterns(istate->sparse_checkout_patterns) < 0)
+			return 0;
+	}
+
+	if (!istate->sparse_checkout_patterns->use_cone_patterns) {
+		warning(_("attempting to use sparse-index without cone mode"));
+		return -1;
+	}
+
+	if (cache_tree_update(istate, 0)) {
+		warning(_("unable to update cache-tree, staying full"));
+		return -1;
+	}
+
+	remove_fsmonitor(istate);
+
+	trace2_region_enter("index", "convert_to_sparse", istate->repo);
+	istate->cache_nr = convert_to_sparse_rec(istate,
+						 0, 0, istate->cache_nr,
+						 "", 0, istate->cache_tree);
+	istate->drop_cache_tree = 1;
+	istate->sparse_index = 1;
+	trace2_region_leave("index", "convert_to_sparse", istate->repo);
+	return 0;
+}
 
 static void set_index_entry(struct index_state *istate, int nr, struct cache_entry *ce)
 {
diff --git a/sparse-index.h b/sparse-index.h
index 09a20d036c46..64380e121d80 100644
--- a/sparse-index.h
+++ b/sparse-index.h
@@ -3,5 +3,6 @@
 
 struct index_state;
 void ensure_full_index(struct index_state *istate);
+int convert_to_sparse(struct index_state *istate);
 
 #endif
diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index a1aea141c62c..1e888d195122 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -2,6 +2,11 @@
 
 test_description='compare full workdir to sparse workdir'
 
+# The verify_cache_tree() check is not sparse-aware (yet).
+# So, disable the check until that integration is complete.
+GIT_TEST_CHECK_CACHE_TREE=0
+GIT_TEST_SPLIT_INDEX=0
+
 . ./test-lib.sh
 
 test_expect_success 'setup' '
@@ -121,7 +126,9 @@ run_on_all () {
 test_all_match () {
 	run_on_all "$@" &&
 	test_cmp full-checkout-out sparse-checkout-out &&
-	test_cmp full-checkout-err sparse-checkout-err
+	test_cmp full-checkout-out sparse-index-out &&
+	test_cmp full-checkout-err sparse-checkout-err &&
+	test_cmp full-checkout-err sparse-index-err
 }
 
 test_sparse_match () {
@@ -130,6 +137,38 @@ test_sparse_match () {
 	test_cmp sparse-checkout-err sparse-index-err
 }
 
+test_expect_success 'sparse-index contents' '
+	init_repos &&
+
+	test-tool -C sparse-index read-cache --table >cache &&
+	for dir in folder1 folder2 x
+	do
+		TREE=$(git -C sparse-index rev-parse HEAD:$dir) &&
+		grep "040000 tree $TREE	$dir/" cache \
+			|| return 1
+	done &&
+
+	GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set folder1 &&
+
+	test-tool -C sparse-index read-cache --table >cache &&
+	for dir in deep folder2 x
+	do
+		TREE=$(git -C sparse-index rev-parse HEAD:$dir) &&
+		grep "040000 tree $TREE	$dir/" cache \
+			|| return 1
+	done &&
+
+	GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set deep/deeper1 &&
+
+	test-tool -C sparse-index read-cache --table >cache &&
+	for dir in deep/deeper2 folder1 folder2 x
+	do
+		TREE=$(git -C sparse-index rev-parse HEAD:$dir) &&
+		grep "040000 tree $TREE	$dir/" cache \
+			|| return 1
+	done
+'
+
 test_expect_success 'expanded in-memory index matches full index' '
 	init_repos &&
 	test_sparse_match test-tool read-cache --expand --table
@@ -137,6 +176,7 @@ test_expect_success 'expanded in-memory index matches full index' '
 
 test_expect_success 'status with options' '
 	init_repos &&
+	test_sparse_match ls &&
 	test_all_match git status --porcelain=v2 &&
 	test_all_match git status --porcelain=v2 -z -u &&
 	test_all_match git status --porcelain=v2 -uno &&
@@ -273,6 +313,17 @@ test_expect_failure 'checkout and reset (mixed)' '
 	test_all_match git reset update-folder2
 '
 
+# Ensure that sparse-index behaves identically to
+# sparse-checkout with a full index.
+test_expect_success 'checkout and reset (mixed) [sparse]' '
+	init_repos &&
+
+	test_sparse_match git checkout -b reset-test update-deep &&
+	test_sparse_match git reset deepest &&
+	test_sparse_match git reset update-folder1 &&
+	test_sparse_match git reset update-folder2
+'
+
 test_expect_success 'merge' '
 	init_repos &&
 
@@ -309,14 +360,20 @@ test_expect_success 'clean' '
 	test_all_match git status --porcelain=v2 &&
 	test_all_match git clean -f &&
 	test_all_match git status --porcelain=v2 &&
+	test_sparse_match ls &&
+	test_sparse_match ls folder1 &&
 
 	test_all_match git clean -xf &&
 	test_all_match git status --porcelain=v2 &&
+	test_sparse_match ls &&
+	test_sparse_match ls folder1 &&
 
 	test_all_match git clean -xdf &&
 	test_all_match git status --porcelain=v2 &&
+	test_sparse_match ls &&
+	test_sparse_match ls folder1 &&
 
-	test_path_is_dir sparse-checkout/folder1
+	test_sparse_match test_path_is_dir folder1
 '
 
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH v3 12/20] submodule: sparse-index should not collapse links
  2021-03-16 16:42   ` [PATCH v3 " Derrick Stolee via GitGitGadget
                       ` (10 preceding siblings ...)
  2021-03-16 16:42     ` [PATCH v3 11/20] sparse-index: convert from full to sparse Derrick Stolee via GitGitGadget
@ 2021-03-16 16:42     ` Derrick Stolee via GitGitGadget
  2021-03-16 16:42     ` [PATCH v3 13/20] unpack-trees: allow sparse directories Derrick Stolee via GitGitGadget
                       ` (11 subsequent siblings)
  23 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-16 16:42 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

A submodule is stored as a "Git link" that actually points to a commit
within a submodule. Submodules are populated or not depending on
submodule configuration, not sparse-checkout. To ensure that the
sparse-index feature integrates correctly with submodules, we should not
collapse a directory if there is a Git link within its range.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 sparse-index.c                           |  1 +
 t/t1092-sparse-checkout-compatibility.sh | 17 +++++++++++++++++
 2 files changed, 18 insertions(+)

diff --git a/sparse-index.c b/sparse-index.c
index 619ff7c2e217..7631f7bd00b7 100644
--- a/sparse-index.c
+++ b/sparse-index.c
@@ -52,6 +52,7 @@ static int convert_to_sparse_rec(struct index_state *istate,
 		struct cache_entry *ce = istate->cache[i];
 
 		if (ce_stage(ce) ||
+		    S_ISGITLINK(ce->ce_mode) ||
 		    !(ce->ce_flags & CE_SKIP_WORKTREE))
 			can_convert = 0;
 	}
diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index 1e888d195122..cba5f89b1e96 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -376,4 +376,21 @@ test_expect_success 'clean' '
 	test_sparse_match test_path_is_dir folder1
 '
 
+test_expect_success 'submodule handling' '
+	init_repos &&
+
+	test_all_match mkdir modules &&
+	test_all_match touch modules/a &&
+	test_all_match git add modules &&
+	test_all_match git commit -m "add modules directory" &&
+
+	run_on_all git submodule add "$(pwd)/initial-repo" modules/sub &&
+	test_all_match git commit -m "add submodule" &&
+
+	# having a submodule prevents "modules" from collapse
+	test-tool -C sparse-index read-cache --table >cache &&
+	grep "100644 blob .*	modules/a" cache &&
+	grep "160000 commit $(git -C initial-repo rev-parse HEAD)	modules/sub" cache
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH v3 13/20] unpack-trees: allow sparse directories
  2021-03-16 16:42   ` [PATCH v3 " Derrick Stolee via GitGitGadget
                       ` (11 preceding siblings ...)
  2021-03-16 16:42     ` [PATCH v3 12/20] submodule: sparse-index should not collapse links Derrick Stolee via GitGitGadget
@ 2021-03-16 16:42     ` Derrick Stolee via GitGitGadget
  2021-03-17 13:35       ` Ævar Arnfjörð Bjarmason
  2021-03-16 16:42     ` [PATCH v3 14/20] sparse-index: check index conversion happens Derrick Stolee via GitGitGadget
                       ` (10 subsequent siblings)
  23 siblings, 1 reply; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-16 16:42 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The index_pos_by_traverse_info() currently throws a BUG() when a
directory entry exists exactly in the index. We need to consider that it
is possible to have a directory in a sparse index as long as that entry
is itself marked with the skip-worktree bit.

The 'pos' variable is assigned a negative value if an exact match is not
found. Since a directory name can be an exact match, it is no longer an
error to have a nonnegative 'pos' value.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 unpack-trees.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/unpack-trees.c b/unpack-trees.c
index 2da3e5ec77a1..e81d82d72d89 100644
--- a/unpack-trees.c
+++ b/unpack-trees.c
@@ -749,9 +749,12 @@ static int index_pos_by_traverse_info(struct name_entry *names,
 	strbuf_make_traverse_path(&name, info, names->path, names->pathlen);
 	strbuf_addch(&name, '/');
 	pos = index_name_pos(o->src_index, name.buf, name.len);
-	if (pos >= 0)
-		BUG("This is a directory and should not exist in index");
-	pos = -pos - 1;
+	if (pos >= 0) {
+		if (!o->src_index->sparse_index ||
+		    !(o->src_index->cache[pos]->ce_flags & CE_SKIP_WORKTREE))
+			BUG("This is a directory and should not exist in index");
+	} else
+		pos = -pos - 1;
 	if (pos >= o->src_index->cache_nr ||
 	    !starts_with(o->src_index->cache[pos]->name, name.buf) ||
 	    (pos > 0 && starts_with(o->src_index->cache[pos-1]->name, name.buf)))
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH v3 14/20] sparse-index: check index conversion happens
  2021-03-16 16:42   ` [PATCH v3 " Derrick Stolee via GitGitGadget
                       ` (12 preceding siblings ...)
  2021-03-16 16:42     ` [PATCH v3 13/20] unpack-trees: allow sparse directories Derrick Stolee via GitGitGadget
@ 2021-03-16 16:42     ` Derrick Stolee via GitGitGadget
  2021-03-16 16:42     ` [PATCH v3 15/20] sparse-index: create extension for compatibility Derrick Stolee via GitGitGadget
                       ` (9 subsequent siblings)
  23 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-16 16:42 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Add a test case that uses test_region to ensure that we are truly
expanding a sparse index to a full one, then converting back to sparse
when writing the index. As we integrate more Git commands with the
sparse index, we will convert these commands to check that we do _not_
convert the sparse index to a full index and instead stay sparse the
entire time.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/t1092-sparse-checkout-compatibility.sh | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index cba5f89b1e96..47f983217852 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -393,4 +393,22 @@ test_expect_success 'submodule handling' '
 	grep "160000 commit $(git -C initial-repo rev-parse HEAD)	modules/sub" cache
 '
 
+test_expect_success 'sparse-index is expanded and converted back' '
+	init_repos &&
+
+	(
+		GIT_TEST_SPARSE_INDEX=1 &&
+		export GIT_TEST_SPARSE_INDEX &&
+		GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
+			git -C sparse-index -c core.fsmonitor="" reset --hard &&
+		test_region index convert_to_sparse trace2.txt &&
+		test_region index ensure_full_index trace2.txt &&
+
+		rm trace2.txt &&
+		GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
+			git -C sparse-index -c core.fsmonitor="" status -uno &&
+		test_region index ensure_full_index trace2.txt
+	)
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH v3 15/20] sparse-index: create extension for compatibility
  2021-03-16 16:42   ` [PATCH v3 " Derrick Stolee via GitGitGadget
                       ` (13 preceding siblings ...)
  2021-03-16 16:42     ` [PATCH v3 14/20] sparse-index: check index conversion happens Derrick Stolee via GitGitGadget
@ 2021-03-16 16:42     ` Derrick Stolee via GitGitGadget
  2021-03-16 16:42     ` [PATCH v3 16/20] sparse-checkout: toggle sparse index from builtin Derrick Stolee via GitGitGadget
                       ` (8 subsequent siblings)
  23 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-16 16:42 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Previously, we enabled the sparse index format only using
GIT_TEST_SPARSE_INDEX=1. This is not a feasible direction for users to
actually select this mode. Further, sparse directory entries are not
understood by the index formats as advertised.

We _could_ add a new index version that explicitly adds these
capabilities, but there are nuances to index formats 2, 3, and 4 that
are still valuable to select as options. Until we add index format
version 5, create a repo extension, "extensions.sparseIndex", that
specifies that the tool reading this repository must understand sparse
directory entries.

This change only encodes the extension and enables it when
GIT_TEST_SPARSE_INDEX=1. Later, we will add a more user-friendly CLI
mechanism.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/config/extensions.txt |  8 ++++++
 cache.h                             |  1 +
 repo-settings.c                     |  7 ++++++
 repository.h                        |  3 ++-
 setup.c                             |  3 +++
 sparse-index.c                      | 38 +++++++++++++++++++++++++----
 6 files changed, 54 insertions(+), 6 deletions(-)

diff --git a/Documentation/config/extensions.txt b/Documentation/config/extensions.txt
index 4e23d73cdcad..c02e09af0046 100644
--- a/Documentation/config/extensions.txt
+++ b/Documentation/config/extensions.txt
@@ -6,3 +6,11 @@ extensions.objectFormat::
 Note that this setting should only be set by linkgit:git-init[1] or
 linkgit:git-clone[1].  Trying to change it after initialization will not
 work and will produce hard-to-diagnose issues.
+
+extensions.sparseIndex::
+	When combined with `core.sparseCheckout=true` and
+	`core.sparseCheckoutCone=true`, the index may contain entries
+	corresponding to directories outside of the sparse-checkout
+	definition in lieu of containing each path under such directories.
+	Versions of Git that do not understand this extension do not
+	expect directory entries in the index.
diff --git a/cache.h b/cache.h
index 69a32146cd77..4ca6cd7f782c 100644
--- a/cache.h
+++ b/cache.h
@@ -1059,6 +1059,7 @@ struct repository_format {
 	int worktree_config;
 	int is_bare;
 	int hash_algo;
+	int sparse_index;
 	char *work_tree;
 	struct string_list unknown_extensions;
 	struct string_list v1_only_extensions;
diff --git a/repo-settings.c b/repo-settings.c
index d63569e4041e..9677d50f9238 100644
--- a/repo-settings.c
+++ b/repo-settings.c
@@ -85,4 +85,11 @@ void prepare_repo_settings(struct repository *r)
 	 * removed.
 	 */
 	r->settings.command_requires_full_index = 1;
+
+	/*
+	 * Initialize this as off.
+	 */
+	r->settings.sparse_index = 0;
+	if (!repo_config_get_bool(r, "extensions.sparseindex", &value) && value)
+		r->settings.sparse_index = 1;
 }
diff --git a/repository.h b/repository.h
index e06a23015697..a45f7520fd9e 100644
--- a/repository.h
+++ b/repository.h
@@ -42,7 +42,8 @@ struct repo_settings {
 
 	int core_multi_pack_index;
 
-	unsigned command_requires_full_index:1;
+	unsigned command_requires_full_index:1,
+		 sparse_index:1;
 };
 
 struct repository {
diff --git a/setup.c b/setup.c
index c04cd25a30df..cd8394564613 100644
--- a/setup.c
+++ b/setup.c
@@ -500,6 +500,9 @@ static enum extension_result handle_extension(const char *var,
 			return error("invalid value for 'extensions.objectformat'");
 		data->hash_algo = format;
 		return EXTENSION_OK;
+	} else if (!strcmp(ext, "sparseindex")) {
+		data->sparse_index = 1;
+		return EXTENSION_OK;
 	}
 	return EXTENSION_UNKNOWN;
 }
diff --git a/sparse-index.c b/sparse-index.c
index 7631f7bd00b7..3a6df66faeab 100644
--- a/sparse-index.c
+++ b/sparse-index.c
@@ -102,19 +102,47 @@ static int convert_to_sparse_rec(struct index_state *istate,
 	return num_converted - start_converted;
 }
 
+static int enable_sparse_index(struct repository *repo)
+{
+	const char *config_path = repo_git_path(repo, "config.worktree");
+
+	if (upgrade_repository_format(1) < 0) {
+		warning(_("unable to upgrade repository format to enable sparse-index"));
+		return -1;
+	}
+	git_config_set_in_file_gently(config_path,
+				      "extensions.sparseIndex",
+				      "true");
+
+	prepare_repo_settings(repo);
+	repo->settings.sparse_index = 1;
+	return 0;
+}
+
 int convert_to_sparse(struct index_state *istate)
 {
 	if (istate->split_index || istate->sparse_index ||
 	    !core_apply_sparse_checkout || !core_sparse_checkout_cone)
 		return 0;
 
+	if (!istate->repo)
+		istate->repo = the_repository;
+
+	/*
+	 * The GIT_TEST_SPARSE_INDEX environment variable triggers the
+	 * extensions.sparseIndex config variable to be on.
+	 */
+	if (git_env_bool("GIT_TEST_SPARSE_INDEX", 0)) {
+		int err = enable_sparse_index(istate->repo);
+		if (err < 0)
+			return err;
+	}
+
 	/*
-	 * For now, only create a sparse index with the
-	 * GIT_TEST_SPARSE_INDEX environment variable. We will relax
-	 * this once we have a proper way to opt-in (and later still,
-	 * opt-out).
+	 * Only convert to sparse if extensions.sparseIndex is set.
 	 */
-	if (!git_env_bool("GIT_TEST_SPARSE_INDEX", 0))
+	prepare_repo_settings(istate->repo);
+	if (!istate->repo->settings.sparse_index)
 		return 0;
 
 	if (!istate->sparse_checkout_patterns) {
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH v3 16/20] sparse-checkout: toggle sparse index from builtin
  2021-03-16 16:42   ` [PATCH v3 " Derrick Stolee via GitGitGadget
                       ` (14 preceding siblings ...)
  2021-03-16 16:42     ` [PATCH v3 15/20] sparse-index: create extension for compatibility Derrick Stolee via GitGitGadget
@ 2021-03-16 16:42     ` Derrick Stolee via GitGitGadget
  2021-03-16 16:43     ` [PATCH v3 17/20] sparse-checkout: disable sparse-index Derrick Stolee via GitGitGadget
                       ` (7 subsequent siblings)
  23 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-16 16:42 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The sparse index extension is used to signal that index writes should be
in sparse mode. This was only updated using GIT_TEST_SPARSE_INDEX=1.

Add a '--[no-]sparse-index' option to 'git sparse-checkout init' that
specifies if the sparse index should be used. It also updates the index
to use the correct format, either way. Add a warning in the
documentation that the use of a repository extension might reduce
compatibility with third-party tools. 'git sparse-checkout init' already
sets extension.worktreeConfig, which places most sparse-checkout users
outside of the scope of most third-party tools.

Update t1092-sparse-checkout-compatibility.sh to use this CLI instead of
GIT_TEST_SPARSE_INDEX=1.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-sparse-checkout.txt    | 14 +++++++
 builtin/sparse-checkout.c                | 17 ++++++++-
 sparse-index.c                           | 37 +++++++++++++------
 sparse-index.h                           |  3 ++
 t/t1092-sparse-checkout-compatibility.sh | 47 +++++++++++++-----------
 5 files changed, 84 insertions(+), 34 deletions(-)

diff --git a/Documentation/git-sparse-checkout.txt b/Documentation/git-sparse-checkout.txt
index a0eeaeb02ee3..2ff66c5a4e41 100644
--- a/Documentation/git-sparse-checkout.txt
+++ b/Documentation/git-sparse-checkout.txt
@@ -45,6 +45,20 @@ To avoid interfering with other worktrees, it first enables the
 When `--cone` is provided, the `core.sparseCheckoutCone` setting is
 also set, allowing for better performance with a limited set of
 patterns (see 'CONE PATTERN SET' below).
++
+Use the `--[no-]sparse-index` option to toggle the use of the sparse
+index format. This reduces the size of the index to be more closely
+aligned with your sparse-checkout definition. This can have significant
+performance advantages for commands such as `git status` or `git add`.
+This feature is still experimental. Some commands might be slower with
+a sparse index until they are properly integrated with the feature.
++
+**WARNING:** Using a sparse index requires modifying the index in a way
+that is not completely understood by external tools. If you have trouble
+with this compatibility, then run `git sparse-checkout init --no-sparse-index`
+to rewrite your index to not be sparse. Older versions of Git will not
+understand the `sparseIndex` repository extension and may fail to interact
+with your repository until it is disabled.
 
 'set'::
 	Write a set of patterns to the sparse-checkout file, as given as
diff --git a/builtin/sparse-checkout.c b/builtin/sparse-checkout.c
index e00b82af727b..ca63e2c64e95 100644
--- a/builtin/sparse-checkout.c
+++ b/builtin/sparse-checkout.c
@@ -14,6 +14,7 @@
 #include "unpack-trees.h"
 #include "wt-status.h"
 #include "quote.h"
+#include "sparse-index.h"
 
 static const char *empty_base = "";
 
@@ -283,12 +284,13 @@ static int set_config(enum sparse_checkout_mode mode)
 }
 
 static char const * const builtin_sparse_checkout_init_usage[] = {
-	N_("git sparse-checkout init [--cone]"),
+	N_("git sparse-checkout init [--cone] [--[no-]sparse-index]"),
 	NULL
 };
 
 static struct sparse_checkout_init_opts {
 	int cone_mode;
+	int sparse_index;
 } init_opts;
 
 static int sparse_checkout_init(int argc, const char **argv)
@@ -303,11 +305,15 @@ static int sparse_checkout_init(int argc, const char **argv)
 	static struct option builtin_sparse_checkout_init_options[] = {
 		OPT_BOOL(0, "cone", &init_opts.cone_mode,
 			 N_("initialize the sparse-checkout in cone mode")),
+		OPT_BOOL(0, "sparse-index", &init_opts.sparse_index,
+			 N_("toggle the use of a sparse index")),
 		OPT_END(),
 	};
 
 	repo_read_index(the_repository);
 
+	init_opts.sparse_index = -1;
+
 	argc = parse_options(argc, argv, NULL,
 			     builtin_sparse_checkout_init_options,
 			     builtin_sparse_checkout_init_usage, 0);
@@ -326,6 +332,15 @@ static int sparse_checkout_init(int argc, const char **argv)
 	sparse_filename = get_sparse_checkout_filename();
 	res = add_patterns_from_file_to_list(sparse_filename, "", 0, &pl, NULL);
 
+	if (init_opts.sparse_index >= 0) {
+		if (set_sparse_index_config(the_repository, init_opts.sparse_index) < 0)
+			die(_("failed to modify sparse-index config"));
+
+		/* force an index rewrite */
+		repo_read_index(the_repository);
+		the_repository->index->updated_workdir = 1;
+	}
+
 	/* If we already have a sparse-checkout file, use it. */
 	if (res >= 0) {
 		free(sparse_filename);
diff --git a/sparse-index.c b/sparse-index.c
index 3a6df66faeab..30c1a11fd62d 100644
--- a/sparse-index.c
+++ b/sparse-index.c
@@ -104,23 +104,37 @@ static int convert_to_sparse_rec(struct index_state *istate,
 
 static int enable_sparse_index(struct repository *repo)
 {
-	const char *config_path = repo_git_path(repo, "config.worktree");
+	int res;
 
 	if (upgrade_repository_format(1) < 0) {
 		warning(_("unable to upgrade repository format to enable sparse-index"));
 		return -1;
 	}
-	git_config_set_in_file_gently(config_path,
-				      "extensions.sparseIndex",
-				      "true");
+	res = git_config_set_gently("extensions.sparseindex", "true");
 
 	prepare_repo_settings(repo);
 	repo->settings.sparse_index = 1;
-	return 0;
+	return res;
+}
+
+int set_sparse_index_config(struct repository *repo, int enable)
+{
+	int res;
+
+	if (enable)
+		return enable_sparse_index(repo);
+
+	/* Don't downgrade repository format, just remove the extension. */
+	res = git_config_set_gently("extensions.sparseindex", NULL);
+
+	prepare_repo_settings(repo);
+	repo->settings.sparse_index = 0;
+	return res;
 }
 
 int convert_to_sparse(struct index_state *istate)
 {
+	int test_env;
 	if (istate->split_index || istate->sparse_index ||
 	    !core_apply_sparse_checkout || !core_sparse_checkout_cone)
 		return 0;
@@ -129,14 +143,13 @@ int convert_to_sparse(struct index_state *istate)
 		istate->repo = the_repository;
 
 	/*
-	 * The GIT_TEST_SPARSE_INDEX environment variable triggers the
-	 * extensions.sparseIndex config variable to be on.
+	 * If GIT_TEST_SPARSE_INDEX=1, then trigger extensions.sparseIndex
+	 * to be fully enabled. If GIT_TEST_SPARSE_INDEX=0 (set explicitly),
+	 * then purposefully disable the setting.
 	 */
-	if (git_env_bool("GIT_TEST_SPARSE_INDEX", 0)) {
-		int err = enable_sparse_index(istate->repo);
-		if (err < 0)
-			return err;
-	}
+	test_env = git_env_bool("GIT_TEST_SPARSE_INDEX", -1);
+	if (test_env >= 0)
+		set_sparse_index_config(istate->repo, test_env);
 
 	/*
 	 * Only convert to sparse if extensions.sparseIndex is set.
diff --git a/sparse-index.h b/sparse-index.h
index 64380e121d80..39dcc859735e 100644
--- a/sparse-index.h
+++ b/sparse-index.h
@@ -5,4 +5,7 @@ struct index_state;
 void ensure_full_index(struct index_state *istate);
 int convert_to_sparse(struct index_state *istate);
 
+struct repository;
+int set_sparse_index_config(struct repository *repo, int enable);
+
 #endif
diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index 47f983217852..f14dc48924d2 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -6,6 +6,7 @@ test_description='compare full workdir to sparse workdir'
 # So, disable the check until that integration is complete.
 GIT_TEST_CHECK_CACHE_TREE=0
 GIT_TEST_SPLIT_INDEX=0
+GIT_TEST_SPARSE_INDEX=
 
 . ./test-lib.sh
 
@@ -100,25 +101,26 @@ init_repos () {
 	# initialize sparse-checkout definitions
 	git -C sparse-checkout sparse-checkout init --cone &&
 	git -C sparse-checkout sparse-checkout set deep &&
-	GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout init --cone &&
-	GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set deep
+	git -C sparse-index sparse-checkout init --cone --sparse-index &&
+	test_cmp_config -C sparse-index true extensions.sparseindex &&
+	git -C sparse-index sparse-checkout set deep
 }
 
 run_on_sparse () {
 	(
 		cd sparse-checkout &&
-		GIT_TEST_SPARSE_INDEX=0 "$@" >../sparse-checkout-out 2>../sparse-checkout-err
+		"$@" >../sparse-checkout-out 2>../sparse-checkout-err
 	) &&
 	(
 		cd sparse-index &&
-		GIT_TEST_SPARSE_INDEX=1 "$@" >../sparse-index-out 2>../sparse-index-err
+		"$@" >../sparse-index-out 2>../sparse-index-err
 	)
 }
 
 run_on_all () {
 	(
 		cd full-checkout &&
-		GIT_TEST_SPARSE_INDEX=0 "$@" >../full-checkout-out 2>../full-checkout-err
+		"$@" >../full-checkout-out 2>../full-checkout-err
 	) &&
 	run_on_sparse "$@"
 }
@@ -148,7 +150,7 @@ test_expect_success 'sparse-index contents' '
 			|| return 1
 	done &&
 
-	GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set folder1 &&
+	git -C sparse-index sparse-checkout set folder1 &&
 
 	test-tool -C sparse-index read-cache --table >cache &&
 	for dir in deep folder2 x
@@ -158,7 +160,7 @@ test_expect_success 'sparse-index contents' '
 			|| return 1
 	done &&
 
-	GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set deep/deeper1 &&
+	git -C sparse-index sparse-checkout set deep/deeper1 &&
 
 	test-tool -C sparse-index read-cache --table >cache &&
 	for dir in deep/deeper2 folder1 folder2 x
@@ -166,7 +168,14 @@ test_expect_success 'sparse-index contents' '
 		TREE=$(git -C sparse-index rev-parse HEAD:$dir) &&
 		grep "040000 tree $TREE	$dir/" cache \
 			|| return 1
-	done
+	done &&
+
+	# Disabling the sparse-index removes tree entries with full ones
+	git -C sparse-index sparse-checkout init --no-sparse-index &&
+
+	test-tool -C sparse-index read-cache --table >cache &&
+	! grep "040000 tree" cache &&
+	test_sparse_match test-tool read-cache --table
 '
 
 test_expect_success 'expanded in-memory index matches full index' '
@@ -396,19 +405,15 @@ test_expect_success 'submodule handling' '
 test_expect_success 'sparse-index is expanded and converted back' '
 	init_repos &&
 
-	(
-		GIT_TEST_SPARSE_INDEX=1 &&
-		export GIT_TEST_SPARSE_INDEX &&
-		GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
-			git -C sparse-index -c core.fsmonitor="" reset --hard &&
-		test_region index convert_to_sparse trace2.txt &&
-		test_region index ensure_full_index trace2.txt &&
-
-		rm trace2.txt &&
-		GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
-			git -C sparse-index -c core.fsmonitor="" status -uno &&
-		test_region index ensure_full_index trace2.txt
-	)
+	GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
+		git -C sparse-index -c core.fsmonitor="" reset --hard &&
+	test_region index convert_to_sparse trace2.txt &&
+	test_region index ensure_full_index trace2.txt &&
+
+	rm trace2.txt &&
+	GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
+		git -C sparse-index -c core.fsmonitor="" status -uno &&
+	test_region index ensure_full_index trace2.txt
 '
 
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH v3 17/20] sparse-checkout: disable sparse-index
  2021-03-16 16:42   ` [PATCH v3 " Derrick Stolee via GitGitGadget
                       ` (15 preceding siblings ...)
  2021-03-16 16:42     ` [PATCH v3 16/20] sparse-checkout: toggle sparse index from builtin Derrick Stolee via GitGitGadget
@ 2021-03-16 16:43     ` Derrick Stolee via GitGitGadget
  2021-03-16 16:43     ` [PATCH v3 18/20] cache-tree: integrate with sparse directory entries Derrick Stolee via GitGitGadget
                       ` (6 subsequent siblings)
  23 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-16 16:43 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

We use 'git sparse-checkout init --cone --sparse-index' to toggle the
sparse-index feature. It makes sense to also disable it when running
'git sparse-checkout disable'. This is particularly important because it
removes the extensions.sparseIndex config option, allowing other tools
to use this Git repository again.

This does mean that 'git sparse-checkout init' will not re-enable the
sparse-index feature, even if it was previously enabled.

While testing this feature, I noticed that the sparse-index was not
being written on the first run, but by a second. This was caught by the
call to 'test-tool read-cache --table'. This requires adjusting some
assignments to core_apply_sparse_checkout and pl.use_cone_patterns in
the sparse_checkout_init() logic.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 builtin/sparse-checkout.c          | 10 +++++++++-
 t/t1091-sparse-checkout-builtin.sh | 13 +++++++++++++
 2 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/builtin/sparse-checkout.c b/builtin/sparse-checkout.c
index ca63e2c64e95..585343fa1972 100644
--- a/builtin/sparse-checkout.c
+++ b/builtin/sparse-checkout.c
@@ -280,6 +280,9 @@ static int set_config(enum sparse_checkout_mode mode)
 				      "core.sparseCheckoutCone",
 				      mode == MODE_CONE_PATTERNS ? "true" : NULL);
 
+	if (mode == MODE_NO_PATTERNS)
+		set_sparse_index_config(the_repository, 0);
+
 	return 0;
 }
 
@@ -341,10 +344,11 @@ static int sparse_checkout_init(int argc, const char **argv)
 		the_repository->index->updated_workdir = 1;
 	}
 
+	core_apply_sparse_checkout = 1;
+
 	/* If we already have a sparse-checkout file, use it. */
 	if (res >= 0) {
 		free(sparse_filename);
-		core_apply_sparse_checkout = 1;
 		return update_working_directory(NULL);
 	}
 
@@ -366,6 +370,7 @@ static int sparse_checkout_init(int argc, const char **argv)
 	add_pattern(strbuf_detach(&pattern, NULL), empty_base, 0, &pl, 0);
 	strbuf_addstr(&pattern, "!/*/");
 	add_pattern(strbuf_detach(&pattern, NULL), empty_base, 0, &pl, 0);
+	pl.use_cone_patterns = init_opts.cone_mode;
 
 	return write_patterns_and_update(&pl);
 }
@@ -632,6 +637,9 @@ static int sparse_checkout_disable(int argc, const char **argv)
 	strbuf_addstr(&match_all, "/*");
 	add_pattern(strbuf_detach(&match_all, NULL), empty_base, 0, &pl, 0);
 
+	prepare_repo_settings(the_repository);
+	the_repository->settings.sparse_index = 0;
+
 	if (update_working_directory(&pl))
 		die(_("error while refreshing working directory"));
 
diff --git a/t/t1091-sparse-checkout-builtin.sh b/t/t1091-sparse-checkout-builtin.sh
index fc64e9ed99f4..ff1ad570a255 100755
--- a/t/t1091-sparse-checkout-builtin.sh
+++ b/t/t1091-sparse-checkout-builtin.sh
@@ -205,6 +205,19 @@ test_expect_success 'sparse-checkout disable' '
 	check_files repo a deep folder1 folder2
 '
 
+test_expect_success 'sparse-index enabled and disabled' '
+	git -C repo sparse-checkout init --cone --sparse-index &&
+	test_cmp_config -C repo true extensions.sparseIndex &&
+	test-tool -C repo read-cache --table >cache &&
+	grep " tree " cache &&
+
+	git -C repo sparse-checkout disable &&
+	test-tool -C repo read-cache --table >cache &&
+	! grep " tree " cache &&
+	git -C repo config --list >config &&
+	! grep extensions.sparseindex config
+'
+
 test_expect_success 'cone mode: init and set' '
 	git -C repo sparse-checkout init --cone &&
 	git -C repo config --list >config &&
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH v3 18/20] cache-tree: integrate with sparse directory entries
  2021-03-16 16:42   ` [PATCH v3 " Derrick Stolee via GitGitGadget
                       ` (16 preceding siblings ...)
  2021-03-16 16:43     ` [PATCH v3 17/20] sparse-checkout: disable sparse-index Derrick Stolee via GitGitGadget
@ 2021-03-16 16:43     ` Derrick Stolee via GitGitGadget
  2021-03-16 16:43     ` [PATCH v3 19/20] sparse-index: loose integration with cache_tree_verify() Derrick Stolee via GitGitGadget
                       ` (5 subsequent siblings)
  23 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-16 16:43 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The cache-tree extension was previously disabled with sparse indexes.
However, the cache-tree is an important performance feature for commands
like 'git status' and 'git add'. Integrate it with sparse directory
entries.

When writing a sparse index, completely clear and recalculate the cache
tree. By starting from scratch, the only integration necessary is to
check if we hit a sparse directory entry and create a leaf of the
cache-tree that has an entry_count of one and no subtrees.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 cache-tree.c   | 18 ++++++++++++++++++
 sparse-index.c | 10 +++++++++-
 2 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/cache-tree.c b/cache-tree.c
index 5f07a39e501e..950a9615db8f 100644
--- a/cache-tree.c
+++ b/cache-tree.c
@@ -256,6 +256,24 @@ static int update_one(struct cache_tree *it,
 
 	*skip_count = 0;
 
+	/*
+	 * If the first entry of this region is a sparse directory
+	 * entry corresponding exactly to 'base', then this cache_tree
+	 * struct is a "leaf" in the data structure, pointing to the
+	 * tree OID specified in the entry.
+	 */
+	if (entries > 0) {
+		const struct cache_entry *ce = cache[0];
+
+		if (S_ISSPARSEDIR(ce->ce_mode) &&
+		    ce->ce_namelen == baselen &&
+		    !strncmp(ce->name, base, baselen)) {
+			it->entry_count = 1;
+			oidcpy(&it->oid, &ce->oid);
+			return 1;
+		}
+	}
+
 	if (0 <= it->entry_count && has_object_file(&it->oid))
 		return it->entry_count;
 
diff --git a/sparse-index.c b/sparse-index.c
index 30c1a11fd62d..56313e805d9d 100644
--- a/sparse-index.c
+++ b/sparse-index.c
@@ -180,7 +180,11 @@ int convert_to_sparse(struct index_state *istate)
 	istate->cache_nr = convert_to_sparse_rec(istate,
 						 0, 0, istate->cache_nr,
 						 "", 0, istate->cache_tree);
-	istate->drop_cache_tree = 1;
+
+	/* Clear and recompute the cache-tree */
+	cache_tree_free(&istate->cache_tree);
+	cache_tree_update(istate, 0);
+
 	istate->sparse_index = 1;
 	trace2_region_leave("index", "convert_to_sparse", istate->repo);
 	return 0;
@@ -281,5 +285,9 @@ void ensure_full_index(struct index_state *istate)
 	strbuf_release(&base);
 	free(full);
 
+	/* Clear and recompute the cache-tree */
+	cache_tree_free(&istate->cache_tree);
+	cache_tree_update(istate, 0);
+
 	trace2_region_leave("index", "ensure_full_index", istate->repo);
 }
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH v3 19/20] sparse-index: loose integration with cache_tree_verify()
  2021-03-16 16:42   ` [PATCH v3 " Derrick Stolee via GitGitGadget
                       ` (17 preceding siblings ...)
  2021-03-16 16:43     ` [PATCH v3 18/20] cache-tree: integrate with sparse directory entries Derrick Stolee via GitGitGadget
@ 2021-03-16 16:43     ` Derrick Stolee via GitGitGadget
  2021-03-16 16:43     ` [PATCH v3 20/20] p2000: add sparse-index repos Derrick Stolee via GitGitGadget
                       ` (4 subsequent siblings)
  23 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-16 16:43 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The cache_tree_verify() method is run when GIT_TEST_CHECK_CACHE_TREE
is enabled, which it is by default in the test suite. The logic must
be adjusted for the presence of these directory entries.

For now, leave the test as a simple check for whether the directory
entry is sparse. Do not go any further until needed.

This allows us to re-enable GIT_TEST_CHECK_CACHE_TREE in
t1092-sparse-checkout-compatibility.sh. Further,
p2000-sparse-operations.sh uses the test suite and hence this is enabled
for all tests. We need to integrate with it before we run our
performance tests with a sparse-index.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 cache-tree.c                             | 19 +++++++++++++++++++
 t/t1092-sparse-checkout-compatibility.sh |  3 ---
 2 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/cache-tree.c b/cache-tree.c
index 950a9615db8f..11bf1fcae6e1 100644
--- a/cache-tree.c
+++ b/cache-tree.c
@@ -808,6 +808,19 @@ int cache_tree_matches_traversal(struct cache_tree *root,
 	return 0;
 }
 
+static void verify_one_sparse(struct repository *r,
+			      struct index_state *istate,
+			      struct cache_tree *it,
+			      struct strbuf *path,
+			      int pos)
+{
+	struct cache_entry *ce = istate->cache[pos];
+
+	if (!S_ISSPARSEDIR(ce->ce_mode))
+		BUG("directory '%s' is present in index, but not sparse",
+		    path->buf);
+}
+
 static void verify_one(struct repository *r,
 		       struct index_state *istate,
 		       struct cache_tree *it,
@@ -830,6 +843,12 @@ static void verify_one(struct repository *r,
 
 	if (path->len) {
 		pos = index_name_pos(istate, path->buf, path->len);
+
+		if (pos >= 0) {
+			verify_one_sparse(r, istate, it, path, pos);
+			return;
+		}
+
 		pos = -pos - 1;
 	} else {
 		pos = 0;
diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index f14dc48924d2..d97bf9b64527 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -2,9 +2,6 @@
 
 test_description='compare full workdir to sparse workdir'
 
-# The verify_cache_tree() check is not sparse-aware (yet).
-# So, disable the check until that integration is complete.
-GIT_TEST_CHECK_CACHE_TREE=0
 GIT_TEST_SPLIT_INDEX=0
 GIT_TEST_SPARSE_INDEX=
 
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH v3 20/20] p2000: add sparse-index repos
  2021-03-16 16:42   ` [PATCH v3 " Derrick Stolee via GitGitGadget
                       ` (18 preceding siblings ...)
  2021-03-16 16:43     ` [PATCH v3 19/20] sparse-index: loose integration with cache_tree_verify() Derrick Stolee via GitGitGadget
@ 2021-03-16 16:43     ` Derrick Stolee via GitGitGadget
  2021-03-16 16:59     ` [PATCH v3 00/20] Sparse Index: Design, Format, Tests Derrick Stolee
                       ` (3 subsequent siblings)
  23 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-16 16:43 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

p2000-sparse-operations.sh compares different Git commands in
repositories with many files at HEAD but using sparse-checkout to focus
on a small portion of those files.

Add extra copies of the repository that use the sparse-index format so
we can track how that affects the performance of different commands.

At this point in time, the sparse-index is 100% overhead from the CPU
front, and this is measurable in these tests:

Test
---------------------------------------------------------------
2000.2: git status (full-index-v3)              0.59(0.51+0.12)
2000.3: git status (full-index-v4)              0.59(0.52+0.11)
2000.4: git status (sparse-index-v3)            1.40(1.32+0.12)
2000.5: git status (sparse-index-v4)            1.41(1.36+0.08)
2000.6: git add -A (full-index-v3)              2.32(1.97+0.19)
2000.7: git add -A (full-index-v4)              2.17(1.92+0.14)
2000.8: git add -A (sparse-index-v3)            2.31(2.21+0.15)
2000.9: git add -A (sparse-index-v4)            2.30(2.20+0.13)
2000.10: git add . (full-index-v3)              2.39(2.02+0.20)
2000.11: git add . (full-index-v4)              2.20(1.94+0.16)
2000.12: git add . (sparse-index-v3)            2.36(2.27+0.12)
2000.13: git add . (sparse-index-v4)            2.33(2.21+0.16)
2000.14: git commit -a -m A (full-index-v3)     2.47(2.12+0.20)
2000.15: git commit -a -m A (full-index-v4)     2.26(2.00+0.17)
2000.16: git commit -a -m A (sparse-index-v3)   3.01(2.92+0.16)
2000.17: git commit -a -m A (sparse-index-v4)   3.01(2.94+0.15)

Note that there is very little difference between the v3 and v4 index
formats when the sparse-index is enabled. This is primarily due to the
fact that the relative file sizes are the same, and the command time is
mostly taken up by parsing tree objects to expand the sparse index into
a full one.

With the current file layout, the index file sizes are given by this
table:

       |  full index | sparse index |
       +-------------+--------------+
    v3 |     108 MiB |      1.6 MiB |
    v4 |      80 MiB |      1.2 MiB |

Future updates will improve the performance of Git commands when the
index is sparse.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/perf/p2000-sparse-operations.sh | 19 ++++++++++++++++++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/t/perf/p2000-sparse-operations.sh b/t/perf/p2000-sparse-operations.sh
index 2fbc81b22119..e527316e66d6 100755
--- a/t/perf/p2000-sparse-operations.sh
+++ b/t/perf/p2000-sparse-operations.sh
@@ -60,12 +60,29 @@ test_expect_success 'setup repo and indexes' '
 		git sparse-checkout set $SPARSE_CONE &&
 		git config index.version 4 &&
 		git update-index --index-version=4
+	) &&
+	git -c core.sparseCheckoutCone=true clone --branch=wide --sparse . sparse-index-v3 &&
+	(
+		cd sparse-index-v3 &&
+		git sparse-checkout init --cone --sparse-index &&
+		git sparse-checkout set $SPARSE_CONE &&
+		git config index.version 3 &&
+		git update-index --index-version=3
+	) &&
+	git -c core.sparseCheckoutCone=true clone --branch=wide --sparse . sparse-index-v4 &&
+	(
+		cd sparse-index-v4 &&
+		git sparse-checkout init --cone --sparse-index &&
+		git sparse-checkout set $SPARSE_CONE &&
+		git config index.version 4 &&
+		git update-index --index-version=4
 	)
 '
 
 test_perf_on_all () {
 	command="$@"
-	for repo in full-index-v3 full-index-v4
+	for repo in full-index-v3 full-index-v4 \
+		    sparse-index-v3 sparse-index-v4
 	do
 		test_perf "$command ($repo)" "
 			(
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 203+ messages in thread

* Re: [PATCH v3 00/20] Sparse Index: Design, Format, Tests
  2021-03-16 16:42   ` [PATCH v3 " Derrick Stolee via GitGitGadget
                       ` (19 preceding siblings ...)
  2021-03-16 16:43     ` [PATCH v3 20/20] p2000: add sparse-index repos Derrick Stolee via GitGitGadget
@ 2021-03-16 16:59     ` Derrick Stolee
  2021-03-16 21:18     ` Elijah Newren
                       ` (2 subsequent siblings)
  23 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee @ 2021-03-16 16:59 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget, git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	SZEDER Gábor, Ævar Arnfjörð Bjarmason,
	Derrick Stolee

On 3/16/2021 12:42 PM, Derrick Stolee via GitGitGadget wrote:> Updates in V3
> =============
> 
> For this version, I took Ævar's latest patches and applied them to v2.31.0
> and rebased this series on top. It uses his new "read_tree_at()" helper and
> the associated changes to the function pointer type.

Junio, I wanted to call your attention to this change in base.

Here is the relevant part of the range-diff:

>   5:  399ddb0bad56 !  5:  99292cdbaae4 sparse-index: implement ensure_full_index()
>      @@ sparse-index.c
>       +}
>       +
>       +static int add_path_to_index(const struct object_id *oid,
>      -+				struct strbuf *base, const char *path,
>      -+				unsigned int mode, int stage, void *context)
>      ++			     struct strbuf *base, const char *path,
>      ++			     unsigned int mode, void *context)
>       +{
>       +	struct index_state *istate = (struct index_state *)context;
>       +	struct cache_entry *ce;
>      @@ sparse-index.c
>       -	/* intentionally left blank */
>       +	int i;
>       +	struct index_state *full;
>      ++	struct strbuf base = STRBUF_INIT;
>       +
>       +	if (!istate || !istate->sparse_index)
>       +		return;
>      @@ sparse-index.c
>       +		ps.has_wildcard = 1;
>       +		ps.max_depth = -1;
>       +
>      -+		read_tree_recursive(istate->repo, tree,
>      -+				    ce->name, strlen(ce->name),
>      -+				    0, &ps,
>      -+				    add_path_to_index, full);
>      ++		strbuf_setlen(&base, 0);
>      ++		strbuf_add(&base, ce->name, strlen(ce->name));
>      ++
>      ++		read_tree_at(istate->repo, tree, &base, &ps,
>      ++			     add_path_to_index, full);
>       +
>       +		/* free directory entries. full entries are re-used */
>       +		discard_cache_entry(ce);
>      @@ sparse-index.c
>       +	istate->cache_nr = full->cache_nr;
>       +	istate->cache_alloc = full->cache_alloc;
>       +
>      ++	strbuf_release(&base);
>       +	free(full);
>       +
>       +	trace2_region_leave("index", "ensure_full_index", istate->repo);

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v3 00/20] Sparse Index: Design, Format, Tests
  2021-03-16 16:42   ` [PATCH v3 " Derrick Stolee via GitGitGadget
                       ` (20 preceding siblings ...)
  2021-03-16 16:59     ` [PATCH v3 00/20] Sparse Index: Design, Format, Tests Derrick Stolee
@ 2021-03-16 21:18     ` Elijah Newren
  2021-03-18 21:50     ` Junio C Hamano
  2021-03-23 13:44     ` [PATCH v4 " Derrick Stolee via GitGitGadget
  23 siblings, 0 replies; 203+ messages in thread
From: Elijah Newren @ 2021-03-16 21:18 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano,
	Nguyễn Thái Ngọc, Jonathan Nieder,
	Martin Ågren, Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee

On Tue, Mar 16, 2021 at 9:43 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> Here is the first full patch series submission coming out of the
> sparse-index RFC [1].
>
> [1]
> https://lore.kernel.org/git/pull.847.git.1611596533.gitgitgadget@gmail.com/
>
> I won't waste too much space here, because PATCH 1 includes a sizeable
> design document that describes the feature, the reasoning behind it, and my
> plan for getting this implemented widely throughout the codebase.
>
> There are some new things here that were not in the RFC:
>
>  * Design doc and format updates. (Patch 1)
>  * Performance test script. (Patches 2 and 20)
>
> Notably missing in this series from the RFC:
>
>  * The mega-patch inserting ensure_full_index() throughout the codebase.
>    That will be a follow-up series to this one.
>  * The integrations with git status and git add to demonstrate the improved
>    performance. Those will also appear in their own series later.
>
> I plan to keep my latest work in this area in my 'sparse-index/wip' branch
> [2]. It includes all of the work from the RFC right now, updated with the
> work from this series.
>
> [2] https://github.com/derrickstolee/git/tree/sparse-index/wip
>
>
> Updates in V3
> =============
>
> For this version, I took Ævar's latest patches and applied them to v2.31.0
> and rebased this series on top. It uses his new "read_tree_at()" helper and
> the associated changes to the function pointer type.
>
>  * Fixed more typos. Thanks Martin and Elijah!
>  * Updated the test_sparse_match() macro to use "$@" instead of $*
>  * Added a test that git sparse-checkout init --no-sparse-index rewrites the
>    index to be full.

I've read through the range-diff.  Sorry for not spotting the conflict
with Ævar's series (that I also reviewed).  Anyway, my Reviewed-by
from the last series still holds.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v3 02/20] t/perf: add performance test for sparse operations
  2021-03-16 16:42     ` [PATCH v3 02/20] t/perf: add performance test for sparse operations Derrick Stolee via GitGitGadget
@ 2021-03-17  8:41       ` Ævar Arnfjörð Bjarmason
  2021-03-17 13:05         ` Derrick Stolee
  0 siblings, 1 reply; 203+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-03-17  8:41 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor, Derrick Stolee,
	Derrick Stolee


On Tue, Mar 16 2021, Derrick Stolee via GitGitGadget wrote:

> From: Derrick Stolee <dstolee@microsoft.com>
>
> Create a test script that takes the default performance test (the Git
> codebase) and multiplies it by 256 using four layers of duplicated
> trees of width four. This results in nearly one million blob entries in
> the index. Then, we can clone this repository with sparse-checkout
> patterns that demonstrate four copies of the initial repository. Each
> clone will use a different index format or mode so peformance can be
> tested across the different options.
>
> Note that the initial repo is stripped of submodules before doing the
> copies. This preserves the expected data shape of the sparse index,
> because directories containing submodules are not collapsed to a sparse
> directory entry.
>
> Run a few Git commands on these clones, especially those that use the
> index (status, add, commit).
>
> Here are the results on my Linux machine:
>
> Test
> --------------------------------------------------------------
> 2000.2: git status (full-index-v3)             0.37(0.30+0.09)
> 2000.3: git status (full-index-v4)             0.39(0.32+0.10)
> 2000.4: git add -A (full-index-v3)             1.42(1.06+0.20)
> 2000.5: git add -A (full-index-v4)             1.26(0.98+0.16)
> 2000.6: git add . (full-index-v3)              1.40(1.04+0.18)
> 2000.7: git add . (full-index-v4)              1.26(0.98+0.17)
> 2000.8: git commit -a -m A (full-index-v3)     1.42(1.11+0.16)
> 2000.9: git commit -a -m A (full-index-v4)     1.33(1.08+0.16)
>
> It is perhaps noteworthy that there is an improvement when using index
> version 4. This is because the v3 index uses 108 MiB while the v4
> index uses 80 MiB. Since the repeated portions of the directories are
> very short (f3/f1/f2, for example) this ratio is less pronounced than in
> similarly-sized real repositories.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  t/perf/p2000-sparse-operations.sh | 85 +++++++++++++++++++++++++++++++
>  1 file changed, 85 insertions(+)
>  create mode 100755 t/perf/p2000-sparse-operations.sh
>
> diff --git a/t/perf/p2000-sparse-operations.sh b/t/perf/p2000-sparse-operations.sh
> new file mode 100755
> index 000000000000..2fbc81b22119
> --- /dev/null
> +++ b/t/perf/p2000-sparse-operations.sh
> @@ -0,0 +1,85 @@
> +#!/bin/sh
> +
> +test_description="test performance of Git operations using the index"
> +
> +. ./perf-lib.sh
> +
> +test_perf_default_repo
> +
> +SPARSE_CONE=f2/f4/f1
> +
> +test_expect_success 'setup repo and indexes' '
> +	git reset --hard HEAD &&
> +	# Remove submodules from the example repo, because our
> +	# duplication of the entire repo creates an unlikly data shape.
> +	git config --file .gitmodules --get-regexp "submodule.*.path" >modules &&
> +	git rm -f .gitmodules &&
> +	for module in $(awk "{print \$2}" modules)
> +	do
> +		git rm $module || return 1
> +	done &&
> +	git commit -m "remove submodules" &&

Paradoxically with this you can no longer use a repo that's not git.git
or another repo that has submodules, since we'll die in trying to remove
them.

Also you don't have to "git rm .gitmodules", the "git rm" command
removes submodule entries.

Perhaps just:

    for module in $(git ls-files --stage | grep ^160000 | awk -F '\t' '{ print $2 }')
    do
        git rm "$module"
    done

Or another way of guarding against rm getting the empty list && commit?

But it seems odd to be doing this at all, the point of the perf
framework is that you can point it at any repo, and some repos you want
to test will have submodules.

Seems like something like the WIP patch at the end on top would be
better.

> +	echo bogus >a &&
> +	cp a b &&
> +	git add a b &&
> +	git commit -m "level 0" &&
> +	BLOB=$(git rev-parse HEAD:a) &&

Isn't the way we're getting this $BLOB equivalent to just 'echo bogus |
git hash-object --stdin -w' why commit it?

> +	OLD_COMMIT=$(git rev-parse HEAD) &&
> +	OLD_TREE=$(git rev-parse HEAD^{tree}) &&
> +
> +	for i in $(test_seq 1 4)
> +	do
> +		cat >in <<-EOF &&
> +			100755 blob $BLOB	a
> +			040000 tree $OLD_TREE	f1
> +			040000 tree $OLD_TREE	f2
> +			040000 tree $OLD_TREE	f3
> +			040000 tree $OLD_TREE	f4
> +		EOF
> +		NEW_TREE=$(git mktree <in) &&
> +		NEW_COMMIT=$(git commit-tree $NEW_TREE -p $OLD_COMMIT -m "level $i") &&
> +		OLD_TREE=$NEW_TREE &&
> +		OLD_COMMIT=$NEW_COMMIT || return 1
> +	done &&
> +
> +	git sparse-checkout init --cone &&
> +	git branch -f wide $OLD_COMMIT &&
> +	git -c core.sparseCheckoutCone=true clone --branch=wide --sparse . full-index-v3 &&
> +	(
> +		cd full-index-v3 &&
> +		git sparse-checkout init --cone &&
> +		git sparse-checkout set $SPARSE_CONE &&
> +		git config index.version 3 &&
> +		git update-index --index-version=3
> +	) &&
> +	git -c core.sparseCheckoutCone=true clone --branch=wide --sparse . full-index-v4 &&
> +	(
> +		cd full-index-v4 &&
> +		git sparse-checkout init --cone &&
> +		git sparse-checkout set $SPARSE_CONE &&
> +		git config index.version 4 &&
> +		git update-index --index-version=4
> +	)
> +'

This whole thing makes me think you just wanted a test_perf_fresh_repo
all along, but I think this would be much more useful if you took the
default repo and multiplied the size in its tree by some multiple.

E.g. take the files we have in git.git, write a copy at prefix-1/,
prefix-2/ etc.

The whole point of test_perf_{default,large}_repo is being able to point
them at a local repo you're testing for performance and get numbers
representative of that repo.

So maybe that's not what's wanted here at all, but that brings us back
to test_perf_fresh_repo...

> +test_perf_on_all () {
> +	command="$@"
> +	for repo in full-index-v3 full-index-v4
> +	do
> +		test_perf "$command ($repo)" "
> +			(
> +				cd $repo &&
> +				echo >>$SPARSE_CONE/a &&
> +				$command
> +			)
> +		"
> +	done
> +}
> +
> +test_perf_on_all git status
> +test_perf_on_all git add -A
> +test_perf_on_all git add .
> +test_perf_on_all git commit -a -m A
> +
> +test_done

diff --git a/t/perf/p2000-sparse-operations.sh b/t/perf/p2000-sparse-operations.sh
index e527316e66..2c07b04159 100755
--- a/t/perf/p2000-sparse-operations.sh
+++ b/t/perf/p2000-sparse-operations.sh
@@ -4,22 +4,11 @@ test_description="test performance of Git operations using the index"
 
 . ./perf-lib.sh
 
-test_perf_default_repo
+test_perf_nosubodules_repo
 
 SPARSE_CONE=f2/f4/f1
 
 test_expect_success 'setup repo and indexes' '
-	git reset --hard HEAD &&
-	# Remove submodules from the example repo, because our
-	# duplication of the entire repo creates an unlikly data shape.
-	git config --file .gitmodules --get-regexp "submodule.*.path" >modules &&
-	git rm -f .gitmodules &&
-	for module in $(awk "{print \$2}" modules)
-	do
-		git rm $module || return 1
-	done &&
-	git commit -m "remove submodules" &&
-
 	echo bogus >a &&
 	cp a b &&
 	git add a b &&
diff --git a/t/perf/perf-lib.sh b/t/perf/perf-lib.sh
index e385c6896f..86b716ce8f 100644
--- a/t/perf/perf-lib.sh
+++ b/t/perf/perf-lib.sh
@@ -128,6 +128,15 @@ test_perf_large_repo () {
 	fi
 	test_perf_create_repo_from "${1:-$TRASH_DIRECTORY}" "$GIT_PERF_LARGE_REPO"
 }
+test_perf_nosubodules_repo () {
+	if test "$GIT_PERF_NOSUBMODULES_REPO" = "$GIT_BUILD_DIR"; then
+		echo "warning: \$GIT_PERF_NOSUBMODULES_REPO is \$GIT_BUILD_DIR." >&2
+		echo "warning: This will probably work, but it has a submodule!" >&2
+		echo "warning: point to another repo for representative measurements." >&2
+		# git rm dance here? optionally?
+	fi
+	test_perf_create_repo_from "${1:-$TRASH_DIRECTORY}" "$GIT_PERF_NOSUBMODULES_REPO"
+}
 test_checkout_worktree () {
 	git checkout-index -u -a ||
 	error "git checkout-index failed"
@@ -196,7 +205,7 @@ test_perf_ () {
 	else
 		echo "perf $test_count - $1:"
 	fi
-	for i in $(test_seq 1 $GIT_PERF_REPEAT_COUNT); do
+	for i in $(test_seq 1 $GIT_PERF_REP
 		say >&3 "running: $2"
 		if test_run_perf_ "$2"
 		then

^ permalink raw reply related	[flat|nested] 203+ messages in thread

* Re: [PATCH v3 03/20] t1092: clean up script quoting
  2021-03-16 16:42     ` [PATCH v3 03/20] t1092: clean up script quoting Derrick Stolee via GitGitGadget
@ 2021-03-17  8:47       ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 203+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-03-17  8:47 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor, Derrick Stolee,
	Derrick Stolee


On Tue, Mar 16 2021, Derrick Stolee via GitGitGadget wrote:

> From: Derrick Stolee <dstolee@microsoft.com>
>
> This test was introduced in 19a0acc83e4 (t1092: test interesting
> sparse-checkout scenarios, 2021-01-23), but these issues with quoting
> were not noticed until starting this follow-up series. The old mechanism
> would drop quoting such as in

the "but these issues" follows a partial sentence where we haven't
introduces "what issues?".

Perhaps leading with some summary about $@ v.s. $*:

    Fix a bug in the sparse checkout tests of "$@" being conflated with
    "$*". The bug was introduced in 19a0acc83e4 ([...]), but had no
    effect until now because XYZ ...


>    test_all_match git commit -m "touch README.md"
>
> The above happened to work because README.md is a file in the
> repository, so 'git commit -m touch REAMDE.md' would succeed by
> accident.
>
> Other cases included quoting for no good reason, so clean that up now.

Maybe just my taste, per your comment on another series of mine we might
not have the same sense of splitting up commits, but...

I think in this case it's clearer to have these be two commits. We have
3 hunks fixing the bug, and 6 on an unrelated cleanup. It's a lot easier
for eyeballing a fix to be able to glance just at the 3, especially with
something like $@ v.s. $*.

> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  t/t1092-sparse-checkout-compatibility.sh | 20 ++++++++++----------
>  1 file changed, 10 insertions(+), 10 deletions(-)
>
> diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
> index 8cd3e5a8d227..3725d3997e70 100755
> --- a/t/t1092-sparse-checkout-compatibility.sh
> +++ b/t/t1092-sparse-checkout-compatibility.sh
> @@ -96,20 +96,20 @@ init_repos () {
>  run_on_sparse () {
>  	(
>  		cd sparse-checkout &&
> -		$* >../sparse-checkout-out 2>../sparse-checkout-err
> +		"$@" >../sparse-checkout-out 2>../sparse-checkout-err
>  	)
>  }
>  
>  run_on_all () {
>  	(
>  		cd full-checkout &&
> -		$* >../full-checkout-out 2>../full-checkout-err
> +		"$@" >../full-checkout-out 2>../full-checkout-err
>  	) &&
> -	run_on_sparse $*
> +	run_on_sparse "$@"
>  }
>  
>  test_all_match () {
> -	run_on_all $* &&
> +	run_on_all "$@" &&
>  	test_cmp full-checkout-out sparse-checkout-out &&
>  	test_cmp full-checkout-err sparse-checkout-err
>  }
> @@ -119,7 +119,7 @@ test_expect_success 'status with options' '
>  	test_all_match git status --porcelain=v2 &&
>  	test_all_match git status --porcelain=v2 -z -u &&
>  	test_all_match git status --porcelain=v2 -uno &&
> -	run_on_all "touch README.md" &&
> +	run_on_all touch README.md &&
>  	test_all_match git status --porcelain=v2 &&
>  	test_all_match git status --porcelain=v2 -z -u &&
>  	test_all_match git status --porcelain=v2 -uno &&
> @@ -135,7 +135,7 @@ test_expect_success 'add, commit, checkout' '
>  	write_script edit-contents <<-\EOF &&
>  	echo text >>$1
>  	EOF
> -	run_on_all "../edit-contents README.md" &&
> +	run_on_all ../edit-contents README.md &&
>  
>  	test_all_match git add README.md &&
>  	test_all_match git status --porcelain=v2 &&
> @@ -144,7 +144,7 @@ test_expect_success 'add, commit, checkout' '
>  	test_all_match git checkout HEAD~1 &&
>  	test_all_match git checkout - &&
>  
> -	run_on_all "../edit-contents README.md" &&
> +	run_on_all ../edit-contents README.md &&
>  
>  	test_all_match git add -A &&
>  	test_all_match git status --porcelain=v2 &&
> @@ -153,7 +153,7 @@ test_expect_success 'add, commit, checkout' '
>  	test_all_match git checkout HEAD~1 &&
>  	test_all_match git checkout - &&
>  
> -	run_on_all "../edit-contents deep/newfile" &&
> +	run_on_all ../edit-contents deep/newfile &&
>  
>  	test_all_match git status --porcelain=v2 -uno &&
>  	test_all_match git status --porcelain=v2 &&
> @@ -186,7 +186,7 @@ test_expect_success 'diff --staged' '
>  	write_script edit-contents <<-\EOF &&
>  	echo text >>README.md
>  	EOF
> -	run_on_all "../edit-contents" &&
> +	run_on_all ../edit-contents &&
>  
>  	test_all_match git diff &&
>  	test_all_match git diff --staged &&
> @@ -280,7 +280,7 @@ test_expect_success 'clean' '
>  	echo bogus >>.gitignore &&
>  	run_on_all cp ../.gitignore . &&
>  	test_all_match git add .gitignore &&
> -	test_all_match git commit -m ignore-bogus-files &&
> +	test_all_match git commit -m "ignore bogus files" &&
>  
>  	run_on_sparse mkdir folder1 &&
>  	run_on_all touch folder1/bogus &&


^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v3 05/20] sparse-index: implement ensure_full_index()
  2021-03-16 16:42     ` [PATCH v3 05/20] sparse-index: implement ensure_full_index() Derrick Stolee via GitGitGadget
@ 2021-03-17 13:03       ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 203+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-03-17 13:03 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor, Derrick Stolee,
	Derrick Stolee


On Tue, Mar 16 2021, Derrick Stolee via GitGitGadget wrote:

> From: Derrick Stolee <dstolee@microsoft.com>
> [...]
> +static int add_path_to_index(const struct object_id *oid,
> +			     struct strbuf *base, const char *path,
> +			     unsigned int mode, void *context)
> +{
> +	struct index_state *istate = (struct index_state *)context;
> +	struct cache_entry *ce;
> +	size_t len = base->len;
> +
> +	if (S_ISDIR(mode))
> +		return READ_TREE_RECURSIVE;
> +
> +	strbuf_addstr(base, path);
> +
> +	ce = make_cache_entry(istate, mode, oid, base->buf, 0, 0);
> +	ce->ce_flags |= CE_SKIP_WORKTREE;
> +	set_index_entry(istate, istate->cache_nr++, ce);
> +
> +	strbuf_setlen(base, len);
> +	return 0;
> +}
>  
>  void ensure_full_index(struct index_state *istate)
>  {
> -	/* intentionally left blank */
> +	int i;
> +	struct index_state *full;
> +	struct strbuf base = STRBUF_INIT;
> +
> +	if (!istate || !istate->sparse_index)
> +		return;
> +
> +	if (!istate->repo)
> +		istate->repo = the_repository;
> +
> +	trace2_region_enter("index", "ensure_full_index", istate->repo);
> +
> +	/* initialize basics of new index */
> +	full = xcalloc(1, sizeof(struct index_state));
> +	memcpy(full, istate, sizeof(struct index_state));
> +
> +	/* then change the necessary things */
> +	full->sparse_index = 0;
> +	full->cache_alloc = (3 * istate->cache_alloc) / 2;
> +	full->cache_nr = 0;
> +	ALLOC_ARRAY(full->cache, full->cache_alloc);
> +
> +	for (i = 0; i < istate->cache_nr; i++) {
> +		struct cache_entry *ce = istate->cache[i];
> +		struct tree *tree;
> +		struct pathspec ps;
> +
> +		if (!S_ISSPARSEDIR(ce->ce_mode)) {
> +			set_index_entry(full, full->cache_nr++, ce);
> +			continue;
> +		}
> +		if (!(ce->ce_flags & CE_SKIP_WORKTREE))
> +			warning(_("index entry is a directory, but not sparse (%08x)"),
> +				ce->ce_flags);
> +
> +		/* recursively walk into cd->name */
> +		tree = lookup_tree(istate->repo, &ce->oid);
> +
> +		memset(&ps, 0, sizeof(ps));
> +		ps.recursive = 1;
> +		ps.has_wildcard = 1;
> +		ps.max_depth = -1;
> +
> +		strbuf_setlen(&base, 0);
> +		strbuf_add(&base, ce->name, strlen(ce->name));
> +
> +		read_tree_at(istate->repo, tree, &base, &ps,
> +			     add_path_to_index, full);
> +
> +		/* free directory entries. full entries are re-used */
> +		discard_cache_entry(ce);
> +	}
> +
> +	/* Copy back into original index. */
> +	memcpy(&istate->name_hash, &full->name_hash, sizeof(full->name_hash));
> +	istate->sparse_index = 0;
> +	free(istate->cache);
> +	istate->cache = full->cache;
> +	istate->cache_nr = full->cache_nr;
> +	istate->cache_alloc = full->cache_alloc;
> +
> +	strbuf_release(&base);
> +	free(full);
> +
> +	trace2_region_leave("index", "ensure_full_index", istate->repo);
>  }

Not that I mind having added the read_tree_at() again, but just thinking
aloud here.

So we need this loop here because there's nothing like a read_tree_at()
that knows how to start at the non-tree root of the index, and then for
each directory there we're going to perform the equivalent of a
read_tree() there, but we need to set the base for add_path_to_index()
since we started at subdirs, not the root.

That's fine, but grepping around a bit I wonder if we shouldn't
eventually have some slightly fancier API that just works like
read_tree() but takes an optional "start at the index's root" instead.

Well, things that want that usually care about the index-specific bits,
whereas this "I just care about the tree for these" is more of a special
case I guess.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v3 02/20] t/perf: add performance test for sparse operations
  2021-03-17  8:41       ` Ævar Arnfjörð Bjarmason
@ 2021-03-17 13:05         ` Derrick Stolee
  2021-03-17 13:21           ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 203+ messages in thread
From: Derrick Stolee @ 2021-03-17 13:05 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason, Derrick Stolee via GitGitGadget
  Cc: git, newren, gitster, pclouds, jrnieder, Martin Ågren,
	SZEDER Gábor, Derrick Stolee, Derrick Stolee

On 3/17/2021 4:41 AM, Ævar Arnfjörð Bjarmason wrote:
> 
> On Tue, Mar 16 2021, Derrick Stolee via GitGitGadget wrote:
>> +test_expect_success 'setup repo and indexes' '
>> +	git reset --hard HEAD &&
>> +	# Remove submodules from the example repo, because our
>> +	# duplication of the entire repo creates an unlikly data shape.
>> +	git config --file .gitmodules --get-regexp "submodule.*.path" >modules &&
>> +	git rm -f .gitmodules &&
>> +	for module in $(awk "{print \$2}" modules)
>> +	do
>> +		git rm $module || return 1
>> +	done &&
>> +	git commit -m "remove submodules" &&
> 
> Paradoxically with this you can no longer use a repo that's not git.git
> or another repo that has submodules, since we'll die in trying to remove
> them.

Good point.

> Also you don't have to "git rm .gitmodules", the "git rm" command
> removes submodule entries.

Sure.

> Perhaps just:
> 
>     for module in $(git ls-files --stage | grep ^160000 | awk -F '\t' '{ print $2 }')
>     do
>         git rm "$module"
>     done
> 
> Or another way of guarding against rm getting the empty list && commit?
> 
> But it seems odd to be doing this at all, the point of the perf
> framework is that you can point it at any repo, and some repos you want
> to test will have submodules.

You're right that it should handle all repos. However, the point of
the test is to have many copies of the repo, but most of them are
excluded by sparse-directory entries. We don't collapse sparse-directory
entries if there is a submodule inside, so the data shape is wrong after
making all the copies.

So, I disagree with your approach in your suggested diff, and instead
offer this one. I've tested this with git.git and another local repo
without submodules and checked that everything works as expected.

diff --git a/t/perf/p2000-sparse-operations.sh b/t/perf/p2000-sparse-operations.sh
index e527316e66d..5c0d78eeeea 100755
--- a/t/perf/p2000-sparse-operations.sh
+++ b/t/perf/p2000-sparse-operations.sh
@@ -10,15 +10,17 @@ SPARSE_CONE=f2/f4/f1
 
 test_expect_success 'setup repo and indexes' '
 	git reset --hard HEAD &&
+
 	# Remove submodules from the example repo, because our
-	# duplication of the entire repo creates an unlikly data shape.
-	git config --file .gitmodules --get-regexp "submodule.*.path" >modules &&
-	git rm -f .gitmodules &&
-	for module in $(awk "{print \$2}" modules)
-	do
-		git rm $module || return 1
-	done &&
-	git commit -m "remove submodules" &&
+	# duplication of the entire repo creates an unlikely data shape.
+	if (git config --file .gitmodules --get-regexp "submodule.*.path" >modules)
+	then
+		for module in $(awk "{print \$2}" modules)
+		do
+			git rm $module || return 1
+		done &&
+		git commit -m "remove submodules" || return 1
+	fi &&
 
 	echo bogus >a &&
 	cp a b &&

> Seems like something like the WIP patch at the end on top would be
> better.
> 
>> +	echo bogus >a &&
>> +	cp a b &&
>> +	git add a b &&
>> +	git commit -m "level 0" &&
>> +	BLOB=$(git rev-parse HEAD:a) &&
> 
> Isn't the way we're getting this $BLOB equivalent to just 'echo bogus |
> git hash-object --stdin -w' why commit it?

We are committing it so we can add commits that deepen the copies,
but within those copies we have these known file paths.

> This whole thing makes me think you just wanted a test_perf_fresh_repo
> all along, but I think this would be much more useful if you took the
> default repo and multiplied the size in its tree by some multiple.
> 
> E.g. take the files we have in git.git, write a copy at prefix-1/,
> prefix-2/ etc.

That is essentially what is happening here, but using multiple levels
of directories. Using these multiple levels presents extra tree
lookups and parsing in the event of expanding a sparse index to a
full one.

Thanks,
-Stolee

^ permalink raw reply related	[flat|nested] 203+ messages in thread

* Re: [PATCH v3 02/20] t/perf: add performance test for sparse operations
  2021-03-17 13:05         ` Derrick Stolee
@ 2021-03-17 13:21           ` Ævar Arnfjörð Bjarmason
  2021-03-17 18:02             ` Derrick Stolee
  0 siblings, 1 reply; 203+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-03-17 13:21 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, git, newren, gitster, pclouds,
	jrnieder, Martin Ågren, SZEDER Gábor, Derrick Stolee,
	Derrick Stolee


On Wed, Mar 17 2021, Derrick Stolee wrote:

> On 3/17/2021 4:41 AM, Ævar Arnfjörð Bjarmason wrote:
>> 
>> On Tue, Mar 16 2021, Derrick Stolee via GitGitGadget wrote:
>>> +test_expect_success 'setup repo and indexes' '
>>> +	git reset --hard HEAD &&
>>> +	# Remove submodules from the example repo, because our
>>> +	# duplication of the entire repo creates an unlikly data shape.
>>> +	git config --file .gitmodules --get-regexp "submodule.*.path" >modules &&
>>> +	git rm -f .gitmodules &&
>>> +	for module in $(awk "{print \$2}" modules)
>>> +	do
>>> +		git rm $module || return 1
>>> +	done &&
>>> +	git commit -m "remove submodules" &&
>> 
>> Paradoxically with this you can no longer use a repo that's not git.git
>> or another repo that has submodules, since we'll die in trying to remove
>> them.
>
> Good point.
>
>> Also you don't have to "git rm .gitmodules", the "git rm" command
>> removes submodule entries.
>
> Sure.
>
>> Perhaps just:
>> 
>>     for module in $(git ls-files --stage | grep ^160000 | awk -F '\t' '{ print $2 }')
>>     do
>>         git rm "$module"
>>     done
>> 
>> Or another way of guarding against rm getting the empty list && commit?
>> 
>> But it seems odd to be doing this at all, the point of the perf
>> framework is that you can point it at any repo, and some repos you want
>> to test will have submodules.
>
> You're right that it should handle all repos. However, the point of
> the test is to have many copies of the repo, but most of them are
> excluded by sparse-directory entries. We don't collapse sparse-directory
> entries if there is a submodule inside, so the data shape is wrong after
> making all the copies.
>
> So, I disagree with your approach in your suggested diff, and instead
> offer this one. I've tested this with git.git and another local repo
> without submodules and checked that everything works as expected.

What's got me confused here is that there's two uses for the perf
framework in this context.

It's to use an empty/git.git as a test repo to demonstrate something,
but then also that you can run it in your arbitrary repo, and e.g. see
how much a given feature might benefit you.

Hence suggesting that maybe test_perf_fresh_repois better here, because
by using test_perf_default_repo you're creating the expectation that you
can run the perf test, observe an %X difference, and that'll be
give-or-take what you'll get for that use case if you enable the feature.

Except it won't because the repo has submodules, which we deleted for
the perf test...

> diff --git a/t/perf/p2000-sparse-operations.sh b/t/perf/p2000-sparse-operations.sh
> index e527316e66d..5c0d78eeeea 100755
> --- a/t/perf/p2000-sparse-operations.sh
> +++ b/t/perf/p2000-sparse-operations.sh
> @@ -10,15 +10,17 @@ SPARSE_CONE=f2/f4/f1
>  
>  test_expect_success 'setup repo and indexes' '
>  	git reset --hard HEAD &&
> +
>  	# Remove submodules from the example repo, because our
> -	# duplication of the entire repo creates an unlikly data shape.
> -	git config --file .gitmodules --get-regexp "submodule.*.path" >modules &&
> -	git rm -f .gitmodules &&
> -	for module in $(awk "{print \$2}" modules)
> -	do
> -		git rm $module || return 1
> -	done &&
> -	git commit -m "remove submodules" &&
> +	# duplication of the entire repo creates an unlikely data shape.
> +	if (git config --file .gitmodules --get-regexp "submodule.*.path" >modules)

A subshell isn't needed here.

FWIW the reason I got this out of ls-files is because you can have
submodules without .gitmodules entries, rare and broken, but seemed more
direct to grep the mode bits.

> +	then
> +		for module in $(awk "{print \$2}" modules)
> +		do
> +			git rm $module || return 1
> +		done &&

Once we know we have submodules we can just do this without the loop.

    git rm $(awk "{print \$2}" modules)



> +		git commit -m "remove submodules" || return 1
> +	fi &&
>  
>  	echo bogus >a &&
>  	cp a b &&
>
>> Seems like something like the WIP patch at the end on top would be
>> better.
>> 
>>> +	echo bogus >a &&
>>> +	cp a b &&
>>> +	git add a b &&
>>> +	git commit -m "level 0" &&
>>> +	BLOB=$(git rev-parse HEAD:a) &&
>> 
>> Isn't the way we're getting this $BLOB equivalent to just 'echo bogus |
>> git hash-object --stdin -w' why commit it?
>
> We are committing it so we can add commits that deepen the copies,
> but within those copies we have these known file paths.
>
>> This whole thing makes me think you just wanted a test_perf_fresh_repo
>> all along, but I think this would be much more useful if you took the
>> default repo and multiplied the size in its tree by some multiple.
>> 
>> E.g. take the files we have in git.git, write a copy at prefix-1/,
>> prefix-2/ etc.
>
> That is essentially what is happening here, but using multiple levels
> of directories. Using these multiple levels presents extra tree
> lookups and parsing in the event of expanding a sparse index to a
> full one.

*nod*

Anyway, this thread's a bit of a bikeshed on my part, I was just
wondering if & what part of the test relied on the existing repo if it
was mostly setting up its own test data.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* [RFC/PATCH 0/5] Re: [PATCH v3 07/20] test-read-cache: print cache entries with --table
  2021-03-16 16:42     ` [PATCH v3 07/20] test-read-cache: print cache entries with --table Derrick Stolee via GitGitGadget
@ 2021-03-17 13:28       ` Ævar Arnfjörð Bjarmason
  2021-03-17 18:28         ` Elijah Newren
  2021-03-17 13:28       ` [RFC/PATCH 1/5] ls-files: defer read_index() after parse_options() etc Ævar Arnfjörð Bjarmason
                         ` (4 subsequent siblings)
  5 siblings, 1 reply; 203+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-03-17 13:28 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee, dstolee

> From: Derrick Stolee <dstolee@microsoft.com>
>
> This table is helpful for discovering data in the index to ensure it is
> being written correctly, especially as we build and test the
> sparse-index. This table includes an output format similar to 'git
> ls-tree', but should not be compared to that directly. The biggest
> reasons are that 'git ls-tree' includes a tree entry for every
> subdirectory, even those that would not appear as a sparse directory in
> a sparse-index. Further, 'git ls-tree' does not use a trailing directory
> separator for its tree rows.
>
> This does not print the stat() information for the blobs. That could be
> added in a future change with another option. The tests that are added
> in the next few changes care only about the object types and IDs.
>
> To make the option parsing slightly more robust, wrap the string
> comparisons in a loop adapted from test-dir-iterator.c.
>
> Care must be taken with the final check for the 'cnt' variable. We
> continue the expectation that the numerical value is the final argument.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  t/helper/test-read-cache.c | 55 +++++++++++++++++++++++++++++++-------
>  1 file changed, 45 insertions(+), 10 deletions(-)
>
> diff --git a/t/helper/test-read-cache.c b/t/helper/test-read-cache.c
> index 244977a29bdf..6cfd8f2de71c 100644
> --- a/t/helper/test-read-cache.c
> +++ b/t/helper/test-read-cache.c
> @@ -1,36 +1,71 @@
>  #include "test-tool.h"
>  #include "cache.h"
>  #include "config.h"
> +#include "blob.h"
> +#include "commit.h"
> +#include "tree.h"
> +
> +static void print_cache_entry(struct cache_entry *ce)
> +{
> +	const char *type;
> +	printf("%06o ", ce->ce_mode & 0177777);
> +
> +	if (S_ISSPARSEDIR(ce->ce_mode))
> +		type = tree_type;
> +	else if (S_ISGITLINK(ce->ce_mode))
> +		type = commit_type;
> +	else
> +		type = blob_type;
> +
> +	printf("%s %s\t%s\n",
> +	       type,
> +	       oid_to_hex(&ce->oid),
> +	       ce->name);
> +}
> +

So we have a test tool that's mostly ls-files but mocks the output
ls-tree would emit, won't these tests eventually care about what stage
things are in?

What follows is an RFC series on top that's the result of me wondering
why if we're adding new index constructs we aren't updating our
plumbing to emit that data, can we just add this to ls-files and drop
this test helper?

Turns out: Yes we can.

Ævar Arnfjörð Bjarmason (5):
  ls-files: defer read_index() after parse_options() etc.
  ls-files: make "mode" in show_ce() loop a variable
  ls-files: add and use a new --sparse option
  test-tool read-cache: --table is redundant to ls-files
  test-tool: split up test-tool read-cache

 Documentation/git-ls-files.txt           |  4 ++
 Makefile                                 |  3 +-
 builtin/ls-files.c                       | 29 +++++++--
 t/helper/test-read-cache-again.c         | 31 +++++++++
 t/helper/test-read-cache-perf.c          | 21 ++++++
 t/helper/test-read-cache.c               | 82 ------------------------
 t/helper/test-tool.c                     |  3 +-
 t/helper/test-tool.h                     |  3 +-
 t/perf/p0002-read-cache.sh               |  2 +-
 t/t1091-sparse-checkout-builtin.sh       |  9 +--
 t/t1092-sparse-checkout-compatibility.sh | 57 ++++++++++------
 t/t7519-status-fsmonitor.sh              |  2 +-
 12 files changed, 131 insertions(+), 115 deletions(-)
 create mode 100644 t/helper/test-read-cache-again.c
 create mode 100644 t/helper/test-read-cache-perf.c
 delete mode 100644 t/helper/test-read-cache.c

-- 
2.31.0.260.g719c683c1d


^ permalink raw reply	[flat|nested] 203+ messages in thread

* [RFC/PATCH 1/5] ls-files: defer read_index() after parse_options() etc.
  2021-03-16 16:42     ` [PATCH v3 07/20] test-read-cache: print cache entries with --table Derrick Stolee via GitGitGadget
  2021-03-17 13:28       ` [RFC/PATCH 0/5] " Ævar Arnfjörð Bjarmason
@ 2021-03-17 13:28       ` Ævar Arnfjörð Bjarmason
  2021-03-17 13:28       ` [RFC/PATCH 2/5] ls-files: make "mode" in show_ce() loop a variable Ævar Arnfjörð Bjarmason
                         ` (3 subsequent siblings)
  5 siblings, 0 replies; 203+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-03-17 13:28 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee, dstolee

Move the reading of the index below the parsing of options. We'll need
to setup some index options in the next commit after option parsing,
but in any case it makes sense to give parse_options() handling a
chance to die early before we perform the more expensive operation of
reading the index.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 builtin/ls-files.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/builtin/ls-files.c b/builtin/ls-files.c
index 13bcc2d847..eb72d16493 100644
--- a/builtin/ls-files.c
+++ b/builtin/ls-files.c
@@ -681,9 +681,6 @@ int cmd_ls_files(int argc, const char **argv, const char *cmd_prefix)
 		prefix_len = strlen(prefix);
 	git_config(git_default_config, NULL);
 
-	if (repo_read_index(the_repository) < 0)
-		die("index file corrupt");
-
 	argc = parse_options(argc, argv, prefix, builtin_ls_files_options,
 			ls_files_usage, 0);
 	pl = add_pattern_list(&dir, EXC_CMDL, "--exclude option");
@@ -743,6 +740,12 @@ int cmd_ls_files(int argc, const char **argv, const char *cmd_prefix)
 		max_prefix = common_prefix(&pathspec);
 	max_prefix_len = get_common_prefix_len(max_prefix);
 
+	/*
+	 * Read the index after parse options etc. have had a chance
+	 * to die early.
+	 */
+	if (repo_read_index(the_repository) < 0)
+		die("index file corrupt");
 	prune_index(the_repository->index, max_prefix, max_prefix_len);
 
 	/* Treat unmatching pathspec elements as errors */
-- 
2.31.0.260.g719c683c1d


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [RFC/PATCH 2/5] ls-files: make "mode" in show_ce() loop a variable
  2021-03-16 16:42     ` [PATCH v3 07/20] test-read-cache: print cache entries with --table Derrick Stolee via GitGitGadget
  2021-03-17 13:28       ` [RFC/PATCH 0/5] " Ævar Arnfjörð Bjarmason
  2021-03-17 13:28       ` [RFC/PATCH 1/5] ls-files: defer read_index() after parse_options() etc Ævar Arnfjörð Bjarmason
@ 2021-03-17 13:28       ` Ævar Arnfjörð Bjarmason
  2021-03-17 18:11         ` Elijah Newren
  2021-03-17 13:28       ` [RFC/PATCH 3/5] ls-files: add and use a new --sparse option Ævar Arnfjörð Bjarmason
                         ` (2 subsequent siblings)
  5 siblings, 1 reply; 203+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-03-17 13:28 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee, dstolee

In a subsequent commit I'll optionally change the mode in a new sparse
mode, let's do this first to make that change smaller.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 builtin/ls-files.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/builtin/ls-files.c b/builtin/ls-files.c
index eb72d16493..4db75351f2 100644
--- a/builtin/ls-files.c
+++ b/builtin/ls-files.c
@@ -242,9 +242,17 @@ static void show_ce(struct repository *repo, struct dir_struct *dir,
 		if (!show_stage) {
 			fputs(tag, stdout);
 		} else {
+			unsigned int mode = ce->ce_mode;
+			if (show_sparse && S_ISSPARSEDIR(mode))
+				/*
+				 * We could just do & 0177777 all the
+				 * time, just make it clear this is
+				 * for --stage-sparse.
+				 */
+				mode &= 0177777;
 			printf("%s%06o %s %d\t",
 			       tag,
-			       ce->ce_mode,
+			       mode,
 			       find_unique_abbrev(&ce->oid, abbrev),
 			       ce_stage(ce));
 		}
-- 
2.31.0.260.g719c683c1d


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [RFC/PATCH 3/5] ls-files: add and use a new --sparse option
  2021-03-16 16:42     ` [PATCH v3 07/20] test-read-cache: print cache entries with --table Derrick Stolee via GitGitGadget
                         ` (2 preceding siblings ...)
  2021-03-17 13:28       ` [RFC/PATCH 2/5] ls-files: make "mode" in show_ce() loop a variable Ævar Arnfjörð Bjarmason
@ 2021-03-17 13:28       ` Ævar Arnfjörð Bjarmason
  2021-03-17 18:19         ` Elijah Newren
  2021-03-17 20:43         ` Derrick Stolee
  2021-03-17 13:28       ` [RFC/PATCH 4/5] test-tool read-cache: --table is redundant to ls-files Ævar Arnfjörð Bjarmason
  2021-03-17 13:28       ` [RFC/PATCH 5/5] test-tool: split up test-tool read-cache Ævar Arnfjörð Bjarmason
  5 siblings, 2 replies; 203+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-03-17 13:28 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee, dstolee

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 Documentation/git-ls-files.txt           |  4 ++
 builtin/ls-files.c                       | 10 ++++-
 t/t1091-sparse-checkout-builtin.sh       |  9 ++--
 t/t1092-sparse-checkout-compatibility.sh | 57 ++++++++++++++++--------
 4 files changed, 56 insertions(+), 24 deletions(-)

diff --git a/Documentation/git-ls-files.txt b/Documentation/git-ls-files.txt
index 6d11ab506b..1145e960a4 100644
--- a/Documentation/git-ls-files.txt
+++ b/Documentation/git-ls-files.txt
@@ -71,6 +71,10 @@ OPTIONS
 --unmerged::
 	Show unmerged files in the output (forces --stage)
 
+--sparse::
+	Show sparse directories in the output instead of expanding
+	them (forces --stage)
+
 -k::
 --killed::
 	Show files on the filesystem that need to be removed due
diff --git a/builtin/ls-files.c b/builtin/ls-files.c
index 4db75351f2..1ebbb63c10 100644
--- a/builtin/ls-files.c
+++ b/builtin/ls-files.c
@@ -26,6 +26,7 @@ static int show_deleted;
 static int show_cached;
 static int show_others;
 static int show_stage;
+static int show_sparse;
 static int show_unmerged;
 static int show_resolve_undo;
 static int show_modified;
@@ -639,6 +640,8 @@ int cmd_ls_files(int argc, const char **argv, const char *cmd_prefix)
 			DIR_SHOW_IGNORED),
 		OPT_BOOL('s', "stage", &show_stage,
 			N_("show staged contents' object name in the output")),
+		OPT_BOOL(0, "sparse", &show_sparse,
+			N_("show unexpanded sparse directories in the output")),
 		OPT_BOOL('k', "killed", &show_killed,
 			N_("show files on the filesystem that need to be removed")),
 		OPT_BIT(0, "directory", &dir.flags,
@@ -705,12 +708,17 @@ int cmd_ls_files(int argc, const char **argv, const char *cmd_prefix)
 		tag_skip_worktree = "S ";
 		tag_resolve_undo = "U ";
 	}
+	if (show_sparse) {
+		prepare_repo_settings(the_repository);
+		the_repository->settings.command_requires_full_index = 0;
+	}
 	if (show_modified || show_others || show_deleted || (dir.flags & DIR_SHOW_IGNORED) || show_killed)
 		require_work_tree = 1;
-	if (show_unmerged)
+	if (show_unmerged || show_sparse)
 		/*
 		 * There's no point in showing unmerged unless
 		 * you also show the stage information.
+		 * The same goes for the --sparse option.
 		 */
 		show_stage = 1;
 	if (show_tag || show_stage)
diff --git a/t/t1091-sparse-checkout-builtin.sh b/t/t1091-sparse-checkout-builtin.sh
index ff1ad570a2..c823df423c 100755
--- a/t/t1091-sparse-checkout-builtin.sh
+++ b/t/t1091-sparse-checkout-builtin.sh
@@ -208,12 +208,13 @@ test_expect_success 'sparse-checkout disable' '
 test_expect_success 'sparse-index enabled and disabled' '
 	git -C repo sparse-checkout init --cone --sparse-index &&
 	test_cmp_config -C repo true extensions.sparseIndex &&
-	test-tool -C repo read-cache --table >cache &&
-	grep " tree " cache &&
+	git -C repo ls-files --sparse >cache &&
+	grep "^040000 " cache >lines &&
+	test_line_count = 3 lines &&
 
 	git -C repo sparse-checkout disable &&
-	test-tool -C repo read-cache --table >cache &&
-	! grep " tree " cache &&
+	git -C repo ls-files --sparse >cache &&
+	! grep "^040000 " cache &&
 	git -C repo config --list >config &&
 	! grep extensions.sparseindex config
 '
diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index d97bf9b645..48d3920490 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -136,48 +136,67 @@ test_sparse_match () {
 	test_cmp sparse-checkout-err sparse-index-err
 }
 
+test_index_entry_like () {
+	dir=$1
+	shift
+	fmt=$1
+	shift
+	rev=$1
+	shift
+	entry=$1
+	shift
+	file=$1
+	shift
+	hash=$(git -C "$dir" rev-parse "$rev") &&
+	printf "$fmt\n" "$hash" "$entry" >expected &&
+	if grep "$entry" "$file" >line
+	then
+		test_cmp expected line
+	else
+		cat cache &&
+		false
+	fi
+}
+
 test_expect_success 'sparse-index contents' '
 	init_repos &&
 
-	test-tool -C sparse-index read-cache --table >cache &&
+	git -C sparse-index ls-files --sparse >cache &&
 	for dir in folder1 folder2 x
 	do
-		TREE=$(git -C sparse-index rev-parse HEAD:$dir) &&
-		grep "040000 tree $TREE	$dir/" cache \
-			|| return 1
+		test_index_entry_like sparse-index "040000 %s 0\t%s" "HEAD:$dir" "$dir/" cache || return 1
 	done &&
 
 	git -C sparse-index sparse-checkout set folder1 &&
 
-	test-tool -C sparse-index read-cache --table >cache &&
+	git -C sparse-index ls-files --sparse >cache &&
 	for dir in deep folder2 x
 	do
-		TREE=$(git -C sparse-index rev-parse HEAD:$dir) &&
-		grep "040000 tree $TREE	$dir/" cache \
-			|| return 1
+		test_index_entry_like sparse-index "040000 %s 0\t%s" "HEAD:$dir" "$dir/" cache || return 1
 	done &&
 
 	git -C sparse-index sparse-checkout set deep/deeper1 &&
 
-	test-tool -C sparse-index read-cache --table >cache &&
+	git -C sparse-index ls-files --sparse >cache &&
 	for dir in deep/deeper2 folder1 folder2 x
 	do
-		TREE=$(git -C sparse-index rev-parse HEAD:$dir) &&
-		grep "040000 tree $TREE	$dir/" cache \
-			|| return 1
+		test_index_entry_like sparse-index "040000 %s 0\t%s" "HEAD:$dir" "$dir/" cache || return 1
 	done &&
 
+	grep 040000 cache >lines &&
+	test_line_count = 4 lines &&
+
 	# Disabling the sparse-index removes tree entries with full ones
 	git -C sparse-index sparse-checkout init --no-sparse-index &&
 
-	test-tool -C sparse-index read-cache --table >cache &&
-	! grep "040000 tree" cache &&
-	test_sparse_match test-tool read-cache --table
+	git -C sparse-index ls-files --sparse >cache &&
+	! grep "^040000 " cache >lines &&
+	test_sparse_match git ls-tree -r HEAD
 '
 
 test_expect_success 'expanded in-memory index matches full index' '
 	init_repos &&
-	test_sparse_match test-tool read-cache --expand --table
+	test_sparse_match git ls-tree -r HEAD
 '
 
 test_expect_success 'status with options' '
@@ -394,9 +413,9 @@ test_expect_success 'submodule handling' '
 	test_all_match git commit -m "add submodule" &&
 
 	# having a submodule prevents "modules" from collapse
-	test-tool -C sparse-index read-cache --table >cache &&
-	grep "100644 blob .*	modules/a" cache &&
-	grep "160000 commit $(git -C initial-repo rev-parse HEAD)	modules/sub" cache
+	git -C sparse-index ls-files --sparse >cache &&
+	test_index_entry_like sparse-index "100644 %s 0\t%s" "HEAD:modules/a" "modules/a" cache &&
+	test_index_entry_like sparse-index "160000 %s 0\t%s" "HEAD:modules/sub" "modules/sub" cache
 '
 
 test_expect_success 'sparse-index is expanded and converted back' '
-- 
2.31.0.260.g719c683c1d


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [RFC/PATCH 4/5] test-tool read-cache: --table is redundant to ls-files
  2021-03-16 16:42     ` [PATCH v3 07/20] test-read-cache: print cache entries with --table Derrick Stolee via GitGitGadget
                         ` (3 preceding siblings ...)
  2021-03-17 13:28       ` [RFC/PATCH 3/5] ls-files: add and use a new --sparse option Ævar Arnfjörð Bjarmason
@ 2021-03-17 13:28       ` Ævar Arnfjörð Bjarmason
  2021-03-17 13:28       ` [RFC/PATCH 5/5] test-tool: split up test-tool read-cache Ævar Arnfjörð Bjarmason
  5 siblings, 0 replies; 203+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-03-17 13:28 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee, dstolee

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 t/helper/test-read-cache.c | 43 --------------------------------------
 1 file changed, 43 deletions(-)

diff --git a/t/helper/test-read-cache.c b/t/helper/test-read-cache.c
index b52c174acc..2499999af3 100644
--- a/t/helper/test-read-cache.c
+++ b/t/helper/test-read-cache.c
@@ -1,54 +1,16 @@
 #include "test-tool.h"
 #include "cache.h"
 #include "config.h"
-#include "blob.h"
-#include "commit.h"
-#include "tree.h"
-#include "sparse-index.h"
-
-static void print_cache_entry(struct cache_entry *ce)
-{
-	const char *type;
-	printf("%06o ", ce->ce_mode & 0177777);
-
-	if (S_ISSPARSEDIR(ce->ce_mode))
-		type = tree_type;
-	else if (S_ISGITLINK(ce->ce_mode))
-		type = commit_type;
-	else
-		type = blob_type;
-
-	printf("%s %s\t%s\n",
-	       type,
-	       oid_to_hex(&ce->oid),
-	       ce->name);
-}
-
-static void print_cache(struct index_state *istate)
-{
-	int i;
-	for (i = 0; i < istate->cache_nr; i++)
-		print_cache_entry(istate->cache[i]);
-}
 
 int cmd__read_cache(int argc, const char **argv)
 {
 	struct repository *r = the_repository;
 	int i, cnt = 1;
 	const char *name = NULL;
-	int table = 0, expand = 0;
-
-	initialize_the_repository();
-	prepare_repo_settings(r);
-	r->settings.command_requires_full_index = 0;
 
 	for (++argv, --argc; *argv && starts_with(*argv, "--"); ++argv, --argc) {
 		if (skip_prefix(*argv, "--print-and-refresh=", &name))
 			continue;
-		if (!strcmp(*argv, "--table"))
-			table = 1;
-		else if (!strcmp(*argv, "--expand"))
-			expand = 1;
 	}
 
 	if (argc == 1)
@@ -59,9 +21,6 @@ int cmd__read_cache(int argc, const char **argv)
 	for (i = 0; i < cnt; i++) {
 		repo_read_index(r);
 
-		if (expand)
-			ensure_full_index(r->index);
-
 		if (name) {
 			int pos;
 
@@ -74,8 +33,6 @@ int cmd__read_cache(int argc, const char **argv)
 			       ce_uptodate(r->index->cache[pos]) ? "" : " not");
 			write_file(name, "%d\n", i);
 		}
-		if (table)
-			print_cache(r->index);
 		discard_index(r->index);
 	}
 	return 0;
-- 
2.31.0.260.g719c683c1d


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [RFC/PATCH 5/5] test-tool: split up test-tool read-cache
  2021-03-16 16:42     ` [PATCH v3 07/20] test-read-cache: print cache entries with --table Derrick Stolee via GitGitGadget
                         ` (4 preceding siblings ...)
  2021-03-17 13:28       ` [RFC/PATCH 4/5] test-tool read-cache: --table is redundant to ls-files Ævar Arnfjörð Bjarmason
@ 2021-03-17 13:28       ` Ævar Arnfjörð Bjarmason
  2021-03-17 13:32         ` Ævar Arnfjörð Bjarmason
  5 siblings, 1 reply; 203+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-03-17 13:28 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee, dstolee

Since the "test-tool read-cache" was originally added back in
1ecb5ff141 (read-cache: add simple performance test, 2013-06-09) it's
been growing all sorts of bells and whistles that aren't very
conducive to performance testing the index, e.g. it learned how to
read config.

Let's split what remains of the "test-tool read-cache" into the two
narrow use-cases it's used for.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 Makefile                         |  3 ++-
 t/helper/test-read-cache-again.c | 31 +++++++++++++++++++++++++
 t/helper/test-read-cache-perf.c  | 21 +++++++++++++++++
 t/helper/test-read-cache.c       | 39 --------------------------------
 t/helper/test-tool.c             |  3 ++-
 t/helper/test-tool.h             |  3 ++-
 t/perf/p0002-read-cache.sh       |  2 +-
 t/t7519-status-fsmonitor.sh      |  2 +-
 8 files changed, 60 insertions(+), 44 deletions(-)
 create mode 100644 t/helper/test-read-cache-again.c
 create mode 100644 t/helper/test-read-cache-perf.c
 delete mode 100644 t/helper/test-read-cache.c

diff --git a/Makefile b/Makefile
index 89b1d53741..a1bbb818d9 100644
--- a/Makefile
+++ b/Makefile
@@ -724,7 +724,8 @@ TEST_BUILTINS_OBJS += test-prio-queue.o
 TEST_BUILTINS_OBJS += test-proc-receive.o
 TEST_BUILTINS_OBJS += test-progress.o
 TEST_BUILTINS_OBJS += test-reach.o
-TEST_BUILTINS_OBJS += test-read-cache.o
+TEST_BUILTINS_OBJS += test-read-cache-again.o
+TEST_BUILTINS_OBJS += test-read-cache-perf.o
 TEST_BUILTINS_OBJS += test-read-graph.o
 TEST_BUILTINS_OBJS += test-read-midx.o
 TEST_BUILTINS_OBJS += test-ref-store.o
diff --git a/t/helper/test-read-cache-again.c b/t/helper/test-read-cache-again.c
new file mode 100644
index 0000000000..5e20ca1c8f
--- /dev/null
+++ b/t/helper/test-read-cache-again.c
@@ -0,0 +1,31 @@
+#include "test-tool.h"
+#include "cache.h"
+
+int cmd__read_cache_again(int argc, const char **argv)
+{
+	struct repository *r = the_repository;
+	int cnt;
+	const char *name;
+
+	if (argc != 2)
+		die("usage: test-tool read-cache-again <count> <file>");
+
+	cnt = strtol(argv[0], NULL, 0);
+	name = argv[2];
+
+	setup_git_directory();
+	while (cnt--) {
+		int pos;
+		repo_read_index(r);
+		refresh_index(r->index, REFRESH_QUIET,
+			      NULL, NULL, NULL);
+		pos = index_name_pos(r->index, name, strlen(name));
+		if (pos < 0)
+			die("%s not in index", name);
+		printf("%s is%s up to date\n", name,
+		       ce_uptodate(r->index->cache[pos]) ? "" : " not");
+		write_file(name, "%d\n", cnt);
+		discard_index(r->index);
+	}
+	return 0;
+}
diff --git a/t/helper/test-read-cache-perf.c b/t/helper/test-read-cache-perf.c
new file mode 100644
index 0000000000..ac9c297efa
--- /dev/null
+++ b/t/helper/test-read-cache-perf.c
@@ -0,0 +1,21 @@
+#include "test-tool.h"
+#include "cache.h"
+
+int cmd__read_cache_perf(int argc, const char **argv)
+{
+	struct repository *r = the_repository;
+	int cnt = 1000;
+
+	if (argc == 1)
+		cnt = strtol(argv[0], NULL, 0);
+	else if (argc)
+		die("usage: test-tool read-cache-perf [<count>]");
+
+	setup_git_directory();
+	while (cnt--) {
+		repo_read_index(r);
+		discard_index(r->index);
+	}
+
+	return 0;
+}
diff --git a/t/helper/test-read-cache.c b/t/helper/test-read-cache.c
deleted file mode 100644
index 2499999af3..0000000000
--- a/t/helper/test-read-cache.c
+++ /dev/null
@@ -1,39 +0,0 @@
-#include "test-tool.h"
-#include "cache.h"
-#include "config.h"
-
-int cmd__read_cache(int argc, const char **argv)
-{
-	struct repository *r = the_repository;
-	int i, cnt = 1;
-	const char *name = NULL;
-
-	for (++argv, --argc; *argv && starts_with(*argv, "--"); ++argv, --argc) {
-		if (skip_prefix(*argv, "--print-and-refresh=", &name))
-			continue;
-	}
-
-	if (argc == 1)
-		cnt = strtol(argv[0], NULL, 0);
-	setup_git_directory();
-	git_config(git_default_config, NULL);
-
-	for (i = 0; i < cnt; i++) {
-		repo_read_index(r);
-
-		if (name) {
-			int pos;
-
-			refresh_index(r->index, REFRESH_QUIET,
-				      NULL, NULL, NULL);
-			pos = index_name_pos(r->index, name, strlen(name));
-			if (pos < 0)
-				die("%s not in index", name);
-			printf("%s is%s up to date\n", name,
-			       ce_uptodate(r->index->cache[pos]) ? "" : " not");
-			write_file(name, "%d\n", i);
-		}
-		discard_index(r->index);
-	}
-	return 0;
-}
diff --git a/t/helper/test-tool.c b/t/helper/test-tool.c
index f97cd9f48a..1334fa25ba 100644
--- a/t/helper/test-tool.c
+++ b/t/helper/test-tool.c
@@ -52,7 +52,8 @@ static struct test_cmd cmds[] = {
 	{ "proc-receive", cmd__proc_receive},
 	{ "progress", cmd__progress },
 	{ "reach", cmd__reach },
-	{ "read-cache", cmd__read_cache },
+	{ "read-cache-again", cmd__read_cache_again },
+	{ "read-cache-perf", cmd__read_cache_perf },
 	{ "read-graph", cmd__read_graph },
 	{ "read-midx", cmd__read_midx },
 	{ "ref-store", cmd__ref_store },
diff --git a/t/helper/test-tool.h b/t/helper/test-tool.h
index 28072c0ad5..d70cde8574 100644
--- a/t/helper/test-tool.h
+++ b/t/helper/test-tool.h
@@ -41,7 +41,8 @@ int cmd__prio_queue(int argc, const char **argv);
 int cmd__proc_receive(int argc, const char **argv);
 int cmd__progress(int argc, const char **argv);
 int cmd__reach(int argc, const char **argv);
-int cmd__read_cache(int argc, const char **argv);
+int cmd__read_cache_again(int argc, const char **argv);
+int cmd__read_cache_perf(int argc, const char **argv);
 int cmd__read_graph(int argc, const char **argv);
 int cmd__read_midx(int argc, const char **argv);
 int cmd__ref_store(int argc, const char **argv);
diff --git a/t/perf/p0002-read-cache.sh b/t/perf/p0002-read-cache.sh
index cdd105a594..d0ba5173fb 100755
--- a/t/perf/p0002-read-cache.sh
+++ b/t/perf/p0002-read-cache.sh
@@ -8,7 +8,7 @@ test_perf_default_repo
 
 count=1000
 test_perf "read_cache/discard_cache $count times" "
-	test-tool read-cache $count
+	test-tool read-cache-perf $count
 "
 
 test_done
diff --git a/t/t7519-status-fsmonitor.sh b/t/t7519-status-fsmonitor.sh
index 45d025f960..3761a8781d 100755
--- a/t/t7519-status-fsmonitor.sh
+++ b/t/t7519-status-fsmonitor.sh
@@ -359,7 +359,7 @@ test_expect_success UNTRACKED_CACHE 'ignore .git changes when invalidating UNTR'
 test_expect_success 'discard_index() also discards fsmonitor info' '
 	test_config core.fsmonitor "$TEST_DIRECTORY/t7519/fsmonitor-all" &&
 	test_might_fail git update-index --refresh &&
-	test-tool read-cache --print-and-refresh=tracked 2 >actual &&
+	test-tool read-cache-again 2 tracked >actual &&
 	printf "tracked is%s up to date\n" "" " not" >expect &&
 	test_cmp expect actual
 '
-- 
2.31.0.260.g719c683c1d


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* Re: [RFC/PATCH 5/5] test-tool: split up test-tool read-cache
  2021-03-17 13:28       ` [RFC/PATCH 5/5] test-tool: split up test-tool read-cache Ævar Arnfjörð Bjarmason
@ 2021-03-17 13:32         ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 203+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-03-17 13:32 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee, dstolee


On Wed, Mar 17 2021, Ævar Arnfjörð Bjarmason wrote:

> +	if (argc != 2)
> +		die("usage: test-tool read-cache-again <count> <file>");
> +
> +	cnt = strtol(argv[0], NULL, 0);
> +	name = argv[2];

This is needed on top, the perils of sending out ad-hoc RFC patches from
the working tree..:

diff --git a/t/helper/test-read-cache-again.c b/t/helper/test-read-cache-again.c
index 5e20ca1c8f..aa97b3aaf3 100644
--- a/t/helper/test-read-cache-again.c
+++ b/t/helper/test-read-cache-again.c
@@ -7,10 +7,9 @@ int cmd__read_cache_again(int argc, const char **argv)
 	int cnt;
 	const char *name;
 
-	if (argc != 2)
+	if (argc != 3)
 		die("usage: test-tool read-cache-again <count> <file>");
-
-	cnt = strtol(argv[0], NULL, 0);
+	cnt = strtol(argv[1], NULL, 0);
 	name = argv[2];
 
 	setup_git_directory();

^ permalink raw reply related	[flat|nested] 203+ messages in thread

* Re: [PATCH v3 13/20] unpack-trees: allow sparse directories
  2021-03-16 16:42     ` [PATCH v3 13/20] unpack-trees: allow sparse directories Derrick Stolee via GitGitGadget
@ 2021-03-17 13:35       ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 203+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-03-17 13:35 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor, Derrick Stolee,
	Derrick Stolee


On Tue, Mar 16 2021, Derrick Stolee via GitGitGadget wrote:

> From: Derrick Stolee <dstolee@microsoft.com>
>
> The index_pos_by_traverse_info() currently throws a BUG() when a
> directory entry exists exactly in the index. We need to consider that it
> is possible to have a directory in a sparse index as long as that entry
> is itself marked with the skip-worktree bit.
>
> The 'pos' variable is assigned a negative value if an exact match is not
> found. Since a directory name can be an exact match, it is no longer an
> error to have a nonnegative 'pos' value.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  unpack-trees.c | 9 ++++++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/unpack-trees.c b/unpack-trees.c
> index 2da3e5ec77a1..e81d82d72d89 100644
> --- a/unpack-trees.c
> +++ b/unpack-trees.c
> @@ -749,9 +749,12 @@ static int index_pos_by_traverse_info(struct name_entry *names,
>  	strbuf_make_traverse_path(&name, info, names->path, names->pathlen);
>  	strbuf_addch(&name, '/');
>  	pos = index_name_pos(o->src_index, name.buf, name.len);
> -	if (pos >= 0)
> -		BUG("This is a directory and should not exist in index");
> -	pos = -pos - 1;
> +	if (pos >= 0) {
> +		if (!o->src_index->sparse_index ||
> +		    !(o->src_index->cache[pos]->ce_flags & CE_SKIP_WORKTREE))
> +			BUG("This is a directory and should not exist in index");
> +	} else
> +		pos = -pos - 1;

Style nit: add {}'s to the "else" once the "if" gets one.

>  	if (pos >= o->src_index->cache_nr ||
>  	    !starts_with(o->src_index->cache[pos]->name, name.buf) ||
>  	    (pos > 0 && starts_with(o->src_index->cache[pos-1]->name, name.buf)))


^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v3 11/20] sparse-index: convert from full to sparse
  2021-03-16 16:42     ` [PATCH v3 11/20] sparse-index: convert from full to sparse Derrick Stolee via GitGitGadget
@ 2021-03-17 13:43       ` Ævar Arnfjörð Bjarmason
  2021-03-17 19:55         ` Derrick Stolee
  0 siblings, 1 reply; 203+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-03-17 13:43 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor, Derrick Stolee,
	Derrick Stolee


On Tue, Mar 16 2021, Derrick Stolee via GitGitGadget wrote:

> diff --git a/cache-tree.c b/cache-tree.c
> index 2fb483d3c083..5f07a39e501e 100644
> --- a/cache-tree.c
> +++ b/cache-tree.c
> @@ -6,6 +6,7 @@
>  #include "object-store.h"
>  #include "replace-object.h"
>  #include "promisor-remote.h"
> +#include "sparse-index.h"
>  
>  #ifndef DEBUG_CACHE_TREE
>  #define DEBUG_CACHE_TREE 0
> @@ -442,6 +443,8 @@ int cache_tree_update(struct index_state *istate, int flags)
>  	if (i)
>  		return i;
>  
> +	ensure_full_index(istate);
> +
>  	if (!istate->cache_tree)
>  		istate->cache_tree = cache_tree();
>  
> diff --git a/cache.h b/cache.h
> index 759ca92e2ecc..69a32146cd77 100644
> --- a/cache.h
> +++ b/cache.h
> @@ -251,6 +251,8 @@ static inline unsigned int create_ce_mode(unsigned int mode)
>  {
>  	if (S_ISLNK(mode))
>  		return S_IFLNK;
> +	if (mode == S_IFDIR)
> +		return S_IFDIR;

Does this actually need to be mode == S_IFDIR v.s. S_ISDIR(mode)? Those
aren't the same thing...

>  	if (S_ISDIR(mode) || S_ISGITLINK(mode))
>  		return S_IFGITLINK;

...and if it can be S_ISDIR(mode) then this becomes just
S_ISGITLINK(mode), but losing the "if" there makes me suspect that some
dir == submodule heuristic is being broken somewhere..


^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v3 02/20] t/perf: add performance test for sparse operations
  2021-03-17 13:21           ` Ævar Arnfjörð Bjarmason
@ 2021-03-17 18:02             ` Derrick Stolee
  0 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee @ 2021-03-17 18:02 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Derrick Stolee via GitGitGadget, git, newren, gitster, pclouds,
	jrnieder, Martin Ågren, SZEDER Gábor, Derrick Stolee,
	Derrick Stolee

On 3/17/2021 9:21 AM, Ævar Arnfjörð Bjarmason wrote:
> 
> On Wed, Mar 17 2021, Derrick Stolee wrote:
> 
>> On 3/17/2021 4:41 AM, Ævar Arnfjörð Bjarmason wrote:
>>> But it seems odd to be doing this at all, the point of the perf
>>> framework is that you can point it at any repo, and some repos you want
>>> to test will have submodules.
>>
>> You're right that it should handle all repos. However, the point of
>> the test is to have many copies of the repo, but most of them are
>> excluded by sparse-directory entries. We don't collapse sparse-directory
>> entries if there is a submodule inside, so the data shape is wrong after
>> making all the copies.
>>
>> So, I disagree with your approach in your suggested diff, and instead
>> offer this one. I've tested this with git.git and another local repo
>> without submodules and checked that everything works as expected.
> 
> What's got me confused here is that there's two uses for the perf
> framework in this context.
> 
> It's to use an empty/git.git as a test repo to demonstrate something,
> but then also that you can run it in your arbitrary repo, and e.g. see
> how much a given feature might benefit you.
> 
> Hence suggesting that maybe test_perf_fresh_repois better here, because
> by using test_perf_default_repo you're creating the expectation that you
> can run the perf test, observe an %X difference, and that'll be
> give-or-take what you'll get for that use case if you enable the feature.
> 
> Except it won't because the repo has submodules, which we deleted for
> the perf test...

I'm also dramatically changing the repository shape to expose index
reads and writes as a bottleneck. The benefit of using other repos
(like git.git or optionally choosing the Linux kernel repo) is to
change how much of the time is spent crawling the populated set.

>> diff --git a/t/perf/p2000-sparse-operations.sh b/t/perf/p2000-sparse-operations.sh
>> index e527316e66d..5c0d78eeeea 100755
>> --- a/t/perf/p2000-sparse-operations.sh
>> +++ b/t/perf/p2000-sparse-operations.sh
>> @@ -10,15 +10,17 @@ SPARSE_CONE=f2/f4/f1
>>  
>>  test_expect_success 'setup repo and indexes' '
>>  	git reset --hard HEAD &&
>> +
>>  	# Remove submodules from the example repo, because our
>> -	# duplication of the entire repo creates an unlikly data shape.
>> -	git config --file .gitmodules --get-regexp "submodule.*.path" >modules &&
>> -	git rm -f .gitmodules &&
>> -	for module in $(awk "{print \$2}" modules)
>> -	do
>> -		git rm $module || return 1
>> -	done &&
>> -	git commit -m "remove submodules" &&
>> +	# duplication of the entire repo creates an unlikely data shape.
>> +	if (git config --file .gitmodules --get-regexp "submodule.*.path" >modules)
> 
> A subshell isn't needed here.
> 
> FWIW the reason I got this out of ls-files is because you can have
> submodules without .gitmodules entries, rare and broken, but seemed more
> direct to grep the mode bits.

I'd prefer to do something (textually) simpler, expecting the input
repos to have correct data.

>> +	then
>> +		for module in $(awk "{print \$2}" modules)
>> +		do
>> +			git rm $module || return 1
>> +		done &&
> 
> Once we know we have submodules we can just do this without the loop.
> 
>     git rm $(awk "{print \$2}" modules)

Ok. That works for me.
>>> Seems like something like the WIP patch at the end on top would be
>>> better.
>>>
>>>> +	echo bogus >a &&
>>>> +	cp a b &&
>>>> +	git add a b &&
>>>> +	git commit -m "level 0" &&
>>>> +	BLOB=$(git rev-parse HEAD:a) &&
>>>
>>> Isn't the way we're getting this $BLOB equivalent to just 'echo bogus |
>>> git hash-object --stdin -w' why commit it?
>>
>> We are committing it so we can add commits that deepen the copies,
>> but within those copies we have these known file paths.
>>
>>> This whole thing makes me think you just wanted a test_perf_fresh_repo
>>> all along, but I think this would be much more useful if you took the
>>> default repo and multiplied the size in its tree by some multiple.
>>>
>>> E.g. take the files we have in git.git, write a copy at prefix-1/,
>>> prefix-2/ etc.
>>
>> That is essentially what is happening here, but using multiple levels
>> of directories. Using these multiple levels presents extra tree
>> lookups and parsing in the event of expanding a sparse index to a
>> full one.
> 
> *nod*
> 
> Anyway, this thread's a bit of a bikeshed on my part, I was just
> wondering if & what part of the test relied on the existing repo if it
> was mostly setting up its own test data.

Again, the benefit is to depend on the repo shape in some aspects,
while exaggerating the data shape to make the non-populated set
extremely large.

This presents different aspects that are worth examining, such as
git.git is much smaller than linux.git, and that is noticable with
these different performance numbers (taken at the end of this
series):

git.git
Test                                            this tree      
---------------------------------------------------------------
2000.2: git status (full-index-v3)              0.39(0.35+0.08)
2000.3: git status (full-index-v4)              0.39(0.34+0.09)
2000.4: git status (sparse-index-v3)            2.46(2.33+0.16)
2000.5: git status (sparse-index-v4)            2.42(2.31+0.15)
2000.6: git add -A (full-index-v3)              1.35(0.98+0.20)
2000.7: git add -A (full-index-v4)              1.25(0.96+0.18)
2000.8: git add -A (sparse-index-v3)            2.39(2.26+0.17)
2000.9: git add -A (sparse-index-v4)            2.35(2.29+0.11)
2000.10: git add . (full-index-v3)              1.39(1.01+0.19)
2000.11: git add . (full-index-v4)              1.31(1.00+0.19)
2000.12: git add . (sparse-index-v3)            2.41(2.28+0.16)
2000.13: git add . (sparse-index-v4)            2.45(2.32+0.16)
2000.14: git commit -a -m A (full-index-v3)     1.44(1.08+0.21)
2000.15: git commit -a -m A (full-index-v4)     1.31(1.04+0.19)
2000.16: git commit -a -m A (sparse-index-v3)   2.44(2.35+0.16)
2000.17: git commit -a -m A (sparse-index-v4)   2.44(2.36+0.16)

linux.git
Test                                            this tree        
-----------------------------------------------------------------
2000.2: git status (full-index-v3)              7.14(6.06+1.79)  
2000.3: git status (full-index-v4)              7.01(6.16+1.60)  
2000.4: git status (sparse-index-v3)            58.50(56.86+2.34)
2000.5: git status (sparse-index-v4)            57.52(55.80+2.45)
2000.6: git add -A (full-index-v3)              25.52(18.70+3.18)
2000.7: git add -A (full-index-v4)              22.26(17.52+2.72)
2000.8: git add -A (sparse-index-v3)            56.65(55.00+2.35)
2000.9: git add -A (sparse-index-v4)            56.56(54.98+2.29)
2000.10: git add . (full-index-v3)              25.87(19.12+3.15)
2000.11: git add . (full-index-v4)              22.56(17.85+2.71)
2000.12: git add . (sparse-index-v3)            57.01(55.28+2.42)
2000.13: git add . (sparse-index-v4)            56.84(55.38+2.19)
2000.14: git commit -a -m A (full-index-v3)     26.83(20.69+3.24)
2000.15: git commit -a -m A (full-index-v4)     24.04(19.86+2.65)
2000.16: git commit -a -m A (sparse-index-v3)   60.23(58.99+2.44)
2000.17: git commit -a -m A (sparse-index-v4)   60.52(59.09+2.74)

The intention is to make these numbers improve in the future
so that the sparse-index is a better approach.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [RFC/PATCH 2/5] ls-files: make "mode" in show_ce() loop a variable
  2021-03-17 13:28       ` [RFC/PATCH 2/5] ls-files: make "mode" in show_ce() loop a variable Ævar Arnfjörð Bjarmason
@ 2021-03-17 18:11         ` Elijah Newren
  2021-03-24  0:46           ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 203+ messages in thread
From: Elijah Newren @ 2021-03-17 18:11 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Git Mailing List, Junio C Hamano,
	Nguyễn Thái Ngọc, Jonathan Nieder,
	Martin Ågren, Derrick Stolee, SZEDER Gábor,
	Derrick Stolee, Derrick Stolee

On Wed, Mar 17, 2021 at 6:28 AM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
> In a subsequent commit I'll optionally change the mode in a new sparse
> mode, let's do this first to make that change smaller.
>
> Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
> ---
>  builtin/ls-files.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/builtin/ls-files.c b/builtin/ls-files.c
> index eb72d16493..4db75351f2 100644
> --- a/builtin/ls-files.c
> +++ b/builtin/ls-files.c
> @@ -242,9 +242,17 @@ static void show_ce(struct repository *repo, struct dir_struct *dir,
>                 if (!show_stage) {
>                         fputs(tag, stdout);
>                 } else {
> +                       unsigned int mode = ce->ce_mode;
> +                       if (show_sparse && S_ISSPARSEDIR(mode))
> +                               /*
> +                                * We could just do & 0177777 all the
> +                                * time, just make it clear this is
> +                                * for --stage-sparse.
> +                                */
> +                               mode &= 0177777;

I could kind of see referencing the magic constant 0177777 in a test-*
source file, but it really needs an explanation when showing up in
actual git source code.  At least reference something about how
cache.h mentions these are the mode bits, or better yet #define this
constant somewhere in cache.h with an explanation.

Also, what is --stage-sparse?

>                         printf("%s%06o %s %d\t",
>                                tag,
> -                              ce->ce_mode,
> +                              mode,
>                                find_unique_abbrev(&ce->oid, abbrev),
>                                ce_stage(ce));
>                 }
> --
> 2.31.0.260.g719c683c1d

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [RFC/PATCH 3/5] ls-files: add and use a new --sparse option
  2021-03-17 13:28       ` [RFC/PATCH 3/5] ls-files: add and use a new --sparse option Ævar Arnfjörð Bjarmason
@ 2021-03-17 18:19         ` Elijah Newren
  2021-03-17 18:27           ` Ævar Arnfjörð Bjarmason
  2021-03-17 20:43         ` Derrick Stolee
  1 sibling, 1 reply; 203+ messages in thread
From: Elijah Newren @ 2021-03-17 18:19 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Git Mailing List, Junio C Hamano,
	Nguyễn Thái Ngọc, Jonathan Nieder,
	Martin Ågren, Derrick Stolee, SZEDER Gábor,
	Derrick Stolee, Derrick Stolee

On Wed, Mar 17, 2021 at 6:28 AM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
> Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
> ---
>  Documentation/git-ls-files.txt           |  4 ++
>  builtin/ls-files.c                       | 10 ++++-
>  t/t1091-sparse-checkout-builtin.sh       |  9 ++--
>  t/t1092-sparse-checkout-compatibility.sh | 57 ++++++++++++++++--------
>  4 files changed, 56 insertions(+), 24 deletions(-)
>
> diff --git a/Documentation/git-ls-files.txt b/Documentation/git-ls-files.txt
> index 6d11ab506b..1145e960a4 100644
> --- a/Documentation/git-ls-files.txt
> +++ b/Documentation/git-ls-files.txt
> @@ -71,6 +71,10 @@ OPTIONS
>  --unmerged::
>         Show unmerged files in the output (forces --stage)
>
> +--sparse::
> +       Show sparse directories in the output instead of expanding
> +       them (forces --stage)
> +
>  -k::
>  --killed::
>         Show files on the filesystem that need to be removed due
> diff --git a/builtin/ls-files.c b/builtin/ls-files.c
> index 4db75351f2..1ebbb63c10 100644
> --- a/builtin/ls-files.c
> +++ b/builtin/ls-files.c
> @@ -26,6 +26,7 @@ static int show_deleted;
>  static int show_cached;
>  static int show_others;
>  static int show_stage;
> +static int show_sparse;
>  static int show_unmerged;
>  static int show_resolve_undo;
>  static int show_modified;
> @@ -639,6 +640,8 @@ int cmd_ls_files(int argc, const char **argv, const char *cmd_prefix)
>                         DIR_SHOW_IGNORED),
>                 OPT_BOOL('s', "stage", &show_stage,
>                         N_("show staged contents' object name in the output")),
> +               OPT_BOOL(0, "sparse", &show_sparse,
> +                       N_("show unexpanded sparse directories in the output")),
>                 OPT_BOOL('k', "killed", &show_killed,
>                         N_("show files on the filesystem that need to be removed")),
>                 OPT_BIT(0, "directory", &dir.flags,
> @@ -705,12 +708,17 @@ int cmd_ls_files(int argc, const char **argv, const char *cmd_prefix)
>                 tag_skip_worktree = "S ";
>                 tag_resolve_undo = "U ";
>         }
> +       if (show_sparse) {
> +               prepare_repo_settings(the_repository);
> +               the_repository->settings.command_requires_full_index = 0;
> +       }
>         if (show_modified || show_others || show_deleted || (dir.flags & DIR_SHOW_IGNORED) || show_killed)
>                 require_work_tree = 1;
> -       if (show_unmerged)
> +       if (show_unmerged || show_sparse)
>                 /*
>                  * There's no point in showing unmerged unless
>                  * you also show the stage information.
> +                * The same goes for the --sparse option.

Yuck, haven't you just made --sparse an alias for --stage?  Why does
it need an alias?

Was the goal just to get a quick way to make the command run under
repo->settings.command_requires_full_index = 0 without auditing the
codepaths?  It seems to rely on them having been audited anyway, since
it just falls back to the code used for --stage, so I don't see how it
helps.  It also suggests the command might do unexpected or weird
things if run without the --sparse option?  If people manually
configure a sparse-checkout and cone mode AND a sparse-index (it's
annoying how they have to specify all three instead of having to just
pass one flag somewhere), then now we also need to force them to
remember to pass extra flags to random various commands for them to
operate in a sane manner in their environment??

I think this is a bad path to go down.

However, if you want to write the necessary tests to make it so that
ls-files can operate with command_requires_full_index = 0, then I
think that's useful.  If you want to add a special flag so that folks
in a sparse-checkout-with-cone-mode-with-sparse-index setup want to
operate densely (i.e. to show what files would be in the index if it
were fully populated), then I think that's useful.  But having
sparse-yes-with-cone-yes-very-sparse folks need to specify an extra
flag to commands to get sparse behavior just seems wrong to me.

>                  */
>                 show_stage = 1;
>         if (show_tag || show_stage)
> diff --git a/t/t1091-sparse-checkout-builtin.sh b/t/t1091-sparse-checkout-builtin.sh
> index ff1ad570a2..c823df423c 100755
> --- a/t/t1091-sparse-checkout-builtin.sh
> +++ b/t/t1091-sparse-checkout-builtin.sh
> @@ -208,12 +208,13 @@ test_expect_success 'sparse-checkout disable' '
>  test_expect_success 'sparse-index enabled and disabled' '
>         git -C repo sparse-checkout init --cone --sparse-index &&
>         test_cmp_config -C repo true extensions.sparseIndex &&
> -       test-tool -C repo read-cache --table >cache &&
> -       grep " tree " cache &&
> +       git -C repo ls-files --sparse >cache &&
> +       grep "^040000 " cache >lines &&
> +       test_line_count = 3 lines &&
>
>         git -C repo sparse-checkout disable &&
> -       test-tool -C repo read-cache --table >cache &&
> -       ! grep " tree " cache &&
> +       git -C repo ls-files --sparse >cache &&
> +       ! grep "^040000 " cache &&
>         git -C repo config --list >config &&
>         ! grep extensions.sparseindex config
>  '
> diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
> index d97bf9b645..48d3920490 100755
> --- a/t/t1092-sparse-checkout-compatibility.sh
> +++ b/t/t1092-sparse-checkout-compatibility.sh
> @@ -136,48 +136,67 @@ test_sparse_match () {
>         test_cmp sparse-checkout-err sparse-index-err
>  }
>
> +test_index_entry_like () {
> +       dir=$1
> +       shift
> +       fmt=$1
> +       shift
> +       rev=$1
> +       shift
> +       entry=$1
> +       shift
> +       file=$1
> +       shift
> +       hash=$(git -C "$dir" rev-parse "$rev") &&
> +       printf "$fmt\n" "$hash" "$entry" >expected &&
> +       if grep "$entry" "$file" >line
> +       then
> +               test_cmp expected line
> +       else
> +               cat cache &&
> +               false
> +       fi
> +}
> +
>  test_expect_success 'sparse-index contents' '
>         init_repos &&
>
> -       test-tool -C sparse-index read-cache --table >cache &&
> +       git -C sparse-index ls-files --sparse >cache &&
>         for dir in folder1 folder2 x
>         do
> -               TREE=$(git -C sparse-index rev-parse HEAD:$dir) &&
> -               grep "040000 tree $TREE $dir/" cache \
> -                       || return 1
> +               test_index_entry_like sparse-index "040000 %s 0\t%s" "HEAD:$dir" "$dir/" cache || return 1
>         done &&
>
>         git -C sparse-index sparse-checkout set folder1 &&
>
> -       test-tool -C sparse-index read-cache --table >cache &&
> +       git -C sparse-index ls-files --sparse >cache &&
>         for dir in deep folder2 x
>         do
> -               TREE=$(git -C sparse-index rev-parse HEAD:$dir) &&
> -               grep "040000 tree $TREE $dir/" cache \
> -                       || return 1
> +               test_index_entry_like sparse-index "040000 %s 0\t%s" "HEAD:$dir" "$dir/" cache || return 1
>         done &&
>
>         git -C sparse-index sparse-checkout set deep/deeper1 &&
>
> -       test-tool -C sparse-index read-cache --table >cache &&
> +       git -C sparse-index ls-files --sparse >cache &&
>         for dir in deep/deeper2 folder1 folder2 x
>         do
> -               TREE=$(git -C sparse-index rev-parse HEAD:$dir) &&
> -               grep "040000 tree $TREE $dir/" cache \
> -                       || return 1
> +               test_index_entry_like sparse-index "040000 %s 0\t%s" "HEAD:$dir" "$dir/" cache || return 1
>         done &&
>
> +       grep 040000 cache >lines &&
> +       test_line_count = 4 lines &&
> +
>         # Disabling the sparse-index removes tree entries with full ones
>         git -C sparse-index sparse-checkout init --no-sparse-index &&
>
> -       test-tool -C sparse-index read-cache --table >cache &&
> -       ! grep "040000 tree" cache &&
> -       test_sparse_match test-tool read-cache --table
> +       git -C sparse-index ls-files --sparse >cache &&
> +       ! grep "^040000 " cache >lines &&
> +       test_sparse_match git ls-tree -r HEAD
>  '
>
>  test_expect_success 'expanded in-memory index matches full index' '
>         init_repos &&
> -       test_sparse_match test-tool read-cache --expand --table
> +       test_sparse_match git ls-tree -r HEAD
>  '
>
>  test_expect_success 'status with options' '
> @@ -394,9 +413,9 @@ test_expect_success 'submodule handling' '
>         test_all_match git commit -m "add submodule" &&
>
>         # having a submodule prevents "modules" from collapse
> -       test-tool -C sparse-index read-cache --table >cache &&
> -       grep "100644 blob .*    modules/a" cache &&
> -       grep "160000 commit $(git -C initial-repo rev-parse HEAD)       modules/sub" cache
> +       git -C sparse-index ls-files --sparse >cache &&
> +       test_index_entry_like sparse-index "100644 %s 0\t%s" "HEAD:modules/a" "modules/a" cache &&
> +       test_index_entry_like sparse-index "160000 %s 0\t%s" "HEAD:modules/sub" "modules/sub" cache
>  '
>
>  test_expect_success 'sparse-index is expanded and converted back' '
> --
> 2.31.0.260.g719c683c1d

I do like the tests and your idea that we can use ls-files to list
whatever entries are in the index, I just think the tests should use
--stage to do that.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [RFC/PATCH 3/5] ls-files: add and use a new --sparse option
  2021-03-17 18:19         ` Elijah Newren
@ 2021-03-17 18:27           ` Ævar Arnfjörð Bjarmason
  2021-03-17 18:44             ` Elijah Newren
  0 siblings, 1 reply; 203+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-03-17 18:27 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Git Mailing List, Junio C Hamano,
	Nguyễn Thái Ngọc, Jonathan Nieder,
	Martin Ågren, Derrick Stolee, SZEDER Gábor,
	Derrick Stolee, Derrick Stolee


On Wed, Mar 17 2021, Elijah Newren wrote:

> On Wed, Mar 17, 2021 at 6:28 AM Ævar Arnfjörð Bjarmason
> <avarab@gmail.com> wrote:
>>
>> Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
>> ---
>>  Documentation/git-ls-files.txt           |  4 ++
>>  builtin/ls-files.c                       | 10 ++++-
>>  t/t1091-sparse-checkout-builtin.sh       |  9 ++--
>>  t/t1092-sparse-checkout-compatibility.sh | 57 ++++++++++++++++--------
>>  4 files changed, 56 insertions(+), 24 deletions(-)
>>
>> diff --git a/Documentation/git-ls-files.txt b/Documentation/git-ls-files.txt
>> index 6d11ab506b..1145e960a4 100644
>> --- a/Documentation/git-ls-files.txt
>> +++ b/Documentation/git-ls-files.txt
>> @@ -71,6 +71,10 @@ OPTIONS
>>  --unmerged::
>>         Show unmerged files in the output (forces --stage)
>>
>> +--sparse::
>> +       Show sparse directories in the output instead of expanding
>> +       them (forces --stage)
>> +
>>  -k::
>>  --killed::
>>         Show files on the filesystem that need to be removed due
>> diff --git a/builtin/ls-files.c b/builtin/ls-files.c
>> index 4db75351f2..1ebbb63c10 100644
>> --- a/builtin/ls-files.c
>> +++ b/builtin/ls-files.c
>> @@ -26,6 +26,7 @@ static int show_deleted;
>>  static int show_cached;
>>  static int show_others;
>>  static int show_stage;
>> +static int show_sparse;
>>  static int show_unmerged;
>>  static int show_resolve_undo;
>>  static int show_modified;
>> @@ -639,6 +640,8 @@ int cmd_ls_files(int argc, const char **argv, const char *cmd_prefix)
>>                         DIR_SHOW_IGNORED),
>>                 OPT_BOOL('s', "stage", &show_stage,
>>                         N_("show staged contents' object name in the output")),
>> +               OPT_BOOL(0, "sparse", &show_sparse,
>> +                       N_("show unexpanded sparse directories in the output")),
>>                 OPT_BOOL('k', "killed", &show_killed,
>>                         N_("show files on the filesystem that need to be removed")),
>>                 OPT_BIT(0, "directory", &dir.flags,
>> @@ -705,12 +708,17 @@ int cmd_ls_files(int argc, const char **argv, const char *cmd_prefix)
>>                 tag_skip_worktree = "S ";
>>                 tag_resolve_undo = "U ";
>>         }
>> +       if (show_sparse) {
>> +               prepare_repo_settings(the_repository);
>> +               the_repository->settings.command_requires_full_index = 0;
>> +       }
>>         if (show_modified || show_others || show_deleted || (dir.flags & DIR_SHOW_IGNORED) || show_killed)
>>                 require_work_tree = 1;
>> -       if (show_unmerged)
>> +       if (show_unmerged || show_sparse)
>>                 /*
>>                  * There's no point in showing unmerged unless
>>                  * you also show the stage information.
>> +                * The same goes for the --sparse option.
>
> Yuck, haven't you just made --sparse an alias for --stage?  Why does
> it need an alias?

It doesn't, but --unmerged, the one other option which purely modifies
--stage output implies --stage.

So it's in line with existing UI convention in the command, it's
probably better to keep following that than have new options behave
differently.

But yeah, we could spell out --stage --sparse in the tests.

> Was the goal just to get a quick way to make the command run under
> repo->settings.command_requires_full_index = 0 without auditing the
> codepaths?  It seems to rely on them having been audited anyway, since
> it just falls back to the code used for --stage, so I don't see how it
> helps.  It also suggests the command might do unexpected or weird
> things if run without the --sparse option?  If people manually
> configure a sparse-checkout and cone mode AND a sparse-index (it's
> annoying how they have to specify all three instead of having to just
> pass one flag somewhere), then now we also need to force them to
> remember to pass extra flags to random various commands for them to
> operate in a sane manner in their environment??
>
> I think this is a bad path to go down.

Those are probably good points, I don't have enough overview of the
whole sparse thing yet to say.

I just thought it didn't make sense to have a series changing the nature
of the index without corresponding tooling changes to interrogate the
state of the index.

> However, if you want to write the necessary tests to make it so that
> ls-files can operate with command_requires_full_index = 0, then I
> think that's useful.  If you want to add a special flag so that folks
> in a sparse-checkout-with-cone-mode-with-sparse-index setup want to
> operate densely (i.e. to show what files would be in the index if it
> were fully populated), then I think that's useful.  But having
> sparse-yes-with-cone-yes-very-sparse folks need to specify an extra
> flag to commands to get sparse behavior just seems wrong to me.

Maybe, but what else do you suggest for getting this information out of
the index?

>>                  */
>>                 show_stage = 1;
>>         if (show_tag || show_stage)
>> diff --git a/t/t1091-sparse-checkout-builtin.sh b/t/t1091-sparse-checkout-builtin.sh
>> index ff1ad570a2..c823df423c 100755
>> --- a/t/t1091-sparse-checkout-builtin.sh
>> +++ b/t/t1091-sparse-checkout-builtin.sh
>> @@ -208,12 +208,13 @@ test_expect_success 'sparse-checkout disable' '
>>  test_expect_success 'sparse-index enabled and disabled' '
>>         git -C repo sparse-checkout init --cone --sparse-index &&
>>         test_cmp_config -C repo true extensions.sparseIndex &&
>> -       test-tool -C repo read-cache --table >cache &&
>> -       grep " tree " cache &&
>> +       git -C repo ls-files --sparse >cache &&
>> +       grep "^040000 " cache >lines &&
>> +       test_line_count = 3 lines &&
>>
>>         git -C repo sparse-checkout disable &&
>> -       test-tool -C repo read-cache --table >cache &&
>> -       ! grep " tree " cache &&
>> +       git -C repo ls-files --sparse >cache &&
>> +       ! grep "^040000 " cache &&
>>         git -C repo config --list >config &&
>>         ! grep extensions.sparseindex config
>>  '
>> diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
>> index d97bf9b645..48d3920490 100755
>> --- a/t/t1092-sparse-checkout-compatibility.sh
>> +++ b/t/t1092-sparse-checkout-compatibility.sh
>> @@ -136,48 +136,67 @@ test_sparse_match () {
>>         test_cmp sparse-checkout-err sparse-index-err
>>  }
>>
>> +test_index_entry_like () {
>> +       dir=$1
>> +       shift
>> +       fmt=$1
>> +       shift
>> +       rev=$1
>> +       shift
>> +       entry=$1
>> +       shift
>> +       file=$1
>> +       shift
>> +       hash=$(git -C "$dir" rev-parse "$rev") &&
>> +       printf "$fmt\n" "$hash" "$entry" >expected &&
>> +       if grep "$entry" "$file" >line
>> +       then
>> +               test_cmp expected line
>> +       else
>> +               cat cache &&
>> +               false
>> +       fi
>> +}
>> +
>>  test_expect_success 'sparse-index contents' '
>>         init_repos &&
>>
>> -       test-tool -C sparse-index read-cache --table >cache &&
>> +       git -C sparse-index ls-files --sparse >cache &&
>>         for dir in folder1 folder2 x
>>         do
>> -               TREE=$(git -C sparse-index rev-parse HEAD:$dir) &&
>> -               grep "040000 tree $TREE $dir/" cache \
>> -                       || return 1
>> +               test_index_entry_like sparse-index "040000 %s 0\t%s" "HEAD:$dir" "$dir/" cache || return 1
>>         done &&
>>
>>         git -C sparse-index sparse-checkout set folder1 &&
>>
>> -       test-tool -C sparse-index read-cache --table >cache &&
>> +       git -C sparse-index ls-files --sparse >cache &&
>>         for dir in deep folder2 x
>>         do
>> -               TREE=$(git -C sparse-index rev-parse HEAD:$dir) &&
>> -               grep "040000 tree $TREE $dir/" cache \
>> -                       || return 1
>> +               test_index_entry_like sparse-index "040000 %s 0\t%s" "HEAD:$dir" "$dir/" cache || return 1
>>         done &&
>>
>>         git -C sparse-index sparse-checkout set deep/deeper1 &&
>>
>> -       test-tool -C sparse-index read-cache --table >cache &&
>> +       git -C sparse-index ls-files --sparse >cache &&
>>         for dir in deep/deeper2 folder1 folder2 x
>>         do
>> -               TREE=$(git -C sparse-index rev-parse HEAD:$dir) &&
>> -               grep "040000 tree $TREE $dir/" cache \
>> -                       || return 1
>> +               test_index_entry_like sparse-index "040000 %s 0\t%s" "HEAD:$dir" "$dir/" cache || return 1
>>         done &&
>>
>> +       grep 040000 cache >lines &&
>> +       test_line_count = 4 lines &&
>> +
>>         # Disabling the sparse-index removes tree entries with full ones
>>         git -C sparse-index sparse-checkout init --no-sparse-index &&
>>
>> -       test-tool -C sparse-index read-cache --table >cache &&
>> -       ! grep "040000 tree" cache &&
>> -       test_sparse_match test-tool read-cache --table
>> +       git -C sparse-index ls-files --sparse >cache &&
>> +       ! grep "^040000 " cache >lines &&
>> +       test_sparse_match git ls-tree -r HEAD
>>  '
>>
>>  test_expect_success 'expanded in-memory index matches full index' '
>>         init_repos &&
>> -       test_sparse_match test-tool read-cache --expand --table
>> +       test_sparse_match git ls-tree -r HEAD
>>  '
>>
>>  test_expect_success 'status with options' '
>> @@ -394,9 +413,9 @@ test_expect_success 'submodule handling' '
>>         test_all_match git commit -m "add submodule" &&
>>
>>         # having a submodule prevents "modules" from collapse
>> -       test-tool -C sparse-index read-cache --table >cache &&
>> -       grep "100644 blob .*    modules/a" cache &&
>> -       grep "160000 commit $(git -C initial-repo rev-parse HEAD)       modules/sub" cache
>> +       git -C sparse-index ls-files --sparse >cache &&
>> +       test_index_entry_like sparse-index "100644 %s 0\t%s" "HEAD:modules/a" "modules/a" cache &&
>> +       test_index_entry_like sparse-index "160000 %s 0\t%s" "HEAD:modules/sub" "modules/sub" cache
>>  '
>>
>>  test_expect_success 'sparse-index is expanded and converted back' '
>> --
>> 2.31.0.260.g719c683c1d
>
> I do like the tests and your idea that we can use ls-files to list
> whatever entries are in the index, I just think the tests should use
> --stage to do that.


^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [RFC/PATCH 0/5] Re: [PATCH v3 07/20] test-read-cache: print cache entries with --table
  2021-03-17 13:28       ` [RFC/PATCH 0/5] " Ævar Arnfjörð Bjarmason
@ 2021-03-17 18:28         ` Elijah Newren
  2021-03-17 19:46           ` Derrick Stolee
  0 siblings, 1 reply; 203+ messages in thread
From: Elijah Newren @ 2021-03-17 18:28 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Git Mailing List, Junio C Hamano,
	Nguyễn Thái Ngọc, Jonathan Nieder,
	Martin Ågren, Derrick Stolee, SZEDER Gábor,
	Derrick Stolee, Derrick Stolee

On Wed, Mar 17, 2021 at 6:28 AM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
> > From: Derrick Stolee <dstolee@microsoft.com>
> >
> > This table is helpful for discovering data in the index to ensure it is
> > being written correctly, especially as we build and test the
> > sparse-index. This table includes an output format similar to 'git
> > ls-tree', but should not be compared to that directly. The biggest
> > reasons are that 'git ls-tree' includes a tree entry for every
> > subdirectory, even those that would not appear as a sparse directory in
> > a sparse-index. Further, 'git ls-tree' does not use a trailing directory
> > separator for its tree rows.
> >
> > This does not print the stat() information for the blobs. That could be
> > added in a future change with another option. The tests that are added
> > in the next few changes care only about the object types and IDs.
> >
> > To make the option parsing slightly more robust, wrap the string
> > comparisons in a loop adapted from test-dir-iterator.c.
> >
> > Care must be taken with the final check for the 'cnt' variable. We
> > continue the expectation that the numerical value is the final argument.
> >
> > Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> > ---
> >  t/helper/test-read-cache.c | 55 +++++++++++++++++++++++++++++++-------
> >  1 file changed, 45 insertions(+), 10 deletions(-)
> >
> > diff --git a/t/helper/test-read-cache.c b/t/helper/test-read-cache.c
> > index 244977a29bdf..6cfd8f2de71c 100644
> > --- a/t/helper/test-read-cache.c
> > +++ b/t/helper/test-read-cache.c
> > @@ -1,36 +1,71 @@
> >  #include "test-tool.h"
> >  #include "cache.h"
> >  #include "config.h"
> > +#include "blob.h"
> > +#include "commit.h"
> > +#include "tree.h"
> > +
> > +static void print_cache_entry(struct cache_entry *ce)
> > +{
> > +     const char *type;
> > +     printf("%06o ", ce->ce_mode & 0177777);
> > +
> > +     if (S_ISSPARSEDIR(ce->ce_mode))
> > +             type = tree_type;
> > +     else if (S_ISGITLINK(ce->ce_mode))
> > +             type = commit_type;
> > +     else
> > +             type = blob_type;
> > +
> > +     printf("%s %s\t%s\n",
> > +            type,
> > +            oid_to_hex(&ce->oid),
> > +            ce->name);
> > +}
> > +
>
> So we have a test tool that's mostly ls-files but mocks the output
> ls-tree would emit, won't these tests eventually care about what stage
> things are in?
>
> What follows is an RFC series on top that's the result of me wondering
> why if we're adding new index constructs we aren't updating our
> plumbing to emit that data, can we just add this to ls-files and drop
> this test helper?
>
> Turns out: Yes we can.

I like the idea of having ls-files be usable to show the entries that
are in the index; that seems great to me.  I very much dislike the
--sparse flag to ls-files, as noted on that commit.

Also, as a minor point, the first two patches seemed a bit confusing
to me.  The first commit said that it was there solely to make "the
next commit" easier, and the second was worded as just making the next
patch easier, which made me wonder if the wording in the first commit
message was referring to 3/5 when it said "the next commit".  Both of
the first two commits were so tiny that if they are both prep for 3/5,
maybe it makes sense to combine them (together or both to 3/5)?  If
not, maybe the commit messages could be cleaned up or clarified a bit?

> Ævar Arnfjörð Bjarmason (5):
>   ls-files: defer read_index() after parse_options() etc.
>   ls-files: make "mode" in show_ce() loop a variable
>   ls-files: add and use a new --sparse option
>   test-tool read-cache: --table is redundant to ls-files
>   test-tool: split up test-tool read-cache
>
>  Documentation/git-ls-files.txt           |  4 ++
>  Makefile                                 |  3 +-
>  builtin/ls-files.c                       | 29 +++++++--
>  t/helper/test-read-cache-again.c         | 31 +++++++++
>  t/helper/test-read-cache-perf.c          | 21 ++++++
>  t/helper/test-read-cache.c               | 82 ------------------------
>  t/helper/test-tool.c                     |  3 +-
>  t/helper/test-tool.h                     |  3 +-
>  t/perf/p0002-read-cache.sh               |  2 +-
>  t/t1091-sparse-checkout-builtin.sh       |  9 +--
>  t/t1092-sparse-checkout-compatibility.sh | 57 ++++++++++------
>  t/t7519-status-fsmonitor.sh              |  2 +-
>  12 files changed, 131 insertions(+), 115 deletions(-)
>  create mode 100644 t/helper/test-read-cache-again.c
>  create mode 100644 t/helper/test-read-cache-perf.c
>  delete mode 100644 t/helper/test-read-cache.c
>
> --
> 2.31.0.260.g719c683c1d

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [RFC/PATCH 3/5] ls-files: add and use a new --sparse option
  2021-03-17 18:27           ` Ævar Arnfjörð Bjarmason
@ 2021-03-17 18:44             ` Elijah Newren
  0 siblings, 0 replies; 203+ messages in thread
From: Elijah Newren @ 2021-03-17 18:44 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Git Mailing List, Junio C Hamano,
	Nguyễn Thái Ngọc, Jonathan Nieder,
	Martin Ågren, Derrick Stolee, SZEDER Gábor,
	Derrick Stolee, Derrick Stolee

On Wed, Mar 17, 2021 at 11:27 AM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
> On Wed, Mar 17 2021, Elijah Newren wrote:
>
> > On Wed, Mar 17, 2021 at 6:28 AM Ævar Arnfjörð Bjarmason
> > <avarab@gmail.com> wrote:
> >>
> >> Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
> >> ---
> >>  Documentation/git-ls-files.txt           |  4 ++
> >>  builtin/ls-files.c                       | 10 ++++-
> >>  t/t1091-sparse-checkout-builtin.sh       |  9 ++--
> >>  t/t1092-sparse-checkout-compatibility.sh | 57 ++++++++++++++++--------
> >>  4 files changed, 56 insertions(+), 24 deletions(-)
> >>
> >> diff --git a/Documentation/git-ls-files.txt b/Documentation/git-ls-files.txt
> >> index 6d11ab506b..1145e960a4 100644
> >> --- a/Documentation/git-ls-files.txt
> >> +++ b/Documentation/git-ls-files.txt
> >> @@ -71,6 +71,10 @@ OPTIONS
> >>  --unmerged::
> >>         Show unmerged files in the output (forces --stage)
> >>
> >> +--sparse::
> >> +       Show sparse directories in the output instead of expanding
> >> +       them (forces --stage)
> >> +
> >>  -k::
> >>  --killed::
> >>         Show files on the filesystem that need to be removed due
> >> diff --git a/builtin/ls-files.c b/builtin/ls-files.c
> >> index 4db75351f2..1ebbb63c10 100644
> >> --- a/builtin/ls-files.c
> >> +++ b/builtin/ls-files.c
> >> @@ -26,6 +26,7 @@ static int show_deleted;
> >>  static int show_cached;
> >>  static int show_others;
> >>  static int show_stage;
> >> +static int show_sparse;
> >>  static int show_unmerged;
> >>  static int show_resolve_undo;
> >>  static int show_modified;
> >> @@ -639,6 +640,8 @@ int cmd_ls_files(int argc, const char **argv, const char *cmd_prefix)
> >>                         DIR_SHOW_IGNORED),
> >>                 OPT_BOOL('s', "stage", &show_stage,
> >>                         N_("show staged contents' object name in the output")),
> >> +               OPT_BOOL(0, "sparse", &show_sparse,
> >> +                       N_("show unexpanded sparse directories in the output")),
> >>                 OPT_BOOL('k', "killed", &show_killed,
> >>                         N_("show files on the filesystem that need to be removed")),
> >>                 OPT_BIT(0, "directory", &dir.flags,
> >> @@ -705,12 +708,17 @@ int cmd_ls_files(int argc, const char **argv, const char *cmd_prefix)
> >>                 tag_skip_worktree = "S ";
> >>                 tag_resolve_undo = "U ";
> >>         }
> >> +       if (show_sparse) {
> >> +               prepare_repo_settings(the_repository);
> >> +               the_repository->settings.command_requires_full_index = 0;
> >> +       }
> >>         if (show_modified || show_others || show_deleted || (dir.flags & DIR_SHOW_IGNORED) || show_killed)
> >>                 require_work_tree = 1;
> >> -       if (show_unmerged)
> >> +       if (show_unmerged || show_sparse)
> >>                 /*
> >>                  * There's no point in showing unmerged unless
> >>                  * you also show the stage information.
> >> +                * The same goes for the --sparse option.
> >
> > Yuck, haven't you just made --sparse an alias for --stage?  Why does
> > it need an alias?
>
> It doesn't, but --unmerged, the one other option which purely modifies
> --stage output implies --stage.

--unmerged modifies --stage output.  --sparse won't.  (Maybe it does
_now_ because the command doesn't yet support sparse-indexes, but
that's a temporary artifact.  Long term, there should be no difference
in the output.)

> So it's in line with existing UI convention in the command, it's
> probably better to keep following that than have new options behave
> differently.
>
> But yeah, we could spell out --stage --sparse in the tests.

There should not be a --sparse option.  The index is _already_ sparse
and users had to take multiple steps to make it so; users shouldn't
have to repeat themselves with each and every command they ever type
when they've created a sparse index that they want sparse behavior.

You should just spell it "--stage".

> > Was the goal just to get a quick way to make the command run under
> > repo->settings.command_requires_full_index = 0 without auditing the
> > codepaths?  It seems to rely on them having been audited anyway, since
> > it just falls back to the code used for --stage, so I don't see how it
> > helps.  It also suggests the command might do unexpected or weird
> > things if run without the --sparse option?  If people manually
> > configure a sparse-checkout and cone mode AND a sparse-index (it's
> > annoying how they have to specify all three instead of having to just
> > pass one flag somewhere), then now we also need to force them to
> > remember to pass extra flags to random various commands for them to
> > operate in a sane manner in their environment??
> >
> > I think this is a bad path to go down.
>
> Those are probably good points, I don't have enough overview of the
> whole sparse thing yet to say.
>
> I just thought it didn't make sense to have a series changing the nature
> of the index without corresponding tooling changes to interrogate the
> state of the index.

That makes sense to me; I agree with you on that point.

> > However, if you want to write the necessary tests to make it so that
> > ls-files can operate with command_requires_full_index = 0, then I
> > think that's useful.  If you want to add a special flag so that folks
> > in a sparse-checkout-with-cone-mode-with-sparse-index setup want to
> > operate densely (i.e. to show what files would be in the index if it
> > were fully populated), then I think that's useful.  But having
> > sparse-yes-with-cone-yes-very-sparse folks need to specify an extra
> > flag to commands to get sparse behavior just seems wrong to me.
>
> Maybe, but what else do you suggest for getting this information out of
> the index?

Use git ls-files without new options...as I stated here:

...
> > I do like the tests and your idea that we can use ls-files to list
> > whatever entries are in the index, I just think the tests should use
> > --stage to do that.

In other words, I think making "git ls-files" the first, or at least
one of the first, commands to be modified to behave properly in a
sparse-index world is what you should be aiming for, not some
new-option-shortcut that'll make no sense long term and persist
indefinitely.

List the entries in the index: `git ls-files`
List the entries in the index with their hash, mode, and stage: `git
ls-files --stage`

List all the entries that would be in the index if it weren't sparse:
`git ls-files --$SOME_NEW_OPTION_NAME`

You don't need to implement the --$SOME_NEW_OPTION_NAME yet, of
course.  We can just note that it's the plan to add it later.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [RFC/PATCH 0/5] Re: [PATCH v3 07/20] test-read-cache: print cache entries with --table
  2021-03-17 18:28         ` Elijah Newren
@ 2021-03-17 19:46           ` Derrick Stolee
  2021-03-17 20:26             ` Elijah Newren
  0 siblings, 1 reply; 203+ messages in thread
From: Derrick Stolee @ 2021-03-17 19:46 UTC (permalink / raw)
  To: Elijah Newren, Ævar Arnfjörð Bjarmason
  Cc: Git Mailing List, Junio C Hamano,
	Nguyễn Thái Ngọc, Jonathan Nieder,
	Martin Ågren, SZEDER Gábor, Derrick Stolee,
	Derrick Stolee

On 3/17/2021 2:28 PM, Elijah Newren wrote:
> On Wed, Mar 17, 2021 at 6:28 AM Ævar Arnfjörð Bjarmason
> <avarab@gmail.com> wrote:
>>
>>> From: Derrick Stolee <dstolee@microsoft.com>

>>
>> So we have a test tool that's mostly ls-files but mocks the output
>> ls-tree would emit, won't these tests eventually care about what stage
>> things are in?
>>
>> What follows is an RFC series on top that's the result of me wondering
>> why if we're adding new index constructs we aren't updating our
>> plumbing to emit that data, can we just add this to ls-files and drop
>> this test helper?
>>
>> Turns out: Yes we can.
> 
> I like the idea of having ls-files be usable to show the entries that
> are in the index; that seems great to me.  I very much dislike the
> --sparse flag to ls-files, as noted on that commit.

I don't like this idea. I don't think exposing internal structures
like this is something we want to do so quickly. Further, I intend
to use this test tool in the future to _also_ show the stored stat()
data, which would be inappropriate here in ls-files.

I would prefer to continue using the test helper here and leave
functional changes to ls-files be considered independently.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v3 11/20] sparse-index: convert from full to sparse
  2021-03-17 13:43       ` Ævar Arnfjörð Bjarmason
@ 2021-03-17 19:55         ` Derrick Stolee
  2021-03-18 13:38           ` Derrick Stolee
  0 siblings, 1 reply; 203+ messages in thread
From: Derrick Stolee @ 2021-03-17 19:55 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason, Derrick Stolee via GitGitGadget
  Cc: git, newren, gitster, pclouds, jrnieder, Martin Ågren,
	SZEDER Gábor, Derrick Stolee, Derrick Stolee

On 3/17/2021 9:43 AM, Ævar Arnfjörð Bjarmason wrote:
> 
> On Tue, Mar 16 2021, Derrick Stolee via GitGitGadget wrote:
>> @@ -251,6 +251,8 @@ static inline unsigned int create_ce_mode(unsigned int mode)
>>  {
>>  	if (S_ISLNK(mode))
>>  		return S_IFLNK;
>> +	if (mode == S_IFDIR)
>> +		return S_IFDIR;
> 
> Does this actually need to be mode == S_IFDIR v.s. S_ISDIR(mode)? Those
> aren't the same thing...
> 
>>  	if (S_ISDIR(mode) || S_ISGITLINK(mode))
>>  		return S_IFGITLINK;
> 
> ...and if it can be S_ISDIR(mode) then this becomes just
> S_ISGITLINK(mode), but losing the "if" there makes me suspect that some
> dir == submodule heuristic is being broken somewhere..
 
I have a vague recollection that I did that at one point, and
it didn't work. However, using the simpler

	if (S_ISDIR(mode))
		return S_IFDIR;
	if (S_ISGITLINK(mode))
		return S_IFGITLINK;

passes all of my tests.

Looking at the history of create_ce_mode(), this "||"
condition was created in this commit:

commit 9eec4795d44439cd170fb52c73827c728252648d
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Mon Apr 9 21:14:58 2007 -0700

    Add "S_IFDIRLNK" file mode infrastructure for git links
    
    This just adds the basic helper functions to recognize and work with git
    tree entries that are links to other git repositories ("subprojects").
    They still aren't actually connected up to any of the code-paths, but
    now all the infrastructure is in place.
    
    The next commit will start actually adding actual subproject support.
    
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Junio C Hamano <junkio@cox.net>

There isn't any justification of why S_ISDIR() is there. Perhaps
it was defensive programming? If that is the case, then this simpler
logic will work.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [RFC/PATCH 0/5] Re: [PATCH v3 07/20] test-read-cache: print cache entries with --table
  2021-03-17 19:46           ` Derrick Stolee
@ 2021-03-17 20:26             ` Elijah Newren
  2021-03-17 20:34               ` Derrick Stolee
  0 siblings, 1 reply; 203+ messages in thread
From: Elijah Newren @ 2021-03-17 20:26 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Ævar Arnfjörð Bjarmason, Git Mailing List,
	Junio C Hamano, Nguyễn Thái Ngọc,
	Jonathan Nieder, Martin Ågren, SZEDER Gábor,
	Derrick Stolee, Derrick Stolee

On Wed, Mar 17, 2021 at 12:46 PM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 3/17/2021 2:28 PM, Elijah Newren wrote:
> > On Wed, Mar 17, 2021 at 6:28 AM Ævar Arnfjörð Bjarmason
> > <avarab@gmail.com> wrote:
> >>
> >>> From: Derrick Stolee <dstolee@microsoft.com>
>
> >>
> >> So we have a test tool that's mostly ls-files but mocks the output
> >> ls-tree would emit, won't these tests eventually care about what stage
> >> things are in?
> >>
> >> What follows is an RFC series on top that's the result of me wondering
> >> why if we're adding new index constructs we aren't updating our
> >> plumbing to emit that data, can we just add this to ls-files and drop
> >> this test helper?
> >>
> >> Turns out: Yes we can.
> >
> > I like the idea of having ls-files be usable to show the entries that
> > are in the index; that seems great to me.  I very much dislike the
> > --sparse flag to ls-files, as noted on that commit.
>
> I don't like this idea. I don't think exposing internal structures
> like this is something we want to do so quickly.

Not sure I follow; ls-files was already about exposing three bits of
internal structures for index entries: mode, hash, and stage number.
These are quantities that are well-defined for sparse directories too.
It would not be exposing any new or different internal structures, nor
changing the output format.  (Ævar changed the tests to not look for
"tree" but to look for the "040000" mode number.)

>  Further, I intend
> to use this test tool in the future to _also_ show the stored stat()
> data, which would be inappropriate here in ls-files.
>
> I would prefer to continue using the test helper here and leave
> functional changes to ls-files be considered independently.

Well, I was okay with it being in a test helper regardless of whether
it could be done with ls-files, and then just circling back and fixing
up ls-files later.  But perhaps it's worth calling out in the commit
message about your plans to add stat() data and how that future piece
can't be done in ls-files (without functional changes of some sort)
just to make it clearer why we're using a test helper instead of
front-loading the port of ls-files over to sparse-indexes?

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [RFC/PATCH 0/5] Re: [PATCH v3 07/20] test-read-cache: print cache entries with --table
  2021-03-17 20:26             ` Elijah Newren
@ 2021-03-17 20:34               ` Derrick Stolee
  0 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee @ 2021-03-17 20:34 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Ævar Arnfjörð Bjarmason, Git Mailing List,
	Junio C Hamano, Nguyễn Thái Ngọc,
	Jonathan Nieder, Martin Ågren, SZEDER Gábor,
	Derrick Stolee, Derrick Stolee

On 3/17/2021 4:26 PM, Elijah Newren wrote:
> On Wed, Mar 17, 2021 at 12:46 PM Derrick Stolee <stolee@gmail.com> wrote:
>>
>> On 3/17/2021 2:28 PM, Elijah Newren wrote:
>>> On Wed, Mar 17, 2021 at 6:28 AM Ævar Arnfjörð Bjarmason
>>> <avarab@gmail.com> wrote:
>>>>
>>>>> From: Derrick Stolee <dstolee@microsoft.com>
>>
>>>>
>>>> So we have a test tool that's mostly ls-files but mocks the output
>>>> ls-tree would emit, won't these tests eventually care about what stage
>>>> things are in?
>>>>
>>>> What follows is an RFC series on top that's the result of me wondering
>>>> why if we're adding new index constructs we aren't updating our
>>>> plumbing to emit that data, can we just add this to ls-files and drop
>>>> this test helper?
>>>>
>>>> Turns out: Yes we can.
>>>
>>> I like the idea of having ls-files be usable to show the entries that
>>> are in the index; that seems great to me.  I very much dislike the
>>> --sparse flag to ls-files, as noted on that commit.
>>
>> I don't like this idea. I don't think exposing internal structures
>> like this is something we want to do so quickly.
> 
> Not sure I follow; ls-files was already about exposing three bits of
> internal structures for index entries: mode, hash, and stage number.
> These are quantities that are well-defined for sparse directories too.
> It would not be exposing any new or different internal structures, nor
> changing the output format.  (Ævar changed the tests to not look for
> "tree" but to look for the "040000" mode number.)

True, that is some internal information already.

>>  Further, I intend
>> to use this test tool in the future to _also_ show the stored stat()
>> data, which would be inappropriate here in ls-files.
>>
>> I would prefer to continue using the test helper here and leave
>> functional changes to ls-files be considered independently.
> 
> Well, I was okay with it being in a test helper regardless of whether
> it could be done with ls-files, and then just circling back and fixing
> up ls-files later.  But perhaps it's worth calling out in the commit
> message about your plans to add stat() data and how that future piece
> can't be done in ls-files (without functional changes of some sort)
> just to make it clearer why we're using a test helper instead of
> front-loading the port of ls-files over to sparse-indexes?

Adding this justification to the commit message would definitely be
helpful, so I will do that.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [RFC/PATCH 3/5] ls-files: add and use a new --sparse option
  2021-03-17 13:28       ` [RFC/PATCH 3/5] ls-files: add and use a new --sparse option Ævar Arnfjörð Bjarmason
  2021-03-17 18:19         ` Elijah Newren
@ 2021-03-17 20:43         ` Derrick Stolee
  2021-03-24  0:52           ` Ævar Arnfjörð Bjarmason
  1 sibling, 1 reply; 203+ messages in thread
From: Derrick Stolee @ 2021-03-17 20:43 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason, git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	SZEDER Gábor, Derrick Stolee, dstolee

On 3/17/2021 9:28 AM, Ævar Arnfjörð Bjarmason wrote:
> Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>> diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh

I want to learn from your suggested changes to the test, here,
so forgive my questions here:
  
> +test_index_entry_like () {
> +	dir=$1
> +	shift
> +	fmt=$1
> +	shift
> +	rev=$1
> +	shift
> +	entry=$1
> +	shift
> +	file=$1
> +	shift

Why all the shifts? Why not just use $1, $2, $3,...? My
guess is that you want to be able to insert a new parameter
in the middle in the future without changing the later
numbers, but that seems unlikely, and we could just add
the parameter at the end.

> +	hash=$(git -C "$dir" rev-parse "$rev") &&
> +	printf "$fmt\n" "$hash" "$entry" >expected &&
> +	if grep "$entry" "$file" >line
> +	then
> +		test_cmp expected line
> +	else
> +		cat cache &&
> +		false
> +	fi
> +}
> +
>  test_expect_success 'sparse-index contents' '
>  	init_repos &&
>  
> -	test-tool -C sparse-index read-cache --table >cache &&
> +	git -C sparse-index ls-files --sparse >cache &&
>  	for dir in folder1 folder2 x
>  	do
> -		TREE=$(git -C sparse-index rev-parse HEAD:$dir) &&
> -		grep "040000 tree $TREE	$dir/" cache \
> -			|| return 1
> +		test_index_entry_like sparse-index "040000 %s 0\t%s" "HEAD:$dir" "$dir/" cache || return 1

I see how this uses only one line, but it seems like the
test_index_entry_like is too generic to make it not a
complicated mess of format strings that need to copy
over and over again.

Perhaps instead it could be a "test_entry_is_tree"
and it only passes "$dir" and "cache"? Then we could drop the loop and
just have

	test_entry_is_tree cache folder1 &&
	test_entry_is_tree cache folder2 &&
	test_entry_is_tree cache x &&

or we could still use the loop, especially when we test for four trees.

> -	test-tool -C sparse-index read-cache --table >cache &&
> +	git -C sparse-index ls-files --sparse >cache &&
>  	for dir in deep/deeper2 folder1 folder2 x
>  	do
> -		TREE=$(git -C sparse-index rev-parse HEAD:$dir) &&
> -		grep "040000 tree $TREE	$dir/" cache \
> -			|| return 1
> +		test_index_entry_like sparse-index "040000 %s 0\t%s" "HEAD:$dir" "$dir/" cache || return 1
>  	done &&
>  
> +	grep 040000 cache >lines &&
> +	test_line_count = 4 lines &&
> +

The point here is to check that no other entries are trees? We know
that this number will be _at least_ 4 based on the loop above.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v3 11/20] sparse-index: convert from full to sparse
  2021-03-17 19:55         ` Derrick Stolee
@ 2021-03-18 13:38           ` Derrick Stolee
  0 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee @ 2021-03-18 13:38 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason, Derrick Stolee via GitGitGadget
  Cc: git, newren, gitster, pclouds, jrnieder, Martin Ågren,
	SZEDER Gábor, Derrick Stolee, Derrick Stolee

On 3/17/2021 3:55 PM, Derrick Stolee wrote:
> On 3/17/2021 9:43 AM, Ævar Arnfjörð Bjarmason wrote:
>>
>> On Tue, Mar 16 2021, Derrick Stolee via GitGitGadget wrote:
>>> @@ -251,6 +251,8 @@ static inline unsigned int create_ce_mode(unsigned int mode)
>>>  {
>>>  	if (S_ISLNK(mode))
>>>  		return S_IFLNK;
>>> +	if (mode == S_IFDIR)
>>> +		return S_IFDIR;
>>
>> Does this actually need to be mode == S_IFDIR v.s. S_ISDIR(mode)? Those
>> aren't the same thing...
>>
>>>  	if (S_ISDIR(mode) || S_ISGITLINK(mode))
>>>  		return S_IFGITLINK;
>>
>> ...and if it can be S_ISDIR(mode) then this becomes just
>> S_ISGITLINK(mode), but losing the "if" there makes me suspect that some
>> dir == submodule heuristic is being broken somewhere..
>  
> I have a vague recollection that I did that at one point, and
> it didn't work. However, using the simpler
> 
> 	if (S_ISDIR(mode))
> 		return S_IFDIR;
> 	if (S_ISGITLINK(mode))
> 		return S_IFGITLINK;
> 
> passes all of my tests.

I'm not sure why it was passing yesterday (maybe I was in the
wrong worktree) but I _do_ get failures, such as this one in t2105:

expecting success of 2105.4 'add gitlink to relative .git file': 
        git update-index --add -- sub2

+ git update-index --add -- sub2
warning: index entry is a directory, but not sparse (00000000)
error: Could not read 50e526bb426771f6036ad3a8b0c81d511d91fc2a
BUG: read-cache.c:324: unsupported ce_mode: 40000
Aborted (core dumped)
error: last command exited with $?=134
not ok 4 - add gitlink to relative .git file
#
#               git update-index --add -- sub2
#

In this case, the mode that is specified is equal to 040775,
so we need to use the permission bits outside of __S_IFMT
(0170000) to determine if this is a sparse directory or a
submodule entry. Submodules will never be sparse, so
permissions matter. Sparse directories never actually exist,
so permissions don't matter.

Playing around with it, I still only see the exact equality
as working for me.

I can, however, use this format for these if statements:

	if (S_ISSPARSEDIR(mode))
		return S_IFDIR;
	if (S_ISDIR(mode) || S_ISGITLINK(mode))
		return S_IFGITLINK;

The S_ISSPARSEDIR macro expands to the exact equality.

Now, if we intended to make this work differently, then a
change would be required to construct_sparse_dir_entry()
in sparse-index.c:

static struct cache_entry *construct_sparse_dir_entry(
				struct index_state *istate,
				const char *sparse_dir,
				struct cache_tree *tree)
{
	struct cache_entry *de;

	de = make_cache_entry(istate, S_IFDIR, &tree->oid, sparse_dir, 0, 0);

	de->ce_flags |= CE_SKIP_WORKTREE;
	return de;
}

For instance, we could at this point assign de->ce_mode to
be S_IFDIR directly. It seems like the wrong place to do that
to me, but I'm open to suggestions.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v3 00/20] Sparse Index: Design, Format, Tests
  2021-03-16 16:42   ` [PATCH v3 " Derrick Stolee via GitGitGadget
                       ` (21 preceding siblings ...)
  2021-03-16 21:18     ` Elijah Newren
@ 2021-03-18 21:50     ` Junio C Hamano
  2021-03-19 13:00       ` Derrick Stolee
  2021-03-23 13:44     ` [PATCH v4 " Derrick Stolee via GitGitGadget
  23 siblings, 1 reply; 203+ messages in thread
From: Junio C Hamano @ 2021-03-18 21:50 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, newren, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> For this version, I took Ævar's latest patches and applied them to v2.31.0
> and rebased this series on top. It uses his new "read_tree_at()" helper and
> the associated changes to the function pointer type.
>
>  * Fixed more typos. Thanks Martin and Elijah!
>  * Updated the test_sparse_match() macro to use "$@" instead of $*
>  * Added a test that git sparse-checkout init --no-sparse-index rewrites the
>    index to be full.

Thanks.  I expect ab/read-tree would be rerolled at least one more
time, if only to straighten out the "oops #5 was screwy, let's patch
it up on top with three more steps", but I do not expect the end
state would be all that different, so tentatively I'll queue these
patches on top of the latest iteration of the topic for now and
hope that the other topic will be updated soonish.



^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v3 00/20] Sparse Index: Design, Format, Tests
  2021-03-18 21:50     ` Junio C Hamano
@ 2021-03-19 13:00       ` Derrick Stolee
  0 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee @ 2021-03-19 13:00 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee via GitGitGadget
  Cc: git, newren, pclouds, jrnieder, Martin Ågren,
	SZEDER Gábor, Ævar Arnfjörð Bjarmason,
	Derrick Stolee



On 3/18/2021 5:50 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> For this version, I took Ævar's latest patches and applied them to v2.31.0
>> and rebased this series on top. It uses his new "read_tree_at()" helper and
>> the associated changes to the function pointer type.
>>
>>  * Fixed more typos. Thanks Martin and Elijah!
>>  * Updated the test_sparse_match() macro to use "$@" instead of $*
>>  * Added a test that git sparse-checkout init --no-sparse-index rewrites the
>>    index to be full.
> 
> Thanks.  I expect ab/read-tree would be rerolled at least one more
> time, if only to straighten out the "oops #5 was screwy, let's patch
> it up on top with three more steps", but I do not expect the end
> state would be all that different, so tentatively I'll queue these
> patches on top of the latest iteration of the topic for now and
> hope that the other topic will be updated soonish.

Thanks. I'm grateful that it can spend some time in 'seen' if only
to avoid these conflicts in the meantime.

I'm waiting for that reroll of ab/read-tree before updating this
version with the feedback from v3.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v3 01/20] sparse-index: design doc and format update
  2021-03-16 16:42     ` [PATCH v3 01/20] sparse-index: design doc and format update Derrick Stolee via GitGitGadget
@ 2021-03-19 23:43       ` Junio C Hamano
  2021-03-23 11:16         ` Derrick Stolee
  0 siblings, 1 reply; 203+ messages in thread
From: Junio C Hamano @ 2021-03-19 23:43 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, newren, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Derrick Stolee <dstolee@microsoft.com>
>
> This begins a long effort to update the index format to allow sparse
> directory entries. This should result in a significant improvement to
> Git commands when HEAD contains millions of files, but the user has
> selected many fewer files to keep in their sparse-checkout definition.

This compromise makes sense.

In the past, we often dreamed of recording trees in the index
(instead of using a bolted on extension like cache-tree, treating
trees as first-class citizens) and lazily expanding it only when the
user starts modifying the paths within the subdirectory.

But such an optimization never materialized, as the dual and
conflicting nature of the index to keep track of the contents for
the "next" commit (for which it is sufficient to just record trees
for parts that have not been modified) and to cache stat information
to detect which working tree paths may possibly have modifications
(for which, we used the one-entry-per-path nature of the cache
entries so far) was never resolved.

But if we limit the use of trees-in-index for sparse/cone checkout
case, we do not even have to worry about having to cache the stat
information for those paths that we are not going to populate in the
working tree at all.  It is a great simplification of the problem.

> +  These entries have mode `0040000`, include the `SKIP_WORKTREE` bit, and
> +  the path ends in a directory separator.
> +

Why leading two 0's?  At the tree object level, we do not 0-pad blob
mode word, and if you are writing for C programmers, you need only
one '0' prefix to signal that it is in octal (in the on-disk index
file, the blob mode word is stored in a be16 word).

> diff --git a/Documentation/technical/sparse-index.txt b/Documentation/technical/sparse-index.txt
> new file mode 100644
> index 000000000000..aa116406a016
> --- /dev/null
> +++ b/Documentation/technical/sparse-index.txt
> @@ -0,0 +1,173 @@
> +Git Sparse-Index Design Document
> +================================
> +
> +The sparse-checkout feature allows users to focus a working directory on
> +a subset of the files at HEAD. The cone mode patterns, enabled by
> +`core.sparseCheckoutCone`, allow for very fast pattern matching to
> +discover which files at HEAD belong in the sparse-checkout cone.
> +
> +Three important scale dimensions for a Git worktree are:

s/worktree/working tree/; The former is the thing the "git worktree"
command deals with.  The latter is relevant even when "git worktree"
is not used (the traditional "git clone and you get a working tree
to work in").

> +* `HEAD`: How many files are present at `HEAD`?
> +
> +* Populated: How many files are within the sparse-checkout cone.
> +
> +* Modified: How many files has the user modified in the working directory?
> +
> +We will use big-O notation -- O(X) -- to denote how expensive certain
> +operations are in terms of these dimensions.
> +
> +These dimensions are ordered by their magnitude: users (typically) modify
> +fewer files than are populated, and we can only populate files at `HEAD`.

OK.

> +These dimensions are also ordered by how expensive they are per item: it
> +is expensive to detect a modified file than it is to write one that we
> +know must be populated; changing `HEAD` only really requires updating the
> +index.

This is a bit too dense to grok.  Among Populated, there are some
Modified but it takes lstat(2) per path or fsmonitor listening to
inotify to know which ones are in the Modified set.  Is that the
"expensive" you are referring to here?  I am not sure how you
compared the cost to know if a path is modified or merely populated
with the cost of "write one that we know must be populated" (which I
take as "given a populated file, make modification to it").  Also it
is unclear what you mean by "changing HEAD only require updating the
index".  Certainly when "git switch" flips HEAD from one commit to
another, you'd update the index and update the files in the working
tree (in the Populated part that is in the sparse-checkout cone) to
match, no?

> +Problems occur if there is an extreme imbalance in these dimensions. For
> +example, if `HEAD` contains millions of paths but the populated set has
> +only tens of thousands, then commands like `git status` and `git add` can
> +be dominated by operations that require O(`HEAD`) operations instead of
> +O(Populated). Primarily, the cost is in parsing and rewriting the index,
> +which is filled primarily with files at `HEAD` that are marked with the
> +`SKIP_WORKTREE` bit.
> +
> +The sparse-index intends to take these commands that read and modify the
> +index from O(`HEAD`) to O(Populated). To do this, we need to modify the
> +index format in a significant way: add "sparse directory" entries.

OK.

> +With cone mode patterns, it is possible to detect when an entire
> +directory will have its contents outside of the sparse-checkout definition.
> +Instead of listing all of the files it contains as individual entries, a
> +sparse-index contains an entry with the directory name, referencing the
> +object ID of the tree at `HEAD` and marked with the `SKIP_WORKTREE` bit.
> +If we need to discover the details for paths within that directory, we
> +can parse trees to find that list.

;-)

> +At time of writing, sparse-directory entries violate expectations about the
> +index format and its in-memory data structure. There are many consumers in
> +the codebase that expect to iterate through all of the index entries and
> +see only files.

True.

> In addition, they expect to see all files at `HEAD`.

It is not clear to me what this means.  After "git add", "git
ls-files" would expect to see a file that may not even in HEAD.
After "git rm", it would expect to see some file missing from the
set of paths in HEAD.  While I do not think that is what you meant
here, it is hard to guess what you wanted to say.

> One
> +way to handle this is to parse trees to replace a sparse-directory entry
> +with all of the files within that tree as the index is loaded. However,
> +parsing trees is slower than parsing the index format, so that is a slower
> +operation than if we left the index alone.

Besides, that would leave in-core index fully populated, so I would
suspect that you'd lose a lot of benefit that comes from having to
keep much fewer entries in the in-core index than what is in HEAD.
It would be nice for "git diff-index --cached" (which is part of
"git status") to be able to skip a single "tree" entry in the sparse
index as "known to be untouched", than skipping thousands of paths
in that single subdirectory (in a mega monorepo project) as "these
are marked with SKIP_WORKTREE so ignore what is in the working tree".

> +The implementation plan below follows four phases to slowly integrate with
> +the sparse-index. The intention is to incrementally update Git commands to
> +interact safely with the sparse-index without significant slowdowns. This
> +may not always be possible, but the hope is that the primary commands that
> +users need in their daily work are dramatically improved.

OK.

> +Phase I: Format and initial speedups
> +------------------------------------
> +
> +During this phase, Git learns to enable the sparse-index and safely parse
> +one. Protections are put in place so that every consumer of the in-memory
> +data structure can operate with its current assumption of every file at
> +`HEAD`.

IOW, before they iterate over the in-core index, tree entries are expanded
into bunch of individual entries with SKIP_WORKTREE bit?  Makes sense.

> +At first, every index parse will expand the sparse-directory entries into
> +the full list of paths at `HEAD`. This will be slower in all cases. The
> +only noticable change in behavior will be that the serialized index file
> +contains sparse-directory entries.

Hmph, do you mean that the expansion is done by not replacing each
"tree" entry with blob entries for the contents of the directory,
but the original "tree" entry is still left in the in-core index?
It is not immediately clear what we are trying to gain by leaving it
in, but let's read on.  Perhaps we can get rid of cache-tree
extension and replace its use with these "tree" entries whose
content paths are populated in the index?

> +To start, we use a new repository extension, `extensions.sparseIndex`, to
> +allow inserting sparse-directory entries into indexes with file format
> +versions 2, 3, and 4. This prevents Git versions that do not understand
> +the sparse-index from operating on one, but it also prevents other
> +operations that do not use the index at all. A new format, index v5, will
> +be introduced that includes sparse-directory entries by default. It might
> +also introduce other features that have been considered for improving the
> +index, as well.

OK.

> +Next, consumers of the index will be guarded against operating on a
> +sparse-index by inserting calls to `ensure_full_index()` or
> +`expand_index_to_path()`. After these guards are in place, we can begin
> +leaving sparse-directory entries in the in-memory index structure.

It is unclear why "we can begin leaving"; an iterator that only
expects to see blobs would need to be updated to skip them, too, no?
They would probably be already skipping blob entries that are marked
with the SKIP_WORKTREE bit, so it may be just a matter of skipping
more things than the current code.

Or did I misread the design presented earlier, and when a directory
that is outside the cone is expanded into the paths of blobs in the
directory, the "tree" entry is removed from the in-core index?

> +Even after inserting these guards, we will keep expanding sparse-indexes
> +for most Git commands using the `command_requires_full_index` repository
> +setting. This setting will be on by default and disabled one builtin at a
> +time until we have sufficient confidence that all of the index operations
> +are properly guarded.

OK.

> +To complete this phase, the commands `git status` and `git add` will be
> +integrated with the sparse-index so that they operate with O(Populated)
> +performance. They will be carefully tested for operations within and
> +outside the sparse-checkout definition.

;-)


^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v3 01/20] sparse-index: design doc and format update
  2021-03-19 23:43       ` Junio C Hamano
@ 2021-03-23 11:16         ` Derrick Stolee
  2021-03-23 20:10           ` Junio C Hamano
  0 siblings, 1 reply; 203+ messages in thread
From: Derrick Stolee @ 2021-03-23 11:16 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee via GitGitGadget
  Cc: git, newren, pclouds, jrnieder, Martin Ågren,
	SZEDER Gábor, Ævar Arnfjörð Bjarmason,
	Derrick Stolee, Derrick Stolee

On 3/19/2021 7:43 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> From: Derrick Stolee <dstolee@microsoft.com>
>>
>> This begins a long effort to update the index format to allow sparse
>> directory entries. This should result in a significant improvement to
>> Git commands when HEAD contains millions of files, but the user has
>> selected many fewer files to keep in their sparse-checkout definition.
> 
> This compromise makes sense.
> 
> In the past, we often dreamed of recording trees in the index
> (instead of using a bolted on extension like cache-tree, treating
> trees as first-class citizens) and lazily expanding it only when the
> user starts modifying the paths within the subdirectory.
> 
> But such an optimization never materialized, as the dual and
> conflicting nature of the index to keep track of the contents for
> the "next" commit (for which it is sufficient to just record trees
> for parts that have not been modified) and to cache stat information
> to detect which working tree paths may possibly have modifications
> (for which, we used the one-entry-per-path nature of the cache
> entries so far) was never resolved.
> 
> But if we limit the use of trees-in-index for sparse/cone checkout
> case, we do not even have to worry about having to cache the stat
> information for those paths that we are not going to populate in the
> working tree at all.  It is a great simplification of the problem.

Thanks. I appreciate your input here.
 
>> +  These entries have mode `0040000`, include the `SKIP_WORKTREE` bit, and
>> +  the path ends in a directory separator.
>> +
> 
> Why leading two 0's?  At the tree object level, we do not 0-pad blob
> mode word, and if you are writing for C programmers, you need only
> one '0' prefix to signal that it is in octal (in the on-disk index
> file, the blob mode word is stored in a be16 word).

Fixed.

>> diff --git a/Documentation/technical/sparse-index.txt b/Documentation/technical/sparse-index.txt
>> new file mode 100644
>> index 000000000000..aa116406a016
>> --- /dev/null
>> +++ b/Documentation/technical/sparse-index.txt
>> @@ -0,0 +1,173 @@
>> +Git Sparse-Index Design Document
>> +================================
>> +
>> +The sparse-checkout feature allows users to focus a working directory on
>> +a subset of the files at HEAD. The cone mode patterns, enabled by
>> +`core.sparseCheckoutCone`, allow for very fast pattern matching to
>> +discover which files at HEAD belong in the sparse-checkout cone.
>> +
>> +Three important scale dimensions for a Git worktree are:
> 
> s/worktree/working tree/; The former is the thing the "git worktree"
> command deals with.  The latter is relevant even when "git worktree"
> is not used (the traditional "git clone and you get a working tree
> to work in").

I guess I'm distracted by using SKIP_WORKTREE a lot, but "working
directory" is more specific and hence better.

>> +These dimensions are also ordered by how expensive they are per item: it
>> +is expensive to detect a modified file than it is to write one that we
>> +know must be populated; changing `HEAD` only really requires updating the
>> +index.
> 
> This is a bit too dense to grok.  Among Populated, there are some
> Modified but it takes lstat(2) per path or fsmonitor listening to
> inotify to know which ones are in the Modified set.  Is that the
> "expensive" you are referring to here?  I am not sure how you
> compared the cost to know if a path is modified or merely populated
> with the cost of "write one that we know must be populated" (which I
> take as "given a populated file, make modification to it"). 

I could rearrange things here. The important things to note are:

1. Updating index entries is very fast, but adds up at large scale.

2. It is faster to write a file to disk from Git's object database
   than it is to compare a file on disk to the copy in the database,
   which is frequently necessary when the mtime on disk doesn't match
   the mtime in the index.

> Also it
> is unclear what you mean by "changing HEAD only require updating the
> index".  Certainly when "git switch" flips HEAD from one commit to
> another, you'd update the index and update the files in the working
> tree (in the Populated part that is in the sparse-checkout cone) to
> match, no?

This is unclear of me. I was thinking more on the lines of "git reset"
(soft mode) which updates HEAD without changing the files on disk.

After all of this postulating, I think that the offending sentences
are better off deleted. They don't add clarity over what can be
inferred by an interested reader.

>> In addition, they expect to see all files at `HEAD`.
> 
> It is not clear to me what this means.  After "git add", "git
> ls-files" would expect to see a file that may not even in HEAD.
> After "git rm", it would expect to see some file missing from the
> set of paths in HEAD.  While I do not think that is what you meant
> here, it is hard to guess what you wanted to say.

I'm mixing terms incorrectly. I think what I really mean is

  In fact, these loops expect to see a reference to every
  staged file.

>> One
>> +way to handle this is to parse trees to replace a sparse-directory entry
>> +with all of the files within that tree as the index is loaded. However,
>> +parsing trees is slower than parsing the index format, so that is a slower
>> +operation than if we left the index alone.
> 
> Besides, that would leave in-core index fully populated, so I would
> suspect that you'd lose a lot of benefit that comes from having to
> keep much fewer entries in the in-core index than what is in HEAD.
> It would be nice for "git diff-index --cached" (which is part of
> "git status") to be able to skip a single "tree" entry in the sparse
> index as "known to be untouched", than skipping thousands of paths
> in that single subdirectory (in a mega monorepo project) as "these
> are marked with SKIP_WORKTREE so ignore what is in the working tree".

Absolutely! I'm burying the lead here, so I should get to the real
point by adding this to the end:

 The plan is to make all of these integrations "sparse aware" so
 this expansion through tree parsing is unnecessary and they use
 fewer resources than when using a full index.

>> +Phase I: Format and initial speedups
>> +------------------------------------
>> +
>> +During this phase, Git learns to enable the sparse-index and safely parse
>> +one. Protections are put in place so that every consumer of the in-memory
>> +data structure can operate with its current assumption of every file at
>> +`HEAD`.
> 
> IOW, before they iterate over the in-core index, tree entries are expanded
> into bunch of individual entries with SKIP_WORKTREE bit?  Makes sense.
> 
>> +At first, every index parse will expand the sparse-directory entries into
>> +the full list of paths at `HEAD`. This will be slower in all cases. The
>> +only noticable change in behavior will be that the serialized index file
>> +contains sparse-directory entries.
> 
> Hmph, do you mean that the expansion is done by not replacing each
> "tree" entry with blob entries for the contents of the directory,
> but the original "tree" entry is still left in the in-core index?

I meant by "serialized index file" is that the file written to disk has
the sparse directory entries, but the in-core copy will not (except for
a very brief moment in time, during do_read_index()).

The intention at this point in time is that all code behaves identically
to the full index case, except that the index file itself is smaller due
to these sparse directory entries.

> It is not immediately clear what we are trying to gain by leaving it
> in, but let's read on.  Perhaps we can get rid of cache-tree
> extension and replace its use with these "tree" entries whose
> content paths are populated in the index?

This is an interesting idea, but not one I plan to pursue with this work.

>> +Next, consumers of the index will be guarded against operating on a
>> +sparse-index by inserting calls to `ensure_full_index()` or
>> +`expand_index_to_path()`. After these guards are in place, we can begin
>> +leaving sparse-directory entries in the in-memory index structure.
> 
> It is unclear why "we can begin leaving"; an iterator that only
> expects to see blobs would need to be updated to skip them, too, no?
> They would probably be already skipping blob entries that are marked
> with the SKIP_WORKTREE bit, so it may be just a matter of skipping
> more things than the current code.
> 
> Or did I misread the design presented earlier, and when a directory
> that is outside the cone is expanded into the paths of blobs in the
> directory, the "tree" entry is removed from the in-core index?

I will make this more explicit.
 
Thanks for your help improving this doc! Hopefully the plan is a
little more clear, now.

-Stolee

^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH v4 00/20] Sparse Index: Design, Format, Tests
  2021-03-16 16:42   ` [PATCH v3 " Derrick Stolee via GitGitGadget
                       ` (22 preceding siblings ...)
  2021-03-18 21:50     ` Junio C Hamano
@ 2021-03-23 13:44     ` Derrick Stolee via GitGitGadget
  2021-03-23 13:44       ` [PATCH v4 01/20] sparse-index: design doc and format update Derrick Stolee via GitGitGadget
                         ` (21 more replies)
  23 siblings, 22 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-23 13:44 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee

Here is the first full patch series submission coming out of the
sparse-index RFC [1].

[1]
https://lore.kernel.org/git/pull.847.git.1611596533.gitgitgadget@gmail.com/

I won't waste too much space here, because PATCH 1 includes a sizeable
design document that describes the feature, the reasoning behind it, and my
plan for getting this implemented widely throughout the codebase.

There are some new things here that were not in the RFC:

 * Design doc and format updates. (Patch 1)
 * Performance test script. (Patches 2 and 20)

Notably missing in this series from the RFC:

 * The mega-patch inserting ensure_full_index() throughout the codebase.
   That will be a follow-up series to this one.
 * The integrations with git status and git add to demonstrate the improved
   performance. Those will also appear in their own series later.

I plan to keep my latest work in this area in my 'sparse-index/wip' branch
[2]. It includes all of the work from the RFC right now, updated with the
work from this series.

[2] https://github.com/derrickstolee/git/tree/sparse-index/wip


Updates in V4
=============

 * Rebased onto the latest copy of ab/read-tree.
 * Updated the design document as per Junio's comments.
 * Updated the submodule handling in the performance test.
 * Followed up on some other review from Ævar, mostly style or commit
   message things.


Updates in V3
=============

For this version, I took Ævar's latest patches and applied them to v2.31.0
and rebased this series on top. It uses his new "read_tree_at()" helper and
the associated changes to the function pointer type.

 * Fixed more typos. Thanks Martin and Elijah!
 * Updated the test_sparse_match() macro to use "$@" instead of $*
 * Added a test that git sparse-checkout init --no-sparse-index rewrites the
   index to be full.


Updates in V2
=============

 * Various typos and awkward grammar is fixed.
 * Cleaned up unnecessary commands in p2000-sparse-operations.sh
 * Added a comment to the sparse_index member of struct index_state.
 * Used tree_type, commit_type, and blob_type in test-read-cache.c.

Thanks, -Stolee

Derrick Stolee (20):
  sparse-index: design doc and format update
  t/perf: add performance test for sparse operations
  t1092: clean up script quoting
  sparse-index: add guard to ensure full index
  sparse-index: implement ensure_full_index()
  t1092: compare sparse-checkout to sparse-index
  test-read-cache: print cache entries with --table
  test-tool: don't force full index
  unpack-trees: ensure full index
  sparse-checkout: hold pattern list in index
  sparse-index: convert from full to sparse
  submodule: sparse-index should not collapse links
  unpack-trees: allow sparse directories
  sparse-index: check index conversion happens
  sparse-index: create extension for compatibility
  sparse-checkout: toggle sparse index from builtin
  sparse-checkout: disable sparse-index
  cache-tree: integrate with sparse directory entries
  sparse-index: loose integration with cache_tree_verify()
  p2000: add sparse-index repos

 Documentation/config/extensions.txt      |   8 +
 Documentation/git-sparse-checkout.txt    |  14 ++
 Documentation/technical/index-format.txt |   7 +
 Documentation/technical/sparse-index.txt | 174 ++++++++++++++
 Makefile                                 |   1 +
 builtin/sparse-checkout.c                |  44 +++-
 cache-tree.c                             |  40 ++++
 cache.h                                  |  18 +-
 read-cache.c                             |  35 ++-
 repo-settings.c                          |  15 ++
 repository.c                             |  11 +-
 repository.h                             |   3 +
 setup.c                                  |   3 +
 sparse-index.c                           | 293 +++++++++++++++++++++++
 sparse-index.h                           |  11 +
 t/README                                 |   3 +
 t/helper/test-read-cache.c               |  66 ++++-
 t/perf/p2000-sparse-operations.sh        | 101 ++++++++
 t/t1091-sparse-checkout-builtin.sh       |  13 +
 t/t1092-sparse-checkout-compatibility.sh | 143 +++++++++--
 unpack-trees.c                           |  17 +-
 21 files changed, 980 insertions(+), 40 deletions(-)
 create mode 100644 Documentation/technical/sparse-index.txt
 create mode 100644 sparse-index.c
 create mode 100644 sparse-index.h
 create mode 100755 t/perf/p2000-sparse-operations.sh


base-commit: 47957485b3b731a7860e0554d2bd12c0dce1c75a
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-883%2Fderrickstolee%2Fsparse-index%2Fformat-v4
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-883/derrickstolee/sparse-index/format-v4
Pull-Request: https://github.com/gitgitgadget/git/pull/883

Range-diff vs v3:

  1:  62ac13945bec !  1:  6426a5c60e53 sparse-index: design doc and format update
     @@ Documentation/technical/index-format.txt: Git index format
      +  is enabled in cone mode (`core.sparseCheckoutCone` is enabled) and the
      +  `extensions.sparseIndex` extension is enabled, then the index may
      +  contain entries for directories outside of the sparse-checkout definition.
     -+  These entries have mode `0040000`, include the `SKIP_WORKTREE` bit, and
     ++  These entries have mode `040000`, include the `SKIP_WORKTREE` bit, and
      +  the path ends in a directory separator.
      +
         32-bit ctime seconds, the last time a file's metadata changed
     @@ Documentation/technical/sparse-index.txt (new)
      +`core.sparseCheckoutCone`, allow for very fast pattern matching to
      +discover which files at HEAD belong in the sparse-checkout cone.
      +
     -+Three important scale dimensions for a Git worktree are:
     ++Three important scale dimensions for a Git working directory are:
      +
      +* `HEAD`: How many files are present at `HEAD`?
      +
     @@ Documentation/technical/sparse-index.txt (new)
      +
      +These dimensions are ordered by their magnitude: users (typically) modify
      +fewer files than are populated, and we can only populate files at `HEAD`.
     -+These dimensions are also ordered by how expensive they are per item: it
     -+is expensive to detect a modified file than it is to write one that we
     -+know must be populated; changing `HEAD` only really requires updating the
     -+index.
      +
      +Problems occur if there is an extreme imbalance in these dimensions. For
      +example, if `HEAD` contains millions of paths but the populated set has
     @@ Documentation/technical/sparse-index.txt (new)
      +At time of writing, sparse-directory entries violate expectations about the
      +index format and its in-memory data structure. There are many consumers in
      +the codebase that expect to iterate through all of the index entries and
     -+see only files. In addition, they expect to see all files at `HEAD`. One
     -+way to handle this is to parse trees to replace a sparse-directory entry
     -+with all of the files within that tree as the index is loaded. However,
     -+parsing trees is slower than parsing the index format, so that is a slower
     -+operation than if we left the index alone.
     ++see only files. In fact, these loops expect to see a reference to every
     ++staged file. One way to handle this is to parse trees to replace a
     ++sparse-directory entry with all of the files within that tree as the index
     ++is loaded. However, parsing trees is slower than parsing the index format,
     ++so that is a slower operation than if we left the index alone. The plan is
     ++to make all of these integrations "sparse aware" so this expansion through
     ++tree parsing is unnecessary and they use fewer resources than when using a
     ++full index.
      +
      +The implementation plan below follows four phases to slowly integrate with
      +the sparse-index. The intention is to incrementally update Git commands to
     @@ Documentation/technical/sparse-index.txt (new)
      +data structure can operate with its current assumption of every file at
      +`HEAD`.
      +
     -+At first, every index parse will expand the sparse-directory entries into
     -+the full list of paths at `HEAD`. This will be slower in all cases. The
     -+only noticable change in behavior will be that the serialized index file
     -+contains sparse-directory entries.
     ++At first, every index parse will call a helper method,
     ++`ensure_full_index()`, which scans the index for sparse-directory entries
     ++(pointing to trees) and replaces them with the full list of paths (with
     ++blob contents) by parsing tree objects. This will be slower in all cases.
     ++The only noticeable change in behavior will be that the serialized index
     ++file contains sparse-directory entries.
      +
      +To start, we use a new repository extension, `extensions.sparseIndex`, to
      +allow inserting sparse-directory entries into indexes with file format
  2:  d2197e895e4d !  2:  7eabc1d0586c t/perf: add performance test for sparse operations
     @@ t/perf/p2000-sparse-operations.sh (new)
      +
      +test_expect_success 'setup repo and indexes' '
      +	git reset --hard HEAD &&
     ++
      +	# Remove submodules from the example repo, because our
     -+	# duplication of the entire repo creates an unlikly data shape.
     -+	git config --file .gitmodules --get-regexp "submodule.*.path" >modules &&
     -+	git rm -f .gitmodules &&
     -+	for module in $(awk "{print \$2}" modules)
     -+	do
     -+		git rm $module || return 1
     -+	done &&
     -+	git commit -m "remove submodules" &&
     ++	# duplication of the entire repo creates an unlikely data shape.
     ++	if git config --file .gitmodules --get-regexp "submodule.*.path" >modules
     ++	then
     ++		git rm $(awk "{print \$2}" modules) &&
     ++		git commit -m "remove submodules" || return 1
     ++	fi &&
      +
      +	echo bogus >a &&
      +	cp a b &&
  3:  d3cfd34b8418 !  3:  c9e21d78ecba t1092: clean up script quoting
     @@ Commit message
          t1092: clean up script quoting
      
          This test was introduced in 19a0acc83e4 (t1092: test interesting
     -    sparse-checkout scenarios, 2021-01-23), but these issues with quoting
     -    were not noticed until starting this follow-up series. The old mechanism
     -    would drop quoting such as in
     +    sparse-checkout scenarios, 2021-01-23), but it contains issues with quoting
     +    that were not noticed until starting this follow-up series. The old
     +    mechanism would drop quoting such as in
      
             test_all_match git commit -m "touch README.md"
      
  4:  4472118cf903 =  4:  03cdde756563 sparse-index: add guard to ensure full index
  5:  99292cdbaae4 =  5:  6b3b6d86385d sparse-index: implement ensure_full_index()
  6:  fae5663a17bb =  6:  7f67adba0498 t1092: compare sparse-checkout to sparse-index
  7:  dffe8821fde2 !  7:  7ebd9570b1ad test-read-cache: print cache entries with --table
     @@ Commit message
          a sparse-index. Further, 'git ls-tree' does not use a trailing directory
          separator for its tree rows.
      
     -    This does not print the stat() information for the blobs. That could be
     +    This does not print the stat() information for the blobs. That will be
          added in a future change with another option. The tests that are added
          in the next few changes care only about the object types and IDs.
     +    However, this future need for full index information justifies the need
     +    for this test helper over extending a user-facing feature, such as 'git
     +    ls-files'.
      
          To make the option parsing slightly more robust, wrap the string
          comparisons in a loop adapted from test-dir-iterator.c.
  8:  f4ad081f25bb =  8:  db7bbd06dbcc test-tool: don't force full index
  9:  4780076a50df =  9:  3ddd5e794b5e unpack-trees: ensure full index
 10:  33fdba2b8cfd = 10:  7308c87697f1 sparse-checkout: hold pattern list in index
 11:  e41b14e03ebb ! 11:  7c10d653ca6b sparse-index: convert from full to sparse
     @@ cache.h: static inline unsigned int create_ce_mode(unsigned int mode)
       {
       	if (S_ISLNK(mode))
       		return S_IFLNK;
     -+	if (mode == S_IFDIR)
     ++	if (S_ISSPARSEDIR(mode))
      +		return S_IFDIR;
       	if (S_ISDIR(mode) || S_ISGITLINK(mode))
       		return S_IFGITLINK;
 12:  b77cd6b02265 = 12:  6db36f33e960 submodule: sparse-index should not collapse links
 13:  4000c5cdd4cf ! 13:  d24bd3348d98 unpack-trees: allow sparse directories
     @@ unpack-trees.c: static int index_pos_by_traverse_info(struct name_entry *names,
      +		if (!o->src_index->sparse_index ||
      +		    !(o->src_index->cache[pos]->ce_flags & CE_SKIP_WORKTREE))
      +			BUG("This is a directory and should not exist in index");
     -+	} else
     ++	} else {
      +		pos = -pos - 1;
     ++	}
       	if (pos >= o->src_index->cache_nr ||
       	    !starts_with(o->src_index->cache[pos]->name, name.buf) ||
       	    (pos > 0 && starts_with(o->src_index->cache[pos-1]->name, name.buf)))
 14:  1a2be38b2ca7 = 14:  08d9f5f3c0d1 sparse-index: check index conversion happens
 15:  f89891b0ae4e = 15:  6f38cef196b0 sparse-index: create extension for compatibility
 16:  bd703c76c859 = 16:  923081e7e079 sparse-checkout: toggle sparse index from builtin
 17:  598557f90a2a = 17:  6f1ad72c390d sparse-checkout: disable sparse-index
 18:  c2d0c17db31a = 18:  bd94e6b7d089 cache-tree: integrate with sparse directory entries
 19:  6fdd9323c14e = 19:  e7190376b806 sparse-index: loose integration with cache_tree_verify()
 20:  3db06ac46dd5 = 20:  bcf0a58eb38c p2000: add sparse-index repos

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH v4 01/20] sparse-index: design doc and format update
  2021-03-23 13:44     ` [PATCH v4 " Derrick Stolee via GitGitGadget
@ 2021-03-23 13:44       ` Derrick Stolee via GitGitGadget
  2021-03-26 20:29         ` SZEDER Gábor
  2021-03-23 13:44       ` [PATCH v4 02/20] t/perf: add performance test for sparse operations Derrick Stolee via GitGitGadget
                         ` (20 subsequent siblings)
  21 siblings, 1 reply; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-23 13:44 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

This begins a long effort to update the index format to allow sparse
directory entries. This should result in a significant improvement to
Git commands when HEAD contains millions of files, but the user has
selected many fewer files to keep in their sparse-checkout definition.

Currently, the index format is only updated in the presence of
extensions.sparseIndex instead of increasing a file format version
number. This is temporary, and index v5 is part of the plan for future
work in this area.

The design document details many of the reasons for embarking on this
work, and also the plan for completing it safely.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/index-format.txt |   7 +
 Documentation/technical/sparse-index.txt | 174 +++++++++++++++++++++++
 2 files changed, 181 insertions(+)
 create mode 100644 Documentation/technical/sparse-index.txt

diff --git a/Documentation/technical/index-format.txt b/Documentation/technical/index-format.txt
index d363a71c37ec..3b74c05647db 100644
--- a/Documentation/technical/index-format.txt
+++ b/Documentation/technical/index-format.txt
@@ -44,6 +44,13 @@ Git index format
   localization, no special casing of directory separator '/'). Entries
   with the same name are sorted by their stage field.
 
+  An index entry typically represents a file. However, if sparse-checkout
+  is enabled in cone mode (`core.sparseCheckoutCone` is enabled) and the
+  `extensions.sparseIndex` extension is enabled, then the index may
+  contain entries for directories outside of the sparse-checkout definition.
+  These entries have mode `040000`, include the `SKIP_WORKTREE` bit, and
+  the path ends in a directory separator.
+
   32-bit ctime seconds, the last time a file's metadata changed
     this is stat(2) data
 
diff --git a/Documentation/technical/sparse-index.txt b/Documentation/technical/sparse-index.txt
new file mode 100644
index 000000000000..62f6dc225a44
--- /dev/null
+++ b/Documentation/technical/sparse-index.txt
@@ -0,0 +1,174 @@
+Git Sparse-Index Design Document
+================================
+
+The sparse-checkout feature allows users to focus a working directory on
+a subset of the files at HEAD. The cone mode patterns, enabled by
+`core.sparseCheckoutCone`, allow for very fast pattern matching to
+discover which files at HEAD belong in the sparse-checkout cone.
+
+Three important scale dimensions for a Git working directory are:
+
+* `HEAD`: How many files are present at `HEAD`?
+
+* Populated: How many files are within the sparse-checkout cone.
+
+* Modified: How many files has the user modified in the working directory?
+
+We will use big-O notation -- O(X) -- to denote how expensive certain
+operations are in terms of these dimensions.
+
+These dimensions are ordered by their magnitude: users (typically) modify
+fewer files than are populated, and we can only populate files at `HEAD`.
+
+Problems occur if there is an extreme imbalance in these dimensions. For
+example, if `HEAD` contains millions of paths but the populated set has
+only tens of thousands, then commands like `git status` and `git add` can
+be dominated by operations that require O(`HEAD`) operations instead of
+O(Populated). Primarily, the cost is in parsing and rewriting the index,
+which is filled primarily with files at `HEAD` that are marked with the
+`SKIP_WORKTREE` bit.
+
+The sparse-index intends to take these commands that read and modify the
+index from O(`HEAD`) to O(Populated). To do this, we need to modify the
+index format in a significant way: add "sparse directory" entries.
+
+With cone mode patterns, it is possible to detect when an entire
+directory will have its contents outside of the sparse-checkout definition.
+Instead of listing all of the files it contains as individual entries, a
+sparse-index contains an entry with the directory name, referencing the
+object ID of the tree at `HEAD` and marked with the `SKIP_WORKTREE` bit.
+If we need to discover the details for paths within that directory, we
+can parse trees to find that list.
+
+At time of writing, sparse-directory entries violate expectations about the
+index format and its in-memory data structure. There are many consumers in
+the codebase that expect to iterate through all of the index entries and
+see only files. In fact, these loops expect to see a reference to every
+staged file. One way to handle this is to parse trees to replace a
+sparse-directory entry with all of the files within that tree as the index
+is loaded. However, parsing trees is slower than parsing the index format,
+so that is a slower operation than if we left the index alone. The plan is
+to make all of these integrations "sparse aware" so this expansion through
+tree parsing is unnecessary and they use fewer resources than when using a
+full index.
+
+The implementation plan below follows four phases to slowly integrate with
+the sparse-index. The intention is to incrementally update Git commands to
+interact safely with the sparse-index without significant slowdowns. This
+may not always be possible, but the hope is that the primary commands that
+users need in their daily work are dramatically improved.
+
+Phase I: Format and initial speedups
+------------------------------------
+
+During this phase, Git learns to enable the sparse-index and safely parse
+one. Protections are put in place so that every consumer of the in-memory
+data structure can operate with its current assumption of every file at
+`HEAD`.
+
+At first, every index parse will call a helper method,
+`ensure_full_index()`, which scans the index for sparse-directory entries
+(pointing to trees) and replaces them with the full list of paths (with
+blob contents) by parsing tree objects. This will be slower in all cases.
+The only noticeable change in behavior will be that the serialized index
+file contains sparse-directory entries.
+
+To start, we use a new repository extension, `extensions.sparseIndex`, to
+allow inserting sparse-directory entries into indexes with file format
+versions 2, 3, and 4. This prevents Git versions that do not understand
+the sparse-index from operating on one, but it also prevents other
+operations that do not use the index at all. A new format, index v5, will
+be introduced that includes sparse-directory entries by default. It might
+also introduce other features that have been considered for improving the
+index, as well.
+
+Next, consumers of the index will be guarded against operating on a
+sparse-index by inserting calls to `ensure_full_index()` or
+`expand_index_to_path()`. After these guards are in place, we can begin
+leaving sparse-directory entries in the in-memory index structure.
+
+Even after inserting these guards, we will keep expanding sparse-indexes
+for most Git commands using the `command_requires_full_index` repository
+setting. This setting will be on by default and disabled one builtin at a
+time until we have sufficient confidence that all of the index operations
+are properly guarded.
+
+To complete this phase, the commands `git status` and `git add` will be
+integrated with the sparse-index so that they operate with O(Populated)
+performance. They will be carefully tested for operations within and
+outside the sparse-checkout definition.
+
+Phase II: Careful integrations
+------------------------------
+
+This phase focuses on ensuring that all index extensions and APIs work
+well with a sparse-index. This requires significant increases to our test
+coverage, especially for operations that interact with the working
+directory outside of the sparse-checkout definition. Some of these
+behaviors may not be the desirable ones, such as some tests already
+marked for failure in `t1092-sparse-checkout-compatibility.sh`.
+
+The index extensions that may require special integrations are:
+
+* FS Monitor
+* Untracked cache
+
+While integrating with these features, we should look for patterns that
+might lead to better APIs for interacting with the index. Coalescing
+common usage patterns into an API call can reduce the number of places
+where sparse-directories need to be handled carefully.
+
+Phase III: Important command speedups
+-------------------------------------
+
+At this point, the patterns for testing and implementing sparse-directory
+logic should be relatively stable. This phase focuses on updating some of
+the most common builtins that use the index to operate as O(Populated).
+Here is a potential list of commands that could be valuable to integrate
+at this point:
+
+* `git commit`
+* `git checkout`
+* `git merge`
+* `git rebase`
+
+Hopefully, commands such as `git merge` and `git rebase` can benefit
+instead from merge algorithms that do not use the index as a data
+structure, such as the merge-ORT strategy. As these topics mature, we
+may enable the ORT strategy by default for repositories using the
+sparse-index feature.
+
+Along with `git status` and `git add`, these commands cover the majority
+of users' interactions with the working directory. In addition, we can
+integrate with these commands:
+
+* `git grep`
+* `git rm`
+
+These have been proposed as some whose behavior could change when in a
+repo with a sparse-checkout definition. It would be good to include this
+behavior automatically when using a sparse-index. Some clarity is needed
+to make the behavior switch clear to the user.
+
+This phase is the first where parallel work might be possible without too
+much conflicts between topics.
+
+Phase IV: The long tail
+-----------------------
+
+This last phase is less a "phase" and more "the new normal" after all of
+the previous work.
+
+To start, the `command_requires_full_index` option could be removed in
+favor of expanding only when hitting an API guard.
+
+There are many Git commands that could use special attention to operate as
+O(Populated), while some might be so rare that it is acceptable to leave
+them with additional overhead when a sparse-index is present.
+
+Here are some commands that might be useful to update:
+
+* `git sparse-checkout set`
+* `git am`
+* `git clean`
+* `git stash`
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH v4 02/20] t/perf: add performance test for sparse operations
  2021-03-23 13:44     ` [PATCH v4 " Derrick Stolee via GitGitGadget
  2021-03-23 13:44       ` [PATCH v4 01/20] sparse-index: design doc and format update Derrick Stolee via GitGitGadget
@ 2021-03-23 13:44       ` Derrick Stolee via GitGitGadget
  2021-03-23 13:44       ` [PATCH v4 03/20] t1092: clean up script quoting Derrick Stolee via GitGitGadget
                         ` (19 subsequent siblings)
  21 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-23 13:44 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Create a test script that takes the default performance test (the Git
codebase) and multiplies it by 256 using four layers of duplicated
trees of width four. This results in nearly one million blob entries in
the index. Then, we can clone this repository with sparse-checkout
patterns that demonstrate four copies of the initial repository. Each
clone will use a different index format or mode so peformance can be
tested across the different options.

Note that the initial repo is stripped of submodules before doing the
copies. This preserves the expected data shape of the sparse index,
because directories containing submodules are not collapsed to a sparse
directory entry.

Run a few Git commands on these clones, especially those that use the
index (status, add, commit).

Here are the results on my Linux machine:

Test
--------------------------------------------------------------
2000.2: git status (full-index-v3)             0.37(0.30+0.09)
2000.3: git status (full-index-v4)             0.39(0.32+0.10)
2000.4: git add -A (full-index-v3)             1.42(1.06+0.20)
2000.5: git add -A (full-index-v4)             1.26(0.98+0.16)
2000.6: git add . (full-index-v3)              1.40(1.04+0.18)
2000.7: git add . (full-index-v4)              1.26(0.98+0.17)
2000.8: git commit -a -m A (full-index-v3)     1.42(1.11+0.16)
2000.9: git commit -a -m A (full-index-v4)     1.33(1.08+0.16)

It is perhaps noteworthy that there is an improvement when using index
version 4. This is because the v3 index uses 108 MiB while the v4
index uses 80 MiB. Since the repeated portions of the directories are
very short (f3/f1/f2, for example) this ratio is less pronounced than in
similarly-sized real repositories.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/perf/p2000-sparse-operations.sh | 84 +++++++++++++++++++++++++++++++
 1 file changed, 84 insertions(+)
 create mode 100755 t/perf/p2000-sparse-operations.sh

diff --git a/t/perf/p2000-sparse-operations.sh b/t/perf/p2000-sparse-operations.sh
new file mode 100755
index 000000000000..dddd527b6330
--- /dev/null
+++ b/t/perf/p2000-sparse-operations.sh
@@ -0,0 +1,84 @@
+#!/bin/sh
+
+test_description="test performance of Git operations using the index"
+
+. ./perf-lib.sh
+
+test_perf_default_repo
+
+SPARSE_CONE=f2/f4/f1
+
+test_expect_success 'setup repo and indexes' '
+	git reset --hard HEAD &&
+
+	# Remove submodules from the example repo, because our
+	# duplication of the entire repo creates an unlikely data shape.
+	if git config --file .gitmodules --get-regexp "submodule.*.path" >modules
+	then
+		git rm $(awk "{print \$2}" modules) &&
+		git commit -m "remove submodules" || return 1
+	fi &&
+
+	echo bogus >a &&
+	cp a b &&
+	git add a b &&
+	git commit -m "level 0" &&
+	BLOB=$(git rev-parse HEAD:a) &&
+	OLD_COMMIT=$(git rev-parse HEAD) &&
+	OLD_TREE=$(git rev-parse HEAD^{tree}) &&
+
+	for i in $(test_seq 1 4)
+	do
+		cat >in <<-EOF &&
+			100755 blob $BLOB	a
+			040000 tree $OLD_TREE	f1
+			040000 tree $OLD_TREE	f2
+			040000 tree $OLD_TREE	f3
+			040000 tree $OLD_TREE	f4
+		EOF
+		NEW_TREE=$(git mktree <in) &&
+		NEW_COMMIT=$(git commit-tree $NEW_TREE -p $OLD_COMMIT -m "level $i") &&
+		OLD_TREE=$NEW_TREE &&
+		OLD_COMMIT=$NEW_COMMIT || return 1
+	done &&
+
+	git sparse-checkout init --cone &&
+	git branch -f wide $OLD_COMMIT &&
+	git -c core.sparseCheckoutCone=true clone --branch=wide --sparse . full-index-v3 &&
+	(
+		cd full-index-v3 &&
+		git sparse-checkout init --cone &&
+		git sparse-checkout set $SPARSE_CONE &&
+		git config index.version 3 &&
+		git update-index --index-version=3
+	) &&
+	git -c core.sparseCheckoutCone=true clone --branch=wide --sparse . full-index-v4 &&
+	(
+		cd full-index-v4 &&
+		git sparse-checkout init --cone &&
+		git sparse-checkout set $SPARSE_CONE &&
+		git config index.version 4 &&
+		git update-index --index-version=4
+	)
+'
+
+test_perf_on_all () {
+	command="$@"
+	for repo in full-index-v3 full-index-v4
+	do
+		test_perf "$command ($repo)" "
+			(
+				cd $repo &&
+				echo >>$SPARSE_CONE/a &&
+				$command
+			)
+		"
+	done
+}
+
+test_perf_on_all git status
+test_perf_on_all git add -A
+test_perf_on_all git add .
+test_perf_on_all git commit -a -m A
+
+test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH v4 03/20] t1092: clean up script quoting
  2021-03-23 13:44     ` [PATCH v4 " Derrick Stolee via GitGitGadget
  2021-03-23 13:44       ` [PATCH v4 01/20] sparse-index: design doc and format update Derrick Stolee via GitGitGadget
  2021-03-23 13:44       ` [PATCH v4 02/20] t/perf: add performance test for sparse operations Derrick Stolee via GitGitGadget
@ 2021-03-23 13:44       ` Derrick Stolee via GitGitGadget
  2021-03-23 13:44       ` [PATCH v4 04/20] sparse-index: add guard to ensure full index Derrick Stolee via GitGitGadget
                         ` (18 subsequent siblings)
  21 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-23 13:44 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

This test was introduced in 19a0acc83e4 (t1092: test interesting
sparse-checkout scenarios, 2021-01-23), but it contains issues with quoting
that were not noticed until starting this follow-up series. The old
mechanism would drop quoting such as in

   test_all_match git commit -m "touch README.md"

The above happened to work because README.md is a file in the
repository, so 'git commit -m touch REAMDE.md' would succeed by
accident.

Other cases included quoting for no good reason, so clean that up now.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/t1092-sparse-checkout-compatibility.sh | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index 8cd3e5a8d227..3725d3997e70 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -96,20 +96,20 @@ init_repos () {
 run_on_sparse () {
 	(
 		cd sparse-checkout &&
-		$* >../sparse-checkout-out 2>../sparse-checkout-err
+		"$@" >../sparse-checkout-out 2>../sparse-checkout-err
 	)
 }
 
 run_on_all () {
 	(
 		cd full-checkout &&
-		$* >../full-checkout-out 2>../full-checkout-err
+		"$@" >../full-checkout-out 2>../full-checkout-err
 	) &&
-	run_on_sparse $*
+	run_on_sparse "$@"
 }
 
 test_all_match () {
-	run_on_all $* &&
+	run_on_all "$@" &&
 	test_cmp full-checkout-out sparse-checkout-out &&
 	test_cmp full-checkout-err sparse-checkout-err
 }
@@ -119,7 +119,7 @@ test_expect_success 'status with options' '
 	test_all_match git status --porcelain=v2 &&
 	test_all_match git status --porcelain=v2 -z -u &&
 	test_all_match git status --porcelain=v2 -uno &&
-	run_on_all "touch README.md" &&
+	run_on_all touch README.md &&
 	test_all_match git status --porcelain=v2 &&
 	test_all_match git status --porcelain=v2 -z -u &&
 	test_all_match git status --porcelain=v2 -uno &&
@@ -135,7 +135,7 @@ test_expect_success 'add, commit, checkout' '
 	write_script edit-contents <<-\EOF &&
 	echo text >>$1
 	EOF
-	run_on_all "../edit-contents README.md" &&
+	run_on_all ../edit-contents README.md &&
 
 	test_all_match git add README.md &&
 	test_all_match git status --porcelain=v2 &&
@@ -144,7 +144,7 @@ test_expect_success 'add, commit, checkout' '
 	test_all_match git checkout HEAD~1 &&
 	test_all_match git checkout - &&
 
-	run_on_all "../edit-contents README.md" &&
+	run_on_all ../edit-contents README.md &&
 
 	test_all_match git add -A &&
 	test_all_match git status --porcelain=v2 &&
@@ -153,7 +153,7 @@ test_expect_success 'add, commit, checkout' '
 	test_all_match git checkout HEAD~1 &&
 	test_all_match git checkout - &&
 
-	run_on_all "../edit-contents deep/newfile" &&
+	run_on_all ../edit-contents deep/newfile &&
 
 	test_all_match git status --porcelain=v2 -uno &&
 	test_all_match git status --porcelain=v2 &&
@@ -186,7 +186,7 @@ test_expect_success 'diff --staged' '
 	write_script edit-contents <<-\EOF &&
 	echo text >>README.md
 	EOF
-	run_on_all "../edit-contents" &&
+	run_on_all ../edit-contents &&
 
 	test_all_match git diff &&
 	test_all_match git diff --staged &&
@@ -280,7 +280,7 @@ test_expect_success 'clean' '
 	echo bogus >>.gitignore &&
 	run_on_all cp ../.gitignore . &&
 	test_all_match git add .gitignore &&
-	test_all_match git commit -m ignore-bogus-files &&
+	test_all_match git commit -m "ignore bogus files" &&
 
 	run_on_sparse mkdir folder1 &&
 	run_on_all touch folder1/bogus &&
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH v4 04/20] sparse-index: add guard to ensure full index
  2021-03-23 13:44     ` [PATCH v4 " Derrick Stolee via GitGitGadget
                         ` (2 preceding siblings ...)
  2021-03-23 13:44       ` [PATCH v4 03/20] t1092: clean up script quoting Derrick Stolee via GitGitGadget
@ 2021-03-23 13:44       ` Derrick Stolee via GitGitGadget
  2021-03-23 13:44       ` [PATCH v4 05/20] sparse-index: implement ensure_full_index() Derrick Stolee via GitGitGadget
                         ` (17 subsequent siblings)
  21 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-23 13:44 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Upcoming changes will introduce modifications to the index format that
allow sparse directories. It will be useful to have a mechanism for
converting those sparse index files into full indexes by walking the
tree at those sparse directories. Name this method ensure_full_index()
as it will guarantee that the index is fully expanded.

This method is not implemented yet, and instead we focus on the
scaffolding to declare it and call it at the appropriate time.

Add a 'command_requires_full_index' member to struct repo_settings. This
will be an indicator that we need the index in full mode to do certain
index operations. This starts as being true for every command, then we
will set it to false as some commands integrate with sparse indexes.

If 'command_requires_full_index' is true, then we will immediately
expand a sparse index to a full one upon reading from disk. This
suffices for now, but we will want to add more callers to
ensure_full_index() later.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Makefile        |  1 +
 repo-settings.c |  8 ++++++++
 repository.c    | 11 ++++++++++-
 repository.h    |  2 ++
 sparse-index.c  |  8 ++++++++
 sparse-index.h  |  7 +++++++
 6 files changed, 36 insertions(+), 1 deletion(-)
 create mode 100644 sparse-index.c
 create mode 100644 sparse-index.h

diff --git a/Makefile b/Makefile
index dfb0f1000fa3..89b1d5374107 100644
--- a/Makefile
+++ b/Makefile
@@ -985,6 +985,7 @@ LIB_OBJS += setup.o
 LIB_OBJS += shallow.o
 LIB_OBJS += sideband.o
 LIB_OBJS += sigchain.o
+LIB_OBJS += sparse-index.o
 LIB_OBJS += split-index.o
 LIB_OBJS += stable-qsort.o
 LIB_OBJS += strbuf.o
diff --git a/repo-settings.c b/repo-settings.c
index f7fff0f5ab83..d63569e4041e 100644
--- a/repo-settings.c
+++ b/repo-settings.c
@@ -77,4 +77,12 @@ void prepare_repo_settings(struct repository *r)
 		UPDATE_DEFAULT_BOOL(r->settings.core_untracked_cache, UNTRACKED_CACHE_KEEP);
 
 	UPDATE_DEFAULT_BOOL(r->settings.fetch_negotiation_algorithm, FETCH_NEGOTIATION_DEFAULT);
+
+	/*
+	 * This setting guards all index reads to require a full index
+	 * over a sparse index. After suitable guards are placed in the
+	 * codebase around uses of the index, this setting will be
+	 * removed.
+	 */
+	r->settings.command_requires_full_index = 1;
 }
diff --git a/repository.c b/repository.c
index c98298acd017..a8acae002f71 100644
--- a/repository.c
+++ b/repository.c
@@ -10,6 +10,7 @@
 #include "object.h"
 #include "lockfile.h"
 #include "submodule-config.h"
+#include "sparse-index.h"
 
 /* The main repository */
 static struct repository the_repo;
@@ -261,6 +262,8 @@ void repo_clear(struct repository *repo)
 
 int repo_read_index(struct repository *repo)
 {
+	int res;
+
 	if (!repo->index)
 		repo->index = xcalloc(1, sizeof(*repo->index));
 
@@ -270,7 +273,13 @@ int repo_read_index(struct repository *repo)
 	else if (repo->index->repo != repo)
 		BUG("repo's index should point back at itself");
 
-	return read_index_from(repo->index, repo->index_file, repo->gitdir);
+	res = read_index_from(repo->index, repo->index_file, repo->gitdir);
+
+	prepare_repo_settings(repo);
+	if (repo->settings.command_requires_full_index)
+		ensure_full_index(repo->index);
+
+	return res;
 }
 
 int repo_hold_locked_index(struct repository *repo,
diff --git a/repository.h b/repository.h
index b385ca3c94b6..e06a23015697 100644
--- a/repository.h
+++ b/repository.h
@@ -41,6 +41,8 @@ struct repo_settings {
 	enum fetch_negotiation_setting fetch_negotiation_algorithm;
 
 	int core_multi_pack_index;
+
+	unsigned command_requires_full_index:1;
 };
 
 struct repository {
diff --git a/sparse-index.c b/sparse-index.c
new file mode 100644
index 000000000000..82183ead563b
--- /dev/null
+++ b/sparse-index.c
@@ -0,0 +1,8 @@
+#include "cache.h"
+#include "repository.h"
+#include "sparse-index.h"
+
+void ensure_full_index(struct index_state *istate)
+{
+	/* intentionally left blank */
+}
diff --git a/sparse-index.h b/sparse-index.h
new file mode 100644
index 000000000000..09a20d036c46
--- /dev/null
+++ b/sparse-index.h
@@ -0,0 +1,7 @@
+#ifndef SPARSE_INDEX_H__
+#define SPARSE_INDEX_H__
+
+struct index_state;
+void ensure_full_index(struct index_state *istate);
+
+#endif
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH v4 05/20] sparse-index: implement ensure_full_index()
  2021-03-23 13:44     ` [PATCH v4 " Derrick Stolee via GitGitGadget
                         ` (3 preceding siblings ...)
  2021-03-23 13:44       ` [PATCH v4 04/20] sparse-index: add guard to ensure full index Derrick Stolee via GitGitGadget
@ 2021-03-23 13:44       ` Derrick Stolee via GitGitGadget
  2021-03-23 13:44       ` [PATCH v4 06/20] t1092: compare sparse-checkout to sparse-index Derrick Stolee via GitGitGadget
                         ` (16 subsequent siblings)
  21 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-23 13:44 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

We will mark an in-memory index_state as having sparse directory entries
with the sparse_index bit. These currently cannot exist, but we will add
a mechanism for collapsing a full index to a sparse one in a later
change. That will happen at write time, so we must first allow parsing
the format before writing it.

Commands or methods that require a full index in order to operate can
call ensure_full_index() to expand that index in-memory. This requires
parsing trees using that index's repository.

Sparse directory entries have a specific 'ce_mode' value. The macro
S_ISSPARSEDIR(ce->ce_mode) can check if a cache_entry 'ce' has this type.
This ce_mode is not possible with the existing index formats, so we don't
also verify all properties of a sparse-directory entry, which are:

 1. ce->ce_mode == 0040000
 2. ce->flags & CE_SKIP_WORKTREE is true
 3. ce->name[ce->namelen - 1] == '/' (ends in dir separator)
 4. ce->oid references a tree object.

These are all semi-enforced in ensure_full_index() to some extent. Any
deviation will cause a warning at minimum or a failure in the worst
case.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 cache.h        | 13 ++++++-
 read-cache.c   |  9 +++++
 sparse-index.c | 98 +++++++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 118 insertions(+), 2 deletions(-)

diff --git a/cache.h b/cache.h
index bb317abc91fb..136dd496c95d 100644
--- a/cache.h
+++ b/cache.h
@@ -204,6 +204,8 @@ struct cache_entry {
 #error "CE_EXTENDED_FLAGS out of range"
 #endif
 
+#define S_ISSPARSEDIR(m) ((m) == S_IFDIR)
+
 /* Forward structure decls */
 struct pathspec;
 struct child_process;
@@ -319,7 +321,14 @@ struct index_state {
 		 drop_cache_tree : 1,
 		 updated_workdir : 1,
 		 updated_skipworktree : 1,
-		 fsmonitor_has_run_once : 1;
+		 fsmonitor_has_run_once : 1,
+
+		 /*
+		  * sparse_index == 1 when sparse-directory
+		  * entries exist. Requires sparse-checkout
+		  * in cone mode.
+		  */
+		 sparse_index : 1;
 	struct hashmap name_hash;
 	struct hashmap dir_hash;
 	struct object_id oid;
@@ -722,6 +731,8 @@ int read_index_from(struct index_state *, const char *path,
 		    const char *gitdir);
 int is_index_unborn(struct index_state *);
 
+void ensure_full_index(struct index_state *istate);
+
 /* For use with `write_locked_index()`. */
 #define COMMIT_LOCK		(1 << 0)
 #define SKIP_IF_UNCHANGED	(1 << 1)
diff --git a/read-cache.c b/read-cache.c
index 1e9a50c6c734..dd3980c12b53 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -101,6 +101,9 @@ static const char *alternate_index_output;
 
 static void set_index_entry(struct index_state *istate, int nr, struct cache_entry *ce)
 {
+	if (S_ISSPARSEDIR(ce->ce_mode))
+		istate->sparse_index = 1;
+
 	istate->cache[nr] = ce;
 	add_name_hash(istate, ce);
 }
@@ -2273,6 +2276,12 @@ int do_read_index(struct index_state *istate, const char *path, int must_exist)
 	trace2_data_intmax("index", the_repository, "read/cache_nr",
 			   istate->cache_nr);
 
+	if (!istate->repo)
+		istate->repo = the_repository;
+	prepare_repo_settings(istate->repo);
+	if (istate->repo->settings.command_requires_full_index)
+		ensure_full_index(istate);
+
 	return istate->cache_nr;
 
 unmap:
diff --git a/sparse-index.c b/sparse-index.c
index 82183ead563b..7095378a1b28 100644
--- a/sparse-index.c
+++ b/sparse-index.c
@@ -1,8 +1,104 @@
 #include "cache.h"
 #include "repository.h"
 #include "sparse-index.h"
+#include "tree.h"
+#include "pathspec.h"
+#include "trace2.h"
+
+static void set_index_entry(struct index_state *istate, int nr, struct cache_entry *ce)
+{
+	ALLOC_GROW(istate->cache, nr + 1, istate->cache_alloc);
+
+	istate->cache[nr] = ce;
+	add_name_hash(istate, ce);
+}
+
+static int add_path_to_index(const struct object_id *oid,
+			     struct strbuf *base, const char *path,
+			     unsigned int mode, void *context)
+{
+	struct index_state *istate = (struct index_state *)context;
+	struct cache_entry *ce;
+	size_t len = base->len;
+
+	if (S_ISDIR(mode))
+		return READ_TREE_RECURSIVE;
+
+	strbuf_addstr(base, path);
+
+	ce = make_cache_entry(istate, mode, oid, base->buf, 0, 0);
+	ce->ce_flags |= CE_SKIP_WORKTREE;
+	set_index_entry(istate, istate->cache_nr++, ce);
+
+	strbuf_setlen(base, len);
+	return 0;
+}
 
 void ensure_full_index(struct index_state *istate)
 {
-	/* intentionally left blank */
+	int i;
+	struct index_state *full;
+	struct strbuf base = STRBUF_INIT;
+
+	if (!istate || !istate->sparse_index)
+		return;
+
+	if (!istate->repo)
+		istate->repo = the_repository;
+
+	trace2_region_enter("index", "ensure_full_index", istate->repo);
+
+	/* initialize basics of new index */
+	full = xcalloc(1, sizeof(struct index_state));
+	memcpy(full, istate, sizeof(struct index_state));
+
+	/* then change the necessary things */
+	full->sparse_index = 0;
+	full->cache_alloc = (3 * istate->cache_alloc) / 2;
+	full->cache_nr = 0;
+	ALLOC_ARRAY(full->cache, full->cache_alloc);
+
+	for (i = 0; i < istate->cache_nr; i++) {
+		struct cache_entry *ce = istate->cache[i];
+		struct tree *tree;
+		struct pathspec ps;
+
+		if (!S_ISSPARSEDIR(ce->ce_mode)) {
+			set_index_entry(full, full->cache_nr++, ce);
+			continue;
+		}
+		if (!(ce->ce_flags & CE_SKIP_WORKTREE))
+			warning(_("index entry is a directory, but not sparse (%08x)"),
+				ce->ce_flags);
+
+		/* recursively walk into cd->name */
+		tree = lookup_tree(istate->repo, &ce->oid);
+
+		memset(&ps, 0, sizeof(ps));
+		ps.recursive = 1;
+		ps.has_wildcard = 1;
+		ps.max_depth = -1;
+
+		strbuf_setlen(&base, 0);
+		strbuf_add(&base, ce->name, strlen(ce->name));
+
+		read_tree_at(istate->repo, tree, &base, &ps,
+			     add_path_to_index, full);
+
+		/* free directory entries. full entries are re-used */
+		discard_cache_entry(ce);
+	}
+
+	/* Copy back into original index. */
+	memcpy(&istate->name_hash, &full->name_hash, sizeof(full->name_hash));
+	istate->sparse_index = 0;
+	free(istate->cache);
+	istate->cache = full->cache;
+	istate->cache_nr = full->cache_nr;
+	istate->cache_alloc = full->cache_alloc;
+
+	strbuf_release(&base);
+	free(full);
+
+	trace2_region_leave("index", "ensure_full_index", istate->repo);
 }
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH v4 06/20] t1092: compare sparse-checkout to sparse-index
  2021-03-23 13:44     ` [PATCH v4 " Derrick Stolee via GitGitGadget
                         ` (4 preceding siblings ...)
  2021-03-23 13:44       ` [PATCH v4 05/20] sparse-index: implement ensure_full_index() Derrick Stolee via GitGitGadget
@ 2021-03-23 13:44       ` Derrick Stolee via GitGitGadget
  2021-03-23 13:44       ` [PATCH v4 07/20] test-read-cache: print cache entries with --table Derrick Stolee via GitGitGadget
                         ` (15 subsequent siblings)
  21 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-23 13:44 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Add a new 'sparse-index' repo alongside the 'full-checkout' and
'sparse-checkout' repos in t1092-sparse-checkout-compatibility.sh. Also
add run_on_sparse and test_sparse_match helpers. These helpers will be
used when the sparse index is implemented.

Add the GIT_TEST_SPARSE_INDEX environment variable to enable the
sparse-index by default. This can be enabled across all tests, but that
will only affect cases where the sparse-checkout feature is enabled.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/README                                 |  3 +++
 t/t1092-sparse-checkout-compatibility.sh | 24 ++++++++++++++++++++----
 2 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/t/README b/t/README
index 593d4a4e270c..b98bc563aab5 100644
--- a/t/README
+++ b/t/README
@@ -439,6 +439,9 @@ and "sha256".
 GIT_TEST_WRITE_REV_INDEX=<boolean>, when true enables the
 'pack.writeReverseIndex' setting.
 
+GIT_TEST_SPARSE_INDEX=<boolean>, when true enables index writes to use the
+sparse-index format by default.
+
 Naming Tests
 ------------
 
diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index 3725d3997e70..de5d8461c993 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -7,6 +7,7 @@ test_description='compare full workdir to sparse workdir'
 test_expect_success 'setup' '
 	git init initial-repo &&
 	(
+		GIT_TEST_SPARSE_INDEX=0 &&
 		cd initial-repo &&
 		echo a >a &&
 		echo "after deep" >e &&
@@ -87,23 +88,32 @@ init_repos () {
 
 	cp -r initial-repo sparse-checkout &&
 	git -C sparse-checkout reset --hard &&
-	git -C sparse-checkout sparse-checkout init --cone &&
+
+	cp -r initial-repo sparse-index &&
+	git -C sparse-index reset --hard &&
 
 	# initialize sparse-checkout definitions
-	git -C sparse-checkout sparse-checkout set deep
+	git -C sparse-checkout sparse-checkout init --cone &&
+	git -C sparse-checkout sparse-checkout set deep &&
+	GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout init --cone &&
+	GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set deep
 }
 
 run_on_sparse () {
 	(
 		cd sparse-checkout &&
-		"$@" >../sparse-checkout-out 2>../sparse-checkout-err
+		GIT_TEST_SPARSE_INDEX=0 "$@" >../sparse-checkout-out 2>../sparse-checkout-err
+	) &&
+	(
+		cd sparse-index &&
+		GIT_TEST_SPARSE_INDEX=1 "$@" >../sparse-index-out 2>../sparse-index-err
 	)
 }
 
 run_on_all () {
 	(
 		cd full-checkout &&
-		"$@" >../full-checkout-out 2>../full-checkout-err
+		GIT_TEST_SPARSE_INDEX=0 "$@" >../full-checkout-out 2>../full-checkout-err
 	) &&
 	run_on_sparse "$@"
 }
@@ -114,6 +124,12 @@ test_all_match () {
 	test_cmp full-checkout-err sparse-checkout-err
 }
 
+test_sparse_match () {
+	run_on_sparse "$@" &&
+	test_cmp sparse-checkout-out sparse-index-out &&
+	test_cmp sparse-checkout-err sparse-index-err
+}
+
 test_expect_success 'status with options' '
 	init_repos &&
 	test_all_match git status --porcelain=v2 &&
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH v4 07/20] test-read-cache: print cache entries with --table
  2021-03-23 13:44     ` [PATCH v4 " Derrick Stolee via GitGitGadget
                         ` (5 preceding siblings ...)
  2021-03-23 13:44       ` [PATCH v4 06/20] t1092: compare sparse-checkout to sparse-index Derrick Stolee via GitGitGadget
@ 2021-03-23 13:44       ` Derrick Stolee via GitGitGadget
  2021-03-24  1:24         ` Ævar Arnfjörð Bjarmason
  2021-03-23 13:44       ` [PATCH v4 08/20] test-tool: don't force full index Derrick Stolee via GitGitGadget
                         ` (14 subsequent siblings)
  21 siblings, 1 reply; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-23 13:44 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

This table is helpful for discovering data in the index to ensure it is
being written correctly, especially as we build and test the
sparse-index. This table includes an output format similar to 'git
ls-tree', but should not be compared to that directly. The biggest
reasons are that 'git ls-tree' includes a tree entry for every
subdirectory, even those that would not appear as a sparse directory in
a sparse-index. Further, 'git ls-tree' does not use a trailing directory
separator for its tree rows.

This does not print the stat() information for the blobs. That will be
added in a future change with another option. The tests that are added
in the next few changes care only about the object types and IDs.
However, this future need for full index information justifies the need
for this test helper over extending a user-facing feature, such as 'git
ls-files'.

To make the option parsing slightly more robust, wrap the string
comparisons in a loop adapted from test-dir-iterator.c.

Care must be taken with the final check for the 'cnt' variable. We
continue the expectation that the numerical value is the final argument.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/helper/test-read-cache.c | 55 +++++++++++++++++++++++++++++++-------
 1 file changed, 45 insertions(+), 10 deletions(-)

diff --git a/t/helper/test-read-cache.c b/t/helper/test-read-cache.c
index 244977a29bdf..6cfd8f2de71c 100644
--- a/t/helper/test-read-cache.c
+++ b/t/helper/test-read-cache.c
@@ -1,36 +1,71 @@
 #include "test-tool.h"
 #include "cache.h"
 #include "config.h"
+#include "blob.h"
+#include "commit.h"
+#include "tree.h"
+
+static void print_cache_entry(struct cache_entry *ce)
+{
+	const char *type;
+	printf("%06o ", ce->ce_mode & 0177777);
+
+	if (S_ISSPARSEDIR(ce->ce_mode))
+		type = tree_type;
+	else if (S_ISGITLINK(ce->ce_mode))
+		type = commit_type;
+	else
+		type = blob_type;
+
+	printf("%s %s\t%s\n",
+	       type,
+	       oid_to_hex(&ce->oid),
+	       ce->name);
+}
+
+static void print_cache(struct index_state *istate)
+{
+	int i;
+	for (i = 0; i < istate->cache_nr; i++)
+		print_cache_entry(istate->cache[i]);
+}
 
 int cmd__read_cache(int argc, const char **argv)
 {
+	struct repository *r = the_repository;
 	int i, cnt = 1;
 	const char *name = NULL;
+	int table = 0;
 
-	if (argc > 1 && skip_prefix(argv[1], "--print-and-refresh=", &name)) {
-		argc--;
-		argv++;
+	for (++argv, --argc; *argv && starts_with(*argv, "--"); ++argv, --argc) {
+		if (skip_prefix(*argv, "--print-and-refresh=", &name))
+			continue;
+		if (!strcmp(*argv, "--table"))
+			table = 1;
 	}
 
-	if (argc == 2)
-		cnt = strtol(argv[1], NULL, 0);
+	if (argc == 1)
+		cnt = strtol(argv[0], NULL, 0);
 	setup_git_directory();
 	git_config(git_default_config, NULL);
+
 	for (i = 0; i < cnt; i++) {
-		read_cache();
+		repo_read_index(r);
 		if (name) {
 			int pos;
 
-			refresh_index(&the_index, REFRESH_QUIET,
+			refresh_index(r->index, REFRESH_QUIET,
 				      NULL, NULL, NULL);
-			pos = index_name_pos(&the_index, name, strlen(name));
+			pos = index_name_pos(r->index, name, strlen(name));
 			if (pos < 0)
 				die("%s not in index", name);
 			printf("%s is%s up to date\n", name,
-			       ce_uptodate(the_index.cache[pos]) ? "" : " not");
+			       ce_uptodate(r->index->cache[pos]) ? "" : " not");
 			write_file(name, "%d\n", i);
 		}
-		discard_cache();
+		if (table)
+			print_cache(r->index);
+		discard_index(r->index);
 	}
 	return 0;
 }
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH v4 08/20] test-tool: don't force full index
  2021-03-23 13:44     ` [PATCH v4 " Derrick Stolee via GitGitGadget
                         ` (6 preceding siblings ...)
  2021-03-23 13:44       ` [PATCH v4 07/20] test-read-cache: print cache entries with --table Derrick Stolee via GitGitGadget
@ 2021-03-23 13:44       ` Derrick Stolee via GitGitGadget
  2021-03-23 13:44       ` [PATCH v4 09/20] unpack-trees: ensure " Derrick Stolee via GitGitGadget
                         ` (13 subsequent siblings)
  21 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-23 13:44 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

We will use 'test-tool read-cache --table' to check that a sparse
index is written as part of init_repos. Since we will no longer always
expand a sparse index into a full index, add an '--expand' parameter
that adds a call to ensure_full_index() so we can compare a sparse index
directly against a full index, or at least what the in-memory index
looks like when expanded in this way.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/helper/test-read-cache.c               | 13 ++++++++++++-
 t/t1092-sparse-checkout-compatibility.sh |  5 +++++
 2 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/t/helper/test-read-cache.c b/t/helper/test-read-cache.c
index 6cfd8f2de71c..b52c174acc7a 100644
--- a/t/helper/test-read-cache.c
+++ b/t/helper/test-read-cache.c
@@ -4,6 +4,7 @@
 #include "blob.h"
 #include "commit.h"
 #include "tree.h"
+#include "sparse-index.h"
 
 static void print_cache_entry(struct cache_entry *ce)
 {
@@ -35,13 +36,19 @@ int cmd__read_cache(int argc, const char **argv)
 	struct repository *r = the_repository;
 	int i, cnt = 1;
 	const char *name = NULL;
-	int table = 0;
+	int table = 0, expand = 0;
+
+	initialize_the_repository();
+	prepare_repo_settings(r);
+	r->settings.command_requires_full_index = 0;
 
 	for (++argv, --argc; *argv && starts_with(*argv, "--"); ++argv, --argc) {
 		if (skip_prefix(*argv, "--print-and-refresh=", &name))
 			continue;
 		if (!strcmp(*argv, "--table"))
 			table = 1;
+		else if (!strcmp(*argv, "--expand"))
+			expand = 1;
 	}
 
 	if (argc == 1)
@@ -51,6 +58,10 @@ int cmd__read_cache(int argc, const char **argv)
 
 	for (i = 0; i < cnt; i++) {
 		repo_read_index(r);
+
+		if (expand)
+			ensure_full_index(r->index);
+
 		if (name) {
 			int pos;
 
diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index de5d8461c993..a1aea141c62c 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -130,6 +130,11 @@ test_sparse_match () {
 	test_cmp sparse-checkout-err sparse-index-err
 }
 
+test_expect_success 'expanded in-memory index matches full index' '
+	init_repos &&
+	test_sparse_match test-tool read-cache --expand --table
+'
+
 test_expect_success 'status with options' '
 	init_repos &&
 	test_all_match git status --porcelain=v2 &&
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH v4 09/20] unpack-trees: ensure full index
  2021-03-23 13:44     ` [PATCH v4 " Derrick Stolee via GitGitGadget
                         ` (7 preceding siblings ...)
  2021-03-23 13:44       ` [PATCH v4 08/20] test-tool: don't force full index Derrick Stolee via GitGitGadget
@ 2021-03-23 13:44       ` Derrick Stolee via GitGitGadget
  2021-03-23 13:44       ` [PATCH v4 10/20] sparse-checkout: hold pattern list in index Derrick Stolee via GitGitGadget
                         ` (12 subsequent siblings)
  21 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-23 13:44 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The next change will translate full indexes into sparse indexes at write
time. The existing logic provides a way for every sparse index to be
expanded to a full index at read time. However, there are cases where an
index is written and then continues to be used in-memory to perform
further updates.

unpack_trees() is frequently called after such a write. In particular,
commands like 'git reset' do this double-update of the index.

Ensure that we have a full index when entering unpack_trees(), but only
when command_requires_full_index is true. This is always true at the
moment, but we will later relax that after unpack_trees() is updated to
handle sparse directory entries.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 unpack-trees.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/unpack-trees.c b/unpack-trees.c
index f5f668f532d8..4dd99219073a 100644
--- a/unpack-trees.c
+++ b/unpack-trees.c
@@ -1567,6 +1567,7 @@ static int verify_absent(const struct cache_entry *,
  */
 int unpack_trees(unsigned len, struct tree_desc *t, struct unpack_trees_options *o)
 {
+	struct repository *repo = the_repository;
 	int i, ret;
 	static struct cache_entry *dfc;
 	struct pattern_list pl;
@@ -1578,6 +1579,12 @@ int unpack_trees(unsigned len, struct tree_desc *t, struct unpack_trees_options
 	trace_performance_enter();
 	trace2_region_enter("unpack_trees", "unpack_trees", the_repository);
 
+	prepare_repo_settings(repo);
+	if (repo->settings.command_requires_full_index) {
+		ensure_full_index(o->src_index);
+		ensure_full_index(o->dst_index);
+	}
+
 	if (!core_apply_sparse_checkout || !o->update)
 		o->skip_sparse_checkout = 1;
 	if (!o->skip_sparse_checkout && !o->pl) {
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH v4 10/20] sparse-checkout: hold pattern list in index
  2021-03-23 13:44     ` [PATCH v4 " Derrick Stolee via GitGitGadget
                         ` (8 preceding siblings ...)
  2021-03-23 13:44       ` [PATCH v4 09/20] unpack-trees: ensure " Derrick Stolee via GitGitGadget
@ 2021-03-23 13:44       ` Derrick Stolee via GitGitGadget
  2021-03-23 13:44       ` [PATCH v4 11/20] sparse-index: convert from full to sparse Derrick Stolee via GitGitGadget
                         ` (11 subsequent siblings)
  21 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-23 13:44 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

As we modify the sparse-checkout definition, we perform index operations
on a pattern_list that only exists in-memory. This allows easy backing
out in case the index update fails.

However, if the index write itself cares about the sparse-checkout
pattern set, we need access to that in-memory copy. Place a pointer to
a 'struct pattern_list' in the index so we can access this on-demand.
This will be used in the next change which uses the sparse-checkout
definition to filter out directories that are outside the sparse cone.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 builtin/sparse-checkout.c | 17 ++++++++++-------
 cache.h                   |  2 ++
 2 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/builtin/sparse-checkout.c b/builtin/sparse-checkout.c
index 2306a9ad98e0..e00b82af727b 100644
--- a/builtin/sparse-checkout.c
+++ b/builtin/sparse-checkout.c
@@ -110,6 +110,8 @@ static int update_working_directory(struct pattern_list *pl)
 	if (is_index_unborn(r->index))
 		return UPDATE_SPARSITY_SUCCESS;
 
+	r->index->sparse_checkout_patterns = pl;
+
 	memset(&o, 0, sizeof(o));
 	o.verbose_update = isatty(2);
 	o.update = 1;
@@ -138,6 +140,7 @@ static int update_working_directory(struct pattern_list *pl)
 	else
 		rollback_lock_file(&lock_file);
 
+	r->index->sparse_checkout_patterns = NULL;
 	return result;
 }
 
@@ -517,19 +520,18 @@ static int modify_pattern_list(int argc, const char **argv, enum modify_type m)
 {
 	int result;
 	int changed_config = 0;
-	struct pattern_list pl;
-	memset(&pl, 0, sizeof(pl));
+	struct pattern_list *pl = xcalloc(1, sizeof(*pl));
 
 	switch (m) {
 	case ADD:
 		if (core_sparse_checkout_cone)
-			add_patterns_cone_mode(argc, argv, &pl);
+			add_patterns_cone_mode(argc, argv, pl);
 		else
-			add_patterns_literal(argc, argv, &pl);
+			add_patterns_literal(argc, argv, pl);
 		break;
 
 	case REPLACE:
-		add_patterns_from_input(&pl, argc, argv);
+		add_patterns_from_input(pl, argc, argv);
 		break;
 	}
 
@@ -539,12 +541,13 @@ static int modify_pattern_list(int argc, const char **argv, enum modify_type m)
 		changed_config = 1;
 	}
 
-	result = write_patterns_and_update(&pl);
+	result = write_patterns_and_update(pl);
 
 	if (result && changed_config)
 		set_config(MODE_NO_PATTERNS);
 
-	clear_pattern_list(&pl);
+	clear_pattern_list(pl);
+	free(pl);
 	return result;
 }
 
diff --git a/cache.h b/cache.h
index 136dd496c95d..8c4464420d0a 100644
--- a/cache.h
+++ b/cache.h
@@ -307,6 +307,7 @@ static inline unsigned int canon_mode(unsigned int mode)
 struct split_index;
 struct untracked_cache;
 struct progress;
+struct pattern_list;
 
 struct index_state {
 	struct cache_entry **cache;
@@ -338,6 +339,7 @@ struct index_state {
 	struct mem_pool *ce_mem_pool;
 	struct progress *progress;
 	struct repository *repo;
+	struct pattern_list *sparse_checkout_patterns;
 };
 
 /* Name hashing */
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH v4 11/20] sparse-index: convert from full to sparse
  2021-03-23 13:44     ` [PATCH v4 " Derrick Stolee via GitGitGadget
                         ` (9 preceding siblings ...)
  2021-03-23 13:44       ` [PATCH v4 10/20] sparse-checkout: hold pattern list in index Derrick Stolee via GitGitGadget
@ 2021-03-23 13:44       ` Derrick Stolee via GitGitGadget
  2021-03-23 13:44       ` [PATCH v4 12/20] submodule: sparse-index should not collapse links Derrick Stolee via GitGitGadget
                         ` (10 subsequent siblings)
  21 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-23 13:44 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

If we have a full index, then we can convert it to a sparse index by
replacing directories outside of the sparse cone with sparse directory
entries. The convert_to_sparse() method does this, when the situation is
appropriate.

For now, we avoid converting the index to a sparse index if:

 1. the index is split.
 2. the index is already sparse.
 3. sparse-checkout is disabled.
 4. sparse-checkout does not use cone mode.

Finally, we currently limit the conversion to when the
GIT_TEST_SPARSE_INDEX environment variable is enabled. A mode using Git
config will be added in a later change.

The trickiest thing about this conversion is that we might not be able
to mark a directory as a sparse directory just because it is outside the
sparse cone. There might be unmerged files within that directory, so we
need to look for those. Also, if there is some strange reason why a file
is not marked with CE_SKIP_WORKTREE, then we should give up on
converting that directory. There is still hope that some of its
subdirectories might be able to convert to sparse, so we keep looking
deeper.

The conversion process is assisted by the cache-tree extension. This is
calculated from the full index if it does not already exist. We then
abandon the cache-tree as it no longer applies to the newly-sparse
index. Thus, this cache-tree will be recalculated in every
sparse-full-sparse round-trip until we integrate the cache-tree
extension with the sparse index.

Some Git commands use the index after writing it. For example, 'git add'
will update the index, then write it to disk, then read its entries to
report information. To keep the in-memory index in a full state after
writing, we re-expand it to a full one after the write. This is wasteful
for commands that only write the index and do not read from it again,
but that is only the case until we make those commands "sparse aware."

We can compare the behavior of the sparse-index in
t1092-sparse-checkout-compability.sh by using GIT_TEST_SPARSE_INDEX=1
when operating on the 'sparse-index' repo. We can also compare the two
sparse repos directly, such as comparing their indexes (when expanded to
full in the case of the 'sparse-index' repo). We also verify that the
index is actually populated with sparse directory entries.

The 'checkout and reset (mixed)' test is marked for failure when
comparing a sparse repo to a full repo, but we can compare the two
sparse-checkout cases directly to ensure that we are not changing the
behavior when using a sparse index.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 cache-tree.c                             |   3 +
 cache.h                                  |   2 +
 read-cache.c                             |  26 ++++-
 sparse-index.c                           | 139 +++++++++++++++++++++++
 sparse-index.h                           |   1 +
 t/t1092-sparse-checkout-compatibility.sh |  61 +++++++++-
 6 files changed, 228 insertions(+), 4 deletions(-)

diff --git a/cache-tree.c b/cache-tree.c
index 2fb483d3c083..5f07a39e501e 100644
--- a/cache-tree.c
+++ b/cache-tree.c
@@ -6,6 +6,7 @@
 #include "object-store.h"
 #include "replace-object.h"
 #include "promisor-remote.h"
+#include "sparse-index.h"
 
 #ifndef DEBUG_CACHE_TREE
 #define DEBUG_CACHE_TREE 0
@@ -442,6 +443,8 @@ int cache_tree_update(struct index_state *istate, int flags)
 	if (i)
 		return i;
 
+	ensure_full_index(istate);
+
 	if (!istate->cache_tree)
 		istate->cache_tree = cache_tree();
 
diff --git a/cache.h b/cache.h
index 8c4464420d0a..74b43aaa2bd1 100644
--- a/cache.h
+++ b/cache.h
@@ -251,6 +251,8 @@ static inline unsigned int create_ce_mode(unsigned int mode)
 {
 	if (S_ISLNK(mode))
 		return S_IFLNK;
+	if (S_ISSPARSEDIR(mode))
+		return S_IFDIR;
 	if (S_ISDIR(mode) || S_ISGITLINK(mode))
 		return S_IFGITLINK;
 	return S_IFREG | ce_permissions(mode);
diff --git a/read-cache.c b/read-cache.c
index dd3980c12b53..b9c08773466c 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -25,6 +25,7 @@
 #include "fsmonitor.h"
 #include "thread-utils.h"
 #include "progress.h"
+#include "sparse-index.h"
 
 /* Mask for the name length in ce_flags in the on-disk index */
 
@@ -1002,8 +1003,14 @@ int verify_path(const char *path, unsigned mode)
 
 			c = *path++;
 			if ((c == '.' && !verify_dotfile(path, mode)) ||
-			    is_dir_sep(c) || c == '\0')
+			    is_dir_sep(c))
 				return 0;
+			/*
+			 * allow terminating directory separators for
+			 * sparse directory entries.
+			 */
+			if (c == '\0')
+				return S_ISDIR(mode);
 		} else if (c == '\\' && protect_ntfs) {
 			if (is_ntfs_dotgit(path))
 				return 0;
@@ -3079,6 +3086,14 @@ static int do_write_locked_index(struct index_state *istate, struct lock_file *l
 				 unsigned flags)
 {
 	int ret;
+	int was_full = !istate->sparse_index;
+
+	ret = convert_to_sparse(istate);
+
+	if (ret) {
+		warning(_("failed to convert to a sparse-index"));
+		return ret;
+	}
 
 	/*
 	 * TODO trace2: replace "the_repository" with the actual repo instance
@@ -3090,6 +3105,9 @@ static int do_write_locked_index(struct index_state *istate, struct lock_file *l
 	trace2_region_leave_printf("index", "do_write_index", the_repository,
 				   "%s", get_lock_file_path(lock));
 
+	if (was_full)
+		ensure_full_index(istate);
+
 	if (ret)
 		return ret;
 	if (flags & COMMIT_LOCK)
@@ -3180,9 +3198,10 @@ static int write_shared_index(struct index_state *istate,
 			      struct tempfile **temp)
 {
 	struct split_index *si = istate->split_index;
-	int ret;
+	int ret, was_full = !istate->sparse_index;
 
 	move_cache_to_base_index(istate);
+	convert_to_sparse(istate);
 
 	trace2_region_enter_printf("index", "shared/do_write_index",
 				   the_repository, "%s", get_tempfile_path(*temp));
@@ -3190,6 +3209,9 @@ static int write_shared_index(struct index_state *istate,
 	trace2_region_leave_printf("index", "shared/do_write_index",
 				   the_repository, "%s", get_tempfile_path(*temp));
 
+	if (was_full)
+		ensure_full_index(istate);
+
 	if (ret)
 		return ret;
 	ret = adjust_shared_perm(get_tempfile_path(*temp));
diff --git a/sparse-index.c b/sparse-index.c
index 7095378a1b28..619ff7c2e217 100644
--- a/sparse-index.c
+++ b/sparse-index.c
@@ -4,6 +4,145 @@
 #include "tree.h"
 #include "pathspec.h"
 #include "trace2.h"
+#include "cache-tree.h"
+#include "config.h"
+#include "dir.h"
+#include "fsmonitor.h"
+
+static struct cache_entry *construct_sparse_dir_entry(
+				struct index_state *istate,
+				const char *sparse_dir,
+				struct cache_tree *tree)
+{
+	struct cache_entry *de;
+
+	de = make_cache_entry(istate, S_IFDIR, &tree->oid, sparse_dir, 0, 0);
+
+	de->ce_flags |= CE_SKIP_WORKTREE;
+	return de;
+}
+
+/*
+ * Returns the number of entries "inserted" into the index.
+ */
+static int convert_to_sparse_rec(struct index_state *istate,
+				 int num_converted,
+				 int start, int end,
+				 const char *ct_path, size_t ct_pathlen,
+				 struct cache_tree *ct)
+{
+	int i, can_convert = 1;
+	int start_converted = num_converted;
+	enum pattern_match_result match;
+	int dtype;
+	struct strbuf child_path = STRBUF_INIT;
+	struct pattern_list *pl = istate->sparse_checkout_patterns;
+
+	/*
+	 * Is the current path outside of the sparse cone?
+	 * Then check if the region can be replaced by a sparse
+	 * directory entry (everything is sparse and merged).
+	 */
+	match = path_matches_pattern_list(ct_path, ct_pathlen,
+					  NULL, &dtype, pl, istate);
+	if (match != NOT_MATCHED)
+		can_convert = 0;
+
+	for (i = start; can_convert && i < end; i++) {
+		struct cache_entry *ce = istate->cache[i];
+
+		if (ce_stage(ce) ||
+		    !(ce->ce_flags & CE_SKIP_WORKTREE))
+			can_convert = 0;
+	}
+
+	if (can_convert) {
+		struct cache_entry *se;
+		se = construct_sparse_dir_entry(istate, ct_path, ct);
+
+		istate->cache[num_converted++] = se;
+		return 1;
+	}
+
+	for (i = start; i < end; ) {
+		int count, span, pos = -1;
+		const char *base, *slash;
+		struct cache_entry *ce = istate->cache[i];
+
+		/*
+		 * Detect if this is a normal entry outside of any subtree
+		 * entry.
+		 */
+		base = ce->name + ct_pathlen;
+		slash = strchr(base, '/');
+
+		if (slash)
+			pos = cache_tree_subtree_pos(ct, base, slash - base);
+
+		if (pos < 0) {
+			istate->cache[num_converted++] = ce;
+			i++;
+			continue;
+		}
+
+		strbuf_setlen(&child_path, 0);
+		strbuf_add(&child_path, ce->name, slash - ce->name + 1);
+
+		span = ct->down[pos]->cache_tree->entry_count;
+		count = convert_to_sparse_rec(istate,
+					      num_converted, i, i + span,
+					      child_path.buf, child_path.len,
+					      ct->down[pos]->cache_tree);
+		num_converted += count;
+		i += span;
+	}
+
+	strbuf_release(&child_path);
+	return num_converted - start_converted;
+}
+
+int convert_to_sparse(struct index_state *istate)
+{
+	if (istate->split_index || istate->sparse_index ||
+	    !core_apply_sparse_checkout || !core_sparse_checkout_cone)
+		return 0;
+
+	/*
+	 * For now, only create a sparse index with the
+	 * GIT_TEST_SPARSE_INDEX environment variable. We will relax
+	 * this once we have a proper way to opt-in (and later still,
+	 * opt-out).
+	 */
+	if (!git_env_bool("GIT_TEST_SPARSE_INDEX", 0))
+		return 0;
+
+	if (!istate->sparse_checkout_patterns) {
+		istate->sparse_checkout_patterns = xcalloc(1, sizeof(struct pattern_list));
+		if (get_sparse_checkout_patterns(istate->sparse_checkout_patterns) < 0)
+			return 0;
+	}
+
+	if (!istate->sparse_checkout_patterns->use_cone_patterns) {
+		warning(_("attempting to use sparse-index without cone mode"));
+		return -1;
+	}
+
+	if (cache_tree_update(istate, 0)) {
+		warning(_("unable to update cache-tree, staying full"));
+		return -1;
+	}
+
+	remove_fsmonitor(istate);
+
+	trace2_region_enter("index", "convert_to_sparse", istate->repo);
+	istate->cache_nr = convert_to_sparse_rec(istate,
+						 0, 0, istate->cache_nr,
+						 "", 0, istate->cache_tree);
+	istate->drop_cache_tree = 1;
+	istate->sparse_index = 1;
+	trace2_region_leave("index", "convert_to_sparse", istate->repo);
+	return 0;
+}
 
 static void set_index_entry(struct index_state *istate, int nr, struct cache_entry *ce)
 {
diff --git a/sparse-index.h b/sparse-index.h
index 09a20d036c46..64380e121d80 100644
--- a/sparse-index.h
+++ b/sparse-index.h
@@ -3,5 +3,6 @@
 
 struct index_state;
 void ensure_full_index(struct index_state *istate);
+int convert_to_sparse(struct index_state *istate);
 
 #endif
diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index a1aea141c62c..1e888d195122 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -2,6 +2,11 @@
 
 test_description='compare full workdir to sparse workdir'
 
+# The verify_cache_tree() check is not sparse-aware (yet).
+# So, disable the check until that integration is complete.
+GIT_TEST_CHECK_CACHE_TREE=0
+GIT_TEST_SPLIT_INDEX=0
+
 . ./test-lib.sh
 
 test_expect_success 'setup' '
@@ -121,7 +126,9 @@ run_on_all () {
 test_all_match () {
 	run_on_all "$@" &&
 	test_cmp full-checkout-out sparse-checkout-out &&
-	test_cmp full-checkout-err sparse-checkout-err
+	test_cmp full-checkout-out sparse-index-out &&
+	test_cmp full-checkout-err sparse-checkout-err &&
+	test_cmp full-checkout-err sparse-index-err
 }
 
 test_sparse_match () {
@@ -130,6 +137,38 @@ test_sparse_match () {
 	test_cmp sparse-checkout-err sparse-index-err
 }
 
+test_expect_success 'sparse-index contents' '
+	init_repos &&
+
+	test-tool -C sparse-index read-cache --table >cache &&
+	for dir in folder1 folder2 x
+	do
+		TREE=$(git -C sparse-index rev-parse HEAD:$dir) &&
+		grep "040000 tree $TREE	$dir/" cache \
+			|| return 1
+	done &&
+
+	GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set folder1 &&
+
+	test-tool -C sparse-index read-cache --table >cache &&
+	for dir in deep folder2 x
+	do
+		TREE=$(git -C sparse-index rev-parse HEAD:$dir) &&
+		grep "040000 tree $TREE	$dir/" cache \
+			|| return 1
+	done &&
+
+	GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set deep/deeper1 &&
+
+	test-tool -C sparse-index read-cache --table >cache &&
+	for dir in deep/deeper2 folder1 folder2 x
+	do
+		TREE=$(git -C sparse-index rev-parse HEAD:$dir) &&
+		grep "040000 tree $TREE	$dir/" cache \
+			|| return 1
+	done
+'
+
 test_expect_success 'expanded in-memory index matches full index' '
 	init_repos &&
 	test_sparse_match test-tool read-cache --expand --table
@@ -137,6 +176,7 @@ test_expect_success 'expanded in-memory index matches full index' '
 
 test_expect_success 'status with options' '
 	init_repos &&
+	test_sparse_match ls &&
 	test_all_match git status --porcelain=v2 &&
 	test_all_match git status --porcelain=v2 -z -u &&
 	test_all_match git status --porcelain=v2 -uno &&
@@ -273,6 +313,17 @@ test_expect_failure 'checkout and reset (mixed)' '
 	test_all_match git reset update-folder2
 '
 
+# Ensure that sparse-index behaves identically to
+# sparse-checkout with a full index.
+test_expect_success 'checkout and reset (mixed) [sparse]' '
+	init_repos &&
+
+	test_sparse_match git checkout -b reset-test update-deep &&
+	test_sparse_match git reset deepest &&
+	test_sparse_match git reset update-folder1 &&
+	test_sparse_match git reset update-folder2
+'
+
 test_expect_success 'merge' '
 	init_repos &&
 
@@ -309,14 +360,20 @@ test_expect_success 'clean' '
 	test_all_match git status --porcelain=v2 &&
 	test_all_match git clean -f &&
 	test_all_match git status --porcelain=v2 &&
+	test_sparse_match ls &&
+	test_sparse_match ls folder1 &&
 
 	test_all_match git clean -xf &&
 	test_all_match git status --porcelain=v2 &&
+	test_sparse_match ls &&
+	test_sparse_match ls folder1 &&
 
 	test_all_match git clean -xdf &&
 	test_all_match git status --porcelain=v2 &&
+	test_sparse_match ls &&
+	test_sparse_match ls folder1 &&
 
-	test_path_is_dir sparse-checkout/folder1
+	test_sparse_match test_path_is_dir folder1
 '
 
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH v4 12/20] submodule: sparse-index should not collapse links
  2021-03-23 13:44     ` [PATCH v4 " Derrick Stolee via GitGitGadget
                         ` (10 preceding siblings ...)
  2021-03-23 13:44       ` [PATCH v4 11/20] sparse-index: convert from full to sparse Derrick Stolee via GitGitGadget
@ 2021-03-23 13:44       ` Derrick Stolee via GitGitGadget
  2021-03-23 13:44       ` [PATCH v4 13/20] unpack-trees: allow sparse directories Derrick Stolee via GitGitGadget
                         ` (9 subsequent siblings)
  21 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-23 13:44 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

A submodule is stored as a "Git link" that actually points to a commit
within a submodule. Submodules are populated or not depending on
submodule configuration, not sparse-checkout. To ensure that the
sparse-index feature integrates correctly with submodules, we should not
collapse a directory if there is a Git link within its range.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 sparse-index.c                           |  1 +
 t/t1092-sparse-checkout-compatibility.sh | 17 +++++++++++++++++
 2 files changed, 18 insertions(+)

diff --git a/sparse-index.c b/sparse-index.c
index 619ff7c2e217..7631f7bd00b7 100644
--- a/sparse-index.c
+++ b/sparse-index.c
@@ -52,6 +52,7 @@ static int convert_to_sparse_rec(struct index_state *istate,
 		struct cache_entry *ce = istate->cache[i];
 
 		if (ce_stage(ce) ||
+		    S_ISGITLINK(ce->ce_mode) ||
 		    !(ce->ce_flags & CE_SKIP_WORKTREE))
 			can_convert = 0;
 	}
diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index 1e888d195122..cba5f89b1e96 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -376,4 +376,21 @@ test_expect_success 'clean' '
 	test_sparse_match test_path_is_dir folder1
 '
 
+test_expect_success 'submodule handling' '
+	init_repos &&
+
+	test_all_match mkdir modules &&
+	test_all_match touch modules/a &&
+	test_all_match git add modules &&
+	test_all_match git commit -m "add modules directory" &&
+
+	run_on_all git submodule add "$(pwd)/initial-repo" modules/sub &&
+	test_all_match git commit -m "add submodule" &&
+
+	# having a submodule prevents "modules" from collapse
+	test-tool -C sparse-index read-cache --table >cache &&
+	grep "100644 blob .*	modules/a" cache &&
+	grep "160000 commit $(git -C initial-repo rev-parse HEAD)	modules/sub" cache
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH v4 13/20] unpack-trees: allow sparse directories
  2021-03-23 13:44     ` [PATCH v4 " Derrick Stolee via GitGitGadget
                         ` (11 preceding siblings ...)
  2021-03-23 13:44       ` [PATCH v4 12/20] submodule: sparse-index should not collapse links Derrick Stolee via GitGitGadget
@ 2021-03-23 13:44       ` Derrick Stolee via GitGitGadget
  2021-03-23 13:44       ` [PATCH v4 14/20] sparse-index: check index conversion happens Derrick Stolee via GitGitGadget
                         ` (8 subsequent siblings)
  21 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-23 13:44 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The index_pos_by_traverse_info() currently throws a BUG() when a
directory entry exists exactly in the index. We need to consider that it
is possible to have a directory in a sparse index as long as that entry
is itself marked with the skip-worktree bit.

The 'pos' variable is assigned a negative value if an exact match is not
found. Since a directory name can be an exact match, it is no longer an
error to have a nonnegative 'pos' value.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 unpack-trees.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/unpack-trees.c b/unpack-trees.c
index 4dd99219073a..0b888dab2246 100644
--- a/unpack-trees.c
+++ b/unpack-trees.c
@@ -746,9 +746,13 @@ static int index_pos_by_traverse_info(struct name_entry *names,
 	strbuf_make_traverse_path(&name, info, names->path, names->pathlen);
 	strbuf_addch(&name, '/');
 	pos = index_name_pos(o->src_index, name.buf, name.len);
-	if (pos >= 0)
-		BUG("This is a directory and should not exist in index");
-	pos = -pos - 1;
+	if (pos >= 0) {
+		if (!o->src_index->sparse_index ||
+		    !(o->src_index->cache[pos]->ce_flags & CE_SKIP_WORKTREE))
+			BUG("This is a directory and should not exist in index");
+	} else {
+		pos = -pos - 1;
+	}
 	if (pos >= o->src_index->cache_nr ||
 	    !starts_with(o->src_index->cache[pos]->name, name.buf) ||
 	    (pos > 0 && starts_with(o->src_index->cache[pos-1]->name, name.buf)))
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH v4 14/20] sparse-index: check index conversion happens
  2021-03-23 13:44     ` [PATCH v4 " Derrick Stolee via GitGitGadget
                         ` (12 preceding siblings ...)
  2021-03-23 13:44       ` [PATCH v4 13/20] unpack-trees: allow sparse directories Derrick Stolee via GitGitGadget
@ 2021-03-23 13:44       ` Derrick Stolee via GitGitGadget
  2021-03-23 13:44       ` [PATCH v4 15/20] sparse-index: create extension for compatibility Derrick Stolee via GitGitGadget
                         ` (7 subsequent siblings)
  21 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-23 13:44 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Add a test case that uses test_region to ensure that we are truly
expanding a sparse index to a full one, then converting back to sparse
when writing the index. As we integrate more Git commands with the
sparse index, we will convert these commands to check that we do _not_
convert the sparse index to a full index and instead stay sparse the
entire time.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/t1092-sparse-checkout-compatibility.sh | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index cba5f89b1e96..47f983217852 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -393,4 +393,22 @@ test_expect_success 'submodule handling' '
 	grep "160000 commit $(git -C initial-repo rev-parse HEAD)	modules/sub" cache
 '
 
+test_expect_success 'sparse-index is expanded and converted back' '
+	init_repos &&
+
+	(
+		GIT_TEST_SPARSE_INDEX=1 &&
+		export GIT_TEST_SPARSE_INDEX &&
+		GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
+			git -C sparse-index -c core.fsmonitor="" reset --hard &&
+		test_region index convert_to_sparse trace2.txt &&
+		test_region index ensure_full_index trace2.txt &&
+
+		rm trace2.txt &&
+		GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
+			git -C sparse-index -c core.fsmonitor="" status -uno &&
+		test_region index ensure_full_index trace2.txt
+	)
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH v4 15/20] sparse-index: create extension for compatibility
  2021-03-23 13:44     ` [PATCH v4 " Derrick Stolee via GitGitGadget
                         ` (13 preceding siblings ...)
  2021-03-23 13:44       ` [PATCH v4 14/20] sparse-index: check index conversion happens Derrick Stolee via GitGitGadget
@ 2021-03-23 13:44       ` Derrick Stolee via GitGitGadget
  2021-03-23 13:44       ` [PATCH v4 16/20] sparse-checkout: toggle sparse index from builtin Derrick Stolee via GitGitGadget
                         ` (6 subsequent siblings)
  21 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-23 13:44 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Previously, we enabled the sparse index format only using
GIT_TEST_SPARSE_INDEX=1. This is not a feasible direction for users to
actually select this mode. Further, sparse directory entries are not
understood by the index formats as advertised.

We _could_ add a new index version that explicitly adds these
capabilities, but there are nuances to index formats 2, 3, and 4 that
are still valuable to select as options. Until we add index format
version 5, create a repo extension, "extensions.sparseIndex", that
specifies that the tool reading this repository must understand sparse
directory entries.

This change only encodes the extension and enables it when
GIT_TEST_SPARSE_INDEX=1. Later, we will add a more user-friendly CLI
mechanism.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/config/extensions.txt |  8 ++++++
 cache.h                             |  1 +
 repo-settings.c                     |  7 ++++++
 repository.h                        |  3 ++-
 setup.c                             |  3 +++
 sparse-index.c                      | 38 +++++++++++++++++++++++++----
 6 files changed, 54 insertions(+), 6 deletions(-)

diff --git a/Documentation/config/extensions.txt b/Documentation/config/extensions.txt
index 4e23d73cdcad..c02e09af0046 100644
--- a/Documentation/config/extensions.txt
+++ b/Documentation/config/extensions.txt
@@ -6,3 +6,11 @@ extensions.objectFormat::
 Note that this setting should only be set by linkgit:git-init[1] or
 linkgit:git-clone[1].  Trying to change it after initialization will not
 work and will produce hard-to-diagnose issues.
+
+extensions.sparseIndex::
+	When combined with `core.sparseCheckout=true` and
+	`core.sparseCheckoutCone=true`, the index may contain entries
+	corresponding to directories outside of the sparse-checkout
+	definition in lieu of containing each path under such directories.
+	Versions of Git that do not understand this extension do not
+	expect directory entries in the index.
diff --git a/cache.h b/cache.h
index 74b43aaa2bd1..8aede373aeb3 100644
--- a/cache.h
+++ b/cache.h
@@ -1059,6 +1059,7 @@ struct repository_format {
 	int worktree_config;
 	int is_bare;
 	int hash_algo;
+	int sparse_index;
 	char *work_tree;
 	struct string_list unknown_extensions;
 	struct string_list v1_only_extensions;
diff --git a/repo-settings.c b/repo-settings.c
index d63569e4041e..9677d50f9238 100644
--- a/repo-settings.c
+++ b/repo-settings.c
@@ -85,4 +85,11 @@ void prepare_repo_settings(struct repository *r)
 	 * removed.
 	 */
 	r->settings.command_requires_full_index = 1;
+
+	/*
+	 * Initialize this as off.
+	 */
+	r->settings.sparse_index = 0;
+	if (!repo_config_get_bool(r, "extensions.sparseindex", &value) && value)
+		r->settings.sparse_index = 1;
 }
diff --git a/repository.h b/repository.h
index e06a23015697..a45f7520fd9e 100644
--- a/repository.h
+++ b/repository.h
@@ -42,7 +42,8 @@ struct repo_settings {
 
 	int core_multi_pack_index;
 
-	unsigned command_requires_full_index:1;
+	unsigned command_requires_full_index:1,
+		 sparse_index:1;
 };
 
 struct repository {
diff --git a/setup.c b/setup.c
index c04cd25a30df..cd8394564613 100644
--- a/setup.c
+++ b/setup.c
@@ -500,6 +500,9 @@ static enum extension_result handle_extension(const char *var,
 			return error("invalid value for 'extensions.objectformat'");
 		data->hash_algo = format;
 		return EXTENSION_OK;
+	} else if (!strcmp(ext, "sparseindex")) {
+		data->sparse_index = 1;
+		return EXTENSION_OK;
 	}
 	return EXTENSION_UNKNOWN;
 }
diff --git a/sparse-index.c b/sparse-index.c
index 7631f7bd00b7..3a6df66faeab 100644
--- a/sparse-index.c
+++ b/sparse-index.c
@@ -102,19 +102,47 @@ static int convert_to_sparse_rec(struct index_state *istate,
 	return num_converted - start_converted;
 }
 
+static int enable_sparse_index(struct repository *repo)
+{
+	const char *config_path = repo_git_path(repo, "config.worktree");
+
+	if (upgrade_repository_format(1) < 0) {
+		warning(_("unable to upgrade repository format to enable sparse-index"));
+		return -1;
+	}
+	git_config_set_in_file_gently(config_path,
+				      "extensions.sparseIndex",
+				      "true");
+
+	prepare_repo_settings(repo);
+	repo->settings.sparse_index = 1;
+	return 0;
+}
+
 int convert_to_sparse(struct index_state *istate)
 {
 	if (istate->split_index || istate->sparse_index ||
 	    !core_apply_sparse_checkout || !core_sparse_checkout_cone)
 		return 0;
 
+	if (!istate->repo)
+		istate->repo = the_repository;
+
+	/*
+	 * The GIT_TEST_SPARSE_INDEX environment variable triggers the
+	 * extensions.sparseIndex config variable to be on.
+	 */
+	if (git_env_bool("GIT_TEST_SPARSE_INDEX", 0)) {
+		int err = enable_sparse_index(istate->repo);
+		if (err < 0)
+			return err;
+	}
+
 	/*
-	 * For now, only create a sparse index with the
-	 * GIT_TEST_SPARSE_INDEX environment variable. We will relax
-	 * this once we have a proper way to opt-in (and later still,
-	 * opt-out).
+	 * Only convert to sparse if extensions.sparseIndex is set.
 	 */
-	if (!git_env_bool("GIT_TEST_SPARSE_INDEX", 0))
+	prepare_repo_settings(istate->repo);
+	if (!istate->repo->settings.sparse_index)
 		return 0;
 
 	if (!istate->sparse_checkout_patterns) {
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH v4 16/20] sparse-checkout: toggle sparse index from builtin
  2021-03-23 13:44     ` [PATCH v4 " Derrick Stolee via GitGitGadget
                         ` (14 preceding siblings ...)
  2021-03-23 13:44       ` [PATCH v4 15/20] sparse-index: create extension for compatibility Derrick Stolee via GitGitGadget
@ 2021-03-23 13:44       ` Derrick Stolee via GitGitGadget
  2021-03-23 13:44       ` [PATCH v4 17/20] sparse-checkout: disable sparse-index Derrick Stolee via GitGitGadget
                         ` (5 subsequent siblings)
  21 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-23 13:44 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The sparse index extension is used to signal that index writes should be
in sparse mode. This was only updated using GIT_TEST_SPARSE_INDEX=1.

Add a '--[no-]sparse-index' option to 'git sparse-checkout init' that
specifies if the sparse index should be used. It also updates the index
to use the correct format, either way. Add a warning in the
documentation that the use of a repository extension might reduce
compatibility with third-party tools. 'git sparse-checkout init' already
sets extension.worktreeConfig, which places most sparse-checkout users
outside of the scope of most third-party tools.

Update t1092-sparse-checkout-compatibility.sh to use this CLI instead of
GIT_TEST_SPARSE_INDEX=1.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-sparse-checkout.txt    | 14 +++++++
 builtin/sparse-checkout.c                | 17 ++++++++-
 sparse-index.c                           | 37 +++++++++++++------
 sparse-index.h                           |  3 ++
 t/t1092-sparse-checkout-compatibility.sh | 47 +++++++++++++-----------
 5 files changed, 84 insertions(+), 34 deletions(-)

diff --git a/Documentation/git-sparse-checkout.txt b/Documentation/git-sparse-checkout.txt
index a0eeaeb02ee3..2ff66c5a4e41 100644
--- a/Documentation/git-sparse-checkout.txt
+++ b/Documentation/git-sparse-checkout.txt
@@ -45,6 +45,20 @@ To avoid interfering with other worktrees, it first enables the
 When `--cone` is provided, the `core.sparseCheckoutCone` setting is
 also set, allowing for better performance with a limited set of
 patterns (see 'CONE PATTERN SET' below).
++
+Use the `--[no-]sparse-index` option to toggle the use of the sparse
+index format. This reduces the size of the index to be more closely
+aligned with your sparse-checkout definition. This can have significant
+performance advantages for commands such as `git status` or `git add`.
+This feature is still experimental. Some commands might be slower with
+a sparse index until they are properly integrated with the feature.
++
+**WARNING:** Using a sparse index requires modifying the index in a way
+that is not completely understood by external tools. If you have trouble
+with this compatibility, then run `git sparse-checkout init --no-sparse-index`
+to rewrite your index to not be sparse. Older versions of Git will not
+understand the `sparseIndex` repository extension and may fail to interact
+with your repository until it is disabled.
 
 'set'::
 	Write a set of patterns to the sparse-checkout file, as given as
diff --git a/builtin/sparse-checkout.c b/builtin/sparse-checkout.c
index e00b82af727b..ca63e2c64e95 100644
--- a/builtin/sparse-checkout.c
+++ b/builtin/sparse-checkout.c
@@ -14,6 +14,7 @@
 #include "unpack-trees.h"
 #include "wt-status.h"
 #include "quote.h"
+#include "sparse-index.h"
 
 static const char *empty_base = "";
 
@@ -283,12 +284,13 @@ static int set_config(enum sparse_checkout_mode mode)
 }
 
 static char const * const builtin_sparse_checkout_init_usage[] = {
-	N_("git sparse-checkout init [--cone]"),
+	N_("git sparse-checkout init [--cone] [--[no-]sparse-index]"),
 	NULL
 };
 
 static struct sparse_checkout_init_opts {
 	int cone_mode;
+	int sparse_index;
 } init_opts;
 
 static int sparse_checkout_init(int argc, const char **argv)
@@ -303,11 +305,15 @@ static int sparse_checkout_init(int argc, const char **argv)
 	static struct option builtin_sparse_checkout_init_options[] = {
 		OPT_BOOL(0, "cone", &init_opts.cone_mode,
 			 N_("initialize the sparse-checkout in cone mode")),
+		OPT_BOOL(0, "sparse-index", &init_opts.sparse_index,
+			 N_("toggle the use of a sparse index")),
 		OPT_END(),
 	};
 
 	repo_read_index(the_repository);
 
+	init_opts.sparse_index = -1;
+
 	argc = parse_options(argc, argv, NULL,
 			     builtin_sparse_checkout_init_options,
 			     builtin_sparse_checkout_init_usage, 0);
@@ -326,6 +332,15 @@ static int sparse_checkout_init(int argc, const char **argv)
 	sparse_filename = get_sparse_checkout_filename();
 	res = add_patterns_from_file_to_list(sparse_filename, "", 0, &pl, NULL);
 
+	if (init_opts.sparse_index >= 0) {
+		if (set_sparse_index_config(the_repository, init_opts.sparse_index) < 0)
+			die(_("failed to modify sparse-index config"));
+
+		/* force an index rewrite */
+		repo_read_index(the_repository);
+		the_repository->index->updated_workdir = 1;
+	}
+
 	/* If we already have a sparse-checkout file, use it. */
 	if (res >= 0) {
 		free(sparse_filename);
diff --git a/sparse-index.c b/sparse-index.c
index 3a6df66faeab..30c1a11fd62d 100644
--- a/sparse-index.c
+++ b/sparse-index.c
@@ -104,23 +104,37 @@ static int convert_to_sparse_rec(struct index_state *istate,
 
 static int enable_sparse_index(struct repository *repo)
 {
-	const char *config_path = repo_git_path(repo, "config.worktree");
+	int res;
 
 	if (upgrade_repository_format(1) < 0) {
 		warning(_("unable to upgrade repository format to enable sparse-index"));
 		return -1;
 	}
-	git_config_set_in_file_gently(config_path,
-				      "extensions.sparseIndex",
-				      "true");
+	res = git_config_set_gently("extensions.sparseindex", "true");
 
 	prepare_repo_settings(repo);
 	repo->settings.sparse_index = 1;
-	return 0;
+	return res;
+}
+
+int set_sparse_index_config(struct repository *repo, int enable)
+{
+	int res;
+
+	if (enable)
+		return enable_sparse_index(repo);
+
+	/* Don't downgrade repository format, just remove the extension. */
+	res = git_config_set_gently("extensions.sparseindex", NULL);
+
+	prepare_repo_settings(repo);
+	repo->settings.sparse_index = 0;
+	return res;
 }
 
 int convert_to_sparse(struct index_state *istate)
 {
+	int test_env;
 	if (istate->split_index || istate->sparse_index ||
 	    !core_apply_sparse_checkout || !core_sparse_checkout_cone)
 		return 0;
@@ -129,14 +143,13 @@ int convert_to_sparse(struct index_state *istate)
 		istate->repo = the_repository;
 
 	/*
-	 * The GIT_TEST_SPARSE_INDEX environment variable triggers the
-	 * extensions.sparseIndex config variable to be on.
+	 * If GIT_TEST_SPARSE_INDEX=1, then trigger extensions.sparseIndex
+	 * to be fully enabled. If GIT_TEST_SPARSE_INDEX=0 (set explicitly),
+	 * then purposefully disable the setting.
 	 */
-	if (git_env_bool("GIT_TEST_SPARSE_INDEX", 0)) {
-		int err = enable_sparse_index(istate->repo);
-		if (err < 0)
-			return err;
-	}
+	test_env = git_env_bool("GIT_TEST_SPARSE_INDEX", -1);
+	if (test_env >= 0)
+		set_sparse_index_config(istate->repo, test_env);
 
 	/*
 	 * Only convert to sparse if extensions.sparseIndex is set.
diff --git a/sparse-index.h b/sparse-index.h
index 64380e121d80..39dcc859735e 100644
--- a/sparse-index.h
+++ b/sparse-index.h
@@ -5,4 +5,7 @@ struct index_state;
 void ensure_full_index(struct index_state *istate);
 int convert_to_sparse(struct index_state *istate);
 
+struct repository;
+int set_sparse_index_config(struct repository *repo, int enable);
+
 #endif
diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index 47f983217852..f14dc48924d2 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -6,6 +6,7 @@ test_description='compare full workdir to sparse workdir'
 # So, disable the check until that integration is complete.
 GIT_TEST_CHECK_CACHE_TREE=0
 GIT_TEST_SPLIT_INDEX=0
+GIT_TEST_SPARSE_INDEX=
 
 . ./test-lib.sh
 
@@ -100,25 +101,26 @@ init_repos () {
 	# initialize sparse-checkout definitions
 	git -C sparse-checkout sparse-checkout init --cone &&
 	git -C sparse-checkout sparse-checkout set deep &&
-	GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout init --cone &&
-	GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set deep
+	git -C sparse-index sparse-checkout init --cone --sparse-index &&
+	test_cmp_config -C sparse-index true extensions.sparseindex &&
+	git -C sparse-index sparse-checkout set deep
 }
 
 run_on_sparse () {
 	(
 		cd sparse-checkout &&
-		GIT_TEST_SPARSE_INDEX=0 "$@" >../sparse-checkout-out 2>../sparse-checkout-err
+		"$@" >../sparse-checkout-out 2>../sparse-checkout-err
 	) &&
 	(
 		cd sparse-index &&
-		GIT_TEST_SPARSE_INDEX=1 "$@" >../sparse-index-out 2>../sparse-index-err
+		"$@" >../sparse-index-out 2>../sparse-index-err
 	)
 }
 
 run_on_all () {
 	(
 		cd full-checkout &&
-		GIT_TEST_SPARSE_INDEX=0 "$@" >../full-checkout-out 2>../full-checkout-err
+		"$@" >../full-checkout-out 2>../full-checkout-err
 	) &&
 	run_on_sparse "$@"
 }
@@ -148,7 +150,7 @@ test_expect_success 'sparse-index contents' '
 			|| return 1
 	done &&
 
-	GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set folder1 &&
+	git -C sparse-index sparse-checkout set folder1 &&
 
 	test-tool -C sparse-index read-cache --table >cache &&
 	for dir in deep folder2 x
@@ -158,7 +160,7 @@ test_expect_success 'sparse-index contents' '
 			|| return 1
 	done &&
 
-	GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set deep/deeper1 &&
+	git -C sparse-index sparse-checkout set deep/deeper1 &&
 
 	test-tool -C sparse-index read-cache --table >cache &&
 	for dir in deep/deeper2 folder1 folder2 x
@@ -166,7 +168,14 @@ test_expect_success 'sparse-index contents' '
 		TREE=$(git -C sparse-index rev-parse HEAD:$dir) &&
 		grep "040000 tree $TREE	$dir/" cache \
 			|| return 1
-	done
+	done &&
+
+	# Disabling the sparse-index removes tree entries with full ones
+	git -C sparse-index sparse-checkout init --no-sparse-index &&
+
+	test-tool -C sparse-index read-cache --table >cache &&
+	! grep "040000 tree" cache &&
+	test_sparse_match test-tool read-cache --table
 '
 
 test_expect_success 'expanded in-memory index matches full index' '
@@ -396,19 +405,15 @@ test_expect_success 'submodule handling' '
 test_expect_success 'sparse-index is expanded and converted back' '
 	init_repos &&
 
-	(
-		GIT_TEST_SPARSE_INDEX=1 &&
-		export GIT_TEST_SPARSE_INDEX &&
-		GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
-			git -C sparse-index -c core.fsmonitor="" reset --hard &&
-		test_region index convert_to_sparse trace2.txt &&
-		test_region index ensure_full_index trace2.txt &&
-
-		rm trace2.txt &&
-		GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
-			git -C sparse-index -c core.fsmonitor="" status -uno &&
-		test_region index ensure_full_index trace2.txt
-	)
+	GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
+		git -C sparse-index -c core.fsmonitor="" reset --hard &&
+	test_region index convert_to_sparse trace2.txt &&
+	test_region index ensure_full_index trace2.txt &&
+
+	rm trace2.txt &&
+	GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
+		git -C sparse-index -c core.fsmonitor="" status -uno &&
+	test_region index ensure_full_index trace2.txt
 '
 
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH v4 17/20] sparse-checkout: disable sparse-index
  2021-03-23 13:44     ` [PATCH v4 " Derrick Stolee via GitGitGadget
                         ` (15 preceding siblings ...)
  2021-03-23 13:44       ` [PATCH v4 16/20] sparse-checkout: toggle sparse index from builtin Derrick Stolee via GitGitGadget
@ 2021-03-23 13:44       ` Derrick Stolee via GitGitGadget
  2021-03-23 13:44       ` [PATCH v4 18/20] cache-tree: integrate with sparse directory entries Derrick Stolee via GitGitGadget
                         ` (4 subsequent siblings)
  21 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-23 13:44 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

We use 'git sparse-checkout init --cone --sparse-index' to toggle the
sparse-index feature. It makes sense to also disable it when running
'git sparse-checkout disable'. This is particularly important because it
removes the extensions.sparseIndex config option, allowing other tools
to use this Git repository again.

This does mean that 'git sparse-checkout init' will not re-enable the
sparse-index feature, even if it was previously enabled.

While testing this feature, I noticed that the sparse-index was not
being written on the first run, but by a second. This was caught by the
call to 'test-tool read-cache --table'. This requires adjusting some
assignments to core_apply_sparse_checkout and pl.use_cone_patterns in
the sparse_checkout_init() logic.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 builtin/sparse-checkout.c          | 10 +++++++++-
 t/t1091-sparse-checkout-builtin.sh | 13 +++++++++++++
 2 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/builtin/sparse-checkout.c b/builtin/sparse-checkout.c
index ca63e2c64e95..585343fa1972 100644
--- a/builtin/sparse-checkout.c
+++ b/builtin/sparse-checkout.c
@@ -280,6 +280,9 @@ static int set_config(enum sparse_checkout_mode mode)
 				      "core.sparseCheckoutCone",
 				      mode == MODE_CONE_PATTERNS ? "true" : NULL);
 
+	if (mode == MODE_NO_PATTERNS)
+		set_sparse_index_config(the_repository, 0);
+
 	return 0;
 }
 
@@ -341,10 +344,11 @@ static int sparse_checkout_init(int argc, const char **argv)
 		the_repository->index->updated_workdir = 1;
 	}
 
+	core_apply_sparse_checkout = 1;
+
 	/* If we already have a sparse-checkout file, use it. */
 	if (res >= 0) {
 		free(sparse_filename);
-		core_apply_sparse_checkout = 1;
 		return update_working_directory(NULL);
 	}
 
@@ -366,6 +370,7 @@ static int sparse_checkout_init(int argc, const char **argv)
 	add_pattern(strbuf_detach(&pattern, NULL), empty_base, 0, &pl, 0);
 	strbuf_addstr(&pattern, "!/*/");
 	add_pattern(strbuf_detach(&pattern, NULL), empty_base, 0, &pl, 0);
+	pl.use_cone_patterns = init_opts.cone_mode;
 
 	return write_patterns_and_update(&pl);
 }
@@ -632,6 +637,9 @@ static int sparse_checkout_disable(int argc, const char **argv)
 	strbuf_addstr(&match_all, "/*");
 	add_pattern(strbuf_detach(&match_all, NULL), empty_base, 0, &pl, 0);
 
+	prepare_repo_settings(the_repository);
+	the_repository->settings.sparse_index = 0;
+
 	if (update_working_directory(&pl))
 		die(_("error while refreshing working directory"));
 
diff --git a/t/t1091-sparse-checkout-builtin.sh b/t/t1091-sparse-checkout-builtin.sh
index fc64e9ed99f4..ff1ad570a255 100755
--- a/t/t1091-sparse-checkout-builtin.sh
+++ b/t/t1091-sparse-checkout-builtin.sh
@@ -205,6 +205,19 @@ test_expect_success 'sparse-checkout disable' '
 	check_files repo a deep folder1 folder2
 '
 
+test_expect_success 'sparse-index enabled and disabled' '
+	git -C repo sparse-checkout init --cone --sparse-index &&
+	test_cmp_config -C repo true extensions.sparseIndex &&
+	test-tool -C repo read-cache --table >cache &&
+	grep " tree " cache &&
+
+	git -C repo sparse-checkout disable &&
+	test-tool -C repo read-cache --table >cache &&
+	! grep " tree " cache &&
+	git -C repo config --list >config &&
+	! grep extensions.sparseindex config
+'
+
 test_expect_success 'cone mode: init and set' '
 	git -C repo sparse-checkout init --cone &&
 	git -C repo config --list >config &&
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH v4 18/20] cache-tree: integrate with sparse directory entries
  2021-03-23 13:44     ` [PATCH v4 " Derrick Stolee via GitGitGadget
                         ` (16 preceding siblings ...)
  2021-03-23 13:44       ` [PATCH v4 17/20] sparse-checkout: disable sparse-index Derrick Stolee via GitGitGadget
@ 2021-03-23 13:44       ` Derrick Stolee via GitGitGadget
  2021-03-23 13:44       ` [PATCH v4 19/20] sparse-index: loose integration with cache_tree_verify() Derrick Stolee via GitGitGadget
                         ` (3 subsequent siblings)
  21 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-23 13:44 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The cache-tree extension was previously disabled with sparse indexes.
However, the cache-tree is an important performance feature for commands
like 'git status' and 'git add'. Integrate it with sparse directory
entries.

When writing a sparse index, completely clear and recalculate the cache
tree. By starting from scratch, the only integration necessary is to
check if we hit a sparse directory entry and create a leaf of the
cache-tree that has an entry_count of one and no subtrees.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 cache-tree.c   | 18 ++++++++++++++++++
 sparse-index.c | 10 +++++++++-
 2 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/cache-tree.c b/cache-tree.c
index 5f07a39e501e..950a9615db8f 100644
--- a/cache-tree.c
+++ b/cache-tree.c
@@ -256,6 +256,24 @@ static int update_one(struct cache_tree *it,
 
 	*skip_count = 0;
 
+	/*
+	 * If the first entry of this region is a sparse directory
+	 * entry corresponding exactly to 'base', then this cache_tree
+	 * struct is a "leaf" in the data structure, pointing to the
+	 * tree OID specified in the entry.
+	 */
+	if (entries > 0) {
+		const struct cache_entry *ce = cache[0];
+
+		if (S_ISSPARSEDIR(ce->ce_mode) &&
+		    ce->ce_namelen == baselen &&
+		    !strncmp(ce->name, base, baselen)) {
+			it->entry_count = 1;
+			oidcpy(&it->oid, &ce->oid);
+			return 1;
+		}
+	}
+
 	if (0 <= it->entry_count && has_object_file(&it->oid))
 		return it->entry_count;
 
diff --git a/sparse-index.c b/sparse-index.c
index 30c1a11fd62d..56313e805d9d 100644
--- a/sparse-index.c
+++ b/sparse-index.c
@@ -180,7 +180,11 @@ int convert_to_sparse(struct index_state *istate)
 	istate->cache_nr = convert_to_sparse_rec(istate,
 						 0, 0, istate->cache_nr,
 						 "", 0, istate->cache_tree);
-	istate->drop_cache_tree = 1;
+
+	/* Clear and recompute the cache-tree */
+	cache_tree_free(&istate->cache_tree);
+	cache_tree_update(istate, 0);
+
 	istate->sparse_index = 1;
 	trace2_region_leave("index", "convert_to_sparse", istate->repo);
 	return 0;
@@ -281,5 +285,9 @@ void ensure_full_index(struct index_state *istate)
 	strbuf_release(&base);
 	free(full);
 
+	/* Clear and recompute the cache-tree */
+	cache_tree_free(&istate->cache_tree);
+	cache_tree_update(istate, 0);
+
 	trace2_region_leave("index", "ensure_full_index", istate->repo);
 }
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH v4 19/20] sparse-index: loose integration with cache_tree_verify()
  2021-03-23 13:44     ` [PATCH v4 " Derrick Stolee via GitGitGadget
                         ` (17 preceding siblings ...)
  2021-03-23 13:44       ` [PATCH v4 18/20] cache-tree: integrate with sparse directory entries Derrick Stolee via GitGitGadget
@ 2021-03-23 13:44       ` Derrick Stolee via GitGitGadget
  2021-03-23 13:44       ` [PATCH v4 20/20] p2000: add sparse-index repos Derrick Stolee via GitGitGadget
                         ` (2 subsequent siblings)
  21 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-23 13:44 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The cache_tree_verify() method is run when GIT_TEST_CHECK_CACHE_TREE
is enabled, which it is by default in the test suite. The logic must
be adjusted for the presence of these directory entries.

For now, leave the test as a simple check for whether the directory
entry is sparse. Do not go any further until needed.

This allows us to re-enable GIT_TEST_CHECK_CACHE_TREE in
t1092-sparse-checkout-compatibility.sh. Further,
p2000-sparse-operations.sh uses the test suite and hence this is enabled
for all tests. We need to integrate with it before we run our
performance tests with a sparse-index.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 cache-tree.c                             | 19 +++++++++++++++++++
 t/t1092-sparse-checkout-compatibility.sh |  3 ---
 2 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/cache-tree.c b/cache-tree.c
index 950a9615db8f..11bf1fcae6e1 100644
--- a/cache-tree.c
+++ b/cache-tree.c
@@ -808,6 +808,19 @@ int cache_tree_matches_traversal(struct cache_tree *root,
 	return 0;
 }
 
+static void verify_one_sparse(struct repository *r,
+			      struct index_state *istate,
+			      struct cache_tree *it,
+			      struct strbuf *path,
+			      int pos)
+{
+	struct cache_entry *ce = istate->cache[pos];
+
+	if (!S_ISSPARSEDIR(ce->ce_mode))
+		BUG("directory '%s' is present in index, but not sparse",
+		    path->buf);
+}
+
 static void verify_one(struct repository *r,
 		       struct index_state *istate,
 		       struct cache_tree *it,
@@ -830,6 +843,12 @@ static void verify_one(struct repository *r,
 
 	if (path->len) {
 		pos = index_name_pos(istate, path->buf, path->len);
+
+		if (pos >= 0) {
+			verify_one_sparse(r, istate, it, path, pos);
+			return;
+		}
+
 		pos = -pos - 1;
 	} else {
 		pos = 0;
diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index f14dc48924d2..d97bf9b64527 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -2,9 +2,6 @@
 
 test_description='compare full workdir to sparse workdir'
 
-# The verify_cache_tree() check is not sparse-aware (yet).
-# So, disable the check until that integration is complete.
-GIT_TEST_CHECK_CACHE_TREE=0
 GIT_TEST_SPLIT_INDEX=0
 GIT_TEST_SPARSE_INDEX=
 
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH v4 20/20] p2000: add sparse-index repos
  2021-03-23 13:44     ` [PATCH v4 " Derrick Stolee via GitGitGadget
                         ` (18 preceding siblings ...)
  2021-03-23 13:44       ` [PATCH v4 19/20] sparse-index: loose integration with cache_tree_verify() Derrick Stolee via GitGitGadget
@ 2021-03-23 13:44       ` Derrick Stolee via GitGitGadget
  2021-03-23 16:16       ` [PATCH v4 00/20] Sparse Index: Design, Format, Tests Elijah Newren
  2021-03-30 13:10       ` [PATCH v5 00/21] " Derrick Stolee via GitGitGadget
  21 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-23 13:44 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

p2000-sparse-operations.sh compares different Git commands in
repositories with many files at HEAD but using sparse-checkout to focus
on a small portion of those files.

Add extra copies of the repository that use the sparse-index format so
we can track how that affects the performance of different commands.

At this point in time, the sparse-index is 100% overhead from the CPU
front, and this is measurable in these tests:

Test
---------------------------------------------------------------
2000.2: git status (full-index-v3)              0.59(0.51+0.12)
2000.3: git status (full-index-v4)              0.59(0.52+0.11)
2000.4: git status (sparse-index-v3)            1.40(1.32+0.12)
2000.5: git status (sparse-index-v4)            1.41(1.36+0.08)
2000.6: git add -A (full-index-v3)              2.32(1.97+0.19)
2000.7: git add -A (full-index-v4)              2.17(1.92+0.14)
2000.8: git add -A (sparse-index-v3)            2.31(2.21+0.15)
2000.9: git add -A (sparse-index-v4)            2.30(2.20+0.13)
2000.10: git add . (full-index-v3)              2.39(2.02+0.20)
2000.11: git add . (full-index-v4)              2.20(1.94+0.16)
2000.12: git add . (sparse-index-v3)            2.36(2.27+0.12)
2000.13: git add . (sparse-index-v4)            2.33(2.21+0.16)
2000.14: git commit -a -m A (full-index-v3)     2.47(2.12+0.20)
2000.15: git commit -a -m A (full-index-v4)     2.26(2.00+0.17)
2000.16: git commit -a -m A (sparse-index-v3)   3.01(2.92+0.16)
2000.17: git commit -a -m A (sparse-index-v4)   3.01(2.94+0.15)

Note that there is very little difference between the v3 and v4 index
formats when the sparse-index is enabled. This is primarily due to the
fact that the relative file sizes are the same, and the command time is
mostly taken up by parsing tree objects to expand the sparse index into
a full one.

With the current file layout, the index file sizes are given by this
table:

       |  full index | sparse index |
       +-------------+--------------+
    v3 |     108 MiB |      1.6 MiB |
    v4 |      80 MiB |      1.2 MiB |

Future updates will improve the performance of Git commands when the
index is sparse.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/perf/p2000-sparse-operations.sh | 19 ++++++++++++++++++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/t/perf/p2000-sparse-operations.sh b/t/perf/p2000-sparse-operations.sh
index dddd527b6330..94513c977489 100755
--- a/t/perf/p2000-sparse-operations.sh
+++ b/t/perf/p2000-sparse-operations.sh
@@ -59,12 +59,29 @@ test_expect_success 'setup repo and indexes' '
 		git sparse-checkout set $SPARSE_CONE &&
 		git config index.version 4 &&
 		git update-index --index-version=4
+	) &&
+	git -c core.sparseCheckoutCone=true clone --branch=wide --sparse . sparse-index-v3 &&
+	(
+		cd sparse-index-v3 &&
+		git sparse-checkout init --cone --sparse-index &&
+		git sparse-checkout set $SPARSE_CONE &&
+		git config index.version 3 &&
+		git update-index --index-version=3
+	) &&
+	git -c core.sparseCheckoutCone=true clone --branch=wide --sparse . sparse-index-v4 &&
+	(
+		cd sparse-index-v4 &&
+		git sparse-checkout init --cone --sparse-index &&
+		git sparse-checkout set $SPARSE_CONE &&
+		git config index.version 4 &&
+		git update-index --index-version=4
 	)
 '
 
 test_perf_on_all () {
 	command="$@"
-	for repo in full-index-v3 full-index-v4
+	for repo in full-index-v3 full-index-v4 \
+		    sparse-index-v3 sparse-index-v4
 	do
 		test_perf "$command ($repo)" "
 			(
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 203+ messages in thread

* Re: [PATCH v4 00/20] Sparse Index: Design, Format, Tests
  2021-03-23 13:44     ` [PATCH v4 " Derrick Stolee via GitGitGadget
                         ` (19 preceding siblings ...)
  2021-03-23 13:44       ` [PATCH v4 20/20] p2000: add sparse-index repos Derrick Stolee via GitGitGadget
@ 2021-03-23 16:16       ` Elijah Newren
  2021-03-30 13:10       ` [PATCH v5 00/21] " Derrick Stolee via GitGitGadget
  21 siblings, 0 replies; 203+ messages in thread
From: Elijah Newren @ 2021-03-23 16:16 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano,
	Nguyễn Thái Ngọc, Jonathan Nieder,
	Martin Ågren, Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee

On Tue, Mar 23, 2021 at 6:44 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> Here is the first full patch series submission coming out of the
> sparse-index RFC [1].
>
...
>
> Updates in V4
> =============
>
>  * Rebased onto the latest copy of ab/read-tree.
>  * Updated the design document as per Junio's comments.
>  * Updated the submodule handling in the performance test.
>  * Followed up on some other review from Ævar, mostly style or commit
>    message things.

Range-diff looks good to me; my Reviewed-by still holds.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v3 01/20] sparse-index: design doc and format update
  2021-03-23 11:16         ` Derrick Stolee
@ 2021-03-23 20:10           ` Junio C Hamano
  2021-03-23 20:42             ` Derrick Stolee
  0 siblings, 1 reply; 203+ messages in thread
From: Junio C Hamano @ 2021-03-23 20:10 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, git, newren, pclouds, jrnieder,
	Martin Ågren, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

Derrick Stolee <stolee@gmail.com> writes:

>>> +Three important scale dimensions for a Git worktree are:
>> 
>> s/worktree/working tree/; The former is the thing the "git worktree"
>> command deals with.  The latter is relevant even when "git worktree"
>> is not used (the traditional "git clone and you get a working tree
>> to work in").
>
> I guess I'm distracted by using SKIP_WORKTREE a lot, but "working
> directory" is more specific and hence better.

Since the user's current working directory can be outside any
working tree that is governed by any git repository, "working
directory" is a term I try to avoid when describing the directory
where a checkout of a revision lives.

Documentation/glossary-content.txt is where the suggestion for
"working tree" comes from.

> I could rearrange things here. The important things to note are:
>
> 1. Updating index entries is very fast, but adds up at large scale.

This is the "checkout to match the index to the tree of HEAD" part,
ignoring the cost of writing working tree files out?

> 2. It is faster to write a file to disk from Git's object database
>    than it is to compare a file on disk to the copy in the database,
>    which is frequently necessary when the mtime on disk doesn't match
>    the mtime in the index.

True.  But of course, not having to do either (i.e. having a fresh
cached stat info) would be even faster ;-).

>> Also it
>> is unclear what you mean by "changing HEAD only require updating the
>> index".  Certainly when "git switch" flips HEAD from one commit to
>> another, you'd update the index and update the files in the working
>> tree (in the Populated part that is in the sparse-checkout cone) to
>> match, no?
>
> This is unclear of me. I was thinking more on the lines of "git reset"
> (soft mode) which updates HEAD without changing the files on disk.

OK, and that is in line with your "updating index entries is very
fast (but adds up)".

> After all of this postulating, I think that the offending sentences
> are better off deleted. They don't add clarity over what can be
> inferred by an interested reader.

OK.

> I'm mixing terms incorrectly. I think what I really mean is
>
>   In fact, these loops expect to see a reference to every
>   staged file.

OK.

>  The plan is to make all of these integrations "sparse aware" so
>  this expansion through tree parsing is unnecessary and they use
>  fewer resources than when using a full index.

;-)

> I meant by "serialized index file" is that the file written to disk has
> the sparse directory entries, but the in-core copy will not (except for
> a very brief moment in time, during do_read_index()).

Nice.  That would probably mean cache-tree extension on-disk can go
away, because we can populate in-core cache-tree from these entries.
I've always hated the on-disk encoding of that extension.

Or we are not doing this "extra tree" everywhere (i.e. limited only
to the parts that are marked for "sparse checkout")?

Thanks.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v3 01/20] sparse-index: design doc and format update
  2021-03-23 20:10           ` Junio C Hamano
@ 2021-03-23 20:42             ` Derrick Stolee
  0 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee @ 2021-03-23 20:42 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Derrick Stolee via GitGitGadget, git, newren, pclouds, jrnieder,
	Martin Ågren, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

On 3/23/2021 4:10 PM, Junio C Hamano wrote:
> Derrick Stolee <stolee@gmail.com> writes:
> 
>>>> +Three important scale dimensions for a Git worktree are:
>>>
>>> s/worktree/working tree/; The former is the thing the "git worktree"
>>> command deals with.  The latter is relevant even when "git worktree"
>>> is not used (the traditional "git clone and you get a working tree
>>> to work in").
>>
>> I guess I'm distracted by using SKIP_WORKTREE a lot, but "working
>> directory" is more specific and hence better.
> 
> Since the user's current working directory can be outside any
> working tree that is governed by any git repository, "working
> directory" is a term I try to avoid when describing the directory
> where a checkout of a revision lives.
> 
> Documentation/glossary-content.txt is where the suggestion for
> "working tree" comes from.

Whoops. Somehow I read that wrong. Thanks for pointing out my error.

>> I meant by "serialized index file" is that the file written to disk has
>> the sparse directory entries, but the in-core copy will not (except for
>> a very brief moment in time, during do_read_index()).
> 
> Nice.  That would probably mean cache-tree extension on-disk can go
> away, because we can populate in-core cache-tree from these entries.
> I've always hated the on-disk encoding of that extension.
> 
> Or we are not doing this "extra tree" everywhere (i.e. limited only
> to the parts that are marked for "sparse checkout")?

The current design is to only have these entries when all paths
within the directory are marked with SKIP_WORKTREE. This pairs
with the cache-tree extension, which has these directories as
nodes, but only consuming one cache entry (for itself).

I haven't considered the idea of inserting trees for other
reasons. Seems like a valuable experiment.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [RFC/PATCH 2/5] ls-files: make "mode" in show_ce() loop a variable
  2021-03-17 18:11         ` Elijah Newren
@ 2021-03-24  0:46           ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 203+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-03-24  0:46 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Git Mailing List, Junio C Hamano,
	Nguyễn Thái Ngọc, Jonathan Nieder,
	Martin Ågren, Derrick Stolee, SZEDER Gábor,
	Derrick Stolee, Derrick Stolee


On Wed, Mar 17 2021, Elijah Newren wrote:

> On Wed, Mar 17, 2021 at 6:28 AM Ævar Arnfjörð Bjarmason
> <avarab@gmail.com> wrote:
>>
>> In a subsequent commit I'll optionally change the mode in a new sparse
>> mode, let's do this first to make that change smaller.
>>
>> Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
>> ---
>>  builtin/ls-files.c | 10 +++++++++-
>>  1 file changed, 9 insertions(+), 1 deletion(-)
>>
>> diff --git a/builtin/ls-files.c b/builtin/ls-files.c
>> index eb72d16493..4db75351f2 100644
>> --- a/builtin/ls-files.c
>> +++ b/builtin/ls-files.c
>> @@ -242,9 +242,17 @@ static void show_ce(struct repository *repo, struct dir_struct *dir,
>>                 if (!show_stage) {
>>                         fputs(tag, stdout);
>>                 } else {
>> +                       unsigned int mode = ce->ce_mode;
>> +                       if (show_sparse && S_ISSPARSEDIR(mode))
>> +                               /*
>> +                                * We could just do & 0177777 all the
>> +                                * time, just make it clear this is
>> +                                * for --stage-sparse.
>> +                                */
>> +                               mode &= 0177777;
>
> I could kind of see referencing the magic constant 0177777 in a test-*
> source file, but it really needs an explanation when showing up in
> actual git source code.  At least reference something about how
> cache.h mentions these are the mode bits, or better yet #define this
> constant somewhere in cache.h with an explanation.
>
> Also, what is --stage-sparse?

A relic from a WIP version of this patch. I ended up just calling it
--sparse in 3/5.

>>                         printf("%s%06o %s %d\t",
>>                                tag,
>> -                              ce->ce_mode,
>> +                              mode,
>>                                find_unique_abbrev(&ce->oid, abbrev),
>>                                ce_stage(ce));
>>                 }
>> --
>> 2.31.0.260.g719c683c1d


^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [RFC/PATCH 3/5] ls-files: add and use a new --sparse option
  2021-03-17 20:43         ` Derrick Stolee
@ 2021-03-24  0:52           ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 203+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-03-24  0:52 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, newren, gitster, pclouds, jrnieder, Martin Ågren,
	SZEDER Gábor, Derrick Stolee, dstolee


On Wed, Mar 17 2021, Derrick Stolee wrote:

> On 3/17/2021 9:28 AM, Ævar Arnfjörð Bjarmason wrote:
>> Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>> diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
>
> I want to learn from your suggested changes to the test, here,
> so forgive my questions here:
>   
>> +test_index_entry_like () {
>> +	dir=$1
>> +	shift
>> +	fmt=$1
>> +	shift
>> +	rev=$1
>> +	shift
>> +	entry=$1
>> +	shift
>> +	file=$1
>> +	shift
>
> Why all the shifts? Why not just use $1, $2, $3,...? My
> guess is that you want to be able to insert a new parameter
> in the middle in the future without changing the later
> numbers, but that seems unlikely, and we could just add
> the parameter at the end.

It's just crappy RFC-quality code. I probably copied some other function
and went with it. No good reason. Yeah it's ugly.

>> +	hash=$(git -C "$dir" rev-parse "$rev") &&
>> +	printf "$fmt\n" "$hash" "$entry" >expected &&
>> +	if grep "$entry" "$file" >line
>> +	then
>> +		test_cmp expected line
>> +	else
>> +		cat cache &&
>> +		false
>> +	fi
>> +}
>> +
>>  test_expect_success 'sparse-index contents' '
>>  	init_repos &&
>>  
>> -	test-tool -C sparse-index read-cache --table >cache &&
>> +	git -C sparse-index ls-files --sparse >cache &&
>>  	for dir in folder1 folder2 x
>>  	do
>> -		TREE=$(git -C sparse-index rev-parse HEAD:$dir) &&
>> -		grep "040000 tree $TREE	$dir/" cache \
>> -			|| return 1
>> +		test_index_entry_like sparse-index "040000 %s 0\t%s" "HEAD:$dir" "$dir/" cache || return 1
>
> I see how this uses only one line, but it seems like the
> test_index_entry_like is too generic to make it not a
> complicated mess of format strings that need to copy
> over and over again.
>
> Perhaps instead it could be a "test_entry_is_tree"
> and it only passes "$dir" and "cache"? Then we could drop the loop and
> just have
>
> 	test_entry_is_tree cache folder1 &&
> 	test_entry_is_tree cache folder2 &&
> 	test_entry_is_tree cache x &&
>
> or we could still use the loop, especially when we test for four trees.

Yeah that sounds good. Personally I don't mind 4x similar lines
copy/pasted over a for-loop in the tests. You don't need to worry about
the || return doing the right thing, and just setting up the for-loop is
already 3 lines...

>> -	test-tool -C sparse-index read-cache --table >cache &&
>> +	git -C sparse-index ls-files --sparse >cache &&
>>  	for dir in deep/deeper2 folder1 folder2 x
>>  	do
>> -		TREE=$(git -C sparse-index rev-parse HEAD:$dir) &&
>> -		grep "040000 tree $TREE	$dir/" cache \
>> -			|| return 1
>> +		test_index_entry_like sparse-index "040000 %s 0\t%s" "HEAD:$dir" "$dir/" cache || return 1
>>  	done &&
>>  
>> +	grep 040000 cache >lines &&
>> +	test_line_count = 4 lines &&
>> +
>
> The point here is to check that no other entries are trees? We know
> that this number will be _at least_ 4 based on the loop above.

It's exactly 4 because we have 4 folders we're checking. But you tell
me. I was just trying to refactor this dependence on the ls-tree format
while moving it over to ls-files without spending too much time on
understanding all the specifics.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v4 07/20] test-read-cache: print cache entries with --table
  2021-03-23 13:44       ` [PATCH v4 07/20] test-read-cache: print cache entries with --table Derrick Stolee via GitGitGadget
@ 2021-03-24  1:24         ` Ævar Arnfjörð Bjarmason
  2021-03-24 12:33           ` Derrick Stolee
  0 siblings, 1 reply; 203+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-03-24  1:24 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor, Derrick Stolee,
	Derrick Stolee


On Tue, Mar 23 2021, Derrick Stolee via GitGitGadget wrote:

> From: Derrick Stolee <dstolee@microsoft.com>
>
> This table is helpful for discovering data in the index to ensure it is
> being written correctly, especially as we build and test the
> sparse-index. This table includes an output format similar to 'git
> ls-tree', but should not be compared to that directly. The biggest
> reasons are that 'git ls-tree' includes a tree entry for every
> subdirectory, even those that would not appear as a sparse directory in
> a sparse-index. Further, 'git ls-tree' does not use a trailing directory
> separator for its tree rows.
>
> This does not print the stat() information for the blobs. That will be
> added in a future change with another option. The tests that are added
> in the next few changes care only about the object types and IDs.
> However, this future need for full index information justifies the need
> for this test helper over extending a user-facing feature, such as 'git
> ls-files'.

Is that stat() information that's going to be essential to grab in the
same process that runs the "for (i = 0; i < istate->cache_nr; i++)"
for-loop, or stat() information that could be grabbed as:

    git ls-files -z --stage | some-program-that-stats-all-listed-blobs

It's not so much that I still disagree as I feel like I'm missing
something. I haven't gone through this topic with a fine toothed comb,
so ...

If and when these patches land and I'm using this nascent sparse
checkout support why wouldn't I want ls-files or another not-a-test-tool
to support extracting this new information that's in the index?

That's why I sent the RFC patches at
https://lore.kernel.org/git/20210317132814.30175-2-avarab@gmail.com/ to
roll this functionality into ls-files.

Still, I think if there's a good reason for why we want this in the
index but never want our plumbing to be able to dump it in some
user-facing way I think just as a matter of reviewing this code it would
be much simpler if it was in ls-files behind some
git_env_bool("GIT_TEST_...") flag or something.

Or maybe I'm the only one who spends a lot of time with both ls-files.c
and test-read-cache.c open while trying to review this trying to keep
track of if and how this helper is and isn't subtly different from
ls-files (as my RFC series shows, not really that different at all...).
Especially with the really-just-ls-files-plus-one-thing tool mimicking
ls-tree output, for reasons I still don't get...

> To make the option parsing slightly more robust, wrap the string
> comparisons in a loop adapted from test-dir-iterator.c.
>
> Care must be taken with the final check for the 'cnt' variable. We
> continue the expectation that the numerical value is the final argument.

I think even if you're set on not having this exposed in some
builtin/*.c command this code would be much clearer based on some
version of my
https://lore.kernel.org/git/20210317132814.30175-6-avarab@gmail.com/
i.e. the part that isn't entirely deleting t/helper/test-read-cache.c,
which would survive as t/helper/test-read-cache-sparse.c or something.

As that patch shows this code is needlessly convoluted because it's
serving 3x wildly different in-tree use-cases. I don't see how the very
small amount of de-duplication we're getting is worth the complexity.

At that point we don't need any care with the cnt variable, because
we're not combining the fsmonitor and perf use-cases of reading the
index in some loop with the ls-files-alike.

> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  t/helper/test-read-cache.c | 55 +++++++++++++++++++++++++++++++-------
>  1 file changed, 45 insertions(+), 10 deletions(-)
>
> diff --git a/t/helper/test-read-cache.c b/t/helper/test-read-cache.c
> index 244977a29bdf..6cfd8f2de71c 100644
> --- a/t/helper/test-read-cache.c
> +++ b/t/helper/test-read-cache.c
> @@ -1,36 +1,71 @@
>  #include "test-tool.h"
>  #include "cache.h"
>  #include "config.h"
> +#include "blob.h"
> +#include "commit.h"
> +#include "tree.h"
> +
> +static void print_cache_entry(struct cache_entry *ce)
> +{
> +	const char *type;
> +	printf("%06o ", ce->ce_mode & 0177777);
> +
> +	if (S_ISSPARSEDIR(ce->ce_mode))
> +		type = tree_type;
> +	else if (S_ISGITLINK(ce->ce_mode))
> +		type = commit_type;
> +	else
> +		type = blob_type;
> +
> +	printf("%s %s\t%s\n",
> +	       type,
> +	       oid_to_hex(&ce->oid),
> +	       ce->name);
> +}
> +
> +static void print_cache(struct index_state *istate)
> +{
> +	int i;
> +	for (i = 0; i < istate->cache_nr; i++)
> +		print_cache_entry(istate->cache[i]);
> +}
>  
>  int cmd__read_cache(int argc, const char **argv)
>  {
> +	struct repository *r = the_repository;
>  	int i, cnt = 1;
>  	const char *name = NULL;
> +	int table = 0;
>  
> -	if (argc > 1 && skip_prefix(argv[1], "--print-and-refresh=", &name)) {
> -		argc--;
> -		argv++;
> +	for (++argv, --argc; *argv && starts_with(*argv, "--"); ++argv, --argc) {
> +		if (skip_prefix(*argv, "--print-and-refresh=", &name))
> +			continue;
> +		if (!strcmp(*argv, "--table"))
> +			table = 1;
>  	}
>  
> -	if (argc == 2)
> -		cnt = strtol(argv[1], NULL, 0);
> +	if (argc == 1)
> +		cnt = strtol(argv[0], NULL, 0);
>  	setup_git_directory();
>  	git_config(git_default_config, NULL);
> +
>  	for (i = 0; i < cnt; i++) {
> -		read_cache();
> +		repo_read_index(r);
>  		if (name) {
>  			int pos;
>  
> -			refresh_index(&the_index, REFRESH_QUIET,
> +			refresh_index(r->index, REFRESH_QUIET,
>  				      NULL, NULL, NULL);
> -			pos = index_name_pos(&the_index, name, strlen(name));
> +			pos = index_name_pos(r->index, name, strlen(name));
>  			if (pos < 0)
>  				die("%s not in index", name);
>  			printf("%s is%s up to date\n", name,
> -			       ce_uptodate(the_index.cache[pos]) ? "" : " not");
> +			       ce_uptodate(r->index->cache[pos]) ? "" : " not");
>  			write_file(name, "%d\n", i);
>  		}
> -		discard_cache();
> +		if (table)
> +			print_cache(r->index);
> +		discard_index(r->index);
>  	}
>  	return 0;
>  }


^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v4 07/20] test-read-cache: print cache entries with --table
  2021-03-24  1:24         ` Ævar Arnfjörð Bjarmason
@ 2021-03-24 12:33           ` Derrick Stolee
  2021-03-25  3:41             ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 203+ messages in thread
From: Derrick Stolee @ 2021-03-24 12:33 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason, Derrick Stolee via GitGitGadget
  Cc: git, newren, gitster, pclouds, jrnieder, Martin Ågren,
	SZEDER Gábor, Derrick Stolee, Derrick Stolee

On 3/23/21 9:24 PM, Ævar Arnfjörð Bjarmason wrote:
> 
> On Tue, Mar 23 2021, Derrick Stolee via GitGitGadget wrote:
> 
>> From: Derrick Stolee <dstolee@microsoft.com>
>>
>> This table is helpful for discovering data in the index to ensure it is
>> being written correctly, especially as we build and test the
>> sparse-index. This table includes an output format similar to 'git
>> ls-tree', but should not be compared to that directly. The biggest
>> reasons are that 'git ls-tree' includes a tree entry for every
>> subdirectory, even those that would not appear as a sparse directory in
>> a sparse-index. Further, 'git ls-tree' does not use a trailing directory
>> separator for its tree rows.
>>
>> This does not print the stat() information for the blobs. That will be
>> added in a future change with another option. The tests that are added
>> in the next few changes care only about the object types and IDs.
>> However, this future need for full index information justifies the need
>> for this test helper over extending a user-facing feature, such as 'git
>> ls-files'.
> 
> Is that stat() information that's going to be essential to grab in the
> same process that runs the "for (i = 0; i < istate->cache_nr; i++)"
> for-loop, or stat() information that could be grabbed as:
> 
>     git ls-files -z --stage | some-program-that-stats-all-listed-blobs

The point is not to find the stat() data from disk, but to ensure that
the stat() data is correctly stored in the index (say, after converting
an existing index from another format). This pipe strategy does not
allow for that scenario.

> It's not so much that I still disagree as I feel like I'm missing
> something. I haven't gone through this topic with a fine toothed comb,
> so ...
> 
> If and when these patches land and I'm using this nascent sparse
> checkout support why wouldn't I want ls-files or another not-a-test-tool
> to support extracting this new information that's in the index?
> 
> That's why I sent the RFC patches at
> https://lore.kernel.org/git/20210317132814.30175-2-avarab@gmail.com/ to
> roll this functionality into ls-files.

And I recommend that you continue to pursue them as an independent
series, but I'm not going to incorporate them into this one. I'm
not going to distract from this internal data structure with changes
to user-facing commands until I think it's ready to use. As the design
document describes the plan, I don't expect this to be something I
will recommend to users until most of "Phase 3" is complete, making
the most common Git commands aware of a sparse index. (I expect to
fast-track a prototype to willing users that covers that functionality
while review continues on the mailing list.)

Making a change to a builtin is _forever_, and since the only
purpose right now is to expose the data in a test environment, I
don't want to adjust the builtin until either there is a real user
need or the feature has otherwise stabilized. If you want to take on
that responsibility, then please do.

Otherwise, I will need to eventually handle "git ls-files" being
sparse-aware when eventually removing 'command_requires_full_index',
(Phase 4) so that would be a good opportunity to adjust the
expectations.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v4 07/20] test-read-cache: print cache entries with --table
  2021-03-24 12:33           ` Derrick Stolee
@ 2021-03-25  3:41             ` Ævar Arnfjörð Bjarmason
  2021-03-26  0:12               ` Elijah Newren
  0 siblings, 1 reply; 203+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-03-25  3:41 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, git, newren, gitster, pclouds,
	jrnieder, Martin Ågren, SZEDER Gábor, Derrick Stolee,
	Derrick Stolee


On Wed, Mar 24 2021, Derrick Stolee wrote:

> On 3/23/21 9:24 PM, Ævar Arnfjörð Bjarmason wrote:
>> 
>> On Tue, Mar 23 2021, Derrick Stolee via GitGitGadget wrote:
>> 
>>> From: Derrick Stolee <dstolee@microsoft.com>
>>>
>>> This table is helpful for discovering data in the index to ensure it is
>>> being written correctly, especially as we build and test the
>>> sparse-index. This table includes an output format similar to 'git
>>> ls-tree', but should not be compared to that directly. The biggest
>>> reasons are that 'git ls-tree' includes a tree entry for every
>>> subdirectory, even those that would not appear as a sparse directory in
>>> a sparse-index. Further, 'git ls-tree' does not use a trailing directory
>>> separator for its tree rows.
>>>
>>> This does not print the stat() information for the blobs. That will be
>>> added in a future change with another option. The tests that are added
>>> in the next few changes care only about the object types and IDs.
>>> However, this future need for full index information justifies the need
>>> for this test helper over extending a user-facing feature, such as 'git
>>> ls-files'.
>> 
>> Is that stat() information that's going to be essential to grab in the
>> same process that runs the "for (i = 0; i < istate->cache_nr; i++)"
>> for-loop, or stat() information that could be grabbed as:
>> 
>>     git ls-files -z --stage | some-program-that-stats-all-listed-blobs
>
> The point is not to find the stat() data from disk, but to ensure that
> the stat() data is correctly stored in the index (say, after converting
> an existing index from another format). This pipe strategy does not
> allow for that scenario.

So a dump of ce->ce_stat_data, i.e. the same thing ls-files --debug
prints out now, or...?

>> It's not so much that I still disagree as I feel like I'm missing
>> something. I haven't gone through this topic with a fine toothed comb,
>> so ...
>> 
>> If and when these patches land and I'm using this nascent sparse
>> checkout support why wouldn't I want ls-files or another not-a-test-tool
>> to support extracting this new information that's in the index?
>> 
>> That's why I sent the RFC patches at
>> https://lore.kernel.org/git/20210317132814.30175-2-avarab@gmail.com/ to
>> roll this functionality into ls-files.
>
> And I recommend that you continue to pursue them as an independent
> series, but I'm not going to incorporate them into this one. I'm
> not going to distract from this internal data structure with changes
> to user-facing commands until I think it's ready to use. As the design
> document describes the plan, I don't expect this to be something I
> will recommend to users until most of "Phase 3" is complete, making
> the most common Git commands aware of a sparse index. (I expect to
> fast-track a prototype to willing users that covers that functionality
> while review continues on the mailing list.)

This series is 20 patches. Your current derrickstolee/sparse-index/wip
is another 36, and from skimming those patches & your design doc those
56 seem to be partway into Phase I of IV.

So at the rate things tend to get reviewed / re-rolled & land in git.git
it seems exceedingly likely that we'll have some part-way implementation
of this for at least a major release or two. No?

Which is why I'm suggesting/asking if we shouldn't have something like
this debugging helper as part of installed tooling, because people are
going to try it, it's probably going to have bugs and do other weird
things, and I'd rather not have to manually build some test-tool to
debug some local sparse checkout somewhere.

> Making a change to a builtin is _forever_, and since the only
> purpose right now is to expose the data in a test environment, I
> don't want to adjust the builtin until either there is a real user
> need or the feature has otherwise stabilized. If you want to take on
> that responsibility, then please do.

That's just not the case, we have plenty of unstable debug-esque options
in various built-in commands, in fact ls-files already has a --debug
option whose docs say:

    This is intended to show as much information as possible for manual
    inspection; the exact format may change at any time.

It was added in 84974217151 (ls-files: learn a debugging dump format,
2010-07-31) and "just tacks all available data from the cache onto each
file's line" so in a way not adjusting it and using it would be a
regression, after all this is new data in the cache, so it should print
it :)

There's also PARSE_OPT_HIDDEN for other such in-tree use. Whatever the
sanity/merits of me suggesting that this specific thing be in ls-files
instead of a test-helper, it seems far fetched that something like that
hidden behind a GIT_TEST_* env var (or hidden option, --debug etc.) is
something we'd need to worry about backwards compatibility for.

So, whatever you think about the merits of including this functionality
in ls-files I think your stance of this being a no-go for adding to the
builtin is based on a false premise. It's fine to have
unstable/transitory/debug output in the builtins. We just name &
document them as such.

I also had some feedback in that series and on the earlier iteration
that I think is appropriate to be incorporated into a re-roll of this
one, which doesn't have anything to do with the question of whether we
use ls-files or the helper in the tests. Such as us showing more stuff
into the read-cache.c test-tool v.s. splitting it up making that code
needlessly convoluted.

I don't see how recommending that I pursue that as an independent series
is productive for anyone. So as you re-roll this I should submit another
series on top to refactor your in-flight code & tests?

Either my suggestions are just bad, and we shouldn't do them at all, or
it makes sense to incorporate relevant feedback in re-rolls. I'll let
other reviewers draw their own conclusions on that.

That's not a snarky "I'm right" b.t.w., I may honestly be full of it on
this particular topic.

But if those suggested changes are worth doing at all, then doing them
in that way seems like a massive waste of time for everyone involved, or
maybe I'm not getting what you're suggesting by pursuing them as an
independent series.

> Otherwise, I will need to eventually handle "git ls-files" being
> sparse-aware when eventually removing 'command_requires_full_index',
> (Phase 4) so that would be a good opportunity to adjust the
> expectations.

At which point you'd be adjusting your tests that expect ls-tree format
output to using ls-files output, instead of using ls-files-like output
from the beginning?

At the end of this E-Mail is a patch on top that adds an undocumented
--debug-sparse in addition to the existing --debug. Running that in the
middle of one of your tests:
    
    $ ~/g/git/git ls-files --debug -- a folder1
    a
      ctime: 1616641434:474004002
      mtime: 1616641434:474004002
      dev: 2306     ino: 28576528
      uid: 1001     gid: 1001
      size: 8       flags: 0
    folder1/a
      ctime: 0:0
      mtime: 0:0
      dev: 0        ino: 0
      uid: 0        gid: 0
      size: 0       flags: 40000000
    $ ~/g/git/git ls-files --debug --debug-sparse -- a folder1
    a 
      ctime: 1616641434:474004002
      mtime: 1616641434:474004002
      dev: 2306     ino: 28576528
      uid: 1001     gid: 1001
      size: 8       flags: 0
    folder1/
      ctime: 0:0
      mtime: 0:0
      dev: 0        ino: 0
      uid: 0        gid: 0
      size: 0       flags: 40004000
    $ ~/g/git/git ls-files --stage -- a folder1
    100644 e79c5e8f964493290a409888d5413a737e8e5dd5 0       a
    100644 e79c5e8f964493290a409888d5413a737e8e5dd5 0       folder1/a
    $ ~/g/git/git ls-files --stage --debug-sparse -- a folder1
    100644 e79c5e8f964493290a409888d5413a737e8e5dd5 0       a
    040000 f203181537ff55dcf7896bf8c5b5c35af1514421 0       folder1/

I.e. it gives you everything your helper does and more with a trivial
addition of a --debug-sparse (which we can later just remove, it's a
debug option...).

See e.g. my recent 15c9649730d (grep/log: remove hidden --debug and
--grep-debug options, 2021-01-26) which is already in a release, and
AFAICT nobody has noticed or cared.

I don't know if that's the stat() information you wanted (your WIP
branch doesn't have such a change), but presumably it either is the info
you want, or ls-files's --debug would want to emit any such such info
that's now missing too.

diff --git a/builtin/ls-files.c b/builtin/ls-files.c
index 13bcc2d8473..e691512d4f8 100644
--- a/builtin/ls-files.c
+++ b/builtin/ls-files.c
@@ -34,6 +34,7 @@ static int show_valid_bit;
 static int show_fsmonitor_bit;
 static int line_terminator = '\n';
 static int debug_mode;
+static int debug_sparse_mode;
 static int show_eol;
 static int recurse_submodules;
 static int skipping_duplicates;
@@ -242,9 +243,17 @@ static void show_ce(struct repository *repo, struct dir_struct *dir,
 		if (!show_stage) {
 			fputs(tag, stdout);
 		} else {
+			unsigned int mode = ce->ce_mode;
+			if (debug_sparse_mode && S_ISSPARSEDIR(mode))
+				/*
+				 * We could just do & 0177777 all the
+				 * time, just make it clear this is
+				 * for --debug-sparse.
+				 */
+				mode &= 0177777;
 			printf("%s%06o %s %d\t",
 			       tag,
-			       ce->ce_mode,
+			       mode,
 			       find_unique_abbrev(&ce->oid, abbrev),
 			       ce_stage(ce));
 		}
@@ -667,6 +676,7 @@ int cmd_ls_files(int argc, const char **argv, const char *cmd_prefix)
 			N_("pretend that paths removed since <tree-ish> are still present")),
 		OPT__ABBREV(&abbrev),
 		OPT_BOOL(0, "debug", &debug_mode, N_("show debugging data")),
+		OPT_BOOL(0, "debug-sparse", &debug_sparse_mode, N_("show sparse debugging data")),
 		OPT_BOOL(0, "deduplicate", &skipping_duplicates,
 			 N_("suppress duplicate entries")),
 		OPT_END()
@@ -681,9 +691,6 @@ int cmd_ls_files(int argc, const char **argv, const char *cmd_prefix)
 		prefix_len = strlen(prefix);
 	git_config(git_default_config, NULL);
 
-	if (repo_read_index(the_repository) < 0)
-		die("index file corrupt");
-
 	argc = parse_options(argc, argv, prefix, builtin_ls_files_options,
 			ls_files_usage, 0);
 	pl = add_pattern_list(&dir, EXC_CMDL, "--exclude option");
@@ -700,6 +707,10 @@ int cmd_ls_files(int argc, const char **argv, const char *cmd_prefix)
 		tag_skip_worktree = "S ";
 		tag_resolve_undo = "U ";
 	}
+	if (debug_sparse_mode) {
+		prepare_repo_settings(the_repository);
+		the_repository->settings.command_requires_full_index = 0;
+	}
 	if (show_modified || show_others || show_deleted || (dir.flags & DIR_SHOW_IGNORED) || show_killed)
 		require_work_tree = 1;
 	if (show_unmerged)
@@ -743,6 +754,12 @@ int cmd_ls_files(int argc, const char **argv, const char *cmd_prefix)
 		max_prefix = common_prefix(&pathspec);
 	max_prefix_len = get_common_prefix_len(max_prefix);
 
+	/*
+	 * Read the index after parse options etc. have had a chance
+	 * to die early.
+	 */
+	if (repo_read_index(the_repository) < 0)
+		die("index file corrupt");
 	prune_index(the_repository->index, max_prefix, max_prefix_len);
 
 	/* Treat unmatching pathspec elements as errors */

^ permalink raw reply related	[flat|nested] 203+ messages in thread

* Re: [PATCH v4 07/20] test-read-cache: print cache entries with --table
  2021-03-25  3:41             ` Ævar Arnfjörð Bjarmason
@ 2021-03-26  0:12               ` Elijah Newren
  2021-03-28 15:31                 ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 203+ messages in thread
From: Elijah Newren @ 2021-03-26  0:12 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Derrick Stolee, Derrick Stolee via GitGitGadget,
	Git Mailing List, Junio C Hamano,
	Nguyễn Thái Ngọc, Jonathan Nieder,
	Martin Ågren, SZEDER Gábor, Derrick Stolee,
	Derrick Stolee

Hi,

On Wed, Mar 24, 2021 at 8:41 PM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
> On Wed, Mar 24 2021, Derrick Stolee wrote:
>
> > On 3/23/21 9:24 PM, Ævar Arnfjörð Bjarmason wrote:
> >>
> >> On Tue, Mar 23 2021, Derrick Stolee via GitGitGadget wrote:
> >>
> >>> From: Derrick Stolee <dstolee@microsoft.com>
> >>>
...
> >> It's not so much that I still disagree as I feel like I'm missing
> >> something. I haven't gone through this topic with a fine toothed comb,
> >> so ...
> >>
> >> If and when these patches land and I'm using this nascent sparse
> >> checkout support why wouldn't I want ls-files or another not-a-test-tool
> >> to support extracting this new information that's in the index?
> >>
> >> That's why I sent the RFC patches at
> >> https://lore.kernel.org/git/20210317132814.30175-2-avarab@gmail.com/ to
> >> roll this functionality into ls-files.
> >
> > And I recommend that you continue to pursue them as an independent
> > series, but I'm not going to incorporate them into this one. I'm
> > not going to distract from this internal data structure with changes
> > to user-facing commands until I think it's ready to use. As the design
> > document describes the plan, I don't expect this to be something I
> > will recommend to users until most of "Phase 3" is complete, making
> > the most common Git commands aware of a sparse index. (I expect to
> > fast-track a prototype to willing users that covers that functionality
> > while review continues on the mailing list.)
>
> This series is 20 patches. Your current derrickstolee/sparse-index/wip
> is another 36, and from skimming those patches & your design doc those
> 56 seem to be partway into Phase I of IV.
>
> So at the rate things tend to get reviewed / re-rolled & land in git.git
> it seems exceedingly likely that we'll have some part-way implementation
> of this for at least a major release or two. No?
>
> Which is why I'm suggesting/asking if we shouldn't have something like
> this debugging helper as part of installed tooling, because people are
> going to try it, it's probably going to have bugs and do other weird
> things, and I'd rather not have to manually build some test-tool to
> debug some local sparse checkout somewhere.

I'm curious why you feel it's critical that this particular piece of
debugging machinery needs to be prioritized early and exposed; in
particular, I'm not sure I follow the "people are going to try it"
assertion.  Are you the one who is going to try it or are you going to
give it to your users?  If so, what do you need out of the debugging
tool?

You are correct that this will span multiple releases; Stolee already
said he was planning to be working on this for most of 2021.  But just
because pieces of the code exist and are shipped doesn't mean it'll be
announced or supported.  For example, the git-2.30 and git-2.31
release notes were completely silent about merge-ort.  It existed in
both releases; in fact, the version that ships in git-2.31, could
theoretically be used successfully by the vast majority of users for
their daily workflow.  (But it does have known shortcomings and test
failures so I definitely did *not* want it to be announced at that
time.)

> > Making a change to a builtin is _forever_, and since the only
> > purpose right now is to expose the data in a test environment, I
> > don't want to adjust the builtin until either there is a real user
> > need or the feature has otherwise stabilized. If you want to take on
> > that responsibility, then please do.
>
> That's just not the case, we have plenty of unstable debug-esque options
> in various built-in commands, in fact ls-files already has a --debug
> option whose docs say:
>
>     This is intended to show as much information as possible for manual
>     inspection; the exact format may change at any time.
>
> It was added in 84974217151 (ls-files: learn a debugging dump format,
> 2010-07-31) and "just tacks all available data from the cache onto each
> file's line" so in a way not adjusting it and using it would be a
> regression, after all this is new data in the cache, so it should print
> it :)
>
> There's also PARSE_OPT_HIDDEN for other such in-tree use. Whatever the
> sanity/merits of me suggesting that this specific thing be in ls-files
> instead of a test-helper, it seems far fetched that something like that
> hidden behind a GIT_TEST_* env var (or hidden option, --debug etc.) is
> something we'd need to worry about backwards compatibility for.
>
> So, whatever you think about the merits of including this functionality
> in ls-files I think your stance of this being a no-go for adding to the
> builtin is based on a false premise. It's fine to have
> unstable/transitory/debug output in the builtins. We just name &
> document them as such.
>
> I also had some feedback in that series and on the earlier iteration
> that I think is appropriate to be incorporated into a re-roll of this
> one, which doesn't have anything to do with the question of whether we
> use ls-files or the helper in the tests. Such as us showing more stuff
> into the read-cache.c test-tool v.s. splitting it up making that code
> needlessly convoluted.

Well:
  * you seem to be strongly opposed to test-read-cache.c containing
this code (though I don't quite follow why)
  * Stolee seems to be strongly opposed to modifying
builtin/ls-files.c until he has time to think through how builtins
should work.

So putting it in another test file that looks slightly duplicative of
test-read-cache.c might indeed be a good way out of this conundrum.
:-)

(I'm not opposed to any of the three solutions, I'm mostly chiming in
here because I'm worried about possible bubbling frustration; see
below.)

> I don't see how recommending that I pursue that as an independent series
> is productive for anyone. So as you re-roll this I should submit another
> series on top to refactor your in-flight code & tests?

Your tone suggests some frustration; I have a suspicion there's some
lack of understanding or misreading that has occurred (perhaps on my
part too), and before that misunderstanding morphs into motive
questioning, let me see if I might be able to help...

So far, you have advocated for:
  A) Moving the checks to ls-files with a permanent new flag (--sparse)
  B) Duplicating test-read-cache.c (which is admittedly pretty small)
and then modifying the duplicate to have the new behavior, or
alternatively:
  C) Just stating files to get the information
  D) Creating new debug option(s) to ls-files so that end users can
use this in the next few releases before the feature is ready for
prime time
You also mentioned you had read just part of the series.

Option D comes with the problem that it's not at all clear who these
end-users are, why they want the option, or how we should design it.
Personally, I'm totally onboard that ls-files should generally have
the ability to show information in the index (e.g. if there are tree
entries in addition to blob entries, it should be able to show both),
but I'm not following the reasoning for why it needs to be there as
part of the early stages of development of the sparse-index feature
and who it's supposed to be helping in these next few releases.

The progression also suggests that Option B might have just been a
step along the way and that you were advocating for Option D now.  I
think it'd be easy to miss that you still had option B open and
considered it equivalently good to option D (or am I misreading?),
much like you missed how option C wasn't even relevant to the problem
at hand or option A would have introduced perpetual confusion as a
mere duplicate of --stage (in the best case scenario, anyway).
They're all easy misunderstandings.

> Either my suggestions are just bad, and we shouldn't do them at all, or
> it makes sense to incorporate relevant feedback in re-rolls. I'll let
> other reviewers draw their own conclusions on that.

I think that's a bit unfair; Stolee has been incorporating feedback.
He even called out fixing up things at your suggestion in v4 of his
re-roll.

> That's not a snarky "I'm right" b.t.w., I may honestly be full of it on
> this particular topic.
>
> But if those suggested changes are worth doing at all, then doing them
> in that way seems like a massive waste of time for everyone involved, or
> maybe I'm not getting what you're suggesting by pursuing them as an
> independent series.

I think you should instead read it as he has no idea why this needs to
be exposed in ls-files, who these users are you are asserting will be
using it, or how to cater for their needs.  Shouldn't the person who
implements this understand those pieces to avoid a massive waste of
time?

> > Otherwise, I will need to eventually handle "git ls-files" being
> > sparse-aware when eventually removing 'command_requires_full_index',
> > (Phase 4) so that would be a good opportunity to adjust the
> > expectations.
>
> At which point you'd be adjusting your tests that expect ls-tree format
> output to using ls-files output, instead of using ls-files-like output
> from the beginning?

I don't understand what you're getting at here.  I was the one who
requested Stolee make the output look like ls-trees in his original
RFC series, so if there's a problem with this style of output, I'm to
blame.  But, what is exactly the problem?  Old-style ls-files output
just isn't relevant anymore.  ls-tree prints four things: mode, type,
hash, and filename.  ls-files prints all of those except "type".  The
reason ls-files never included type before was because it was always
"blob".  This series changes that, and adds "tree" to the mix.  Once
you have different types included in the index, then ls-files has to
print all the same fields that ls-tree does...so why not make it look
similar?

> At the end of this E-Mail is a patch on top that adds an undocumented
> --debug-sparse in addition to the existing --debug. Running that in the
> middle of one of your tests:
>
>     $ ~/g/git/git ls-files --debug -- a folder1
>     a
>       ctime: 1616641434:474004002
>       mtime: 1616641434:474004002
>       dev: 2306     ino: 28576528
>       uid: 1001     gid: 1001
>       size: 8       flags: 0
>     folder1/a
>       ctime: 0:0
>       mtime: 0:0
>       dev: 0        ino: 0
>       uid: 0        gid: 0
>       size: 0       flags: 40000000
>     $ ~/g/git/git ls-files --debug --debug-sparse -- a folder1
>     a
>       ctime: 1616641434:474004002
>       mtime: 1616641434:474004002
>       dev: 2306     ino: 28576528
>       uid: 1001     gid: 1001
>       size: 8       flags: 0
>     folder1/
>       ctime: 0:0
>       mtime: 0:0
>       dev: 0        ino: 0
>       uid: 0        gid: 0
>       size: 0       flags: 40004000
>     $ ~/g/git/git ls-files --stage -- a folder1
>     100644 e79c5e8f964493290a409888d5413a737e8e5dd5 0       a
>     100644 e79c5e8f964493290a409888d5413a737e8e5dd5 0       folder1/a
>     $ ~/g/git/git ls-files --stage --debug-sparse -- a folder1
>     100644 e79c5e8f964493290a409888d5413a737e8e5dd5 0       a
>     040000 f203181537ff55dcf7896bf8c5b5c35af1514421 0       folder1/
>
> I.e. it gives you everything your helper does and more with a trivial
> addition of a --debug-sparse (which we can later just remove, it's a
> debug option...).
>
> See e.g. my recent 15c9649730d (grep/log: remove hidden --debug and
> --grep-debug options, 2021-01-26) which is already in a release, and
> AFAICT nobody has noticed or cared.
>
> I don't know if that's the stat() information you wanted (your WIP
> branch doesn't have such a change), but presumably it either is the info
> you want, or ls-files's --debug would want to emit any such such info
> that's now missing too.
>
> diff --git a/builtin/ls-files.c b/builtin/ls-files.c
> index 13bcc2d8473..e691512d4f8 100644
> --- a/builtin/ls-files.c
> +++ b/builtin/ls-files.c
> @@ -34,6 +34,7 @@ static int show_valid_bit;
>  static int show_fsmonitor_bit;
>  static int line_terminator = '\n';
>  static int debug_mode;
> +static int debug_sparse_mode;
>  static int show_eol;
>  static int recurse_submodules;
>  static int skipping_duplicates;
> @@ -242,9 +243,17 @@ static void show_ce(struct repository *repo, struct dir_struct *dir,
>                 if (!show_stage) {
>                         fputs(tag, stdout);
>                 } else {
> +                       unsigned int mode = ce->ce_mode;
> +                       if (debug_sparse_mode && S_ISSPARSEDIR(mode))
> +                               /*
> +                                * We could just do & 0177777 all the
> +                                * time, just make it clear this is
> +                                * for --debug-sparse.
> +                                */
> +                               mode &= 0177777;
>                         printf("%s%06o %s %d\t",
>                                tag,
> -                              ce->ce_mode,
> +                              mode,
>                                find_unique_abbrev(&ce->oid, abbrev),
>                                ce_stage(ce));
>                 }
> @@ -667,6 +676,7 @@ int cmd_ls_files(int argc, const char **argv, const char *cmd_prefix)
>                         N_("pretend that paths removed since <tree-ish> are still present")),
>                 OPT__ABBREV(&abbrev),
>                 OPT_BOOL(0, "debug", &debug_mode, N_("show debugging data")),
> +               OPT_BOOL(0, "debug-sparse", &debug_sparse_mode, N_("show sparse debugging data")),
>                 OPT_BOOL(0, "deduplicate", &skipping_duplicates,
>                          N_("suppress duplicate entries")),
>                 OPT_END()
> @@ -681,9 +691,6 @@ int cmd_ls_files(int argc, const char **argv, const char *cmd_prefix)
>                 prefix_len = strlen(prefix);
>         git_config(git_default_config, NULL);
>
> -       if (repo_read_index(the_repository) < 0)
> -               die("index file corrupt");
> -
>         argc = parse_options(argc, argv, prefix, builtin_ls_files_options,
>                         ls_files_usage, 0);
>         pl = add_pattern_list(&dir, EXC_CMDL, "--exclude option");
> @@ -700,6 +707,10 @@ int cmd_ls_files(int argc, const char **argv, const char *cmd_prefix)
>                 tag_skip_worktree = "S ";
>                 tag_resolve_undo = "U ";
>         }
> +       if (debug_sparse_mode) {
> +               prepare_repo_settings(the_repository);
> +               the_repository->settings.command_requires_full_index = 0;
> +       }
>         if (show_modified || show_others || show_deleted || (dir.flags & DIR_SHOW_IGNORED) || show_killed)
>                 require_work_tree = 1;
>         if (show_unmerged)
> @@ -743,6 +754,12 @@ int cmd_ls_files(int argc, const char **argv, const char *cmd_prefix)
>                 max_prefix = common_prefix(&pathspec);
>         max_prefix_len = get_common_prefix_len(max_prefix);
>
> +       /*
> +        * Read the index after parse options etc. have had a chance
> +        * to die early.
> +        */
> +       if (repo_read_index(the_repository) < 0)
> +               die("index file corrupt");
>         prune_index(the_repository->index, max_prefix, max_prefix_len);
>
>         /* Treat unmatching pathspec elements as errors */

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v4 01/20] sparse-index: design doc and format update
  2021-03-23 13:44       ` [PATCH v4 01/20] sparse-index: design doc and format update Derrick Stolee via GitGitGadget
@ 2021-03-26 20:29         ` SZEDER Gábor
  2021-03-28  1:47           ` Junio C Hamano
  0 siblings, 1 reply; 203+ messages in thread
From: SZEDER Gábor @ 2021-03-26 20:29 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, Ævar Arnfjörð Bjarmason,
	Derrick Stolee, Derrick Stolee

On Tue, Mar 23, 2021 at 01:44:09PM +0000, Derrick Stolee via GitGitGadget wrote:
> Currently, the index format is only updated in the presence of
> extensions.sparseIndex instead of increasing a file format version
> number. This is temporary, and index v5 is part of the plan for future
> work in this area.

> diff --git a/Documentation/technical/sparse-index.txt b/Documentation/technical/sparse-index.txt
> new file mode 100644
> index 000000000000..62f6dc225a44
> --- /dev/null
> +++ b/Documentation/technical/sparse-index.txt

> +To start, we use a new repository extension, `extensions.sparseIndex`, to
> +allow inserting sparse-directory entries into indexes with file format
> +versions 2, 3, and 4. This prevents Git versions that do not understand
> +the sparse-index from operating on one, but it also prevents other
> +operations that do not use the index at all.

Why is this not a non-optional index extension?  That would allow
older Git versions and other implementations not understanding sparse
directory entries to still perform any operation that doesn't involve
the index.  More importantly, that would prevent older Git versions
and other implementations not understanding repository extensions from
potentially wreaking havoc when the index contains sparse directory
entries.  Notably JGit's current version (5.11.0.202103091610-r) does
still completely ignore repository extensions, and e.g. happily
attempts any operations on a SHA256 repository, so it would do the
same in the presence of 'extensions.sparseIndex' as well.  JGit does
respect non-optional index extensions, see e.g. 87a6bb701a
(t5310-pack-bitmaps: make JGit tests work with GIT_TEST_SPLIT_INDEX,
2018-05-10).

This really should be a non-optional index extension.

> A new format, index v5, will
> +be introduced that includes sparse-directory entries by default. It might
> +also introduce other features that have been considered for improving the
> +index, as well.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v4 01/20] sparse-index: design doc and format update
  2021-03-26 20:29         ` SZEDER Gábor
@ 2021-03-28  1:47           ` Junio C Hamano
  2021-03-29 14:32             ` Derrick Stolee
  0 siblings, 1 reply; 203+ messages in thread
From: Junio C Hamano @ 2021-03-28  1:47 UTC (permalink / raw)
  To: SZEDER Gábor
  Cc: Derrick Stolee via GitGitGadget, git, newren, pclouds, jrnieder,
	Martin Ågren, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

SZEDER Gábor <szeder.dev@gmail.com> writes:

>> +To start, we use a new repository extension, `extensions.sparseIndex`, to
>> +allow inserting sparse-directory entries into indexes with file format
>> +versions 2, 3, and 4. This prevents Git versions that do not understand
>> +the sparse-index from operating on one, but it also prevents other
>> +operations that do not use the index at all.
>
> Why is this not a non-optional index extension?  ...
> This really should be a non-optional index extension.

Yeah, the index extension mechanism was designed with optional and
required kinds because we wanted to support exactly a use case like
this one.

Thanks for pointing it out.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v4 07/20] test-read-cache: print cache entries with --table
  2021-03-26  0:12               ` Elijah Newren
@ 2021-03-28 15:31                 ` Ævar Arnfjörð Bjarmason
  2021-03-29 19:46                   ` Derrick Stolee
  2021-03-29 22:02                   ` Elijah Newren
  0 siblings, 2 replies; 203+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-03-28 15:31 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Derrick Stolee, Derrick Stolee via GitGitGadget,
	Git Mailing List, Junio C Hamano,
	Nguyễn Thái Ngọc, Jonathan Nieder,
	Martin Ågren, SZEDER Gábor, Derrick Stolee,
	Derrick Stolee


On Fri, Mar 26 2021, Elijah Newren wrote:

> Hi,
>
> On Wed, Mar 24, 2021 at 8:41 PM Ævar Arnfjörð Bjarmason
> <avarab@gmail.com> wrote:
>>
>> On Wed, Mar 24 2021, Derrick Stolee wrote:
>>
>> > On 3/23/21 9:24 PM, Ævar Arnfjörð Bjarmason wrote:
>> >>
>> >> On Tue, Mar 23 2021, Derrick Stolee via GitGitGadget wrote:
>> >>
>> >>> From: Derrick Stolee <dstolee@microsoft.com>
>> >>>
> ...
>> >> It's not so much that I still disagree as I feel like I'm missing
>> >> something. I haven't gone through this topic with a fine toothed comb,
>> >> so ...
>> >>
>> >> If and when these patches land and I'm using this nascent sparse
>> >> checkout support why wouldn't I want ls-files or another not-a-test-tool
>> >> to support extracting this new information that's in the index?
>> >>
>> >> That's why I sent the RFC patches at
>> >> https://lore.kernel.org/git/20210317132814.30175-2-avarab@gmail.com/ to
>> >> roll this functionality into ls-files.
>> >
>> > And I recommend that you continue to pursue them as an independent
>> > series, but I'm not going to incorporate them into this one. I'm
>> > not going to distract from this internal data structure with changes
>> > to user-facing commands until I think it's ready to use. As the design
>> > document describes the plan, I don't expect this to be something I
>> > will recommend to users until most of "Phase 3" is complete, making
>> > the most common Git commands aware of a sparse index. (I expect to
>> > fast-track a prototype to willing users that covers that functionality
>> > while review continues on the mailing list.)
>>
>> This series is 20 patches. Your current derrickstolee/sparse-index/wip
>> is another 36, and from skimming those patches & your design doc those
>> 56 seem to be partway into Phase I of IV.
>>
>> So at the rate things tend to get reviewed / re-rolled & land in git.git
>> it seems exceedingly likely that we'll have some part-way implementation
>> of this for at least a major release or two. No?
>>
>> Which is why I'm suggesting/asking if we shouldn't have something like
>> this debugging helper as part of installed tooling, because people are
>> going to try it, it's probably going to have bugs and do other weird
>> things, and I'd rather not have to manually build some test-tool to
>> debug some local sparse checkout somewhere.
>
> I'm curious why you feel it's critical that this particular piece of
> debugging machinery needs to be prioritized early and exposed; in
> particular, I'm not sure I follow the "people are going to try it"
> assertion.

The debugging machinery's already there, the question is why we have a
need for duplicating code in-tree.

I just did some cursory review of this topic, and wondered why its tests
couldn't use a builtin instead of (mostly) reinventing the wheel.

It seems to me that the reason for that state is based on a
misunderstanding about what we would and wouldn't add to builtin/*.c,
i.e. that we wouldn't have something like a --debug option, but as
ls-files shows that's not a problem.

So my interest is twofold:

 * Just a comment on "can we avoid this code duplication"

 * The related one of not wanting to re-learn some custom test helper as
   (presumably) we get N number of large patch serieses on this topic,
   if it turns out that we can use an existing well-known tool with
   minimal changes.

> Are you the one who is going to try it or are you going to
> give it to your users?  If so, what do you need out of the debugging
> tool?

I haven't understood the sparse index enough feature enough to know if
anyone would ever want to run this --debug-sparse outside of the test
suite.

Isn't extract info about its internal state going to be useful sooner
than later in the scenarios where you'd care enough to run "ls-files
--stage" now?

Maybe I've misunderstood this feature and it's going to be so
transparent that nobody will ever have any reason to dump how it's
working out of the index...

> You are correct that this will span multiple releases; Stolee already
> said he was planning to be working on this for most of 2021.  But just
> because pieces of the code exist and are shipped doesn't mean it'll be
> announced or supported.  For example, the git-2.30 and git-2.31
> release notes were completely silent about merge-ort.  It existed in
> both releases; in fact, the version that ships in git-2.31, could
> theoretically be used successfully by the vast majority of users for
> their daily workflow.  (But it does have known shortcomings and test
> failures so I definitely did *not* want it to be announced at that
> time.)

Yes, and that's fine. But if you'd been bending over backwards to add
merge-ort to t/helper/ "because it's not ready yet" or something I'd
have probably commented to the effect of "can't we just add it as part
of builtins but not advertise it?" which is what you did :)

>> > Making a change to a builtin is _forever_, and since the only
>> > purpose right now is to expose the data in a test environment, I
>> > don't want to adjust the builtin until either there is a real user
>> > need or the feature has otherwise stabilized. If you want to take on
>> > that responsibility, then please do.
>>
>> That's just not the case, we have plenty of unstable debug-esque options
>> in various built-in commands, in fact ls-files already has a --debug
>> option whose docs say:
>>
>>     This is intended to show as much information as possible for manual
>>     inspection; the exact format may change at any time.
>>
>> It was added in 84974217151 (ls-files: learn a debugging dump format,
>> 2010-07-31) and "just tacks all available data from the cache onto each
>> file's line" so in a way not adjusting it and using it would be a
>> regression, after all this is new data in the cache, so it should print
>> it :)
>>
>> There's also PARSE_OPT_HIDDEN for other such in-tree use. Whatever the
>> sanity/merits of me suggesting that this specific thing be in ls-files
>> instead of a test-helper, it seems far fetched that something like that
>> hidden behind a GIT_TEST_* env var (or hidden option, --debug etc.) is
>> something we'd need to worry about backwards compatibility for.
>>
>> So, whatever you think about the merits of including this functionality
>> in ls-files I think your stance of this being a no-go for adding to the
>> builtin is based on a false premise. It's fine to have
>> unstable/transitory/debug output in the builtins. We just name &
>> document them as such.
>>
>> I also had some feedback in that series and on the earlier iteration
>> that I think is appropriate to be incorporated into a re-roll of this
>> one, which doesn't have anything to do with the question of whether we
>> use ls-files or the helper in the tests. Such as us showing more stuff
>> into the read-cache.c test-tool v.s. splitting it up making that code
>> needlessly convoluted.
>
> Well:
>   * you seem to be strongly opposed to test-read-cache.c containing
> this code (though I don't quite follow why)

See above.

>   * Stolee seems to be strongly opposed to modifying
> builtin/ls-files.c until he has time to think through how builtins
> should work.

As noted above my reading of upthread is that those reasons basically
boil down to not knowing "git ls-files --debug" exists, and that we can
extend it.

> So putting it in another test file that looks slightly duplicative of
> test-read-cache.c might indeed be a good way out of this conundrum.
> :-)

FWIW I think that read-cache.c split is worth doing even if this series
doesn't modify t/helper/read-cache.c.

The "this is for fsmonitor" and "this is for the perf test" use-cases
are (as I think my RFC patch shows) clearer once they're split up.

> (I'm not opposed to any of the three solutions, I'm mostly chiming in
> here because I'm worried about possible bubbling frustration; see
> below.)
>
>> I don't see how recommending that I pursue that as an independent series
>> is productive for anyone. So as you re-roll this I should submit another
>> series on top to refactor your in-flight code & tests?
>
> Your tone suggests some frustration; I have a suspicion there's some
> lack of understanding or misreading that has occurred (perhaps on my
> part too), and before that misunderstanding morphs into motive
> questioning, let me see if I might be able to help...

Honestly more flabbergasted than anything, so I'm trying to clarify what
the author thinks of this direction.

I mean it's fine if it's just a "I don't think this is important and
don't want to spend time on it, but it seems like a good idea", in which
case others have the option of re-rolling some of these patches if they
care (at this point I wouldn't).

Or "this is just a bad idea for XYZ reason", which is also fine, and
even more valuable to document for future work in the area.

But to have another series built on this with refactorings back and
forth before code's landed on master just seems like needless churn.

> So far, you have advocated for:
>   A) Moving the checks to ls-files with a permanent new flag (--sparse)
>   B) Duplicating test-read-cache.c (which is admittedly pretty small)
> and then modifying the duplicate to have the new behavior, or
> alternatively:
>   C) Just stating files to get the information
>   D) Creating new debug option(s) to ls-files so that end users can
> use this in the next few releases before the feature is ready for
> prime time
> You also mentioned you had read just part of the series.
>
> Option D comes with the problem that it's not at all clear who these
> end-users are, why they want the option, or how we should design it. [...]

I think s/advocated/read the series and sent an flow-of-thought
not-ready-for-anything RFC patches on top/ would be more accurate :)

I.e. the A) --sparse thing was just reading the patch and seeing if
ls-files couldn't be made to do this, but yes, having the documented
--sparse interface might not make sense.

we discussed B) above.

C) Was a question to clarify what was meant with stat data, since it's
an offhand comment in the commit message. Does it mean "stat after the
fact" or "this will have a mode like ls-files --debug has now"?

Right now I'm just suggesting with D) that this might be rolled into the
dev-only-not-for-end-users --debug mode. 

> I'm totally onboard that ls-files should generally have
> the ability to show information in the index (e.g. if there are tree
> entries in addition to blob entries, it should be able to show both),
> but I'm not following the reasoning for why it needs to be there as
> part of the early stages of development of the sparse-index feature
> and who it's supposed to be helping in these next few releases.

We already are extracting the info at this early stage, just with a
custom helper. All I'm suggesting right now is that the motivation for
the custom helper is "this isn't for end users" then surely having a
patch around 1/2 the size to add it to already reviewed/tested ls-files
code under a --debug option makes more sense.

Especially since the upthread commit mentions wanting to incorporate
stat() data. I'm not sure how exactly (there's no outstanding patches,
even on a WIP branch for it, AFAICT), but most likely it's further
duplication of data "ls-files --debug" already spews out.

So the patch would be 1/2 the size, and instead of saying "let's do stat
stuff in the future" it would get it for free.

Or not, part of that's speculation on information that's just in
Stolee's head. Hence this side-discussion.

> [...]

[Cut parts hopefully all clarified with the above comments]

>> [..]
>> At which point you'd be adjusting your tests that expect ls-tree format
>> output to using ls-files output, instead of using ls-files-like output
>> from the beginning?
>
> I don't understand what you're getting at here.  I was the one who
> requested Stolee make the output look like ls-trees in his original
> RFC series, so if there's a problem with this style of output, I'm to
> blame.

I didn't read the RFC series, so I missed that there was past discussion
on this point.

Perhaps something to roll into an updated commit message? My reading of
the current version is that it suggests that the ls-tree-like output is
important to get at the data we need, which my patch-for-discussion
shows isn't the case.

> [...] Once you have different types included in the index, then
> ls-files has to print all the same fields that ls-tree does...so why
> not make it look similar?

I don't have a problem with how the output looks, I happen to like the
ls-tree output better, I've just been suggesting that differing output
== code duplication.

In any case. I'm sorry about any comments I've made that came across as
snarky or whatever. Since we're talking in a text-based medium I'm going
to take the reading of a third-party native speaker (you) over mine.

I didn't mean any comments I've made that way, I'm very interested in
seeing this feature land, and just want to try to help it along. Given
the size of this thread over a relatively trivial matter I think that
"help" is probably counterproductive at this point.

I don't think this is criticial or needs to be done or whatever. I've
only kept up this thread for the reasons stated above, i.e. it seeming
to me to be based on the premise that we can't add certain code to
builtin/*.c, and if we can get around that we can make this simpler.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v4 01/20] sparse-index: design doc and format update
  2021-03-28  1:47           ` Junio C Hamano
@ 2021-03-29 14:32             ` Derrick Stolee
  0 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee @ 2021-03-29 14:32 UTC (permalink / raw)
  To: Junio C Hamano, SZEDER Gábor
  Cc: Derrick Stolee via GitGitGadget, git, newren, pclouds, jrnieder,
	Martin Ågren, Ævar Arnfjörð Bjarmason,
	Derrick Stolee, Derrick Stolee

On 3/27/2021 9:47 PM, Junio C Hamano wrote:
> SZEDER Gábor <szeder.dev@gmail.com> writes:
> 
>>> +To start, we use a new repository extension, `extensions.sparseIndex`, to
>>> +allow inserting sparse-directory entries into indexes with file format
>>> +versions 2, 3, and 4. This prevents Git versions that do not understand
>>> +the sparse-index from operating on one, but it also prevents other
>>> +operations that do not use the index at all.
>>
>> Why is this not a non-optional index extension?  ...
>> This really should be a non-optional index extension.
> 
> Yeah, the index extension mechanism was designed with optional and
> required kinds because we wanted to support exactly a use case like
> this one.
> 
> Thanks for pointing it out.

Ok, so let me be sure I understand the request, as I believe it is
a very good one:

Using a REQUIRED index extension that says "this index has
sparse-directory entries" will allow tools that don't touch
the index to be compatible with repos using the sparse-index,
while also avoiding a new index version.

I'll work on this right away. Thanks!
-Stolee

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v4 07/20] test-read-cache: print cache entries with --table
  2021-03-28 15:31                 ` Ævar Arnfjörð Bjarmason
@ 2021-03-29 19:46                   ` Derrick Stolee
  2021-03-29 21:44                     ` Junio C Hamano
  2021-03-29 23:06                     ` Ævar Arnfjörð Bjarmason
  2021-03-29 22:02                   ` Elijah Newren
  1 sibling, 2 replies; 203+ messages in thread
From: Derrick Stolee @ 2021-03-29 19:46 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason, Elijah Newren
  Cc: Derrick Stolee via GitGitGadget, Git Mailing List,
	Junio C Hamano, Nguyễn Thái Ngọc,
	Jonathan Nieder, Martin Ågren, SZEDER Gábor,
	Derrick Stolee, Derrick Stolee

On 3/28/2021 11:31 AM, Ævar Arnfjörð Bjarmason wrote:> It seems to me that the reason for that state is based on a
> misunderstanding about what we would and wouldn't add to builtin/*.c,
> i.e. that we wouldn't have something like a --debug option, but as
> ls-files shows that's not a problem.

I feel _strongly_ that a change to the user-facing CLI should come
with a good reason and care about how it locks-in behavior for the
future.

Any adjustment to 'git ls-files' deserves its own series and
attention, not in an already-too-large series like this one.

I'm not happy that this series and the next are so long, but that's
the best I can do to make them reviewable and still capture a
complete scenario. Hopefully the remaining series after these first
two are smaller. Things like "what should 'git ls-files' do with a
sparse index?" can fit cleanly on top once the core functionality
of the internals are stable.

I have an _opinion_ that the ls-files output is not well-suited to
testing because the --debug output splits details across multiple
lines. This is a minor point that could probably be corrected by
a complicated script method, but that's why I list this as an
opinion.

> I mean it's fine if it's just a "I don't think this is important and
> don't want to spend time on it, but it seems like a good idea", in which
> case others have the option of re-rolling some of these patches if they
> care (at this point I wouldn't).
> 
> Or "this is just a bad idea for XYZ reason", which is also fine, and
> even more valuable to document for future work in the area.
> 
> But to have another series built on this with refactorings back and
> forth before code's landed on master just seems like needless churn.

I think changing 'ls-files' before the sparse index has stabilized is
premature. I said that a series like the RFC you sent would be
appropriate after this concept is more stable. I do _not_ recommend
trying to juggle it on top of the work while the patches are in flight.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v4 07/20] test-read-cache: print cache entries with --table
  2021-03-29 19:46                   ` Derrick Stolee
@ 2021-03-29 21:44                     ` Junio C Hamano
  2021-03-30 11:28                       ` Derrick Stolee
  2021-03-29 23:06                     ` Ævar Arnfjörð Bjarmason
  1 sibling, 1 reply; 203+ messages in thread
From: Junio C Hamano @ 2021-03-29 21:44 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Ævar Arnfjörð Bjarmason, Elijah Newren,
	Derrick Stolee via GitGitGadget, Git Mailing List,
	Nguyễn Thái Ngọc, Jonathan Nieder,
	Martin Ågren, SZEDER Gábor, Derrick Stolee,
	Derrick Stolee

Derrick Stolee <stolee@gmail.com> writes:

> I think changing 'ls-files' before the sparse index has stabilized is
> premature. I said that a series like the RFC you sent would be
> appropriate after this concept is more stable. I do _not_ recommend
> trying to juggle it on top of the work while the patches are in flight.

I do not have a problem with either of approaches to help debugging
(i.e. extending "ls-files --debug" or a new test helper), but I am
curious to be reminded what the plan for "git ls-files [-s]" output
is, when run in a repository in which sparse cone checkout is used.

Do we see trees and paths outside the cone omitted, or does the act
of running "ls-files" dehydrate the trees into their constituent
blobs?

Thanks.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v4 07/20] test-read-cache: print cache entries with --table
  2021-03-28 15:31                 ` Ævar Arnfjörð Bjarmason
  2021-03-29 19:46                   ` Derrick Stolee
@ 2021-03-29 22:02                   ` Elijah Newren
  1 sibling, 0 replies; 203+ messages in thread
From: Elijah Newren @ 2021-03-29 22:02 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Derrick Stolee, Derrick Stolee via GitGitGadget,
	Git Mailing List, Junio C Hamano,
	Nguyễn Thái Ngọc, Jonathan Nieder,
	Martin Ågren, SZEDER Gábor, Derrick Stolee,
	Derrick Stolee

Hi,

On Sun, Mar 28, 2021 at 8:31 AM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
> On Fri, Mar 26 2021, Elijah Newren wrote:
>
[...]
> > You are correct that this will span multiple releases; Stolee already
> > said he was planning to be working on this for most of 2021.  But just
> > because pieces of the code exist and are shipped doesn't mean it'll be
> > announced or supported.  For example, the git-2.30 and git-2.31
> > release notes were completely silent about merge-ort.  It existed in
> > both releases; in fact, the version that ships in git-2.31, could
> > theoretically be used successfully by the vast majority of users for
> > their daily workflow.  (But it does have known shortcomings and test
> > failures so I definitely did *not* want it to be announced at that
> > time.)
>
> Yes, and that's fine. But if you'd been bending over backwards to add
> merge-ort to t/helper/ "because it's not ready yet" or something I'd
> have probably commented to the effect of "can't we just add it as part
> of builtins but not advertise it?" which is what you did :)

Actually, I did add a t/helper/test-fast-rebase.c (which is a few
hundred lines long) as part of the work on merge-ort, because
merge-ort wasn't ready and because rewiring sequencer.c was a huge
amount of work that I didn't want to get distracted by at the time.  I
originally suggested making fast-rebase a non-advertised builtin, but
multiple reviewers suggested the test helper route instead.

¯\_(ツ)_/¯

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v4 07/20] test-read-cache: print cache entries with --table
  2021-03-29 19:46                   ` Derrick Stolee
  2021-03-29 21:44                     ` Junio C Hamano
@ 2021-03-29 23:06                     ` Ævar Arnfjörð Bjarmason
  2021-03-30 11:41                       ` Derrick Stolee
  1 sibling, 1 reply; 203+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-03-29 23:06 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Elijah Newren, Derrick Stolee via GitGitGadget, Git Mailing List,
	Junio C Hamano, Nguyễn Thái Ngọc,
	Jonathan Nieder, Martin Ågren, SZEDER Gábor,
	Derrick Stolee, Derrick Stolee


On Mon, Mar 29 2021, Derrick Stolee wrote:

> On 3/28/2021 11:31 AM, Ævar Arnfjörð Bjarmason wrote:> It seems to me that the reason for that state is based on a
>> misunderstanding about what we would and wouldn't add to builtin/*.c,
>> i.e. that we wouldn't have something like a --debug option, but as
>> ls-files shows that's not a problem.

At the risk of going in circles here...

> I feel _strongly_ that a change to the user-facing CLI should come
> with a good reason and care about how it locks-in behavior for the
> future.

And I agree with you. Where we disagree is whether lives in builtin/*.c
== user-facing. I think --debug options are != that. It seems Junio
downthread agrees with that.

> Any adjustment to 'git ls-files' deserves its own series and
> attention[...]

A user-facing change to it yes, but I don't see how use of an (existing
even) --debug option would warrant any more attention than a new test
helper, less actually, it's less new code.

> [...] not in an already-too-large series like this one.

The alternative way of doing it at the end of
https://lore.kernel.org/git/874kgzq4qi.fsf@evledraar.gmail.com would
make this series smaller.

Anyway. As I noted in the E-Mail you're replying to
(https://lore.kernel.org/git/87eeg0ng78.fsf@evledraar.gmail.com/) I
really don't care that much.

I'm just still perplexed at how you keep bringing up use of an
internal-only --debug option as "user-facing", and here "already too
large" when we're talking about a proposed alternate direction that
would reduce the size.

> I'm not happy that this series and the next are so long, but that's
> the best I can do to make them reviewable and still capture a
> complete scenario. Hopefully the remaining series after these first
> two are smaller. Things like "what should 'git ls-files' do with a
> sparse index?" can fit cleanly on top once the core functionality
> of the internals are stable.

Sure. I'm fully on board with just moving forward with this in some
manner.

I'm not on board with the part of this that seems like it could just be
rephrased/understood as "...and we're not touching ls-files even with a
--debug option now because that would be user-facing[...]".

> I have an _opinion_ that the ls-files output is not well-suited to
> testing because the --debug output splits details across multiple
> lines. This is a minor point that could probably be corrected by
> a complicated script method, but that's why I list this as an
> opinion.

If the --debug it's spewing now isn't handy we can just change the
output format. The docs say:

    This is intended to show as much information as possible for manual
    inspection; the exact format may change at any time.

And we don't have existing in-tree users, something like this would make
it rather trivial:
    
    diff --git a/builtin/ls-files.c b/builtin/ls-files.c
    index f6f9e483b27..7596edc9f9d 100644
    --- a/builtin/ls-files.c
    +++ b/builtin/ls-files.c
    @@ -113,11 +113,11 @@ static void print_debug(const struct cache_entry *ce)
     	if (debug_mode) {
     		const struct stat_data *sd = &ce->ce_stat_data;
     
    -		printf("  ctime: %u:%u\n", sd->sd_ctime.sec, sd->sd_ctime.nsec);
    -		printf("  mtime: %u:%u\n", sd->sd_mtime.sec, sd->sd_mtime.nsec);
    -		printf("  dev: %u\tino: %u\n", sd->sd_dev, sd->sd_ino);
    -		printf("  uid: %u\tgid: %u\n", sd->sd_uid, sd->sd_gid);
    -		printf("  size: %u\tflags: %x\n", sd->sd_size, ce->ce_flags);
    +		printf("  ctime: %u:%u%c", sd->sd_ctime.sec, sd->sd_ctime.nsec, line_terminator);
    +		printf("  mtime: %u:%u%c", sd->sd_mtime.sec, sd->sd_mtime.nsec, line_terminator);
    +		printf("  dev: %u\tino: %u%c", sd->sd_dev, sd->sd_ino, line_terminator);
    +		printf("  uid: %u\tgid: %u%c", sd->sd_uid, sd->sd_gid, line_terminator);
    +		printf("  size: %u\tflags: %x%c", sd->sd_size, ce->ce_flags, line_terminator);
     	}
     }

But even without that it wouldn't be some complicated post-processing,
just a pipe to a small perl or awk process.
     
>> I mean it's fine if it's just a "I don't think this is important and
>> don't want to spend time on it, but it seems like a good idea", in which
>> case others have the option of re-rolling some of these patches if they
>> care (at this point I wouldn't).
>> 
>> Or "this is just a bad idea for XYZ reason", which is also fine, and
>> even more valuable to document for future work in the area.
>> 
>> But to have another series built on this with refactorings back and
>> forth before code's landed on master just seems like needless churn.
>
> I think changing 'ls-files' before the sparse index has stabilized is
> premature. I said that a series like the RFC you sent would be
> appropriate after this concept is more stable. I do _not_ recommend
> trying to juggle it on top of the work while the patches are in flight.

Just to clarify, upthread in [1] you said:

    And I recommend that you continue to pursue [these RFC patches] as
    an independent series, but I'm not going to incorporate them into
    this one[...]

So do I understand it right that you're referring to phase IV in your
opinion being the first point where we'd consider piggy-backing on
anything in builtin (that "user-facing" dilemma again...).

But at that point wouldn't you have your own ideas about some
user-facing ls-files or other porcelain for this, so I'm not sure where
to place the encouragement that I continue to pursue that RFC series,
other than setting a reminder in my calendar for 6-12 months in the
future :)

1. https://lore.kernel.org/git/ca8a96a4-5897-2484-b195-57e5b3820576@gmail.com/

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v4 07/20] test-read-cache: print cache entries with --table
  2021-03-29 21:44                     ` Junio C Hamano
@ 2021-03-30 11:28                       ` Derrick Stolee
  0 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee @ 2021-03-30 11:28 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Ævar Arnfjörð Bjarmason, Elijah Newren,
	Derrick Stolee via GitGitGadget, Git Mailing List,
	Nguyễn Thái Ngọc, Jonathan Nieder,
	Martin Ågren, SZEDER Gábor, Derrick Stolee,
	Derrick Stolee

On 3/29/2021 5:44 PM, Junio C Hamano wrote:
> Derrick Stolee <stolee@gmail.com> writes:
> 
>> I think changing 'ls-files' before the sparse index has stabilized is
>> premature. I said that a series like the RFC you sent would be
>> appropriate after this concept is more stable. I do _not_ recommend
>> trying to juggle it on top of the work while the patches are in flight.
> 
> I do not have a problem with either of approaches to help debugging
> (i.e. extending "ls-files --debug" or a new test helper), but I am
> curious to be reminded what the plan for "git ls-files [-s]" output
> is, when run in a repository in which sparse cone checkout is used.
> 
> Do we see trees and paths outside the cone omitted, or does the act
> of running "ls-files" dehydrate the trees into their constituent
> blobs?

At the moment, end-to-end behavior is identical as before: sparse
directory entries are expanded to all of the contained blobs instead
of writing the tree entries.

The sparse-index work will not be complete until every command is
audited for potential behavior change when disabling the
command_requires_full_index setting. That includes deciding what
is the best decision for ls-files, and will likely include an option
for both possible outputs (tree entries, or expanding to blobs). The
interesting discussion that is worth its own topic is whether or not
the tree entries should be displayed by default.

So the plan is: this _will_ be addressed, but in the future after
the core functionality and value of the sparse-index is set.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v4 07/20] test-read-cache: print cache entries with --table
  2021-03-29 23:06                     ` Ævar Arnfjörð Bjarmason
@ 2021-03-30 11:41                       ` Derrick Stolee
  0 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee @ 2021-03-30 11:41 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Elijah Newren, Derrick Stolee via GitGitGadget, Git Mailing List,
	Junio C Hamano, Nguyễn Thái Ngọc,
	Jonathan Nieder, Martin Ågren, SZEDER Gábor,
	Derrick Stolee, Derrick Stolee

On 3/29/2021 7:06 PM, Ævar Arnfjörð Bjarmason wrote:
> 
> On Mon, Mar 29 2021, Derrick Stolee wrote:
> 
>> On 3/28/2021 11:31 AM, Ævar Arnfjörð Bjarmason wrote:> It seems to me that the reason for that state is based on a
>>> misunderstanding about what we would and wouldn't add to builtin/*.c,
>>> i.e. that we wouldn't have something like a --debug option, but as
>>> ls-files shows that's not a problem.
> 
> At the risk of going in circles here...
> 
>> I feel _strongly_ that a change to the user-facing CLI should come
>> with a good reason and care about how it locks-in behavior for the
>> future.
> 
> And I agree with you. Where we disagree is whether lives in builtin/*.c
> == user-facing. I think --debug options are != that. It seems Junio
> downthread agrees with that.
> 
>> Any adjustment to 'git ls-files' deserves its own series and
>> attention[...]
> 
> A user-facing change to it yes, but I don't see how use of an (existing
> even) --debug option would warrant any more attention than a new test
> helper, less actually, it's less new code.

I disagree that we can change the expected output of --debug so
quickly, despite warnings in the documentation. Changing that format
or creating a new output format requires cognitive load, and we have
enough of that going on in this area as it is.

>> [...] not in an already-too-large series like this one.
...
> I'm just still perplexed at how you keep bringing up use of an
> internal-only --debug option as "user-facing", and here "already too
> large" when we're talking about a proposed alternate direction that
> would reduce the size.

I'm not saying "patch size" or "code size" but instead thinking of it
in terms of how many decisions need to be made. Changing a builtin
when it's not necessary adds to the complexity of the series and
interrupts its core goals.

Finally, I have mentioned that I will need extra data for testing a
new index format. I don't want to modify the builtin now in a way
that is insufficient for the needs in that future series.

> Just to clarify, upthread in [1] you said:
> 
>     And I recommend that you continue to pursue [these RFC patches] as
>     an independent series, but I'm not going to incorporate them into
>     this one[...]
> 
> So do I understand it right that you're referring to phase IV in your
> opinion being the first point where we'd consider piggy-backing on
> anything in builtin (that "user-facing" dilemma again...).

I'm saying that if you feel strongly about it, then please pursue the
changes to ls-files any time after this series (but probably after
the next) solidifies. Having the changes be in a separate series allows
time to inspect the behavior change to the builtin in a focused way.
 
> But at that point wouldn't you have your own ideas about some
> user-facing ls-files or other porcelain for this, so I'm not sure where
> to place the encouragement that I continue to pursue that RFC series,
> other than setting a reminder in my calendar for 6-12 months in the
> future :)

Otherwise, I will modify ls-files myself in this 6-12 month timeframe,
based on the established plan to remove the command_requires_full_index
setting.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH v5 00/21] Sparse Index: Design, Format, Tests
  2021-03-23 13:44     ` [PATCH v4 " Derrick Stolee via GitGitGadget
                         ` (20 preceding siblings ...)
  2021-03-23 16:16       ` [PATCH v4 00/20] Sparse Index: Design, Format, Tests Elijah Newren
@ 2021-03-30 13:10       ` Derrick Stolee via GitGitGadget
  2021-03-30 13:10         ` [PATCH v5 01/21] sparse-index: design doc and format update Derrick Stolee via GitGitGadget
                           ` (22 more replies)
  21 siblings, 23 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-30 13:10 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee

Here is the first full patch series submission coming out of the
sparse-index RFC [1].

[1]
https://lore.kernel.org/git/pull.847.git.1611596533.gitgitgadget@gmail.com/

I won't waste too much space here, because PATCH 1 includes a sizeable
design document that describes the feature, the reasoning behind it, and my
plan for getting this implemented widely throughout the codebase.

There are some new things here that were not in the RFC:

 * Design doc and format updates. (Patch 1)
 * Performance test script. (Patches 2 and 20)

Notably missing in this series from the RFC:

 * The mega-patch inserting ensure_full_index() throughout the codebase.
   That will be a follow-up series to this one.
 * The integrations with git status and git add to demonstrate the improved
   performance. Those will also appear in their own series later.

I plan to keep my latest work in this area in my 'sparse-index/wip' branch
[2]. It includes all of the work from the RFC right now, updated with the
work from this series.

[2] https://github.com/derrickstolee/git/tree/sparse-index/wip


Updates in V5
=============

This version is updated to use an index extension instead of a repository
format extension. Thanks, Szeder! This one change affects the range-diff
quite a bit, so please review those changes carefully.

In particular: git sparse-checkout init --cone --sparse-index now sets a new
index.sparse config option as an indicator that we should attempt writing
the index in sparse form.


Updates in V4
=============

 * Rebased onto the latest copy of ab/read-tree.
 * Updated the design document as per Junio's comments.
 * Updated the submodule handling in the performance test.
 * Followed up on some other review from Ævar, mostly style or commit
   message things.


Updates in V3
=============

For this version, I took Ævar's latest patches and applied them to v2.31.0
and rebased this series on top. It uses his new "read_tree_at()" helper and
the associated changes to the function pointer type.

 * Fixed more typos. Thanks Martin and Elijah!
 * Updated the test_sparse_match() macro to use "$@" instead of $*
 * Added a test that git sparse-checkout init --no-sparse-index rewrites the
   index to be full.


Updates in V2
=============

 * Various typos and awkward grammar is fixed.
 * Cleaned up unnecessary commands in p2000-sparse-operations.sh
 * Added a comment to the sparse_index member of struct index_state.
 * Used tree_type, commit_type, and blob_type in test-read-cache.c.

Thanks, -Stolee

Derrick Stolee (21):
  sparse-index: design doc and format update
  t/perf: add performance test for sparse operations
  t1092: clean up script quoting
  sparse-index: add guard to ensure full index
  sparse-index: implement ensure_full_index()
  t1092: compare sparse-checkout to sparse-index
  test-read-cache: print cache entries with --table
  test-tool: don't force full index
  unpack-trees: ensure full index
  sparse-checkout: hold pattern list in index
  sparse-index: add 'sdir' index extension
  sparse-index: convert from full to sparse
  submodule: sparse-index should not collapse links
  unpack-trees: allow sparse directories
  sparse-index: check index conversion happens
  sparse-index: add index.sparse config option
  sparse-checkout: toggle sparse index from builtin
  sparse-checkout: disable sparse-index
  cache-tree: integrate with sparse directory entries
  sparse-index: loose integration with cache_tree_verify()
  p2000: add sparse-index repos

 Documentation/config/index.txt           |   5 +
 Documentation/git-sparse-checkout.txt    |  14 ++
 Documentation/technical/index-format.txt |  19 ++
 Documentation/technical/sparse-index.txt | 175 ++++++++++++++
 Makefile                                 |   1 +
 builtin/sparse-checkout.c                |  44 +++-
 cache-tree.c                             |  40 ++++
 cache.h                                  |  18 +-
 read-cache.c                             |  44 +++-
 repo-settings.c                          |  15 ++
 repository.c                             |  11 +-
 repository.h                             |   3 +
 sparse-index.c                           | 285 +++++++++++++++++++++++
 sparse-index.h                           |  11 +
 t/README                                 |   3 +
 t/helper/test-read-cache.c               |  66 +++++-
 t/perf/p2000-sparse-operations.sh        | 101 ++++++++
 t/t1091-sparse-checkout-builtin.sh       |  13 ++
 t/t1092-sparse-checkout-compatibility.sh | 143 ++++++++++--
 unpack-trees.c                           |  17 +-
 20 files changed, 988 insertions(+), 40 deletions(-)
 create mode 100644 Documentation/technical/sparse-index.txt
 create mode 100644 sparse-index.c
 create mode 100644 sparse-index.h
 create mode 100755 t/perf/p2000-sparse-operations.sh


base-commit: 47957485b3b731a7860e0554d2bd12c0dce1c75a
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-883%2Fderrickstolee%2Fsparse-index%2Fformat-v5
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-883/derrickstolee/sparse-index/format-v5
Pull-Request: https://github.com/gitgitgadget/git/pull/883

Range-diff vs v4:

  1:  6426a5c60e53 !  1:  7b600d536c6e sparse-index: design doc and format update
     @@ Documentation/technical/sparse-index.txt (new)
      +The only noticeable change in behavior will be that the serialized index
      +file contains sparse-directory entries.
      +
     -+To start, we use a new repository extension, `extensions.sparseIndex`, to
     -+allow inserting sparse-directory entries into indexes with file format
     ++To start, we use a new required index extension, `sdir`, to allow
     ++inserting sparse-directory entries into indexes with file format
      +versions 2, 3, and 4. This prevents Git versions that do not understand
     -+the sparse-index from operating on one, but it also prevents other
     -+operations that do not use the index at all. A new format, index v5, will
     -+be introduced that includes sparse-directory entries by default. It might
     -+also introduce other features that have been considered for improving the
     ++the sparse-index from operating on one, while allowing tools that do not
     ++understand the sparse-index to operate on repositories as long as they do
     ++not interact with the index. A new format, index v5, will be introduced
     ++that includes sparse-directory entries by default. It might also
     ++introduce other features that have been considered for improving the
      +index, as well.
      +
      +Next, consumers of the index will be guarded against operating on a
  2:  7eabc1d0586c =  2:  202253ec82f3 t/perf: add performance test for sparse operations
  3:  c9e21d78ecba =  3:  437a0f144e57 t1092: clean up script quoting
  4:  03cdde756563 =  4:  b7e1bf5c55a7 sparse-index: add guard to ensure full index
  5:  6b3b6d86385d =  5:  e41d55d2cca9 sparse-index: implement ensure_full_index()
  6:  7f67adba0498 =  6:  7bfbfbd17321 t1092: compare sparse-checkout to sparse-index
  7:  7ebd9570b1ad =  7:  a1b8135c0fc8 test-read-cache: print cache entries with --table
  8:  db7bbd06dbcc =  8:  dd84a2a9121b test-tool: don't force full index
  9:  3ddd5e794b5e =  9:  b276d2ed5323 unpack-trees: ensure full index
 10:  7308c87697f1 = 10:  c3651e26dc3a sparse-checkout: hold pattern list in index
  -:  ------------ > 11:  f926cf8b2e01 sparse-index: add 'sdir' index extension
 11:  7c10d653ca6b = 12:  c870ae5e8749 sparse-index: convert from full to sparse
 12:  6db36f33e960 = 13:  bcf0da959ef3 submodule: sparse-index should not collapse links
 13:  d24bd3348d98 = 14:  7191b48237de unpack-trees: allow sparse directories
 14:  08d9f5f3c0d1 = 15:  57be9b4a728b sparse-index: check index conversion happens
 15:  6f38cef196b0 ! 16:  c22b4111e49e sparse-index: create extension for compatibility
     @@ Metadata
      Author: Derrick Stolee <dstolee@microsoft.com>
      
       ## Commit message ##
     -    sparse-index: create extension for compatibility
     +    sparse-index: add index.sparse config option
      
     -    Previously, we enabled the sparse index format only using
     -    GIT_TEST_SPARSE_INDEX=1. This is not a feasible direction for users to
     -    actually select this mode. Further, sparse directory entries are not
     -    understood by the index formats as advertised.
     -
     -    We _could_ add a new index version that explicitly adds these
     -    capabilities, but there are nuances to index formats 2, 3, and 4 that
     -    are still valuable to select as options. Until we add index format
     -    version 5, create a repo extension, "extensions.sparseIndex", that
     -    specifies that the tool reading this repository must understand sparse
     -    directory entries.
     -
     -    This change only encodes the extension and enables it when
     -    GIT_TEST_SPARSE_INDEX=1. Later, we will add a more user-friendly CLI
     -    mechanism.
     +    When enabled, this config option signals that index writes should
     +    attempt to use sparse-directory entries.
      
          Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
      
     - ## Documentation/config/extensions.txt ##
     -@@ Documentation/config/extensions.txt: extensions.objectFormat::
     - Note that this setting should only be set by linkgit:git-init[1] or
     - linkgit:git-clone[1].  Trying to change it after initialization will not
     - work and will produce hard-to-diagnose issues.
     + ## Documentation/config/index.txt ##
     +@@ Documentation/config/index.txt: index.recordOffsetTable::
     + 	Defaults to 'true' if index.threads has been explicitly enabled,
     + 	'false' otherwise.
     + 
     ++index.sparse::
     ++	When enabled, write the index using sparse-directory entries. This
     ++	has no effect unless `core.sparseCheckout` and
     ++	`core.sparseCheckoutCone` are both enabled. Defaults to 'false'.
      +
     -+extensions.sparseIndex::
     -+	When combined with `core.sparseCheckout=true` and
     -+	`core.sparseCheckoutCone=true`, the index may contain entries
     -+	corresponding to directories outside of the sparse-checkout
     -+	definition in lieu of containing each path under such directories.
     -+	Versions of Git that do not understand this extension do not
     -+	expect directory entries in the index.
     + index.threads::
     + 	Specifies the number of threads to spawn when loading the index.
     + 	This is meant to reduce index load time on multiprocessor machines.
      
       ## cache.h ##
      @@ cache.h: struct repository_format {
     @@ repo-settings.c: void prepare_repo_settings(struct repository *r)
      +	 * Initialize this as off.
      +	 */
      +	r->settings.sparse_index = 0;
     -+	if (!repo_config_get_bool(r, "extensions.sparseindex", &value) && value)
     ++	if (!repo_config_get_bool(r, "index.sparse", &value) && value)
      +		r->settings.sparse_index = 1;
       }
      
     @@ repository.h: struct repo_settings {
       
       struct repository {
      
     - ## setup.c ##
     -@@ setup.c: static enum extension_result handle_extension(const char *var,
     - 			return error("invalid value for 'extensions.objectformat'");
     - 		data->hash_algo = format;
     - 		return EXTENSION_OK;
     -+	} else if (!strcmp(ext, "sparseindex")) {
     -+		data->sparse_index = 1;
     -+		return EXTENSION_OK;
     - 	}
     - 	return EXTENSION_UNKNOWN;
     - }
     -
       ## sparse-index.c ##
      @@ sparse-index.c: static int convert_to_sparse_rec(struct index_state *istate,
       	return num_converted - start_converted;
     @@ sparse-index.c: static int convert_to_sparse_rec(struct index_state *istate,
      +{
      +	const char *config_path = repo_git_path(repo, "config.worktree");
      +
     -+	if (upgrade_repository_format(1) < 0) {
     -+		warning(_("unable to upgrade repository format to enable sparse-index"));
     -+		return -1;
     -+	}
      +	git_config_set_in_file_gently(config_path,
     -+				      "extensions.sparseIndex",
     ++				      "index.sparse",
      +				      "true");
      +
      +	prepare_repo_settings(repo);
     @@ sparse-index.c: static int convert_to_sparse_rec(struct index_state *istate,
      +
      +	/*
      +	 * The GIT_TEST_SPARSE_INDEX environment variable triggers the
     -+	 * extensions.sparseIndex config variable to be on.
     ++	 * index.sparse config variable to be on.
      +	 */
      +	if (git_env_bool("GIT_TEST_SPARSE_INDEX", 0)) {
      +		int err = enable_sparse_index(istate->repo);
     @@ sparse-index.c: static int convert_to_sparse_rec(struct index_state *istate,
      -	 * GIT_TEST_SPARSE_INDEX environment variable. We will relax
      -	 * this once we have a proper way to opt-in (and later still,
      -	 * opt-out).
     -+	 * Only convert to sparse if extensions.sparseIndex is set.
     ++	 * Only convert to sparse if index.sparse is set.
       	 */
      -	if (!git_env_bool("GIT_TEST_SPARSE_INDEX", 0))
      +	prepare_repo_settings(istate->repo);
 16:  923081e7e079 ! 17:  75fe9b0f57da sparse-checkout: toggle sparse index from builtin
     @@ Documentation/git-sparse-checkout.txt: To avoid interfering with other worktrees
      +that is not completely understood by external tools. If you have trouble
      +with this compatibility, then run `git sparse-checkout init --no-sparse-index`
      +to rewrite your index to not be sparse. Older versions of Git will not
     -+understand the `sparseIndex` repository extension and may fail to interact
     -+with your repository until it is disabled.
     ++understand the sparse directory entries index extension and may fail to
     ++interact with your repository until it is disabled.
       
       'set'::
       	Write a set of patterns to the sparse-checkout file, as given as
     @@ builtin/sparse-checkout.c: static int sparse_checkout_init(int argc, const char
      
       ## sparse-index.c ##
      @@ sparse-index.c: static int convert_to_sparse_rec(struct index_state *istate,
     + 	return num_converted - start_converted;
     + }
       
     - static int enable_sparse_index(struct repository *repo)
     +-static int enable_sparse_index(struct repository *repo)
     ++static int set_index_sparse_config(struct repository *repo, int enable)
       {
      -	const char *config_path = repo_git_path(repo, "config.worktree");
     -+	int res;
     - 
     - 	if (upgrade_repository_format(1) < 0) {
     - 		warning(_("unable to upgrade repository format to enable sparse-index"));
     - 		return -1;
     - 	}
     +-
      -	git_config_set_in_file_gently(config_path,
     --				      "extensions.sparseIndex",
     +-				      "index.sparse",
      -				      "true");
     -+	res = git_config_set_gently("extensions.sparseindex", "true");
     ++	int res;
     ++	char *config_path = repo_git_path(repo, "config.worktree");
     ++	res = git_config_set_in_file_gently(config_path,
     ++					    "index.sparse",
     ++					    enable ? "true" : NULL);
     ++	free(config_path);
       
       	prepare_repo_settings(repo);
       	repo->settings.sparse_index = 1;
     @@ sparse-index.c: static int convert_to_sparse_rec(struct index_state *istate,
      +
      +int set_sparse_index_config(struct repository *repo, int enable)
      +{
     -+	int res;
     -+
     -+	if (enable)
     -+		return enable_sparse_index(repo);
     -+
     -+	/* Don't downgrade repository format, just remove the extension. */
     -+	res = git_config_set_gently("extensions.sparseindex", NULL);
     ++	int res = set_index_sparse_config(repo, enable);
      +
      +	prepare_repo_settings(repo);
     -+	repo->settings.sparse_index = 0;
     ++	repo->settings.sparse_index = enable;
      +	return res;
       }
       
     @@ sparse-index.c: static int convert_to_sparse_rec(struct index_state *istate,
       	    !core_apply_sparse_checkout || !core_sparse_checkout_cone)
       		return 0;
      @@ sparse-index.c: int convert_to_sparse(struct index_state *istate)
     - 		istate->repo = the_repository;
     - 
     - 	/*
     --	 * The GIT_TEST_SPARSE_INDEX environment variable triggers the
     --	 * extensions.sparseIndex config variable to be on.
     -+	 * If GIT_TEST_SPARSE_INDEX=1, then trigger extensions.sparseIndex
     -+	 * to be fully enabled. If GIT_TEST_SPARSE_INDEX=0 (set explicitly),
     -+	 * then purposefully disable the setting.
     + 	 * The GIT_TEST_SPARSE_INDEX environment variable triggers the
     + 	 * index.sparse config variable to be on.
       	 */
      -	if (git_env_bool("GIT_TEST_SPARSE_INDEX", 0)) {
      -		int err = enable_sparse_index(istate->repo);
     @@ sparse-index.c: int convert_to_sparse(struct index_state *istate)
      +		set_sparse_index_config(istate->repo, test_env);
       
       	/*
     - 	 * Only convert to sparse if extensions.sparseIndex is set.
     + 	 * Only convert to sparse if index.sparse is set.
      
       ## sparse-index.h ##
      @@ sparse-index.h: struct index_state;
     @@ t/t1092-sparse-checkout-compatibility.sh: init_repos () {
      -	GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout init --cone &&
      -	GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set deep
      +	git -C sparse-index sparse-checkout init --cone --sparse-index &&
     -+	test_cmp_config -C sparse-index true extensions.sparseindex &&
     ++	test_cmp_config -C sparse-index true index.sparse &&
      +	git -C sparse-index sparse-checkout set deep
       }
       
 17:  6f1ad72c390d ! 18:  7f55a232e647 sparse-checkout: disable sparse-index
     @@ t/t1091-sparse-checkout-builtin.sh: test_expect_success 'sparse-checkout disable
       
      +test_expect_success 'sparse-index enabled and disabled' '
      +	git -C repo sparse-checkout init --cone --sparse-index &&
     -+	test_cmp_config -C repo true extensions.sparseIndex &&
     ++	test_cmp_config -C repo true index.sparse &&
      +	test-tool -C repo read-cache --table >cache &&
      +	grep " tree " cache &&
      +
     @@ t/t1091-sparse-checkout-builtin.sh: test_expect_success 'sparse-checkout disable
      +	test-tool -C repo read-cache --table >cache &&
      +	! grep " tree " cache &&
      +	git -C repo config --list >config &&
     -+	! grep extensions.sparseindex config
     ++	! grep index.sparse config
      +'
      +
       test_expect_success 'cone mode: init and set' '
 18:  bd94e6b7d089 = 19:  365901809d9d cache-tree: integrate with sparse directory entries
 19:  e7190376b806 = 20:  9b068c458898 sparse-index: loose integration with cache_tree_verify()
 20:  bcf0a58eb38c = 21:  66602733cc95 p2000: add sparse-index repos

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH v5 01/21] sparse-index: design doc and format update
  2021-03-30 13:10       ` [PATCH v5 00/21] " Derrick Stolee via GitGitGadget
@ 2021-03-30 13:10         ` Derrick Stolee via GitGitGadget
  2021-03-30 13:10         ` [PATCH v5 02/21] t/perf: add performance test for sparse operations Derrick Stolee via GitGitGadget
                           ` (21 subsequent siblings)
  22 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-30 13:10 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

This begins a long effort to update the index format to allow sparse
directory entries. This should result in a significant improvement to
Git commands when HEAD contains millions of files, but the user has
selected many fewer files to keep in their sparse-checkout definition.

Currently, the index format is only updated in the presence of
extensions.sparseIndex instead of increasing a file format version
number. This is temporary, and index v5 is part of the plan for future
work in this area.

The design document details many of the reasons for embarking on this
work, and also the plan for completing it safely.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/index-format.txt |   7 +
 Documentation/technical/sparse-index.txt | 175 +++++++++++++++++++++++
 2 files changed, 182 insertions(+)
 create mode 100644 Documentation/technical/sparse-index.txt

diff --git a/Documentation/technical/index-format.txt b/Documentation/technical/index-format.txt
index d363a71c37ec..3b74c05647db 100644
--- a/Documentation/technical/index-format.txt
+++ b/Documentation/technical/index-format.txt
@@ -44,6 +44,13 @@ Git index format
   localization, no special casing of directory separator '/'). Entries
   with the same name are sorted by their stage field.
 
+  An index entry typically represents a file. However, if sparse-checkout
+  is enabled in cone mode (`core.sparseCheckoutCone` is enabled) and the
+  `extensions.sparseIndex` extension is enabled, then the index may
+  contain entries for directories outside of the sparse-checkout definition.
+  These entries have mode `040000`, include the `SKIP_WORKTREE` bit, and
+  the path ends in a directory separator.
+
   32-bit ctime seconds, the last time a file's metadata changed
     this is stat(2) data
 
diff --git a/Documentation/technical/sparse-index.txt b/Documentation/technical/sparse-index.txt
new file mode 100644
index 000000000000..8d3d80804604
--- /dev/null
+++ b/Documentation/technical/sparse-index.txt
@@ -0,0 +1,175 @@
+Git Sparse-Index Design Document
+================================
+
+The sparse-checkout feature allows users to focus a working directory on
+a subset of the files at HEAD. The cone mode patterns, enabled by
+`core.sparseCheckoutCone`, allow for very fast pattern matching to
+discover which files at HEAD belong in the sparse-checkout cone.
+
+Three important scale dimensions for a Git working directory are:
+
+* `HEAD`: How many files are present at `HEAD`?
+
+* Populated: How many files are within the sparse-checkout cone.
+
+* Modified: How many files has the user modified in the working directory?
+
+We will use big-O notation -- O(X) -- to denote how expensive certain
+operations are in terms of these dimensions.
+
+These dimensions are ordered by their magnitude: users (typically) modify
+fewer files than are populated, and we can only populate files at `HEAD`.
+
+Problems occur if there is an extreme imbalance in these dimensions. For
+example, if `HEAD` contains millions of paths but the populated set has
+only tens of thousands, then commands like `git status` and `git add` can
+be dominated by operations that require O(`HEAD`) operations instead of
+O(Populated). Primarily, the cost is in parsing and rewriting the index,
+which is filled primarily with files at `HEAD` that are marked with the
+`SKIP_WORKTREE` bit.
+
+The sparse-index intends to take these commands that read and modify the
+index from O(`HEAD`) to O(Populated). To do this, we need to modify the
+index format in a significant way: add "sparse directory" entries.
+
+With cone mode patterns, it is possible to detect when an entire
+directory will have its contents outside of the sparse-checkout definition.
+Instead of listing all of the files it contains as individual entries, a
+sparse-index contains an entry with the directory name, referencing the
+object ID of the tree at `HEAD` and marked with the `SKIP_WORKTREE` bit.
+If we need to discover the details for paths within that directory, we
+can parse trees to find that list.
+
+At time of writing, sparse-directory entries violate expectations about the
+index format and its in-memory data structure. There are many consumers in
+the codebase that expect to iterate through all of the index entries and
+see only files. In fact, these loops expect to see a reference to every
+staged file. One way to handle this is to parse trees to replace a
+sparse-directory entry with all of the files within that tree as the index
+is loaded. However, parsing trees is slower than parsing the index format,
+so that is a slower operation than if we left the index alone. The plan is
+to make all of these integrations "sparse aware" so this expansion through
+tree parsing is unnecessary and they use fewer resources than when using a
+full index.
+
+The implementation plan below follows four phases to slowly integrate with
+the sparse-index. The intention is to incrementally update Git commands to
+interact safely with the sparse-index without significant slowdowns. This
+may not always be possible, but the hope is that the primary commands that
+users need in their daily work are dramatically improved.
+
+Phase I: Format and initial speedups
+------------------------------------
+
+During this phase, Git learns to enable the sparse-index and safely parse
+one. Protections are put in place so that every consumer of the in-memory
+data structure can operate with its current assumption of every file at
+`HEAD`.
+
+At first, every index parse will call a helper method,
+`ensure_full_index()`, which scans the index for sparse-directory entries
+(pointing to trees) and replaces them with the full list of paths (with
+blob contents) by parsing tree objects. This will be slower in all cases.
+The only noticeable change in behavior will be that the serialized index
+file contains sparse-directory entries.
+
+To start, we use a new required index extension, `sdir`, to allow
+inserting sparse-directory entries into indexes with file format
+versions 2, 3, and 4. This prevents Git versions that do not understand
+the sparse-index from operating on one, while allowing tools that do not
+understand the sparse-index to operate on repositories as long as they do
+not interact with the index. A new format, index v5, will be introduced
+that includes sparse-directory entries by default. It might also
+introduce other features that have been considered for improving the
+index, as well.
+
+Next, consumers of the index will be guarded against operating on a
+sparse-index by inserting calls to `ensure_full_index()` or
+`expand_index_to_path()`. After these guards are in place, we can begin
+leaving sparse-directory entries in the in-memory index structure.
+
+Even after inserting these guards, we will keep expanding sparse-indexes
+for most Git commands using the `command_requires_full_index` repository
+setting. This setting will be on by default and disabled one builtin at a
+time until we have sufficient confidence that all of the index operations
+are properly guarded.
+
+To complete this phase, the commands `git status` and `git add` will be
+integrated with the sparse-index so that they operate with O(Populated)
+performance. They will be carefully tested for operations within and
+outside the sparse-checkout definition.
+
+Phase II: Careful integrations
+------------------------------
+
+This phase focuses on ensuring that all index extensions and APIs work
+well with a sparse-index. This requires significant increases to our test
+coverage, especially for operations that interact with the working
+directory outside of the sparse-checkout definition. Some of these
+behaviors may not be the desirable ones, such as some tests already
+marked for failure in `t1092-sparse-checkout-compatibility.sh`.
+
+The index extensions that may require special integrations are:
+
+* FS Monitor
+* Untracked cache
+
+While integrating with these features, we should look for patterns that
+might lead to better APIs for interacting with the index. Coalescing
+common usage patterns into an API call can reduce the number of places
+where sparse-directories need to be handled carefully.
+
+Phase III: Important command speedups
+-------------------------------------
+
+At this point, the patterns for testing and implementing sparse-directory
+logic should be relatively stable. This phase focuses on updating some of
+the most common builtins that use the index to operate as O(Populated).
+Here is a potential list of commands that could be valuable to integrate
+at this point:
+
+* `git commit`
+* `git checkout`
+* `git merge`
+* `git rebase`
+
+Hopefully, commands such as `git merge` and `git rebase` can benefit
+instead from merge algorithms that do not use the index as a data
+structure, such as the merge-ORT strategy. As these topics mature, we
+may enable the ORT strategy by default for repositories using the
+sparse-index feature.
+
+Along with `git status` and `git add`, these commands cover the majority
+of users' interactions with the working directory. In addition, we can
+integrate with these commands:
+
+* `git grep`
+* `git rm`
+
+These have been proposed as some whose behavior could change when in a
+repo with a sparse-checkout definition. It would be good to include this
+behavior automatically when using a sparse-index. Some clarity is needed
+to make the behavior switch clear to the user.
+
+This phase is the first where parallel work might be possible without too
+much conflicts between topics.
+
+Phase IV: The long tail
+-----------------------
+
+This last phase is less a "phase" and more "the new normal" after all of
+the previous work.
+
+To start, the `command_requires_full_index` option could be removed in
+favor of expanding only when hitting an API guard.
+
+There are many Git commands that could use special attention to operate as
+O(Populated), while some might be so rare that it is acceptable to leave
+them with additional overhead when a sparse-index is present.
+
+Here are some commands that might be useful to update:
+
+* `git sparse-checkout set`
+* `git am`
+* `git clean`
+* `git stash`
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH v5 02/21] t/perf: add performance test for sparse operations
  2021-03-30 13:10       ` [PATCH v5 00/21] " Derrick Stolee via GitGitGadget
  2021-03-30 13:10         ` [PATCH v5 01/21] sparse-index: design doc and format update Derrick Stolee via GitGitGadget
@ 2021-03-30 13:10         ` Derrick Stolee via GitGitGadget
  2021-03-30 13:10         ` [PATCH v5 03/21] t1092: clean up script quoting Derrick Stolee via GitGitGadget
                           ` (20 subsequent siblings)
  22 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-30 13:10 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Create a test script that takes the default performance test (the Git
codebase) and multiplies it by 256 using four layers of duplicated
trees of width four. This results in nearly one million blob entries in
the index. Then, we can clone this repository with sparse-checkout
patterns that demonstrate four copies of the initial repository. Each
clone will use a different index format or mode so peformance can be
tested across the different options.

Note that the initial repo is stripped of submodules before doing the
copies. This preserves the expected data shape of the sparse index,
because directories containing submodules are not collapsed to a sparse
directory entry.

Run a few Git commands on these clones, especially those that use the
index (status, add, commit).

Here are the results on my Linux machine:

Test
--------------------------------------------------------------
2000.2: git status (full-index-v3)             0.37(0.30+0.09)
2000.3: git status (full-index-v4)             0.39(0.32+0.10)
2000.4: git add -A (full-index-v3)             1.42(1.06+0.20)
2000.5: git add -A (full-index-v4)             1.26(0.98+0.16)
2000.6: git add . (full-index-v3)              1.40(1.04+0.18)
2000.7: git add . (full-index-v4)              1.26(0.98+0.17)
2000.8: git commit -a -m A (full-index-v3)     1.42(1.11+0.16)
2000.9: git commit -a -m A (full-index-v4)     1.33(1.08+0.16)

It is perhaps noteworthy that there is an improvement when using index
version 4. This is because the v3 index uses 108 MiB while the v4
index uses 80 MiB. Since the repeated portions of the directories are
very short (f3/f1/f2, for example) this ratio is less pronounced than in
similarly-sized real repositories.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/perf/p2000-sparse-operations.sh | 84 +++++++++++++++++++++++++++++++
 1 file changed, 84 insertions(+)
 create mode 100755 t/perf/p2000-sparse-operations.sh

diff --git a/t/perf/p2000-sparse-operations.sh b/t/perf/p2000-sparse-operations.sh
new file mode 100755
index 000000000000..dddd527b6330
--- /dev/null
+++ b/t/perf/p2000-sparse-operations.sh
@@ -0,0 +1,84 @@
+#!/bin/sh
+
+test_description="test performance of Git operations using the index"
+
+. ./perf-lib.sh
+
+test_perf_default_repo
+
+SPARSE_CONE=f2/f4/f1
+
+test_expect_success 'setup repo and indexes' '
+	git reset --hard HEAD &&
+
+	# Remove submodules from the example repo, because our
+	# duplication of the entire repo creates an unlikely data shape.
+	if git config --file .gitmodules --get-regexp "submodule.*.path" >modules
+	then
+		git rm $(awk "{print \$2}" modules) &&
+		git commit -m "remove submodules" || return 1
+	fi &&
+
+	echo bogus >a &&
+	cp a b &&
+	git add a b &&
+	git commit -m "level 0" &&
+	BLOB=$(git rev-parse HEAD:a) &&
+	OLD_COMMIT=$(git rev-parse HEAD) &&
+	OLD_TREE=$(git rev-parse HEAD^{tree}) &&
+
+	for i in $(test_seq 1 4)
+	do
+		cat >in <<-EOF &&
+			100755 blob $BLOB	a
+			040000 tree $OLD_TREE	f1
+			040000 tree $OLD_TREE	f2
+			040000 tree $OLD_TREE	f3
+			040000 tree $OLD_TREE	f4
+		EOF
+		NEW_TREE=$(git mktree <in) &&
+		NEW_COMMIT=$(git commit-tree $NEW_TREE -p $OLD_COMMIT -m "level $i") &&
+		OLD_TREE=$NEW_TREE &&
+		OLD_COMMIT=$NEW_COMMIT || return 1
+	done &&
+
+	git sparse-checkout init --cone &&
+	git branch -f wide $OLD_COMMIT &&
+	git -c core.sparseCheckoutCone=true clone --branch=wide --sparse . full-index-v3 &&
+	(
+		cd full-index-v3 &&
+		git sparse-checkout init --cone &&
+		git sparse-checkout set $SPARSE_CONE &&
+		git config index.version 3 &&
+		git update-index --index-version=3
+	) &&
+	git -c core.sparseCheckoutCone=true clone --branch=wide --sparse . full-index-v4 &&
+	(
+		cd full-index-v4 &&
+		git sparse-checkout init --cone &&
+		git sparse-checkout set $SPARSE_CONE &&
+		git config index.version 4 &&
+		git update-index --index-version=4
+	)
+'
+
+test_perf_on_all () {
+	command="$@"
+	for repo in full-index-v3 full-index-v4
+	do
+		test_perf "$command ($repo)" "
+			(
+				cd $repo &&
+				echo >>$SPARSE_CONE/a &&
+				$command
+			)
+		"
+	done
+}
+
+test_perf_on_all git status
+test_perf_on_all git add -A
+test_perf_on_all git add .
+test_perf_on_all git commit -a -m A
+
+test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH v5 03/21] t1092: clean up script quoting
  2021-03-30 13:10       ` [PATCH v5 00/21] " Derrick Stolee via GitGitGadget
  2021-03-30 13:10         ` [PATCH v5 01/21] sparse-index: design doc and format update Derrick Stolee via GitGitGadget
  2021-03-30 13:10         ` [PATCH v5 02/21] t/perf: add performance test for sparse operations Derrick Stolee via GitGitGadget
@ 2021-03-30 13:10         ` Derrick Stolee via GitGitGadget
  2021-03-30 13:10         ` [PATCH v5 04/21] sparse-index: add guard to ensure full index Derrick Stolee via GitGitGadget
                           ` (19 subsequent siblings)
  22 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-30 13:10 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

This test was introduced in 19a0acc83e4 (t1092: test interesting
sparse-checkout scenarios, 2021-01-23), but it contains issues with quoting
that were not noticed until starting this follow-up series. The old
mechanism would drop quoting such as in

   test_all_match git commit -m "touch README.md"

The above happened to work because README.md is a file in the
repository, so 'git commit -m touch REAMDE.md' would succeed by
accident.

Other cases included quoting for no good reason, so clean that up now.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/t1092-sparse-checkout-compatibility.sh | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index 8cd3e5a8d227..3725d3997e70 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -96,20 +96,20 @@ init_repos () {
 run_on_sparse () {
 	(
 		cd sparse-checkout &&
-		$* >../sparse-checkout-out 2>../sparse-checkout-err
+		"$@" >../sparse-checkout-out 2>../sparse-checkout-err
 	)
 }
 
 run_on_all () {
 	(
 		cd full-checkout &&
-		$* >../full-checkout-out 2>../full-checkout-err
+		"$@" >../full-checkout-out 2>../full-checkout-err
 	) &&
-	run_on_sparse $*
+	run_on_sparse "$@"
 }
 
 test_all_match () {
-	run_on_all $* &&
+	run_on_all "$@" &&
 	test_cmp full-checkout-out sparse-checkout-out &&
 	test_cmp full-checkout-err sparse-checkout-err
 }
@@ -119,7 +119,7 @@ test_expect_success 'status with options' '
 	test_all_match git status --porcelain=v2 &&
 	test_all_match git status --porcelain=v2 -z -u &&
 	test_all_match git status --porcelain=v2 -uno &&
-	run_on_all "touch README.md" &&
+	run_on_all touch README.md &&
 	test_all_match git status --porcelain=v2 &&
 	test_all_match git status --porcelain=v2 -z -u &&
 	test_all_match git status --porcelain=v2 -uno &&
@@ -135,7 +135,7 @@ test_expect_success 'add, commit, checkout' '
 	write_script edit-contents <<-\EOF &&
 	echo text >>$1
 	EOF
-	run_on_all "../edit-contents README.md" &&
+	run_on_all ../edit-contents README.md &&
 
 	test_all_match git add README.md &&
 	test_all_match git status --porcelain=v2 &&
@@ -144,7 +144,7 @@ test_expect_success 'add, commit, checkout' '
 	test_all_match git checkout HEAD~1 &&
 	test_all_match git checkout - &&
 
-	run_on_all "../edit-contents README.md" &&
+	run_on_all ../edit-contents README.md &&
 
 	test_all_match git add -A &&
 	test_all_match git status --porcelain=v2 &&
@@ -153,7 +153,7 @@ test_expect_success 'add, commit, checkout' '
 	test_all_match git checkout HEAD~1 &&
 	test_all_match git checkout - &&
 
-	run_on_all "../edit-contents deep/newfile" &&
+	run_on_all ../edit-contents deep/newfile &&
 
 	test_all_match git status --porcelain=v2 -uno &&
 	test_all_match git status --porcelain=v2 &&
@@ -186,7 +186,7 @@ test_expect_success 'diff --staged' '
 	write_script edit-contents <<-\EOF &&
 	echo text >>README.md
 	EOF
-	run_on_all "../edit-contents" &&
+	run_on_all ../edit-contents &&
 
 	test_all_match git diff &&
 	test_all_match git diff --staged &&
@@ -280,7 +280,7 @@ test_expect_success 'clean' '
 	echo bogus >>.gitignore &&
 	run_on_all cp ../.gitignore . &&
 	test_all_match git add .gitignore &&
-	test_all_match git commit -m ignore-bogus-files &&
+	test_all_match git commit -m "ignore bogus files" &&
 
 	run_on_sparse mkdir folder1 &&
 	run_on_all touch folder1/bogus &&
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH v5 04/21] sparse-index: add guard to ensure full index
  2021-03-30 13:10       ` [PATCH v5 00/21] " Derrick Stolee via GitGitGadget
                           ` (2 preceding siblings ...)
  2021-03-30 13:10         ` [PATCH v5 03/21] t1092: clean up script quoting Derrick Stolee via GitGitGadget
@ 2021-03-30 13:10         ` Derrick Stolee via GitGitGadget
  2021-03-30 13:10         ` [PATCH v5 05/21] sparse-index: implement ensure_full_index() Derrick Stolee via GitGitGadget
                           ` (18 subsequent siblings)
  22 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-30 13:10 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Upcoming changes will introduce modifications to the index format that
allow sparse directories. It will be useful to have a mechanism for
converting those sparse index files into full indexes by walking the
tree at those sparse directories. Name this method ensure_full_index()
as it will guarantee that the index is fully expanded.

This method is not implemented yet, and instead we focus on the
scaffolding to declare it and call it at the appropriate time.

Add a 'command_requires_full_index' member to struct repo_settings. This
will be an indicator that we need the index in full mode to do certain
index operations. This starts as being true for every command, then we
will set it to false as some commands integrate with sparse indexes.

If 'command_requires_full_index' is true, then we will immediately
expand a sparse index to a full one upon reading from disk. This
suffices for now, but we will want to add more callers to
ensure_full_index() later.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Makefile        |  1 +
 repo-settings.c |  8 ++++++++
 repository.c    | 11 ++++++++++-
 repository.h    |  2 ++
 sparse-index.c  |  8 ++++++++
 sparse-index.h  |  7 +++++++
 6 files changed, 36 insertions(+), 1 deletion(-)
 create mode 100644 sparse-index.c
 create mode 100644 sparse-index.h

diff --git a/Makefile b/Makefile
index dfb0f1000fa3..89b1d5374107 100644
--- a/Makefile
+++ b/Makefile
@@ -985,6 +985,7 @@ LIB_OBJS += setup.o
 LIB_OBJS += shallow.o
 LIB_OBJS += sideband.o
 LIB_OBJS += sigchain.o
+LIB_OBJS += sparse-index.o
 LIB_OBJS += split-index.o
 LIB_OBJS += stable-qsort.o
 LIB_OBJS += strbuf.o
diff --git a/repo-settings.c b/repo-settings.c
index f7fff0f5ab83..d63569e4041e 100644
--- a/repo-settings.c
+++ b/repo-settings.c
@@ -77,4 +77,12 @@ void prepare_repo_settings(struct repository *r)
 		UPDATE_DEFAULT_BOOL(r->settings.core_untracked_cache, UNTRACKED_CACHE_KEEP);
 
 	UPDATE_DEFAULT_BOOL(r->settings.fetch_negotiation_algorithm, FETCH_NEGOTIATION_DEFAULT);
+
+	/*
+	 * This setting guards all index reads to require a full index
+	 * over a sparse index. After suitable guards are placed in the
+	 * codebase around uses of the index, this setting will be
+	 * removed.
+	 */
+	r->settings.command_requires_full_index = 1;
 }
diff --git a/repository.c b/repository.c
index c98298acd017..a8acae002f71 100644
--- a/repository.c
+++ b/repository.c
@@ -10,6 +10,7 @@
 #include "object.h"
 #include "lockfile.h"
 #include "submodule-config.h"
+#include "sparse-index.h"
 
 /* The main repository */
 static struct repository the_repo;
@@ -261,6 +262,8 @@ void repo_clear(struct repository *repo)
 
 int repo_read_index(struct repository *repo)
 {
+	int res;
+
 	if (!repo->index)
 		repo->index = xcalloc(1, sizeof(*repo->index));
 
@@ -270,7 +273,13 @@ int repo_read_index(struct repository *repo)
 	else if (repo->index->repo != repo)
 		BUG("repo's index should point back at itself");
 
-	return read_index_from(repo->index, repo->index_file, repo->gitdir);
+	res = read_index_from(repo->index, repo->index_file, repo->gitdir);
+
+	prepare_repo_settings(repo);
+	if (repo->settings.command_requires_full_index)
+		ensure_full_index(repo->index);
+
+	return res;
 }
 
 int repo_hold_locked_index(struct repository *repo,
diff --git a/repository.h b/repository.h
index b385ca3c94b6..e06a23015697 100644
--- a/repository.h
+++ b/repository.h
@@ -41,6 +41,8 @@ struct repo_settings {
 	enum fetch_negotiation_setting fetch_negotiation_algorithm;
 
 	int core_multi_pack_index;
+
+	unsigned command_requires_full_index:1;
 };
 
 struct repository {
diff --git a/sparse-index.c b/sparse-index.c
new file mode 100644
index 000000000000..82183ead563b
--- /dev/null
+++ b/sparse-index.c
@@ -0,0 +1,8 @@
+#include "cache.h"
+#include "repository.h"
+#include "sparse-index.h"
+
+void ensure_full_index(struct index_state *istate)
+{
+	/* intentionally left blank */
+}
diff --git a/sparse-index.h b/sparse-index.h
new file mode 100644
index 000000000000..09a20d036c46
--- /dev/null
+++ b/sparse-index.h
@@ -0,0 +1,7 @@
+#ifndef SPARSE_INDEX_H__
+#define SPARSE_INDEX_H__
+
+struct index_state;
+void ensure_full_index(struct index_state *istate);
+
+#endif
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH v5 05/21] sparse-index: implement ensure_full_index()
  2021-03-30 13:10       ` [PATCH v5 00/21] " Derrick Stolee via GitGitGadget
                           ` (3 preceding siblings ...)
  2021-03-30 13:10         ` [PATCH v5 04/21] sparse-index: add guard to ensure full index Derrick Stolee via GitGitGadget
@ 2021-03-30 13:10         ` Derrick Stolee via GitGitGadget
  2021-03-30 13:10         ` [PATCH v5 06/21] t1092: compare sparse-checkout to sparse-index Derrick Stolee via GitGitGadget
                           ` (17 subsequent siblings)
  22 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-30 13:10 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

We will mark an in-memory index_state as having sparse directory entries
with the sparse_index bit. These currently cannot exist, but we will add
a mechanism for collapsing a full index to a sparse one in a later
change. That will happen at write time, so we must first allow parsing
the format before writing it.

Commands or methods that require a full index in order to operate can
call ensure_full_index() to expand that index in-memory. This requires
parsing trees using that index's repository.

Sparse directory entries have a specific 'ce_mode' value. The macro
S_ISSPARSEDIR(ce->ce_mode) can check if a cache_entry 'ce' has this type.
This ce_mode is not possible with the existing index formats, so we don't
also verify all properties of a sparse-directory entry, which are:

 1. ce->ce_mode == 0040000
 2. ce->flags & CE_SKIP_WORKTREE is true
 3. ce->name[ce->namelen - 1] == '/' (ends in dir separator)
 4. ce->oid references a tree object.

These are all semi-enforced in ensure_full_index() to some extent. Any
deviation will cause a warning at minimum or a failure in the worst
case.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 cache.h        | 13 ++++++-
 read-cache.c   |  9 +++++
 sparse-index.c | 98 +++++++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 118 insertions(+), 2 deletions(-)

diff --git a/cache.h b/cache.h
index bb317abc91fb..136dd496c95d 100644
--- a/cache.h
+++ b/cache.h
@@ -204,6 +204,8 @@ struct cache_entry {
 #error "CE_EXTENDED_FLAGS out of range"
 #endif
 
+#define S_ISSPARSEDIR(m) ((m) == S_IFDIR)
+
 /* Forward structure decls */
 struct pathspec;
 struct child_process;
@@ -319,7 +321,14 @@ struct index_state {
 		 drop_cache_tree : 1,
 		 updated_workdir : 1,
 		 updated_skipworktree : 1,
-		 fsmonitor_has_run_once : 1;
+		 fsmonitor_has_run_once : 1,
+
+		 /*
+		  * sparse_index == 1 when sparse-directory
+		  * entries exist. Requires sparse-checkout
+		  * in cone mode.
+		  */
+		 sparse_index : 1;
 	struct hashmap name_hash;
 	struct hashmap dir_hash;
 	struct object_id oid;
@@ -722,6 +731,8 @@ int read_index_from(struct index_state *, const char *path,
 		    const char *gitdir);
 int is_index_unborn(struct index_state *);
 
+void ensure_full_index(struct index_state *istate);
+
 /* For use with `write_locked_index()`. */
 #define COMMIT_LOCK		(1 << 0)
 #define SKIP_IF_UNCHANGED	(1 << 1)
diff --git a/read-cache.c b/read-cache.c
index 1e9a50c6c734..dd3980c12b53 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -101,6 +101,9 @@ static const char *alternate_index_output;
 
 static void set_index_entry(struct index_state *istate, int nr, struct cache_entry *ce)
 {
+	if (S_ISSPARSEDIR(ce->ce_mode))
+		istate->sparse_index = 1;
+
 	istate->cache[nr] = ce;
 	add_name_hash(istate, ce);
 }
@@ -2273,6 +2276,12 @@ int do_read_index(struct index_state *istate, const char *path, int must_exist)
 	trace2_data_intmax("index", the_repository, "read/cache_nr",
 			   istate->cache_nr);
 
+	if (!istate->repo)
+		istate->repo = the_repository;
+	prepare_repo_settings(istate->repo);
+	if (istate->repo->settings.command_requires_full_index)
+		ensure_full_index(istate);
+
 	return istate->cache_nr;
 
 unmap:
diff --git a/sparse-index.c b/sparse-index.c
index 82183ead563b..7095378a1b28 100644
--- a/sparse-index.c
+++ b/sparse-index.c
@@ -1,8 +1,104 @@
 #include "cache.h"
 #include "repository.h"
 #include "sparse-index.h"
+#include "tree.h"
+#include "pathspec.h"
+#include "trace2.h"
+
+static void set_index_entry(struct index_state *istate, int nr, struct cache_entry *ce)
+{
+	ALLOC_GROW(istate->cache, nr + 1, istate->cache_alloc);
+
+	istate->cache[nr] = ce;
+	add_name_hash(istate, ce);
+}
+
+static int add_path_to_index(const struct object_id *oid,
+			     struct strbuf *base, const char *path,
+			     unsigned int mode, void *context)
+{
+	struct index_state *istate = (struct index_state *)context;
+	struct cache_entry *ce;
+	size_t len = base->len;
+
+	if (S_ISDIR(mode))
+		return READ_TREE_RECURSIVE;
+
+	strbuf_addstr(base, path);
+
+	ce = make_cache_entry(istate, mode, oid, base->buf, 0, 0);
+	ce->ce_flags |= CE_SKIP_WORKTREE;
+	set_index_entry(istate, istate->cache_nr++, ce);
+
+	strbuf_setlen(base, len);
+	return 0;
+}
 
 void ensure_full_index(struct index_state *istate)
 {
-	/* intentionally left blank */
+	int i;
+	struct index_state *full;
+	struct strbuf base = STRBUF_INIT;
+
+	if (!istate || !istate->sparse_index)
+		return;
+
+	if (!istate->repo)
+		istate->repo = the_repository;
+
+	trace2_region_enter("index", "ensure_full_index", istate->repo);
+
+	/* initialize basics of new index */
+	full = xcalloc(1, sizeof(struct index_state));
+	memcpy(full, istate, sizeof(struct index_state));
+
+	/* then change the necessary things */
+	full->sparse_index = 0;
+	full->cache_alloc = (3 * istate->cache_alloc) / 2;
+	full->cache_nr = 0;
+	ALLOC_ARRAY(full->cache, full->cache_alloc);
+
+	for (i = 0; i < istate->cache_nr; i++) {
+		struct cache_entry *ce = istate->cache[i];
+		struct tree *tree;
+		struct pathspec ps;
+
+		if (!S_ISSPARSEDIR(ce->ce_mode)) {
+			set_index_entry(full, full->cache_nr++, ce);
+			continue;
+		}
+		if (!(ce->ce_flags & CE_SKIP_WORKTREE))
+			warning(_("index entry is a directory, but not sparse (%08x)"),
+				ce->ce_flags);
+
+		/* recursively walk into cd->name */
+		tree = lookup_tree(istate->repo, &ce->oid);
+
+		memset(&ps, 0, sizeof(ps));
+		ps.recursive = 1;
+		ps.has_wildcard = 1;
+		ps.max_depth = -1;
+
+		strbuf_setlen(&base, 0);
+		strbuf_add(&base, ce->name, strlen(ce->name));
+
+		read_tree_at(istate->repo, tree, &base, &ps,
+			     add_path_to_index, full);
+
+		/* free directory entries. full entries are re-used */
+		discard_cache_entry(ce);
+	}
+
+	/* Copy back into original index. */
+	memcpy(&istate->name_hash, &full->name_hash, sizeof(full->name_hash));
+	istate->sparse_index = 0;
+	free(istate->cache);
+	istate->cache = full->cache;
+	istate->cache_nr = full->cache_nr;
+	istate->cache_alloc = full->cache_alloc;
+
+	strbuf_release(&base);
+	free(full);
+
+	trace2_region_leave("index", "ensure_full_index", istate->repo);
 }
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH v5 06/21] t1092: compare sparse-checkout to sparse-index
  2021-03-30 13:10       ` [PATCH v5 00/21] " Derrick Stolee via GitGitGadget
                           ` (4 preceding siblings ...)
  2021-03-30 13:10         ` [PATCH v5 05/21] sparse-index: implement ensure_full_index() Derrick Stolee via GitGitGadget
@ 2021-03-30 13:10         ` Derrick Stolee via GitGitGadget
  2021-03-30 13:10         ` [PATCH v5 07/21] test-read-cache: print cache entries with --table Derrick Stolee via GitGitGadget
                           ` (16 subsequent siblings)
  22 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-30 13:10 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Add a new 'sparse-index' repo alongside the 'full-checkout' and
'sparse-checkout' repos in t1092-sparse-checkout-compatibility.sh. Also
add run_on_sparse and test_sparse_match helpers. These helpers will be
used when the sparse index is implemented.

Add the GIT_TEST_SPARSE_INDEX environment variable to enable the
sparse-index by default. This can be enabled across all tests, but that
will only affect cases where the sparse-checkout feature is enabled.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/README                                 |  3 +++
 t/t1092-sparse-checkout-compatibility.sh | 24 ++++++++++++++++++++----
 2 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/t/README b/t/README
index 593d4a4e270c..b98bc563aab5 100644
--- a/t/README
+++ b/t/README
@@ -439,6 +439,9 @@ and "sha256".
 GIT_TEST_WRITE_REV_INDEX=<boolean>, when true enables the
 'pack.writeReverseIndex' setting.
 
+GIT_TEST_SPARSE_INDEX=<boolean>, when true enables index writes to use the
+sparse-index format by default.
+
 Naming Tests
 ------------
 
diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index 3725d3997e70..de5d8461c993 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -7,6 +7,7 @@ test_description='compare full workdir to sparse workdir'
 test_expect_success 'setup' '
 	git init initial-repo &&
 	(
+		GIT_TEST_SPARSE_INDEX=0 &&
 		cd initial-repo &&
 		echo a >a &&
 		echo "after deep" >e &&
@@ -87,23 +88,32 @@ init_repos () {
 
 	cp -r initial-repo sparse-checkout &&
 	git -C sparse-checkout reset --hard &&
-	git -C sparse-checkout sparse-checkout init --cone &&
+
+	cp -r initial-repo sparse-index &&
+	git -C sparse-index reset --hard &&
 
 	# initialize sparse-checkout definitions
-	git -C sparse-checkout sparse-checkout set deep
+	git -C sparse-checkout sparse-checkout init --cone &&
+	git -C sparse-checkout sparse-checkout set deep &&
+	GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout init --cone &&
+	GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set deep
 }
 
 run_on_sparse () {
 	(
 		cd sparse-checkout &&
-		"$@" >../sparse-checkout-out 2>../sparse-checkout-err
+		GIT_TEST_SPARSE_INDEX=0 "$@" >../sparse-checkout-out 2>../sparse-checkout-err
+	) &&
+	(
+		cd sparse-index &&
+		GIT_TEST_SPARSE_INDEX=1 "$@" >../sparse-index-out 2>../sparse-index-err
 	)
 }
 
 run_on_all () {
 	(
 		cd full-checkout &&
-		"$@" >../full-checkout-out 2>../full-checkout-err
+		GIT_TEST_SPARSE_INDEX=0 "$@" >../full-checkout-out 2>../full-checkout-err
 	) &&
 	run_on_sparse "$@"
 }
@@ -114,6 +124,12 @@ test_all_match () {
 	test_cmp full-checkout-err sparse-checkout-err
 }
 
+test_sparse_match () {
+	run_on_sparse "$@" &&
+	test_cmp sparse-checkout-out sparse-index-out &&
+	test_cmp sparse-checkout-err sparse-index-err
+}
+
 test_expect_success 'status with options' '
 	init_repos &&
 	test_all_match git status --porcelain=v2 &&
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH v5 07/21] test-read-cache: print cache entries with --table
  2021-03-30 13:10       ` [PATCH v5 00/21] " Derrick Stolee via GitGitGadget
                           ` (5 preceding siblings ...)
  2021-03-30 13:10         ` [PATCH v5 06/21] t1092: compare sparse-checkout to sparse-index Derrick Stolee via GitGitGadget
@ 2021-03-30 13:10         ` Derrick Stolee via GitGitGadget
  2021-03-30 13:10         ` [PATCH v5 08/21] test-tool: don't force full index Derrick Stolee via GitGitGadget
                           ` (15 subsequent siblings)
  22 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-30 13:10 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

This table is helpful for discovering data in the index to ensure it is
being written correctly, especially as we build and test the
sparse-index. This table includes an output format similar to 'git
ls-tree', but should not be compared to that directly. The biggest
reasons are that 'git ls-tree' includes a tree entry for every
subdirectory, even those that would not appear as a sparse directory in
a sparse-index. Further, 'git ls-tree' does not use a trailing directory
separator for its tree rows.

This does not print the stat() information for the blobs. That will be
added in a future change with another option. The tests that are added
in the next few changes care only about the object types and IDs.
However, this future need for full index information justifies the need
for this test helper over extending a user-facing feature, such as 'git
ls-files'.

To make the option parsing slightly more robust, wrap the string
comparisons in a loop adapted from test-dir-iterator.c.

Care must be taken with the final check for the 'cnt' variable. We
continue the expectation that the numerical value is the final argument.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/helper/test-read-cache.c | 55 +++++++++++++++++++++++++++++++-------
 1 file changed, 45 insertions(+), 10 deletions(-)

diff --git a/t/helper/test-read-cache.c b/t/helper/test-read-cache.c
index 244977a29bdf..6cfd8f2de71c 100644
--- a/t/helper/test-read-cache.c
+++ b/t/helper/test-read-cache.c
@@ -1,36 +1,71 @@
 #include "test-tool.h"
 #include "cache.h"
 #include "config.h"
+#include "blob.h"
+#include "commit.h"
+#include "tree.h"
+
+static void print_cache_entry(struct cache_entry *ce)
+{
+	const char *type;
+	printf("%06o ", ce->ce_mode & 0177777);
+
+	if (S_ISSPARSEDIR(ce->ce_mode))
+		type = tree_type;
+	else if (S_ISGITLINK(ce->ce_mode))
+		type = commit_type;
+	else
+		type = blob_type;
+
+	printf("%s %s\t%s\n",
+	       type,
+	       oid_to_hex(&ce->oid),
+	       ce->name);
+}
+
+static void print_cache(struct index_state *istate)
+{
+	int i;
+	for (i = 0; i < istate->cache_nr; i++)
+		print_cache_entry(istate->cache[i]);
+}
 
 int cmd__read_cache(int argc, const char **argv)
 {
+	struct repository *r = the_repository;
 	int i, cnt = 1;
 	const char *name = NULL;
+	int table = 0;
 
-	if (argc > 1 && skip_prefix(argv[1], "--print-and-refresh=", &name)) {
-		argc--;
-		argv++;
+	for (++argv, --argc; *argv && starts_with(*argv, "--"); ++argv, --argc) {
+		if (skip_prefix(*argv, "--print-and-refresh=", &name))
+			continue;
+		if (!strcmp(*argv, "--table"))
+			table = 1;
 	}
 
-	if (argc == 2)
-		cnt = strtol(argv[1], NULL, 0);
+	if (argc == 1)
+		cnt = strtol(argv[0], NULL, 0);
 	setup_git_directory();
 	git_config(git_default_config, NULL);
+
 	for (i = 0; i < cnt; i++) {
-		read_cache();
+		repo_read_index(r);
 		if (name) {
 			int pos;
 
-			refresh_index(&the_index, REFRESH_QUIET,
+			refresh_index(r->index, REFRESH_QUIET,
 				      NULL, NULL, NULL);
-			pos = index_name_pos(&the_index, name, strlen(name));
+			pos = index_name_pos(r->index, name, strlen(name));
 			if (pos < 0)
 				die("%s not in index", name);
 			printf("%s is%s up to date\n", name,
-			       ce_uptodate(the_index.cache[pos]) ? "" : " not");
+			       ce_uptodate(r->index->cache[pos]) ? "" : " not");
 			write_file(name, "%d\n", i);
 		}
-		discard_cache();
+		if (table)
+			print_cache(r->index);
+		discard_index(r->index);
 	}
 	return 0;
 }
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH v5 08/21] test-tool: don't force full index
  2021-03-30 13:10       ` [PATCH v5 00/21] " Derrick Stolee via GitGitGadget
                           ` (6 preceding siblings ...)
  2021-03-30 13:10         ` [PATCH v5 07/21] test-read-cache: print cache entries with --table Derrick Stolee via GitGitGadget
@ 2021-03-30 13:10         ` Derrick Stolee via GitGitGadget
  2021-03-30 13:10         ` [PATCH v5 09/21] unpack-trees: ensure " Derrick Stolee via GitGitGadget
                           ` (14 subsequent siblings)
  22 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-30 13:10 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

We will use 'test-tool read-cache --table' to check that a sparse
index is written as part of init_repos. Since we will no longer always
expand a sparse index into a full index, add an '--expand' parameter
that adds a call to ensure_full_index() so we can compare a sparse index
directly against a full index, or at least what the in-memory index
looks like when expanded in this way.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/helper/test-read-cache.c               | 13 ++++++++++++-
 t/t1092-sparse-checkout-compatibility.sh |  5 +++++
 2 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/t/helper/test-read-cache.c b/t/helper/test-read-cache.c
index 6cfd8f2de71c..b52c174acc7a 100644
--- a/t/helper/test-read-cache.c
+++ b/t/helper/test-read-cache.c
@@ -4,6 +4,7 @@
 #include "blob.h"
 #include "commit.h"
 #include "tree.h"
+#include "sparse-index.h"
 
 static void print_cache_entry(struct cache_entry *ce)
 {
@@ -35,13 +36,19 @@ int cmd__read_cache(int argc, const char **argv)
 	struct repository *r = the_repository;
 	int i, cnt = 1;
 	const char *name = NULL;
-	int table = 0;
+	int table = 0, expand = 0;
+
+	initialize_the_repository();
+	prepare_repo_settings(r);
+	r->settings.command_requires_full_index = 0;
 
 	for (++argv, --argc; *argv && starts_with(*argv, "--"); ++argv, --argc) {
 		if (skip_prefix(*argv, "--print-and-refresh=", &name))
 			continue;
 		if (!strcmp(*argv, "--table"))
 			table = 1;
+		else if (!strcmp(*argv, "--expand"))
+			expand = 1;
 	}
 
 	if (argc == 1)
@@ -51,6 +58,10 @@ int cmd__read_cache(int argc, const char **argv)
 
 	for (i = 0; i < cnt; i++) {
 		repo_read_index(r);
+
+		if (expand)
+			ensure_full_index(r->index);
+
 		if (name) {
 			int pos;
 
diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index de5d8461c993..a1aea141c62c 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -130,6 +130,11 @@ test_sparse_match () {
 	test_cmp sparse-checkout-err sparse-index-err
 }
 
+test_expect_success 'expanded in-memory index matches full index' '
+	init_repos &&
+	test_sparse_match test-tool read-cache --expand --table
+'
+
 test_expect_success 'status with options' '
 	init_repos &&
 	test_all_match git status --porcelain=v2 &&
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH v5 09/21] unpack-trees: ensure full index
  2021-03-30 13:10       ` [PATCH v5 00/21] " Derrick Stolee via GitGitGadget
                           ` (7 preceding siblings ...)
  2021-03-30 13:10         ` [PATCH v5 08/21] test-tool: don't force full index Derrick Stolee via GitGitGadget
@ 2021-03-30 13:10         ` Derrick Stolee via GitGitGadget
  2021-03-30 13:10         ` [PATCH v5 10/21] sparse-checkout: hold pattern list in index Derrick Stolee via GitGitGadget
                           ` (13 subsequent siblings)
  22 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-30 13:10 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The next change will translate full indexes into sparse indexes at write
time. The existing logic provides a way for every sparse index to be
expanded to a full index at read time. However, there are cases where an
index is written and then continues to be used in-memory to perform
further updates.

unpack_trees() is frequently called after such a write. In particular,
commands like 'git reset' do this double-update of the index.

Ensure that we have a full index when entering unpack_trees(), but only
when command_requires_full_index is true. This is always true at the
moment, but we will later relax that after unpack_trees() is updated to
handle sparse directory entries.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 unpack-trees.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/unpack-trees.c b/unpack-trees.c
index f5f668f532d8..4dd99219073a 100644
--- a/unpack-trees.c
+++ b/unpack-trees.c
@@ -1567,6 +1567,7 @@ static int verify_absent(const struct cache_entry *,
  */
 int unpack_trees(unsigned len, struct tree_desc *t, struct unpack_trees_options *o)
 {
+	struct repository *repo = the_repository;
 	int i, ret;
 	static struct cache_entry *dfc;
 	struct pattern_list pl;
@@ -1578,6 +1579,12 @@ int unpack_trees(unsigned len, struct tree_desc *t, struct unpack_trees_options
 	trace_performance_enter();
 	trace2_region_enter("unpack_trees", "unpack_trees", the_repository);
 
+	prepare_repo_settings(repo);
+	if (repo->settings.command_requires_full_index) {
+		ensure_full_index(o->src_index);
+		ensure_full_index(o->dst_index);
+	}
+
 	if (!core_apply_sparse_checkout || !o->update)
 		o->skip_sparse_checkout = 1;
 	if (!o->skip_sparse_checkout && !o->pl) {
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH v5 10/21] sparse-checkout: hold pattern list in index
  2021-03-30 13:10       ` [PATCH v5 00/21] " Derrick Stolee via GitGitGadget
                           ` (8 preceding siblings ...)
  2021-03-30 13:10         ` [PATCH v5 09/21] unpack-trees: ensure " Derrick Stolee via GitGitGadget
@ 2021-03-30 13:10         ` Derrick Stolee via GitGitGadget
  2021-03-30 13:10         ` [PATCH v5 11/21] sparse-index: add 'sdir' index extension Derrick Stolee via GitGitGadget
                           ` (12 subsequent siblings)
  22 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-30 13:10 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

As we modify the sparse-checkout definition, we perform index operations
on a pattern_list that only exists in-memory. This allows easy backing
out in case the index update fails.

However, if the index write itself cares about the sparse-checkout
pattern set, we need access to that in-memory copy. Place a pointer to
a 'struct pattern_list' in the index so we can access this on-demand.
This will be used in the next change which uses the sparse-checkout
definition to filter out directories that are outside the sparse cone.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 builtin/sparse-checkout.c | 17 ++++++++++-------
 cache.h                   |  2 ++
 2 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/builtin/sparse-checkout.c b/builtin/sparse-checkout.c
index 2306a9ad98e0..e00b82af727b 100644
--- a/builtin/sparse-checkout.c
+++ b/builtin/sparse-checkout.c
@@ -110,6 +110,8 @@ static int update_working_directory(struct pattern_list *pl)
 	if (is_index_unborn(r->index))
 		return UPDATE_SPARSITY_SUCCESS;
 
+	r->index->sparse_checkout_patterns = pl;
+
 	memset(&o, 0, sizeof(o));
 	o.verbose_update = isatty(2);
 	o.update = 1;
@@ -138,6 +140,7 @@ static int update_working_directory(struct pattern_list *pl)
 	else
 		rollback_lock_file(&lock_file);
 
+	r->index->sparse_checkout_patterns = NULL;
 	return result;
 }
 
@@ -517,19 +520,18 @@ static int modify_pattern_list(int argc, const char **argv, enum modify_type m)
 {
 	int result;
 	int changed_config = 0;
-	struct pattern_list pl;
-	memset(&pl, 0, sizeof(pl));
+	struct pattern_list *pl = xcalloc(1, sizeof(*pl));
 
 	switch (m) {
 	case ADD:
 		if (core_sparse_checkout_cone)
-			add_patterns_cone_mode(argc, argv, &pl);
+			add_patterns_cone_mode(argc, argv, pl);
 		else
-			add_patterns_literal(argc, argv, &pl);
+			add_patterns_literal(argc, argv, pl);
 		break;
 
 	case REPLACE:
-		add_patterns_from_input(&pl, argc, argv);
+		add_patterns_from_input(pl, argc, argv);
 		break;
 	}
 
@@ -539,12 +541,13 @@ static int modify_pattern_list(int argc, const char **argv, enum modify_type m)
 		changed_config = 1;
 	}
 
-	result = write_patterns_and_update(&pl);
+	result = write_patterns_and_update(pl);
 
 	if (result && changed_config)
 		set_config(MODE_NO_PATTERNS);
 
-	clear_pattern_list(&pl);
+	clear_pattern_list(pl);
+	free(pl);
 	return result;
 }
 
diff --git a/cache.h b/cache.h
index 136dd496c95d..8c4464420d0a 100644
--- a/cache.h
+++ b/cache.h
@@ -307,6 +307,7 @@ static inline unsigned int canon_mode(unsigned int mode)
 struct split_index;
 struct untracked_cache;
 struct progress;
+struct pattern_list;
 
 struct index_state {
 	struct cache_entry **cache;
@@ -338,6 +339,7 @@ struct index_state {
 	struct mem_pool *ce_mem_pool;
 	struct progress *progress;
 	struct repository *repo;
+	struct pattern_list *sparse_checkout_patterns;
 };
 
 /* Name hashing */
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH v5 11/21] sparse-index: add 'sdir' index extension
  2021-03-30 13:10       ` [PATCH v5 00/21] " Derrick Stolee via GitGitGadget
                           ` (9 preceding siblings ...)
  2021-03-30 13:10         ` [PATCH v5 10/21] sparse-checkout: hold pattern list in index Derrick Stolee via GitGitGadget
@ 2021-03-30 13:10         ` Derrick Stolee via GitGitGadget
  2021-03-30 13:10         ` [PATCH v5 12/21] sparse-index: convert from full to sparse Derrick Stolee via GitGitGadget
                           ` (11 subsequent siblings)
  22 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-30 13:10 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The index format does not currently allow for sparse directory entries.
This violates some expectations that older versions of Git or
third-party tools might not understand. We need an indicator inside the
index file to warn these tools to not interact with a sparse index
unless they are aware of sparse directory entries.

Add a new _required_ index extension, 'sdir', that indicates that the
index may contain sparse directory entries. This allows us to continue
to use the differences in index formats 2, 3, and 4 before we create a
new index version 5 in a later change.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/index-format.txt | 12 ++++++++++++
 read-cache.c                             |  9 +++++++++
 2 files changed, 21 insertions(+)

diff --git a/Documentation/technical/index-format.txt b/Documentation/technical/index-format.txt
index 3b74c05647db..65da0daaa563 100644
--- a/Documentation/technical/index-format.txt
+++ b/Documentation/technical/index-format.txt
@@ -392,3 +392,15 @@ The remaining data of each directory block is grouped by type:
 	in this block of entries.
 
     - 32-bit count of cache entries in this block
+
+== Sparse Directory Entries
+
+  When using sparse-checkout in cone mode, some entire directories within
+  the index can be summarized by pointing to a tree object instead of the
+  entire expanded list of paths within that tree. An index containing such
+  entries is a "sparse index". Index format versions 4 and less were not
+  implemented with such entries in mind. Thus, for these versions, an
+  index containing sparse directory entries will include this extension
+  with signature { 's', 'd', 'i', 'r' }. Like the split-index extension,
+  tools should avoid interacting with a sparse index unless they understand
+  this extension.
diff --git a/read-cache.c b/read-cache.c
index dd3980c12b53..b8f092d1b7eb 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -47,6 +47,7 @@
 #define CACHE_EXT_FSMONITOR 0x46534D4E	  /* "FSMN" */
 #define CACHE_EXT_ENDOFINDEXENTRIES 0x454F4945	/* "EOIE" */
 #define CACHE_EXT_INDEXENTRYOFFSETTABLE 0x49454F54 /* "IEOT" */
+#define CACHE_EXT_SPARSE_DIRECTORIES 0x73646972 /* "sdir" */
 
 /* changes that can be kept in $GIT_DIR/index (basically all extensions) */
 #define EXTMASK (RESOLVE_UNDO_CHANGED | CACHE_TREE_CHANGED | \
@@ -1763,6 +1764,10 @@ static int read_index_extension(struct index_state *istate,
 	case CACHE_EXT_INDEXENTRYOFFSETTABLE:
 		/* already handled in do_read_index() */
 		break;
+	case CACHE_EXT_SPARSE_DIRECTORIES:
+		/* no content, only an indicator */
+		istate->sparse_index = 1;
+		break;
 	default:
 		if (*ext < 'A' || 'Z' < *ext)
 			return error(_("index uses %.4s extension, which we do not understand"),
@@ -3020,6 +3025,10 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
 		if (err)
 			return -1;
 	}
+	if (istate->sparse_index) {
+		if (write_index_ext_header(&c, &eoie_c, newfd, CACHE_EXT_SPARSE_DIRECTORIES, 0) < 0)
+			return -1;
+	}
 
 	/*
 	 * CACHE_EXT_ENDOFINDEXENTRIES must be written as the last entry before the SHA1
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH v5 12/21] sparse-index: convert from full to sparse
  2021-03-30 13:10       ` [PATCH v5 00/21] " Derrick Stolee via GitGitGadget
                           ` (10 preceding siblings ...)
  2021-03-30 13:10         ` [PATCH v5 11/21] sparse-index: add 'sdir' index extension Derrick Stolee via GitGitGadget
@ 2021-03-30 13:10         ` Derrick Stolee via GitGitGadget
  2021-03-30 13:10         ` [PATCH v5 13/21] submodule: sparse-index should not collapse links Derrick Stolee via GitGitGadget
                           ` (10 subsequent siblings)
  22 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-30 13:10 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

If we have a full index, then we can convert it to a sparse index by
replacing directories outside of the sparse cone with sparse directory
entries. The convert_to_sparse() method does this, when the situation is
appropriate.

For now, we avoid converting the index to a sparse index if:

 1. the index is split.
 2. the index is already sparse.
 3. sparse-checkout is disabled.
 4. sparse-checkout does not use cone mode.

Finally, we currently limit the conversion to when the
GIT_TEST_SPARSE_INDEX environment variable is enabled. A mode using Git
config will be added in a later change.

The trickiest thing about this conversion is that we might not be able
to mark a directory as a sparse directory just because it is outside the
sparse cone. There might be unmerged files within that directory, so we
need to look for those. Also, if there is some strange reason why a file
is not marked with CE_SKIP_WORKTREE, then we should give up on
converting that directory. There is still hope that some of its
subdirectories might be able to convert to sparse, so we keep looking
deeper.

The conversion process is assisted by the cache-tree extension. This is
calculated from the full index if it does not already exist. We then
abandon the cache-tree as it no longer applies to the newly-sparse
index. Thus, this cache-tree will be recalculated in every
sparse-full-sparse round-trip until we integrate the cache-tree
extension with the sparse index.

Some Git commands use the index after writing it. For example, 'git add'
will update the index, then write it to disk, then read its entries to
report information. To keep the in-memory index in a full state after
writing, we re-expand it to a full one after the write. This is wasteful
for commands that only write the index and do not read from it again,
but that is only the case until we make those commands "sparse aware."

We can compare the behavior of the sparse-index in
t1092-sparse-checkout-compability.sh by using GIT_TEST_SPARSE_INDEX=1
when operating on the 'sparse-index' repo. We can also compare the two
sparse repos directly, such as comparing their indexes (when expanded to
full in the case of the 'sparse-index' repo). We also verify that the
index is actually populated with sparse directory entries.

The 'checkout and reset (mixed)' test is marked for failure when
comparing a sparse repo to a full repo, but we can compare the two
sparse-checkout cases directly to ensure that we are not changing the
behavior when using a sparse index.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 cache-tree.c                             |   3 +
 cache.h                                  |   2 +
 read-cache.c                             |  26 ++++-
 sparse-index.c                           | 139 +++++++++++++++++++++++
 sparse-index.h                           |   1 +
 t/t1092-sparse-checkout-compatibility.sh |  61 +++++++++-
 6 files changed, 228 insertions(+), 4 deletions(-)

diff --git a/cache-tree.c b/cache-tree.c
index 2fb483d3c083..5f07a39e501e 100644
--- a/cache-tree.c
+++ b/cache-tree.c
@@ -6,6 +6,7 @@
 #include "object-store.h"
 #include "replace-object.h"
 #include "promisor-remote.h"
+#include "sparse-index.h"
 
 #ifndef DEBUG_CACHE_TREE
 #define DEBUG_CACHE_TREE 0
@@ -442,6 +443,8 @@ int cache_tree_update(struct index_state *istate, int flags)
 	if (i)
 		return i;
 
+	ensure_full_index(istate);
+
 	if (!istate->cache_tree)
 		istate->cache_tree = cache_tree();
 
diff --git a/cache.h b/cache.h
index 8c4464420d0a..74b43aaa2bd1 100644
--- a/cache.h
+++ b/cache.h
@@ -251,6 +251,8 @@ static inline unsigned int create_ce_mode(unsigned int mode)
 {
 	if (S_ISLNK(mode))
 		return S_IFLNK;
+	if (S_ISSPARSEDIR(mode))
+		return S_IFDIR;
 	if (S_ISDIR(mode) || S_ISGITLINK(mode))
 		return S_IFGITLINK;
 	return S_IFREG | ce_permissions(mode);
diff --git a/read-cache.c b/read-cache.c
index b8f092d1b7eb..2410e6e0df13 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -25,6 +25,7 @@
 #include "fsmonitor.h"
 #include "thread-utils.h"
 #include "progress.h"
+#include "sparse-index.h"
 
 /* Mask for the name length in ce_flags in the on-disk index */
 
@@ -1003,8 +1004,14 @@ int verify_path(const char *path, unsigned mode)
 
 			c = *path++;
 			if ((c == '.' && !verify_dotfile(path, mode)) ||
-			    is_dir_sep(c) || c == '\0')
+			    is_dir_sep(c))
 				return 0;
+			/*
+			 * allow terminating directory separators for
+			 * sparse directory entries.
+			 */
+			if (c == '\0')
+				return S_ISDIR(mode);
 		} else if (c == '\\' && protect_ntfs) {
 			if (is_ntfs_dotgit(path))
 				return 0;
@@ -3088,6 +3095,14 @@ static int do_write_locked_index(struct index_state *istate, struct lock_file *l
 				 unsigned flags)
 {
 	int ret;
+	int was_full = !istate->sparse_index;
+
+	ret = convert_to_sparse(istate);
+
+	if (ret) {
+		warning(_("failed to convert to a sparse-index"));
+		return ret;
+	}
 
 	/*
 	 * TODO trace2: replace "the_repository" with the actual repo instance
@@ -3099,6 +3114,9 @@ static int do_write_locked_index(struct index_state *istate, struct lock_file *l
 	trace2_region_leave_printf("index", "do_write_index", the_repository,
 				   "%s", get_lock_file_path(lock));
 
+	if (was_full)
+		ensure_full_index(istate);
+
 	if (ret)
 		return ret;
 	if (flags & COMMIT_LOCK)
@@ -3189,9 +3207,10 @@ static int write_shared_index(struct index_state *istate,
 			      struct tempfile **temp)
 {
 	struct split_index *si = istate->split_index;
-	int ret;
+	int ret, was_full = !istate->sparse_index;
 
 	move_cache_to_base_index(istate);
+	convert_to_sparse(istate);
 
 	trace2_region_enter_printf("index", "shared/do_write_index",
 				   the_repository, "%s", get_tempfile_path(*temp));
@@ -3199,6 +3218,9 @@ static int write_shared_index(struct index_state *istate,
 	trace2_region_leave_printf("index", "shared/do_write_index",
 				   the_repository, "%s", get_tempfile_path(*temp));
 
+	if (was_full)
+		ensure_full_index(istate);
+
 	if (ret)
 		return ret;
 	ret = adjust_shared_perm(get_tempfile_path(*temp));
diff --git a/sparse-index.c b/sparse-index.c
index 7095378a1b28..619ff7c2e217 100644
--- a/sparse-index.c
+++ b/sparse-index.c
@@ -4,6 +4,145 @@
 #include "tree.h"
 #include "pathspec.h"
 #include "trace2.h"
+#include "cache-tree.h"
+#include "config.h"
+#include "dir.h"
+#include "fsmonitor.h"
+
+static struct cache_entry *construct_sparse_dir_entry(
+				struct index_state *istate,
+				const char *sparse_dir,
+				struct cache_tree *tree)
+{
+	struct cache_entry *de;
+
+	de = make_cache_entry(istate, S_IFDIR, &tree->oid, sparse_dir, 0, 0);
+
+	de->ce_flags |= CE_SKIP_WORKTREE;
+	return de;
+}
+
+/*
+ * Returns the number of entries "inserted" into the index.
+ */
+static int convert_to_sparse_rec(struct index_state *istate,
+				 int num_converted,
+				 int start, int end,
+				 const char *ct_path, size_t ct_pathlen,
+				 struct cache_tree *ct)
+{
+	int i, can_convert = 1;
+	int start_converted = num_converted;
+	enum pattern_match_result match;
+	int dtype;
+	struct strbuf child_path = STRBUF_INIT;
+	struct pattern_list *pl = istate->sparse_checkout_patterns;
+
+	/*
+	 * Is the current path outside of the sparse cone?
+	 * Then check if the region can be replaced by a sparse
+	 * directory entry (everything is sparse and merged).
+	 */
+	match = path_matches_pattern_list(ct_path, ct_pathlen,
+					  NULL, &dtype, pl, istate);
+	if (match != NOT_MATCHED)
+		can_convert = 0;
+
+	for (i = start; can_convert && i < end; i++) {
+		struct cache_entry *ce = istate->cache[i];
+
+		if (ce_stage(ce) ||
+		    !(ce->ce_flags & CE_SKIP_WORKTREE))
+			can_convert = 0;
+	}
+
+	if (can_convert) {
+		struct cache_entry *se;
+		se = construct_sparse_dir_entry(istate, ct_path, ct);
+
+		istate->cache[num_converted++] = se;
+		return 1;
+	}
+
+	for (i = start; i < end; ) {
+		int count, span, pos = -1;
+		const char *base, *slash;
+		struct cache_entry *ce = istate->cache[i];
+
+		/*
+		 * Detect if this is a normal entry outside of any subtree
+		 * entry.
+		 */
+		base = ce->name + ct_pathlen;
+		slash = strchr(base, '/');
+
+		if (slash)
+			pos = cache_tree_subtree_pos(ct, base, slash - base);
+
+		if (pos < 0) {
+			istate->cache[num_converted++] = ce;
+			i++;
+			continue;
+		}
+
+		strbuf_setlen(&child_path, 0);
+		strbuf_add(&child_path, ce->name, slash - ce->name + 1);
+
+		span = ct->down[pos]->cache_tree->entry_count;
+		count = convert_to_sparse_rec(istate,
+					      num_converted, i, i + span,
+					      child_path.buf, child_path.len,
+					      ct->down[pos]->cache_tree);
+		num_converted += count;
+		i += span;
+	}
+
+	strbuf_release(&child_path);
+	return num_converted - start_converted;
+}
+
+int convert_to_sparse(struct index_state *istate)
+{
+	if (istate->split_index || istate->sparse_index ||
+	    !core_apply_sparse_checkout || !core_sparse_checkout_cone)
+		return 0;
+
+	/*
+	 * For now, only create a sparse index with the
+	 * GIT_TEST_SPARSE_INDEX environment variable. We will relax
+	 * this once we have a proper way to opt-in (and later still,
+	 * opt-out).
+	 */
+	if (!git_env_bool("GIT_TEST_SPARSE_INDEX", 0))
+		return 0;
+
+	if (!istate->sparse_checkout_patterns) {
+		istate->sparse_checkout_patterns = xcalloc(1, sizeof(struct pattern_list));
+		if (get_sparse_checkout_patterns(istate->sparse_checkout_patterns) < 0)
+			return 0;
+	}
+
+	if (!istate->sparse_checkout_patterns->use_cone_patterns) {
+		warning(_("attempting to use sparse-index without cone mode"));
+		return -1;
+	}
+
+	if (cache_tree_update(istate, 0)) {
+		warning(_("unable to update cache-tree, staying full"));
+		return -1;
+	}
+
+	remove_fsmonitor(istate);
+
+	trace2_region_enter("index", "convert_to_sparse", istate->repo);
+	istate->cache_nr = convert_to_sparse_rec(istate,
+						 0, 0, istate->cache_nr,
+						 "", 0, istate->cache_tree);
+	istate->drop_cache_tree = 1;
+	istate->sparse_index = 1;
+	trace2_region_leave("index", "convert_to_sparse", istate->repo);
+	return 0;
+}
 
 static void set_index_entry(struct index_state *istate, int nr, struct cache_entry *ce)
 {
diff --git a/sparse-index.h b/sparse-index.h
index 09a20d036c46..64380e121d80 100644
--- a/sparse-index.h
+++ b/sparse-index.h
@@ -3,5 +3,6 @@
 
 struct index_state;
 void ensure_full_index(struct index_state *istate);
+int convert_to_sparse(struct index_state *istate);
 
 #endif
diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index a1aea141c62c..1e888d195122 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -2,6 +2,11 @@
 
 test_description='compare full workdir to sparse workdir'
 
+# The verify_cache_tree() check is not sparse-aware (yet).
+# So, disable the check until that integration is complete.
+GIT_TEST_CHECK_CACHE_TREE=0
+GIT_TEST_SPLIT_INDEX=0
+
 . ./test-lib.sh
 
 test_expect_success 'setup' '
@@ -121,7 +126,9 @@ run_on_all () {
 test_all_match () {
 	run_on_all "$@" &&
 	test_cmp full-checkout-out sparse-checkout-out &&
-	test_cmp full-checkout-err sparse-checkout-err
+	test_cmp full-checkout-out sparse-index-out &&
+	test_cmp full-checkout-err sparse-checkout-err &&
+	test_cmp full-checkout-err sparse-index-err
 }
 
 test_sparse_match () {
@@ -130,6 +137,38 @@ test_sparse_match () {
 	test_cmp sparse-checkout-err sparse-index-err
 }
 
+test_expect_success 'sparse-index contents' '
+	init_repos &&
+
+	test-tool -C sparse-index read-cache --table >cache &&
+	for dir in folder1 folder2 x
+	do
+		TREE=$(git -C sparse-index rev-parse HEAD:$dir) &&
+		grep "040000 tree $TREE	$dir/" cache \
+			|| return 1
+	done &&
+
+	GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set folder1 &&
+
+	test-tool -C sparse-index read-cache --table >cache &&
+	for dir in deep folder2 x
+	do
+		TREE=$(git -C sparse-index rev-parse HEAD:$dir) &&
+		grep "040000 tree $TREE	$dir/" cache \
+			|| return 1
+	done &&
+
+	GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set deep/deeper1 &&
+
+	test-tool -C sparse-index read-cache --table >cache &&
+	for dir in deep/deeper2 folder1 folder2 x
+	do
+		TREE=$(git -C sparse-index rev-parse HEAD:$dir) &&
+		grep "040000 tree $TREE	$dir/" cache \
+			|| return 1
+	done
+'
+
 test_expect_success 'expanded in-memory index matches full index' '
 	init_repos &&
 	test_sparse_match test-tool read-cache --expand --table
@@ -137,6 +176,7 @@ test_expect_success 'expanded in-memory index matches full index' '
 
 test_expect_success 'status with options' '
 	init_repos &&
+	test_sparse_match ls &&
 	test_all_match git status --porcelain=v2 &&
 	test_all_match git status --porcelain=v2 -z -u &&
 	test_all_match git status --porcelain=v2 -uno &&
@@ -273,6 +313,17 @@ test_expect_failure 'checkout and reset (mixed)' '
 	test_all_match git reset update-folder2
 '
 
+# Ensure that sparse-index behaves identically to
+# sparse-checkout with a full index.
+test_expect_success 'checkout and reset (mixed) [sparse]' '
+	init_repos &&
+
+	test_sparse_match git checkout -b reset-test update-deep &&
+	test_sparse_match git reset deepest &&
+	test_sparse_match git reset update-folder1 &&
+	test_sparse_match git reset update-folder2
+'
+
 test_expect_success 'merge' '
 	init_repos &&
 
@@ -309,14 +360,20 @@ test_expect_success 'clean' '
 	test_all_match git status --porcelain=v2 &&
 	test_all_match git clean -f &&
 	test_all_match git status --porcelain=v2 &&
+	test_sparse_match ls &&
+	test_sparse_match ls folder1 &&
 
 	test_all_match git clean -xf &&
 	test_all_match git status --porcelain=v2 &&
+	test_sparse_match ls &&
+	test_sparse_match ls folder1 &&
 
 	test_all_match git clean -xdf &&
 	test_all_match git status --porcelain=v2 &&
+	test_sparse_match ls &&
+	test_sparse_match ls folder1 &&
 
-	test_path_is_dir sparse-checkout/folder1
+	test_sparse_match test_path_is_dir folder1
 '
 
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH v5 13/21] submodule: sparse-index should not collapse links
  2021-03-30 13:10       ` [PATCH v5 00/21] " Derrick Stolee via GitGitGadget
                           ` (11 preceding siblings ...)
  2021-03-30 13:10         ` [PATCH v5 12/21] sparse-index: convert from full to sparse Derrick Stolee via GitGitGadget
@ 2021-03-30 13:10         ` Derrick Stolee via GitGitGadget
  2021-03-30 13:10         ` [PATCH v5 14/21] unpack-trees: allow sparse directories Derrick Stolee via GitGitGadget
                           ` (9 subsequent siblings)
  22 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-30 13:10 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

A submodule is stored as a "Git link" that actually points to a commit
within a submodule. Submodules are populated or not depending on
submodule configuration, not sparse-checkout. To ensure that the
sparse-index feature integrates correctly with submodules, we should not
collapse a directory if there is a Git link within its range.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 sparse-index.c                           |  1 +
 t/t1092-sparse-checkout-compatibility.sh | 17 +++++++++++++++++
 2 files changed, 18 insertions(+)

diff --git a/sparse-index.c b/sparse-index.c
index 619ff7c2e217..7631f7bd00b7 100644
--- a/sparse-index.c
+++ b/sparse-index.c
@@ -52,6 +52,7 @@ static int convert_to_sparse_rec(struct index_state *istate,
 		struct cache_entry *ce = istate->cache[i];
 
 		if (ce_stage(ce) ||
+		    S_ISGITLINK(ce->ce_mode) ||
 		    !(ce->ce_flags & CE_SKIP_WORKTREE))
 			can_convert = 0;
 	}
diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index 1e888d195122..cba5f89b1e96 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -376,4 +376,21 @@ test_expect_success 'clean' '
 	test_sparse_match test_path_is_dir folder1
 '
 
+test_expect_success 'submodule handling' '
+	init_repos &&
+
+	test_all_match mkdir modules &&
+	test_all_match touch modules/a &&
+	test_all_match git add modules &&
+	test_all_match git commit -m "add modules directory" &&
+
+	run_on_all git submodule add "$(pwd)/initial-repo" modules/sub &&
+	test_all_match git commit -m "add submodule" &&
+
+	# having a submodule prevents "modules" from collapse
+	test-tool -C sparse-index read-cache --table >cache &&
+	grep "100644 blob .*	modules/a" cache &&
+	grep "160000 commit $(git -C initial-repo rev-parse HEAD)	modules/sub" cache
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH v5 14/21] unpack-trees: allow sparse directories
  2021-03-30 13:10       ` [PATCH v5 00/21] " Derrick Stolee via GitGitGadget
                           ` (12 preceding siblings ...)
  2021-03-30 13:10         ` [PATCH v5 13/21] submodule: sparse-index should not collapse links Derrick Stolee via GitGitGadget
@ 2021-03-30 13:10         ` Derrick Stolee via GitGitGadget
  2021-03-30 13:10         ` [PATCH v5 15/21] sparse-index: check index conversion happens Derrick Stolee via GitGitGadget
                           ` (8 subsequent siblings)
  22 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-30 13:10 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The index_pos_by_traverse_info() currently throws a BUG() when a
directory entry exists exactly in the index. We need to consider that it
is possible to have a directory in a sparse index as long as that entry
is itself marked with the skip-worktree bit.

The 'pos' variable is assigned a negative value if an exact match is not
found. Since a directory name can be an exact match, it is no longer an
error to have a nonnegative 'pos' value.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 unpack-trees.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/unpack-trees.c b/unpack-trees.c
index 4dd99219073a..0b888dab2246 100644
--- a/unpack-trees.c
+++ b/unpack-trees.c
@@ -746,9 +746,13 @@ static int index_pos_by_traverse_info(struct name_entry *names,
 	strbuf_make_traverse_path(&name, info, names->path, names->pathlen);
 	strbuf_addch(&name, '/');
 	pos = index_name_pos(o->src_index, name.buf, name.len);
-	if (pos >= 0)
-		BUG("This is a directory and should not exist in index");
-	pos = -pos - 1;
+	if (pos >= 0) {
+		if (!o->src_index->sparse_index ||
+		    !(o->src_index->cache[pos]->ce_flags & CE_SKIP_WORKTREE))
+			BUG("This is a directory and should not exist in index");
+	} else {
+		pos = -pos - 1;
+	}
 	if (pos >= o->src_index->cache_nr ||
 	    !starts_with(o->src_index->cache[pos]->name, name.buf) ||
 	    (pos > 0 && starts_with(o->src_index->cache[pos-1]->name, name.buf)))
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH v5 15/21] sparse-index: check index conversion happens
  2021-03-30 13:10       ` [PATCH v5 00/21] " Derrick Stolee via GitGitGadget
                           ` (13 preceding siblings ...)
  2021-03-30 13:10         ` [PATCH v5 14/21] unpack-trees: allow sparse directories Derrick Stolee via GitGitGadget
@ 2021-03-30 13:10         ` Derrick Stolee via GitGitGadget
  2021-03-30 13:10         ` [PATCH v5 16/21] sparse-index: add index.sparse config option Derrick Stolee via GitGitGadget
                           ` (7 subsequent siblings)
  22 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-30 13:10 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Add a test case that uses test_region to ensure that we are truly
expanding a sparse index to a full one, then converting back to sparse
when writing the index. As we integrate more Git commands with the
sparse index, we will convert these commands to check that we do _not_
convert the sparse index to a full index and instead stay sparse the
entire time.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/t1092-sparse-checkout-compatibility.sh | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index cba5f89b1e96..47f983217852 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -393,4 +393,22 @@ test_expect_success 'submodule handling' '
 	grep "160000 commit $(git -C initial-repo rev-parse HEAD)	modules/sub" cache
 '
 
+test_expect_success 'sparse-index is expanded and converted back' '
+	init_repos &&
+
+	(
+		GIT_TEST_SPARSE_INDEX=1 &&
+		export GIT_TEST_SPARSE_INDEX &&
+		GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
+			git -C sparse-index -c core.fsmonitor="" reset --hard &&
+		test_region index convert_to_sparse trace2.txt &&
+		test_region index ensure_full_index trace2.txt &&
+
+		rm trace2.txt &&
+		GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
+			git -C sparse-index -c core.fsmonitor="" status -uno &&
+		test_region index ensure_full_index trace2.txt
+	)
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH v5 16/21] sparse-index: add index.sparse config option
  2021-03-30 13:10       ` [PATCH v5 00/21] " Derrick Stolee via GitGitGadget
                           ` (14 preceding siblings ...)
  2021-03-30 13:10         ` [PATCH v5 15/21] sparse-index: check index conversion happens Derrick Stolee via GitGitGadget
@ 2021-03-30 13:10         ` Derrick Stolee via GitGitGadget
  2021-03-30 13:11         ` [PATCH v5 17/21] sparse-checkout: toggle sparse index from builtin Derrick Stolee via GitGitGadget
                           ` (6 subsequent siblings)
  22 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-30 13:10 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

When enabled, this config option signals that index writes should
attempt to use sparse-directory entries.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/config/index.txt |  5 +++++
 cache.h                        |  1 +
 repo-settings.c                |  7 +++++++
 repository.h                   |  3 ++-
 sparse-index.c                 | 34 +++++++++++++++++++++++++++++-----
 5 files changed, 44 insertions(+), 6 deletions(-)

diff --git a/Documentation/config/index.txt b/Documentation/config/index.txt
index 7cb50b37e98d..75f3a2d10541 100644
--- a/Documentation/config/index.txt
+++ b/Documentation/config/index.txt
@@ -14,6 +14,11 @@ index.recordOffsetTable::
 	Defaults to 'true' if index.threads has been explicitly enabled,
 	'false' otherwise.
 
+index.sparse::
+	When enabled, write the index using sparse-directory entries. This
+	has no effect unless `core.sparseCheckout` and
+	`core.sparseCheckoutCone` are both enabled. Defaults to 'false'.
+
 index.threads::
 	Specifies the number of threads to spawn when loading the index.
 	This is meant to reduce index load time on multiprocessor machines.
diff --git a/cache.h b/cache.h
index 74b43aaa2bd1..8aede373aeb3 100644
--- a/cache.h
+++ b/cache.h
@@ -1059,6 +1059,7 @@ struct repository_format {
 	int worktree_config;
 	int is_bare;
 	int hash_algo;
+	int sparse_index;
 	char *work_tree;
 	struct string_list unknown_extensions;
 	struct string_list v1_only_extensions;
diff --git a/repo-settings.c b/repo-settings.c
index d63569e4041e..0cfe8b787db2 100644
--- a/repo-settings.c
+++ b/repo-settings.c
@@ -85,4 +85,11 @@ void prepare_repo_settings(struct repository *r)
 	 * removed.
 	 */
 	r->settings.command_requires_full_index = 1;
+
+	/*
+	 * Initialize this as off.
+	 */
+	r->settings.sparse_index = 0;
+	if (!repo_config_get_bool(r, "index.sparse", &value) && value)
+		r->settings.sparse_index = 1;
 }
diff --git a/repository.h b/repository.h
index e06a23015697..a45f7520fd9e 100644
--- a/repository.h
+++ b/repository.h
@@ -42,7 +42,8 @@ struct repo_settings {
 
 	int core_multi_pack_index;
 
-	unsigned command_requires_full_index:1;
+	unsigned command_requires_full_index:1,
+		 sparse_index:1;
 };
 
 struct repository {
diff --git a/sparse-index.c b/sparse-index.c
index 7631f7bd00b7..6f4d95d35b1e 100644
--- a/sparse-index.c
+++ b/sparse-index.c
@@ -102,19 +102,43 @@ static int convert_to_sparse_rec(struct index_state *istate,
 	return num_converted - start_converted;
 }
 
+static int enable_sparse_index(struct repository *repo)
+{
+	const char *config_path = repo_git_path(repo, "config.worktree");
+
+	git_config_set_in_file_gently(config_path,
+				      "index.sparse",
+				      "true");
+
+	prepare_repo_settings(repo);
+	repo->settings.sparse_index = 1;
+	return 0;
+}
+
 int convert_to_sparse(struct index_state *istate)
 {
 	if (istate->split_index || istate->sparse_index ||
 	    !core_apply_sparse_checkout || !core_sparse_checkout_cone)
 		return 0;
 
+	if (!istate->repo)
+		istate->repo = the_repository;
+
+	/*
+	 * The GIT_TEST_SPARSE_INDEX environment variable triggers the
+	 * index.sparse config variable to be on.
+	 */
+	if (git_env_bool("GIT_TEST_SPARSE_INDEX", 0)) {
+		int err = enable_sparse_index(istate->repo);
+		if (err < 0)
+			return err;
+	}
+
 	/*
-	 * For now, only create a sparse index with the
-	 * GIT_TEST_SPARSE_INDEX environment variable. We will relax
-	 * this once we have a proper way to opt-in (and later still,
-	 * opt-out).
+	 * Only convert to sparse if index.sparse is set.
 	 */
-	if (!git_env_bool("GIT_TEST_SPARSE_INDEX", 0))
+	prepare_repo_settings(istate->repo);
+	if (!istate->repo->settings.sparse_index)
 		return 0;
 
 	if (!istate->sparse_checkout_patterns) {
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH v5 17/21] sparse-checkout: toggle sparse index from builtin
  2021-03-30 13:10       ` [PATCH v5 00/21] " Derrick Stolee via GitGitGadget
                           ` (15 preceding siblings ...)
  2021-03-30 13:10         ` [PATCH v5 16/21] sparse-index: add index.sparse config option Derrick Stolee via GitGitGadget
@ 2021-03-30 13:11         ` Derrick Stolee via GitGitGadget
  2021-03-30 13:11         ` [PATCH v5 18/21] sparse-checkout: disable sparse-index Derrick Stolee via GitGitGadget
                           ` (5 subsequent siblings)
  22 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-30 13:11 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The sparse index extension is used to signal that index writes should be
in sparse mode. This was only updated using GIT_TEST_SPARSE_INDEX=1.

Add a '--[no-]sparse-index' option to 'git sparse-checkout init' that
specifies if the sparse index should be used. It also updates the index
to use the correct format, either way. Add a warning in the
documentation that the use of a repository extension might reduce
compatibility with third-party tools. 'git sparse-checkout init' already
sets extension.worktreeConfig, which places most sparse-checkout users
outside of the scope of most third-party tools.

Update t1092-sparse-checkout-compatibility.sh to use this CLI instead of
GIT_TEST_SPARSE_INDEX=1.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-sparse-checkout.txt    | 14 +++++++
 builtin/sparse-checkout.c                | 17 ++++++++-
 sparse-index.c                           | 33 +++++++++++------
 sparse-index.h                           |  3 ++
 t/t1092-sparse-checkout-compatibility.sh | 47 +++++++++++++-----------
 5 files changed, 80 insertions(+), 34 deletions(-)

diff --git a/Documentation/git-sparse-checkout.txt b/Documentation/git-sparse-checkout.txt
index a0eeaeb02ee3..fdcf43f87cb3 100644
--- a/Documentation/git-sparse-checkout.txt
+++ b/Documentation/git-sparse-checkout.txt
@@ -45,6 +45,20 @@ To avoid interfering with other worktrees, it first enables the
 When `--cone` is provided, the `core.sparseCheckoutCone` setting is
 also set, allowing for better performance with a limited set of
 patterns (see 'CONE PATTERN SET' below).
++
+Use the `--[no-]sparse-index` option to toggle the use of the sparse
+index format. This reduces the size of the index to be more closely
+aligned with your sparse-checkout definition. This can have significant
+performance advantages for commands such as `git status` or `git add`.
+This feature is still experimental. Some commands might be slower with
+a sparse index until they are properly integrated with the feature.
++
+**WARNING:** Using a sparse index requires modifying the index in a way
+that is not completely understood by external tools. If you have trouble
+with this compatibility, then run `git sparse-checkout init --no-sparse-index`
+to rewrite your index to not be sparse. Older versions of Git will not
+understand the sparse directory entries index extension and may fail to
+interact with your repository until it is disabled.
 
 'set'::
 	Write a set of patterns to the sparse-checkout file, as given as
diff --git a/builtin/sparse-checkout.c b/builtin/sparse-checkout.c
index e00b82af727b..ca63e2c64e95 100644
--- a/builtin/sparse-checkout.c
+++ b/builtin/sparse-checkout.c
@@ -14,6 +14,7 @@
 #include "unpack-trees.h"
 #include "wt-status.h"
 #include "quote.h"
+#include "sparse-index.h"
 
 static const char *empty_base = "";
 
@@ -283,12 +284,13 @@ static int set_config(enum sparse_checkout_mode mode)
 }
 
 static char const * const builtin_sparse_checkout_init_usage[] = {
-	N_("git sparse-checkout init [--cone]"),
+	N_("git sparse-checkout init [--cone] [--[no-]sparse-index]"),
 	NULL
 };
 
 static struct sparse_checkout_init_opts {
 	int cone_mode;
+	int sparse_index;
 } init_opts;
 
 static int sparse_checkout_init(int argc, const char **argv)
@@ -303,11 +305,15 @@ static int sparse_checkout_init(int argc, const char **argv)
 	static struct option builtin_sparse_checkout_init_options[] = {
 		OPT_BOOL(0, "cone", &init_opts.cone_mode,
 			 N_("initialize the sparse-checkout in cone mode")),
+		OPT_BOOL(0, "sparse-index", &init_opts.sparse_index,
+			 N_("toggle the use of a sparse index")),
 		OPT_END(),
 	};
 
 	repo_read_index(the_repository);
 
+	init_opts.sparse_index = -1;
+
 	argc = parse_options(argc, argv, NULL,
 			     builtin_sparse_checkout_init_options,
 			     builtin_sparse_checkout_init_usage, 0);
@@ -326,6 +332,15 @@ static int sparse_checkout_init(int argc, const char **argv)
 	sparse_filename = get_sparse_checkout_filename();
 	res = add_patterns_from_file_to_list(sparse_filename, "", 0, &pl, NULL);
 
+	if (init_opts.sparse_index >= 0) {
+		if (set_sparse_index_config(the_repository, init_opts.sparse_index) < 0)
+			die(_("failed to modify sparse-index config"));
+
+		/* force an index rewrite */
+		repo_read_index(the_repository);
+		the_repository->index->updated_workdir = 1;
+	}
+
 	/* If we already have a sparse-checkout file, use it. */
 	if (res >= 0) {
 		free(sparse_filename);
diff --git a/sparse-index.c b/sparse-index.c
index 6f4d95d35b1e..4c73772c6d6c 100644
--- a/sparse-index.c
+++ b/sparse-index.c
@@ -102,21 +102,32 @@ static int convert_to_sparse_rec(struct index_state *istate,
 	return num_converted - start_converted;
 }
 
-static int enable_sparse_index(struct repository *repo)
+static int set_index_sparse_config(struct repository *repo, int enable)
 {
-	const char *config_path = repo_git_path(repo, "config.worktree");
-
-	git_config_set_in_file_gently(config_path,
-				      "index.sparse",
-				      "true");
+	int res;
+	char *config_path = repo_git_path(repo, "config.worktree");
+	res = git_config_set_in_file_gently(config_path,
+					    "index.sparse",
+					    enable ? "true" : NULL);
+	free(config_path);
 
 	prepare_repo_settings(repo);
 	repo->settings.sparse_index = 1;
-	return 0;
+	return res;
+}
+
+int set_sparse_index_config(struct repository *repo, int enable)
+{
+	int res = set_index_sparse_config(repo, enable);
+
+	prepare_repo_settings(repo);
+	repo->settings.sparse_index = enable;
+	return res;
 }
 
 int convert_to_sparse(struct index_state *istate)
 {
+	int test_env;
 	if (istate->split_index || istate->sparse_index ||
 	    !core_apply_sparse_checkout || !core_sparse_checkout_cone)
 		return 0;
@@ -128,11 +139,9 @@ int convert_to_sparse(struct index_state *istate)
 	 * The GIT_TEST_SPARSE_INDEX environment variable triggers the
 	 * index.sparse config variable to be on.
 	 */
-	if (git_env_bool("GIT_TEST_SPARSE_INDEX", 0)) {
-		int err = enable_sparse_index(istate->repo);
-		if (err < 0)
-			return err;
-	}
+	test_env = git_env_bool("GIT_TEST_SPARSE_INDEX", -1);
+	if (test_env >= 0)
+		set_sparse_index_config(istate->repo, test_env);
 
 	/*
 	 * Only convert to sparse if index.sparse is set.
diff --git a/sparse-index.h b/sparse-index.h
index 64380e121d80..39dcc859735e 100644
--- a/sparse-index.h
+++ b/sparse-index.h
@@ -5,4 +5,7 @@ struct index_state;
 void ensure_full_index(struct index_state *istate);
 int convert_to_sparse(struct index_state *istate);
 
+struct repository;
+int set_sparse_index_config(struct repository *repo, int enable);
+
 #endif
diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index 47f983217852..472c5337de1b 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -6,6 +6,7 @@ test_description='compare full workdir to sparse workdir'
 # So, disable the check until that integration is complete.
 GIT_TEST_CHECK_CACHE_TREE=0
 GIT_TEST_SPLIT_INDEX=0
+GIT_TEST_SPARSE_INDEX=
 
 . ./test-lib.sh
 
@@ -100,25 +101,26 @@ init_repos () {
 	# initialize sparse-checkout definitions
 	git -C sparse-checkout sparse-checkout init --cone &&
 	git -C sparse-checkout sparse-checkout set deep &&
-	GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout init --cone &&
-	GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set deep
+	git -C sparse-index sparse-checkout init --cone --sparse-index &&
+	test_cmp_config -C sparse-index true index.sparse &&
+	git -C sparse-index sparse-checkout set deep
 }
 
 run_on_sparse () {
 	(
 		cd sparse-checkout &&
-		GIT_TEST_SPARSE_INDEX=0 "$@" >../sparse-checkout-out 2>../sparse-checkout-err
+		"$@" >../sparse-checkout-out 2>../sparse-checkout-err
 	) &&
 	(
 		cd sparse-index &&
-		GIT_TEST_SPARSE_INDEX=1 "$@" >../sparse-index-out 2>../sparse-index-err
+		"$@" >../sparse-index-out 2>../sparse-index-err
 	)
 }
 
 run_on_all () {
 	(
 		cd full-checkout &&
-		GIT_TEST_SPARSE_INDEX=0 "$@" >../full-checkout-out 2>../full-checkout-err
+		"$@" >../full-checkout-out 2>../full-checkout-err
 	) &&
 	run_on_sparse "$@"
 }
@@ -148,7 +150,7 @@ test_expect_success 'sparse-index contents' '
 			|| return 1
 	done &&
 
-	GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set folder1 &&
+	git -C sparse-index sparse-checkout set folder1 &&
 
 	test-tool -C sparse-index read-cache --table >cache &&
 	for dir in deep folder2 x
@@ -158,7 +160,7 @@ test_expect_success 'sparse-index contents' '
 			|| return 1
 	done &&
 
-	GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set deep/deeper1 &&
+	git -C sparse-index sparse-checkout set deep/deeper1 &&
 
 	test-tool -C sparse-index read-cache --table >cache &&
 	for dir in deep/deeper2 folder1 folder2 x
@@ -166,7 +168,14 @@ test_expect_success 'sparse-index contents' '
 		TREE=$(git -C sparse-index rev-parse HEAD:$dir) &&
 		grep "040000 tree $TREE	$dir/" cache \
 			|| return 1
-	done
+	done &&
+
+	# Disabling the sparse-index removes tree entries with full ones
+	git -C sparse-index sparse-checkout init --no-sparse-index &&
+
+	test-tool -C sparse-index read-cache --table >cache &&
+	! grep "040000 tree" cache &&
+	test_sparse_match test-tool read-cache --table
 '
 
 test_expect_success 'expanded in-memory index matches full index' '
@@ -396,19 +405,15 @@ test_expect_success 'submodule handling' '
 test_expect_success 'sparse-index is expanded and converted back' '
 	init_repos &&
 
-	(
-		GIT_TEST_SPARSE_INDEX=1 &&
-		export GIT_TEST_SPARSE_INDEX &&
-		GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
-			git -C sparse-index -c core.fsmonitor="" reset --hard &&
-		test_region index convert_to_sparse trace2.txt &&
-		test_region index ensure_full_index trace2.txt &&
-
-		rm trace2.txt &&
-		GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
-			git -C sparse-index -c core.fsmonitor="" status -uno &&
-		test_region index ensure_full_index trace2.txt
-	)
+	GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
+		git -C sparse-index -c core.fsmonitor="" reset --hard &&
+	test_region index convert_to_sparse trace2.txt &&
+	test_region index ensure_full_index trace2.txt &&
+
+	rm trace2.txt &&
+	GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
+		git -C sparse-index -c core.fsmonitor="" status -uno &&
+	test_region index ensure_full_index trace2.txt
 '
 
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH v5 18/21] sparse-checkout: disable sparse-index
  2021-03-30 13:10       ` [PATCH v5 00/21] " Derrick Stolee via GitGitGadget
                           ` (16 preceding siblings ...)
  2021-03-30 13:11         ` [PATCH v5 17/21] sparse-checkout: toggle sparse index from builtin Derrick Stolee via GitGitGadget
@ 2021-03-30 13:11         ` Derrick Stolee via GitGitGadget
  2021-03-30 13:11         ` [PATCH v5 19/21] cache-tree: integrate with sparse directory entries Derrick Stolee via GitGitGadget
                           ` (4 subsequent siblings)
  22 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-30 13:11 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

We use 'git sparse-checkout init --cone --sparse-index' to toggle the
sparse-index feature. It makes sense to also disable it when running
'git sparse-checkout disable'. This is particularly important because it
removes the extensions.sparseIndex config option, allowing other tools
to use this Git repository again.

This does mean that 'git sparse-checkout init' will not re-enable the
sparse-index feature, even if it was previously enabled.

While testing this feature, I noticed that the sparse-index was not
being written on the first run, but by a second. This was caught by the
call to 'test-tool read-cache --table'. This requires adjusting some
assignments to core_apply_sparse_checkout and pl.use_cone_patterns in
the sparse_checkout_init() logic.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 builtin/sparse-checkout.c          | 10 +++++++++-
 t/t1091-sparse-checkout-builtin.sh | 13 +++++++++++++
 2 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/builtin/sparse-checkout.c b/builtin/sparse-checkout.c
index ca63e2c64e95..585343fa1972 100644
--- a/builtin/sparse-checkout.c
+++ b/builtin/sparse-checkout.c
@@ -280,6 +280,9 @@ static int set_config(enum sparse_checkout_mode mode)
 				      "core.sparseCheckoutCone",
 				      mode == MODE_CONE_PATTERNS ? "true" : NULL);
 
+	if (mode == MODE_NO_PATTERNS)
+		set_sparse_index_config(the_repository, 0);
+
 	return 0;
 }
 
@@ -341,10 +344,11 @@ static int sparse_checkout_init(int argc, const char **argv)
 		the_repository->index->updated_workdir = 1;
 	}
 
+	core_apply_sparse_checkout = 1;
+
 	/* If we already have a sparse-checkout file, use it. */
 	if (res >= 0) {
 		free(sparse_filename);
-		core_apply_sparse_checkout = 1;
 		return update_working_directory(NULL);
 	}
 
@@ -366,6 +370,7 @@ static int sparse_checkout_init(int argc, const char **argv)
 	add_pattern(strbuf_detach(&pattern, NULL), empty_base, 0, &pl, 0);
 	strbuf_addstr(&pattern, "!/*/");
 	add_pattern(strbuf_detach(&pattern, NULL), empty_base, 0, &pl, 0);
+	pl.use_cone_patterns = init_opts.cone_mode;
 
 	return write_patterns_and_update(&pl);
 }
@@ -632,6 +637,9 @@ static int sparse_checkout_disable(int argc, const char **argv)
 	strbuf_addstr(&match_all, "/*");
 	add_pattern(strbuf_detach(&match_all, NULL), empty_base, 0, &pl, 0);
 
+	prepare_repo_settings(the_repository);
+	the_repository->settings.sparse_index = 0;
+
 	if (update_working_directory(&pl))
 		die(_("error while refreshing working directory"));
 
diff --git a/t/t1091-sparse-checkout-builtin.sh b/t/t1091-sparse-checkout-builtin.sh
index fc64e9ed99f4..38fc8340f5c9 100755
--- a/t/t1091-sparse-checkout-builtin.sh
+++ b/t/t1091-sparse-checkout-builtin.sh
@@ -205,6 +205,19 @@ test_expect_success 'sparse-checkout disable' '
 	check_files repo a deep folder1 folder2
 '
 
+test_expect_success 'sparse-index enabled and disabled' '
+	git -C repo sparse-checkout init --cone --sparse-index &&
+	test_cmp_config -C repo true index.sparse &&
+	test-tool -C repo read-cache --table >cache &&
+	grep " tree " cache &&
+
+	git -C repo sparse-checkout disable &&
+	test-tool -C repo read-cache --table >cache &&
+	! grep " tree " cache &&
+	git -C repo config --list >config &&
+	! grep index.sparse config
+'
+
 test_expect_success 'cone mode: init and set' '
 	git -C repo sparse-checkout init --cone &&
 	git -C repo config --list >config &&
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH v5 19/21] cache-tree: integrate with sparse directory entries
  2021-03-30 13:10       ` [PATCH v5 00/21] " Derrick Stolee via GitGitGadget
                           ` (17 preceding siblings ...)
  2021-03-30 13:11         ` [PATCH v5 18/21] sparse-checkout: disable sparse-index Derrick Stolee via GitGitGadget
@ 2021-03-30 13:11         ` Derrick Stolee via GitGitGadget
  2021-03-30 13:11         ` [PATCH v5 20/21] sparse-index: loose integration with cache_tree_verify() Derrick Stolee via GitGitGadget
                           ` (3 subsequent siblings)
  22 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-30 13:11 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The cache-tree extension was previously disabled with sparse indexes.
However, the cache-tree is an important performance feature for commands
like 'git status' and 'git add'. Integrate it with sparse directory
entries.

When writing a sparse index, completely clear and recalculate the cache
tree. By starting from scratch, the only integration necessary is to
check if we hit a sparse directory entry and create a leaf of the
cache-tree that has an entry_count of one and no subtrees.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 cache-tree.c   | 18 ++++++++++++++++++
 sparse-index.c | 10 +++++++++-
 2 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/cache-tree.c b/cache-tree.c
index 5f07a39e501e..950a9615db8f 100644
--- a/cache-tree.c
+++ b/cache-tree.c
@@ -256,6 +256,24 @@ static int update_one(struct cache_tree *it,
 
 	*skip_count = 0;
 
+	/*
+	 * If the first entry of this region is a sparse directory
+	 * entry corresponding exactly to 'base', then this cache_tree
+	 * struct is a "leaf" in the data structure, pointing to the
+	 * tree OID specified in the entry.
+	 */
+	if (entries > 0) {
+		const struct cache_entry *ce = cache[0];
+
+		if (S_ISSPARSEDIR(ce->ce_mode) &&
+		    ce->ce_namelen == baselen &&
+		    !strncmp(ce->name, base, baselen)) {
+			it->entry_count = 1;
+			oidcpy(&it->oid, &ce->oid);
+			return 1;
+		}
+	}
+
 	if (0 <= it->entry_count && has_object_file(&it->oid))
 		return it->entry_count;
 
diff --git a/sparse-index.c b/sparse-index.c
index 4c73772c6d6c..95ea17174da3 100644
--- a/sparse-index.c
+++ b/sparse-index.c
@@ -172,7 +172,11 @@ int convert_to_sparse(struct index_state *istate)
 	istate->cache_nr = convert_to_sparse_rec(istate,
 						 0, 0, istate->cache_nr,
 						 "", 0, istate->cache_tree);
-	istate->drop_cache_tree = 1;
+
+	/* Clear and recompute the cache-tree */
+	cache_tree_free(&istate->cache_tree);
+	cache_tree_update(istate, 0);
+
 	istate->sparse_index = 1;
 	trace2_region_leave("index", "convert_to_sparse", istate->repo);
 	return 0;
@@ -273,5 +277,9 @@ void ensure_full_index(struct index_state *istate)
 	strbuf_release(&base);
 	free(full);
 
+	/* Clear and recompute the cache-tree */
+	cache_tree_free(&istate->cache_tree);
+	cache_tree_update(istate, 0);
+
 	trace2_region_leave("index", "ensure_full_index", istate->repo);
 }
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH v5 20/21] sparse-index: loose integration with cache_tree_verify()
  2021-03-30 13:10       ` [PATCH v5 00/21] " Derrick Stolee via GitGitGadget
                           ` (18 preceding siblings ...)
  2021-03-30 13:11         ` [PATCH v5 19/21] cache-tree: integrate with sparse directory entries Derrick Stolee via GitGitGadget
@ 2021-03-30 13:11         ` Derrick Stolee via GitGitGadget
  2021-03-30 13:11         ` [PATCH v5 21/21] p2000: add sparse-index repos Derrick Stolee via GitGitGadget
                           ` (2 subsequent siblings)
  22 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-30 13:11 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The cache_tree_verify() method is run when GIT_TEST_CHECK_CACHE_TREE
is enabled, which it is by default in the test suite. The logic must
be adjusted for the presence of these directory entries.

For now, leave the test as a simple check for whether the directory
entry is sparse. Do not go any further until needed.

This allows us to re-enable GIT_TEST_CHECK_CACHE_TREE in
t1092-sparse-checkout-compatibility.sh. Further,
p2000-sparse-operations.sh uses the test suite and hence this is enabled
for all tests. We need to integrate with it before we run our
performance tests with a sparse-index.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 cache-tree.c                             | 19 +++++++++++++++++++
 t/t1092-sparse-checkout-compatibility.sh |  3 ---
 2 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/cache-tree.c b/cache-tree.c
index 950a9615db8f..11bf1fcae6e1 100644
--- a/cache-tree.c
+++ b/cache-tree.c
@@ -808,6 +808,19 @@ int cache_tree_matches_traversal(struct cache_tree *root,
 	return 0;
 }
 
+static void verify_one_sparse(struct repository *r,
+			      struct index_state *istate,
+			      struct cache_tree *it,
+			      struct strbuf *path,
+			      int pos)
+{
+	struct cache_entry *ce = istate->cache[pos];
+
+	if (!S_ISSPARSEDIR(ce->ce_mode))
+		BUG("directory '%s' is present in index, but not sparse",
+		    path->buf);
+}
+
 static void verify_one(struct repository *r,
 		       struct index_state *istate,
 		       struct cache_tree *it,
@@ -830,6 +843,12 @@ static void verify_one(struct repository *r,
 
 	if (path->len) {
 		pos = index_name_pos(istate, path->buf, path->len);
+
+		if (pos >= 0) {
+			verify_one_sparse(r, istate, it, path, pos);
+			return;
+		}
+
 		pos = -pos - 1;
 	} else {
 		pos = 0;
diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index 472c5337de1b..12e6c453024f 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -2,9 +2,6 @@
 
 test_description='compare full workdir to sparse workdir'
 
-# The verify_cache_tree() check is not sparse-aware (yet).
-# So, disable the check until that integration is complete.
-GIT_TEST_CHECK_CACHE_TREE=0
 GIT_TEST_SPLIT_INDEX=0
 GIT_TEST_SPARSE_INDEX=
 
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [PATCH v5 21/21] p2000: add sparse-index repos
  2021-03-30 13:10       ` [PATCH v5 00/21] " Derrick Stolee via GitGitGadget
                           ` (19 preceding siblings ...)
  2021-03-30 13:11         ` [PATCH v5 20/21] sparse-index: loose integration with cache_tree_verify() Derrick Stolee via GitGitGadget
@ 2021-03-30 13:11         ` Derrick Stolee via GitGitGadget
  2021-03-30 20:11         ` [PATCH v5 00/21] Sparse Index: Design, Format, Tests Junio C Hamano
  2021-04-01  4:38         ` Elijah Newren
  22 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-30 13:11 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

p2000-sparse-operations.sh compares different Git commands in
repositories with many files at HEAD but using sparse-checkout to focus
on a small portion of those files.

Add extra copies of the repository that use the sparse-index format so
we can track how that affects the performance of different commands.

At this point in time, the sparse-index is 100% overhead from the CPU
front, and this is measurable in these tests:

Test
---------------------------------------------------------------
2000.2: git status (full-index-v3)              0.59(0.51+0.12)
2000.3: git status (full-index-v4)              0.59(0.52+0.11)
2000.4: git status (sparse-index-v3)            1.40(1.32+0.12)
2000.5: git status (sparse-index-v4)            1.41(1.36+0.08)
2000.6: git add -A (full-index-v3)              2.32(1.97+0.19)
2000.7: git add -A (full-index-v4)              2.17(1.92+0.14)
2000.8: git add -A (sparse-index-v3)            2.31(2.21+0.15)
2000.9: git add -A (sparse-index-v4)            2.30(2.20+0.13)
2000.10: git add . (full-index-v3)              2.39(2.02+0.20)
2000.11: git add . (full-index-v4)              2.20(1.94+0.16)
2000.12: git add . (sparse-index-v3)            2.36(2.27+0.12)
2000.13: git add . (sparse-index-v4)            2.33(2.21+0.16)
2000.14: git commit -a -m A (full-index-v3)     2.47(2.12+0.20)
2000.15: git commit -a -m A (full-index-v4)     2.26(2.00+0.17)
2000.16: git commit -a -m A (sparse-index-v3)   3.01(2.92+0.16)
2000.17: git commit -a -m A (sparse-index-v4)   3.01(2.94+0.15)

Note that there is very little difference between the v3 and v4 index
formats when the sparse-index is enabled. This is primarily due to the
fact that the relative file sizes are the same, and the command time is
mostly taken up by parsing tree objects to expand the sparse index into
a full one.

With the current file layout, the index file sizes are given by this
table:

       |  full index | sparse index |
       +-------------+--------------+
    v3 |     108 MiB |      1.6 MiB |
    v4 |      80 MiB |      1.2 MiB |

Future updates will improve the performance of Git commands when the
index is sparse.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/perf/p2000-sparse-operations.sh | 19 ++++++++++++++++++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/t/perf/p2000-sparse-operations.sh b/t/perf/p2000-sparse-operations.sh
index dddd527b6330..94513c977489 100755
--- a/t/perf/p2000-sparse-operations.sh
+++ b/t/perf/p2000-sparse-operations.sh
@@ -59,12 +59,29 @@ test_expect_success 'setup repo and indexes' '
 		git sparse-checkout set $SPARSE_CONE &&
 		git config index.version 4 &&
 		git update-index --index-version=4
+	) &&
+	git -c core.sparseCheckoutCone=true clone --branch=wide --sparse . sparse-index-v3 &&
+	(
+		cd sparse-index-v3 &&
+		git sparse-checkout init --cone --sparse-index &&
+		git sparse-checkout set $SPARSE_CONE &&
+		git config index.version 3 &&
+		git update-index --index-version=3
+	) &&
+	git -c core.sparseCheckoutCone=true clone --branch=wide --sparse . sparse-index-v4 &&
+	(
+		cd sparse-index-v4 &&
+		git sparse-checkout init --cone --sparse-index &&
+		git sparse-checkout set $SPARSE_CONE &&
+		git config index.version 4 &&
+		git update-index --index-version=4
 	)
 '
 
 test_perf_on_all () {
 	command="$@"
-	for repo in full-index-v3 full-index-v4
+	for repo in full-index-v3 full-index-v4 \
+		    sparse-index-v3 sparse-index-v4
 	do
 		test_perf "$command ($repo)" "
 			(
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 203+ messages in thread

* Re: [PATCH v5 00/21] Sparse Index: Design, Format, Tests
  2021-03-30 13:10       ` [PATCH v5 00/21] " Derrick Stolee via GitGitGadget
                           ` (20 preceding siblings ...)
  2021-03-30 13:11         ` [PATCH v5 21/21] p2000: add sparse-index repos Derrick Stolee via GitGitGadget
@ 2021-03-30 20:11         ` Junio C Hamano
  2021-03-30 21:31           ` Derrick Stolee
  2021-04-01  4:38         ` Elijah Newren
  22 siblings, 1 reply; 203+ messages in thread
From: Junio C Hamano @ 2021-03-30 20:11 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, newren, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

>      @@ repo-settings.c: void prepare_repo_settings(struct repository *r)
>       +	 * Initialize this as off.
>       +	 */
>       +	r->settings.sparse_index = 0;
>      -+	if (!repo_config_get_bool(r, "extensions.sparseindex", &value) && value)
>      ++	if (!repo_config_get_bool(r, "index.sparse", &value) && value)
>       +		r->settings.sparse_index = 1;
>        }

It would be helpful to have a way for the repository owner to say
"Even if the version of Git may be capable of handling 'sdir'
extension, and my checkout uses sparse-cone settings, I do not want
to use it", and the other way around, i.e. "Even if my checkout
currently does not use sparse-cone settings, do use 'sdir'
extension".  But for that, .sparse_index member may need to be
tristate (i.e. forbidden, enable-if-needed, use-even-unneeded)?

We have a similar setting in index.version; I believe we always
auto-demote 3 down to 2 when extended flags are not used, and
I think "always auto-demote" would be sufficient (iow,
"use-even-unneeded" may not be necessary, even though that might
help debugging).

Thanks.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v5 00/21] Sparse Index: Design, Format, Tests
  2021-03-30 20:11         ` [PATCH v5 00/21] Sparse Index: Design, Format, Tests Junio C Hamano
@ 2021-03-30 21:31           ` Derrick Stolee
  2021-03-30 21:49             ` Junio C Hamano
  2021-04-01  5:59             ` Elijah Newren
  0 siblings, 2 replies; 203+ messages in thread
From: Derrick Stolee @ 2021-03-30 21:31 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee via GitGitGadget
  Cc: git, newren, pclouds, jrnieder, Martin Ågren,
	SZEDER Gábor, Ævar Arnfjörð Bjarmason,
	Derrick Stolee

On 3/30/2021 4:11 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>>      @@ repo-settings.c: void prepare_repo_settings(struct repository *r)
>>       +	 * Initialize this as off.
>>       +	 */
>>       +	r->settings.sparse_index = 0;
>>      -+	if (!repo_config_get_bool(r, "extensions.sparseindex", &value) && value)
>>      ++	if (!repo_config_get_bool(r, "index.sparse", &value) && value)
>>       +		r->settings.sparse_index = 1;
>>        }
> 
> It would be helpful to have a way for the repository owner to say
> "Even if the version of Git may be capable of handling 'sdir'
> extension, and my checkout uses sparse-cone settings, I do not want
> to use it", and the other way around, i.e. "Even if my checkout
> currently does not use sparse-cone settings, do use 'sdir'
> extension".  But for that, .sparse_index member may need to be
> tristate (i.e. forbidden, enable-if-needed, use-even-unneeded)?

I believe as presented, index.sparse=false will prevent the sdir
extension from being used. If index.sparse=true, then it will only
be used if sparse-checkout is enabled in cone mode.

I don't see the value in using the 'sdir' extension when not using
sparse-checkout in cone mode (and hence there are no sparse directory
entries in the index). What am I missing?

> We have a similar setting in index.version; I believe we always
> auto-demote 3 down to 2 when extended flags are not used, and
> I think "always auto-demote" would be sufficient (iow,
> "use-even-unneeded" may not be necessary, even though that might
> help debugging).

Yes, the same is happening here: we auto-demote to not use 'sdir'
if it the other settings are not configured as well.

There is the rare scenario where these things all occur:

1. index.sparse = true
2. core.sparseCheckout = true
3. core.sparseCheckoutCone = true
4. Every path in the index matches the cone-mode patterns.

In this case, convert_to_sparse() is called and the istate->sparse
bit is set, telling do_write_index() to add the 'sdir' extension.

This seems like a rare occurrence. Is it still worth adding logic
to avoid 'sdir' when these are all true?

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v5 00/21] Sparse Index: Design, Format, Tests
  2021-03-30 21:31           ` Derrick Stolee
@ 2021-03-30 21:49             ` Junio C Hamano
  2021-04-01  5:59             ` Elijah Newren
  1 sibling, 0 replies; 203+ messages in thread
From: Junio C Hamano @ 2021-03-30 21:49 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, git, newren, pclouds, jrnieder,
	Martin Ågren, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee

Derrick Stolee <stolee@gmail.com> writes:

>> We have a similar setting in index.version; I believe we always
>> auto-demote 3 down to 2 when extended flags are not used, and
>> I think "always auto-demote" would be sufficient (iow,
>> "use-even-unneeded" may not be necessary, even though that might
>> help debugging).
>
> Yes, the same is happening here: we auto-demote to not use 'sdir'
> if it the other settings are not configured as well.
>
> There is the rare scenario where these things all occur:
> ...
> This seems like a rare occurrence. Is it still worth adding logic
> to avoid 'sdir' when these are all true?

You'd be the primary one who will be debugging the system while and
after this goes through the stabilization effort, so whichever you
find is more convenient is good enough for us, I guess.

Thanks.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v5 00/21] Sparse Index: Design, Format, Tests
  2021-03-30 13:10       ` [PATCH v5 00/21] " Derrick Stolee via GitGitGadget
                           ` (21 preceding siblings ...)
  2021-03-30 20:11         ` [PATCH v5 00/21] Sparse Index: Design, Format, Tests Junio C Hamano
@ 2021-04-01  4:38         ` Elijah Newren
  22 siblings, 0 replies; 203+ messages in thread
From: Elijah Newren @ 2021-04-01  4:38 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano,
	Nguyễn Thái Ngọc, Jonathan Nieder,
	Martin Ågren, Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee

On Tue, Mar 30, 2021 at 6:11 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> Here is the first full patch series submission coming out of the
> sparse-index RFC [1].
>
> [1]
> https://lore.kernel.org/git/pull.847.git.1611596533.gitgitgadget@gmail.com/
>
> I won't waste too much space here, because PATCH 1 includes a sizeable
> design document that describes the feature, the reasoning behind it, and my
> plan for getting this implemented widely throughout the codebase.
>
> There are some new things here that were not in the RFC:
>
>  * Design doc and format updates. (Patch 1)
>  * Performance test script. (Patches 2 and 20)
>
> Notably missing in this series from the RFC:
>
>  * The mega-patch inserting ensure_full_index() throughout the codebase.
>    That will be a follow-up series to this one.
>  * The integrations with git status and git add to demonstrate the improved
>    performance. Those will also appear in their own series later.
>
> I plan to keep my latest work in this area in my 'sparse-index/wip' branch
> [2]. It includes all of the work from the RFC right now, updated with the
> work from this series.
>
> [2] https://github.com/derrickstolee/git/tree/sparse-index/wip
>
>
> Updates in V5
> =============
>
> This version is updated to use an index extension instead of a repository
> format extension. Thanks, Szeder! This one change affects the range-diff
> quite a bit, so please review those changes carefully.
>
> In particular: git sparse-checkout init --cone --sparse-index now sets a new
> index.sparse config option as an indicator that we should attempt writing
> the index in sparse form.
>
>
> Updates in V4
> =============
>
>  * Rebased onto the latest copy of ab/read-tree.
>  * Updated the design document as per Junio's comments.
>  * Updated the submodule handling in the performance test.
>  * Followed up on some other review from Ævar, mostly style or commit
>    message things.
>
>
> Updates in V3
> =============
>
> For this version, I took Ævar's latest patches and applied them to v2.31.0
> and rebased this series on top. It uses his new "read_tree_at()" helper and
> the associated changes to the function pointer type.
>
>  * Fixed more typos. Thanks Martin and Elijah!
>  * Updated the test_sparse_match() macro to use "$@" instead of $*
>  * Added a test that git sparse-checkout init --no-sparse-index rewrites the
>    index to be full.
>
>
> Updates in V2
> =============
>
>  * Various typos and awkward grammar is fixed.
>  * Cleaned up unnecessary commands in p2000-sparse-operations.sh
>  * Added a comment to the sparse_index member of struct index_state.
>  * Used tree_type, commit_type, and blob_type in test-read-cache.c.
>
> Thanks, -Stolee
>
> Derrick Stolee (21):
>   sparse-index: design doc and format update
>   t/perf: add performance test for sparse operations
>   t1092: clean up script quoting
>   sparse-index: add guard to ensure full index
>   sparse-index: implement ensure_full_index()
>   t1092: compare sparse-checkout to sparse-index
>   test-read-cache: print cache entries with --table
>   test-tool: don't force full index
>   unpack-trees: ensure full index
>   sparse-checkout: hold pattern list in index
>   sparse-index: add 'sdir' index extension
>   sparse-index: convert from full to sparse
>   submodule: sparse-index should not collapse links
>   unpack-trees: allow sparse directories
>   sparse-index: check index conversion happens
>   sparse-index: add index.sparse config option
>   sparse-checkout: toggle sparse index from builtin
>   sparse-checkout: disable sparse-index
>   cache-tree: integrate with sparse directory entries
>   sparse-index: loose integration with cache_tree_verify()
>   p2000: add sparse-index repos
>
>  Documentation/config/index.txt           |   5 +
>  Documentation/git-sparse-checkout.txt    |  14 ++
>  Documentation/technical/index-format.txt |  19 ++
>  Documentation/technical/sparse-index.txt | 175 ++++++++++++++
>  Makefile                                 |   1 +
>  builtin/sparse-checkout.c                |  44 +++-
>  cache-tree.c                             |  40 ++++
>  cache.h                                  |  18 +-
>  read-cache.c                             |  44 +++-
>  repo-settings.c                          |  15 ++
>  repository.c                             |  11 +-
>  repository.h                             |   3 +
>  sparse-index.c                           | 285 +++++++++++++++++++++++
>  sparse-index.h                           |  11 +
>  t/README                                 |   3 +
>  t/helper/test-read-cache.c               |  66 +++++-
>  t/perf/p2000-sparse-operations.sh        | 101 ++++++++
>  t/t1091-sparse-checkout-builtin.sh       |  13 ++
>  t/t1092-sparse-checkout-compatibility.sh | 143 ++++++++++--
>  unpack-trees.c                           |  17 +-
>  20 files changed, 988 insertions(+), 40 deletions(-)
>  create mode 100644 Documentation/technical/sparse-index.txt
>  create mode 100644 sparse-index.c
>  create mode 100644 sparse-index.h
>  create mode 100755 t/perf/p2000-sparse-operations.sh
>
>
> base-commit: 47957485b3b731a7860e0554d2bd12c0dce1c75a
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-883%2Fderrickstolee%2Fsparse-index%2Fformat-v5
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-883/derrickstolee/sparse-index/format-v5
> Pull-Request: https://github.com/gitgitgadget/git/pull/883
>
> Range-diff vs v4:
>
>   1:  6426a5c60e53 !  1:  7b600d536c6e sparse-index: design doc and format update
>      @@ Documentation/technical/sparse-index.txt (new)
>       +The only noticeable change in behavior will be that the serialized index
>       +file contains sparse-directory entries.
>       +
>      -+To start, we use a new repository extension, `extensions.sparseIndex`, to
>      -+allow inserting sparse-directory entries into indexes with file format
>      ++To start, we use a new required index extension, `sdir`, to allow
>      ++inserting sparse-directory entries into indexes with file format
>       +versions 2, 3, and 4. This prevents Git versions that do not understand
>      -+the sparse-index from operating on one, but it also prevents other
>      -+operations that do not use the index at all. A new format, index v5, will
>      -+be introduced that includes sparse-directory entries by default. It might
>      -+also introduce other features that have been considered for improving the
>      ++the sparse-index from operating on one, while allowing tools that do not
>      ++understand the sparse-index to operate on repositories as long as they do
>      ++not interact with the index. A new format, index v5, will be introduced
>      ++that includes sparse-directory entries by default. It might also
>      ++introduce other features that have been considered for improving the
>       +index, as well.
>       +
>       +Next, consumers of the index will be guarded against operating on a
>   2:  7eabc1d0586c =  2:  202253ec82f3 t/perf: add performance test for sparse operations
>   3:  c9e21d78ecba =  3:  437a0f144e57 t1092: clean up script quoting
>   4:  03cdde756563 =  4:  b7e1bf5c55a7 sparse-index: add guard to ensure full index
>   5:  6b3b6d86385d =  5:  e41d55d2cca9 sparse-index: implement ensure_full_index()
>   6:  7f67adba0498 =  6:  7bfbfbd17321 t1092: compare sparse-checkout to sparse-index
>   7:  7ebd9570b1ad =  7:  a1b8135c0fc8 test-read-cache: print cache entries with --table
>   8:  db7bbd06dbcc =  8:  dd84a2a9121b test-tool: don't force full index
>   9:  3ddd5e794b5e =  9:  b276d2ed5323 unpack-trees: ensure full index
>  10:  7308c87697f1 = 10:  c3651e26dc3a sparse-checkout: hold pattern list in index
>   -:  ------------ > 11:  f926cf8b2e01 sparse-index: add 'sdir' index extension
>  11:  7c10d653ca6b = 12:  c870ae5e8749 sparse-index: convert from full to sparse
>  12:  6db36f33e960 = 13:  bcf0da959ef3 submodule: sparse-index should not collapse links
>  13:  d24bd3348d98 = 14:  7191b48237de unpack-trees: allow sparse directories
>  14:  08d9f5f3c0d1 = 15:  57be9b4a728b sparse-index: check index conversion happens
>  15:  6f38cef196b0 ! 16:  c22b4111e49e sparse-index: create extension for compatibility
>      @@ Metadata
>       Author: Derrick Stolee <dstolee@microsoft.com>
>
>        ## Commit message ##
>      -    sparse-index: create extension for compatibility
>      +    sparse-index: add index.sparse config option
>
>      -    Previously, we enabled the sparse index format only using
>      -    GIT_TEST_SPARSE_INDEX=1. This is not a feasible direction for users to
>      -    actually select this mode. Further, sparse directory entries are not
>      -    understood by the index formats as advertised.
>      -
>      -    We _could_ add a new index version that explicitly adds these
>      -    capabilities, but there are nuances to index formats 2, 3, and 4 that
>      -    are still valuable to select as options. Until we add index format
>      -    version 5, create a repo extension, "extensions.sparseIndex", that
>      -    specifies that the tool reading this repository must understand sparse
>      -    directory entries.
>      -
>      -    This change only encodes the extension and enables it when
>      -    GIT_TEST_SPARSE_INDEX=1. Later, we will add a more user-friendly CLI
>      -    mechanism.
>      +    When enabled, this config option signals that index writes should
>      +    attempt to use sparse-directory entries.
>
>           Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
>
>      - ## Documentation/config/extensions.txt ##
>      -@@ Documentation/config/extensions.txt: extensions.objectFormat::
>      - Note that this setting should only be set by linkgit:git-init[1] or
>      - linkgit:git-clone[1].  Trying to change it after initialization will not
>      - work and will produce hard-to-diagnose issues.
>      + ## Documentation/config/index.txt ##
>      +@@ Documentation/config/index.txt: index.recordOffsetTable::
>      +  Defaults to 'true' if index.threads has been explicitly enabled,
>      +  'false' otherwise.
>      +
>      ++index.sparse::
>      ++ When enabled, write the index using sparse-directory entries. This
>      ++ has no effect unless `core.sparseCheckout` and
>      ++ `core.sparseCheckoutCone` are both enabled. Defaults to 'false'.
>       +
>      -+extensions.sparseIndex::
>      -+ When combined with `core.sparseCheckout=true` and
>      -+ `core.sparseCheckoutCone=true`, the index may contain entries
>      -+ corresponding to directories outside of the sparse-checkout
>      -+ definition in lieu of containing each path under such directories.
>      -+ Versions of Git that do not understand this extension do not
>      -+ expect directory entries in the index.
>      + index.threads::
>      +  Specifies the number of threads to spawn when loading the index.
>      +  This is meant to reduce index load time on multiprocessor machines.
>
>        ## cache.h ##
>       @@ cache.h: struct repository_format {
>      @@ repo-settings.c: void prepare_repo_settings(struct repository *r)
>       +  * Initialize this as off.
>       +  */
>       + r->settings.sparse_index = 0;
>      -+ if (!repo_config_get_bool(r, "extensions.sparseindex", &value) && value)
>      ++ if (!repo_config_get_bool(r, "index.sparse", &value) && value)
>       +         r->settings.sparse_index = 1;
>        }
>
>      @@ repository.h: struct repo_settings {
>
>        struct repository {
>
>      - ## setup.c ##
>      -@@ setup.c: static enum extension_result handle_extension(const char *var,
>      -                  return error("invalid value for 'extensions.objectformat'");
>      -          data->hash_algo = format;
>      -          return EXTENSION_OK;
>      -+ } else if (!strcmp(ext, "sparseindex")) {
>      -+         data->sparse_index = 1;
>      -+         return EXTENSION_OK;
>      -  }
>      -  return EXTENSION_UNKNOWN;
>      - }
>      -
>        ## sparse-index.c ##
>       @@ sparse-index.c: static int convert_to_sparse_rec(struct index_state *istate,
>         return num_converted - start_converted;
>      @@ sparse-index.c: static int convert_to_sparse_rec(struct index_state *istate,
>       +{
>       + const char *config_path = repo_git_path(repo, "config.worktree");
>       +
>      -+ if (upgrade_repository_format(1) < 0) {
>      -+         warning(_("unable to upgrade repository format to enable sparse-index"));
>      -+         return -1;
>      -+ }
>       + git_config_set_in_file_gently(config_path,
>      -+                               "extensions.sparseIndex",
>      ++                               "index.sparse",
>       +                               "true");
>       +
>       + prepare_repo_settings(repo);
>      @@ sparse-index.c: static int convert_to_sparse_rec(struct index_state *istate,
>       +
>       + /*
>       +  * The GIT_TEST_SPARSE_INDEX environment variable triggers the
>      -+  * extensions.sparseIndex config variable to be on.
>      ++  * index.sparse config variable to be on.
>       +  */
>       + if (git_env_bool("GIT_TEST_SPARSE_INDEX", 0)) {
>       +         int err = enable_sparse_index(istate->repo);
>      @@ sparse-index.c: static int convert_to_sparse_rec(struct index_state *istate,
>       -  * GIT_TEST_SPARSE_INDEX environment variable. We will relax
>       -  * this once we have a proper way to opt-in (and later still,
>       -  * opt-out).
>      -+  * Only convert to sparse if extensions.sparseIndex is set.
>      ++  * Only convert to sparse if index.sparse is set.
>          */
>       - if (!git_env_bool("GIT_TEST_SPARSE_INDEX", 0))
>       + prepare_repo_settings(istate->repo);
>  16:  923081e7e079 ! 17:  75fe9b0f57da sparse-checkout: toggle sparse index from builtin
>      @@ Documentation/git-sparse-checkout.txt: To avoid interfering with other worktrees
>       +that is not completely understood by external tools. If you have trouble
>       +with this compatibility, then run `git sparse-checkout init --no-sparse-index`
>       +to rewrite your index to not be sparse. Older versions of Git will not
>      -+understand the `sparseIndex` repository extension and may fail to interact
>      -+with your repository until it is disabled.
>      ++understand the sparse directory entries index extension and may fail to
>      ++interact with your repository until it is disabled.
>
>        'set'::
>         Write a set of patterns to the sparse-checkout file, as given as
>      @@ builtin/sparse-checkout.c: static int sparse_checkout_init(int argc, const char
>
>        ## sparse-index.c ##
>       @@ sparse-index.c: static int convert_to_sparse_rec(struct index_state *istate,
>      +  return num_converted - start_converted;
>      + }
>
>      - static int enable_sparse_index(struct repository *repo)
>      +-static int enable_sparse_index(struct repository *repo)
>      ++static int set_index_sparse_config(struct repository *repo, int enable)
>        {
>       - const char *config_path = repo_git_path(repo, "config.worktree");
>      -+ int res;
>      -
>      -  if (upgrade_repository_format(1) < 0) {
>      -          warning(_("unable to upgrade repository format to enable sparse-index"));
>      -          return -1;
>      -  }
>      +-
>       - git_config_set_in_file_gently(config_path,
>      --                               "extensions.sparseIndex",
>      +-                               "index.sparse",
>       -                               "true");
>      -+ res = git_config_set_gently("extensions.sparseindex", "true");
>      ++ int res;
>      ++ char *config_path = repo_git_path(repo, "config.worktree");
>      ++ res = git_config_set_in_file_gently(config_path,
>      ++                                     "index.sparse",
>      ++                                     enable ? "true" : NULL);
>      ++ free(config_path);
>
>         prepare_repo_settings(repo);
>         repo->settings.sparse_index = 1;
>      @@ sparse-index.c: static int convert_to_sparse_rec(struct index_state *istate,
>       +
>       +int set_sparse_index_config(struct repository *repo, int enable)
>       +{
>      -+ int res;
>      -+
>      -+ if (enable)
>      -+         return enable_sparse_index(repo);
>      -+
>      -+ /* Don't downgrade repository format, just remove the extension. */
>      -+ res = git_config_set_gently("extensions.sparseindex", NULL);
>      ++ int res = set_index_sparse_config(repo, enable);
>       +
>       + prepare_repo_settings(repo);
>      -+ repo->settings.sparse_index = 0;
>      ++ repo->settings.sparse_index = enable;
>       + return res;
>        }
>
>      @@ sparse-index.c: static int convert_to_sparse_rec(struct index_state *istate,
>             !core_apply_sparse_checkout || !core_sparse_checkout_cone)
>                 return 0;
>       @@ sparse-index.c: int convert_to_sparse(struct index_state *istate)
>      -          istate->repo = the_repository;
>      -
>      -  /*
>      --  * The GIT_TEST_SPARSE_INDEX environment variable triggers the
>      --  * extensions.sparseIndex config variable to be on.
>      -+  * If GIT_TEST_SPARSE_INDEX=1, then trigger extensions.sparseIndex
>      -+  * to be fully enabled. If GIT_TEST_SPARSE_INDEX=0 (set explicitly),
>      -+  * then purposefully disable the setting.
>      +   * The GIT_TEST_SPARSE_INDEX environment variable triggers the
>      +   * index.sparse config variable to be on.
>          */
>       - if (git_env_bool("GIT_TEST_SPARSE_INDEX", 0)) {
>       -         int err = enable_sparse_index(istate->repo);
>      @@ sparse-index.c: int convert_to_sparse(struct index_state *istate)
>       +         set_sparse_index_config(istate->repo, test_env);
>
>         /*
>      -   * Only convert to sparse if extensions.sparseIndex is set.
>      +   * Only convert to sparse if index.sparse is set.
>
>        ## sparse-index.h ##
>       @@ sparse-index.h: struct index_state;
>      @@ t/t1092-sparse-checkout-compatibility.sh: init_repos () {
>       - GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout init --cone &&
>       - GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set deep
>       + git -C sparse-index sparse-checkout init --cone --sparse-index &&
>      -+ test_cmp_config -C sparse-index true extensions.sparseindex &&
>      ++ test_cmp_config -C sparse-index true index.sparse &&
>       + git -C sparse-index sparse-checkout set deep
>        }
>
>  17:  6f1ad72c390d ! 18:  7f55a232e647 sparse-checkout: disable sparse-index
>      @@ t/t1091-sparse-checkout-builtin.sh: test_expect_success 'sparse-checkout disable
>
>       +test_expect_success 'sparse-index enabled and disabled' '
>       + git -C repo sparse-checkout init --cone --sparse-index &&
>      -+ test_cmp_config -C repo true extensions.sparseIndex &&
>      ++ test_cmp_config -C repo true index.sparse &&
>       + test-tool -C repo read-cache --table >cache &&
>       + grep " tree " cache &&
>       +
>      @@ t/t1091-sparse-checkout-builtin.sh: test_expect_success 'sparse-checkout disable
>       + test-tool -C repo read-cache --table >cache &&
>       + ! grep " tree " cache &&
>       + git -C repo config --list >config &&
>      -+ ! grep extensions.sparseindex config
>      ++ ! grep index.sparse config
>       +'
>       +
>        test_expect_success 'cone mode: init and set' '
>  18:  bd94e6b7d089 = 19:  365901809d9d cache-tree: integrate with sparse directory entries
>  19:  e7190376b806 = 20:  9b068c458898 sparse-index: loose integration with cache_tree_verify()
>  20:  bcf0a58eb38c = 21:  66602733cc95 p2000: add sparse-index repos

I've read through the range-diff and individually read through the new
patch 11.  Perhaps unsurprisingly since you addressed all my feedback
by about round 3, I didn't find any problems with this new version.
Looks good to me.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v5 00/21] Sparse Index: Design, Format, Tests
  2021-03-30 21:31           ` Derrick Stolee
  2021-03-30 21:49             ` Junio C Hamano
@ 2021-04-01  5:59             ` Elijah Newren
  1 sibling, 0 replies; 203+ messages in thread
From: Elijah Newren @ 2021-04-01  5:59 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Junio C Hamano, Derrick Stolee via GitGitGadget,
	Git Mailing List, Nguyễn Thái Ngọc,
	Jonathan Nieder, Martin Ågren, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee

On Tue, Mar 30, 2021 at 2:31 PM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 3/30/2021 4:11 PM, Junio C Hamano wrote:
> > "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> >
> >>      @@ repo-settings.c: void prepare_repo_settings(struct repository *r)
> >>       +       * Initialize this as off.
> >>       +       */
> >>       +      r->settings.sparse_index = 0;
> >>      -+      if (!repo_config_get_bool(r, "extensions.sparseindex", &value) && value)
> >>      ++      if (!repo_config_get_bool(r, "index.sparse", &value) && value)
> >>       +              r->settings.sparse_index = 1;
> >>        }
> >
> > It would be helpful to have a way for the repository owner to say
> > "Even if the version of Git may be capable of handling 'sdir'
> > extension, and my checkout uses sparse-cone settings, I do not want
> > to use it", and the other way around, i.e. "Even if my checkout
> > currently does not use sparse-cone settings, do use 'sdir'
> > extension".  But for that, .sparse_index member may need to be
> > tristate (i.e. forbidden, enable-if-needed, use-even-unneeded)?
>
> I believe as presented, index.sparse=false will prevent the sdir
> extension from being used. If index.sparse=true, then it will only
> be used if sparse-checkout is enabled in cone mode.
>
> I don't see the value in using the 'sdir' extension when not using
> sparse-checkout in cone mode (and hence there are no sparse directory
> entries in the index). What am I missing?
>
> > We have a similar setting in index.version; I believe we always
> > auto-demote 3 down to 2 when extended flags are not used, and
> > I think "always auto-demote" would be sufficient (iow,
> > "use-even-unneeded" may not be necessary, even though that might
> > help debugging).
>
> Yes, the same is happening here: we auto-demote to not use 'sdir'
> if it the other settings are not configured as well.
>
> There is the rare scenario where these things all occur:
>
> 1. index.sparse = true
> 2. core.sparseCheckout = true
> 3. core.sparseCheckoutCone = true
> 4. Every path in the index matches the cone-mode patterns.
>
> In this case, convert_to_sparse() is called and the istate->sparse
> bit is set, telling do_write_index() to add the 'sdir' extension.
>
> This seems like a rare occurrence. Is it still worth adding logic
> to avoid 'sdir' when these are all true?

I'd agree that this would be very rare; probably indicative of someone
either having a bug in their sparsity patterns or making a simplistic
testcase to see how things operate.

^ permalink raw reply	[flat|nested] 203+ messages in thread

end of thread, other threads:[~2021-04-01  6:01 UTC | newest]

Thread overview: 203+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-23 20:14 [PATCH 00/20] Sparse Index: Design, Format, Tests Derrick Stolee via GitGitGadget
2021-02-23 20:14 ` [PATCH 01/20] sparse-index: design doc and format update Derrick Stolee via GitGitGadget
2021-02-24  1:13   ` Elijah Newren
2021-02-25 15:29     ` Derrick Stolee
2021-02-25 20:14       ` Elijah Newren
2021-02-23 20:14 ` [PATCH 02/20] t/perf: add performance test for sparse operations Derrick Stolee via GitGitGadget
2021-02-24  2:30   ` Elijah Newren
2021-03-09 20:03     ` Derrick Stolee
2021-02-23 20:14 ` [PATCH 03/20] t1092: clean up script quoting Derrick Stolee via GitGitGadget
2021-02-23 20:14 ` [PATCH 04/20] sparse-index: add guard to ensure full index Derrick Stolee via GitGitGadget
2021-02-24  2:44   ` Elijah Newren
2021-02-23 20:14 ` [PATCH 05/20] sparse-index: implement ensure_full_index() Derrick Stolee via GitGitGadget
2021-02-24  3:20   ` Elijah Newren
2021-02-23 20:14 ` [PATCH 06/20] t1092: compare sparse-checkout to sparse-index Derrick Stolee via GitGitGadget
2021-02-25  6:37   ` Elijah Newren
2021-02-23 20:14 ` [PATCH 07/20] test-read-cache: print cache entries with --table Derrick Stolee via GitGitGadget
2021-02-25  7:02   ` Elijah Newren
2021-03-09 21:00     ` Derrick Stolee
2021-02-23 20:14 ` [PATCH 08/20] test-tool: don't force full index Derrick Stolee via GitGitGadget
2021-02-23 20:14 ` [PATCH 09/20] unpack-trees: ensure " Derrick Stolee via GitGitGadget
2021-02-23 20:14 ` [PATCH 10/20] sparse-checkout: hold pattern list in index Derrick Stolee via GitGitGadget
2021-02-25  7:14   ` Elijah Newren
2021-02-23 20:14 ` [PATCH 11/20] sparse-index: convert from full to sparse Derrick Stolee via GitGitGadget
2021-02-25  7:33   ` Elijah Newren
2021-03-09 21:13     ` Derrick Stolee
2021-02-23 20:14 ` [PATCH 12/20] submodule: sparse-index should not collapse links Derrick Stolee via GitGitGadget
2021-02-23 20:14 ` [PATCH 13/20] unpack-trees: allow sparse directories Derrick Stolee via GitGitGadget
2021-02-25  7:40   ` Elijah Newren
2021-03-09 21:35     ` Derrick Stolee
2021-03-09 21:39       ` Elijah Newren
2021-02-23 20:14 ` [PATCH 14/20] sparse-index: check index conversion happens Derrick Stolee via GitGitGadget
2021-02-23 20:14 ` [PATCH 15/20] sparse-index: create extension for compatibility Derrick Stolee via GitGitGadget
2021-02-25  7:45   ` Elijah Newren
2021-03-09 21:45     ` Derrick Stolee
2021-02-23 20:14 ` [PATCH 16/20] sparse-checkout: toggle sparse index from builtin Derrick Stolee via GitGitGadget
2021-02-24 19:11   ` Martin Ågren
2021-03-09 20:52     ` Derrick Stolee
2021-03-09 21:03       ` Elijah Newren
2021-03-09 21:10         ` Derrick Stolee
2021-03-09 21:38           ` Elijah Newren
2021-03-14 20:08       ` Martin Ågren
2021-03-15 13:36         ` Derrick Stolee
2021-02-23 20:14 ` [PATCH 17/20] sparse-checkout: disable sparse-index Derrick Stolee via GitGitGadget
2021-02-27 12:32   ` SZEDER Gábor
2021-03-09 20:20     ` Derrick Stolee
2021-03-10 18:20       ` Derrick Stolee
2021-02-23 20:14 ` [PATCH 18/20] cache-tree: integrate with sparse directory entries Derrick Stolee via GitGitGadget
2021-02-23 20:14 ` [PATCH 19/20] sparse-index: loose integration with cache_tree_verify() Derrick Stolee via GitGitGadget
2021-02-23 20:14 ` [PATCH 20/20] p2000: add sparse-index repos Derrick Stolee via GitGitGadget
2021-02-23 23:49 ` [PATCH 00/20] Sparse Index: Design, Format, Tests Elijah Newren
2021-02-26 21:28   ` Elijah Newren
2021-03-10 19:30 ` [PATCH v2 " Derrick Stolee via GitGitGadget
2021-03-10 19:30   ` [PATCH v2 01/20] sparse-index: design doc and format update Derrick Stolee via GitGitGadget
2021-03-10 22:19     ` Elijah Newren
2021-03-10 19:30   ` [PATCH v2 02/20] t/perf: add performance test for sparse operations Derrick Stolee via GitGitGadget
2021-03-10 19:30   ` [PATCH v2 03/20] t1092: clean up script quoting Derrick Stolee via GitGitGadget
2021-03-10 19:30   ` [PATCH v2 04/20] sparse-index: add guard to ensure full index Derrick Stolee via GitGitGadget
2021-03-10 19:30   ` [PATCH v2 05/20] sparse-index: implement ensure_full_index() Derrick Stolee via GitGitGadget
2021-03-12  6:50     ` Junio C Hamano
2021-03-12 13:56       ` Derrick Stolee
2021-03-12 20:08         ` Junio C Hamano
2021-03-12 20:11           ` Derrick Stolee
2021-03-15 23:52             ` Ævar Arnfjörð Bjarmason
2021-03-10 19:30   ` [PATCH v2 06/20] t1092: compare sparse-checkout to sparse-index Derrick Stolee via GitGitGadget
2021-03-10 23:04     ` Elijah Newren
2021-03-11 14:17       ` Derrick Stolee
2021-03-10 19:30   ` [PATCH v2 07/20] test-read-cache: print cache entries with --table Derrick Stolee via GitGitGadget
2021-03-10 19:30   ` [PATCH v2 08/20] test-tool: don't force full index Derrick Stolee via GitGitGadget
2021-03-10 19:30   ` [PATCH v2 09/20] unpack-trees: ensure " Derrick Stolee via GitGitGadget
2021-03-10 19:30   ` [PATCH v2 10/20] sparse-checkout: hold pattern list in index Derrick Stolee via GitGitGadget
2021-03-10 19:30   ` [PATCH v2 11/20] sparse-index: convert from full to sparse Derrick Stolee via GitGitGadget
2021-03-10 23:44     ` Elijah Newren
2021-03-11 14:13       ` Derrick Stolee
2021-03-10 19:30   ` [PATCH v2 12/20] submodule: sparse-index should not collapse links Derrick Stolee via GitGitGadget
2021-03-10 19:30   ` [PATCH v2 13/20] unpack-trees: allow sparse directories Derrick Stolee via GitGitGadget
2021-03-10 19:30   ` [PATCH v2 14/20] sparse-index: check index conversion happens Derrick Stolee via GitGitGadget
2021-03-10 19:30   ` [PATCH v2 15/20] sparse-index: create extension for compatibility Derrick Stolee via GitGitGadget
2021-03-10 19:30   ` [PATCH v2 16/20] sparse-checkout: toggle sparse index from builtin Derrick Stolee via GitGitGadget
2021-03-10 19:31   ` [PATCH v2 17/20] sparse-checkout: disable sparse-index Derrick Stolee via GitGitGadget
2021-03-10 19:31   ` [PATCH v2 18/20] cache-tree: integrate with sparse directory entries Derrick Stolee via GitGitGadget
2021-03-10 19:31   ` [PATCH v2 19/20] sparse-index: loose integration with cache_tree_verify() Derrick Stolee via GitGitGadget
2021-03-10 19:31   ` [PATCH v2 20/20] p2000: add sparse-index repos Derrick Stolee via GitGitGadget
2021-03-11  0:07   ` [PATCH v2 00/20] Sparse Index: Design, Format, Tests Elijah Newren
2021-03-16 16:42   ` [PATCH v3 " Derrick Stolee via GitGitGadget
2021-03-16 16:42     ` [PATCH v3 01/20] sparse-index: design doc and format update Derrick Stolee via GitGitGadget
2021-03-19 23:43       ` Junio C Hamano
2021-03-23 11:16         ` Derrick Stolee
2021-03-23 20:10           ` Junio C Hamano
2021-03-23 20:42             ` Derrick Stolee
2021-03-16 16:42     ` [PATCH v3 02/20] t/perf: add performance test for sparse operations Derrick Stolee via GitGitGadget
2021-03-17  8:41       ` Ævar Arnfjörð Bjarmason
2021-03-17 13:05         ` Derrick Stolee
2021-03-17 13:21           ` Ævar Arnfjörð Bjarmason
2021-03-17 18:02             ` Derrick Stolee
2021-03-16 16:42     ` [PATCH v3 03/20] t1092: clean up script quoting Derrick Stolee via GitGitGadget
2021-03-17  8:47       ` Ævar Arnfjörð Bjarmason
2021-03-16 16:42     ` [PATCH v3 04/20] sparse-index: add guard to ensure full index Derrick Stolee via GitGitGadget
2021-03-16 16:42     ` [PATCH v3 05/20] sparse-index: implement ensure_full_index() Derrick Stolee via GitGitGadget
2021-03-17 13:03       ` Ævar Arnfjörð Bjarmason
2021-03-16 16:42     ` [PATCH v3 06/20] t1092: compare sparse-checkout to sparse-index Derrick Stolee via GitGitGadget
2021-03-16 16:42     ` [PATCH v3 07/20] test-read-cache: print cache entries with --table Derrick Stolee via GitGitGadget
2021-03-17 13:28       ` [RFC/PATCH 0/5] " Ævar Arnfjörð Bjarmason
2021-03-17 18:28         ` Elijah Newren
2021-03-17 19:46           ` Derrick Stolee
2021-03-17 20:26             ` Elijah Newren
2021-03-17 20:34               ` Derrick Stolee
2021-03-17 13:28       ` [RFC/PATCH 1/5] ls-files: defer read_index() after parse_options() etc Ævar Arnfjörð Bjarmason
2021-03-17 13:28       ` [RFC/PATCH 2/5] ls-files: make "mode" in show_ce() loop a variable Ævar Arnfjörð Bjarmason
2021-03-17 18:11         ` Elijah Newren
2021-03-24  0:46           ` Ævar Arnfjörð Bjarmason
2021-03-17 13:28       ` [RFC/PATCH 3/5] ls-files: add and use a new --sparse option Ævar Arnfjörð Bjarmason
2021-03-17 18:19         ` Elijah Newren
2021-03-17 18:27           ` Ævar Arnfjörð Bjarmason
2021-03-17 18:44             ` Elijah Newren
2021-03-17 20:43         ` Derrick Stolee
2021-03-24  0:52           ` Ævar Arnfjörð Bjarmason
2021-03-17 13:28       ` [RFC/PATCH 4/5] test-tool read-cache: --table is redundant to ls-files Ævar Arnfjörð Bjarmason
2021-03-17 13:28       ` [RFC/PATCH 5/5] test-tool: split up test-tool read-cache Ævar Arnfjörð Bjarmason
2021-03-17 13:32         ` Ævar Arnfjörð Bjarmason
2021-03-16 16:42     ` [PATCH v3 08/20] test-tool: don't force full index Derrick Stolee via GitGitGadget
2021-03-16 16:42     ` [PATCH v3 09/20] unpack-trees: ensure " Derrick Stolee via GitGitGadget
2021-03-16 16:42     ` [PATCH v3 10/20] sparse-checkout: hold pattern list in index Derrick Stolee via GitGitGadget
2021-03-16 16:42     ` [PATCH v3 11/20] sparse-index: convert from full to sparse Derrick Stolee via GitGitGadget
2021-03-17 13:43       ` Ævar Arnfjörð Bjarmason
2021-03-17 19:55         ` Derrick Stolee
2021-03-18 13:38           ` Derrick Stolee
2021-03-16 16:42     ` [PATCH v3 12/20] submodule: sparse-index should not collapse links Derrick Stolee via GitGitGadget
2021-03-16 16:42     ` [PATCH v3 13/20] unpack-trees: allow sparse directories Derrick Stolee via GitGitGadget
2021-03-17 13:35       ` Ævar Arnfjörð Bjarmason
2021-03-16 16:42     ` [PATCH v3 14/20] sparse-index: check index conversion happens Derrick Stolee via GitGitGadget
2021-03-16 16:42     ` [PATCH v3 15/20] sparse-index: create extension for compatibility Derrick Stolee via GitGitGadget
2021-03-16 16:42     ` [PATCH v3 16/20] sparse-checkout: toggle sparse index from builtin Derrick Stolee via GitGitGadget
2021-03-16 16:43     ` [PATCH v3 17/20] sparse-checkout: disable sparse-index Derrick Stolee via GitGitGadget
2021-03-16 16:43     ` [PATCH v3 18/20] cache-tree: integrate with sparse directory entries Derrick Stolee via GitGitGadget
2021-03-16 16:43     ` [PATCH v3 19/20] sparse-index: loose integration with cache_tree_verify() Derrick Stolee via GitGitGadget
2021-03-16 16:43     ` [PATCH v3 20/20] p2000: add sparse-index repos Derrick Stolee via GitGitGadget
2021-03-16 16:59     ` [PATCH v3 00/20] Sparse Index: Design, Format, Tests Derrick Stolee
2021-03-16 21:18     ` Elijah Newren
2021-03-18 21:50     ` Junio C Hamano
2021-03-19 13:00       ` Derrick Stolee
2021-03-23 13:44     ` [PATCH v4 " Derrick Stolee via GitGitGadget
2021-03-23 13:44       ` [PATCH v4 01/20] sparse-index: design doc and format update Derrick Stolee via GitGitGadget
2021-03-26 20:29         ` SZEDER Gábor
2021-03-28  1:47           ` Junio C Hamano
2021-03-29 14:32             ` Derrick Stolee
2021-03-23 13:44       ` [PATCH v4 02/20] t/perf: add performance test for sparse operations Derrick Stolee via GitGitGadget
2021-03-23 13:44       ` [PATCH v4 03/20] t1092: clean up script quoting Derrick Stolee via GitGitGadget
2021-03-23 13:44       ` [PATCH v4 04/20] sparse-index: add guard to ensure full index Derrick Stolee via GitGitGadget
2021-03-23 13:44       ` [PATCH v4 05/20] sparse-index: implement ensure_full_index() Derrick Stolee via GitGitGadget
2021-03-23 13:44       ` [PATCH v4 06/20] t1092: compare sparse-checkout to sparse-index Derrick Stolee via GitGitGadget
2021-03-23 13:44       ` [PATCH v4 07/20] test-read-cache: print cache entries with --table Derrick Stolee via GitGitGadget
2021-03-24  1:24         ` Ævar Arnfjörð Bjarmason
2021-03-24 12:33           ` Derrick Stolee
2021-03-25  3:41             ` Ævar Arnfjörð Bjarmason
2021-03-26  0:12               ` Elijah Newren
2021-03-28 15:31                 ` Ævar Arnfjörð Bjarmason
2021-03-29 19:46                   ` Derrick Stolee
2021-03-29 21:44                     ` Junio C Hamano
2021-03-30 11:28                       ` Derrick Stolee
2021-03-29 23:06                     ` Ævar Arnfjörð Bjarmason
2021-03-30 11:41                       ` Derrick Stolee
2021-03-29 22:02                   ` Elijah Newren
2021-03-23 13:44       ` [PATCH v4 08/20] test-tool: don't force full index Derrick Stolee via GitGitGadget
2021-03-23 13:44       ` [PATCH v4 09/20] unpack-trees: ensure " Derrick Stolee via GitGitGadget
2021-03-23 13:44       ` [PATCH v4 10/20] sparse-checkout: hold pattern list in index Derrick Stolee via GitGitGadget
2021-03-23 13:44       ` [PATCH v4 11/20] sparse-index: convert from full to sparse Derrick Stolee via GitGitGadget
2021-03-23 13:44       ` [PATCH v4 12/20] submodule: sparse-index should not collapse links Derrick Stolee via GitGitGadget
2021-03-23 13:44       ` [PATCH v4 13/20] unpack-trees: allow sparse directories Derrick Stolee via GitGitGadget
2021-03-23 13:44       ` [PATCH v4 14/20] sparse-index: check index conversion happens Derrick Stolee via GitGitGadget
2021-03-23 13:44       ` [PATCH v4 15/20] sparse-index: create extension for compatibility Derrick Stolee via GitGitGadget
2021-03-23 13:44       ` [PATCH v4 16/20] sparse-checkout: toggle sparse index from builtin Derrick Stolee via GitGitGadget
2021-03-23 13:44       ` [PATCH v4 17/20] sparse-checkout: disable sparse-index Derrick Stolee via GitGitGadget
2021-03-23 13:44       ` [PATCH v4 18/20] cache-tree: integrate with sparse directory entries Derrick Stolee via GitGitGadget
2021-03-23 13:44       ` [PATCH v4 19/20] sparse-index: loose integration with cache_tree_verify() Derrick Stolee via GitGitGadget
2021-03-23 13:44       ` [PATCH v4 20/20] p2000: add sparse-index repos Derrick Stolee via GitGitGadget
2021-03-23 16:16       ` [PATCH v4 00/20] Sparse Index: Design, Format, Tests Elijah Newren
2021-03-30 13:10       ` [PATCH v5 00/21] " Derrick Stolee via GitGitGadget
2021-03-30 13:10         ` [PATCH v5 01/21] sparse-index: design doc and format update Derrick Stolee via GitGitGadget
2021-03-30 13:10         ` [PATCH v5 02/21] t/perf: add performance test for sparse operations Derrick Stolee via GitGitGadget
2021-03-30 13:10         ` [PATCH v5 03/21] t1092: clean up script quoting Derrick Stolee via GitGitGadget
2021-03-30 13:10         ` [PATCH v5 04/21] sparse-index: add guard to ensure full index Derrick Stolee via GitGitGadget
2021-03-30 13:10         ` [PATCH v5 05/21] sparse-index: implement ensure_full_index() Derrick Stolee via GitGitGadget
2021-03-30 13:10         ` [PATCH v5 06/21] t1092: compare sparse-checkout to sparse-index Derrick Stolee via GitGitGadget
2021-03-30 13:10         ` [PATCH v5 07/21] test-read-cache: print cache entries with --table Derrick Stolee via GitGitGadget
2021-03-30 13:10         ` [PATCH v5 08/21] test-tool: don't force full index Derrick Stolee via GitGitGadget
2021-03-30 13:10         ` [PATCH v5 09/21] unpack-trees: ensure " Derrick Stolee via GitGitGadget
2021-03-30 13:10         ` [PATCH v5 10/21] sparse-checkout: hold pattern list in index Derrick Stolee via GitGitGadget
2021-03-30 13:10         ` [PATCH v5 11/21] sparse-index: add 'sdir' index extension Derrick Stolee via GitGitGadget
2021-03-30 13:10         ` [PATCH v5 12/21] sparse-index: convert from full to sparse Derrick Stolee via GitGitGadget
2021-03-30 13:10         ` [PATCH v5 13/21] submodule: sparse-index should not collapse links Derrick Stolee via GitGitGadget
2021-03-30 13:10         ` [PATCH v5 14/21] unpack-trees: allow sparse directories Derrick Stolee via GitGitGadget
2021-03-30 13:10         ` [PATCH v5 15/21] sparse-index: check index conversion happens Derrick Stolee via GitGitGadget
2021-03-30 13:10         ` [PATCH v5 16/21] sparse-index: add index.sparse config option Derrick Stolee via GitGitGadget
2021-03-30 13:11         ` [PATCH v5 17/21] sparse-checkout: toggle sparse index from builtin Derrick Stolee via GitGitGadget
2021-03-30 13:11         ` [PATCH v5 18/21] sparse-checkout: disable sparse-index Derrick Stolee via GitGitGadget
2021-03-30 13:11         ` [PATCH v5 19/21] cache-tree: integrate with sparse directory entries Derrick Stolee via GitGitGadget
2021-03-30 13:11         ` [PATCH v5 20/21] sparse-index: loose integration with cache_tree_verify() Derrick Stolee via GitGitGadget
2021-03-30 13:11         ` [PATCH v5 21/21] p2000: add sparse-index repos Derrick Stolee via GitGitGadget
2021-03-30 20:11         ` [PATCH v5 00/21] Sparse Index: Design, Format, Tests Junio C Hamano
2021-03-30 21:31           ` Derrick Stolee
2021-03-30 21:49             ` Junio C Hamano
2021-04-01  5:59             ` Elijah Newren
2021-04-01  4:38         ` Elijah Newren

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.