git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v1 0/2] grep: integrate with sparse index
@ 2022-08-17  7:56 Shaoxuan Yuan
  2022-08-17  7:56 ` [PATCH v1 1/2] builtin/grep.c: add --sparse option Shaoxuan Yuan
                   ` (7 more replies)
  0 siblings, 8 replies; 69+ messages in thread
From: Shaoxuan Yuan @ 2022-08-17  7:56 UTC (permalink / raw)
  To: git; +Cc: derrickstolee, vdye, Shaoxuan Yuan

Integrate `git-grep` with sparse-index and test the performance
improvement.

Note: This series is based on 'next' because the 'rm' series
ede241c715 (rm: integrate with sparse-index, Aug 7th 2022) is in the
'next', and the test cases overlap. Base on top of 'next' makes sure
there are no conflicts to reduce work for Junio.

Shaoxuan Yuan (2):
  builtin/grep.c: add --sparse option
  builtin/grep.c: integrate with sparse index

 builtin/grep.c                           | 18 +++++++++++++++---
 t/perf/p2000-sparse-operations.sh        |  1 +
 t/t1092-sparse-checkout-compatibility.sh | 17 +++++++++++++++++
 t/t7817-grep-sparse-checkout.sh          | 12 ++++++------
 4 files changed, 39 insertions(+), 9 deletions(-)


base-commit: c19287026c9b940f7f43d34e6dacbd5c34e4a2e0
-- 
2.37.0


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH v1 1/2] builtin/grep.c: add --sparse option
  2022-08-17  7:56 [PATCH v1 0/2] grep: integrate with sparse index Shaoxuan Yuan
@ 2022-08-17  7:56 ` Shaoxuan Yuan
  2022-08-17 14:12   ` Derrick Stolee
  2022-08-17  7:56 ` [PATCH v1 2/2] builtin/grep.c: integrate with sparse index Shaoxuan Yuan
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 69+ messages in thread
From: Shaoxuan Yuan @ 2022-08-17  7:56 UTC (permalink / raw)
  To: git; +Cc: derrickstolee, vdye, Shaoxuan Yuan

Add a --sparse option to `git-grep`. This option is mainly used to:

If searching in the index (using --cached):

With --sparse, proceed the action when the current cache_entry is
marked with SKIP_WORKTREE bit (the default is to skip this kind of
entry). Before this patch, --cached itself can realize this action.
Adding --sparse here grants the user finer control over sparse
entries. If the user only wants to peak into the index without
caring about sparse entries, --cached should suffice; if the user
wants to peak into the index _and_ cares about sparse entries,
combining --sparse with --cached can address this need.

Suggested-by: Victoria Dye <vdye@github.com>
Signed-off-by: Shaoxuan Yuan <shaoxuan.yuan02@gmail.com>
---
 builtin/grep.c                  | 10 +++++++++-
 t/t7817-grep-sparse-checkout.sh | 12 ++++++------
 2 files changed, 15 insertions(+), 7 deletions(-)

diff --git a/builtin/grep.c b/builtin/grep.c
index e6bcdf860c..61402e8084 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -96,6 +96,8 @@ static pthread_cond_t cond_result;
 
 static int skip_first_line;
 
+static int grep_sparse = 0;
+
 static void add_work(struct grep_opt *opt, struct grep_source *gs)
 {
 	if (opt->binary != GREP_BINARY_TEXT)
@@ -525,7 +527,11 @@ static int grep_cache(struct grep_opt *opt,
 	for (nr = 0; nr < repo->index->cache_nr; nr++) {
 		const struct cache_entry *ce = repo->index->cache[nr];
 
-		if (!cached && ce_skip_worktree(ce))
+		/*
+		 * If ce is a SKIP_WORKTREE entry, look into it when both
+		 * --sparse and --cached are given.
+		 */
+		if (!(grep_sparse && cached) && ce_skip_worktree(ce))
 			continue;
 
 		strbuf_setlen(&name, name_base_len);
@@ -963,6 +969,8 @@ int cmd_grep(int argc, const char **argv, const char *prefix)
 			   PARSE_OPT_NOCOMPLETE),
 		OPT_INTEGER('m', "max-count", &opt.max_count,
 			N_("maximum number of results per file")),
+		OPT_BOOL(0, "sparse", &grep_sparse,
+			 N_("search sparse contents and expand sparse index")),
 		OPT_END()
 	};
 	grep_prefix = prefix;
diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh
index eb59564565..ca71f526eb 100755
--- a/t/t7817-grep-sparse-checkout.sh
+++ b/t/t7817-grep-sparse-checkout.sh
@@ -118,13 +118,13 @@ test_expect_success 'grep searches unmerged file despite not matching sparsity p
 	test_cmp expect actual
 '
 
-test_expect_success 'grep --cached searches entries with the SKIP_WORKTREE bit' '
+test_expect_success 'grep --cached and --sparse searches entries with the SKIP_WORKTREE bit' '
 	cat >expect <<-EOF &&
 	a:text
 	b:text
 	dir/c:text
 	EOF
-	git grep --cached "text" >actual &&
+	git grep --cached --sparse "text" >actual &&
 	test_cmp expect actual
 '
 
@@ -143,7 +143,7 @@ test_expect_success 'grep --recurse-submodules honors sparse checkout in submodu
 	test_cmp expect actual
 '
 
-test_expect_success 'grep --recurse-submodules --cached searches entries with the SKIP_WORKTREE bit' '
+test_expect_success 'grep --recurse-submodules --cached and --sparse searches entries with the SKIP_WORKTREE bit' '
 	cat >expect <<-EOF &&
 	a:text
 	b:text
@@ -152,7 +152,7 @@ test_expect_success 'grep --recurse-submodules --cached searches entries with th
 	sub/B/b:text
 	sub2/a:text
 	EOF
-	git grep --recurse-submodules --cached "text" >actual &&
+	git grep --recurse-submodules --cached --sparse "text" >actual &&
 	test_cmp expect actual
 '
 
@@ -166,7 +166,7 @@ test_expect_success 'working tree grep does not search the index with CE_VALID a
 	test_cmp expect actual
 '
 
-test_expect_success 'grep --cached searches index entries with both CE_VALID and SKIP_WORKTREE' '
+test_expect_success 'grep --cached and --sparse searches index entries with both CE_VALID and SKIP_WORKTREE' '
 	cat >expect <<-EOF &&
 	a:text
 	b:text
@@ -174,7 +174,7 @@ test_expect_success 'grep --cached searches index entries with both CE_VALID and
 	EOF
 	test_when_finished "git update-index --no-assume-unchanged b" &&
 	git update-index --assume-unchanged b &&
-	git grep --cached text >actual &&
+	git grep --cached --sparse text >actual &&
 	test_cmp expect actual
 '
 
-- 
2.37.0


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v1 2/2] builtin/grep.c: integrate with sparse index
  2022-08-17  7:56 [PATCH v1 0/2] grep: integrate with sparse index Shaoxuan Yuan
  2022-08-17  7:56 ` [PATCH v1 1/2] builtin/grep.c: add --sparse option Shaoxuan Yuan
@ 2022-08-17  7:56 ` Shaoxuan Yuan
  2022-08-17 14:23   ` Derrick Stolee
  2022-08-17 13:46 ` [PATCH v1 0/2] grep: " Derrick Stolee
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 69+ messages in thread
From: Shaoxuan Yuan @ 2022-08-17  7:56 UTC (permalink / raw)
  To: git; +Cc: derrickstolee, vdye, Shaoxuan Yuan

Turn on sparse index and remove ensure_full_index().

Change it to only expands the index when using --sparse.

The p2000 tests demonstrate a ~99.4% execution time reduction for
`git grep` using a sparse index.

Test                                           HEAD~1       HEAD
-----------------------------------------------------------------------------
2000.78: git grep --cached bogus (full-v3)     0.019        0.018  (-5.2%)
2000.79: git grep --cached bogus (full-v4)     0.017        0.016  (-5.8%)
2000.80: git grep --cached bogus (sparse-v3)   0.29         0.0015 (-99.4%)
2000.81: git grep --cached bogus (sparse-v4)   0.30         0.0018 (-99.4%)

Optional reading about performance test results
-----------------------------------------------
Notice that because `git-grep` needs to parse blobs in the index, the
index reading time is minuscule comparing to the object parsing time.
And because of this, the p2000 test results cannot clearly reflect the
speedup for index reading: combining with the object parsing time,
the aggregated time difference is extremely close between HEAD~1 and
HEAD.

Hence, the results presenting here are not directly extracted from the
p2000 test results. Instead, to make the performance difference more
visible, the test command is manually ran with GIT_TRACE2_PERF in the
four repos (full-v3, sparse-v3, full-v4, sparse-v4). The numbers here
are then extracted from the time difference between "region_enter" and
"region_leave" of label "do_read_index".

Helped-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Shaoxuan Yuan <shaoxuan.yuan02@gmail.com>
---
 builtin/grep.c                           |  8 ++++++--
 t/perf/p2000-sparse-operations.sh        |  1 +
 t/t1092-sparse-checkout-compatibility.sh | 17 +++++++++++++++++
 3 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/builtin/grep.c b/builtin/grep.c
index 61402e8084..cbaab604fd 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -519,11 +519,15 @@ static int grep_cache(struct grep_opt *opt,
 		strbuf_addstr(&name, repo->submodule_prefix);
 	}
 
+	prepare_repo_settings(repo);
+	repo->settings.command_requires_full_index = 0;
+
 	if (repo_read_index(repo) < 0)
 		die(_("index file corrupt"));
 
-	/* TODO: audit for interaction with sparse-index. */
-	ensure_full_index(repo->index);
+	if (grep_sparse)
+		ensure_full_index(repo->index);
+
 	for (nr = 0; nr < repo->index->cache_nr; nr++) {
 		const struct cache_entry *ce = repo->index->cache[nr];
 
diff --git a/t/perf/p2000-sparse-operations.sh b/t/perf/p2000-sparse-operations.sh
index fce8151d41..9a466fcbbe 100755
--- a/t/perf/p2000-sparse-operations.sh
+++ b/t/perf/p2000-sparse-operations.sh
@@ -124,5 +124,6 @@ test_perf_on_all git read-tree -mu HEAD
 test_perf_on_all git checkout-index -f --all
 test_perf_on_all git update-index --add --remove $SPARSE_CONE/a
 test_perf_on_all "git rm -f $SPARSE_CONE/a && git checkout HEAD -- $SPARSE_CONE/a"
+test_perf_on_all git grep --cached bogus
 
 test_done
diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index a6a14c8a21..a9bb6734f6 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -1972,4 +1972,21 @@ test_expect_success 'sparse index is not expanded: rm' '
 	ensure_not_expanded rm -r deep
 '
 
+test_expect_success 'grep expands index using --sparse' '
+	init_repos &&
+
+	# With --sparse and --cached, do not ignore sparse entries and
+	# expand the index.
+	test_all_match git grep --sparse --cached a
+'
+
+test_expect_success 'grep is not expanded' '
+	init_repos &&
+
+	ensure_not_expanded grep a &&
+	ensure_not_expanded grep a -- deep/* &&
+	# grep does not match anything per se, so ! is used
+	ensure_not_expanded ! grep a -- folder1/*
+'
+
 test_done
-- 
2.37.0


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [PATCH v1 0/2] grep: integrate with sparse index
  2022-08-17  7:56 [PATCH v1 0/2] grep: integrate with sparse index Shaoxuan Yuan
  2022-08-17  7:56 ` [PATCH v1 1/2] builtin/grep.c: add --sparse option Shaoxuan Yuan
  2022-08-17  7:56 ` [PATCH v1 2/2] builtin/grep.c: integrate with sparse index Shaoxuan Yuan
@ 2022-08-17 13:46 ` Derrick Stolee
  2022-08-29 23:28 ` [PATCH v2 " Shaoxuan Yuan
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 69+ messages in thread
From: Derrick Stolee @ 2022-08-17 13:46 UTC (permalink / raw)
  To: Shaoxuan Yuan, git; +Cc: vdye

On 8/17/2022 3:56 AM, Shaoxuan Yuan wrote:
> Integrate `git-grep` with sparse-index and test the performance
> improvement.
> 
> Note: This series is based on 'next' because the 'rm' series
> ede241c715 (rm: integrate with sparse-index, Aug 7th 2022) is in the
> 'next', and the test cases overlap. Base on top of 'next' makes sure
> there are no conflicts to reduce work for Junio.

Do not base things directly on 'next' because that branch can be
completely rewritten and changes can make it difficult to apply your
patches.

Instead, you can base your change directly on the sy/sparse-rm branch.

Please update the base in v2, but I'll take a review of these patches now.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v1 1/2] builtin/grep.c: add --sparse option
  2022-08-17  7:56 ` [PATCH v1 1/2] builtin/grep.c: add --sparse option Shaoxuan Yuan
@ 2022-08-17 14:12   ` Derrick Stolee
  2022-08-17 17:13     ` Junio C Hamano
                       ` (2 more replies)
  0 siblings, 3 replies; 69+ messages in thread
From: Derrick Stolee @ 2022-08-17 14:12 UTC (permalink / raw)
  To: Shaoxuan Yuan, git; +Cc: vdye

On 8/17/2022 3:56 AM, Shaoxuan Yuan wrote:
> Add a --sparse option to `git-grep`. This option is mainly used to:
> 
> If searching in the index (using --cached):
> 
> With --sparse, proceed the action when the current cache_entry is

This phrasing is awkward. It might be better to reframe to describe the
_why_ before the _what_

  When the '--cached' option is used with the 'git grep' command, the
  search is limited to the blobs found in the index, not in the worktree.
  If the user has enabled sparse-checkout, this might present more results
  than they would like, since the files outside of the sparse-checkout are
  unlikely to be important to them.

  Change the default behavior of 'git grep' to focus on the files within
  the sparse-checkout definition. To enable the previous behavior, add a
  '--sparse' option to 'git grep' that triggers the old behavior that
  inspects paths outside of the sparse-checkout definition when paired
  with the '--cached' option.

Or something like that. The documentation updates will also help clarify
what happens when '--cached' is not included. I assume '--sparse' is
ignored, but perhaps it _could_ allow looking at the cached files outside
the sparse-checkout definition, this could make the simpler invocation of
'git grep --sparse <pattern>' be the way that users can search after their
attempt to search the worktree failed.

> marked with SKIP_WORKTREE bit (the default is to skip this kind of
> entry). Before this patch, --cached itself can realize this action.
> Adding --sparse here grants the user finer control over sparse
> entries. If the user only wants to peak into the index without

s/peak/peek/

> caring about sparse entries, --cached should suffice; if the user
> wants to peak into the index _and_ cares about sparse entries,
> combining --sparse with --cached can address this need.
> 
> Suggested-by: Victoria Dye <vdye@github.com>
> Signed-off-by: Shaoxuan Yuan <shaoxuan.yuan02@gmail.com>
> ---
>  builtin/grep.c                  | 10 +++++++++-
>  t/t7817-grep-sparse-checkout.sh | 12 ++++++------
>  2 files changed, 15 insertions(+), 7 deletions(-)

You mentioned in Slack that you missed the documentation of the --sparse
option. Just pointing it out here so we don't forget.

> 
> diff --git a/builtin/grep.c b/builtin/grep.c
> index e6bcdf860c..61402e8084 100644
> --- a/builtin/grep.c
> +++ b/builtin/grep.c
> @@ -96,6 +96,8 @@ static pthread_cond_t cond_result;
>  
>  static int skip_first_line;
>  
> +static int grep_sparse = 0;
> +

I initially thought it might be good to not define an additional global,
but there are many defined in this file outside of the context and they
are spread out with extra whitespace like this.

>  static void add_work(struct grep_opt *opt, struct grep_source *gs)
>  {
>  	if (opt->binary != GREP_BINARY_TEXT)
> @@ -525,7 +527,11 @@ static int grep_cache(struct grep_opt *opt,
>  	for (nr = 0; nr < repo->index->cache_nr; nr++) {
>  		const struct cache_entry *ce = repo->index->cache[nr];
>  
> -		if (!cached && ce_skip_worktree(ce))

This logic would skip files marked with SKIP_WORKTREE _unless_ --cached
was provided.

> +		/*
> +		 * If ce is a SKIP_WORKTREE entry, look into it when both
> +		 * --sparse and --cached are given.
> +		 */
> +		if (!(grep_sparse && cached) && ce_skip_worktree(ce))
>  			continue;

The logic of this if statement is backwards from the comment because a
true statement means "skip the entry" _not_ "look into it".

	/*
	 * Skip entries with SKIP_WORKTREE unless both --sparse and
	 * --cached are given.
	 */

But again, we might want to consider this alternative:

	/*
	 * Skip entries with SKIP_WORKTREE unless --sparse is given.
	 */
	if (!grep_sparse && ce_skip_worktree(ce))
		continue;

This will require further changes below, specifically this bit:

			/*
			 * If CE_VALID is on, we assume worktree file and its
			 * cache entry are identical, even if worktree file has
			 * been modified, so use cache version instead
			 */
			if (cached || (ce->ce_flags & CE_VALID)) {
				if (ce_stage(ce) || ce_intent_to_add(ce))
					continue;
				hit |= grep_oid(opt, &ce->oid, name.buf,
						 0, name.buf);
			} else {

We need to activate this grep_oid() call also when ce_skip_worktree(c) is
true. That is, if we want 'git grep --sparse' to extend the search beyond
the worktree and into the sparse entries.

>  
>  		strbuf_setlen(&name, name_base_len);
> @@ -963,6 +969,8 @@ int cmd_grep(int argc, const char **argv, const char *prefix)
>  			   PARSE_OPT_NOCOMPLETE),
>  		OPT_INTEGER('m', "max-count", &opt.max_count,
>  			N_("maximum number of results per file")),
> +		OPT_BOOL(0, "sparse", &grep_sparse,
> +			 N_("search sparse contents and expand sparse index")),

This "and expand sparse index" is an internal implementation detail, not a
heplful item for the help text. Instead, perhaps:

	"search the contents of files outside the sparse-checkout definition"

(Also, while the sparse index is being expanded right now, I would expect
to not expand the sparse index by the end of the series.)

> -test_expect_success 'grep --cached searches entries with the SKIP_WORKTREE bit' '
> +test_expect_success 'grep --cached and --sparse searches entries with the SKIP_WORKTREE bit' '
>  	cat >expect <<-EOF &&
>  	a:text
>  	b:text
>  	dir/c:text
>  	EOF
> -	git grep --cached "text" >actual &&
> +	git grep --cached --sparse "text" >actual &&
>  	test_cmp expect actual
>  '

Please add a test that demonstrates the change of behavior when only --cached
is provided, not --sparse.

(If you take my suggestion to allow 'git grep --sparse' to do something
different, then also add a test for that case.)

>  
> @@ -143,7 +143,7 @@ test_expect_success 'grep --recurse-submodules honors sparse checkout in submodu
>  	test_cmp expect actual
>  '
>  
> -test_expect_success 'grep --recurse-submodules --cached searches entries with the SKIP_WORKTREE bit' '
> +test_expect_success 'grep --recurse-submodules --cached and --sparse searches entries with the SKIP_WORKTREE bit' '
>  	cat >expect <<-EOF &&
>  	a:text
>  	b:text
> @@ -152,7 +152,7 @@ test_expect_success 'grep --recurse-submodules --cached searches entries with th
>  	sub/B/b:text
>  	sub2/a:text
>  	EOF
> -	git grep --recurse-submodules --cached "text" >actual &&
> +	git grep --recurse-submodules --cached --sparse "text" >actual &&
>  	test_cmp expect actual
>  '
> @@ -166,7 +166,7 @@ test_expect_success 'working tree grep does not search the index with CE_VALID a
>  	test_cmp expect actual
>  '
>  
> -test_expect_success 'grep --cached searches index entries with both CE_VALID and SKIP_WORKTREE' '
> +test_expect_success 'grep --cached and --sparse searches index entries with both CE_VALID and SKIP_WORKTREE' '
>  	cat >expect <<-EOF &&
>  	a:text
>  	b:text
> @@ -174,7 +174,7 @@ test_expect_success 'grep --cached searches index entries with both CE_VALID and
>  	EOF
>  	test_when_finished "git update-index --no-assume-unchanged b" &&
>  	git update-index --assume-unchanged b &&
> -	git grep --cached text >actual &&
> +	git grep --cached --sparse text >actual &&
>  	test_cmp expect actual
>  '

Same with these two tests. Add additional commands that show the change of
behavior when only using '--cached'.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v1 2/2] builtin/grep.c: integrate with sparse index
  2022-08-17  7:56 ` [PATCH v1 2/2] builtin/grep.c: integrate with sparse index Shaoxuan Yuan
@ 2022-08-17 14:23   ` Derrick Stolee
  2022-08-24 21:06     ` Shaoxuan Yuan
  0 siblings, 1 reply; 69+ messages in thread
From: Derrick Stolee @ 2022-08-17 14:23 UTC (permalink / raw)
  To: Shaoxuan Yuan, git; +Cc: vdye

On 8/17/2022 3:56 AM, Shaoxuan Yuan wrote:
> Turn on sparse index and remove ensure_full_index().
> 
> Change it to only expands the index when using --sparse.
> 
> The p2000 tests demonstrate a ~99.4% execution time reduction for
> `git grep` using a sparse index.
> 
> Test                                           HEAD~1       HEAD
> -----------------------------------------------------------------------------
> 2000.78: git grep --cached bogus (full-v3)     0.019        0.018  (-5.2%)
> 2000.79: git grep --cached bogus (full-v4)     0.017        0.016  (-5.8%)
> 2000.80: git grep --cached bogus (sparse-v3)   0.29         0.0015 (-99.4%)
> 2000.81: git grep --cached bogus (sparse-v4)   0.30         0.0018 (-99.4%)

Good results.

I think we could get interesting results even with the --sparse
option if you go another step further (perhaps as a patch after
this one).

> 
> Optional reading about performance test results
> -----------------------------------------------
> Notice that because `git-grep` needs to parse blobs in the index, the
> index reading time is minuscule comparing to the object parsing time.
> And because of this, the p2000 test results cannot clearly reflect the
> speedup for index reading: combining with the object parsing time,
> the aggregated time difference is extremely close between HEAD~1 and
> HEAD.
> 
> Hence, the results presenting here are not directly extracted from the
> p2000 test results. Instead, to make the performance difference more
> visible, the test command is manually ran with GIT_TRACE2_PERF in the
> four repos (full-v3, sparse-v3, full-v4, sparse-v4). The numbers here
> are then extracted from the time difference between "region_enter" and
> "region_leave" of label "do_read_index".

This is a good point, but I don't recommend displaying them as if they
were the output of a "./run HEAD~1 HEAD -- p2000-sparse-operations.sh"
command. Instead, point out that the performance test does not show a
major improvement and instead you have these "Before" and "After" results
from testing manually and extracting trace2 regions.

> @@ -519,11 +519,15 @@ static int grep_cache(struct grep_opt *opt,
>  		strbuf_addstr(&name, repo->submodule_prefix);
>  	}
>  
> +	prepare_repo_settings(repo);
> +	repo->settings.command_requires_full_index = 0;
> +

The best pattern is to put this in cmd_grep() immediately after parsing
options. This guarantees that we don't parse and expand the index in any
other code path.

>  	if (repo_read_index(repo) < 0)
>  		die(_("index file corrupt"));
>  
> -	/* TODO: audit for interaction with sparse-index. */
> -	ensure_full_index(repo->index);
> +	if (grep_sparse)
> +		ensure_full_index(repo->index);
> +

As mentioned before, this approach is the simplest way to make the case
without --sparse faster, but the case _with_ --sparse will still be slow.
The way to fix this would be to modify this portion of the loop:

	if (S_ISREG(ce->ce_mode) &&
	    match_pathspec(repo->index, pathspec, name.buf, name.len, 0, NULL,
			   S_ISDIR(ce->ce_mode) ||
			   S_ISGITLINK(ce->ce_mode))) {

by adding an initial case

	if (S_ISSPARSEDIR(ce->ce_mode)) {
		hit |= grep_tree(opt, &ce->oid, name.buf, 0, name.buf);
	} else if (S_ISREG(ce->ce_mode) &&
		   match_pathspec(repo->index, pathspec, name.buf, name.len, 0, NULL,
				  S_ISDIR(ce->ce_mode) ||
				  S_ISGITLINK(ce->ce_mode))) {

and appropriately implement "grep_tree()" to walk the tree at ce->oid to
find all matching files within, then call grep_oid() for each of those
paths.

Bonus points if you recognize that the pathspec uses prefix checks that
allow pruning the search space and not parsing all of the trees
recursively. But that can definitely be delayed for a future enhancement.

> +test_expect_success 'grep expands index using --sparse' '
> +	init_repos &&
> +
> +	# With --sparse and --cached, do not ignore sparse entries and
> +	# expand the index.
> +	test_all_match git grep --sparse --cached a
> +'

Here, you're testing that the behavior matches, but not testing that the
index expands. (It does describe why you didn't include it in the later
ensure_not_expanded tests.)

> +
> +test_expect_success 'grep is not expanded' '
> +	init_repos &&
> +
> +	ensure_not_expanded grep a &&
> +	ensure_not_expanded grep a -- deep/* &&
> +	# grep does not match anything per se, so ! is used

It can be helpful to say why:

	# All files within the folder1/* pathspec are sparse,
	# so this command does not find any matches.

> +	ensure_not_expanded ! grep a -- folder1/*
> +'

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v1 1/2] builtin/grep.c: add --sparse option
  2022-08-17 14:12   ` Derrick Stolee
@ 2022-08-17 17:13     ` Junio C Hamano
  2022-08-17 17:34       ` Victoria Dye
  2022-08-17 17:37     ` Elijah Newren
  2022-08-24 18:20     ` Shaoxuan Yuan
  2 siblings, 1 reply; 69+ messages in thread
From: Junio C Hamano @ 2022-08-17 17:13 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Shaoxuan Yuan, git, vdye

Derrick Stolee <derrickstolee@github.com> writes:

> On 8/17/2022 3:56 AM, Shaoxuan Yuan wrote:
>> Add a --sparse option to `git-grep`. This option is mainly used to:
>> 
>> If searching in the index (using --cached):
>> 
>> With --sparse, proceed the action when the current cache_entry is
>
> This phrasing is awkward. It might be better to reframe to describe the
> _why_ before the _what_

Thanks for an excellent suggestion.  As a project participant, I
could guess the motivation, but couldn't link the parts of the
proposed log message to what I thought was being said X-<.  The
below is much clearer.

>   When the '--cached' option is used with the 'git grep' command, the
>   search is limited to the blobs found in the index, not in the worktree.
>   If the user has enabled sparse-checkout, this might present more results
>   than they would like, since the files outside of the sparse-checkout are
>   unlikely to be important to them.

Great.  As an explanation of the reasoning behind the design
decision, I do not think it is bad to go even stronger than "might
... would like" and assume or declare that those users who use
sparse-checkout are the ones who do NOT want to see the parts of the
tree that are sparsed out.  And based on that assumption, "grep" and
"grep --cached" should not bother reporting hit from the part that
the user is not interested in.

By stating the design and the reasoning behind that decision clearly
like so, we allow future developers to reconsider the earlier design
decision more easily.  In 7 years, they may find that the Git users
in their era use sparse-checkout even when they still care about the
contents in the sparsed out area, in which case the basic assumption
behind this change is no longer valid and would allow them to make
"grep" and "grep --cached" behave differently.

>   Change the default behavior of 'git grep' to focus on the files within
>   the sparse-checkout definition. To enable the previous behavior, add a
>   '--sparse' option to 'git grep' that triggers the old behavior that
>   inspects paths outside of the sparse-checkout definition when paired
>   with the '--cached' option.

Yup.  Is that "--sparse" or "--unsparse"?  We are busting the sparse
boundary and looking for everything, and calling the option to do so
"--sparse" somehow feels counter-intuitive, at least to me.

Thanks.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v1 1/2] builtin/grep.c: add --sparse option
  2022-08-17 17:13     ` Junio C Hamano
@ 2022-08-17 17:34       ` Victoria Dye
  2022-08-17 17:43         ` Derrick Stolee
  0 siblings, 1 reply; 69+ messages in thread
From: Victoria Dye @ 2022-08-17 17:34 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee; +Cc: Shaoxuan Yuan, git

Junio C Hamano wrote:
> Derrick Stolee <derrickstolee@github.com> writes:
> 
>> On 8/17/2022 3:56 AM, Shaoxuan Yuan wrote:
>>> Add a --sparse option to `git-grep`. This option is mainly used to:
>>>
>>> If searching in the index (using --cached):
>>>
>>> With --sparse, proceed the action when the current cache_entry is
>>
>> This phrasing is awkward. It might be better to reframe to describe the
>> _why_ before the _what_
> 
> Thanks for an excellent suggestion.  As a project participant, I
> could guess the motivation, but couldn't link the parts of the
> proposed log message to what I thought was being said X-<.  The
> below is much clearer.
> 
>>   When the '--cached' option is used with the 'git grep' command, the
>>   search is limited to the blobs found in the index, not in the worktree.
>>   If the user has enabled sparse-checkout, this might present more results
>>   than they would like, since the files outside of the sparse-checkout are
>>   unlikely to be important to them.
> 
> Great.  As an explanation of the reasoning behind the design
> decision, I do not think it is bad to go even stronger than "might
> ... would like" and assume or declare that those users who use
> sparse-checkout are the ones who do NOT want to see the parts of the
> tree that are sparsed out.  And based on that assumption, "grep" and
> "grep --cached" should not bother reporting hit from the part that
> the user is not interested in.
> 
> By stating the design and the reasoning behind that decision clearly
> like so, we allow future developers to reconsider the earlier design
> decision more easily.  In 7 years, they may find that the Git users
> in their era use sparse-checkout even when they still care about the
> contents in the sparsed out area, in which case the basic assumption
> behind this change is no longer valid and would allow them to make
> "grep" and "grep --cached" behave differently.
> 
>>   Change the default behavior of 'git grep' to focus on the files within
>>   the sparse-checkout definition. To enable the previous behavior, add a
>>   '--sparse' option to 'git grep' that triggers the old behavior that
>>   inspects paths outside of the sparse-checkout definition when paired
>>   with the '--cached' option.
> 
> Yup.  Is that "--sparse" or "--unsparse"?  We are busting the sparse
> boundary and looking for everything, and calling the option to do so
> "--sparse" somehow feels counter-intuitive, at least to me.

It is a bit unintuitive, but '--sparse' is already used to mean "operate on
SKIP_WORKTREE entries (i.e., pretend the repo isn't a sparse-checkout)" in
both 'add' (0299a69694 (add: implement the --sparse option, 2021-09-24)) and
'rm' (f9786f9b85 (rm: add --sparse option, 2021-09-24)). The
'checkout-index' option '--ignore-skip-worktree-bits' indicates similar
behavior (and is, IMO, similarly confusing with its use of "ignore").

I'm not sure '--unsparse' would fit as an alternative, though, since 'git
grep' isn't really "unsparsifying" the repo (to me, that would imply
updating the index to remove the 'SKIP_WORKTREE' flag). Rather, it's looking
at files that are sparse when, by default, it does not. 

I still like the consistency of '--sparse' with existing similar options in
other commands but, if we want to try something clearer here, maybe
something like '--search-sparse' is more descriptive?

> 
> Thanks.


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v1 1/2] builtin/grep.c: add --sparse option
  2022-08-17 14:12   ` Derrick Stolee
  2022-08-17 17:13     ` Junio C Hamano
@ 2022-08-17 17:37     ` Elijah Newren
  2022-08-24 18:20     ` Shaoxuan Yuan
  2 siblings, 0 replies; 69+ messages in thread
From: Elijah Newren @ 2022-08-17 17:37 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Shaoxuan Yuan, Git Mailing List, Victoria Dye

On Wed, Aug 17, 2022 at 7:25 AM Derrick Stolee <derrickstolee@github.com> wrote:
>
> On 8/17/2022 3:56 AM, Shaoxuan Yuan wrote:
> > Add a --sparse option to `git-grep`. This option is mainly used to:
> >
> > If searching in the index (using --cached):
> >
> > With --sparse, proceed the action when the current cache_entry is
>
> This phrasing is awkward. It might be better to reframe to describe the
> _why_ before the _what_
>
>   When the '--cached' option is used with the 'git grep' command, the
>   search is limited to the blobs found in the index, not in the worktree.
>   If the user has enabled sparse-checkout, this might present more results
>   than they would like, since the files outside of the sparse-checkout are
>   unlikely to be important to them.
>
>   Change the default behavior of 'git grep' to focus on the files within
>   the sparse-checkout definition. To enable the previous behavior, add a
>   '--sparse' option to 'git grep' that triggers the old behavior that
>   inspects paths outside of the sparse-checkout definition when paired
>   with the '--cached' option.
>
> Or something like that. The documentation updates will also help clarify
> what happens when '--cached' is not included. I assume '--sparse' is
> ignored, but perhaps it _could_ allow looking at the cached files outside
> the sparse-checkout definition, this could make the simpler invocation of
> 'git grep --sparse <pattern>' be the way that users can search after their
> attempt to search the worktree failed.

In addition to Stolee's comments, isn't this command line confusing?

  $ git grep --cached --sparse   # Do a *dense* search
  $ git grep --cached            # Do a *sparse* search

?

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v1 1/2] builtin/grep.c: add --sparse option
  2022-08-17 17:34       ` Victoria Dye
@ 2022-08-17 17:43         ` Derrick Stolee
  2022-08-17 18:47           ` Junio C Hamano
  0 siblings, 1 reply; 69+ messages in thread
From: Derrick Stolee @ 2022-08-17 17:43 UTC (permalink / raw)
  To: Victoria Dye, Junio C Hamano; +Cc: Shaoxuan Yuan, git

On 8/17/2022 1:34 PM, Victoria Dye wrote:
> Junio C Hamano wrote:
>> Yup.  Is that "--sparse" or "--unsparse"?  We are busting the sparse
>> boundary and looking for everything, and calling the option to do so
>> "--sparse" somehow feels counter-intuitive, at least to me.
> 
> It is a bit unintuitive, but '--sparse' is already used to mean "operate on
> SKIP_WORKTREE entries (i.e., pretend the repo isn't a sparse-checkout)" in
> both 'add' (0299a69694 (add: implement the --sparse option, 2021-09-24)) and
> 'rm' (f9786f9b85 (rm: add --sparse option, 2021-09-24)). The
> 'checkout-index' option '--ignore-skip-worktree-bits' indicates similar
> behavior (and is, IMO, similarly confusing with its use of "ignore").
> 
> I'm not sure '--unsparse' would fit as an alternative, though, since 'git
> grep' isn't really "unsparsifying" the repo (to me, that would imply
> updating the index to remove the 'SKIP_WORKTREE' flag). Rather, it's looking
> at files that are sparse when, by default, it does not. 
> 
> I still like the consistency of '--sparse' with existing similar options in
> other commands but, if we want to try something clearer here, maybe
> something like '--search-sparse' is more descriptive?

My interpretation of '--sparse' is "include skip-worktree paths"
thinking of those paths being "sparse paths".

A too-long version could be '--ignore-sparse-checkout', but I can
understand the confusion where '--sparse' is interpreted as
'--respect-sparse-checkout'.

The existing pattern here means that it isn't Shaoxuan's responsibility
to pick a better name, but if we are interested in changing the name,
then we have some work to replace the previous '--sparse' options with
that name. I could do that replacement, assuming we land on a better name
and are willing to have that change of behavior.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v1 1/2] builtin/grep.c: add --sparse option
  2022-08-17 17:43         ` Derrick Stolee
@ 2022-08-17 18:47           ` Junio C Hamano
  0 siblings, 0 replies; 69+ messages in thread
From: Junio C Hamano @ 2022-08-17 18:47 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Victoria Dye, Shaoxuan Yuan, git

Derrick Stolee <derrickstolee@github.com> writes:

>> It is a bit unintuitive, but '--sparse' is already used to mean "operate on
>> SKIP_WORKTREE entries (i.e., pretend the repo isn't a sparse-checkout)" in
>> both 'add' (0299a69694 (add: implement the --sparse option, 2021-09-24)) and
>> 'rm' (f9786f9b85 (rm: add --sparse option, 2021-09-24)). The
>> 'checkout-index' option '--ignore-skip-worktree-bits' indicates similar
>> behavior (and is, IMO, similarly confusing with its use of "ignore").

OK, I forgot about these precedents.  "ignore skip worktree bits" is
quite a mouthful, but expresses what is going on quite clearly.
Instead of honoring the skip-worktree bit, behave as if they are not
set, so we bust the "sparse" boundary.

> The existing pattern here means that it isn't Shaoxuan's responsibility
> to pick a better name, but if we are interested in changing the name,
> then we have some work to replace the previous '--sparse' options with
> that name. I could do that replacement, assuming we land on a better name
> and are willing to have that change of behavior.

It all depends on how deeply the existing "--sparse" are anchored in
users' minds.  If we have been with them for nearly a year and three
major releases, it is too late to casually "fix" without a proper
transition strategy, I am afraid.  And I am not even sure if it is
worth the trouble.

In any case, let's leave it totally outside the scope of the topic.
As long as we are consistently unintuitive with "--sparse", then I
think we are OK, because users are malleable and can easily get used
to anything as long as it is consistent ;-)

Thanks.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v1 1/2] builtin/grep.c: add --sparse option
  2022-08-17 14:12   ` Derrick Stolee
  2022-08-17 17:13     ` Junio C Hamano
  2022-08-17 17:37     ` Elijah Newren
@ 2022-08-24 18:20     ` Shaoxuan Yuan
  2022-08-24 19:08       ` Derrick Stolee
  2 siblings, 1 reply; 69+ messages in thread
From: Shaoxuan Yuan @ 2022-08-24 18:20 UTC (permalink / raw)
  To: Derrick Stolee, git; +Cc: vdye

Hi reviewrs,

I came back from busying with relocation :)

On 8/17/2022 10:12 PM, Derrick Stolee wrote:
 > On 8/17/2022 3:56 AM, Shaoxuan Yuan wrote:
 >> Add a --sparse option to `git-grep`. This option is mainly used to:
 >>
 >> If searching in the index (using --cached):
 >>
 >> With --sparse, proceed the action when the current cache_entry is
 >
 > This phrasing is awkward. It might be better to reframe to describe the
 > _why_ before the _what_
 >
 >   When the '--cached' option is used with the 'git grep' command, the
 >   search is limited to the blobs found in the index, not in the worktree.
 >   If the user has enabled sparse-checkout, this might present more 
results
 >   than they would like, since the files outside of the 
sparse-checkout are
 >   unlikely to be important to them.
 >
 >   Change the default behavior of 'git grep' to focus on the files within
 >   the sparse-checkout definition. To enable the previous behavior, add a
 >   '--sparse' option to 'git grep' that triggers the old behavior that
 >   inspects paths outside of the sparse-checkout definition when paired
 >   with the '--cached' option.

Good suggestion!

 > Or something like that. The documentation updates will also help clarify
 > what happens when '--cached' is not included. I assume '--sparse' is
 > ignored, but perhaps it _could_ allow looking at the cached files outside
 > the sparse-checkout definition, this could make the simpler invocation of
 > 'git grep --sparse <pattern>' be the way that users can search after 
their
 > attempt to search the worktree failed.

This simpler version was in my earlier local branch, but later I
decided not to go with it. I found the difference between these two
approaches, is that "--cached --sparse" is more correct in terms of
how Git actually works (because sparsity is a concept in the index);
and "--sparse" is more comfortable for the end user.

I found the former one better here, because it is more self-explanatory,
and thus more info for the user, i.e. "you are now looking at the
index, and Git will also consider files outside of sparse definition."

To be honest, I don't know which one is "better", but I think I'll
keep the current implementation unless something more convincing shows
up later.

 >> marked with SKIP_WORKTREE bit (the default is to skip this kind of
 >> entry). Before this patch, --cached itself can realize this action.
 >> Adding --sparse here grants the user finer control over sparse
 >> entries. If the user only wants to peak into the index without
 >
 > s/peak/peek/
 >
 >> caring about sparse entries, --cached should suffice; if the user
 >> wants to peak into the index _and_ cares about sparse entries,
 >> combining --sparse with --cached can address this need.
 >>
 >> Suggested-by: Victoria Dye <vdye@github.com>
 >> Signed-off-by: Shaoxuan Yuan <shaoxuan.yuan02@gmail.com>
 >> ---
 >>  builtin/grep.c                  | 10 +++++++++-
 >>  t/t7817-grep-sparse-checkout.sh | 12 ++++++------
 >>  2 files changed, 15 insertions(+), 7 deletions(-)
 >
 > You mentioned in Slack that you missed the documentation of the --sparse
 > option. Just pointing it out here so we don't forget.

Will do.

 >>
 >> diff --git a/builtin/grep.c b/builtin/grep.c
 >> index e6bcdf860c..61402e8084 100644
 >> --- a/builtin/grep.c
 >> +++ b/builtin/grep.c
 >> @@ -96,6 +96,8 @@ static pthread_cond_t cond_result;
 >>
 >>  static int skip_first_line;
 >>
 >> +static int grep_sparse = 0;
 >> +
 >
 > I initially thought it might be good to not define an additional global,
 > but there are many defined in this file outside of the context and they
 > are spread out with extra whitespace like this.
 >
 >>  static void add_work(struct grep_opt *opt, struct grep_source *gs)
 >>  {
 >>      if (opt->binary != GREP_BINARY_TEXT)
 >> @@ -525,7 +527,11 @@ static int grep_cache(struct grep_opt *opt,
 >>      for (nr = 0; nr < repo->index->cache_nr; nr++) {
 >>          const struct cache_entry *ce = repo->index->cache[nr];
 >>
 >> -        if (!cached && ce_skip_worktree(ce))
 >
 > This logic would skip files marked with SKIP_WORKTREE _unless_ --cached
 > was provided.
 >
 >> +        /*
 >> +         * If ce is a SKIP_WORKTREE entry, look into it when both
 >> +         * --sparse and --cached are given.
 >> +         */
 >> +        if (!(grep_sparse && cached) && ce_skip_worktree(ce))
 >>              continue;
 >
 > The logic of this if statement is backwards from the comment because a
 > true statement means "skip the entry" _not_ "look into it".
 >
 >     /*
 >      * Skip entries with SKIP_WORKTREE unless both --sparse and
 >      * --cached are given.
 >      */

Got it.

 > But again, we might want to consider this alternative:
 >
 >     /*
 >      * Skip entries with SKIP_WORKTREE unless --sparse is given.
 >      */
 >     if (!grep_sparse && ce_skip_worktree(ce))
 >         continue;
 >
 > This will require further changes below, specifically this bit:
 >
 >             /*
 >              * If CE_VALID is on, we assume worktree file and its
 >              * cache entry are identical, even if worktree file has
 >              * been modified, so use cache version instead
 >              */
 >             if (cached || (ce->ce_flags & CE_VALID)) {
 >                 if (ce_stage(ce) || ce_intent_to_add(ce))
 >                     continue;
 >                 hit |= grep_oid(opt, &ce->oid, name.buf,
 >                          0, name.buf);
 >             } else {
 >
 > We need to activate this grep_oid() call also when ce_skip_worktree(c) is
 > true. That is, if we want 'git grep --sparse' to extend the search beyond
 > the worktree and into the sparse entries.
 >
 >>
 >>          strbuf_setlen(&name, name_base_len);
 >> @@ -963,6 +969,8 @@ int cmd_grep(int argc, const char **argv, const 
char *prefix)
 >>                 PARSE_OPT_NOCOMPLETE),
 >>          OPT_INTEGER('m', "max-count", &opt.max_count,
 >>              N_("maximum number of results per file")),
 >> +        OPT_BOOL(0, "sparse", &grep_sparse,
 >> +             N_("search sparse contents and expand sparse index")),
 >
 > This "and expand sparse index" is an internal implementation detail, 
not a
 > heplful item for the help text. Instead, perhaps:
 >
 >     "search the contents of files outside the sparse-checkout definition"

Sounds good!

 > (Also, while the sparse index is being expanded right now, I would expect
 > to not expand the sparse index by the end of the series.)
 >
 >> -test_expect_success 'grep --cached searches entries with the 
SKIP_WORKTREE bit' '
 >> +test_expect_success 'grep --cached and --sparse searches entries 
with the SKIP_WORKTREE bit' '
 >>      cat >expect <<-EOF &&
 >>      a:text
 >>      b:text
 >>      dir/c:text
 >>      EOF
 >> -    git grep --cached "text" >actual &&
 >> +    git grep --cached --sparse "text" >actual &&
 >>      test_cmp expect actual
 >>  '
 >
 > Please add a test that demonstrates the change of behavior when only 
--cached
 > is provided, not --sparse.

Sure!

 > (If you take my suggestion to allow 'git grep --sparse' to do something
 > different, then also add a test for that case.)
 >
 >>
 >> @@ -143,7 +143,7 @@ test_expect_success 'grep --recurse-submodules 
honors sparse checkout in submodu
 >>      test_cmp expect actual
 >>  '
 >>
 >> -test_expect_success 'grep --recurse-submodules --cached searches 
entries with the SKIP_WORKTREE bit' '
 >> +test_expect_success 'grep --recurse-submodules --cached and 
--sparse searches entries with the SKIP_WORKTREE bit' '
 >>      cat >expect <<-EOF &&
 >>      a:text
 >>      b:text
 >> @@ -152,7 +152,7 @@ test_expect_success 'grep --recurse-submodules 
--cached searches entries with th
 >>      sub/B/b:text
 >>      sub2/a:text
 >>      EOF
 >> -    git grep --recurse-submodules --cached "text" >actual &&
 >> +    git grep --recurse-submodules --cached --sparse "text" >actual &&
 >>      test_cmp expect actual
 >>  '
 >> @@ -166,7 +166,7 @@ test_expect_success 'working tree grep does not 
search the index with CE_VALID a
 >>      test_cmp expect actual
 >>  '
 >>
 >> -test_expect_success 'grep --cached searches index entries with both 
CE_VALID and SKIP_WORKTREE' '
 >> +test_expect_success 'grep --cached and --sparse searches index 
entries with both CE_VALID and SKIP_WORKTREE' '
 >>      cat >expect <<-EOF &&
 >>      a:text
 >>      b:text
 >> @@ -174,7 +174,7 @@ test_expect_success 'grep --cached searches 
index entries with both CE_VALID and
 >>      EOF
 >>      test_when_finished "git update-index --no-assume-unchanged b" &&
 >>      git update-index --assume-unchanged b &&
 >> -    git grep --cached text >actual &&
 >> +    git grep --cached --sparse text >actual &&
 >>      test_cmp expect actual
 >>  '
 >
 > Same with these two tests. Add additional commands that show the 
change of
 > behavior when only using '--cached'.

--
Thanks,
Shaoxuan


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v1 1/2] builtin/grep.c: add --sparse option
  2022-08-24 18:20     ` Shaoxuan Yuan
@ 2022-08-24 19:08       ` Derrick Stolee
  0 siblings, 0 replies; 69+ messages in thread
From: Derrick Stolee @ 2022-08-24 19:08 UTC (permalink / raw)
  To: Shaoxuan Yuan, git; +Cc: vdye

On 8/24/2022 2:20 PM, Shaoxuan Yuan wrote:
> Hi reviewrs,
> 
> I came back from busying with relocation :)

Welcome back! I'm looking forward to overlapping our timezones a bit more.
 
> On 8/17/2022 10:12 PM, Derrick Stolee wrote:
>> On 8/17/2022 3:56 AM, Shaoxuan Yuan wrote:
>>> Add a --sparse option to `git-grep`. This option is mainly used to:

>> Or something like that. The documentation updates will also help clarify
>> what happens when '--cached' is not included. I assume '--sparse' is
>> ignored, but perhaps it _could_ allow looking at the cached files outside
>> the sparse-checkout definition, this could make the simpler invocation of
>> 'git grep --sparse <pattern>' be the way that users can search after their
>> attempt to search the worktree failed.
> 
> This simpler version was in my earlier local branch, but later I
> decided not to go with it. I found the difference between these two
> approaches, is that "--cached --sparse" is more correct in terms of
> how Git actually works (because sparsity is a concept in the index);
> and "--sparse" is more comfortable for the end user.
> 
> I found the former one better here, because it is more self-explanatory,
> and thus more info for the user, i.e. "you are now looking at the
> index, and Git will also consider files outside of sparse definition."
> 
> To be honest, I don't know which one is "better", but I think I'll
> keep the current implementation unless something more convincing shows
> up later.

I think it is fine for you to keep the "--sparse requires --cached"
approach that you have now, since we can always choose to extend
the options to allow --sparse without --cached later.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v1 2/2] builtin/grep.c: integrate with sparse index
  2022-08-17 14:23   ` Derrick Stolee
@ 2022-08-24 21:06     ` Shaoxuan Yuan
  2022-08-25  0:39       ` Derrick Stolee
  0 siblings, 1 reply; 69+ messages in thread
From: Shaoxuan Yuan @ 2022-08-24 21:06 UTC (permalink / raw)
  To: Derrick Stolee, git; +Cc: vdye

On 8/17/2022 10:23 PM, Derrick Stolee wrote:
 > On 8/17/2022 3:56 AM, Shaoxuan Yuan wrote:
 >> Turn on sparse index and remove ensure_full_index().
 >>
 >> Change it to only expands the index when using --sparse.
 >>
 >> The p2000 tests demonstrate a ~99.4% execution time reduction for
 >> `git grep` using a sparse index.
 >>
 >> Test                                           HEAD~1 HEAD
 >> 
-----------------------------------------------------------------------------
 >> 2000.78: git grep --cached bogus (full-v3)     0.019 0.018  (-5.2%)
 >> 2000.79: git grep --cached bogus (full-v4)     0.017 0.016  (-5.8%)
 >> 2000.80: git grep --cached bogus (sparse-v3)   0.29 0.0015 (-99.4%)
 >> 2000.81: git grep --cached bogus (sparse-v4)   0.30 0.0018 (-99.4%)
 >
 > Good results.
 >
 > I think we could get interesting results even with the --sparse
 > option if you go another step further (perhaps as a patch after
 > this one).

OK.

 >>
 >> Optional reading about performance test results
 >> -----------------------------------------------
 >> Notice that because `git-grep` needs to parse blobs in the index, the
 >> index reading time is minuscule comparing to the object parsing time.
 >> And because of this, the p2000 test results cannot clearly reflect the
 >> speedup for index reading: combining with the object parsing time,
 >> the aggregated time difference is extremely close between HEAD~1 and
 >> HEAD.
 >>
 >> Hence, the results presenting here are not directly extracted from the
 >> p2000 test results. Instead, to make the performance difference more
 >> visible, the test command is manually ran with GIT_TRACE2_PERF in the
 >> four repos (full-v3, sparse-v3, full-v4, sparse-v4). The numbers here
 >> are then extracted from the time difference between "region_enter" and
 >> "region_leave" of label "do_read_index".
 >
 > This is a good point, but I don't recommend displaying them as if they
 > were the output of a "./run HEAD~1 HEAD -- p2000-sparse-operations.sh"
 > command. Instead, point out that the performance test does not show a
 > major improvement and instead you have these "Before" and "After" results
 > from testing manually and extracting trace2 regions.

OK.

 >> @@ -519,11 +519,15 @@ static int grep_cache(struct grep_opt *opt,
 >>          strbuf_addstr(&name, repo->submodule_prefix);
 >>      }
 >>
 >> +    prepare_repo_settings(repo);
 >> +    repo->settings.command_requires_full_index = 0;
 >> +
 >
 > The best pattern is to put this in cmd_grep() immediately after parsing
 > options. This guarantees that we don't parse and expand the index in any
 > other code path.

Got it.

 >>      if (repo_read_index(repo) < 0)
 >>          die(_("index file corrupt"));
 >>
 >> -    /* TODO: audit for interaction with sparse-index. */
 >> -    ensure_full_index(repo->index);
 >> +    if (grep_sparse)

A side note: this condition should be `grep_sparse && cached`.

 >> +        ensure_full_index(repo->index);
 >> +
 > As mentioned before, this approach is the simplest way to make the case
 > without --sparse faster, but the case _with_ --sparse will still be slow.
 > The way to fix this would be to modify this portion of the loop:

I'm not sure. If --sparse here means we want to expand the index, it
is expected to be slow (ensure_full_index is slow), isn't it?

 >     if (S_ISREG(ce->ce_mode) &&
 >         match_pathspec(repo->index, pathspec, name.buf, name.len, 0, 
NULL,
 >                S_ISDIR(ce->ce_mode) ||
 >                S_ISGITLINK(ce->ce_mode))) {
 >
 > by adding an initial case
 >
 >     if (S_ISSPARSEDIR(ce->ce_mode)) {
 >         hit |= grep_tree(opt, &ce->oid, name.buf, 0, name.buf);
 >     } else if (S_ISREG(ce->ce_mode) &&
 >            match_pathspec(repo->index, pathspec, name.buf, name.len, 
0, NULL,
 >                   S_ISDIR(ce->ce_mode) ||
 >                   S_ISGITLINK(ce->ce_mode))) {
 >
 > and appropriately implement "grep_tree()" to walk the tree at ce->oid to
 > find all matching files within, then call grep_oid() for each of those
 > paths.

Tree walking is faster, yes. So, for this approach to be faster, I
think you are suggesting we should not expand the index, even when
--sparse is given? Instead, we just rely on the tree walking logic,
right?

 > Bonus points if you recognize that the pathspec uses prefix checks that
 > allow pruning the search space and not parsing all of the trees
 > recursively. But that can definitely be delayed for a future enhancement.

OK.

 >> +test_expect_success 'grep expands index using --sparse' '
 >> +    init_repos &&
 >> +
 >> +    # With --sparse and --cached, do not ignore sparse entries and
 >> +    # expand the index.
 >> +    test_all_match git grep --sparse --cached a
 >> +'
 >
 > Here, you're testing that the behavior matches, but not testing that the
 > index expands. (It does describe why you didn't include it in the later
 > ensure_not_expanded tests.)

I was trying to "imply" the index expansion because of the behavior
match. Yes, I think the test should be more explicit.

 >> +
 >> +test_expect_success 'grep is not expanded' '
 >> +    init_repos &&
 >> +
 >> +    ensure_not_expanded grep a &&
 >> +    ensure_not_expanded grep a -- deep/* &&
 >> +    # grep does not match anything per se, so ! is used
 >
 > It can be helpful to say why:
 >
 >     # All files within the folder1/* pathspec are sparse,
 >     # so this command does not find any matches.

OK.

--
Thanks,
Shaoxuan



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v1 2/2] builtin/grep.c: integrate with sparse index
  2022-08-24 21:06     ` Shaoxuan Yuan
@ 2022-08-25  0:39       ` Derrick Stolee
  0 siblings, 0 replies; 69+ messages in thread
From: Derrick Stolee @ 2022-08-25  0:39 UTC (permalink / raw)
  To: Shaoxuan Yuan, git; +Cc: vdye

On 8/24/22 5:06 PM, Shaoxuan Yuan wrote:
> On 8/17/2022 10:23 PM, Derrick Stolee wrote:
>> On 8/17/2022 3:56 AM, Shaoxuan Yuan wrote:
>>> Turn on sparse index and remove ensure_full_index().

>>> -    /* TODO: audit for interaction with sparse-index. */
>>> -    ensure_full_index(repo->index);
>>> +    if (grep_sparse)
> 
> A side note: this condition should be `grep_sparse && cached`.
> 
>>> +        ensure_full_index(repo->index);
>>> +
>> As mentioned before, this approach is the simplest way to make the case
>> without --sparse faster, but the case _with_ --sparse will still be slow.
>> The way to fix this would be to modify this portion of the loop:
> 
> I'm not sure. If --sparse here means we want to expand the index, it
> is expected to be slow (ensure_full_index is slow), isn't it?
> 
>>     if (S_ISREG(ce->ce_mode) &&
>>         match_pathspec(repo->index, pathspec, name.buf, name.len, 0, NULL,
>>                S_ISDIR(ce->ce_mode) ||
>>                S_ISGITLINK(ce->ce_mode))) {
>>
>> by adding an initial case
>>
>>     if (S_ISSPARSEDIR(ce->ce_mode)) {
>>         hit |= grep_tree(opt, &ce->oid, name.buf, 0, name.buf);
>>     } else if (S_ISREG(ce->ce_mode) &&
>>            match_pathspec(repo->index, pathspec, name.buf, name.len, 0, NULL,
>>                   S_ISDIR(ce->ce_mode) ||
>>                   S_ISGITLINK(ce->ce_mode))) {
>>
>> and appropriately implement "grep_tree()" to walk the tree at ce->oid to
>> find all matching files within, then call grep_oid() for each of those
>> paths.
> 
> Tree walking is faster, yes. So, for this approach to be faster, I
> think you are suggesting we should not expand the index, even when
> --sparse is given? Instead, we just rely on the tree walking logic,
> right?

Yes. Tree walking is a sizeable portion of the cost of expanding the
index, but we also avoid constructing the new index _and_ we can use
the t1092 tests to show that we are satisfying the behavior without
resorting to ensure_full_index(). It shows that we are doing the "most
correct" thing.

Walking trees also provides the way to speed up when focused on a
pathspec, since maybe the pathspec reduces the scope of the tree
search automatically (from existing tree-walking logic). Expanding
the index means "walk all the trees, then scan all the files" when
there might be better things to do instead.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH v2 0/2] grep: integrate with sparse index
  2022-08-17  7:56 [PATCH v1 0/2] grep: integrate with sparse index Shaoxuan Yuan
                   ` (2 preceding siblings ...)
  2022-08-17 13:46 ` [PATCH v1 0/2] grep: " Derrick Stolee
@ 2022-08-29 23:28 ` Shaoxuan Yuan
  2022-08-29 23:28   ` [PATCH v2 1/2] builtin/grep.c: add --sparse option Shaoxuan Yuan
  2022-08-29 23:28   ` [PATCH v2 2/2] builtin/grep.c: integrate with sparse index Shaoxuan Yuan
  2022-09-01  4:57 ` [PATCH v3 0/3] grep: " Shaoxuan Yuan
                   ` (3 subsequent siblings)
  7 siblings, 2 replies; 69+ messages in thread
From: Shaoxuan Yuan @ 2022-08-29 23:28 UTC (permalink / raw)
  To: git; +Cc: derrickstolee, vdye, Shaoxuan Yuan

Integrate `git-grep` with sparse-index and test the performance
improvement.

Changes since v1
----------------

* Rewrite the commit message for "builtin/grep.c: add --sparse option"
  to be clearer.

* Update the documentation (both in-code and man page) for --sparse.

* Add a few tests to test the new behavior (when _only_ --cached is
  supplied).

* Reformat the perf test results to not look like directly from p2000
  tests.

* Put the "command_requires_full_index" lines right after parse_options().

* Add a pathspec test in t1092, and reword a few test documentations.

left-over-bits
--------------

As Derrick suggested here [1], we can use tree traversing, for example
`grep_tree()` in "builtin/grep.c", to grep each sparse directory,
rather than expand the index directly, so we save some overheads.

However, when testing "specifying a pathspec to limit the scope of
tree walking", my local branch Git does not show the contents within
the pathspec because of pathspec mismatch (which is not expected,
when "folder1/*" is used, "folder1/a" failed to match?!).
And when the pathspec is not used, Git walks all the trees as
expected, because `all_entries_interesting` is returned for the empty
pathspec.

So I'm convinced that something is wrong with the pathspec matching
logic within "builtin/grep.c", and I'm still working on it [2].

[1] https://lore.kernel.org/git/19dea639-389a-7258-e424-4912bde226df@github.com/
[2] https://github.com/ffyuanda/git/tree/grep/sparse-integration-v2.3-tree-walking

Shaoxuan Yuan (2):
  builtin/grep.c: add --sparse option
  builtin/grep.c: integrate with sparse index

 Documentation/git-grep.txt               |  5 +++-
 builtin/grep.c                           | 20 +++++++++++---
 t/perf/p2000-sparse-operations.sh        |  1 +
 t/t1092-sparse-checkout-compatibility.sh | 18 +++++++++++++
 t/t7817-grep-sparse-checkout.sh          | 34 +++++++++++++++++++-----
 5 files changed, 68 insertions(+), 10 deletions(-)

Range-diff against v1:
1:  bcac4dfc56 ! 1:  27c9341bca builtin/grep.c: add --sparse option
    @@ Metadata
      ## Commit message ##
         builtin/grep.c: add --sparse option
     
    -    Add a --sparse option to `git-grep`. This option is mainly used to:
    +    Add a --sparse option to `git-grep`.
     
    -    If searching in the index (using --cached):
    +    When the '--cached' option is used with the 'git grep' command, the
    +    search is limited to the blobs found in the index, not in the worktree.
    +    If the user has enabled sparse-checkout, this might present more results
    +    than they would like, since the files outside of the sparse-checkout are
    +    unlikely to be important to them.
     
    -    With --sparse, proceed the action when the current cache_entry is
    -    marked with SKIP_WORKTREE bit (the default is to skip this kind of
    -    entry). Before this patch, --cached itself can realize this action.
    -    Adding --sparse here grants the user finer control over sparse
    -    entries. If the user only wants to peak into the index without
    -    caring about sparse entries, --cached should suffice; if the user
    -    wants to peak into the index _and_ cares about sparse entries,
    -    combining --sparse with --cached can address this need.
    +    Change the default behavior of 'git grep' to focus on the files within
    +    the sparse-checkout definition. To enable the previous behavior, add a
    +    '--sparse' option to 'git grep' that triggers the old behavior that
    +    inspects paths outside of the sparse-checkout definition when paired
    +    with the '--cached' option.
     
    +    Helped-by: Derrick Stolee <derrickstolee@github.com>
         Suggested-by: Victoria Dye <vdye@github.com>
         Signed-off-by: Shaoxuan Yuan <shaoxuan.yuan02@gmail.com>
     
    + ## Documentation/git-grep.txt ##
    +@@ Documentation/git-grep.txt: SYNOPSIS
    + 	   [-f <file>] [-e] <pattern>
    + 	   [--and|--or|--not|(|)|-e <pattern>...]
    + 	   [--recurse-submodules] [--parent-basename <basename>]
    +-	   [ [--[no-]exclude-standard] [--cached | --no-index | --untracked] | <tree>...]
    ++	   [ [--[no-]exclude-standard] [--cached [--sparse] | --no-index | --untracked] | <tree>...]
    + 	   [--] [<pathspec>...]
    + 
    + DESCRIPTION
    +@@ Documentation/git-grep.txt: OPTIONS
    + 	Instead of searching tracked files in the working tree, search
    + 	blobs registered in the index file.
    + 
    ++--sparse::
    ++	Use with --cached. Search outside of sparse-checkout definition.
    ++
    + --no-index::
    + 	Search files in the current directory that is not managed by Git.
    + 
    +
      ## builtin/grep.c ##
     @@ builtin/grep.c: static pthread_cond_t cond_result;
      
    @@ builtin/grep.c: static int grep_cache(struct grep_opt *opt,
      
     -		if (!cached && ce_skip_worktree(ce))
     +		/*
    -+		 * If ce is a SKIP_WORKTREE entry, look into it when both
    -+		 * --sparse and --cached are given.
    ++		 * Skip entries with SKIP_WORKTREE unless both --sparse and
    ++		 * --cached are given.
     +		 */
     +		if (!(grep_sparse && cached) && ce_skip_worktree(ce))
      			continue;
    @@ builtin/grep.c: int cmd_grep(int argc, const char **argv, const char *prefix)
      		OPT_INTEGER('m', "max-count", &opt.max_count,
      			N_("maximum number of results per file")),
     +		OPT_BOOL(0, "sparse", &grep_sparse,
    -+			 N_("search sparse contents and expand sparse index")),
    ++			 N_("search the contents of files outside the sparse-checkout definition")),
      		OPT_END()
      	};
      	grep_prefix = prefix;
    @@ t/t7817-grep-sparse-checkout.sh: test_expect_success 'grep searches unmerged fil
      
     -test_expect_success 'grep --cached searches entries with the SKIP_WORKTREE bit' '
     +test_expect_success 'grep --cached and --sparse searches entries with the SKIP_WORKTREE bit' '
    ++	cat >expect <<-EOF &&
    ++	a:text
    ++	EOF
    ++	git grep --cached "text" >actual &&
    ++	test_cmp expect actual &&
    ++
      	cat >expect <<-EOF &&
      	a:text
      	b:text
    @@ t/t7817-grep-sparse-checkout.sh: test_expect_success 'grep --recurse-submodules
      
     -test_expect_success 'grep --recurse-submodules --cached searches entries with the SKIP_WORKTREE bit' '
     +test_expect_success 'grep --recurse-submodules --cached and --sparse searches entries with the SKIP_WORKTREE bit' '
    ++	cat >expect <<-EOF &&
    ++	a:text
    ++	sub/B/b:text
    ++	sub2/a:text
    ++	EOF
    ++	git grep --recurse-submodules --cached "text" >actual &&
    ++	test_cmp expect actual &&
    ++
      	cat >expect <<-EOF &&
      	a:text
      	b:text
    @@ t/t7817-grep-sparse-checkout.sh: test_expect_success 'working tree grep does not
      
     -test_expect_success 'grep --cached searches index entries with both CE_VALID and SKIP_WORKTREE' '
     +test_expect_success 'grep --cached and --sparse searches index entries with both CE_VALID and SKIP_WORKTREE' '
    ++	cat >expect <<-EOF &&
    ++	a:text
    ++	EOF
    ++	test_when_finished "git update-index --no-assume-unchanged b" &&
    ++	git update-index --assume-unchanged b &&
    ++	git grep --cached text >actual &&
    ++	test_cmp expect actual &&
    ++
      	cat >expect <<-EOF &&
      	a:text
      	b:text
2:  48b21afb94 ! 2:  cb16727c05 builtin/grep.c: integrate with sparse index
    @@ Commit message
         The p2000 tests demonstrate a ~99.4% execution time reduction for
         `git grep` using a sparse index.
     
    -    Test                                           HEAD~1       HEAD
    +    Test                                  Before       After
         -----------------------------------------------------------------------------
    -    2000.78: git grep --cached bogus (full-v3)     0.019        0.018  (-5.2%)
    -    2000.79: git grep --cached bogus (full-v4)     0.017        0.016  (-5.8%)
    -    2000.80: git grep --cached bogus (sparse-v3)   0.29         0.0015 (-99.4%)
    -    2000.81: git grep --cached bogus (sparse-v4)   0.30         0.0018 (-99.4%)
    +    git grep --cached bogus (full-v3)     0.019        0.018  (-5.2%)
    +    git grep --cached bogus (full-v4)     0.017        0.016  (-5.8%)
    +    git grep --cached bogus (sparse-v3)   0.29         0.0015 (-99.4%)
    +    git grep --cached bogus (sparse-v4)   0.30         0.0018 (-99.4%)
     
         Optional reading about performance test results
         -----------------------------------------------
    @@ Commit message
     
      ## builtin/grep.c ##
     @@ builtin/grep.c: static int grep_cache(struct grep_opt *opt,
    - 		strbuf_addstr(&name, repo->submodule_prefix);
    - 	}
    - 
    -+	prepare_repo_settings(repo);
    -+	repo->settings.command_requires_full_index = 0;
    -+
      	if (repo_read_index(repo) < 0)
      		die(_("index file corrupt"));
      
    @@ builtin/grep.c: static int grep_cache(struct grep_opt *opt,
      	for (nr = 0; nr < repo->index->cache_nr; nr++) {
      		const struct cache_entry *ce = repo->index->cache[nr];
      
    +@@ builtin/grep.c: int cmd_grep(int argc, const char **argv, const char *prefix)
    + 			     PARSE_OPT_KEEP_DASHDASH |
    + 			     PARSE_OPT_STOP_AT_NON_OPTION);
    + 
    ++	if (the_repository->gitdir) {
    ++		prepare_repo_settings(the_repository);
    ++		the_repository->settings.command_requires_full_index = 0;
    ++	}
    ++
    + 	if (use_index && !startup_info->have_repository) {
    + 		int fallback = 0;
    + 		git_config_get_bool("grep.fallbacktonoindex", &fallback);
     
      ## t/perf/p2000-sparse-operations.sh ##
     @@ t/perf/p2000-sparse-operations.sh: test_perf_on_all git read-tree -mu HEAD
    @@ t/t1092-sparse-checkout-compatibility.sh: test_expect_success 'sparse index is n
      	ensure_not_expanded rm -r deep
      '
      
    -+test_expect_success 'grep expands index using --sparse' '
    ++test_expect_success 'grep with --sparse and --cached' '
     +	init_repos &&
     +
    -+	# With --sparse and --cached, do not ignore sparse entries and
    -+	# expand the index.
    -+	test_all_match git grep --sparse --cached a
    ++	test_all_match git grep --sparse --cached a &&
    ++	test_all_match git grep --sparse --cached a -- "folder1/*"
     +'
     +
     +test_expect_success 'grep is not expanded' '
    @@ t/t1092-sparse-checkout-compatibility.sh: test_expect_success 'sparse index is n
     +
     +	ensure_not_expanded grep a &&
     +	ensure_not_expanded grep a -- deep/* &&
    -+	# grep does not match anything per se, so ! is used
    ++
    ++	# All files within the folder1/* pathspec are sparse,
    ++	# so this command does not find any matches
     +	ensure_not_expanded ! grep a -- folder1/*
     +'
     +

base-commit: 07ee72db0e97b5c233f8ada0abb412248c2f1c6f
-- 
2.37.0


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH v2 1/2] builtin/grep.c: add --sparse option
  2022-08-29 23:28 ` [PATCH v2 " Shaoxuan Yuan
@ 2022-08-29 23:28   ` Shaoxuan Yuan
  2022-08-29 23:28   ` [PATCH v2 2/2] builtin/grep.c: integrate with sparse index Shaoxuan Yuan
  1 sibling, 0 replies; 69+ messages in thread
From: Shaoxuan Yuan @ 2022-08-29 23:28 UTC (permalink / raw)
  To: git; +Cc: derrickstolee, vdye, Shaoxuan Yuan

Add a --sparse option to `git-grep`.

When the '--cached' option is used with the 'git grep' command, the
search is limited to the blobs found in the index, not in the worktree.
If the user has enabled sparse-checkout, this might present more results
than they would like, since the files outside of the sparse-checkout are
unlikely to be important to them.

Change the default behavior of 'git grep' to focus on the files within
the sparse-checkout definition. To enable the previous behavior, add a
'--sparse' option to 'git grep' that triggers the old behavior that
inspects paths outside of the sparse-checkout definition when paired
with the '--cached' option.

Helped-by: Derrick Stolee <derrickstolee@github.com>
Suggested-by: Victoria Dye <vdye@github.com>
Signed-off-by: Shaoxuan Yuan <shaoxuan.yuan02@gmail.com>
---
 Documentation/git-grep.txt      |  5 ++++-
 builtin/grep.c                  | 10 +++++++++-
 t/t7817-grep-sparse-checkout.sh | 34 +++++++++++++++++++++++++++------
 3 files changed, 41 insertions(+), 8 deletions(-)

diff --git a/Documentation/git-grep.txt b/Documentation/git-grep.txt
index 58d944bd57..bdd3d5b8a6 100644
--- a/Documentation/git-grep.txt
+++ b/Documentation/git-grep.txt
@@ -28,7 +28,7 @@ SYNOPSIS
 	   [-f <file>] [-e] <pattern>
 	   [--and|--or|--not|(|)|-e <pattern>...]
 	   [--recurse-submodules] [--parent-basename <basename>]
-	   [ [--[no-]exclude-standard] [--cached | --no-index | --untracked] | <tree>...]
+	   [ [--[no-]exclude-standard] [--cached [--sparse] | --no-index | --untracked] | <tree>...]
 	   [--] [<pathspec>...]
 
 DESCRIPTION
@@ -45,6 +45,9 @@ OPTIONS
 	Instead of searching tracked files in the working tree, search
 	blobs registered in the index file.
 
+--sparse::
+	Use with --cached. Search outside of sparse-checkout definition.
+
 --no-index::
 	Search files in the current directory that is not managed by Git.
 
diff --git a/builtin/grep.c b/builtin/grep.c
index e6bcdf860c..12abd832fa 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -96,6 +96,8 @@ static pthread_cond_t cond_result;
 
 static int skip_first_line;
 
+static int grep_sparse = 0;
+
 static void add_work(struct grep_opt *opt, struct grep_source *gs)
 {
 	if (opt->binary != GREP_BINARY_TEXT)
@@ -525,7 +527,11 @@ static int grep_cache(struct grep_opt *opt,
 	for (nr = 0; nr < repo->index->cache_nr; nr++) {
 		const struct cache_entry *ce = repo->index->cache[nr];
 
-		if (!cached && ce_skip_worktree(ce))
+		/*
+		 * Skip entries with SKIP_WORKTREE unless both --sparse and
+		 * --cached are given.
+		 */
+		if (!(grep_sparse && cached) && ce_skip_worktree(ce))
 			continue;
 
 		strbuf_setlen(&name, name_base_len);
@@ -963,6 +969,8 @@ int cmd_grep(int argc, const char **argv, const char *prefix)
 			   PARSE_OPT_NOCOMPLETE),
 		OPT_INTEGER('m', "max-count", &opt.max_count,
 			N_("maximum number of results per file")),
+		OPT_BOOL(0, "sparse", &grep_sparse,
+			 N_("search the contents of files outside the sparse-checkout definition")),
 		OPT_END()
 	};
 	grep_prefix = prefix;
diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh
index eb59564565..a9879cc980 100755
--- a/t/t7817-grep-sparse-checkout.sh
+++ b/t/t7817-grep-sparse-checkout.sh
@@ -118,13 +118,19 @@ test_expect_success 'grep searches unmerged file despite not matching sparsity p
 	test_cmp expect actual
 '
 
-test_expect_success 'grep --cached searches entries with the SKIP_WORKTREE bit' '
+test_expect_success 'grep --cached and --sparse searches entries with the SKIP_WORKTREE bit' '
+	cat >expect <<-EOF &&
+	a:text
+	EOF
+	git grep --cached "text" >actual &&
+	test_cmp expect actual &&
+
 	cat >expect <<-EOF &&
 	a:text
 	b:text
 	dir/c:text
 	EOF
-	git grep --cached "text" >actual &&
+	git grep --cached --sparse "text" >actual &&
 	test_cmp expect actual
 '
 
@@ -143,7 +149,15 @@ test_expect_success 'grep --recurse-submodules honors sparse checkout in submodu
 	test_cmp expect actual
 '
 
-test_expect_success 'grep --recurse-submodules --cached searches entries with the SKIP_WORKTREE bit' '
+test_expect_success 'grep --recurse-submodules --cached and --sparse searches entries with the SKIP_WORKTREE bit' '
+	cat >expect <<-EOF &&
+	a:text
+	sub/B/b:text
+	sub2/a:text
+	EOF
+	git grep --recurse-submodules --cached "text" >actual &&
+	test_cmp expect actual &&
+
 	cat >expect <<-EOF &&
 	a:text
 	b:text
@@ -152,7 +166,7 @@ test_expect_success 'grep --recurse-submodules --cached searches entries with th
 	sub/B/b:text
 	sub2/a:text
 	EOF
-	git grep --recurse-submodules --cached "text" >actual &&
+	git grep --recurse-submodules --cached --sparse "text" >actual &&
 	test_cmp expect actual
 '
 
@@ -166,7 +180,15 @@ test_expect_success 'working tree grep does not search the index with CE_VALID a
 	test_cmp expect actual
 '
 
-test_expect_success 'grep --cached searches index entries with both CE_VALID and SKIP_WORKTREE' '
+test_expect_success 'grep --cached and --sparse searches index entries with both CE_VALID and SKIP_WORKTREE' '
+	cat >expect <<-EOF &&
+	a:text
+	EOF
+	test_when_finished "git update-index --no-assume-unchanged b" &&
+	git update-index --assume-unchanged b &&
+	git grep --cached text >actual &&
+	test_cmp expect actual &&
+
 	cat >expect <<-EOF &&
 	a:text
 	b:text
@@ -174,7 +196,7 @@ test_expect_success 'grep --cached searches index entries with both CE_VALID and
 	EOF
 	test_when_finished "git update-index --no-assume-unchanged b" &&
 	git update-index --assume-unchanged b &&
-	git grep --cached text >actual &&
+	git grep --cached --sparse text >actual &&
 	test_cmp expect actual
 '
 
-- 
2.37.0


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v2 2/2] builtin/grep.c: integrate with sparse index
  2022-08-29 23:28 ` [PATCH v2 " Shaoxuan Yuan
  2022-08-29 23:28   ` [PATCH v2 1/2] builtin/grep.c: add --sparse option Shaoxuan Yuan
@ 2022-08-29 23:28   ` Shaoxuan Yuan
  2022-08-30 13:45     ` Derrick Stolee
  1 sibling, 1 reply; 69+ messages in thread
From: Shaoxuan Yuan @ 2022-08-29 23:28 UTC (permalink / raw)
  To: git; +Cc: derrickstolee, vdye, Shaoxuan Yuan

Turn on sparse index and remove ensure_full_index().

Change it to only expands the index when using --sparse.

The p2000 tests demonstrate a ~99.4% execution time reduction for
`git grep` using a sparse index.

Test                                  Before       After
-----------------------------------------------------------------------------
git grep --cached bogus (full-v3)     0.019        0.018  (-5.2%)
git grep --cached bogus (full-v4)     0.017        0.016  (-5.8%)
git grep --cached bogus (sparse-v3)   0.29         0.0015 (-99.4%)
git grep --cached bogus (sparse-v4)   0.30         0.0018 (-99.4%)

Optional reading about performance test results
-----------------------------------------------
Notice that because `git-grep` needs to parse blobs in the index, the
index reading time is minuscule comparing to the object parsing time.
And because of this, the p2000 test results cannot clearly reflect the
speedup for index reading: combining with the object parsing time,
the aggregated time difference is extremely close between HEAD~1 and
HEAD.

Hence, the results presenting here are not directly extracted from the
p2000 test results. Instead, to make the performance difference more
visible, the test command is manually ran with GIT_TRACE2_PERF in the
four repos (full-v3, sparse-v3, full-v4, sparse-v4). The numbers here
are then extracted from the time difference between "region_enter" and
"region_leave" of label "do_read_index".

Helped-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Shaoxuan Yuan <shaoxuan.yuan02@gmail.com>
---
 builtin/grep.c                           | 10 ++++++++--
 t/perf/p2000-sparse-operations.sh        |  1 +
 t/t1092-sparse-checkout-compatibility.sh | 18 ++++++++++++++++++
 3 files changed, 27 insertions(+), 2 deletions(-)

diff --git a/builtin/grep.c b/builtin/grep.c
index 12abd832fa..a0b4dbc1dc 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -522,8 +522,9 @@ static int grep_cache(struct grep_opt *opt,
 	if (repo_read_index(repo) < 0)
 		die(_("index file corrupt"));
 
-	/* TODO: audit for interaction with sparse-index. */
-	ensure_full_index(repo->index);
+	if (grep_sparse)
+		ensure_full_index(repo->index);
+
 	for (nr = 0; nr < repo->index->cache_nr; nr++) {
 		const struct cache_entry *ce = repo->index->cache[nr];
 
@@ -992,6 +993,11 @@ int cmd_grep(int argc, const char **argv, const char *prefix)
 			     PARSE_OPT_KEEP_DASHDASH |
 			     PARSE_OPT_STOP_AT_NON_OPTION);
 
+	if (the_repository->gitdir) {
+		prepare_repo_settings(the_repository);
+		the_repository->settings.command_requires_full_index = 0;
+	}
+
 	if (use_index && !startup_info->have_repository) {
 		int fallback = 0;
 		git_config_get_bool("grep.fallbacktonoindex", &fallback);
diff --git a/t/perf/p2000-sparse-operations.sh b/t/perf/p2000-sparse-operations.sh
index fce8151d41..9a466fcbbe 100755
--- a/t/perf/p2000-sparse-operations.sh
+++ b/t/perf/p2000-sparse-operations.sh
@@ -124,5 +124,6 @@ test_perf_on_all git read-tree -mu HEAD
 test_perf_on_all git checkout-index -f --all
 test_perf_on_all git update-index --add --remove $SPARSE_CONE/a
 test_perf_on_all "git rm -f $SPARSE_CONE/a && git checkout HEAD -- $SPARSE_CONE/a"
+test_perf_on_all git grep --cached bogus
 
 test_done
diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index a6a14c8a21..270b47840b 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -1972,4 +1972,22 @@ test_expect_success 'sparse index is not expanded: rm' '
 	ensure_not_expanded rm -r deep
 '
 
+test_expect_success 'grep with --sparse and --cached' '
+	init_repos &&
+
+	test_all_match git grep --sparse --cached a &&
+	test_all_match git grep --sparse --cached a -- "folder1/*"
+'
+
+test_expect_success 'grep is not expanded' '
+	init_repos &&
+
+	ensure_not_expanded grep a &&
+	ensure_not_expanded grep a -- deep/* &&
+
+	# All files within the folder1/* pathspec are sparse,
+	# so this command does not find any matches
+	ensure_not_expanded ! grep a -- folder1/*
+'
+
 test_done
-- 
2.37.0


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 2/2] builtin/grep.c: integrate with sparse index
  2022-08-29 23:28   ` [PATCH v2 2/2] builtin/grep.c: integrate with sparse index Shaoxuan Yuan
@ 2022-08-30 13:45     ` Derrick Stolee
  0 siblings, 0 replies; 69+ messages in thread
From: Derrick Stolee @ 2022-08-30 13:45 UTC (permalink / raw)
  To: Shaoxuan Yuan, git; +Cc: vdye

On 8/29/2022 7:28 PM, Shaoxuan Yuan wrote:
> Turn on sparse index and remove ensure_full_index().
> 
> Change it to only expands the index when using --sparse.

s/expands/expand/

These two sentences should be combined, anyway.

  Enable the sparse index for 'git grep', and only call
  ensure_full_index() when the --sparse argument is provided.

> The p2000 tests demonstrate a ~99.4% execution time reduction for
> `git grep` using a sparse index.
> 
> Test                                  Before       After
> -----------------------------------------------------------------------------
> git grep --cached bogus (full-v3)     0.019        0.018  (-5.2%)
> git grep --cached bogus (full-v4)     0.017        0.016  (-5.8%)
> git grep --cached bogus (sparse-v3)   0.29         0.0015 (-99.4%)
> git grep --cached bogus (sparse-v4)   0.30         0.0018 (-99.4%)

Last time I asked that you don't present this to look like a
performance test to make it clear that it is not the end-to-end
process time. You removed the test numbers, but it still looks
like end-to-end process time, then elaborate after the table.

Instead, you can prepare the reader before the table using
something like this:

  The p2000 tests do not demonstrate a significant improvement,
  because the index read is a small portion of the full process
  time, compared to the blob parsing. The times below reflect the
  time spent in the "do_read_index" trace region as shown using
  GIT_TRACE2_PERF=1. 
> -	/* TODO: audit for interaction with sparse-index. */
> -	ensure_full_index(repo->index);
> +	if (grep_sparse)
> +		ensure_full_index(repo->index);
> +

As we've discussed, there are ways to remove even this call, but
that shouldn't hold up this series which is already an improvement.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH v3 0/3] grep: integrate with sparse index
  2022-08-17  7:56 [PATCH v1 0/2] grep: integrate with sparse index Shaoxuan Yuan
                   ` (3 preceding siblings ...)
  2022-08-29 23:28 ` [PATCH v2 " Shaoxuan Yuan
@ 2022-09-01  4:57 ` Shaoxuan Yuan
  2022-09-01  4:57   ` [PATCH v3 1/3] builtin/grep.c: add --sparse option Shaoxuan Yuan
                     ` (2 more replies)
  2022-09-03  0:36 ` [PATCH v4 0/3] grep: integrate with sparse index Shaoxuan Yuan
                   ` (2 subsequent siblings)
  7 siblings, 3 replies; 69+ messages in thread
From: Shaoxuan Yuan @ 2022-09-01  4:57 UTC (permalink / raw)
  To: git; +Cc: derrickstolee, vdye, Shaoxuan Yuan

Integrate `git-grep` with sparse-index and test the performance
improvement.

Changes since v2
----------------

* Modify the commit message for "builtin/grep.c: integrate with sparse
  index" to make it obvious that the perf test results are not from
  p2000 tests, but from manual perf runs.

* Add tree-walking logic as an extra (the third) patch to improve the
  performance when --sparse is used. This resolved the left-over-bit
  in v2 [1].

[1] https://lore.kernel.org/git/20220829232843.183711-1-shaoxuan.yuan02@gmail.com/

Changes since v1
----------------

* Rewrite the commit message for "builtin/grep.c: add --sparse option"
  to be clearer.

* Update the documentation (both in-code and man page) for --sparse.

* Add a few tests to test the new behavior (when _only_ --cached is
  supplied).

* Reformat the perf test results to not look like directly from p2000
  tests.

* Put the "command_requires_full_index" lines right after parse_options().

* Add a pathspec test in t1092, and reword a few test documentations.

Shaoxuan Yuan (3):
  builtin/grep.c: add --sparse option
  builtin/grep.c: integrate with sparse index
  builtin/grep.c: walking tree instead of expanding index with --sparse

 Documentation/git-grep.txt               |  5 ++-
 builtin/grep.c                           | 46 +++++++++++++++++++++---
 t/perf/p2000-sparse-operations.sh        |  1 +
 t/t1092-sparse-checkout-compatibility.sh | 18 ++++++++++
 t/t7817-grep-sparse-checkout.sh          | 34 ++++++++++++++----
 5 files changed, 92 insertions(+), 12 deletions(-)

Range-diff against v2:
1:  ab5ff488a1 = 1:  db1f5a5409 builtin/grep.c: add --sparse option
2:  68c7ecee73 ! 2:  af566c7862 builtin/grep.c: integrate with sparse index
    @@ Commit message
     
         Turn on sparse index and remove ensure_full_index().
     
    -    Change it to only expands the index when using --sparse.
    +    Change it to only expand the index when using --sparse.
     
    -    The p2000 tests demonstrate a ~99.4% execution time reduction for
    +    The p2000 tests do not demonstrate a significant improvement,
    +    because the index read is a small portion of the full process
    +    time, compared to the blob parsing. The times below reflect the
    +    time spent in the "do_read_index" trace region as shown using
    +    GIT_TRACE2_PERF=1.
    +
    +    The tests demonstrate a ~99.4% execution time reduction for
         `git grep` using a sparse index.
     
    -    Test                                  Before       After
    +    Test                                  HEAD~        HEAD
         -----------------------------------------------------------------------------
         git grep --cached bogus (full-v3)     0.019        0.018  (-5.2%)
         git grep --cached bogus (full-v4)     0.017        0.016  (-5.8%)
    @@ builtin/grep.c: int cmd_grep(int argc, const char **argv, const char *prefix)
      		int fallback = 0;
      		git_config_get_bool("grep.fallbacktonoindex", &fallback);
     
    - ## t/perf/p2000-sparse-operations.sh ##
    -@@ t/perf/p2000-sparse-operations.sh: test_perf_on_all git read-tree -mu HEAD
    - test_perf_on_all git checkout-index -f --all
    - test_perf_on_all git update-index --add --remove $SPARSE_CONE/a
    - test_perf_on_all "git rm -f $SPARSE_CONE/a && git checkout HEAD -- $SPARSE_CONE/a"
    -+test_perf_on_all git grep --cached bogus
    - 
    - test_done
    -
      ## t/t1092-sparse-checkout-compatibility.sh ##
     @@ t/t1092-sparse-checkout-compatibility.sh: test_expect_success 'sparse index is not expanded: rm' '
      	ensure_not_expanded rm -r deep
-:  ---------- > 3:  757ac7ddee builtin/grep.c: walking tree instead of expanding index with --sparse

base-commit: d42b38dfb5edf1a7fddd9542d722f91038407819
-- 
2.37.0


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH v3 1/3] builtin/grep.c: add --sparse option
  2022-09-01  4:57 ` [PATCH v3 0/3] grep: " Shaoxuan Yuan
@ 2022-09-01  4:57   ` Shaoxuan Yuan
  2022-09-01  4:57   ` [PATCH v3 2/3] builtin/grep.c: integrate with sparse index Shaoxuan Yuan
  2022-09-01  4:57   ` [PATCH v3 3/3] builtin/grep.c: walking tree instead of expanding index with --sparse Shaoxuan Yuan
  2 siblings, 0 replies; 69+ messages in thread
From: Shaoxuan Yuan @ 2022-09-01  4:57 UTC (permalink / raw)
  To: git; +Cc: derrickstolee, vdye, Shaoxuan Yuan

Add a --sparse option to `git-grep`.

When the '--cached' option is used with the 'git grep' command, the
search is limited to the blobs found in the index, not in the worktree.
If the user has enabled sparse-checkout, this might present more results
than they would like, since the files outside of the sparse-checkout are
unlikely to be important to them.

Change the default behavior of 'git grep' to focus on the files within
the sparse-checkout definition. To enable the previous behavior, add a
'--sparse' option to 'git grep' that triggers the old behavior that
inspects paths outside of the sparse-checkout definition when paired
with the '--cached' option.

Helped-by: Derrick Stolee <derrickstolee@github.com>
Suggested-by: Victoria Dye <vdye@github.com>
Signed-off-by: Shaoxuan Yuan <shaoxuan.yuan02@gmail.com>
---
 Documentation/git-grep.txt      |  5 ++++-
 builtin/grep.c                  | 10 +++++++++-
 t/t7817-grep-sparse-checkout.sh | 34 +++++++++++++++++++++++++++------
 3 files changed, 41 insertions(+), 8 deletions(-)

diff --git a/Documentation/git-grep.txt b/Documentation/git-grep.txt
index 58d944bd57..bdd3d5b8a6 100644
--- a/Documentation/git-grep.txt
+++ b/Documentation/git-grep.txt
@@ -28,7 +28,7 @@ SYNOPSIS
 	   [-f <file>] [-e] <pattern>
 	   [--and|--or|--not|(|)|-e <pattern>...]
 	   [--recurse-submodules] [--parent-basename <basename>]
-	   [ [--[no-]exclude-standard] [--cached | --no-index | --untracked] | <tree>...]
+	   [ [--[no-]exclude-standard] [--cached [--sparse] | --no-index | --untracked] | <tree>...]
 	   [--] [<pathspec>...]
 
 DESCRIPTION
@@ -45,6 +45,9 @@ OPTIONS
 	Instead of searching tracked files in the working tree, search
 	blobs registered in the index file.
 
+--sparse::
+	Use with --cached. Search outside of sparse-checkout definition.
+
 --no-index::
 	Search files in the current directory that is not managed by Git.
 
diff --git a/builtin/grep.c b/builtin/grep.c
index e6bcdf860c..12abd832fa 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -96,6 +96,8 @@ static pthread_cond_t cond_result;
 
 static int skip_first_line;
 
+static int grep_sparse = 0;
+
 static void add_work(struct grep_opt *opt, struct grep_source *gs)
 {
 	if (opt->binary != GREP_BINARY_TEXT)
@@ -525,7 +527,11 @@ static int grep_cache(struct grep_opt *opt,
 	for (nr = 0; nr < repo->index->cache_nr; nr++) {
 		const struct cache_entry *ce = repo->index->cache[nr];
 
-		if (!cached && ce_skip_worktree(ce))
+		/*
+		 * Skip entries with SKIP_WORKTREE unless both --sparse and
+		 * --cached are given.
+		 */
+		if (!(grep_sparse && cached) && ce_skip_worktree(ce))
 			continue;
 
 		strbuf_setlen(&name, name_base_len);
@@ -963,6 +969,8 @@ int cmd_grep(int argc, const char **argv, const char *prefix)
 			   PARSE_OPT_NOCOMPLETE),
 		OPT_INTEGER('m', "max-count", &opt.max_count,
 			N_("maximum number of results per file")),
+		OPT_BOOL(0, "sparse", &grep_sparse,
+			 N_("search the contents of files outside the sparse-checkout definition")),
 		OPT_END()
 	};
 	grep_prefix = prefix;
diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh
index eb59564565..a9879cc980 100755
--- a/t/t7817-grep-sparse-checkout.sh
+++ b/t/t7817-grep-sparse-checkout.sh
@@ -118,13 +118,19 @@ test_expect_success 'grep searches unmerged file despite not matching sparsity p
 	test_cmp expect actual
 '
 
-test_expect_success 'grep --cached searches entries with the SKIP_WORKTREE bit' '
+test_expect_success 'grep --cached and --sparse searches entries with the SKIP_WORKTREE bit' '
+	cat >expect <<-EOF &&
+	a:text
+	EOF
+	git grep --cached "text" >actual &&
+	test_cmp expect actual &&
+
 	cat >expect <<-EOF &&
 	a:text
 	b:text
 	dir/c:text
 	EOF
-	git grep --cached "text" >actual &&
+	git grep --cached --sparse "text" >actual &&
 	test_cmp expect actual
 '
 
@@ -143,7 +149,15 @@ test_expect_success 'grep --recurse-submodules honors sparse checkout in submodu
 	test_cmp expect actual
 '
 
-test_expect_success 'grep --recurse-submodules --cached searches entries with the SKIP_WORKTREE bit' '
+test_expect_success 'grep --recurse-submodules --cached and --sparse searches entries with the SKIP_WORKTREE bit' '
+	cat >expect <<-EOF &&
+	a:text
+	sub/B/b:text
+	sub2/a:text
+	EOF
+	git grep --recurse-submodules --cached "text" >actual &&
+	test_cmp expect actual &&
+
 	cat >expect <<-EOF &&
 	a:text
 	b:text
@@ -152,7 +166,7 @@ test_expect_success 'grep --recurse-submodules --cached searches entries with th
 	sub/B/b:text
 	sub2/a:text
 	EOF
-	git grep --recurse-submodules --cached "text" >actual &&
+	git grep --recurse-submodules --cached --sparse "text" >actual &&
 	test_cmp expect actual
 '
 
@@ -166,7 +180,15 @@ test_expect_success 'working tree grep does not search the index with CE_VALID a
 	test_cmp expect actual
 '
 
-test_expect_success 'grep --cached searches index entries with both CE_VALID and SKIP_WORKTREE' '
+test_expect_success 'grep --cached and --sparse searches index entries with both CE_VALID and SKIP_WORKTREE' '
+	cat >expect <<-EOF &&
+	a:text
+	EOF
+	test_when_finished "git update-index --no-assume-unchanged b" &&
+	git update-index --assume-unchanged b &&
+	git grep --cached text >actual &&
+	test_cmp expect actual &&
+
 	cat >expect <<-EOF &&
 	a:text
 	b:text
@@ -174,7 +196,7 @@ test_expect_success 'grep --cached searches index entries with both CE_VALID and
 	EOF
 	test_when_finished "git update-index --no-assume-unchanged b" &&
 	git update-index --assume-unchanged b &&
-	git grep --cached text >actual &&
+	git grep --cached --sparse text >actual &&
 	test_cmp expect actual
 '
 
-- 
2.37.0


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v3 2/3] builtin/grep.c: integrate with sparse index
  2022-09-01  4:57 ` [PATCH v3 0/3] grep: " Shaoxuan Yuan
  2022-09-01  4:57   ` [PATCH v3 1/3] builtin/grep.c: add --sparse option Shaoxuan Yuan
@ 2022-09-01  4:57   ` Shaoxuan Yuan
  2022-09-01  4:57   ` [PATCH v3 3/3] builtin/grep.c: walking tree instead of expanding index with --sparse Shaoxuan Yuan
  2 siblings, 0 replies; 69+ messages in thread
From: Shaoxuan Yuan @ 2022-09-01  4:57 UTC (permalink / raw)
  To: git; +Cc: derrickstolee, vdye, Shaoxuan Yuan

Turn on sparse index and remove ensure_full_index().

Change it to only expand the index when using --sparse.

The p2000 tests do not demonstrate a significant improvement,
because the index read is a small portion of the full process
time, compared to the blob parsing. The times below reflect the
time spent in the "do_read_index" trace region as shown using
GIT_TRACE2_PERF=1.

The tests demonstrate a ~99.4% execution time reduction for
`git grep` using a sparse index.

Test                                  HEAD~        HEAD
-----------------------------------------------------------------------------
git grep --cached bogus (full-v3)     0.019        0.018  (-5.2%)
git grep --cached bogus (full-v4)     0.017        0.016  (-5.8%)
git grep --cached bogus (sparse-v3)   0.29         0.0015 (-99.4%)
git grep --cached bogus (sparse-v4)   0.30         0.0018 (-99.4%)

Optional reading about performance test results
-----------------------------------------------
Notice that because `git-grep` needs to parse blobs in the index, the
index reading time is minuscule comparing to the object parsing time.
And because of this, the p2000 test results cannot clearly reflect the
speedup for index reading: combining with the object parsing time,
the aggregated time difference is extremely close between HEAD~1 and
HEAD.

Hence, the results presenting here are not directly extracted from the
p2000 test results. Instead, to make the performance difference more
visible, the test command is manually ran with GIT_TRACE2_PERF in the
four repos (full-v3, sparse-v3, full-v4, sparse-v4). The numbers here
are then extracted from the time difference between "region_enter" and
"region_leave" of label "do_read_index".

Helped-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Shaoxuan Yuan <shaoxuan.yuan02@gmail.com>
---
 builtin/grep.c                           | 10 ++++++++--
 t/t1092-sparse-checkout-compatibility.sh | 18 ++++++++++++++++++
 2 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/builtin/grep.c b/builtin/grep.c
index 12abd832fa..a0b4dbc1dc 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -522,8 +522,9 @@ static int grep_cache(struct grep_opt *opt,
 	if (repo_read_index(repo) < 0)
 		die(_("index file corrupt"));
 
-	/* TODO: audit for interaction with sparse-index. */
-	ensure_full_index(repo->index);
+	if (grep_sparse)
+		ensure_full_index(repo->index);
+
 	for (nr = 0; nr < repo->index->cache_nr; nr++) {
 		const struct cache_entry *ce = repo->index->cache[nr];
 
@@ -992,6 +993,11 @@ int cmd_grep(int argc, const char **argv, const char *prefix)
 			     PARSE_OPT_KEEP_DASHDASH |
 			     PARSE_OPT_STOP_AT_NON_OPTION);
 
+	if (the_repository->gitdir) {
+		prepare_repo_settings(the_repository);
+		the_repository->settings.command_requires_full_index = 0;
+	}
+
 	if (use_index && !startup_info->have_repository) {
 		int fallback = 0;
 		git_config_get_bool("grep.fallbacktonoindex", &fallback);
diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index 0302e36fd6..63becc3138 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -1972,4 +1972,22 @@ test_expect_success 'sparse index is not expanded: rm' '
 	ensure_not_expanded rm -r deep
 '
 
+test_expect_success 'grep with --sparse and --cached' '
+	init_repos &&
+
+	test_all_match git grep --sparse --cached a &&
+	test_all_match git grep --sparse --cached a -- "folder1/*"
+'
+
+test_expect_success 'grep is not expanded' '
+	init_repos &&
+
+	ensure_not_expanded grep a &&
+	ensure_not_expanded grep a -- deep/* &&
+
+	# All files within the folder1/* pathspec are sparse,
+	# so this command does not find any matches
+	ensure_not_expanded ! grep a -- folder1/*
+'
+
 test_done
-- 
2.37.0


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v3 3/3] builtin/grep.c: walking tree instead of expanding index with --sparse
  2022-09-01  4:57 ` [PATCH v3 0/3] grep: " Shaoxuan Yuan
  2022-09-01  4:57   ` [PATCH v3 1/3] builtin/grep.c: add --sparse option Shaoxuan Yuan
  2022-09-01  4:57   ` [PATCH v3 2/3] builtin/grep.c: integrate with sparse index Shaoxuan Yuan
@ 2022-09-01  4:57   ` Shaoxuan Yuan
  2022-09-01 17:03     ` Derrick Stolee
                       ` (2 more replies)
  2 siblings, 3 replies; 69+ messages in thread
From: Shaoxuan Yuan @ 2022-09-01  4:57 UTC (permalink / raw)
  To: git; +Cc: derrickstolee, vdye, Shaoxuan Yuan

Before this patch, whenever --sparse is used, `git-grep` utilizes the
ensure_full_index() method to expand the index and search all the
entries. Because this method requires walking all the trees and
constructing the index, it is the slow part within the whole command.

To achieve better performance, this patch uses grep_tree() to search the
sparse directory entries and get rid of the ensure_full_index() method.

Why grep_tree() is a better choice over ensure_full_index()?

1) grep_tree() is as correct as ensure_full_index(). grep_tree() looks
   into every sparse-directory entry (represented by a tree) recursively
   when looping over the index, and the result of doing so matches the
   result of expanding the index.

2) grep_tree() utilizes pathspecs to limit the scope of searching.
   ensure_full_index() always expands the index when --sparse is used,
   that means it will always walk all the trees and blobs in the repo
   without caring if the user only wants a subset of the content, i.e.
   using a pathspec. On the other hand, grep_tree() will only search
   the contents that match the pathspec, and thus possibly walking fewer
   trees.

3) grep_tree() does not construct and copy back a new index, while
   ensure_full_index() does. This also saves some time.

----------------
Performance test

- Summary:

p2000 tests demonstrate a ~91% execution time reduction for
`git grep --cached --sparse <pattern> -- <pathspec>` using tree-walking
logic.

Test                                                                          HEAD~   HEAD
---------------------------------------------------------------------------------------------------
2000.78: git grep --cached --sparse bogus -- f2/f1/f1/builtin/* (full-v3)     0.11    0.09 (≈)
2000.79: git grep --cached --sparse bogus -- f2/f1/f1/builtin/* (full-v4)     0.08    0.09 (≈)
2000.80: git grep --cached --sparse bogus -- f2/f1/f1/builtin/* (sparse-v3)   0.44    0.04 (-90.9%)
2000.81: git grep --cached --sparse bogus -- f2/f1/f1/builtin/* (sparse-v4)   0.46    0.04 (-91.3%)

- Command used for testing:

	git grep --cached --sparse bogus -- f2/f1/f1/builtin/*

The reason for specifying a pathspec is that, if we don't specify a
pathspec, then grep_tree() will walk all the trees and blobs to find the
pattern, and the time consumed doing so is not too different from using
the original ensure_full_index() method, which also spends most of the
time walking trees. However, when a pathspec is specified, this latest
logic will only walk the area of trees enclosed by the pathspec, and the
time consumed is reasonably a lot less.

That is, if we don't specify a pathspec, the performance difference [1]
is quite small: both methods walk all the trees and take generally same
amount of time (even with the index construction time included for
ensure_full_index()).

[1] Performance test result without pathspec:

	Test                                                    HEAD~  HEAD
	-----------------------------------------------------------------------------
	2000.78: git grep --cached --sparse bogus (full-v3)     6.17   5.19 (≈)
	2000.79: git grep --cached --sparse bogus (full-v4)     6.19   5.46 (≈)
	2000.80: git grep --cached --sparse bogus (sparse-v3)   6.57   6.44 (≈)
	2000.81: git grep --cached --sparse bogus (sparse-v4)   6.65   6.28 (≈)

Suggested-by: Derrick Stolee <derrickstolee@github.com>
Helped-by: Derrick Stolee <derrickstolee@github.com>
Helped-by: Victoria Dye <vdye@github.com>
Signed-off-by: Shaoxuan Yuan <shaoxuan.yuan02@gmail.com>
---
 builtin/grep.c                    | 32 ++++++++++++++++++++++++++-----
 t/perf/p2000-sparse-operations.sh |  1 +
 2 files changed, 28 insertions(+), 5 deletions(-)

diff --git a/builtin/grep.c b/builtin/grep.c
index a0b4dbc1dc..8c0edccd8e 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -522,9 +522,6 @@ static int grep_cache(struct grep_opt *opt,
 	if (repo_read_index(repo) < 0)
 		die(_("index file corrupt"));
 
-	if (grep_sparse)
-		ensure_full_index(repo->index);
-
 	for (nr = 0; nr < repo->index->cache_nr; nr++) {
 		const struct cache_entry *ce = repo->index->cache[nr];
 
@@ -537,8 +534,26 @@ static int grep_cache(struct grep_opt *opt,
 
 		strbuf_setlen(&name, name_base_len);
 		strbuf_addstr(&name, ce->name);
+		if (S_ISSPARSEDIR(ce->ce_mode)) {
+			enum object_type type;
+			struct tree_desc tree;
+			void *data;
+			unsigned long size;
+			struct strbuf base = STRBUF_INIT;
+
+			strbuf_addstr(&base, ce->name);
+
+			data = read_object_file(&ce->oid, &type, &size);
+			init_tree_desc(&tree, data, size);
 
-		if (S_ISREG(ce->ce_mode) &&
+			/*
+			 * sneak in the ce_mode using check_attr parameter
+			 */
+			hit |= grep_tree(opt, pathspec, &tree, &base,
+					 base.len, ce->ce_mode);
+			strbuf_release(&base);
+			free(data);
+		} else if (S_ISREG(ce->ce_mode) &&
 		    match_pathspec(repo->index, pathspec, name.buf, name.len, 0, NULL,
 				   S_ISDIR(ce->ce_mode) ||
 				   S_ISGITLINK(ce->ce_mode))) {
@@ -598,7 +613,14 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
 		int te_len = tree_entry_len(&entry);
 
 		if (match != all_entries_interesting) {
-			strbuf_addstr(&name, base->buf + tn_len);
+			if (S_ISSPARSEDIR(check_attr)) {
+				// object is a sparse directory entry
+				strbuf_addbuf(&name, base);
+			} else {
+				// object is a commit or a root tree
+				strbuf_addstr(&name, base->buf + tn_len);
+			}
+
 			match = tree_entry_interesting(repo->index,
 						       &entry, &name,
 						       0, pathspec);
diff --git a/t/perf/p2000-sparse-operations.sh b/t/perf/p2000-sparse-operations.sh
index fce8151d41..a0b71bb3b4 100755
--- a/t/perf/p2000-sparse-operations.sh
+++ b/t/perf/p2000-sparse-operations.sh
@@ -124,5 +124,6 @@ test_perf_on_all git read-tree -mu HEAD
 test_perf_on_all git checkout-index -f --all
 test_perf_on_all git update-index --add --remove $SPARSE_CONE/a
 test_perf_on_all "git rm -f $SPARSE_CONE/a && git checkout HEAD -- $SPARSE_CONE/a"
+test_perf_on_all git grep --cached --sparse bogus -- "f2/f1/f1/builtin/*"
 
 test_done
-- 
2.37.0


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [PATCH v3 3/3] builtin/grep.c: walking tree instead of expanding index with --sparse
  2022-09-01  4:57   ` [PATCH v3 3/3] builtin/grep.c: walking tree instead of expanding index with --sparse Shaoxuan Yuan
@ 2022-09-01 17:03     ` Derrick Stolee
  2022-09-01 18:31       ` Shaoxuan Yuan
  2022-09-01 17:17     ` Junio C Hamano
  2022-09-02  3:28     ` Victoria Dye
  2 siblings, 1 reply; 69+ messages in thread
From: Derrick Stolee @ 2022-09-01 17:03 UTC (permalink / raw)
  To: Shaoxuan Yuan, git; +Cc: vdye

On 9/1/2022 12:57 AM, Shaoxuan Yuan wrote: 
> Test                                                                          HEAD~   HEAD
> ---------------------------------------------------------------------------------------------------
> 2000.78: git grep --cached --sparse bogus -- f2/f1/f1/builtin/* (full-v3)     0.11    0.09 (≈)
> 2000.79: git grep --cached --sparse bogus -- f2/f1/f1/builtin/* (full-v4)     0.08    0.09 (≈)
> 2000.80: git grep --cached --sparse bogus -- f2/f1/f1/builtin/* (sparse-v3)   0.44    0.04 (-90.9%)
> 2000.81: git grep --cached --sparse bogus -- f2/f1/f1/builtin/* (sparse-v4)   0.46    0.04 (-91.3%)
> 
> - Command used for testing:
> 
> 	git grep --cached --sparse bogus -- f2/f1/f1/builtin/*

It's good to list this command after the table. It allows you to shrink
the table by using "...":

Test                                HEAD~   HEAD
---------------------------------------------------------
2000.78: git grep ... (full-v3)     0.11    0.09 (≈)
2000.79: git grep ... (full-v4)     0.08    0.09 (≈)
2000.80: git grep ... (sparse-v3)   0.44    0.04 (-90.9%)
2000.81: git grep ... (sparse-v4)   0.46    0.04 (-91.3%)

This saves horizontal space without losing clarity. The test numbers help,
too.

>  		strbuf_setlen(&name, name_base_len);
>  		strbuf_addstr(&name, ce->name);
> +		if (S_ISSPARSEDIR(ce->ce_mode)) {
> +			enum object_type type;
> +			struct tree_desc tree;
> +			void *data;
> +			unsigned long size;
> +			struct strbuf base = STRBUF_INIT;
> +
> +			strbuf_addstr(&base, ce->name);
> +
> +			data = read_object_file(&ce->oid, &type, &size);
> +			init_tree_desc(&tree, data, size);
>  
> -		if (S_ISREG(ce->ce_mode) &&
> +			/*
> +			 * sneak in the ce_mode using check_attr parameter
> +			 */
> +			hit |= grep_tree(opt, pathspec, &tree, &base,
> +					 base.len, ce->ce_mode);
> +			strbuf_release(&base);
> +			free(data);
> +		} else if (S_ISREG(ce->ce_mode) &&

I think this is a good setup for transitioning from the index scan
to the tree-walking grep_tree() method. Below, I recommend calling
the method slightly differently, though.

>  		    match_pathspec(repo->index, pathspec, name.buf, name.len, 0, NULL,
>  				   S_ISDIR(ce->ce_mode) ||
>  				   S_ISGITLINK(ce->ce_mode))) {
> @@ -598,7 +613,14 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>  		int te_len = tree_entry_len(&entry);
>  
>  		if (match != all_entries_interesting) {
> -			strbuf_addstr(&name, base->buf + tn_len);
> +			if (S_ISSPARSEDIR(check_attr)) {
> +				// object is a sparse directory entry
> +				strbuf_addbuf(&name, base);
> +			} else {
> +				// object is a commit or a root tree
> +				strbuf_addstr(&name, base->buf + tn_len);
> +			}
> +

I think this is abusing the check_attr too much, since this will also
trigger a different if branch further down the method.

These lines are the same if tn_len is zero, so will it suffice to pass
0 for that length? You are passing base.len when you call it, so maybe
that should be zero?

When I apply this change, all tests pass, so if there _is_ something
different between the two implementations, then it isn't covered by
tests:

diff --git a/builtin/grep.c b/builtin/grep.c
index 8c0edccd8e..fc4adf876a 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -549,8 +549,7 @@ static int grep_cache(struct grep_opt *opt,
 			/*
 			 * sneak in the ce_mode using check_attr parameter
 			 */
-			hit |= grep_tree(opt, pathspec, &tree, &base,
-					 base.len, ce->ce_mode);
+			hit |= grep_tree(opt, pathspec, &tree, &base, 0, 0);
 			strbuf_release(&base);
 			free(data);
 		} else if (S_ISREG(ce->ce_mode) &&
@@ -613,13 +612,7 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
 		int te_len = tree_entry_len(&entry);
 
 		if (match != all_entries_interesting) {
-			if (S_ISSPARSEDIR(check_attr)) {
-				// object is a sparse directory entry
-				strbuf_addbuf(&name, base);
-			} else {
-				// object is a commit or a root tree
-				strbuf_addstr(&name, base->buf + tn_len);
-			}
+			strbuf_addstr(&name, base->buf + tn_len);
 
 			match = tree_entry_interesting(repo->index,
 						       &entry, &name,

> +test_perf_on_all git grep --cached --sparse bogus -- "f2/f1/f1/builtin/*"

We can't use this path in general, because we don't always run the test
using the Git repository as the test repo (see GIT_PERF_[LARGE_]REPO
variables in t/perf/README).

We _can_ however use the structure that we have implied in our construction,
which is to use a path that we know exists and is still outside of the
sparse-checkout cone. Truncating to "f2/f1/f1/*" is sufficient for this.

Modifying the test and running them on my machine, I get:

Test                               HEAD~1            HEAD
----------------------------------------------------------------------------
2000.78: git grep ... (full-v3)    0.19(0.72+0.18)   0.18(0.84+0.13) -5.3%  
2000.79: git grep ... (full-v4)    0.17(0.83+0.16)   0.19(0.84+0.14) +11.8% 
2000.80: git grep ... (sparse-v3)  0.35(1.02+0.13)   0.15(0.85+0.15) -57.1% 
2000.81: git grep ... (sparse-v4)  0.37(1.06+0.12)   0.15(0.89+0.15) -59.5%

So, it's still expensive to do the blob search over a wider pathspec than
the test as you designed it, but this will work for other repo, such as the
Linux kernel:

Test                                HEAD~1             HEAD
------------------------------------------------------------------------------
2000.78: git grep ... (full-v3)     3.16(19.37+2.55)   2.56(15.24+1.76) -19.0%
2000.79: git grep ... (full-v4)     2.97(17.84+2.00)   2.59(15.51+1.89) -12.8%
2000.80: git grep ... (sparse-v3)   8.39(24.74+2.34)   2.13(16.03+1.72) -74.6%
2000.81: git grep ... (sparse-v4)   8.39(24.73+2.40)   2.16(16.14+1.90) -74.3%

Thanks,
-Stolee

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [PATCH v3 3/3] builtin/grep.c: walking tree instead of expanding index with --sparse
  2022-09-01  4:57   ` [PATCH v3 3/3] builtin/grep.c: walking tree instead of expanding index with --sparse Shaoxuan Yuan
  2022-09-01 17:03     ` Derrick Stolee
@ 2022-09-01 17:17     ` Junio C Hamano
  2022-09-01 17:27       ` Junio C Hamano
  2022-09-01 22:36       ` Shaoxuan Yuan
  2022-09-02  3:28     ` Victoria Dye
  2 siblings, 2 replies; 69+ messages in thread
From: Junio C Hamano @ 2022-09-01 17:17 UTC (permalink / raw)
  To: Shaoxuan Yuan; +Cc: git, derrickstolee, vdye

Shaoxuan Yuan <shaoxuan.yuan02@gmail.com> writes:

> Before this patch, whenever --sparse is used, `git-grep` utilizes the
> ensure_full_index() method to expand the index and search all the
> entries. Because this method requires walking all the trees and
> constructing the index, it is the slow part within the whole command.
>
> To achieve better performance, this patch uses grep_tree() to search the
> sparse directory entries and get rid of the ensure_full_index() method.

When you encounter a "sparsedir" (i.e. a tree recorded in index),
you should know the path leading to that directory. Even though I no
longer remember the details of the implementations of grep_$where()
which I did long time ago, I think grep_tree() should know how to
pass the leading path down, as that is the most natural way to
implement the recursive behaviour.  This patch should be able to
piggyback on that.

> @@ -537,8 +534,26 @@ static int grep_cache(struct grep_opt *opt,
>  
>  		strbuf_setlen(&name, name_base_len);
>  		strbuf_addstr(&name, ce->name);
> +		if (S_ISSPARSEDIR(ce->ce_mode)) {
> +			enum object_type type;
> +			struct tree_desc tree;
> +			void *data;
> +			unsigned long size;
> +			struct strbuf base = STRBUF_INIT;
> +
> +			strbuf_addstr(&base, ce->name);
> +
> +			data = read_object_file(&ce->oid, &type, &size);
> +			init_tree_desc(&tree, data, size);
>  
> +			/*
> +			 * sneak in the ce_mode using check_attr parameter
> +			 */
> +			hit |= grep_tree(opt, pathspec, &tree, &base,
> +					 base.len, ce->ce_mode);

OK.  Instead of inventing a new "base" strbuf, we could reuse
existing name while running the grep_tree() and restore it after it
returns, and I suspect that the end result would be more in line
with how grep_cache() uses that "name" buffer for all the cache
entries.  But that is not a correctness issue (it is move about
preventing from making the code worse).

> @@ -598,7 +613,14 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>  		int te_len = tree_entry_len(&entry);
>  
>  		if (match != all_entries_interesting) {
> -			strbuf_addstr(&name, base->buf + tn_len);
> +			if (S_ISSPARSEDIR(check_attr)) {
> +				// object is a sparse directory entry

No // comments, please.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v3 3/3] builtin/grep.c: walking tree instead of expanding index with --sparse
  2022-09-01 17:17     ` Junio C Hamano
@ 2022-09-01 17:27       ` Junio C Hamano
  2022-09-01 22:49         ` Shaoxuan Yuan
  2022-09-01 22:36       ` Shaoxuan Yuan
  1 sibling, 1 reply; 69+ messages in thread
From: Junio C Hamano @ 2022-09-01 17:27 UTC (permalink / raw)
  To: Shaoxuan Yuan; +Cc: git, derrickstolee, vdye

Junio C Hamano <gitster@pobox.com> writes:

> Shaoxuan Yuan <shaoxuan.yuan02@gmail.com> writes:
>
>> Before this patch, whenever --sparse is used, `git-grep` utilizes the
>> ensure_full_index() method to expand the index and search all the
>> entries. Because this method requires walking all the trees and
>> constructing the index, it is the slow part within the whole command.
>>
>> To achieve better performance, this patch uses grep_tree() to search the
>> sparse directory entries and get rid of the ensure_full_index() method.
>
> When you encounter a "sparsedir" (i.e. a tree recorded in index),
> you should know the path leading to that directory. Even though I no
> longer remember the details of the implementations of grep_$where()
> which I did long time ago, I think grep_tree() should know how to
> pass the leading path down, as that is the most natural way to
> implement the recursive behaviour.  This patch should be able to
> piggyback on that.

To avoid unnecessary scare, the above is just me "thinking aloud",
after reading the proposed log message, and agreeing with the
direction taken by this patch.  Not giving a suggestion to go
different route or anything like that.  I should have said "OK" or
something at the end of the paragraph.

Thanks for working on this topic.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v3 3/3] builtin/grep.c: walking tree instead of expanding index with --sparse
  2022-09-01 17:03     ` Derrick Stolee
@ 2022-09-01 18:31       ` Shaoxuan Yuan
  0 siblings, 0 replies; 69+ messages in thread
From: Shaoxuan Yuan @ 2022-09-01 18:31 UTC (permalink / raw)
  To: Derrick Stolee, git; +Cc: vdye

On 9/1/2022 10:03 AM, Derrick Stolee wrote:
 > On 9/1/2022 12:57 AM, Shaoxuan Yuan wrote:
 >> Test HEAD~   HEAD
 >> 
---------------------------------------------------------------------------------------------------
 >> 2000.78: git grep --cached --sparse bogus -- f2/f1/f1/builtin/* 
(full-v3)     0.11    0.09 (≈)
 >> 2000.79: git grep --cached --sparse bogus -- f2/f1/f1/builtin/* 
(full-v4)     0.08    0.09 (≈)
 >> 2000.80: git grep --cached --sparse bogus -- f2/f1/f1/builtin/* 
(sparse-v3)   0.44    0.04 (-90.9%)
 >> 2000.81: git grep --cached --sparse bogus -- f2/f1/f1/builtin/* 
(sparse-v4)   0.46    0.04 (-91.3%)
 >>
 >> - Command used for testing:
 >>
 >>     git grep --cached --sparse bogus -- f2/f1/f1/builtin/*
 >
 > It's good to list this command after the table. It allows you to shrink
 > the table by using "...":

OK.

 >
 > Test                                HEAD~   HEAD
 > ---------------------------------------------------------
 > 2000.78: git grep ... (full-v3)     0.11    0.09 (≈)
 > 2000.79: git grep ... (full-v4)     0.08    0.09 (≈)
 > 2000.80: git grep ... (sparse-v3)   0.44    0.04 (-90.9%)
 > 2000.81: git grep ... (sparse-v4)   0.46    0.04 (-91.3%)
 >
 > This saves horizontal space without losing clarity. The test numbers 
help,
 > too.
 >
 >>          strbuf_setlen(&name, name_base_len);
 >>          strbuf_addstr(&name, ce->name);
 >> +        if (S_ISSPARSEDIR(ce->ce_mode)) {
 >> +            enum object_type type;
 >> +            struct tree_desc tree;
 >> +            void *data;
 >> +            unsigned long size;
 >> +            struct strbuf base = STRBUF_INIT;
 >> +
 >> +            strbuf_addstr(&base, ce->name);
 >> +
 >> +            data = read_object_file(&ce->oid, &type, &size);
 >> +            init_tree_desc(&tree, data, size);
 >>
 >> -        if (S_ISREG(ce->ce_mode) &&
 >> +            /*
 >> +             * sneak in the ce_mode using check_attr parameter
 >> +             */
 >> +            hit |= grep_tree(opt, pathspec, &tree, &base,
 >> +                     base.len, ce->ce_mode);
 >> +            strbuf_release(&base);
 >> +            free(data);
 >> +        } else if (S_ISREG(ce->ce_mode) &&
 >
 > I think this is a good setup for transitioning from the index scan
 > to the tree-walking grep_tree() method. Below, I recommend calling
 > the method slightly differently, though.
 >
 >>              match_pathspec(repo->index, pathspec, name.buf, 
name.len, 0, NULL,
 >>                     S_ISDIR(ce->ce_mode) ||
 >>                     S_ISGITLINK(ce->ce_mode))) {
 >> @@ -598,7 +613,14 @@ static int grep_tree(struct grep_opt *opt, 
const struct pathspec *pathspec,
 >>          int te_len = tree_entry_len(&entry);
 >>
 >>          if (match != all_entries_interesting) {
 >> -            strbuf_addstr(&name, base->buf + tn_len);
 >> +            if (S_ISSPARSEDIR(check_attr)) {
 >> +                // object is a sparse directory entry
 >> +                strbuf_addbuf(&name, base);
 >> +            } else {
 >> +                // object is a commit or a root tree
 >> +                strbuf_addstr(&name, base->buf + tn_len);
 >> +            }
 >> +
 >
 > I think this is abusing the check_attr too much, since this will also
 > trigger a different if branch further down the method.

Yeah that's why I wrote "sneak in".

 > These lines are the same if tn_len is zero, so will it suffice to pass
 > 0 for that length? You are passing base.len when you call it, so maybe
 > that should be zero?

Agree.

 > When I apply this change, all tests pass, so if there _is_ something
 > different between the two implementations, then it isn't covered by
 > tests:

I think they are no difference between these two implementations,
at least according to my intention.

 > diff --git a/builtin/grep.c b/builtin/grep.c
 > index 8c0edccd8e..fc4adf876a 100644
 > --- a/builtin/grep.c
 > +++ b/builtin/grep.c
 > @@ -549,8 +549,7 @@ static int grep_cache(struct grep_opt *opt,
 >              /*
 >               * sneak in the ce_mode using check_attr parameter
 >               */
 > -            hit |= grep_tree(opt, pathspec, &tree, &base,
 > -                     base.len, ce->ce_mode);
 > +            hit |= grep_tree(opt, pathspec, &tree, &base, 0, 0);
 >              strbuf_release(&base);
 >              free(data);
 >          } else if (S_ISREG(ce->ce_mode) &&
 > @@ -613,13 +612,7 @@ static int grep_tree(struct grep_opt *opt, const 
struct pathspec *pathspec,
 >          int te_len = tree_entry_len(&entry);
 >
 >          if (match != all_entries_interesting) {
 > -            if (S_ISSPARSEDIR(check_attr)) {
 > -                // object is a sparse directory entry
 > -                strbuf_addbuf(&name, base);
 > -            } else {
 > -                // object is a commit or a root tree
 > -                strbuf_addstr(&name, base->buf + tn_len);
 > -            }
 > +            strbuf_addstr(&name, base->buf + tn_len);
 >
 >              match = tree_entry_interesting(repo->index,
 >                                 &entry, &name,
 >
 >> +test_perf_on_all git grep --cached --sparse bogus -- 
"f2/f1/f1/builtin/*"
 >
 > We can't use this path in general, because we don't always run the test
 > using the Git repository as the test repo (see GIT_PERF_[LARGE_]REPO
 > variables in t/perf/README).
 >
 > We _can_ however use the structure that we have implied in our 
construction,
 > which is to use a path that we know exists and is still outside of the
 > sparse-checkout cone. Truncating to "f2/f1/f1/*" is sufficient for this.

OK.

 > Modifying the test and running them on my machine, I get:
 >
 > Test                               HEAD~1            HEAD
 > 
----------------------------------------------------------------------------
 > 2000.78: git grep ... (full-v3)    0.19(0.72+0.18) 0.18(0.84+0.13) -5.3%
 > 2000.79: git grep ... (full-v4)    0.17(0.83+0.16) 0.19(0.84+0.14) 
+11.8%
 > 2000.80: git grep ... (sparse-v3)  0.35(1.02+0.13) 0.15(0.85+0.15) 
-57.1%
 > 2000.81: git grep ... (sparse-v4)  0.37(1.06+0.12) 0.15(0.89+0.15) -59.5%
 >
 > So, it's still expensive to do the blob search over a wider pathspec than
 > the test as you designed it, but this will work for other repo, such 
as the
 > Linux kernel:

Yes, I was trying to use a narrower pathspec to show a difference that
looks better.

 > Test                                HEAD~1             HEAD
 > 
------------------------------------------------------------------------------
 > 2000.78: git grep ... (full-v3)     3.16(19.37+2.55) 2.56(15.24+1.76) 
-19.0%
 > 2000.79: git grep ... (full-v4)     2.97(17.84+2.00) 2.59(15.51+1.89) 
-12.8%
 > 2000.80: git grep ... (sparse-v3)   8.39(24.74+2.34) 2.13(16.03+1.72) 
-74.6%
 > 2000.81: git grep ... (sparse-v4)   8.39(24.73+2.40) 2.16(16.14+1.90) 
-74.3%
 >
 > Thanks,
 > -Stolee

Thanks,
Shaoxuan



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v3 3/3] builtin/grep.c: walking tree instead of expanding index with --sparse
  2022-09-01 17:17     ` Junio C Hamano
  2022-09-01 17:27       ` Junio C Hamano
@ 2022-09-01 22:36       ` Shaoxuan Yuan
  1 sibling, 0 replies; 69+ messages in thread
From: Shaoxuan Yuan @ 2022-09-01 22:36 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, derrickstolee, vdye

On 9/1/2022 10:17 AM, Junio C Hamano wrote:
 > Shaoxuan Yuan <shaoxuan.yuan02@gmail.com> writes:
 >
 >> Before this patch, whenever --sparse is used, `git-grep` utilizes the
 >> ensure_full_index() method to expand the index and search all the
 >> entries. Because this method requires walking all the trees and
 >> constructing the index, it is the slow part within the whole command.
 >>
 >> To achieve better performance, this patch uses grep_tree() to search the
 >> sparse directory entries and get rid of the ensure_full_index() method.
 >
 > When you encounter a "sparsedir" (i.e. a tree recorded in index),
 > you should know the path leading to that directory. Even though I no
 > longer remember the details of the implementations of grep_$where()
 > which I did long time ago, I think grep_tree() should know how to
 > pass the leading path down, as that is the most natural way to
 > implement the recursive behaviour.  This patch should be able to
 > piggyback on that.

Yes, though this commit [1] from 6 years ago started to assume that
grep_tree() only accepts root tree or commit, so the function fails
to process a tree like "sparsedir". It's the pathspec matching base that
was messed up. The support for a tree that is not at root-level was
added in this series.

[1] 74ed43711fd1cd7ce155d338f87ebe52cb74d9e2

 >> @@ -537,8 +534,26 @@ static int grep_cache(struct grep_opt *opt,
 >>
 >>          strbuf_setlen(&name, name_base_len);
 >>          strbuf_addstr(&name, ce->name);
 >> +        if (S_ISSPARSEDIR(ce->ce_mode)) {
 >> +            enum object_type type;
 >> +            struct tree_desc tree;
 >> +            void *data;
 >> +            unsigned long size;
 >> +            struct strbuf base = STRBUF_INIT;
 >> +
 >> +            strbuf_addstr(&base, ce->name);
 >> +
 >> +            data = read_object_file(&ce->oid, &type, &size);
 >> +            init_tree_desc(&tree, data, size);
 >>
 >> +            /*
 >> +             * sneak in the ce_mode using check_attr parameter
 >> +             */
 >> +            hit |= grep_tree(opt, pathspec, &tree, &base,
 >> +                     base.len, ce->ce_mode);
 >
 > OK.  Instead of inventing a new "base" strbuf, we could reuse
 > existing name while running the grep_tree() and restore it after it
 > returns, and I suspect that the end result would be more in line
 > with how grep_cache() uses that "name" buffer for all the cache
 > entries.  But that is not a correctness issue (it is move about
 > preventing from making the code worse).

Oh right, thanks for the suggestion!

 >> @@ -598,7 +613,14 @@ static int grep_tree(struct grep_opt *opt, 
const struct pathspec *pathspec,
 >>          int te_len = tree_entry_len(&entry);
 >>
 >>          if (match != all_entries_interesting) {
 >> -            strbuf_addstr(&name, base->buf + tn_len);
 >> +            if (S_ISSPARSEDIR(check_attr)) {
 >> +                // object is a sparse directory entry
 >
 > No // comments, please.

OK.

Thanks,
Shaoxuan




^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v3 3/3] builtin/grep.c: walking tree instead of expanding index with --sparse
  2022-09-01 17:27       ` Junio C Hamano
@ 2022-09-01 22:49         ` Shaoxuan Yuan
  0 siblings, 0 replies; 69+ messages in thread
From: Shaoxuan Yuan @ 2022-09-01 22:49 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, derrickstolee, vdye

On 9/1/2022 10:27 AM, Junio C Hamano wrote:
 > Junio C Hamano <gitster@pobox.com> writes:
 >
 >> Shaoxuan Yuan <shaoxuan.yuan02@gmail.com> writes:
 >>
 >>> Before this patch, whenever --sparse is used, `git-grep` utilizes the
 >>> ensure_full_index() method to expand the index and search all the
 >>> entries. Because this method requires walking all the trees and
 >>> constructing the index, it is the slow part within the whole command.
 >>>
 >>> To achieve better performance, this patch uses grep_tree() to 
search the
 >>> sparse directory entries and get rid of the ensure_full_index() method.
 >>
 >> When you encounter a "sparsedir" (i.e. a tree recorded in index),
 >> you should know the path leading to that directory. Even though I no
 >> longer remember the details of the implementations of grep_$where()
 >> which I did long time ago, I think grep_tree() should know how to
 >> pass the leading path down, as that is the most natural way to
 >> implement the recursive behaviour.  This patch should be able to
 >> piggyback on that.
 >
 > To avoid unnecessary scare, the above is just me "thinking aloud",
 > after reading the proposed log message, and agreeing with the
 > direction taken by this patch.  Not giving a suggestion to go
 > different route or anything like that.  I should have said "OK" or
 > something at the end of the paragraph.
 >
 > Thanks for working on this topic.

Thank you! :-)


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v3 3/3] builtin/grep.c: walking tree instead of expanding index with --sparse
  2022-09-01  4:57   ` [PATCH v3 3/3] builtin/grep.c: walking tree instead of expanding index with --sparse Shaoxuan Yuan
  2022-09-01 17:03     ` Derrick Stolee
  2022-09-01 17:17     ` Junio C Hamano
@ 2022-09-02  3:28     ` Victoria Dye
  2022-09-02 18:47       ` Shaoxuan Yuan
  2 siblings, 1 reply; 69+ messages in thread
From: Victoria Dye @ 2022-09-02  3:28 UTC (permalink / raw)
  To: Shaoxuan Yuan, git; +Cc: derrickstolee

Shaoxuan Yuan wrote:
> Before this patch, whenever --sparse is used, `git-grep` utilizes the
> ensure_full_index() method to expand the index and search all the
> entries. Because this method requires walking all the trees and
> constructing the index, it is the slow part within the whole command.
> 
> To achieve better performance, this patch uses grep_tree() to search the
> sparse directory entries and get rid of the ensure_full_index() method.
> 
> Why grep_tree() is a better choice over ensure_full_index()?
> 
> 1) grep_tree() is as correct as ensure_full_index(). grep_tree() looks
>    into every sparse-directory entry (represented by a tree) recursively
>    when looping over the index, and the result of doing so matches the
>    result of expanding the index.
> 
> 2) grep_tree() utilizes pathspecs to limit the scope of searching.
>    ensure_full_index() always expands the index when --sparse is used,
>    that means it will always walk all the trees and blobs in the repo
>    without caring if the user only wants a subset of the content, i.e.
>    using a pathspec. On the other hand, grep_tree() will only search
>    the contents that match the pathspec, and thus possibly walking fewer
>    trees.
> 
> 3) grep_tree() does not construct and copy back a new index, while
>    ensure_full_index() does. This also saves some time.

Would you mind adding some 'ensure_not_expanded' cases to 't1092' to codify
this (probably in the 'grep is not expanded' test created in patch 2)? If
I'm understanding this patch correctly, you've updated 'git grep' so that it
*never* needs to expand the index. In that case, it would be good to
exercise a bunch of 'git grep' options (pathspecs inside and outside the
sparse cone, wildcard pathspecs, etc.) to confirm that.

> 
> ----------------
> Performance test
> 
> - Summary:
> 
> p2000 tests demonstrate a ~91% execution time reduction for
> `git grep --cached --sparse <pattern> -- <pathspec>` using tree-walking
> logic.
> 
> Test                                                                          HEAD~   HEAD
> ---------------------------------------------------------------------------------------------------
> 2000.78: git grep --cached --sparse bogus -- f2/f1/f1/builtin/* (full-v3)     0.11    0.09 (≈)
> 2000.79: git grep --cached --sparse bogus -- f2/f1/f1/builtin/* (full-v4)     0.08    0.09 (≈)
> 2000.80: git grep --cached --sparse bogus -- f2/f1/f1/builtin/* (sparse-v3)   0.44    0.04 (-90.9%)
> 2000.81: git grep --cached --sparse bogus -- f2/f1/f1/builtin/* (sparse-v4)   0.46    0.04 (-91.3%)

These are fantastic results!

> 
> - Command used for testing:
> 
> 	git grep --cached --sparse bogus -- f2/f1/f1/builtin/*
> 
> The reason for specifying a pathspec is that, if we don't specify a
> pathspec, then grep_tree() will walk all the trees and blobs to find the
> pattern, and the time consumed doing so is not too different from using
> the original ensure_full_index() method, which also spends most of the
> time walking trees. However, when a pathspec is specified, this latest
> logic will only walk the area of trees enclosed by the pathspec, and the
> time consumed is reasonably a lot less.
> 
> That is, if we don't specify a pathspec, the performance difference [1]
> is quite small: both methods walk all the trees and take generally same
> amount of time (even with the index construction time included for
> ensure_full_index()).

This makes sense, thanks for the thorough explanation of the results.

> 
> [1] Performance test result without pathspec:
> 
> 	Test                                                    HEAD~  HEAD
> 	-----------------------------------------------------------------------------
> 	2000.78: git grep --cached --sparse bogus (full-v3)     6.17   5.19 (≈)
> 	2000.79: git grep --cached --sparse bogus (full-v4)     6.19   5.46 (≈)
> 	2000.80: git grep --cached --sparse bogus (sparse-v3)   6.57   6.44 (≈)
> 	2000.81: git grep --cached --sparse bogus (sparse-v4)   6.65   6.28 (≈)
> 
> Suggested-by: Derrick Stolee <derrickstolee@github.com>
> Helped-by: Derrick Stolee <derrickstolee@github.com>
> Helped-by: Victoria Dye <vdye@github.com>
> Signed-off-by: Shaoxuan Yuan <shaoxuan.yuan02@gmail.com>
> ---
>  builtin/grep.c                    | 32 ++++++++++++++++++++++++++-----
>  t/perf/p2000-sparse-operations.sh |  1 +
>  2 files changed, 28 insertions(+), 5 deletions(-)
> 
> diff --git a/builtin/grep.c b/builtin/grep.c
> index a0b4dbc1dc..8c0edccd8e 100644
> --- a/builtin/grep.c
> +++ b/builtin/grep.c
> @@ -522,9 +522,6 @@ static int grep_cache(struct grep_opt *opt,
>  	if (repo_read_index(repo) < 0)
>  		die(_("index file corrupt"));
>  
> -	if (grep_sparse)
> -		ensure_full_index(repo->index);
> -
>  	for (nr = 0; nr < repo->index->cache_nr; nr++) {
>  		const struct cache_entry *ce = repo->index->cache[nr];
>  
> @@ -537,8 +534,26 @@ static int grep_cache(struct grep_opt *opt,
>  
>  		strbuf_setlen(&name, name_base_len);
>  		strbuf_addstr(&name, ce->name);
> +		if (S_ISSPARSEDIR(ce->ce_mode)) {
> +			enum object_type type;
> +			struct tree_desc tree;
> +			void *data;
> +			unsigned long size;
> +			struct strbuf base = STRBUF_INIT;
> +
> +			strbuf_addstr(&base, ce->name);
> +
> +			data = read_object_file(&ce->oid, &type, &size);
> +			init_tree_desc(&tree, data, size);
>  
> -		if (S_ISREG(ce->ce_mode) &&
> +			/*
> +			 * sneak in the ce_mode using check_attr parameter
> +			 */
> +			hit |= grep_tree(opt, pathspec, &tree, &base,
> +					 base.len, ce->ce_mode);
> +			strbuf_release(&base);
> +			free(data);
> +		} else if (S_ISREG(ce->ce_mode) &&
>  		    match_pathspec(repo->index, pathspec, name.buf, name.len, 0, NULL,
>  				   S_ISDIR(ce->ce_mode) ||
>  				   S_ISGITLINK(ce->ce_mode))) {
> @@ -598,7 +613,14 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>  		int te_len = tree_entry_len(&entry);
>  
>  		if (match != all_entries_interesting) {
> -			strbuf_addstr(&name, base->buf + tn_len);
> +			if (S_ISSPARSEDIR(check_attr)) {
> +				// object is a sparse directory entry
> +				strbuf_addbuf(&name, base);
> +			} else {
> +				// object is a commit or a root tree
> +				strbuf_addstr(&name, base->buf + tn_len);
> +			}

Hmm, I'm not entirely sure I follow what's going on with 'name'. I'll try to
talk myself through it.

Stepping back a bit in the context of 'grep_tree()': the goal of the
function is, given a tree descriptor 'tree', to recursively scan the tree to
find any 'grep' matches within items matching 'pathspec'. It is also called
with a strbuf 'base', a length 'tn_len', and a boolean 'check_attr'; it's
not immediately clear to me what those args are or what they do. What I can
see is that:

- 'check_attr' is true iff the "tree" being grepped is actually a commit. 
- both non-recursive callers ('grep_object()' and 'grep_submodule()') call
  'grep_tree()' with 'tn_len == base.len'.

Stepping into 'grep_tree()', we iterate over the entries *inside of* 'tree'.
We assign the length of the tree entry's path to 'te_len'. Notably, a tree
entry's path *not* the path from the root of the repo to the entry - it's
just the filename of the entry (e.g., for entry 'folder1/a', the path is
'a').

Next, we skip the first 'tn_len' characters of 'base->buf' and assign that
value to 'name'. Because 'tn_len == base.len', for this first iteration,
it's an empty string. Then, we check if the tree entry is interesting with
path 'name'. But 'name' is an empty string, so 'tree_entry_interesting()'
thinks the tree entry is at the root of the repository, even if it isn't!

At this point, I think I've figured out what the deal with 'base' is. Before
this patch, only 'grep_object()' and 'grep_submodule()'. In the former case,
it's either "<objectname>:", or empty; in the latter, it's the path to the
submodule. Both of those are things you'd want to skip to get the correct
path to the tree entry for 'tree_entry_interesting()', but it isn't true in
your case; you need the path from the repository root to your tree for
'tree_entry_interesting()' to work properly. 

Based on all of that, I *think* you can drop the 'check_attr' changes to
'grep_tree()' and update how you provide 'base' and 'tn_len' so
1) 'base' is the path to the tree root, and 2) 'tn_len' is 0 so that full
path is provided to 'tree_entry_interesting()':

----->8----->8----->8----->8----->8----->8----->8----->8----->8----->8-----
diff --git a/builtin/grep.c b/builtin/grep.c
index 8c0edccd8e..85c83190f1 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -546,11 +546,7 @@ static int grep_cache(struct grep_opt *opt,
 			data = read_object_file(&ce->oid, &type, &size);
 			init_tree_desc(&tree, data, size);
 
-			/*
-			 * sneak in the ce_mode using check_attr parameter
-			 */
-			hit |= grep_tree(opt, pathspec, &tree, &base,
-					 base.len, ce->ce_mode);
+			hit |= grep_tree(opt, pathspec, &tree, &base, 0, 0);
 			strbuf_release(&base);
 			free(data);
 		} else if (S_ISREG(ce->ce_mode) &&
@@ -613,14 +609,6 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
 		int te_len = tree_entry_len(&entry);
 
 		if (match != all_entries_interesting) {
-			if (S_ISSPARSEDIR(check_attr)) {
-				// object is a sparse directory entry
-				strbuf_addbuf(&name, base);
-			} else {
-				// object is a commit or a root tree
-				strbuf_addstr(&name, base->buf + tn_len);
-			}
-
 			match = tree_entry_interesting(repo->index,
 						       &entry, &name,
 						       0, pathspec);
-----8<-----8<-----8<-----8<-----8<-----8<-----8<-----8<-----8<-----8<----- 

I still find all of this confusing, and it's possible I'm still not properly
understanding how 'name' and 'tn_len' are supposed to be used. Regardless, I
*am* fairly certain that finding the right values for those args is the
going to be the cleanest (and least fragile) way to handle sparse
directories, rather than using the 'check_attr' arg for something it isn't.

It might take some time + lots of debugging/experimenting, but it's really
important that the implementation you settle on is something you (and,
ideally, the readers of your patches) confidently and completely understand,
rather than something that seems to work but doesn't have a clear
explanation. As always, I'm happy to help if you'd like another set of eyes
on the problem!

> +
>  			match = tree_entry_interesting(repo->index,
>  						       &entry, &name,
>  						       0, pathspec);
> diff --git a/t/perf/p2000-sparse-operations.sh b/t/perf/p2000-sparse-operations.sh
> index fce8151d41..a0b71bb3b4 100755
> --- a/t/perf/p2000-sparse-operations.sh
> +++ b/t/perf/p2000-sparse-operations.sh
> @@ -124,5 +124,6 @@ test_perf_on_all git read-tree -mu HEAD
>  test_perf_on_all git checkout-index -f --all
>  test_perf_on_all git update-index --add --remove $SPARSE_CONE/a
>  test_perf_on_all "git rm -f $SPARSE_CONE/a && git checkout HEAD -- $SPARSE_CONE/a"
> +test_perf_on_all git grep --cached --sparse bogus -- "f2/f1/f1/builtin/*"
>  
>  test_done


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [PATCH v3 3/3] builtin/grep.c: walking tree instead of expanding index with --sparse
  2022-09-02  3:28     ` Victoria Dye
@ 2022-09-02 18:47       ` Shaoxuan Yuan
  0 siblings, 0 replies; 69+ messages in thread
From: Shaoxuan Yuan @ 2022-09-02 18:47 UTC (permalink / raw)
  To: Victoria Dye, git; +Cc: derrickstolee

On 9/1/2022 8:28 PM, Victoria Dye wrote:
> Shaoxuan Yuan wrote:
>> Before this patch, whenever --sparse is used, `git-grep` utilizes the
>> ensure_full_index() method to expand the index and search all the
>> entries. Because this method requires walking all the trees and
>> constructing the index, it is the slow part within the whole command.
>>
>> To achieve better performance, this patch uses grep_tree() to search the
>> sparse directory entries and get rid of the ensure_full_index() method.
>>
>> Why grep_tree() is a better choice over ensure_full_index()?
>>
>> 1) grep_tree() is as correct as ensure_full_index(). grep_tree() looks
>>    into every sparse-directory entry (represented by a tree) recursively
>>    when looping over the index, and the result of doing so matches the
>>    result of expanding the index.
>>
>> 2) grep_tree() utilizes pathspecs to limit the scope of searching.
>>    ensure_full_index() always expands the index when --sparse is used,
>>    that means it will always walk all the trees and blobs in the repo
>>    without caring if the user only wants a subset of the content, i.e.
>>    using a pathspec. On the other hand, grep_tree() will only search
>>    the contents that match the pathspec, and thus possibly walking fewer
>>    trees.
>>
>> 3) grep_tree() does not construct and copy back a new index, while
>>    ensure_full_index() does. This also saves some time.
> 
> Would you mind adding some 'ensure_not_expanded' cases to 't1092' to codify
> this (probably in the 'grep is not expanded' test created in patch 2)? If
> I'm understanding this patch correctly, you've updated 'git grep' so that it
> *never* needs to expand the index. In that case, it would be good to
> exercise a bunch of 'git grep' options (pathspecs inside and outside the
> sparse cone, wildcard pathspecs, etc.) to confirm that.

Sure!

>>
>> ----------------
>> Performance test
>>
>> - Summary:
>>
>> p2000 tests demonstrate a ~91% execution time reduction for
>> `git grep --cached --sparse <pattern> -- <pathspec>` using tree-walking
>> logic.
>>
>> Test                                                                          HEAD~   HEAD
>> ---------------------------------------------------------------------------------------------------
>> 2000.78: git grep --cached --sparse bogus -- f2/f1/f1/builtin/* (full-v3)     0.11    0.09 (≈)
>> 2000.79: git grep --cached --sparse bogus -- f2/f1/f1/builtin/* (full-v4)     0.08    0.09 (≈)
>> 2000.80: git grep --cached --sparse bogus -- f2/f1/f1/builtin/* (sparse-v3)   0.44    0.04 (-90.9%)
>> 2000.81: git grep --cached --sparse bogus -- f2/f1/f1/builtin/* (sparse-v4)   0.46    0.04 (-91.3%)
> 
> These are fantastic results!
> 
>>
>> - Command used for testing:
>>
>> 	git grep --cached --sparse bogus -- f2/f1/f1/builtin/*
>>
>> The reason for specifying a pathspec is that, if we don't specify a
>> pathspec, then grep_tree() will walk all the trees and blobs to find the
>> pattern, and the time consumed doing so is not too different from using
>> the original ensure_full_index() method, which also spends most of the
>> time walking trees. However, when a pathspec is specified, this latest
>> logic will only walk the area of trees enclosed by the pathspec, and the
>> time consumed is reasonably a lot less.
>>
>> That is, if we don't specify a pathspec, the performance difference [1]
>> is quite small: both methods walk all the trees and take generally same
>> amount of time (even with the index construction time included for
>> ensure_full_index()).
> 
> This makes sense, thanks for the thorough explanation of the results.
> 
>>
>> [1] Performance test result without pathspec:
>>
>> 	Test                                                    HEAD~  HEAD
>> 	-----------------------------------------------------------------------------
>> 	2000.78: git grep --cached --sparse bogus (full-v3)     6.17   5.19 (≈)
>> 	2000.79: git grep --cached --sparse bogus (full-v4)     6.19   5.46 (≈)
>> 	2000.80: git grep --cached --sparse bogus (sparse-v3)   6.57   6.44 (≈)
>> 	2000.81: git grep --cached --sparse bogus (sparse-v4)   6.65   6.28 (≈)
>>
>> Suggested-by: Derrick Stolee <derrickstolee@github.com>
>> Helped-by: Derrick Stolee <derrickstolee@github.com>
>> Helped-by: Victoria Dye <vdye@github.com>
>> Signed-off-by: Shaoxuan Yuan <shaoxuan.yuan02@gmail.com>
>> ---
>>  builtin/grep.c                    | 32 ++++++++++++++++++++++++++-----
>>  t/perf/p2000-sparse-operations.sh |  1 +
>>  2 files changed, 28 insertions(+), 5 deletions(-)
>>
>> diff --git a/builtin/grep.c b/builtin/grep.c
>> index a0b4dbc1dc..8c0edccd8e 100644
>> --- a/builtin/grep.c
>> +++ b/builtin/grep.c
>> @@ -522,9 +522,6 @@ static int grep_cache(struct grep_opt *opt,
>>  	if (repo_read_index(repo) < 0)
>>  		die(_("index file corrupt"));
>>  
>> -	if (grep_sparse)
>> -		ensure_full_index(repo->index);
>> -
>>  	for (nr = 0; nr < repo->index->cache_nr; nr++) {
>>  		const struct cache_entry *ce = repo->index->cache[nr];
>>  
>> @@ -537,8 +534,26 @@ static int grep_cache(struct grep_opt *opt,
>>  
>>  		strbuf_setlen(&name, name_base_len);
>>  		strbuf_addstr(&name, ce->name);
>> +		if (S_ISSPARSEDIR(ce->ce_mode)) {
>> +			enum object_type type;
>> +			struct tree_desc tree;
>> +			void *data;
>> +			unsigned long size;
>> +			struct strbuf base = STRBUF_INIT;
>> +
>> +			strbuf_addstr(&base, ce->name);
>> +
>> +			data = read_object_file(&ce->oid, &type, &size);
>> +			init_tree_desc(&tree, data, size);
>>  
>> -		if (S_ISREG(ce->ce_mode) &&
>> +			/*
>> +			 * sneak in the ce_mode using check_attr parameter
>> +			 */
>> +			hit |= grep_tree(opt, pathspec, &tree, &base,
>> +					 base.len, ce->ce_mode);
>> +			strbuf_release(&base);
>> +			free(data);
>> +		} else if (S_ISREG(ce->ce_mode) &&
>>  		    match_pathspec(repo->index, pathspec, name.buf, name.len, 0, NULL,
>>  				   S_ISDIR(ce->ce_mode) ||
>>  				   S_ISGITLINK(ce->ce_mode))) {
>> @@ -598,7 +613,14 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>>  		int te_len = tree_entry_len(&entry);
>>  
>>  		if (match != all_entries_interesting) {
>> -			strbuf_addstr(&name, base->buf + tn_len);
>> +			if (S_ISSPARSEDIR(check_attr)) {
>> +				// object is a sparse directory entry
>> +				strbuf_addbuf(&name, base);
>> +			} else {
>> +				// object is a commit or a root tree
>> +				strbuf_addstr(&name, base->buf + tn_len);
>> +			}
> 
> Hmm, I'm not entirely sure I follow what's going on with 'name'. I'll try to
> talk myself through it.
> 
> Stepping back a bit in the context of 'grep_tree()': the goal of the
> function is, given a tree descriptor 'tree', to recursively scan the tree to
> find any 'grep' matches within items matching 'pathspec'. It is also called
> with a strbuf 'base', a length 'tn_len', and a boolean 'check_attr'; it's
> not immediately clear to me what those args are or what they do. What I can

I was confused for quite a while about the meaning of these args, too.

I think 'base' is the object's ref or SHA, e.g. HEAD, HEAD~, or a <SHA>.
Before this patch, the object was expected to be a root tree or a
commit. I _think_ 'base' can also be "<submodule>/", e.g. "sub/" when
grepping a submodule.

'tn_len' stands for "tree_name_len"?

'check_attr', as you wrote below, is for "commit or not", at lease that
was all its use case before this patch.

> see is that:
> 
> - 'check_attr' is true iff the "tree" being grepped is actually a commit. 

I think this is correct. Though as Derrick Stolee said here [1], this
patch is abusing the 'check_attr' (passing 'ce_mode' through it), and if
that caused any confusions, my apologies.

[1]
https://lore.kernel.org/git/e74b326d-ce4a-31c3-5424-e35858cdb569@github.com

> - both non-recursive callers ('grep_object()' and 'grep_submodule()') call
>   'grep_tree()' with 'tn_len == base.len'.
> 
> Stepping into 'grep_tree()', we iterate over the entries *inside of* 'tree'.
> We assign the length of the tree entry's path to 'te_len'. Notably, a tree
> entry's path *not* the path from the root of the repo to the entry - it's
> just the filename of the entry (e.g., for entry 'folder1/a', the path is
> 'a').

Yes.

> Next, we skip the first 'tn_len' characters of 'base->buf' and assign that
> value to 'name'. Because 'tn_len == base.len', for this first iteration,
> it's an empty string. Then, we check if the tree entry is interesting with
> path 'name'. But 'name' is an empty string, so 'tree_entry_interesting()'
> thinks the tree entry is at the root of the repository, even if it isn't!

Yes, that is the reason why it kept ignoring sub-root-level trees: the
pathspec can never match a tree that is not at root level if this
root-level assumption exists.

> At this point, I think I've figured out what the deal with 'base' is. Before
> this patch, only 'grep_object()' and 'grep_submodule()'. In the former case,
> it's either "<objectname>:", or empty; in the latter, it's the path to the
> submodule. Both of those are things you'd want to skip to get the correct

Yep, this resonates with my reply above!

> path to the tree entry for 'tree_entry_interesting()', but it isn't true in
> your case; you need the path from the repository root to your tree for
> 'tree_entry_interesting()' to work properly. 

Well said! I think this phrasing is very accurate.

> Based on all of that, I *think* you can drop the 'check_attr' changes to
> 'grep_tree()' and update how you provide 'base' and 'tn_len' so
> 1) 'base' is the path to the tree root, and 2) 'tn_len' is 0 so that full
> path is provided to 'tree_entry_interesting()':
> 
> ----->8----->8----->8----->8----->8----->8----->8----->8----->8----->8-----
> diff --git a/builtin/grep.c b/builtin/grep.c
> index 8c0edccd8e..85c83190f1 100644
> --- a/builtin/grep.c
> +++ b/builtin/grep.c
> @@ -546,11 +546,7 @@ static int grep_cache(struct grep_opt *opt,
>  			data = read_object_file(&ce->oid, &type, &size);
>  			init_tree_desc(&tree, data, size);
>  
> -			/*
> -			 * sneak in the ce_mode using check_attr parameter
> -			 */
> -			hit |= grep_tree(opt, pathspec, &tree, &base,
> -					 base.len, ce->ce_mode);
> +			hit |= grep_tree(opt, pathspec, &tree, &base, 0, 0);
>  			strbuf_release(&base);
>  			free(data);
>  		} else if (S_ISREG(ce->ce_mode) &&
> @@ -613,14 +609,6 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>  		int te_len = tree_entry_len(&entry);
>  
>  		if (match != all_entries_interesting) {
> -			if (S_ISSPARSEDIR(check_attr)) {
> -				// object is a sparse directory entry
> -				strbuf_addbuf(&name, base);
> -			} else {
> -				// object is a commit or a root tree
> -				strbuf_addstr(&name, base->buf + tn_len);
> -			}
> -
>  			match = tree_entry_interesting(repo->index,
>  						       &entry, &name,
>  						       0, pathspec);
> -----8<-----8<-----8<-----8<-----8<-----8<-----8<-----8<-----8<-----8<----- 

Thank you for this diff! I think this is also what Derrick suggested in
his review. In fact, this approach is more to the root of the problem:
the expected format of the path/base.

> I still find all of this confusing, and it's possible I'm still not properly
> understanding how 'name' and 'tn_len' are supposed to be used. Regardless, I
> *am* fairly certain that finding the right values for those args is the
> going to be the cleanest (and least fragile) way to handle sparse
> directories, rather than using the 'check_attr' arg for something it isn't.

Right.

> It might take some time + lots of debugging/experimenting, but it's really
> important that the implementation you settle on is something you (and,
> ideally, the readers of your patches) confidently and completely understand,
> rather than something that seems to work but doesn't have a clear
> explanation. As always, I'm happy to help if you'd like another set of eyes
> on the problem!

Right. I admit that the approach I was taking is pretty shady. The way
suggested by you and Derrick is more explainable and to-the-point.
Lesson learned!

Thanks,
Shaoxuan


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH v4 0/3] grep: integrate with sparse index
  2022-08-17  7:56 [PATCH v1 0/2] grep: integrate with sparse index Shaoxuan Yuan
                   ` (4 preceding siblings ...)
  2022-09-01  4:57 ` [PATCH v3 0/3] grep: " Shaoxuan Yuan
@ 2022-09-03  0:36 ` Shaoxuan Yuan
  2022-09-03  0:36   ` [PATCH v4 1/3] builtin/grep.c: add --sparse option Shaoxuan Yuan
                     ` (2 more replies)
  2022-09-08  0:18 ` [PATCH v5 0/3] grep: integrate with sparse index Shaoxuan Yuan
  2022-09-23  4:18 ` [PATCH v6 0/1] grep: integrate with sparse index Shaoxuan Yuan
  7 siblings, 3 replies; 69+ messages in thread
From: Shaoxuan Yuan @ 2022-09-03  0:36 UTC (permalink / raw)
  To: git; +Cc: derrickstolee, vdye, gitster, Shaoxuan Yuan

Integrate `git-grep` with sparse-index and test the performance
improvement.

Changes since v3
----------------
* Shorten the perf result tables in commit message.

* Update the commit message to reflect the changes in the commit.

* Update the commit message to indicate the performance improvement
  is dependent on the pathspec.

* Stop passing `ce_mode` through `check_attr`. Instead, set the
  `base_len` to 0 to make the code more reasonable and less abuse of
  `check_attr`.

* Remove another invention of `base`. Use the existing `name` as the
  argument for `grep_tree()`, and reset it back to `ce->name` after
  `grep_tree()` returns.

* Update the p2000 test to use a more general pathspec for better
  compatibility (i.e. do not use git repository specific pathspec).

* Add tests to t1092 'grep is not expanded' to verify the change
  brought by "builtin/grep.c: walking tree instead of expanding index
  with --sparse": the index *never* expands.

Changes since v2
----------------

* Modify the commit message for "builtin/grep.c: integrate with sparse
  index" to make it obvious that the perf test results are not from
  p2000 tests, but from manual perf runs.

* Add tree-walking logic as an extra (the third) patch to improve the
  performance when --sparse is used. This resolved the left-over-bit
  in v2 [1].

[1] https://lore.kernel.org/git/20220829232843.183711-1-shaoxuan.yuan02@gmail.com/

Changes since v1
----------------

* Rewrite the commit message for "builtin/grep.c: add --sparse option"
  to be clearer.

* Update the documentation (both in-code and man page) for --sparse.

* Add a few tests to test the new behavior (when _only_ --cached is
  supplied).

* Reformat the perf test results to not look like directly from p2000
  tests.

* Put the "command_requires_full_index" lines right after parse_options().

* Add a pathspec test in t1092, and reword a few test documentations.

Shaoxuan Yuan (3):
  builtin/grep.c: add --sparse option
  builtin/grep.c: integrate with sparse index
  builtin/grep.c: walking tree instead of expanding index with --sparse

 Documentation/git-grep.txt               |  5 +++-
 builtin/grep.c                           | 31 ++++++++++++++++++---
 t/perf/p2000-sparse-operations.sh        |  1 +
 t/t1092-sparse-checkout-compatibility.sh | 26 ++++++++++++++++++
 t/t7817-grep-sparse-checkout.sh          | 34 +++++++++++++++++++-----
 5 files changed, 86 insertions(+), 11 deletions(-)

Range-diff against v3:
1:  1fa8c62d95 ! 1:  f1d8271a9b builtin/grep.c: add --sparse option
    @@ Commit message
         inspects paths outside of the sparse-checkout definition when paired
         with the '--cached' option.
     
    -    Helped-by: Derrick Stolee <derrickstolee@github.com>
         Suggested-by: Victoria Dye <vdye@github.com>
    +    Helped-by: Derrick Stolee <derrickstolee@github.com>
    +    Helped-by: Victoria Dye <vdye@github.com>
         Signed-off-by: Shaoxuan Yuan <shaoxuan.yuan02@gmail.com>
     
      ## Documentation/git-grep.txt ##
2:  ce4fba3c35 ! 2:  7aa4b8bc81 builtin/grep.c: integrate with sparse index
    @@ Commit message
         are then extracted from the time difference between "region_enter" and
         "region_leave" of label "do_read_index".
     
    +    Helped-by: Victoria Dye <vdye@github.com>
         Helped-by: Derrick Stolee <derrickstolee@github.com>
         Signed-off-by: Shaoxuan Yuan <shaoxuan.yuan02@gmail.com>
     
3:  240883aa11 ! 3:  6a2e753a19 builtin/grep.c: walking tree instead of expanding index with --sparse
    @@ Commit message
     
         - Summary:
     
    -    p2000 tests demonstrate a ~91% execution time reduction for
    -    `git grep --cached --sparse <pattern> -- <pathspec>` using tree-walking
    -    logic.
    -
    -    Test                                                                          HEAD~   HEAD
    -    ---------------------------------------------------------------------------------------------------
    -    2000.78: git grep --cached --sparse bogus -- f2/f1/f1/builtin/* (full-v3)     0.11    0.09 (≈)
    -    2000.79: git grep --cached --sparse bogus -- f2/f1/f1/builtin/* (full-v4)     0.08    0.09 (≈)
    -    2000.80: git grep --cached --sparse bogus -- f2/f1/f1/builtin/* (sparse-v3)   0.44    0.04 (-90.9%)
    -    2000.81: git grep --cached --sparse bogus -- f2/f1/f1/builtin/* (sparse-v4)   0.46    0.04 (-91.3%)
    +    p2000 tests demonstrate a ~71% execution time reduction for
    +    `git grep --cached --sparse bogus -- "f2/f1/f1/*"` using tree-walking
    +    logic. However, notice that this result varies depending on the pathspec
    +    given. See below "Command used for testing" for more details.
    +
    +    Test                              HEAD~   HEAD
    +    -------------------------------------------------------
    +    2000.78: git grep ... (full-v3)   0.35    0.39 (≈)
    +    2000.79: git grep ... (full-v4)   0.36    0.30 (≈)
    +    2000.80: git grep ... (sparse-v3) 0.88    0.23 (-73.8%)
    +    2000.81: git grep ... (sparse-v4) 0.83    0.26 (-68.6%)
     
         - Command used for testing:
     
    -            git grep --cached --sparse bogus -- f2/f1/f1/builtin/*
    +            git grep --cached --sparse bogus -- "f2/f1/f1/*"
     
         The reason for specifying a pathspec is that, if we don't specify a
         pathspec, then grep_tree() will walk all the trees and blobs to find the
    @@ Commit message
         logic will only walk the area of trees enclosed by the pathspec, and the
         time consumed is reasonably a lot less.
     
    +    Generally speaking, because the performance gain is acheived by walking
    +    less trees, which are specified by the pathspec, the HEAD time v.s.
    +    HEAD~ time in sparse-v[3|4], should be proportional to
    +    "pathspec enclosed area" v.s. "all area", respectively. Namely, the
    +    wider the <pathspec> is encompassing, the less the performance
    +    difference between HEAD~ and HEAD, and vice versa.
    +
         That is, if we don't specify a pathspec, the performance difference [1]
    -    is quite small: both methods walk all the trees and take generally same
    -    amount of time (even with the index construction time included for
    +    is indistinguishable: both methods walk all the trees and take generally
    +    same amount of time (even with the index construction time included for
         ensure_full_index()).
     
    -    [1] Performance test result without pathspec:
    +    [1] Performance test result without pathspec (hence walking all trees):
    +
    +            Command used:
    +
    +                    git grep --cached --sparse bogus
     
    -            Test                                                    HEAD~  HEAD
    -            -----------------------------------------------------------------------------
    -            2000.78: git grep --cached --sparse bogus (full-v3)     6.17   5.19 (≈)
    -            2000.79: git grep --cached --sparse bogus (full-v4)     6.19   5.46 (≈)
    -            2000.80: git grep --cached --sparse bogus (sparse-v3)   6.57   6.44 (≈)
    -            2000.81: git grep --cached --sparse bogus (sparse-v4)   6.65   6.28 (≈)
    +            Test                                HEAD~  HEAD
    +            ---------------------------------------------------
    +            2000.78: git grep ... (full-v3)     6.17   5.19 (≈)
    +            2000.79: git grep ... (full-v4)     6.19   5.46 (≈)
    +            2000.80: git grep ... (sparse-v3)   6.57   6.44 (≈)
    +            2000.81: git grep ... (sparse-v4)   6.65   6.28 (≈)
     
         Suggested-by: Derrick Stolee <derrickstolee@github.com>
         Helped-by: Derrick Stolee <derrickstolee@github.com>
    @@ builtin/grep.c: static int grep_cache(struct grep_opt *opt,
     +			struct tree_desc tree;
     +			void *data;
     +			unsigned long size;
    -+			struct strbuf base = STRBUF_INIT;
    -+
    -+			strbuf_addstr(&base, ce->name);
     +
     +			data = read_object_file(&ce->oid, &type, &size);
     +			init_tree_desc(&tree, data, size);
      
     -		if (S_ISREG(ce->ce_mode) &&
    -+			/*
    -+			 * sneak in the ce_mode using check_attr parameter
    -+			 */
    -+			hit |= grep_tree(opt, pathspec, &tree, &base,
    -+					 base.len, ce->ce_mode);
    -+			strbuf_release(&base);
    ++			hit |= grep_tree(opt, pathspec, &tree, &name, 0, 0);
    ++			strbuf_reset(&name);
    ++			strbuf_addstr(&name, ce->name);
     +			free(data);
     +		} else if (S_ISREG(ce->ce_mode) &&
      		    match_pathspec(repo->index, pathspec, name.buf, name.len, 0, NULL,
      				   S_ISDIR(ce->ce_mode) ||
      				   S_ISGITLINK(ce->ce_mode))) {
    -@@ builtin/grep.c: static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
    - 		int te_len = tree_entry_len(&entry);
    - 
    - 		if (match != all_entries_interesting) {
    --			strbuf_addstr(&name, base->buf + tn_len);
    -+			if (S_ISSPARSEDIR(check_attr)) {
    -+				// object is a sparse directory entry
    -+				strbuf_addbuf(&name, base);
    -+			} else {
    -+				// object is a commit or a root tree
    -+				strbuf_addstr(&name, base->buf + tn_len);
    -+			}
    -+
    - 			match = tree_entry_interesting(repo->index,
    - 						       &entry, &name,
    - 						       0, pathspec);
     
      ## t/perf/p2000-sparse-operations.sh ##
     @@ t/perf/p2000-sparse-operations.sh: test_perf_on_all git read-tree -mu HEAD
      test_perf_on_all git checkout-index -f --all
      test_perf_on_all git update-index --add --remove $SPARSE_CONE/a
      test_perf_on_all "git rm -f $SPARSE_CONE/a && git checkout HEAD -- $SPARSE_CONE/a"
    -+test_perf_on_all git grep --cached --sparse bogus -- "f2/f1/f1/builtin/*"
    ++test_perf_on_all git grep --cached --sparse bogus -- "f2/f1/f1/*"
    + 
    + test_done
    +
    + ## t/t1092-sparse-checkout-compatibility.sh ##
    +@@ t/t1092-sparse-checkout-compatibility.sh: test_expect_success 'grep is not expanded' '
    + 
    + 	# All files within the folder1/* pathspec are sparse,
    + 	# so this command does not find any matches
    +-	ensure_not_expanded ! grep a -- folder1/*
    ++	ensure_not_expanded ! grep a -- folder1/* &&
    ++
    ++	# test out-of-cone pathspec with or without wildcard
    ++	ensure_not_expanded grep --sparse --cached a -- "folder1/a" &&
    ++	ensure_not_expanded grep --sparse --cached a -- "folder1/*" &&
    ++
    ++	# test in-cone pathspec with or without wildcard
    ++	ensure_not_expanded grep --sparse --cached a -- "deep/a" &&
    ++	ensure_not_expanded grep --sparse --cached a -- "deep/*"
    + '
      
      test_done

base-commit: be1a02a17ede4082a86dfbfee0f54f345e8b43ac
-- 
2.37.0


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH v4 1/3] builtin/grep.c: add --sparse option
  2022-09-03  0:36 ` [PATCH v4 0/3] grep: integrate with sparse index Shaoxuan Yuan
@ 2022-09-03  0:36   ` Shaoxuan Yuan
  2022-09-03  0:36   ` [PATCH v4 2/3] builtin/grep.c: integrate with sparse index Shaoxuan Yuan
  2022-09-03  0:36   ` [PATCH v4 3/3] builtin/grep.c: walking tree instead of expanding index with --sparse Shaoxuan Yuan
  2 siblings, 0 replies; 69+ messages in thread
From: Shaoxuan Yuan @ 2022-09-03  0:36 UTC (permalink / raw)
  To: git; +Cc: derrickstolee, vdye, gitster, Shaoxuan Yuan

Add a --sparse option to `git-grep`.

When the '--cached' option is used with the 'git grep' command, the
search is limited to the blobs found in the index, not in the worktree.
If the user has enabled sparse-checkout, this might present more results
than they would like, since the files outside of the sparse-checkout are
unlikely to be important to them.

Change the default behavior of 'git grep' to focus on the files within
the sparse-checkout definition. To enable the previous behavior, add a
'--sparse' option to 'git grep' that triggers the old behavior that
inspects paths outside of the sparse-checkout definition when paired
with the '--cached' option.

Suggested-by: Victoria Dye <vdye@github.com>
Helped-by: Derrick Stolee <derrickstolee@github.com>
Helped-by: Victoria Dye <vdye@github.com>
Signed-off-by: Shaoxuan Yuan <shaoxuan.yuan02@gmail.com>
---
 Documentation/git-grep.txt      |  5 ++++-
 builtin/grep.c                  | 10 +++++++++-
 t/t7817-grep-sparse-checkout.sh | 34 +++++++++++++++++++++++++++------
 3 files changed, 41 insertions(+), 8 deletions(-)

diff --git a/Documentation/git-grep.txt b/Documentation/git-grep.txt
index 58d944bd57..bdd3d5b8a6 100644
--- a/Documentation/git-grep.txt
+++ b/Documentation/git-grep.txt
@@ -28,7 +28,7 @@ SYNOPSIS
 	   [-f <file>] [-e] <pattern>
 	   [--and|--or|--not|(|)|-e <pattern>...]
 	   [--recurse-submodules] [--parent-basename <basename>]
-	   [ [--[no-]exclude-standard] [--cached | --no-index | --untracked] | <tree>...]
+	   [ [--[no-]exclude-standard] [--cached [--sparse] | --no-index | --untracked] | <tree>...]
 	   [--] [<pathspec>...]
 
 DESCRIPTION
@@ -45,6 +45,9 @@ OPTIONS
 	Instead of searching tracked files in the working tree, search
 	blobs registered in the index file.
 
+--sparse::
+	Use with --cached. Search outside of sparse-checkout definition.
+
 --no-index::
 	Search files in the current directory that is not managed by Git.
 
diff --git a/builtin/grep.c b/builtin/grep.c
index e6bcdf860c..12abd832fa 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -96,6 +96,8 @@ static pthread_cond_t cond_result;
 
 static int skip_first_line;
 
+static int grep_sparse = 0;
+
 static void add_work(struct grep_opt *opt, struct grep_source *gs)
 {
 	if (opt->binary != GREP_BINARY_TEXT)
@@ -525,7 +527,11 @@ static int grep_cache(struct grep_opt *opt,
 	for (nr = 0; nr < repo->index->cache_nr; nr++) {
 		const struct cache_entry *ce = repo->index->cache[nr];
 
-		if (!cached && ce_skip_worktree(ce))
+		/*
+		 * Skip entries with SKIP_WORKTREE unless both --sparse and
+		 * --cached are given.
+		 */
+		if (!(grep_sparse && cached) && ce_skip_worktree(ce))
 			continue;
 
 		strbuf_setlen(&name, name_base_len);
@@ -963,6 +969,8 @@ int cmd_grep(int argc, const char **argv, const char *prefix)
 			   PARSE_OPT_NOCOMPLETE),
 		OPT_INTEGER('m', "max-count", &opt.max_count,
 			N_("maximum number of results per file")),
+		OPT_BOOL(0, "sparse", &grep_sparse,
+			 N_("search the contents of files outside the sparse-checkout definition")),
 		OPT_END()
 	};
 	grep_prefix = prefix;
diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh
index eb59564565..a9879cc980 100755
--- a/t/t7817-grep-sparse-checkout.sh
+++ b/t/t7817-grep-sparse-checkout.sh
@@ -118,13 +118,19 @@ test_expect_success 'grep searches unmerged file despite not matching sparsity p
 	test_cmp expect actual
 '
 
-test_expect_success 'grep --cached searches entries with the SKIP_WORKTREE bit' '
+test_expect_success 'grep --cached and --sparse searches entries with the SKIP_WORKTREE bit' '
+	cat >expect <<-EOF &&
+	a:text
+	EOF
+	git grep --cached "text" >actual &&
+	test_cmp expect actual &&
+
 	cat >expect <<-EOF &&
 	a:text
 	b:text
 	dir/c:text
 	EOF
-	git grep --cached "text" >actual &&
+	git grep --cached --sparse "text" >actual &&
 	test_cmp expect actual
 '
 
@@ -143,7 +149,15 @@ test_expect_success 'grep --recurse-submodules honors sparse checkout in submodu
 	test_cmp expect actual
 '
 
-test_expect_success 'grep --recurse-submodules --cached searches entries with the SKIP_WORKTREE bit' '
+test_expect_success 'grep --recurse-submodules --cached and --sparse searches entries with the SKIP_WORKTREE bit' '
+	cat >expect <<-EOF &&
+	a:text
+	sub/B/b:text
+	sub2/a:text
+	EOF
+	git grep --recurse-submodules --cached "text" >actual &&
+	test_cmp expect actual &&
+
 	cat >expect <<-EOF &&
 	a:text
 	b:text
@@ -152,7 +166,7 @@ test_expect_success 'grep --recurse-submodules --cached searches entries with th
 	sub/B/b:text
 	sub2/a:text
 	EOF
-	git grep --recurse-submodules --cached "text" >actual &&
+	git grep --recurse-submodules --cached --sparse "text" >actual &&
 	test_cmp expect actual
 '
 
@@ -166,7 +180,15 @@ test_expect_success 'working tree grep does not search the index with CE_VALID a
 	test_cmp expect actual
 '
 
-test_expect_success 'grep --cached searches index entries with both CE_VALID and SKIP_WORKTREE' '
+test_expect_success 'grep --cached and --sparse searches index entries with both CE_VALID and SKIP_WORKTREE' '
+	cat >expect <<-EOF &&
+	a:text
+	EOF
+	test_when_finished "git update-index --no-assume-unchanged b" &&
+	git update-index --assume-unchanged b &&
+	git grep --cached text >actual &&
+	test_cmp expect actual &&
+
 	cat >expect <<-EOF &&
 	a:text
 	b:text
@@ -174,7 +196,7 @@ test_expect_success 'grep --cached searches index entries with both CE_VALID and
 	EOF
 	test_when_finished "git update-index --no-assume-unchanged b" &&
 	git update-index --assume-unchanged b &&
-	git grep --cached text >actual &&
+	git grep --cached --sparse text >actual &&
 	test_cmp expect actual
 '
 
-- 
2.37.0


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v4 2/3] builtin/grep.c: integrate with sparse index
  2022-09-03  0:36 ` [PATCH v4 0/3] grep: integrate with sparse index Shaoxuan Yuan
  2022-09-03  0:36   ` [PATCH v4 1/3] builtin/grep.c: add --sparse option Shaoxuan Yuan
@ 2022-09-03  0:36   ` Shaoxuan Yuan
  2022-09-03  0:36   ` [PATCH v4 3/3] builtin/grep.c: walking tree instead of expanding index with --sparse Shaoxuan Yuan
  2 siblings, 0 replies; 69+ messages in thread
From: Shaoxuan Yuan @ 2022-09-03  0:36 UTC (permalink / raw)
  To: git; +Cc: derrickstolee, vdye, gitster, Shaoxuan Yuan

Turn on sparse index and remove ensure_full_index().

Change it to only expand the index when using --sparse.

The p2000 tests do not demonstrate a significant improvement,
because the index read is a small portion of the full process
time, compared to the blob parsing. The times below reflect the
time spent in the "do_read_index" trace region as shown using
GIT_TRACE2_PERF=1.

The tests demonstrate a ~99.4% execution time reduction for
`git grep` using a sparse index.

Test                                  HEAD~        HEAD
-----------------------------------------------------------------------------
git grep --cached bogus (full-v3)     0.019        0.018  (-5.2%)
git grep --cached bogus (full-v4)     0.017        0.016  (-5.8%)
git grep --cached bogus (sparse-v3)   0.29         0.0015 (-99.4%)
git grep --cached bogus (sparse-v4)   0.30         0.0018 (-99.4%)

Optional reading about performance test results
-----------------------------------------------
Notice that because `git-grep` needs to parse blobs in the index, the
index reading time is minuscule comparing to the object parsing time.
And because of this, the p2000 test results cannot clearly reflect the
speedup for index reading: combining with the object parsing time,
the aggregated time difference is extremely close between HEAD~1 and
HEAD.

Hence, the results presenting here are not directly extracted from the
p2000 test results. Instead, to make the performance difference more
visible, the test command is manually ran with GIT_TRACE2_PERF in the
four repos (full-v3, sparse-v3, full-v4, sparse-v4). The numbers here
are then extracted from the time difference between "region_enter" and
"region_leave" of label "do_read_index".

Helped-by: Victoria Dye <vdye@github.com>
Helped-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Shaoxuan Yuan <shaoxuan.yuan02@gmail.com>
---
 builtin/grep.c                           | 10 ++++++++--
 t/t1092-sparse-checkout-compatibility.sh | 18 ++++++++++++++++++
 2 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/builtin/grep.c b/builtin/grep.c
index 12abd832fa..a0b4dbc1dc 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -522,8 +522,9 @@ static int grep_cache(struct grep_opt *opt,
 	if (repo_read_index(repo) < 0)
 		die(_("index file corrupt"));
 
-	/* TODO: audit for interaction with sparse-index. */
-	ensure_full_index(repo->index);
+	if (grep_sparse)
+		ensure_full_index(repo->index);
+
 	for (nr = 0; nr < repo->index->cache_nr; nr++) {
 		const struct cache_entry *ce = repo->index->cache[nr];
 
@@ -992,6 +993,11 @@ int cmd_grep(int argc, const char **argv, const char *prefix)
 			     PARSE_OPT_KEEP_DASHDASH |
 			     PARSE_OPT_STOP_AT_NON_OPTION);
 
+	if (the_repository->gitdir) {
+		prepare_repo_settings(the_repository);
+		the_repository->settings.command_requires_full_index = 0;
+	}
+
 	if (use_index && !startup_info->have_repository) {
 		int fallback = 0;
 		git_config_get_bool("grep.fallbacktonoindex", &fallback);
diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index 0302e36fd6..63becc3138 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -1972,4 +1972,22 @@ test_expect_success 'sparse index is not expanded: rm' '
 	ensure_not_expanded rm -r deep
 '
 
+test_expect_success 'grep with --sparse and --cached' '
+	init_repos &&
+
+	test_all_match git grep --sparse --cached a &&
+	test_all_match git grep --sparse --cached a -- "folder1/*"
+'
+
+test_expect_success 'grep is not expanded' '
+	init_repos &&
+
+	ensure_not_expanded grep a &&
+	ensure_not_expanded grep a -- deep/* &&
+
+	# All files within the folder1/* pathspec are sparse,
+	# so this command does not find any matches
+	ensure_not_expanded ! grep a -- folder1/*
+'
+
 test_done
-- 
2.37.0


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v4 3/3] builtin/grep.c: walking tree instead of expanding index with --sparse
  2022-09-03  0:36 ` [PATCH v4 0/3] grep: integrate with sparse index Shaoxuan Yuan
  2022-09-03  0:36   ` [PATCH v4 1/3] builtin/grep.c: add --sparse option Shaoxuan Yuan
  2022-09-03  0:36   ` [PATCH v4 2/3] builtin/grep.c: integrate with sparse index Shaoxuan Yuan
@ 2022-09-03  0:36   ` Shaoxuan Yuan
  2022-09-03  4:39     ` Junio C Hamano
  2 siblings, 1 reply; 69+ messages in thread
From: Shaoxuan Yuan @ 2022-09-03  0:36 UTC (permalink / raw)
  To: git; +Cc: derrickstolee, vdye, gitster, Shaoxuan Yuan

Before this patch, whenever --sparse is used, `git-grep` utilizes the
ensure_full_index() method to expand the index and search all the
entries. Because this method requires walking all the trees and
constructing the index, it is the slow part within the whole command.

To achieve better performance, this patch uses grep_tree() to search the
sparse directory entries and get rid of the ensure_full_index() method.

Why grep_tree() is a better choice over ensure_full_index()?

1) grep_tree() is as correct as ensure_full_index(). grep_tree() looks
   into every sparse-directory entry (represented by a tree) recursively
   when looping over the index, and the result of doing so matches the
   result of expanding the index.

2) grep_tree() utilizes pathspecs to limit the scope of searching.
   ensure_full_index() always expands the index when --sparse is used,
   that means it will always walk all the trees and blobs in the repo
   without caring if the user only wants a subset of the content, i.e.
   using a pathspec. On the other hand, grep_tree() will only search
   the contents that match the pathspec, and thus possibly walking fewer
   trees.

3) grep_tree() does not construct and copy back a new index, while
   ensure_full_index() does. This also saves some time.

----------------
Performance test

- Summary:

p2000 tests demonstrate a ~71% execution time reduction for
`git grep --cached --sparse bogus -- "f2/f1/f1/*"` using tree-walking
logic. However, notice that this result varies depending on the pathspec
given. See below "Command used for testing" for more details.

Test                              HEAD~   HEAD
-------------------------------------------------------
2000.78: git grep ... (full-v3)   0.35    0.39 (≈)
2000.79: git grep ... (full-v4)   0.36    0.30 (≈)
2000.80: git grep ... (sparse-v3) 0.88    0.23 (-73.8%)
2000.81: git grep ... (sparse-v4) 0.83    0.26 (-68.6%)

- Command used for testing:

	git grep --cached --sparse bogus -- "f2/f1/f1/*"

The reason for specifying a pathspec is that, if we don't specify a
pathspec, then grep_tree() will walk all the trees and blobs to find the
pattern, and the time consumed doing so is not too different from using
the original ensure_full_index() method, which also spends most of the
time walking trees. However, when a pathspec is specified, this latest
logic will only walk the area of trees enclosed by the pathspec, and the
time consumed is reasonably a lot less.

Generally speaking, because the performance gain is acheived by walking
less trees, which are specified by the pathspec, the HEAD time v.s.
HEAD~ time in sparse-v[3|4], should be proportional to
"pathspec enclosed area" v.s. "all area", respectively. Namely, the
wider the <pathspec> is encompassing, the less the performance
difference between HEAD~ and HEAD, and vice versa.

That is, if we don't specify a pathspec, the performance difference [1]
is indistinguishable: both methods walk all the trees and take generally
same amount of time (even with the index construction time included for
ensure_full_index()).

[1] Performance test result without pathspec (hence walking all trees):

	Command used:

		git grep --cached --sparse bogus

	Test                                HEAD~  HEAD
	---------------------------------------------------
	2000.78: git grep ... (full-v3)     6.17   5.19 (≈)
	2000.79: git grep ... (full-v4)     6.19   5.46 (≈)
	2000.80: git grep ... (sparse-v3)   6.57   6.44 (≈)
	2000.81: git grep ... (sparse-v4)   6.65   6.28 (≈)

Suggested-by: Derrick Stolee <derrickstolee@github.com>
Helped-by: Derrick Stolee <derrickstolee@github.com>
Helped-by: Victoria Dye <vdye@github.com>
Signed-off-by: Shaoxuan Yuan <shaoxuan.yuan02@gmail.com>
---
 builtin/grep.c                           | 17 +++++++++++++----
 t/perf/p2000-sparse-operations.sh        |  1 +
 t/t1092-sparse-checkout-compatibility.sh | 10 +++++++++-
 3 files changed, 23 insertions(+), 5 deletions(-)

diff --git a/builtin/grep.c b/builtin/grep.c
index a0b4dbc1dc..d8c086abff 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -522,9 +522,6 @@ static int grep_cache(struct grep_opt *opt,
 	if (repo_read_index(repo) < 0)
 		die(_("index file corrupt"));
 
-	if (grep_sparse)
-		ensure_full_index(repo->index);
-
 	for (nr = 0; nr < repo->index->cache_nr; nr++) {
 		const struct cache_entry *ce = repo->index->cache[nr];
 
@@ -537,8 +534,20 @@ static int grep_cache(struct grep_opt *opt,
 
 		strbuf_setlen(&name, name_base_len);
 		strbuf_addstr(&name, ce->name);
+		if (S_ISSPARSEDIR(ce->ce_mode)) {
+			enum object_type type;
+			struct tree_desc tree;
+			void *data;
+			unsigned long size;
+
+			data = read_object_file(&ce->oid, &type, &size);
+			init_tree_desc(&tree, data, size);
 
-		if (S_ISREG(ce->ce_mode) &&
+			hit |= grep_tree(opt, pathspec, &tree, &name, 0, 0);
+			strbuf_reset(&name);
+			strbuf_addstr(&name, ce->name);
+			free(data);
+		} else if (S_ISREG(ce->ce_mode) &&
 		    match_pathspec(repo->index, pathspec, name.buf, name.len, 0, NULL,
 				   S_ISDIR(ce->ce_mode) ||
 				   S_ISGITLINK(ce->ce_mode))) {
diff --git a/t/perf/p2000-sparse-operations.sh b/t/perf/p2000-sparse-operations.sh
index fce8151d41..3242cfe91a 100755
--- a/t/perf/p2000-sparse-operations.sh
+++ b/t/perf/p2000-sparse-operations.sh
@@ -124,5 +124,6 @@ test_perf_on_all git read-tree -mu HEAD
 test_perf_on_all git checkout-index -f --all
 test_perf_on_all git update-index --add --remove $SPARSE_CONE/a
 test_perf_on_all "git rm -f $SPARSE_CONE/a && git checkout HEAD -- $SPARSE_CONE/a"
+test_perf_on_all git grep --cached --sparse bogus -- "f2/f1/f1/*"
 
 test_done
diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index 63becc3138..56e4614276 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -1987,7 +1987,15 @@ test_expect_success 'grep is not expanded' '
 
 	# All files within the folder1/* pathspec are sparse,
 	# so this command does not find any matches
-	ensure_not_expanded ! grep a -- folder1/*
+	ensure_not_expanded ! grep a -- folder1/* &&
+
+	# test out-of-cone pathspec with or without wildcard
+	ensure_not_expanded grep --sparse --cached a -- "folder1/a" &&
+	ensure_not_expanded grep --sparse --cached a -- "folder1/*" &&
+
+	# test in-cone pathspec with or without wildcard
+	ensure_not_expanded grep --sparse --cached a -- "deep/a" &&
+	ensure_not_expanded grep --sparse --cached a -- "deep/*"
 '
 
 test_done
-- 
2.37.0


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [PATCH v4 3/3] builtin/grep.c: walking tree instead of expanding index with --sparse
  2022-09-03  0:36   ` [PATCH v4 3/3] builtin/grep.c: walking tree instead of expanding index with --sparse Shaoxuan Yuan
@ 2022-09-03  4:39     ` Junio C Hamano
  2022-09-08  0:24       ` Shaoxuan Yuan
  0 siblings, 1 reply; 69+ messages in thread
From: Junio C Hamano @ 2022-09-03  4:39 UTC (permalink / raw)
  To: Shaoxuan Yuan; +Cc: git, derrickstolee, vdye

Shaoxuan Yuan <shaoxuan.yuan02@gmail.com> writes:

> @@ -537,8 +534,20 @@ static int grep_cache(struct grep_opt *opt,
>  
>  		strbuf_setlen(&name, name_base_len);
>  		strbuf_addstr(&name, ce->name);
> +		if (S_ISSPARSEDIR(ce->ce_mode)) {
> +			enum object_type type;
> +			struct tree_desc tree;
> +			void *data;
> +			unsigned long size;
> +
> +			data = read_object_file(&ce->oid, &type, &size);
> +			init_tree_desc(&tree, data, size);
>  
> -		if (S_ISREG(ce->ce_mode) &&
> +			hit |= grep_tree(opt, pathspec, &tree, &name, 0, 0);
> +			strbuf_reset(&name);

Is this correct?

I would have expected that this would chomp to name_base_len, just
like what the code before this if/elseif cascade did.

There needs a test that is run with repo->submodule_prefix != NULL
to uncover issues like this, perhaps?

> +			strbuf_addstr(&name, ce->name);
> +			free(data);
> +		} else if (S_ISREG(ce->ce_mode) &&
>  		    match_pathspec(repo->index, pathspec, name.buf, name.len, 0, NULL,
>  				   S_ISDIR(ce->ce_mode) ||
>  				   S_ISGITLINK(ce->ce_mode))) {
> diff --git a/t/perf/p2000-sparse-operations.sh b/t/perf/p2000-sparse-operations.sh
> index fce8151d41..3242cfe91a 100755
> --- a/t/perf/p2000-sparse-operations.sh
> +++ b/t/perf/p2000-sparse-operations.sh
> @@ -124,5 +124,6 @@ test_perf_on_all git read-tree -mu HEAD
>  test_perf_on_all git checkout-index -f --all
>  test_perf_on_all git update-index --add --remove $SPARSE_CONE/a
>  test_perf_on_all "git rm -f $SPARSE_CONE/a && git checkout HEAD -- $SPARSE_CONE/a"
> +test_perf_on_all git grep --cached --sparse bogus -- "f2/f1/f1/*"
>  
>  test_done
> diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
> index 63becc3138..56e4614276 100755
> --- a/t/t1092-sparse-checkout-compatibility.sh
> +++ b/t/t1092-sparse-checkout-compatibility.sh
> @@ -1987,7 +1987,15 @@ test_expect_success 'grep is not expanded' '
>  
>  	# All files within the folder1/* pathspec are sparse,
>  	# so this command does not find any matches
> -	ensure_not_expanded ! grep a -- folder1/*
> +	ensure_not_expanded ! grep a -- folder1/* &&
> +
> +	# test out-of-cone pathspec with or without wildcard
> +	ensure_not_expanded grep --sparse --cached a -- "folder1/a" &&
> +	ensure_not_expanded grep --sparse --cached a -- "folder1/*" &&
> +
> +	# test in-cone pathspec with or without wildcard
> +	ensure_not_expanded grep --sparse --cached a -- "deep/a" &&
> +	ensure_not_expanded grep --sparse --cached a -- "deep/*"
>  '
>  
>  test_done

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH v5 0/3] grep: integrate with sparse index
  2022-08-17  7:56 [PATCH v1 0/2] grep: integrate with sparse index Shaoxuan Yuan
                   ` (5 preceding siblings ...)
  2022-09-03  0:36 ` [PATCH v4 0/3] grep: integrate with sparse index Shaoxuan Yuan
@ 2022-09-08  0:18 ` Shaoxuan Yuan
  2022-09-08  0:18   ` [PATCH v5 1/3] builtin/grep.c: add --sparse option Shaoxuan Yuan
                     ` (2 more replies)
  2022-09-23  4:18 ` [PATCH v6 0/1] grep: integrate with sparse index Shaoxuan Yuan
  7 siblings, 3 replies; 69+ messages in thread
From: Shaoxuan Yuan @ 2022-09-08  0:18 UTC (permalink / raw)
  To: shaoxuan.yuan02; +Cc: derrickstolee, vdye, git, gitster

Integrate `git-grep` with sparse-index and test the performance
improvement.

Changes since v4
----------------
* Reset the length of `struct strbuf name` back to `name_base_len`,
  instead of 0, after `grep_tree()` returns.

* Add test cases in t1092 for `grep` recursing into submodules.

* Add a few NEEDSWORK to explain the current problem with submodules.

Changes since v3
----------------
* Shorten the perf result tables in commit message.

* Update the commit message to reflect the changes in the commit.

* Update the commit message to indicate the performance improvement
  is dependent on the pathspec.

* Stop passing `ce_mode` through `check_attr`. Instead, set the
  `base_len` to 0 to make the code more reasonable and less abuse of
  `check_attr`.

* Remove another invention of `base`. Use the existing `name` as the
  argument for `grep_tree()`, and reset it back to `ce->name` after
  `grep_tree()` returns.

* Update the p2000 test to use a more general pathspec for better
  compatibility (i.e. do not use git repository specific pathspec).

* Add tests to t1092 'grep is not expanded' to verify the change
  brought by "builtin/grep.c: walking tree instead of expanding index
  with --sparse": the index *never* expands.

Changes since v2
----------------

* Modify the commit message for "builtin/grep.c: integrate with sparse
  index" to make it obvious that the perf test results are not from
  p2000 tests, but from manual perf runs.

* Add tree-walking logic as an extra (the third) patch to improve the
  performance when --sparse is used. This resolved the left-over-bit
  in v2 [1].

[1] https://lore.kernel.org/git/20220829232843.183711-1-shaoxuan.yuan02@gmail.com/

Changes since v1
----------------

* Rewrite the commit message for "builtin/grep.c: add --sparse option"
  to be clearer.

* Update the documentation (both in-code and man page) for --sparse.

* Add a few tests to test the new behavior (when _only_ --cached is
  supplied).

* Reformat the perf test results to not look like directly from p2000
  tests.

* Put the "command_requires_full_index" lines right after parse_options().

* Add a pathspec test in t1092, and reword a few test documentations.

Shaoxuan Yuan (3):
  builtin/grep.c: add --sparse option
  builtin/grep.c: integrate with sparse index
  builtin/grep.c: walking tree instead of expanding index with --sparse

 Documentation/git-grep.txt               |  5 +-
 builtin/grep.c                           | 58 +++++++++++++++++--
 t/perf/p2000-sparse-operations.sh        |  1 +
 t/t1092-sparse-checkout-compatibility.sh | 72 ++++++++++++++++++++++++
 t/t7817-grep-sparse-checkout.sh          | 34 +++++++++--
 5 files changed, 159 insertions(+), 11 deletions(-)

Range-diff against v4:
1:  00a8b3a68e = 1:  c3d33e487c builtin/grep.c: add --sparse option
2:  3e0786722c = 2:  c5366f51b8 builtin/grep.c: integrate with sparse index
3:  81afe2fcb3 ! 3:  52bb802eae builtin/grep.c: walking tree instead of expanding index with --sparse
    @@ Commit message
                 2000.80: git grep ... (sparse-v3)   6.57   6.44 (≈)
                 2000.81: git grep ... (sparse-v4)   6.65   6.28 (≈)
     
    +    --------------------------
    +    NEEDSWORK about submodules
    +
    +    There are a few NEEDSWORKs that belong to improvements beyond this
    +    topic. See the NEEDSWORK in builtin/grep.c::grep_submodule() for
    +    more context. The other two NEEDSWORKs in t1092 are also relative.
    +
         Suggested-by: Derrick Stolee <derrickstolee@github.com>
         Helped-by: Derrick Stolee <derrickstolee@github.com>
         Helped-by: Victoria Dye <vdye@github.com>
         Signed-off-by: Shaoxuan Yuan <shaoxuan.yuan02@gmail.com>
     
      ## builtin/grep.c ##
    +@@ builtin/grep.c: static int grep_submodule(struct grep_opt *opt,
    + 	 * subrepo's odbs to the in-memory alternates list.
    + 	 */
    + 	obj_read_lock();
    ++
    ++	/*
    ++	 * NEEDSWORK: when reading a submodule, the sparsity settings in the
    ++	 * superproject are incorrectly forgotten or misused. For example:
    ++	 *
    ++	 * 1. "command_requires_full_index"
    ++	 * 	When this setting is turned on for `grep`, only the superproject
    ++	 *	knows it. All the submodules are read with their own configs
    ++	 *	and get prepare_repo_settings()'d. Therefore, these submodules
    ++	 *	"forget" the sparse-index feature switch. As a result, the index
    ++	 *	of these submodules are expanded unexpectedly.
    ++	 *
    ++	 * 2. "core_apply_sparse_checkout"
    ++	 *	When running `grep` in the superproject, this setting is
    ++	 *	populated using the superproject's configs. However, once
    ++	 *	initialized, this config is globally accessible and is read by
    ++	 *	prepare_repo_settings() for the submodules. For instance, if a
    ++	 *	submodule is using a sparse-checkout, however, the superproject
    ++	 *	is not, the result is that the config from the superproject will
    ++	 *	dictate the behavior for the submodule, making it "forget" its
    ++	 *	sparse-checkout state.
    ++	 *
    ++	 * 3. "core_sparse_checkout_cone"
    ++	 *	ditto.
    ++	 *
    ++	 * Note that this list is not exhaustive.
    ++	 */
    + 	repo_read_gitmodules(subrepo, 0);
    + 
    + 	/*
     @@ builtin/grep.c: static int grep_cache(struct grep_opt *opt,
      	if (repo_read_index(repo) < 0)
      		die(_("index file corrupt"));
    @@ builtin/grep.c: static int grep_cache(struct grep_opt *opt,
      
     -		if (S_ISREG(ce->ce_mode) &&
     +			hit |= grep_tree(opt, pathspec, &tree, &name, 0, 0);
    -+			strbuf_reset(&name);
    ++			strbuf_setlen(&name, name_base_len);
     +			strbuf_addstr(&name, ce->name);
     +			free(data);
     +		} else if (S_ISREG(ce->ce_mode) &&
    @@ t/perf/p2000-sparse-operations.sh: test_perf_on_all git read-tree -mu HEAD
      test_done
     
      ## t/t1092-sparse-checkout-compatibility.sh ##
    +@@ t/t1092-sparse-checkout-compatibility.sh: init_repos () {
    + 	git -C sparse-index sparse-checkout set deep
    + }
    + 
    ++init_repos_as_submodules () {
    ++	git reset --hard &&
    ++	init_repos &&
    ++	git submodule add ./full-checkout &&
    ++	git submodule add ./sparse-checkout &&
    ++	git submodule add ./sparse-index &&
    ++
    ++	git submodule status >actual &&
    ++	grep full-checkout actual &&
    ++	grep sparse-checkout actual &&
    ++	grep sparse-index actual
    ++}
    ++
    + run_on_sparse () {
    + 	(
    + 		cd sparse-checkout &&
     @@ t/t1092-sparse-checkout-compatibility.sh: test_expect_success 'grep is not expanded' '
      
      	# All files within the folder1/* pathspec are sparse,
    @@ t/t1092-sparse-checkout-compatibility.sh: test_expect_success 'grep is not expan
     +	# test in-cone pathspec with or without wildcard
     +	ensure_not_expanded grep --sparse --cached a -- "deep/a" &&
     +	ensure_not_expanded grep --sparse --cached a -- "deep/*"
    ++'
    ++
    ++# NEEDSWORK: when running `grep` in the superproject with --recurse-submodules,
    ++# Git expands the index of the submodules unexpectedly. Even though `grep`
    ++# builtin is marked as "command_requires_full_index = 0", this config is only
    ++# useful for the superproject. Namely, the submodules have their own configs,
    ++# which are _not_ populated by the one-time sparse-index feature switch.
    ++test_expect_failure 'grep within submodules is not expanded' '
    ++	init_repos_as_submodules &&
    ++
    ++	# do not use ensure_not_expanded() here, becasue `grep` should be
    ++	# run in the superproject, not in "./sparse-index"
    ++	GIT_TRACE2_EVENT="$(pwd)/trace2.txt" \
    ++	git grep --sparse --cached --recurse-submodules a -- "*/folder1/*" &&
    ++	test_region ! index ensure_full_index trace2.txt
    ++'
    ++
    ++# NEEDSWORK: this test is not actually testing the code. The design purpose
    ++# of this test is to verify the grep result when the submodules are using a
    ++# sparse-index. Namely, we want "folder1/" as a tree (a sparse directory); but
    ++# because of the index expansion, we are now grepping the "folder1/a" blob.
    ++# Because of the problem stated above 'grep within submodules is not expanded',
    ++# we don't have the ideal test environment yet.
    ++test_expect_success 'grep sparse directory within submodules' '
    ++	init_repos_as_submodules &&
    ++
    ++	cat >expect <<-\EOF &&
    ++	full-checkout/folder1/a:a
    ++	sparse-checkout/folder1/a:a
    ++	sparse-index/folder1/a:a
    ++	EOF
    ++	git grep --sparse --cached --recurse-submodules a -- "*/folder1/*" >actual &&
    ++	test_cmp actual expect
      '
      
      test_done

base-commit: 79f2338b3746d23454308648b2491e5beba4beff
-- 
2.37.0


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH v5 1/3] builtin/grep.c: add --sparse option
  2022-09-08  0:18 ` [PATCH v5 0/3] grep: integrate with sparse index Shaoxuan Yuan
@ 2022-09-08  0:18   ` Shaoxuan Yuan
  2022-09-10  1:07     ` Victoria Dye
  2022-09-14  6:08     ` Elijah Newren
  2022-09-08  0:18   ` [PATCH v5 2/3] builtin/grep.c: integrate with sparse index Shaoxuan Yuan
  2022-09-08  0:18   ` [PATCH v5 3/3] builtin/grep.c: walking tree instead of expanding index with --sparse Shaoxuan Yuan
  2 siblings, 2 replies; 69+ messages in thread
From: Shaoxuan Yuan @ 2022-09-08  0:18 UTC (permalink / raw)
  To: shaoxuan.yuan02; +Cc: derrickstolee, vdye, git, gitster

Add a --sparse option to `git-grep`.

When the '--cached' option is used with the 'git grep' command, the
search is limited to the blobs found in the index, not in the worktree.
If the user has enabled sparse-checkout, this might present more results
than they would like, since the files outside of the sparse-checkout are
unlikely to be important to them.

Change the default behavior of 'git grep' to focus on the files within
the sparse-checkout definition. To enable the previous behavior, add a
'--sparse' option to 'git grep' that triggers the old behavior that
inspects paths outside of the sparse-checkout definition when paired
with the '--cached' option.

Suggested-by: Victoria Dye <vdye@github.com>
Helped-by: Derrick Stolee <derrickstolee@github.com>
Helped-by: Victoria Dye <vdye@github.com>
Signed-off-by: Shaoxuan Yuan <shaoxuan.yuan02@gmail.com>
---
 Documentation/git-grep.txt      |  5 ++++-
 builtin/grep.c                  | 10 +++++++++-
 t/t7817-grep-sparse-checkout.sh | 34 +++++++++++++++++++++++++++------
 3 files changed, 41 insertions(+), 8 deletions(-)

diff --git a/Documentation/git-grep.txt b/Documentation/git-grep.txt
index 58d944bd57..bdd3d5b8a6 100644
--- a/Documentation/git-grep.txt
+++ b/Documentation/git-grep.txt
@@ -28,7 +28,7 @@ SYNOPSIS
 	   [-f <file>] [-e] <pattern>
 	   [--and|--or|--not|(|)|-e <pattern>...]
 	   [--recurse-submodules] [--parent-basename <basename>]
-	   [ [--[no-]exclude-standard] [--cached | --no-index | --untracked] | <tree>...]
+	   [ [--[no-]exclude-standard] [--cached [--sparse] | --no-index | --untracked] | <tree>...]
 	   [--] [<pathspec>...]
 
 DESCRIPTION
@@ -45,6 +45,9 @@ OPTIONS
 	Instead of searching tracked files in the working tree, search
 	blobs registered in the index file.
 
+--sparse::
+	Use with --cached. Search outside of sparse-checkout definition.
+
 --no-index::
 	Search files in the current directory that is not managed by Git.
 
diff --git a/builtin/grep.c b/builtin/grep.c
index e6bcdf860c..12abd832fa 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -96,6 +96,8 @@ static pthread_cond_t cond_result;
 
 static int skip_first_line;
 
+static int grep_sparse = 0;
+
 static void add_work(struct grep_opt *opt, struct grep_source *gs)
 {
 	if (opt->binary != GREP_BINARY_TEXT)
@@ -525,7 +527,11 @@ static int grep_cache(struct grep_opt *opt,
 	for (nr = 0; nr < repo->index->cache_nr; nr++) {
 		const struct cache_entry *ce = repo->index->cache[nr];
 
-		if (!cached && ce_skip_worktree(ce))
+		/*
+		 * Skip entries with SKIP_WORKTREE unless both --sparse and
+		 * --cached are given.
+		 */
+		if (!(grep_sparse && cached) && ce_skip_worktree(ce))
 			continue;
 
 		strbuf_setlen(&name, name_base_len);
@@ -963,6 +969,8 @@ int cmd_grep(int argc, const char **argv, const char *prefix)
 			   PARSE_OPT_NOCOMPLETE),
 		OPT_INTEGER('m', "max-count", &opt.max_count,
 			N_("maximum number of results per file")),
+		OPT_BOOL(0, "sparse", &grep_sparse,
+			 N_("search the contents of files outside the sparse-checkout definition")),
 		OPT_END()
 	};
 	grep_prefix = prefix;
diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh
index eb59564565..a9879cc980 100755
--- a/t/t7817-grep-sparse-checkout.sh
+++ b/t/t7817-grep-sparse-checkout.sh
@@ -118,13 +118,19 @@ test_expect_success 'grep searches unmerged file despite not matching sparsity p
 	test_cmp expect actual
 '
 
-test_expect_success 'grep --cached searches entries with the SKIP_WORKTREE bit' '
+test_expect_success 'grep --cached and --sparse searches entries with the SKIP_WORKTREE bit' '
+	cat >expect <<-EOF &&
+	a:text
+	EOF
+	git grep --cached "text" >actual &&
+	test_cmp expect actual &&
+
 	cat >expect <<-EOF &&
 	a:text
 	b:text
 	dir/c:text
 	EOF
-	git grep --cached "text" >actual &&
+	git grep --cached --sparse "text" >actual &&
 	test_cmp expect actual
 '
 
@@ -143,7 +149,15 @@ test_expect_success 'grep --recurse-submodules honors sparse checkout in submodu
 	test_cmp expect actual
 '
 
-test_expect_success 'grep --recurse-submodules --cached searches entries with the SKIP_WORKTREE bit' '
+test_expect_success 'grep --recurse-submodules --cached and --sparse searches entries with the SKIP_WORKTREE bit' '
+	cat >expect <<-EOF &&
+	a:text
+	sub/B/b:text
+	sub2/a:text
+	EOF
+	git grep --recurse-submodules --cached "text" >actual &&
+	test_cmp expect actual &&
+
 	cat >expect <<-EOF &&
 	a:text
 	b:text
@@ -152,7 +166,7 @@ test_expect_success 'grep --recurse-submodules --cached searches entries with th
 	sub/B/b:text
 	sub2/a:text
 	EOF
-	git grep --recurse-submodules --cached "text" >actual &&
+	git grep --recurse-submodules --cached --sparse "text" >actual &&
 	test_cmp expect actual
 '
 
@@ -166,7 +180,15 @@ test_expect_success 'working tree grep does not search the index with CE_VALID a
 	test_cmp expect actual
 '
 
-test_expect_success 'grep --cached searches index entries with both CE_VALID and SKIP_WORKTREE' '
+test_expect_success 'grep --cached and --sparse searches index entries with both CE_VALID and SKIP_WORKTREE' '
+	cat >expect <<-EOF &&
+	a:text
+	EOF
+	test_when_finished "git update-index --no-assume-unchanged b" &&
+	git update-index --assume-unchanged b &&
+	git grep --cached text >actual &&
+	test_cmp expect actual &&
+
 	cat >expect <<-EOF &&
 	a:text
 	b:text
@@ -174,7 +196,7 @@ test_expect_success 'grep --cached searches index entries with both CE_VALID and
 	EOF
 	test_when_finished "git update-index --no-assume-unchanged b" &&
 	git update-index --assume-unchanged b &&
-	git grep --cached text >actual &&
+	git grep --cached --sparse text >actual &&
 	test_cmp expect actual
 '
 
-- 
2.37.0


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v5 2/3] builtin/grep.c: integrate with sparse index
  2022-09-08  0:18 ` [PATCH v5 0/3] grep: integrate with sparse index Shaoxuan Yuan
  2022-09-08  0:18   ` [PATCH v5 1/3] builtin/grep.c: add --sparse option Shaoxuan Yuan
@ 2022-09-08  0:18   ` Shaoxuan Yuan
  2022-09-08  0:18   ` [PATCH v5 3/3] builtin/grep.c: walking tree instead of expanding index with --sparse Shaoxuan Yuan
  2 siblings, 0 replies; 69+ messages in thread
From: Shaoxuan Yuan @ 2022-09-08  0:18 UTC (permalink / raw)
  To: shaoxuan.yuan02; +Cc: derrickstolee, vdye, git, gitster

Turn on sparse index and remove ensure_full_index().

Change it to only expand the index when using --sparse.

The p2000 tests do not demonstrate a significant improvement,
because the index read is a small portion of the full process
time, compared to the blob parsing. The times below reflect the
time spent in the "do_read_index" trace region as shown using
GIT_TRACE2_PERF=1.

The tests demonstrate a ~99.4% execution time reduction for
`git grep` using a sparse index.

Test                                  HEAD~        HEAD
-----------------------------------------------------------------------------
git grep --cached bogus (full-v3)     0.019        0.018  (-5.2%)
git grep --cached bogus (full-v4)     0.017        0.016  (-5.8%)
git grep --cached bogus (sparse-v3)   0.29         0.0015 (-99.4%)
git grep --cached bogus (sparse-v4)   0.30         0.0018 (-99.4%)

Optional reading about performance test results
-----------------------------------------------
Notice that because `git-grep` needs to parse blobs in the index, the
index reading time is minuscule comparing to the object parsing time.
And because of this, the p2000 test results cannot clearly reflect the
speedup for index reading: combining with the object parsing time,
the aggregated time difference is extremely close between HEAD~1 and
HEAD.

Hence, the results presenting here are not directly extracted from the
p2000 test results. Instead, to make the performance difference more
visible, the test command is manually ran with GIT_TRACE2_PERF in the
four repos (full-v3, sparse-v3, full-v4, sparse-v4). The numbers here
are then extracted from the time difference between "region_enter" and
"region_leave" of label "do_read_index".

Helped-by: Victoria Dye <vdye@github.com>
Helped-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Shaoxuan Yuan <shaoxuan.yuan02@gmail.com>
---
 builtin/grep.c                           | 10 ++++++++--
 t/t1092-sparse-checkout-compatibility.sh | 18 ++++++++++++++++++
 2 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/builtin/grep.c b/builtin/grep.c
index 12abd832fa..a0b4dbc1dc 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -522,8 +522,9 @@ static int grep_cache(struct grep_opt *opt,
 	if (repo_read_index(repo) < 0)
 		die(_("index file corrupt"));
 
-	/* TODO: audit for interaction with sparse-index. */
-	ensure_full_index(repo->index);
+	if (grep_sparse)
+		ensure_full_index(repo->index);
+
 	for (nr = 0; nr < repo->index->cache_nr; nr++) {
 		const struct cache_entry *ce = repo->index->cache[nr];
 
@@ -992,6 +993,11 @@ int cmd_grep(int argc, const char **argv, const char *prefix)
 			     PARSE_OPT_KEEP_DASHDASH |
 			     PARSE_OPT_STOP_AT_NON_OPTION);
 
+	if (the_repository->gitdir) {
+		prepare_repo_settings(the_repository);
+		the_repository->settings.command_requires_full_index = 0;
+	}
+
 	if (use_index && !startup_info->have_repository) {
 		int fallback = 0;
 		git_config_get_bool("grep.fallbacktonoindex", &fallback);
diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index 0302e36fd6..63becc3138 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -1972,4 +1972,22 @@ test_expect_success 'sparse index is not expanded: rm' '
 	ensure_not_expanded rm -r deep
 '
 
+test_expect_success 'grep with --sparse and --cached' '
+	init_repos &&
+
+	test_all_match git grep --sparse --cached a &&
+	test_all_match git grep --sparse --cached a -- "folder1/*"
+'
+
+test_expect_success 'grep is not expanded' '
+	init_repos &&
+
+	ensure_not_expanded grep a &&
+	ensure_not_expanded grep a -- deep/* &&
+
+	# All files within the folder1/* pathspec are sparse,
+	# so this command does not find any matches
+	ensure_not_expanded ! grep a -- folder1/*
+'
+
 test_done
-- 
2.37.0


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v5 3/3] builtin/grep.c: walking tree instead of expanding index with --sparse
  2022-09-08  0:18 ` [PATCH v5 0/3] grep: integrate with sparse index Shaoxuan Yuan
  2022-09-08  0:18   ` [PATCH v5 1/3] builtin/grep.c: add --sparse option Shaoxuan Yuan
  2022-09-08  0:18   ` [PATCH v5 2/3] builtin/grep.c: integrate with sparse index Shaoxuan Yuan
@ 2022-09-08  0:18   ` Shaoxuan Yuan
  2022-09-08 17:59     ` Junio C Hamano
  2022-09-10  2:04     ` Victoria Dye
  2 siblings, 2 replies; 69+ messages in thread
From: Shaoxuan Yuan @ 2022-09-08  0:18 UTC (permalink / raw)
  To: shaoxuan.yuan02; +Cc: derrickstolee, vdye, git, gitster

Before this patch, whenever --sparse is used, `git-grep` utilizes the
ensure_full_index() method to expand the index and search all the
entries. Because this method requires walking all the trees and
constructing the index, it is the slow part within the whole command.

To achieve better performance, this patch uses grep_tree() to search the
sparse directory entries and get rid of the ensure_full_index() method.

Why grep_tree() is a better choice over ensure_full_index()?

1) grep_tree() is as correct as ensure_full_index(). grep_tree() looks
   into every sparse-directory entry (represented by a tree) recursively
   when looping over the index, and the result of doing so matches the
   result of expanding the index.

2) grep_tree() utilizes pathspecs to limit the scope of searching.
   ensure_full_index() always expands the index when --sparse is used,
   that means it will always walk all the trees and blobs in the repo
   without caring if the user only wants a subset of the content, i.e.
   using a pathspec. On the other hand, grep_tree() will only search
   the contents that match the pathspec, and thus possibly walking fewer
   trees.

3) grep_tree() does not construct and copy back a new index, while
   ensure_full_index() does. This also saves some time.

----------------
Performance test

- Summary:

p2000 tests demonstrate a ~71% execution time reduction for
`git grep --cached --sparse bogus -- "f2/f1/f1/*"` using tree-walking
logic. However, notice that this result varies depending on the pathspec
given. See below "Command used for testing" for more details.

Test                              HEAD~   HEAD
-------------------------------------------------------
2000.78: git grep ... (full-v3)   0.35    0.39 (≈)
2000.79: git grep ... (full-v4)   0.36    0.30 (≈)
2000.80: git grep ... (sparse-v3) 0.88    0.23 (-73.8%)
2000.81: git grep ... (sparse-v4) 0.83    0.26 (-68.6%)

- Command used for testing:

	git grep --cached --sparse bogus -- "f2/f1/f1/*"

The reason for specifying a pathspec is that, if we don't specify a
pathspec, then grep_tree() will walk all the trees and blobs to find the
pattern, and the time consumed doing so is not too different from using
the original ensure_full_index() method, which also spends most of the
time walking trees. However, when a pathspec is specified, this latest
logic will only walk the area of trees enclosed by the pathspec, and the
time consumed is reasonably a lot less.

Generally speaking, because the performance gain is acheived by walking
less trees, which are specified by the pathspec, the HEAD time v.s.
HEAD~ time in sparse-v[3|4], should be proportional to
"pathspec enclosed area" v.s. "all area", respectively. Namely, the
wider the <pathspec> is encompassing, the less the performance
difference between HEAD~ and HEAD, and vice versa.

That is, if we don't specify a pathspec, the performance difference [1]
is indistinguishable: both methods walk all the trees and take generally
same amount of time (even with the index construction time included for
ensure_full_index()).

[1] Performance test result without pathspec (hence walking all trees):

	Command used:

		git grep --cached --sparse bogus

	Test                                HEAD~  HEAD
	---------------------------------------------------
	2000.78: git grep ... (full-v3)     6.17   5.19 (≈)
	2000.79: git grep ... (full-v4)     6.19   5.46 (≈)
	2000.80: git grep ... (sparse-v3)   6.57   6.44 (≈)
	2000.81: git grep ... (sparse-v4)   6.65   6.28 (≈)

--------------------------
NEEDSWORK about submodules

There are a few NEEDSWORKs that belong to improvements beyond this
topic. See the NEEDSWORK in builtin/grep.c::grep_submodule() for
more context. The other two NEEDSWORKs in t1092 are also relative.

Suggested-by: Derrick Stolee <derrickstolee@github.com>
Helped-by: Derrick Stolee <derrickstolee@github.com>
Helped-by: Victoria Dye <vdye@github.com>
Signed-off-by: Shaoxuan Yuan <shaoxuan.yuan02@gmail.com>
---
 builtin/grep.c                           | 44 +++++++++++++++++--
 t/perf/p2000-sparse-operations.sh        |  1 +
 t/t1092-sparse-checkout-compatibility.sh | 56 +++++++++++++++++++++++-
 3 files changed, 96 insertions(+), 5 deletions(-)

diff --git a/builtin/grep.c b/builtin/grep.c
index a0b4dbc1dc..9a01932253 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -460,6 +460,33 @@ static int grep_submodule(struct grep_opt *opt,
 	 * subrepo's odbs to the in-memory alternates list.
 	 */
 	obj_read_lock();
+
+	/*
+	 * NEEDSWORK: when reading a submodule, the sparsity settings in the
+	 * superproject are incorrectly forgotten or misused. For example:
+	 *
+	 * 1. "command_requires_full_index"
+	 * 	When this setting is turned on for `grep`, only the superproject
+	 *	knows it. All the submodules are read with their own configs
+	 *	and get prepare_repo_settings()'d. Therefore, these submodules
+	 *	"forget" the sparse-index feature switch. As a result, the index
+	 *	of these submodules are expanded unexpectedly.
+	 *
+	 * 2. "core_apply_sparse_checkout"
+	 *	When running `grep` in the superproject, this setting is
+	 *	populated using the superproject's configs. However, once
+	 *	initialized, this config is globally accessible and is read by
+	 *	prepare_repo_settings() for the submodules. For instance, if a
+	 *	submodule is using a sparse-checkout, however, the superproject
+	 *	is not, the result is that the config from the superproject will
+	 *	dictate the behavior for the submodule, making it "forget" its
+	 *	sparse-checkout state.
+	 *
+	 * 3. "core_sparse_checkout_cone"
+	 *	ditto.
+	 *
+	 * Note that this list is not exhaustive.
+	 */
 	repo_read_gitmodules(subrepo, 0);
 
 	/*
@@ -522,9 +549,6 @@ static int grep_cache(struct grep_opt *opt,
 	if (repo_read_index(repo) < 0)
 		die(_("index file corrupt"));
 
-	if (grep_sparse)
-		ensure_full_index(repo->index);
-
 	for (nr = 0; nr < repo->index->cache_nr; nr++) {
 		const struct cache_entry *ce = repo->index->cache[nr];
 
@@ -537,8 +561,20 @@ static int grep_cache(struct grep_opt *opt,
 
 		strbuf_setlen(&name, name_base_len);
 		strbuf_addstr(&name, ce->name);
+		if (S_ISSPARSEDIR(ce->ce_mode)) {
+			enum object_type type;
+			struct tree_desc tree;
+			void *data;
+			unsigned long size;
+
+			data = read_object_file(&ce->oid, &type, &size);
+			init_tree_desc(&tree, data, size);
 
-		if (S_ISREG(ce->ce_mode) &&
+			hit |= grep_tree(opt, pathspec, &tree, &name, 0, 0);
+			strbuf_setlen(&name, name_base_len);
+			strbuf_addstr(&name, ce->name);
+			free(data);
+		} else if (S_ISREG(ce->ce_mode) &&
 		    match_pathspec(repo->index, pathspec, name.buf, name.len, 0, NULL,
 				   S_ISDIR(ce->ce_mode) ||
 				   S_ISGITLINK(ce->ce_mode))) {
diff --git a/t/perf/p2000-sparse-operations.sh b/t/perf/p2000-sparse-operations.sh
index fce8151d41..3242cfe91a 100755
--- a/t/perf/p2000-sparse-operations.sh
+++ b/t/perf/p2000-sparse-operations.sh
@@ -124,5 +124,6 @@ test_perf_on_all git read-tree -mu HEAD
 test_perf_on_all git checkout-index -f --all
 test_perf_on_all git update-index --add --remove $SPARSE_CONE/a
 test_perf_on_all "git rm -f $SPARSE_CONE/a && git checkout HEAD -- $SPARSE_CONE/a"
+test_perf_on_all git grep --cached --sparse bogus -- "f2/f1/f1/*"
 
 test_done
diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index 63becc3138..fda05faadf 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -162,6 +162,19 @@ init_repos () {
 	git -C sparse-index sparse-checkout set deep
 }
 
+init_repos_as_submodules () {
+	git reset --hard &&
+	init_repos &&
+	git submodule add ./full-checkout &&
+	git submodule add ./sparse-checkout &&
+	git submodule add ./sparse-index &&
+
+	git submodule status >actual &&
+	grep full-checkout actual &&
+	grep sparse-checkout actual &&
+	grep sparse-index actual
+}
+
 run_on_sparse () {
 	(
 		cd sparse-checkout &&
@@ -1987,7 +2000,48 @@ test_expect_success 'grep is not expanded' '
 
 	# All files within the folder1/* pathspec are sparse,
 	# so this command does not find any matches
-	ensure_not_expanded ! grep a -- folder1/*
+	ensure_not_expanded ! grep a -- folder1/* &&
+
+	# test out-of-cone pathspec with or without wildcard
+	ensure_not_expanded grep --sparse --cached a -- "folder1/a" &&
+	ensure_not_expanded grep --sparse --cached a -- "folder1/*" &&
+
+	# test in-cone pathspec with or without wildcard
+	ensure_not_expanded grep --sparse --cached a -- "deep/a" &&
+	ensure_not_expanded grep --sparse --cached a -- "deep/*"
+'
+
+# NEEDSWORK: when running `grep` in the superproject with --recurse-submodules,
+# Git expands the index of the submodules unexpectedly. Even though `grep`
+# builtin is marked as "command_requires_full_index = 0", this config is only
+# useful for the superproject. Namely, the submodules have their own configs,
+# which are _not_ populated by the one-time sparse-index feature switch.
+test_expect_failure 'grep within submodules is not expanded' '
+	init_repos_as_submodules &&
+
+	# do not use ensure_not_expanded() here, becasue `grep` should be
+	# run in the superproject, not in "./sparse-index"
+	GIT_TRACE2_EVENT="$(pwd)/trace2.txt" \
+	git grep --sparse --cached --recurse-submodules a -- "*/folder1/*" &&
+	test_region ! index ensure_full_index trace2.txt
+'
+
+# NEEDSWORK: this test is not actually testing the code. The design purpose
+# of this test is to verify the grep result when the submodules are using a
+# sparse-index. Namely, we want "folder1/" as a tree (a sparse directory); but
+# because of the index expansion, we are now grepping the "folder1/a" blob.
+# Because of the problem stated above 'grep within submodules is not expanded',
+# we don't have the ideal test environment yet.
+test_expect_success 'grep sparse directory within submodules' '
+	init_repos_as_submodules &&
+
+	cat >expect <<-\EOF &&
+	full-checkout/folder1/a:a
+	sparse-checkout/folder1/a:a
+	sparse-index/folder1/a:a
+	EOF
+	git grep --sparse --cached --recurse-submodules a -- "*/folder1/*" >actual &&
+	test_cmp actual expect
 '
 
 test_done
-- 
2.37.0


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [PATCH v4 3/3] builtin/grep.c: walking tree instead of expanding index with --sparse
  2022-09-03  4:39     ` Junio C Hamano
@ 2022-09-08  0:24       ` Shaoxuan Yuan
  0 siblings, 0 replies; 69+ messages in thread
From: Shaoxuan Yuan @ 2022-09-08  0:24 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, derrickstolee, vdye

On 9/2/2022 9:39 PM, Junio C Hamano wrote:
> Shaoxuan Yuan <shaoxuan.yuan02@gmail.com> writes:
> 
>> @@ -537,8 +534,20 @@ static int grep_cache(struct grep_opt *opt,
>>  
>>  		strbuf_setlen(&name, name_base_len);
>>  		strbuf_addstr(&name, ce->name);
>> +		if (S_ISSPARSEDIR(ce->ce_mode)) {
>> +			enum object_type type;
>> +			struct tree_desc tree;
>> +			void *data;
>> +			unsigned long size;
>> +
>> +			data = read_object_file(&ce->oid, &type, &size);
>> +			init_tree_desc(&tree, data, size);
>>  
>> -		if (S_ISREG(ce->ce_mode) &&
>> +			hit |= grep_tree(opt, pathspec, &tree, &name, 0, 0);
>> +			strbuf_reset(&name);
> 
> Is this correct?
> 
> I would have expected that this would chomp to name_base_len, just
> like what the code before this if/elseif cascade did.

OK.

> 
> There needs a test that is run with repo->submodule_prefix != NULL
> to uncover issues like this, perhaps?

I'm sorry that I forgot to directly reply to this. But I have sent a v5
[1] based on your suggestions here. Thanks for the review!

[1]
https://lore.kernel.org/git/20220908001854.206789-1-shaoxuan.yuan02@gmail.com/

Thanks,
Shaoxuan

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v5 3/3] builtin/grep.c: walking tree instead of expanding index with --sparse
  2022-09-08  0:18   ` [PATCH v5 3/3] builtin/grep.c: walking tree instead of expanding index with --sparse Shaoxuan Yuan
@ 2022-09-08 17:59     ` Junio C Hamano
  2022-09-08 20:46       ` Derrick Stolee
  2022-09-10  2:04     ` Victoria Dye
  1 sibling, 1 reply; 69+ messages in thread
From: Junio C Hamano @ 2022-09-08 17:59 UTC (permalink / raw)
  To: Shaoxuan Yuan; +Cc: derrickstolee, vdye, git

Shaoxuan Yuan <shaoxuan.yuan02@gmail.com> writes:

> +
> +	/*
> +	 * NEEDSWORK: when reading a submodule, the sparsity settings in the
> +	 * superproject are incorrectly forgotten or misused. For example:
> +	 *
> +	 * 1. "command_requires_full_index"
> +	 * 	When this setting is turned on for `grep`, only the superproject
> +	 *	knows it. All the submodules are read with their own configs
> +	 *	and get prepare_repo_settings()'d. Therefore, these submodules
> +	 *	"forget" the sparse-index feature switch. As a result, the index
> +	 *	of these submodules are expanded unexpectedly.

Is this fundamental, or is it just this version of the patch is
incomplete in that it still does not propagate the bit from
the_repository->settings to submodule's settings?  Should a change
to propagate the bit be included for this topic to be complete?

To put it another way, when grep with this version of the patch
recurses into a submodule, does it work correctly even without
flipping command_requires_full_index on in the "struct repository"
instance for the submodule?  If so, then the NEEDSWORK above may be
just performance issue.  If it behaves incorrectly, then it means
we cannot safely make "git grep" aware of sparse index yet.  It is
hard to tell which one you meant in the above.

I think the same question needs to be asked for other points
(omitted from quoting) in this list.

Thanks.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v5 3/3] builtin/grep.c: walking tree instead of expanding index with --sparse
  2022-09-08 17:59     ` Junio C Hamano
@ 2022-09-08 20:46       ` Derrick Stolee
  2022-09-08 20:56         ` Junio C Hamano
  2022-09-13 17:23         ` Junio C Hamano
  0 siblings, 2 replies; 69+ messages in thread
From: Derrick Stolee @ 2022-09-08 20:46 UTC (permalink / raw)
  To: Junio C Hamano, Shaoxuan Yuan; +Cc: vdye, git

On 9/8/2022 1:59 PM, Junio C Hamano wrote:
> Shaoxuan Yuan <shaoxuan.yuan02@gmail.com> writes:
> 
>> +
>> +	/*
>> +	 * NEEDSWORK: when reading a submodule, the sparsity settings in the
>> +	 * superproject are incorrectly forgotten or misused. For example:
>> +	 *
>> +	 * 1. "command_requires_full_index"
>> +	 * 	When this setting is turned on for `grep`, only the superproject
>> +	 *	knows it. All the submodules are read with their own configs
>> +	 *	and get prepare_repo_settings()'d. Therefore, these submodules
>> +	 *	"forget" the sparse-index feature switch. As a result, the index
>> +	 *	of these submodules are expanded unexpectedly.
> 
> Is this fundamental, or is it just this version of the patch is
> incomplete in that it still does not propagate the bit from
> the_repository->settings to submodule's settings?  Should a change
> to propagate the bit be included for this topic to be complete?
> 
> To put it another way, when grep with this version of the patch
> recurses into a submodule, does it work correctly even without
> flipping command_requires_full_index on in the "struct repository"
> instance for the submodule?  If so, then the NEEDSWORK above may be
> just performance issue.  If it behaves incorrectly, then it means
> we cannot safely make "git grep" aware of sparse index yet.  It is
> hard to tell which one you meant in the above.
> 
> I think the same question needs to be asked for other points
> (omitted from quoting) in this list.

I think this comment is misplaced. It should either be contained in
the commit message or placed closer to this diff hunk:

>> @@ -537,8 +561,20 @@ static int grep_cache(struct grep_opt *opt,
>>  
>>  		strbuf_setlen(&name, name_base_len);
>>  		strbuf_addstr(&name, ce->name);
>> +		if (S_ISSPARSEDIR(ce->ce_mode)) {
>> +			enum object_type type;
>> +			struct tree_desc tree;
>> +			void *data;
>> +			unsigned long size;
>> +
>> +			data = read_object_file(&ce->oid, &type, &size);
>> +			init_tree_desc(&tree, data, size);
>>  
>> -		if (S_ISREG(ce->ce_mode) &&
>> +			hit |= grep_tree(opt, pathspec, &tree, &name, 0, 0);
>> +			strbuf_setlen(&name, name_base_len);
>> +			strbuf_addstr(&name, ce->name);
>> +			free(data);
>> +		} else if (S_ISREG(ce->ce_mode) &&

The conclusion we were trying to reach is that you (Junio) correctly
identified a bug in how we were calling grep_tree() in this hunk in
its v4 form.

HOWEVER: it "doesn't matter" because the sparse index doesn't work
at all within a submodule. Specifically, if a super-repo does not
enable sparse-checkout, but the submodule _does_, then we don't
know how Git will behave currently. His reasonings go on to explain
why the situation is fraught:

* command_requires_full_index is set in a builtin only for the
  top-level project, so when we traverse into a submodule, we don't
  re-check if the current builtin has integrated with sparse index
  and expand a sparse index to a full one.

* core_apply_sparse_checkout is a global not even associated with
  a repository struct. What happens when a super project is not
  sparse but a submodule is? Or vice-versa? I honestly don't know,
  and it will require testing to find out.

Shaoxuan's comment is attempting to list the reasons why submodules
do not currently work with sparse-index, and specifically that we
can add tests that _should_ exercise this code in a meaningful way,
but because of the current limitations of the codebase, the code
isn't actually exercised in that scenario.

In order to actually create a test that demonstrates how submodules
and sparse-checkout work with this logic, we need to do some serious
refactoring of the sparse-checkout logic to care about the repository
struct, along with some other concerns specifically around the sparse
index. This doesn't seem appropriate for the GSoC timeline or even for
just this topic.

Victoria and I have noted this issue down and will try to find time
to investigate further, with a target of being able to actually
exercise this grep_tree() call within a sparse index in a submodule,
giving us full confidence that name_base_len is the correct value to
put in that parameter.

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v5 3/3] builtin/grep.c: walking tree instead of expanding index with --sparse
  2022-09-08 20:46       ` Derrick Stolee
@ 2022-09-08 20:56         ` Junio C Hamano
  2022-09-08 21:06           ` Shaoxuan Yuan
  2022-09-09 12:49           ` Derrick Stolee
  2022-09-13 17:23         ` Junio C Hamano
  1 sibling, 2 replies; 69+ messages in thread
From: Junio C Hamano @ 2022-09-08 20:56 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Shaoxuan Yuan, vdye, git

Derrick Stolee <derrickstolee@github.com> writes:

> HOWEVER: it "doesn't matter" because the sparse index doesn't work
> at all within a submodule. Specifically, if a super-repo does not
> enable sparse-checkout, but the submodule _does_, then we don't
> know how Git will behave currently. His reasonings go on to explain
> why the situation is fraught:
>
> * command_requires_full_index is set in a builtin only for the
>   top-level project, so when we traverse into a submodule, we don't
>   re-check if the current builtin has integrated with sparse index
>   and expand a sparse index to a full one.

Correct.  

Is it sufficient to propagate the bit from the_repository->settings
to repo->settings of the submodule, or is there more things needed
to fix it?

> * core_apply_sparse_checkout is a global not even associated with
>   a repository struct. What happens when a super project is not
>   sparse but a submodule is? Or vice-versa? I honestly don't know,
>   and it will require testing to find out.

Naïvely, I would think that we should just treat a non-sparse case
as a mere specialization where the sparse cone covers everything,
but there may be pitfalls.

> Shaoxuan's comment is attempting to list the reasons why submodules
> do not currently work with sparse-index,

"do not currently work" in a sense that it produces wrong result, or
it just expands in-core index unnecessarily before applying pathspec
to produce the right result in an inefficient way?

If it is "functionally broken", is there a simple way out to give us
correct result even if it becomes less efficient?  Like "we scan the
index and we see we have some submodules---so we disable the sparse
handling"?

> Victoria and I have noted this issue down and will try to find time
> to investigate further, with a target of being able to actually
> exercise this grep_tree() call within a sparse index in a submodule,
> giving us full confidence that name_base_len is the correct value to
> put in that parameter.

OK.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v5 3/3] builtin/grep.c: walking tree instead of expanding index with --sparse
  2022-09-08 20:56         ` Junio C Hamano
@ 2022-09-08 21:06           ` Shaoxuan Yuan
  2022-09-09 12:49           ` Derrick Stolee
  1 sibling, 0 replies; 69+ messages in thread
From: Shaoxuan Yuan @ 2022-09-08 21:06 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee; +Cc: vdye, git

On 9/8/2022 1:56 PM, Junio C Hamano wrote:
>> Shaoxuan's comment is attempting to list the reasons why submodules
>> do not currently work with sparse-index,
> 
> "do not currently work" in a sense that it produces wrong result, or
> it just expands in-core index unnecessarily before applying pathspec
> to produce the right result in an inefficient way?

It's the latter situation. It expands the index inefficiently though,
the results are correct. The other problem is that, there is no sparse
directories in an expanded index, thus we cannot test how does the
grep_tree() approach (introduced in the third patch) work within submodules.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v5 3/3] builtin/grep.c: walking tree instead of expanding index with --sparse
  2022-09-08 20:56         ` Junio C Hamano
  2022-09-08 21:06           ` Shaoxuan Yuan
@ 2022-09-09 12:49           ` Derrick Stolee
  1 sibling, 0 replies; 69+ messages in thread
From: Derrick Stolee @ 2022-09-09 12:49 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Shaoxuan Yuan, vdye, git

On 9/8/2022 4:56 PM, Junio C Hamano wrote:
> Derrick Stolee <derrickstolee@github.com> writes:
> 
>> HOWEVER: it "doesn't matter" because the sparse index doesn't work
>> at all within a submodule. Specifically, if a super-repo does not
>> enable sparse-checkout, but the submodule _does_, then we don't
>> know how Git will behave currently. His reasonings go on to explain
>> why the situation is fraught:
>>
>> * command_requires_full_index is set in a builtin only for the
>>   top-level project, so when we traverse into a submodule, we don't
>>   re-check if the current builtin has integrated with sparse index
>>   and expand a sparse index to a full one.
> 
> Correct.  
> 
> Is it sufficient to propagate the bit from the_repository->settings
> to repo->settings of the submodule, or is there more things needed
> to fix it?

Likely that would suffice, but before we do that, we need to add a
lot of tests to be sure our previous sparse index integrations do
the right thing when within submodules.
 
>> * core_apply_sparse_checkout is a global not even associated with
>>   a repository struct. What happens when a super project is not
>>   sparse but a submodule is? Or vice-versa? I honestly don't know,
>>   and it will require testing to find out.
> 
> Naïvely, I would think that we should just treat a non-sparse case
> as a mere specialization where the sparse cone covers everything,
> but there may be pitfalls.

I worry about how this works if the super-project and the submodule
differ in the core.sparseCheckout config, but both have sparse-checkout
files. Will one or the other cause the sparse-checkout patterns to be
enabled despite the repo-local config? I honestly have no idea, and I
don't think we have tests that protect this scenario. That's the kind
of direction I would start in this investigation.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v5 1/3] builtin/grep.c: add --sparse option
  2022-09-08  0:18   ` [PATCH v5 1/3] builtin/grep.c: add --sparse option Shaoxuan Yuan
@ 2022-09-10  1:07     ` Victoria Dye
  2022-09-14  6:08     ` Elijah Newren
  1 sibling, 0 replies; 69+ messages in thread
From: Victoria Dye @ 2022-09-10  1:07 UTC (permalink / raw)
  To: Shaoxuan Yuan; +Cc: derrickstolee, git, gitster

Shaoxuan Yuan wrote:
> diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh
> index eb59564565..a9879cc980 100755
> --- a/t/t7817-grep-sparse-checkout.sh
> +++ b/t/t7817-grep-sparse-checkout.sh
> @@ -118,13 +118,19 @@ test_expect_success 'grep searches unmerged file despite not matching sparsity p
>  	test_cmp expect actual
>  '
>  
> -test_expect_success 'grep --cached searches entries with the SKIP_WORKTREE bit' '
> +test_expect_success 'grep --cached and --sparse searches entries with the SKIP_WORKTREE bit' '
> +	cat >expect <<-EOF &&
> +	a:text
> +	EOF
> +	git grep --cached "text" >actual &&
> +	test_cmp expect actual &&
> +
>  	cat >expect <<-EOF &&
>  	a:text
>  	b:text
>  	dir/c:text
>  	EOF
> -	git grep --cached "text" >actual &&
> +	git grep --cached --sparse "text" >actual &&
>  	test_cmp expect actual
>  '

At first, seeing that all the test titles were changed from "grep --cached
<does something>" to "grep --cached and --sparse <does something>", I was
going to suggest that 'git grep --cached' (without '--sparse') should
receive some new tests in addition to updating existing ones (which now
require '--sparse' to work as before).

However, looking at the actual content of the tests like the one above, I
can see that you've added cases demonstrating the expected difference in
behavior between 'grep --cached' and 'grep --cached --sparse'. I can't think
of a clearer way to name the tests, though, so this looks okay to me.

The rest of the patch (namely, the implementation of '--sparse' and
corresponding documentation) looked good as well - I didn't have anything
specific to note on that.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v5 3/3] builtin/grep.c: walking tree instead of expanding index with --sparse
  2022-09-08  0:18   ` [PATCH v5 3/3] builtin/grep.c: walking tree instead of expanding index with --sparse Shaoxuan Yuan
  2022-09-08 17:59     ` Junio C Hamano
@ 2022-09-10  2:04     ` Victoria Dye
  1 sibling, 0 replies; 69+ messages in thread
From: Victoria Dye @ 2022-09-10  2:04 UTC (permalink / raw)
  To: Shaoxuan Yuan; +Cc: derrickstolee, git, gitster

Shaoxuan Yuan wrote:
> +
> +	/*
> +	 * NEEDSWORK: when reading a submodule, the sparsity settings in the
> +	 * superproject are incorrectly forgotten or misused. For example:
> +	 *
> +	 * 1. "command_requires_full_index"
> +	 * 	When this setting is turned on for `grep`, only the superproject
> +	 *	knows it. All the submodules are read with their own configs
> +	 *	and get prepare_repo_settings()'d. Therefore, these submodules
> +	 *	"forget" the sparse-index feature switch. As a result, the index
> +	 *	of these submodules are expanded unexpectedly.
> +	 *
> +	 * 2. "core_apply_sparse_checkout"
> +	 *	When running `grep` in the superproject, this setting is
> +	 *	populated using the superproject's configs. However, once
> +	 *	initialized, this config is globally accessible and is read by
> +	 *	prepare_repo_settings() for the submodules. For instance, if a
> +	 *	submodule is using a sparse-checkout, however, the superproject
> +	 *	is not, the result is that the config from the superproject will
> +	 *	dictate the behavior for the submodule, making it "forget" its
> +	 *	sparse-checkout state.
> +	 *
> +	 * 3. "core_sparse_checkout_cone"
> +	 *	ditto.

These are interesting observations, thank you for describing the behavior in
detail.

- #1 might seem like an easy fix - since 'command_requires_full_index' is
  tied to the command (not properties of the repo), the logical thing to do
  would be to propagate the value from the superproject to the subproject.
  However, that fix will undoubtedly expose lots of places where we're not
  handling the sparse index correctly in submodules. Since this isn't a
  problem introduced by your patch series, I'm content leaving this for a
  later series.
- #2 is an odd situation, but I'm guessing that the effect here will be
  minimal (since, regardless of the 'core_*' sparse-checkout globals,
  'SKIP_WORKTREE' will still be applied to - and respected on - entries in
  the index). It's more worrisome for commands that recurse submodules and
  *write* the index (e.g., 'git read-tree'), but that's also outside the
  scope of this series.

Given this information, I think your approach is (for the time being) a safe
one. Beyond the submodule issues, I'm happy with the rest of your
'grep_tree()' updates.

> diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
> index 63becc3138..fda05faadf 100755
> --- a/t/t1092-sparse-checkout-compatibility.sh
> +++ b/t/t1092-sparse-checkout-compatibility.sh
> @@ -1987,7 +2000,48 @@ test_expect_success 'grep is not expanded' '
>  
>  	# All files within the folder1/* pathspec are sparse,
>  	# so this command does not find any matches
> -	ensure_not_expanded ! grep a -- folder1/*
> +	ensure_not_expanded ! grep a -- folder1/* &&
> +
> +	# test out-of-cone pathspec with or without wildcard
> +	ensure_not_expanded grep --sparse --cached a -- "folder1/a" &&
> +	ensure_not_expanded grep --sparse --cached a -- "folder1/*" &&
> +
> +	# test in-cone pathspec with or without wildcard
> +	ensure_not_expanded grep --sparse --cached a -- "deep/a" &&
> +	ensure_not_expanded grep --sparse --cached a -- "deep/*"

Thanks for the new tests (re: [1])! 

[1] https://lore.kernel.org/git/4b65d7dc-e711-43a6-8763-62be79a3e4a9@github.com/

> +'
> +
> +# NEEDSWORK: when running `grep` in the superproject with --recurse-submodules,
> +# Git expands the index of the submodules unexpectedly. Even though `grep`
> +# builtin is marked as "command_requires_full_index = 0", this config is only
> +# useful for the superproject. Namely, the submodules have their own configs,
> +# which are _not_ populated by the one-time sparse-index feature switch.
> +test_expect_failure 'grep within submodules is not expanded' '
> +	init_repos_as_submodules &&
> +
> +	# do not use ensure_not_expanded() here, becasue `grep` should be
> +	# run in the superproject, not in "./sparse-index"
> +	GIT_TRACE2_EVENT="$(pwd)/trace2.txt" \
> +	git grep --sparse --cached --recurse-submodules a -- "*/folder1/*" &&
> +	test_region ! index ensure_full_index trace2.txt
> +'

So this test is *only* demonstrating that the submodules' indexes are
expanded (incorrectly, hence the 'test_expect_failure'); it doesn't show
that 'git grep' returns the correct results...

> +
> +# NEEDSWORK: this test is not actually testing the code. The design purpose
> +# of this test is to verify the grep result when the submodules are using a
> +# sparse-index. Namely, we want "folder1/" as a tree (a sparse directory); but
> +# because of the index expansion, we are now grepping the "folder1/a" blob.
> +# Because of the problem stated above 'grep within submodules is not expanded',
> +# we don't have the ideal test environment yet.
> +test_expect_success 'grep sparse directory within submodules' '
> +	init_repos_as_submodules &&
> +
> +	cat >expect <<-\EOF &&
> +	full-checkout/folder1/a:a
> +	sparse-checkout/folder1/a:a
> +	sparse-index/folder1/a:a
> +	EOF
> +	git grep --sparse --cached --recurse-submodules a -- "*/folder1/*" >actual &&
> +	test_cmp actual expect
>  '

...but this test *does* show that those results are correct. I think it was
a good decision to keep the two separate, since only the index expansion
behavior is wrong (thus warranting the 'test_expect_failure'). The output of
'git grep' is still what we want it to be, so it gets a
'test_expect_success'.

>  
>  test_done


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v5 3/3] builtin/grep.c: walking tree instead of expanding index with --sparse
  2022-09-08 20:46       ` Derrick Stolee
  2022-09-08 20:56         ` Junio C Hamano
@ 2022-09-13 17:23         ` Junio C Hamano
  1 sibling, 0 replies; 69+ messages in thread
From: Junio C Hamano @ 2022-09-13 17:23 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Shaoxuan Yuan, vdye, git

Derrick Stolee <derrickstolee@github.com> writes:

> On 9/8/2022 1:59 PM, Junio C Hamano wrote:
>> Shaoxuan Yuan <shaoxuan.yuan02@gmail.com> writes:
>> 
>>> +
>>> +	/*
>>> +	 * NEEDSWORK: when reading a submodule, the sparsity settings in the
>>> +	 * superproject are incorrectly forgotten or misused. For example:
>>> +	 *
>>> +	 * 1. "command_requires_full_index"
>>> +	 * 	When this setting is turned on for `grep`, only the superproject
>>> +	 *	knows it. All the submodules are read with their own configs
>>> +	 *	and get prepare_repo_settings()'d. Therefore, these submodules
>>> +	 *	"forget" the sparse-index feature switch. As a result, the index
>>> +	 *	of these submodules are expanded unexpectedly.
>>  ...
> I think this comment is misplaced. It should either be contained in
> the commit message or placed closer to this diff hunk:

OK, so given what you wrote below, except for such a minor
shuffling, the current series is ready to go?

Thanks.

> ...
> Shaoxuan's comment is attempting to list the reasons why submodules
> do not currently work with sparse-index, and specifically that we
> can add tests that _should_ exercise this code in a meaningful way,
> but because of the current limitations of the codebase, the code
> isn't actually exercised in that scenario.
>
> In order to actually create a test that demonstrates how submodules
> and sparse-checkout work with this logic, we need to do some serious
> refactoring of the sparse-checkout logic to care about the repository
> struct, along with some other concerns specifically around the sparse
> index. This doesn't seem appropriate for the GSoC timeline or even for
> just this topic.


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v5 1/3] builtin/grep.c: add --sparse option
  2022-09-08  0:18   ` [PATCH v5 1/3] builtin/grep.c: add --sparse option Shaoxuan Yuan
  2022-09-10  1:07     ` Victoria Dye
@ 2022-09-14  6:08     ` Elijah Newren
  2022-09-15  2:57       ` Junio C Hamano
                         ` (2 more replies)
  1 sibling, 3 replies; 69+ messages in thread
From: Elijah Newren @ 2022-09-14  6:08 UTC (permalink / raw)
  To: Shaoxuan Yuan
  Cc: Derrick Stolee, Victoria Dye, Git Mailing List, Junio C Hamano

Hi Shaoxuan,

Please note that it's customary to cc folks who have commented on
previous versions of your patch series when you re-roll.

On Wed, Sep 7, 2022 at 5:28 PM Shaoxuan Yuan <shaoxuan.yuan02@gmail.com> wrote:
>
> Add a --sparse option to `git-grep`.

It's awesome you're working on this.  Adding more of "behavior A"
(restricting querying commands to the sparse cone) is something I've
wanted for a long time.

I think most of your code is beneficial, but I do have some issues
with high level direction you were implementing, which may require
some tweaks...

> When the '--cached' option is used with the 'git grep' command, the
> search is limited to the blobs found in the index, not in the worktree.
> If the user has enabled sparse-checkout, this might present more results
> than they would like, since the files outside of the sparse-checkout are
> unlikely to be important to them.

"files outside of the sparse-checkout are unlikely to be important to
[users]" is certainly an issue.  But it's *much* wider than this.
Beyond `grep --cached`, it also affects `grep REVISION`, `log`, `diff
[REVISION]`, and related things...perhaps even something like `blame`.
I think all those other commands probably deserve a mode where they
restrict output to the view associated with the user's cone.  I've
brought that up before[1].  I was skeptical of making it the default,
because it'd probably take a long time to implement it everywhere.
Slowly changing defaults of all commands over many git releases seems
like a poor strategy, but I'm afraid that's what it looks like we are
doing here.

I'm also worried that slowly changing the defaults without a
high-level plan will lead to users struggling to figure out what
flag(s) to pass.  Are we going to be stuck in a situation where users
have to remember that for a dense search, they use one flag for `grep
--cached`, a different one for  `grep [REVISION]`, no flag is needed
for `diff [REVISION]`, but yet a different flag is needed for `git
log`?

I'm also curious whether there shouldn't be a config option for
something like this, so folks don't have to specify it with every
invocation.  In particular, while I certainly have users that want to
just query git for information about the part of the history they are
interested in, there are other users who are fully aware they are
working in a bigger repository and want to search for additional
things to add to their sparse-checkout and predominantly use grep for
things like that.  They have even documented that `git grep --cached
<TERM>` can be used in sparse-checkouts for this purpose...and have
been using that for a few years.  (I did warn them at the time that
there was a risk they'd have to change their command, but it's still
going to be a behavioral change they might not expect.)  Further, when
I brought up changing the behavior of commands during sparse-checkouts
to limit to files matching the sparsity paths in that old thread at
[1], Stolee was a bit skeptical of making that the default.  That
suggests, at least, that two independent groups of users would want to
use the non-sparse searching frequently, and frequently enough that
they'd appreciate a config option.

I also brought up in that old thread that perhaps we want to avoid
adding a flag to every subcommand, and instead just having a
git-global flag for triggering this type of behavior.  (e.g. `git
--no-restrict grep --cached ...` or `git --dense grep --cached ...`).

[1] https://lore.kernel.org/git/CABPp-BGJ_Nvi5TmgriD9Bh6eNXE2EDq2f8e8QKXAeYG3BxZafA@mail.gmail.com/
and the responses to that email.

> Change the default behavior of 'git grep' to focus on the files within
> the sparse-checkout definition. To enable the previous behavior, add a
> '--sparse' option to 'git grep' that triggers the old behavior that
> inspects paths outside of the sparse-checkout definition when paired
> with the '--cached' option.

I still think the flag name of `--sparse` is totally backwards and
highly confusing for the described behavior.  I missed Stolee's email
at the time (wasn't cc'ed) where he brought up that "--sparse" had
already been added to "git-add" and "git-rm", but in those cases the
commands aren't querying and I just don't see how they lead to the
same level of user confusion.  This one seems glaringly wrong to me
and both Junio and I flagged it on v1 when we first saw it.  (Perhaps
it also helps that for the add/rm cases, that a user is often given an
error message with the suggested flag to use, which just doesn't make
sense here either.)  If there is concern that this flag should be the
same as add and rm, then I think we need to do the backward
compatibility dance and fix add and rm by adding an alias over there
so that grep's flag won't be so confusing.

I really don't want to have to deal with the backward compatibility
headache of "git grep --sparse" means do a non-sparse search for
backward compatibility reasons.  Here's the flag you should really
use..."

> Suggested-by: Victoria Dye <vdye@github.com>
> Helped-by: Derrick Stolee <derrickstolee@github.com>
> Helped-by: Victoria Dye <vdye@github.com>
> Signed-off-by: Shaoxuan Yuan <shaoxuan.yuan02@gmail.com>
> ---
>  Documentation/git-grep.txt      |  5 ++++-
>  builtin/grep.c                  | 10 +++++++++-
>  t/t7817-grep-sparse-checkout.sh | 34 +++++++++++++++++++++++++++------
>  3 files changed, 41 insertions(+), 8 deletions(-)
>
> diff --git a/Documentation/git-grep.txt b/Documentation/git-grep.txt
> index 58d944bd57..bdd3d5b8a6 100644
> --- a/Documentation/git-grep.txt
> +++ b/Documentation/git-grep.txt
> @@ -28,7 +28,7 @@ SYNOPSIS
>            [-f <file>] [-e] <pattern>
>            [--and|--or|--not|(|)|-e <pattern>...]
>            [--recurse-submodules] [--parent-basename <basename>]
> -          [ [--[no-]exclude-standard] [--cached | --no-index | --untracked] | <tree>...]
> +          [ [--[no-]exclude-standard] [--cached [--sparse] | --no-index | --untracked] | <tree>...]
>            [--] [<pathspec>...]
>
>  DESCRIPTION
> @@ -45,6 +45,9 @@ OPTIONS
>         Instead of searching tracked files in the working tree, search
>         blobs registered in the index file.
>
> +--sparse::
> +       Use with --cached. Search outside of sparse-checkout definition.
> +
>  --no-index::
>         Search files in the current directory that is not managed by Git.
>
> diff --git a/builtin/grep.c b/builtin/grep.c
> index e6bcdf860c..12abd832fa 100644
> --- a/builtin/grep.c
> +++ b/builtin/grep.c
> @@ -96,6 +96,8 @@ static pthread_cond_t cond_result;
>
>  static int skip_first_line;
>
> +static int grep_sparse = 0;
> +
>  static void add_work(struct grep_opt *opt, struct grep_source *gs)
>  {
>         if (opt->binary != GREP_BINARY_TEXT)
> @@ -525,7 +527,11 @@ static int grep_cache(struct grep_opt *opt,
>         for (nr = 0; nr < repo->index->cache_nr; nr++) {
>                 const struct cache_entry *ce = repo->index->cache[nr];
>
> -               if (!cached && ce_skip_worktree(ce))
> +               /*
> +                * Skip entries with SKIP_WORKTREE unless both --sparse and
> +                * --cached are given.
> +                */
> +               if (!(grep_sparse && cached) && ce_skip_worktree(ce))
>                         continue;
>
>                 strbuf_setlen(&name, name_base_len);
> @@ -963,6 +969,8 @@ int cmd_grep(int argc, const char **argv, const char *prefix)
>                            PARSE_OPT_NOCOMPLETE),
>                 OPT_INTEGER('m', "max-count", &opt.max_count,
>                         N_("maximum number of results per file")),
> +               OPT_BOOL(0, "sparse", &grep_sparse,
> +                        N_("search the contents of files outside the sparse-checkout definition")),
>                 OPT_END()
>         };
>         grep_prefix = prefix;
> diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh
> index eb59564565..a9879cc980 100755
> --- a/t/t7817-grep-sparse-checkout.sh
> +++ b/t/t7817-grep-sparse-checkout.sh
> @@ -118,13 +118,19 @@ test_expect_success 'grep searches unmerged file despite not matching sparsity p
>         test_cmp expect actual
>  '
>
> -test_expect_success 'grep --cached searches entries with the SKIP_WORKTREE bit' '
> +test_expect_success 'grep --cached and --sparse searches entries with the SKIP_WORKTREE bit' '
> +       cat >expect <<-EOF &&
> +       a:text
> +       EOF
> +       git grep --cached "text" >actual &&
> +       test_cmp expect actual &&
> +
>         cat >expect <<-EOF &&
>         a:text
>         b:text
>         dir/c:text
>         EOF
> -       git grep --cached "text" >actual &&
> +       git grep --cached --sparse "text" >actual &&
>         test_cmp expect actual
>  '
>
> @@ -143,7 +149,15 @@ test_expect_success 'grep --recurse-submodules honors sparse checkout in submodu
>         test_cmp expect actual
>  '
>
> -test_expect_success 'grep --recurse-submodules --cached searches entries with the SKIP_WORKTREE bit' '
> +test_expect_success 'grep --recurse-submodules --cached and --sparse searches entries with the SKIP_WORKTREE bit' '
> +       cat >expect <<-EOF &&
> +       a:text
> +       sub/B/b:text
> +       sub2/a:text
> +       EOF
> +       git grep --recurse-submodules --cached "text" >actual &&
> +       test_cmp expect actual &&
> +
>         cat >expect <<-EOF &&
>         a:text
>         b:text
> @@ -152,7 +166,7 @@ test_expect_success 'grep --recurse-submodules --cached searches entries with th
>         sub/B/b:text
>         sub2/a:text
>         EOF
> -       git grep --recurse-submodules --cached "text" >actual &&
> +       git grep --recurse-submodules --cached --sparse "text" >actual &&
>         test_cmp expect actual
>  '
>
> @@ -166,7 +180,15 @@ test_expect_success 'working tree grep does not search the index with CE_VALID a
>         test_cmp expect actual
>  '
>
> -test_expect_success 'grep --cached searches index entries with both CE_VALID and SKIP_WORKTREE' '
> +test_expect_success 'grep --cached and --sparse searches index entries with both CE_VALID and SKIP_WORKTREE' '
> +       cat >expect <<-EOF &&
> +       a:text
> +       EOF
> +       test_when_finished "git update-index --no-assume-unchanged b" &&
> +       git update-index --assume-unchanged b &&
> +       git grep --cached text >actual &&
> +       test_cmp expect actual &&
> +
>         cat >expect <<-EOF &&
>         a:text
>         b:text
> @@ -174,7 +196,7 @@ test_expect_success 'grep --cached searches index entries with both CE_VALID and
>         EOF
>         test_when_finished "git update-index --no-assume-unchanged b" &&
>         git update-index --assume-unchanged b &&
> -       git grep --cached text >actual &&
> +       git grep --cached --sparse text >actual &&
>         test_cmp expect actual
>  '
>
> --
> 2.37.0

I read over this patch and the other two patches.  Other than things
like variable names propagating the sparse/dense confusion, and the
high level goals already discussed, I didn't spot any other issues.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v5 1/3] builtin/grep.c: add --sparse option
  2022-09-14  6:08     ` Elijah Newren
@ 2022-09-15  2:57       ` Junio C Hamano
  2022-09-18  2:14         ` Elijah Newren
  2022-09-17  3:34       ` Shaoxuan Yuan
  2022-09-17  3:45       ` Shaoxuan Yuan
  2 siblings, 1 reply; 69+ messages in thread
From: Junio C Hamano @ 2022-09-15  2:57 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Shaoxuan Yuan, Derrick Stolee, Victoria Dye, Git Mailing List

Elijah Newren <newren@gmail.com> writes:

> ... I think all those other commands probably deserve a mode where they
> restrict output to the view associated with the user's cone.  I've
> brought that up before[1].  I was skeptical of making it the default,
> because it'd probably take a long time to implement it everywhere.
> Slowly changing defaults of all commands over many git releases seems
> like a poor strategy, but I'm afraid that's what it looks like we are
> doing here.
>
> I'm also worried that slowly changing the defaults without a
> high-level plan will lead to users struggling to figure out what
> flag(s) to pass.  Are we going to be stuck in a situation where users
> have to remember that for a dense search, they use one flag for `grep
> --cached`, a different one for  `grep [REVISION]`, no flag is needed
> for `diff [REVISION]`, but yet a different flag is needed for `git
> log`?

In short, the default should be "everywhere in tree, regardless of
the current sparse-checkout settings", with commands opting into
implementing "limit only to sparse-checkout settings" as an option,
at least initially, with an eye to possibly flip the default later
when all commands support that position but not before?

I think that is a reasonable position to take.  I lean towards the
default of limiting the operations to inside sparse cone(s) for all
subcommands when all subcommands learn to be capable to do so, but I
also agree that using that default for only for subcommands that
have learned to do, which will happen over time, would be way too
confusing for our users.

By the way, I briefly wondered if "limit to sparse-checkout setting"
can be done by introducing a fake "attribute" and using the "attr"
pathspec magic, but it may probably be a bad match, and separate
option would be more appropriate.

>> Change the default behavior of 'git grep' to focus on the files within
>> the sparse-checkout definition. To enable the previous behavior, add a
>> '--sparse' option to 'git grep' that triggers the old behavior that
>> inspects paths outside of the sparse-checkout definition when paired
>> with the '--cached' option.
>
> I still think the flag name of `--sparse` is totally backwards and
> highly confusing for the described behavior.

Yeah, regardless of which between "--sparse" and "--no-sparse"
should be the default, I am in 100% agreement that "--sparse"
meaning "affect things both inside and outside the sparse cones" is
totally backwards.

How strongly ingrained is this UI mistake?  I have a feeling that
this may be something we still can undo and redo relatively easily,
i.e. "--sparse" may be that "limit to sparse-checkout setting"
option, not "--no-sparse".

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v5 1/3] builtin/grep.c: add --sparse option
  2022-09-14  6:08     ` Elijah Newren
  2022-09-15  2:57       ` Junio C Hamano
@ 2022-09-17  3:34       ` Shaoxuan Yuan
  2022-09-18  4:24         ` Elijah Newren
  2022-09-17  3:45       ` Shaoxuan Yuan
  2 siblings, 1 reply; 69+ messages in thread
From: Shaoxuan Yuan @ 2022-09-17  3:34 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Derrick Stolee, Victoria Dye, Git Mailing List, Junio C Hamano

On 9/13/2022 11:08 PM, Elijah Newren wrote:
> Hi Shaoxuan,
> 
> Please note that it's customary to cc folks who have commented on
> previous versions of your patch series when you re-roll.

Hi Elijah,

Sorry for the delay, I didn't have my computer with me during Merge 2022
and couldn't respond.

I'm sorry that I somehow lost you along the way :(

> On Wed, Sep 7, 2022 at 5:28 PM Shaoxuan Yuan <shaoxuan.yuan02@gmail.com> wrote:
>>
>> Add a --sparse option to `git-grep`.
> 
> It's awesome you're working on this.  Adding more of "behavior A"
> (restricting querying commands to the sparse cone) is something I've
> wanted for a long time.

Thanks :)

> I think most of your code is beneficial, but I do have some issues
> with high level direction you were implementing, which may require
> some tweaks...

OK.

>> When the '--cached' option is used with the 'git grep' command, the
>> search is limited to the blobs found in the index, not in the worktree.
>> If the user has enabled sparse-checkout, this might present more results
>> than they would like, since the files outside of the sparse-checkout are
>> unlikely to be important to them.
> 
> "files outside of the sparse-checkout are unlikely to be important to
> [users]" is certainly an issue.  But it's *much* wider than this.
> Beyond `grep --cached`, it also affects `grep REVISION`, `log`, `diff
> [REVISION]`, and related things...perhaps even something like `blame`.

Agree. Keep reading...

> I think all those other commands probably deserve a mode where they
> restrict output to the view associated with the user's cone.  I've

Agree.

> brought that up before[1].  I was skeptical of making it the default,
> because it'd probably take a long time to implement it everywhere.
> Slowly changing defaults of all commands over many git releases seems
> like a poor strategy, but I'm afraid that's what it looks like we are
> doing here.

True.

> I'm also worried that slowly changing the defaults without a
> high-level plan will lead to users struggling to figure out what
> flag(s) to pass.  Are we going to be stuck in a situation where users
> have to remember that for a dense search, they use one flag for `grep
> --cached`, a different one for  `grep [REVISION]`, no flag is needed
> for `diff [REVISION]`, but yet a different flag is needed for `git
> log`?

I think the inconsistency is certainly unsettling.

> I'm also curious whether there shouldn't be a config option for
> something like this, so folks don't have to specify it with every
> invocation.  In particular, while I certainly have users that want to
> just query git for information about the part of the history they are
> interested in, there are other users who are fully aware they are
> working in a bigger repository and want to search for additional
> things to add to their sparse-checkout and predominantly use grep for
> things like that.  They have even documented that `git grep --cached
> <TERM>` can be used in sparse-checkouts for this purpose...and have
> been using that for a few years.  (I did warn them at the time that
> there was a risk they'd have to change their command, but it's still
> going to be a behavioral change they might not expect.)  Further, when
> I brought up changing the behavior of commands during sparse-checkouts
> to limit to files matching the sparsity paths in that old thread at
> [1], Stolee was a bit skeptical of making that the default.  That
> suggests, at least, that two independent groups of users would want to
> use the non-sparse searching frequently, and frequently enough that
> they'd appreciate a config option.

A config option sounds good. Though I think

1. If this option is for global behavior: users may better off turning
off sparse-checkout if they want a config to do things densely everywhere.

2. If this option is for a single subcommand (e.g. 'grep'): I don't have
much thoughts here. It certainly can be nice for users who need to do
non-sparse searching frequently. This design, if necessary, should
belong to a patch where this config is added for every single subcommand?

> I also brought up in that old thread that perhaps we want to avoid
> adding a flag to every subcommand, and instead just having a
> git-global flag for triggering this type of behavior.  (e.g. `git
> --no-restrict grep --cached ...` or `git --dense grep --cached ...`).

This looks more like the answer to me. It's a peace of mind for users if
they don't have to worry about whether a subcommand is sparse-aware, and
how may their behaviors differ. Though we still may need to update the
actual behavior in each subcommand over an extended period of time
(though may not be difficult?), which you mentioned above "seems like a
poor strategy".

> [1] https://lore.kernel.org/git/CABPp-BGJ_Nvi5TmgriD9Bh6eNXE2EDq2f8e8QKXAeYG3BxZafA@mail.gmail.com/
> and the responses to that email>
>> Change the default behavior of 'git grep' to focus on the files within
>> the sparse-checkout definition. To enable the previous behavior, add a
>> '--sparse' option to 'git grep' that triggers the old behavior that
>> inspects paths outside of the sparse-checkout definition when paired
>> with the '--cached' option.
> 
> I still think the flag name of `--sparse` is totally backwards and
> highly confusing for the described behavior.  I missed Stolee's email
> at the time (wasn't cc'ed) where he brought up that "--sparse" had
> already been added to "git-add" and "git-rm", but in those cases the
> commands aren't querying and I just don't see how they lead to the
> same level of user confusion.  This one seems glaringly wrong to me
> and both Junio and I flagged it on v1 when we first saw it.  (Perhaps
> it also helps that for the add/rm cases, that a user is often given an
> error message with the suggested flag to use, which just doesn't make
> sense here either.)  If there is concern that this flag should be the
> same as add and rm, then I think we need to do the backward
> compatibility dance and fix add and rm by adding an alias over there
> so that grep's flag won't be so confusing.

I guess I'm using "--sparse" here because "add", "rm" and "mv" all imply
that "when operating on a sparse path, ignores/warns unless '--sparse'
is used". I take it as an analogy so "when searching a sparse path,
ignores/warns unless '--sparse' is used". As the idea that "Git does
*not* care sparse contents unless '--[no-]sparse' is specified" is sort
of established through the implementations in "add", "rm", or "mv", I
don't see a big problem using "--sparse" here.

I *think*, as long as the users are informed that the default is to
ignore things outside of the sparse-checkout definition, and they have
to do something (using "--sparse" or a potential better name) to
override the default, we are safe to use a name that is famous (i.e.
"--sparse") even though its literal meaning is not perfectly descriptive.

One outlier I do find confusing though, is the "--sparse" option from
"git-ls-files". Without it, Git expands the index and show everything
outside of sparse-checkout definition, which seems a bit controversial.

...

> 
> I read over this patch and the other two patches.  Other than things
> like variable names propagating the sparse/dense confusion, and the
> high level goals already discussed, I didn't spot any other issues.

Thanks,
Shaoxuan

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v5 1/3] builtin/grep.c: add --sparse option
  2022-09-14  6:08     ` Elijah Newren
  2022-09-15  2:57       ` Junio C Hamano
  2022-09-17  3:34       ` Shaoxuan Yuan
@ 2022-09-17  3:45       ` Shaoxuan Yuan
  2 siblings, 0 replies; 69+ messages in thread
From: Shaoxuan Yuan @ 2022-09-17  3:45 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Derrick Stolee, Victoria Dye, Git Mailing List, Junio C Hamano

On 9/13/2022 11:08 PM, Elijah Newren wrote:

...

I think we are now at a point to make this UI decision, which may not be
easily (and should not be?) reverted once it's made in this patch.

So, is "--sparse" we want for "grep", even for "rm", "add", or "mv"?

Love to hear from other contributors :)

Thanks,
Shaoxuan

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v5 1/3] builtin/grep.c: add --sparse option
  2022-09-15  2:57       ` Junio C Hamano
@ 2022-09-18  2:14         ` Elijah Newren
  2022-09-18 19:52           ` Victoria Dye
  0 siblings, 1 reply; 69+ messages in thread
From: Elijah Newren @ 2022-09-18  2:14 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Shaoxuan Yuan, Derrick Stolee, Victoria Dye, Git Mailing List

On Wed, Sep 14, 2022 at 7:57 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Elijah Newren <newren@gmail.com> writes:
>
> > ... I think all those other commands probably deserve a mode where they
> > restrict output to the view associated with the user's cone.  I've
> > brought that up before[1].  I was skeptical of making it the default,
> > because it'd probably take a long time to implement it everywhere.
> > Slowly changing defaults of all commands over many git releases seems
> > like a poor strategy, but I'm afraid that's what it looks like we are
> > doing here.
> >
> > I'm also worried that slowly changing the defaults without a
> > high-level plan will lead to users struggling to figure out what
> > flag(s) to pass.  Are we going to be stuck in a situation where users
> > have to remember that for a dense search, they use one flag for `grep
> > --cached`, a different one for  `grep [REVISION]`, no flag is needed
> > for `diff [REVISION]`, but yet a different flag is needed for `git
> > log`?
>
> In short, the default should be "everywhere in tree, regardless of
> the current sparse-checkout settings", with commands opting into
> implementing "limit only to sparse-checkout settings" as an option,
> at least initially, with an eye to possibly flip the default later
> when all commands support that position but not before?
>
> I think that is a reasonable position to take.  I lean towards the
> default of limiting the operations to inside sparse cone(s) for all
> subcommands when all subcommands learn to be capable to do so, but I
> also agree that using that default for only for subcommands that
> have learned to do, which will happen over time, would be way too
> confusing for our users.
>
> By the way, I briefly wondered if "limit to sparse-checkout setting"
> can be done by introducing a fake "attribute" and using the "attr"
> pathspec magic, but it may probably be a bad match, and separate
> option would be more appropriate.
>
> >> Change the default behavior of 'git grep' to focus on the files within
> >> the sparse-checkout definition. To enable the previous behavior, add a
> >> '--sparse' option to 'git grep' that triggers the old behavior that
> >> inspects paths outside of the sparse-checkout definition when paired
> >> with the '--cached' option.
> >
> > I still think the flag name of `--sparse` is totally backwards and
> > highly confusing for the described behavior.
>
> Yeah, regardless of which between "--sparse" and "--no-sparse"
> should be the default, I am in 100% agreement that "--sparse"
> meaning "affect things both inside and outside the sparse cones" is
> totally backwards.
>
> How strongly ingrained is this UI mistake?  I have a feeling that
> this may be something we still can undo and redo relatively easily,
> i.e. "--sparse" may be that "limit to sparse-checkout setting"
> option, not "--no-sparse".

It's gotten into a few commands, but I agree it seems like something
we can still undo.

In fact, not all uses of `--sparse` are backwards; two commands (clone
& ls-files) use `--sparse` to mean limit to sparsity specification.
There are three commands that use `--sparse` in a potentially
confusing or backwards way, though one is new to this cycle and isn't
even documented.  In more detail...

== clone --sparse ==

For clone, `--sparse` definitely means limit to the sparsity patterns.
That's the meaning we want.

== ls-files --sparse ==

For ls-files, the meaning of `--sparse` is "do not recurse into sparse
directory entries in order to print the traditional ls-files output,
just print the sparse directory entry".  So, I'd say that also has the
meaning we want; it's for restricting rather than expanding.

This one is also interesting in that it is the only command in the
list about querying for information rather than modifying the
worktree/index, and is thus the best precedent for grep.

If grep behaved similarly to ls-files, it would suggest that
Shaoxuan's series should default to searching the whole index (the
opposite of what his current series does) and that --sparse would be
used to restrict to the sparsity patterns (also the opposite of the
meaning for his flag).

== add --sparse ==

For add, `--sparse` affects the behavior of untracked files.  Its
usage allows untracked files to be added to the index despite the file
normally being outside the sparsity patterns.  There are two ways for
users to view this:
  * The file added is now tracked, and is present (or "checked-out").
Thus, the new file is part of the user's "sparse checkout" now.
Perhaps the flag makes sense viewed from this light?  (I had actually
looked at it this way previously).
  * We used the `--sparse` flag to allow git-add to operate on
something outside of the normal sparsity patterns.  The flag is
backwards.

It might be worth noting that the reason this flag was added was that
users are likely to be surprised later when some other command runs
and causes the file to vanish when they update the working tree to
match the sparsity patterns.

== rm --sparse ==

For rm, `--sparse` allows files to be removed from the index despite
normally being outside the sparsity patterns.  There's also a couple
ways to view this:
  * Any file being removed is not going to be part of the sparse
checkout anymore.  Thus there is no meaning to `--sparse`, but git-add
used it as a safety check to avoid surprises by operating outside the
normal patterns so perhaps we re-use that?
  * We used the `--sparse` flag to allow rm to operate on something
outside the normal sparsity patterns.  The flag is backwards.

Much like add, it might be worth noting that this flag was added for
cases like `git rm '*.jpg'` -- users probably only want such
expressions to operate on their sparse-checkout and they could be
negatively surprised by also removing stuff elsewhere.

== mv --sparse ==

For mv, `--sparse` feels like it's stretching the logic used for
`git-add` and isn't so clear that it could make sense anymore.  The
connection might be that when it moves files outside the sparsity
specification, it actually leaves them materialized, so in that sense
you could argue the files are still part of the sparse checkout, but
I'd say we're stretching that a bit.

However, the `mv` changes were made earlier this same cycle and aren't
part of a release yet.  It doesn't feel like this should be setting a
precedent for how grep should behave.  Especially since it's a
modification command, and grep is a querying command; ls-files seems
like a better precedent.

Also, the `--sparse` flag was not documented for mv for whatever reason.

== Overall ==

For existing querying commands (just ls-files), `--sparse` already
means restrict to the sparse cone.  If we keep using the existing flag
names, grep should follow suit.

For existing modification commands already released (add, rm), the
fact that the command is modifying actually gives a different way to
interpret things such that it's not clear `--sparse` was even a
problem.  However, perhaps the name of the flag is bad just because
there are multiple ways to view it and those who view it one way will
see it as counter-intuitive.

== Flag rename? ==

There's another reason to potentially rename the flag.  We already
have `--sparse` and `--dense` flags for rev-list and friends.  So,
when we want to enable those other commands to restrict to the
sparsity patterns, we probably need a different name.  So, perhaps, we
should rename our `--sparse/--dense` to `--restrict/--no-restrict`.
Such a rename would also likely clear up the ambiguity about which way
to interpret the command for the add & rm commands (though it'd pick
the second one and suggest we were using the wrong name after all).

(There are also two other commands that use `--sparse` -- pack-objects
and show-branch, though in a much different way and neither would ever
be affected by our new --sparse/--dense/--restrict/--no-restrict
flags.)

Other names are also possible.  Any suggestions?

== global flag vs subcommand flags ==

Do we want to make --[no-]restrict a flag for each subcommand, or just
make it a global git flag?  I kind of think it'd make sense to do the
latter

== Defaults ==

As discussed before, we probably want querying commands (ls-files,
grep, log, etc.) to default to --no-restrict for now, since we are
otherwise slowly changing the defaults.  We may want to swap that
default in the future.

However, for modification commands, I think we want the default to be
--restrict, regardless of the default for querying commands.  There
are some potentially very negative surprises for users if we don't,
and those surprises will be delayed rather than occur at the time the
user runs the command.  In fact, those negative surprises are likely
why those commands were the first to gain an option controlling
whether they operated on paths outside the sparsity specification.
(Also, the modification commands print a warning if they could have
affected other files but didn't due the the default of restricting, so
I think we have their default correct, even if the flag name is
suboptimal.)

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v5 1/3] builtin/grep.c: add --sparse option
  2022-09-17  3:34       ` Shaoxuan Yuan
@ 2022-09-18  4:24         ` Elijah Newren
  2022-09-19  4:13           ` Shaoxuan Yuan
  0 siblings, 1 reply; 69+ messages in thread
From: Elijah Newren @ 2022-09-18  4:24 UTC (permalink / raw)
  To: Shaoxuan Yuan
  Cc: Derrick Stolee, Victoria Dye, Git Mailing List, Junio C Hamano

On Fri, Sep 16, 2022 at 8:34 PM Shaoxuan Yuan <shaoxuan.yuan02@gmail.com> wrote:
>
> > I'm also curious whether there shouldn't be a config option for
> > something like this, so folks don't have to specify it with every
> > invocation.  In particular, while I certainly have users that want to
> > just query git for information about the part of the history they are
> > interested in, there are other users who are fully aware they are
> > working in a bigger repository and want to search for additional
> > things to add to their sparse-checkout and predominantly use grep for
> > things like that.  They have even documented that `git grep --cached
> > <TERM>` can be used in sparse-checkouts for this purpose...and have
> > been using that for a few years.  (I did warn them at the time that
> > there was a risk they'd have to change their command, but it's still
> > going to be a behavioral change they might not expect.)  Further, when
> > I brought up changing the behavior of commands during sparse-checkouts
> > to limit to files matching the sparsity paths in that old thread at
> > [1], Stolee was a bit skeptical of making that the default.  That
> > suggests, at least, that two independent groups of users would want to
> > use the non-sparse searching frequently, and frequently enough that
> > they'd appreciate a config option.
>
> A config option sounds good. Though I think
>
> 1. If this option is for global behavior: users may better off turning
> off sparse-checkout if they want a config to do things densely everywhere.

Sorry, it sounds like I haven't explained the usecases to you very
well.  Let me try again.

There are people who want to do everything densely, as you say, and
those folks can just turn off sparse-checkout or not use it in the
first place.  Git has traditionally catered to these folks just fine.
However, it's not a subset of interest for this discussion and wasn't
what I was talking about.

There are (at least) two different usecases for people wanting to use
sparse-checkouts; I have users that fall under each category:


1) Working on a repository subset; users are _only_ interested in that subset.

This usecase is very poorly supported in Git right now, but I think
you understand it so I'll only briefly describe it.

These folks might know there are other things in the repository, but
don't care.  Not only should the working tree be sparse, but grep,
log, diff, etc. should be restricted to the subset of the tree they
are interested in.

Restricting operations to the sparsity specification is also important
for marrying partial clones with sparse checkouts while allowing
disconnected development.  Without such a restrict-to-sparsity-paths
feature, the partial clones will attempt to download objects the first
time they try to grep an old revision, or do log with a glob path.
The download will fail, causing the operation to fail, and break the
ability of the user to work in a disconnected manner.


2) The working directory is sparse, but users are working in a larger whole.

Stolee described this usecase this way[2]:

"I'm also focused on users that know that they are a part of a larger
whole. They know they are operating on a large repository but focus on
what they need to contribute their part. I expect multiple "roles" to
use very different, almost disjoint parts of the codebase. Some other
"architect" users operate across the entire tree or hop between different
sections of the codebase as necessary. In this situation, I'm wary of
scoping too many features to the sparse-checkout definition, especially
"git log," as it can be too confusing to have their view of the codebase
depend on your "point of view."

[2] https://lore.kernel.org/git/1a1e33f6-3514-9afc-0a28-5a6b85bd8014@gmail.com/

I describe it very similarly, but I'd like to point out something
additional around this usecase and how it can be influenced by
dependencies.  The first cut for sparse-checkouts is usually the
directories you are interested in plus what those directories depend
upon within your repository.  But there's a monkey wrench here: if you
have integration tests, they invert the hierarchy: to run integration
tests, you need not only what you are interested in and its
dependencies, you also need everything that depends upon what you are
interested in or that depends upon one of your dependencies...AND you
need all the dependencies of that expanded group.  That can easily
change your sparse-checkout into a nearly dense one.  Naturally, that
tends to kill the benefits of sparse-checkouts.  There are a couple
solutions to this conundrum: either avoid grabbing dependencies (maybe
have built versions of your dependencies pulled from a CI cache
somewhere), or say that users shouldn't run integration tests directly
and instead do it on the CI server when they submit a code review.  Or
do both.  Regardless of whether you stub out your dependencies or stub
out the things that depend upon you, there is certainly a reason to
want to query and be aware of those other parts of the repository.
Thus, sparse-checkouts can be used to limit what you directly build
and modify, but these users do not want it to limit their queries of
history.


Once users pick either the first or the second usecase, they often
stick within it.  For either group, regardless of what Git's default
is, needing to specify an additional flag for *every*
grep/log/diff/etc. they run would just be a total annoyance.  Neither
wants a dense worktree, but one side wants a dense history query while
the other wants a sparse one.  Different groups should be able to
configure the default that works well for them, much like we allow
users to configure whether they want "git pull" to rebase or merge.

> 2. If this option is for a single subcommand (e.g. 'grep'): I don't have
> much thoughts here. It certainly can be nice for users who need to do
> non-sparse searching frequently. This design, if necessary, should
> belong to a patch where this config is added for every single subcommand?
>
> > I also brought up in that old thread that perhaps we want to avoid
> > adding a flag to every subcommand, and instead just having a
> > git-global flag for triggering this type of behavior.  (e.g. `git
> > --no-restrict grep --cached ...` or `git --dense grep --cached ...`).
>
> This looks more like the answer to me. It's a peace of mind for users if
> they don't have to worry about whether a subcommand is sparse-aware, and
> how may their behaviors differ. Though we still may need to update the
> actual behavior in each subcommand over an extended period of time
> (though may not be difficult?), which you mentioned above "seems like a
> poor strategy".
>
> > [1] https://lore.kernel.org/git/CABPp-BGJ_Nvi5TmgriD9Bh6eNXE2EDq2f8e8QKXAeYG3BxZafA@mail.gmail.com/
> > and the responses to that email>
> >> Change the default behavior of 'git grep' to focus on the files within
> >> the sparse-checkout definition. To enable the previous behavior, add a
> >> '--sparse' option to 'git grep' that triggers the old behavior that
> >> inspects paths outside of the sparse-checkout definition when paired
> >> with the '--cached' option.
> >
> > I still think the flag name of `--sparse` is totally backwards and
> > highly confusing for the described behavior.  I missed Stolee's email
> > at the time (wasn't cc'ed) where he brought up that "--sparse" had
> > already been added to "git-add" and "git-rm", but in those cases the
> > commands aren't querying and I just don't see how they lead to the
> > same level of user confusion.  This one seems glaringly wrong to me
> > and both Junio and I flagged it on v1 when we first saw it.  (Perhaps
> > it also helps that for the add/rm cases, that a user is often given an
> > error message with the suggested flag to use, which just doesn't make
> > sense here either.)  If there is concern that this flag should be the
> > same as add and rm, then I think we need to do the backward
> > compatibility dance and fix add and rm by adding an alias over there
> > so that grep's flag won't be so confusing.
>
> I guess I'm using "--sparse" here because "add", "rm" and "mv" all imply
> that "when operating on a sparse path, ignores/warns unless '--sparse'
> is used". I take it as an analogy so "when searching a sparse path,
> ignores/warns unless '--sparse' is used". As the idea that "Git does
> *not* care sparse contents unless '--[no-]sparse' is specified" is sort
> of established through the implementations in "add", "rm", or "mv", I
> don't see a big problem using "--sparse" here.

Well, I do.

In addition to just being utterly backwards and confusing in the
context of grep:
  * Both `clone` and `ls-files` use `--sparse` to mean to limit things
to the sparsity cone, so we're already kinda split-brained.
  * grep is more like ls-files (both being querying functions) than
add/rm/mv, so should really follow its lead instead of the one from
add/rm/mv.
  * There's another way to interpret `--sparse` for `add` and `rm`
such that it makes sense (at least to me); see my other email to Junio
in this thread.
  * `mv` is indeed using it backward, but the `mv` change is new to
this cycle (and undocumented) so I'm not sure it counts as much of a
precedent yet.

> I *think*, as long as the users are informed that the default is to
> ignore things outside of the sparse-checkout definition, and they have
> to do something (using "--sparse" or a potential better name) to
> override the default, we are safe to use a name that is famous (i.e.
> "--sparse") even though its literal meaning is not perfectly descriptive.
>
> One outlier I do find confusing though, is the "--sparse" option from
> "git-ls-files". Without it, Git expands the index and show everything
> outside of sparse-checkout definition, which seems a bit controversial.

Nah, that perfectly matches the expectation of users in the second
usecase above -- querying (ls-files/grep/log/diff) defaults to
non-restricted history, modifying (add/rm/mv) defaults to restricted
paths but warns if the arguments could have matched something else,
and the working tree is restricted to sparse paths.  It doesn't seem
too controversial to me, even if it's not what we want for the
long-term default.

The defaults for the first usecase would be defaulting to restricted
paths for everything, and perhaps not warn if arguments to a modifying
command could have matched something else.


Anyway, hope that helps you understand my perspective and framing.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v5 1/3] builtin/grep.c: add --sparse option
  2022-09-18  2:14         ` Elijah Newren
@ 2022-09-18 19:52           ` Victoria Dye
  2022-09-19  1:23             ` Junio C Hamano
                               ` (3 more replies)
  0 siblings, 4 replies; 69+ messages in thread
From: Victoria Dye @ 2022-09-18 19:52 UTC (permalink / raw)
  To: Elijah Newren, Junio C Hamano
  Cc: Shaoxuan Yuan, Derrick Stolee, Git Mailing List

Elijah Newren wrote:
> == Overall ==
> 
> For existing querying commands (just ls-files), `--sparse` already
> means restrict to the sparse cone.  If we keep using the existing flag
> names, grep should follow suit.
> 
> For existing modification commands already released (add, rm), the
> fact that the command is modifying actually gives a different way to
> interpret things such that it's not clear `--sparse` was even a
> problem.  However, perhaps the name of the flag is bad just because
> there are multiple ways to view it and those who view it one way will
> see it as counter-intuitive.
> 
> == Flag rename? ==
> 
> There's another reason to potentially rename the flag.  We already
> have `--sparse` and `--dense` flags for rev-list and friends.  So,
> when we want to enable those other commands to restrict to the
> sparsity patterns, we probably need a different name.  So, perhaps, we
> should rename our `--sparse/--dense` to `--restrict/--no-restrict`.
> Such a rename would also likely clear up the ambiguity about which way
> to interpret the command for the add & rm commands (though it'd pick
> the second one and suggest we were using the wrong name after all).
> 
> (There are also two other commands that use `--sparse` -- pack-objects
> and show-branch, though in a much different way and neither would ever
> be affected by our new --sparse/--dense/--restrict/--no-restrict
> flags.)
> 
> Other names are also possible.  Any suggestions?
> 
> == global flag vs subcommand flags ==
> 
> Do we want to make --[no-]restrict a flag for each subcommand, or just
> make it a global git flag?  I kind of think it'd make sense to do the
> latter
> 
> == Defaults ==
> 
> As discussed before, we probably want querying commands (ls-files,
> grep, log, etc.) to default to --no-restrict for now, since we are
> otherwise slowly changing the defaults.  We may want to swap that
> default in the future.
> 
> However, for modification commands, I think we want the default to be
> --restrict, regardless of the default for querying commands.  There
> are some potentially very negative surprises for users if we don't,
> and those surprises will be delayed rather than occur at the time the
> user runs the command.  In fact, those negative surprises are likely
> why those commands were the first to gain an option controlling
> whether they operated on paths outside the sparsity specification.
> (Also, the modification commands print a warning if they could have
> affected other files but didn't due the the default of restricting, so
> I think we have their default correct, even if the flag name is
> suboptimal.)

One of the things I've found myself a bit frustrated with while working on
these sparse index integrations is that we haven't had a clear set of
guidelines for times when we need to make UI/UX changes relating to
'sparse-checkout' compatibility. I think what you've outlined here is a good
start to a larger discussion on the topic, but in the middle of this series
might not be the best place for that discussion (at least in terms of
preserving for later reference). 

Elijah, would you be interested in compiling your thoughts into a document
in 'Documentation/technical'? If not, Stolee or I could do it. If we could
settle on some guidelines (option names, behavior, etc.) for better
incorporating 'sparse-checkout' support into existing commands, it'd make
future sparse index work substantially easier for everyone involved.

As for this series, I think the best way to move the sparse index work along
is to drop this patch ("builtin/grep.c: add --sparse option") altogether.
Shaoxuan's updates in patch 3 [1] make 'git grep' sparse index-compatible
for *all* invocations (not just those without '--sparse'), so we don't need
the new option for sparse index compatibility. It can then be re-introduced
later (possibly modified) in a series dedicated to unifying the
sparse-checkout UX.

[1] https://lore.kernel.org/git/20220908001854.206789-4-shaoxuan.yuan02@gmail.com/

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v5 1/3] builtin/grep.c: add --sparse option
  2022-09-18 19:52           ` Victoria Dye
@ 2022-09-19  1:23             ` Junio C Hamano
  2022-09-19  4:27             ` Shaoxuan Yuan
                               ` (2 subsequent siblings)
  3 siblings, 0 replies; 69+ messages in thread
From: Junio C Hamano @ 2022-09-19  1:23 UTC (permalink / raw)
  To: Victoria Dye
  Cc: Elijah Newren, Shaoxuan Yuan, Derrick Stolee, Git Mailing List

Victoria Dye <vdye@github.com> writes:

> Elijah Newren wrote:
> ...
>> However, for modification commands, I think we want the default to be
>> --restrict, regardless of the default for querying commands.  There
>> are some potentially very negative surprises for users if we don't,
>> and those surprises will be delayed rather than occur at the time the
>> user runs the command.  In fact, those negative surprises are likely
>> why those commands were the first to gain an option controlling
>> whether they operated on paths outside the sparsity specification.
>> (Also, the modification commands print a warning if they could have
>> affected other files but didn't due the the default of restricting, so
>> I think we have their default correct, even if the flag name is
>> suboptimal.)
>
> One of the things I've found myself a bit frustrated with while working on
> these sparse index integrations is that we haven't had a clear set of
> guidelines for times when we need to make UI/UX changes relating to
> 'sparse-checkout' compatibility. I think what you've outlined here is a good
> start to a larger discussion on the topic, but in the middle of this series
> might not be the best place for that discussion (at least in terms of
> preserving for later reference). 

Yup, I think we were a bit too quick to add the "hide outside sparse
cones" feature without first coming up with a reasonable guideline
that is designed to keep things consistent.

It might have been nice if we did this "make X sparse checkout
aware" effort in two separate steps.  The first step will not change
any behaviour, i.e. no optional or default "hide outside sparse
cones" at all, just "we do not upfront expand the index fully;
instead as we discover we need to inspect the contents in a
subdirectory that is compacted to a tree in the index, we lazily
expand it" as performance optimization.  And once we made sure we
taught all commands that used to expand the index fully upfront not
to do so, we do the "guideline" design for UI to "hide outside
sparse cones", and add that feature to the commands in the second
step.

Unfortunately we all get excited too much when we find a new shiny
toy, and we ended up getting ahead of ourselves before designing a
consistent end user experience.  But better late than never ;-)

> As for this series, I think the best way to move the sparse index work along
> is to drop this patch ("builtin/grep.c: add --sparse option") altogether.

Does that roughly correspond to the first step in my "It would have
been nice if we did these in two steps" above?  That would be a
sensible thing to do, as it would be less surprises to the users, I
hope.

Thanks.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v5 1/3] builtin/grep.c: add --sparse option
  2022-09-18  4:24         ` Elijah Newren
@ 2022-09-19  4:13           ` Shaoxuan Yuan
  0 siblings, 0 replies; 69+ messages in thread
From: Shaoxuan Yuan @ 2022-09-19  4:13 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Derrick Stolee, Victoria Dye, Git Mailing List, Junio C Hamano

On 9/17/2022 9:24 PM, Elijah Newren wrote:
> On Fri, Sep 16, 2022 at 8:34 PM Shaoxuan Yuan <shaoxuan.yuan02@gmail.com> wrote:
>>
>>> I'm also curious whether there shouldn't be a config option for
>>> something like this, so folks don't have to specify it with every
>>> invocation.  In particular, while I certainly have users that want to
>>> just query git for information about the part of the history they are
>>> interested in, there are other users who are fully aware they are
>>> working in a bigger repository and want to search for additional
>>> things to add to their sparse-checkout and predominantly use grep for
>>> things like that.  They have even documented that `git grep --cached
>>> <TERM>` can be used in sparse-checkouts for this purpose...and have
>>> been using that for a few years.  (I did warn them at the time that
>>> there was a risk they'd have to change their command, but it's still
>>> going to be a behavioral change they might not expect.)  Further, when
>>> I brought up changing the behavior of commands during sparse-checkouts
>>> to limit to files matching the sparsity paths in that old thread at
>>> [1], Stolee was a bit skeptical of making that the default.  That
>>> suggests, at least, that two independent groups of users would want to
>>> use the non-sparse searching frequently, and frequently enough that
>>> they'd appreciate a config option.
>>
>> A config option sounds good. Though I think
>>
>> 1. If this option is for global behavior: users may better off turning
>> off sparse-checkout if they want a config to do things densely everywhere.
> 
> Sorry, it sounds like I haven't explained the usecases to you very
> well.  Let me try again.
> 
> There are people who want to do everything densely, as you say, and
> those folks can just turn off sparse-checkout or not use it in the
> first place.  Git has traditionally catered to these folks just fine.
> However, it's not a subset of interest for this discussion and wasn't
> what I was talking about.

OK, reading...

> There are (at least) two different usecases for people wanting to use
> sparse-checkouts; I have users that fall under each category:
> 
> 
> 1) Working on a repository subset; users are _only_ interested in that subset.
> 
> This usecase is very poorly supported in Git right now, but I think
> you understand it so I'll only briefly describe it.
> 
> These folks might know there are other things in the repository, but
> don't care.  Not only should the working tree be sparse, but grep,
> log, diff, etc. should be restricted to the subset of the tree they
> are interested in.

Right, this is the usecase I am familiar with.

> Restricting operations to the sparsity specification is also important
> for marrying partial clones with sparse checkouts while allowing
> disconnected development.  Without such a restrict-to-sparsity-paths
> feature, the partial clones will attempt to download objects the first
> time they try to grep an old revision, or do log with a glob path.
> The download will fail, causing the operation to fail, and break the
> ability of the user to work in a disconnected manner.

OK, I'm still learning about partial clone, didn't get a chance to look
at it. Will try to figure out what this means :)

> 2) The working directory is sparse, but users are working in a larger whole.
> 
> Stolee described this usecase this way[2]:
> 
> "I'm also focused on users that know that they are a part of a larger
> whole. They know they are operating on a large repository but focus on
> what they need to contribute their part. I expect multiple "roles" to
> use very different, almost disjoint parts of the codebase. Some other
> "architect" users operate across the entire tree or hop between different
> sections of the codebase as necessary. In this situation, I'm wary of
> scoping too many features to the sparse-checkout definition, especially
> "git log," as it can be too confusing to have their view of the codebase
> depend on your "point of view."
> 
> [2] https://lore.kernel.org/git/1a1e33f6-3514-9afc-0a28-5a6b85bd8014@gmail.com/
> 
> I describe it very similarly, but I'd like to point out something
> additional around this usecase and how it can be influenced by
> dependencies.  The first cut for sparse-checkouts is usually the
> directories you are interested in plus what those directories depend
> upon within your repository.  But there's a monkey wrench here: if you
> have integration tests, they invert the hierarchy: to run integration
> tests, you need not only what you are interested in and its
> dependencies, you also need everything that depends upon what you are
> interested in or that depends upon one of your dependencies...AND you
> need all the dependencies of that expanded group.  That can easily
> change your sparse-checkout into a nearly dense one.  Naturally, that
> tends to kill the benefits of sparse-checkouts.  There are a couple
> solutions to this conundrum: either avoid grabbing dependencies (maybe
> have built versions of your dependencies pulled from a CI cache
> somewhere), or say that users shouldn't run integration tests directly
> and instead do it on the CI server when they submit a code review.  Or
> do both.  Regardless of whether you stub out your dependencies or stub
> out the things that depend upon you, there is certainly a reason to
> want to query and be aware of those other parts of the repository.
> Thus, sparse-checkouts can be used to limit what you directly build
> and modify, but these users do not want it to limit their queries of
> history.
> 
> 
> Once users pick either the first or the second usecase, they often
> stick within it.  For either group, regardless of what Git's default
> is, needing to specify an additional flag for *every*
> grep/log/diff/etc. they run would just be a total annoyance.  Neither
> wants a dense worktree, but one side wants a dense history query while
> the other wants a sparse one.  Different groups should be able to
> configure the default that works well for them, much like we allow
> users to configure whether they want "git pull" to rebase or merge.

OK, now I get it:

Case A: users only interested in a subset, so they need only sparse
history and a sparse worktree.

v.s.

Case B: users works within a subset but needs a larger context, so they
need a dense history/query (that's why we should let grep default to
--no-restrict, as you suggested?), though still a sparse worktree.

> 
>> 2. If this option is for a single subcommand (e.g. 'grep'): I don't have
>> much thoughts here. It certainly can be nice for users who need to do
>> non-sparse searching frequently. This design, if necessary, should
>> belong to a patch where this config is added for every single subcommand?
>>
>>> I also brought up in that old thread that perhaps we want to avoid
>>> adding a flag to every subcommand, and instead just having a
>>> git-global flag for triggering this type of behavior.  (e.g. `git
>>> --no-restrict grep --cached ...` or `git --dense grep --cached ...`).
>>
>> This looks more like the answer to me. It's a peace of mind for users if
>> they don't have to worry about whether a subcommand is sparse-aware, and
>> how may their behaviors differ. Though we still may need to update the
>> actual behavior in each subcommand over an extended period of time
>> (though may not be difficult?), which you mentioned above "seems like a
>> poor strategy".
>>
>>> [1] https://lore.kernel.org/git/CABPp-BGJ_Nvi5TmgriD9Bh6eNXE2EDq2f8e8QKXAeYG3BxZafA@mail.gmail.com/
>>> and the responses to that email>
>>>> Change the default behavior of 'git grep' to focus on the files within
>>>> the sparse-checkout definition. To enable the previous behavior, add a
>>>> '--sparse' option to 'git grep' that triggers the old behavior that
>>>> inspects paths outside of the sparse-checkout definition when paired
>>>> with the '--cached' option.
>>>
>>> I still think the flag name of `--sparse` is totally backwards and
>>> highly confusing for the described behavior.  I missed Stolee's email
>>> at the time (wasn't cc'ed) where he brought up that "--sparse" had
>>> already been added to "git-add" and "git-rm", but in those cases the
>>> commands aren't querying and I just don't see how they lead to the
>>> same level of user confusion.  This one seems glaringly wrong to me
>>> and both Junio and I flagged it on v1 when we first saw it.  (Perhaps
>>> it also helps that for the add/rm cases, that a user is often given an
>>> error message with the suggested flag to use, which just doesn't make
>>> sense here either.)  If there is concern that this flag should be the
>>> same as add and rm, then I think we need to do the backward
>>> compatibility dance and fix add and rm by adding an alias over there
>>> so that grep's flag won't be so confusing.
>>
>> I guess I'm using "--sparse" here because "add", "rm" and "mv" all imply
>> that "when operating on a sparse path, ignores/warns unless '--sparse'
>> is used". I take it as an analogy so "when searching a sparse path,
>> ignores/warns unless '--sparse' is used". As the idea that "Git does
>> *not* care sparse contents unless '--[no-]sparse' is specified" is sort
>> of established through the implementations in "add", "rm", or "mv", I
>> don't see a big problem using "--sparse" here.
> 
> Well, I do.
> 
> In addition to just being utterly backwards and confusing in the
> context of grep:
>   * Both `clone` and `ls-files` use `--sparse` to mean to limit things
> to the sparsity cone, so we're already kinda split-brained.

Agree.

>   * grep is more like ls-files (both being querying functions) than
> add/rm/mv, so should really follow its lead instead of the one from
> add/rm/mv.

Agree.

>   * There's another way to interpret `--sparse` for `add` and `rm`
> such that it makes sense (at least to me); see my other email to Junio
> in this thread.

According to the spirit of your points, I think they should be
defaulting to --restrict (a rename perhaps) right now.

>   * `mv` is indeed using it backward, but the `mv` change is new to
> this cycle (and undocumented) so I'm not sure it counts as much of a
> precedent yet.

Oops, I was making the modifications to `mv` and forgot to add
documentation to it. Though the --sparse of `mv` was not documented
before I touching it. Perhaps it can be added later if we are going to
rename --sparse/--dense to --restrict/--no-restrict.

>> I *think*, as long as the users are informed that the default is to
>> ignore things outside of the sparse-checkout definition, and they have
>> to do something (using "--sparse" or a potential better name) to
>> override the default, we are safe to use a name that is famous (i.e.
>> "--sparse") even though its literal meaning is not perfectly descriptive.
>>
>> One outlier I do find confusing though, is the "--sparse" option from
>> "git-ls-files". Without it, Git expands the index and show everything
>> outside of sparse-checkout definition, which seems a bit controversial.
> 
> Nah, that perfectly matches the expectation of users in the second
> usecase above -- querying (ls-files/grep/log/diff) defaults to
> non-restricted history, modifying (add/rm/mv) defaults to restricted
> paths but warns if the arguments could have matched something else,
> and the working tree is restricted to sparse paths.  It doesn't seem
> too controversial to me, even if it's not what we want for the
> long-term default.

OK. After the reasoning you gave above, now the --sparse of ls-files
looks good.

> 
> The defaults for the first usecase would be defaulting to restricted
> paths for everything, and perhaps not warn if arguments to a modifying
> command could have matched something else.
> 
> 
> Anyway, hope that helps you understand my perspective and framing.

Thanks for the explanations, now I get it and agree with your points :)

Thanks,
Shaoxuan

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v5 1/3] builtin/grep.c: add --sparse option
  2022-09-18 19:52           ` Victoria Dye
  2022-09-19  1:23             ` Junio C Hamano
@ 2022-09-19  4:27             ` Shaoxuan Yuan
  2022-09-19 11:03             ` Ævar Arnfjörð Bjarmason
  2022-09-20  7:13             ` Elijah Newren
  3 siblings, 0 replies; 69+ messages in thread
From: Shaoxuan Yuan @ 2022-09-19  4:27 UTC (permalink / raw)
  To: Victoria Dye, Elijah Newren, Junio C Hamano
  Cc: Derrick Stolee, Git Mailing List

Hi Victoria, :-)

On 9/18/2022 12:52 PM, Victoria Dye wrote:
> Elijah Newren wrote:
>> == Overall ==
>>
>> For existing querying commands (just ls-files), `--sparse` already
>> means restrict to the sparse cone.  If we keep using the existing flag
>> names, grep should follow suit.
>>
>> For existing modification commands already released (add, rm), the
>> fact that the command is modifying actually gives a different way to
>> interpret things such that it's not clear `--sparse` was even a
>> problem.  However, perhaps the name of the flag is bad just because
>> there are multiple ways to view it and those who view it one way will
>> see it as counter-intuitive.
>>
>> == Flag rename? ==
>>
>> There's another reason to potentially rename the flag.  We already
>> have `--sparse` and `--dense` flags for rev-list and friends.  So,
>> when we want to enable those other commands to restrict to the
>> sparsity patterns, we probably need a different name.  So, perhaps, we
>> should rename our `--sparse/--dense` to `--restrict/--no-restrict`.
>> Such a rename would also likely clear up the ambiguity about which way
>> to interpret the command for the add & rm commands (though it'd pick
>> the second one and suggest we were using the wrong name after all).
>>
>> (There are also two other commands that use `--sparse` -- pack-objects
>> and show-branch, though in a much different way and neither would ever
>> be affected by our new --sparse/--dense/--restrict/--no-restrict
>> flags.)
>>
>> Other names are also possible.  Any suggestions?
>>
>> == global flag vs subcommand flags ==
>>
>> Do we want to make --[no-]restrict a flag for each subcommand, or just
>> make it a global git flag?  I kind of think it'd make sense to do the
>> latter
>>
>> == Defaults ==
>>
>> As discussed before, we probably want querying commands (ls-files,
>> grep, log, etc.) to default to --no-restrict for now, since we are
>> otherwise slowly changing the defaults.  We may want to swap that
>> default in the future.
>>
>> However, for modification commands, I think we want the default to be
>> --restrict, regardless of the default for querying commands.  There
>> are some potentially very negative surprises for users if we don't,
>> and those surprises will be delayed rather than occur at the time the
>> user runs the command.  In fact, those negative surprises are likely
>> why those commands were the first to gain an option controlling
>> whether they operated on paths outside the sparsity specification.
>> (Also, the modification commands print a warning if they could have
>> affected other files but didn't due the the default of restricting, so
>> I think we have their default correct, even if the flag name is
>> suboptimal.)
> 
> One of the things I've found myself a bit frustrated with while working on
> these sparse index integrations is that we haven't had a clear set of
> guidelines for times when we need to make UI/UX changes relating to
> 'sparse-checkout' compatibility. I think what you've outlined here is a good
> start to a larger discussion on the topic, but in the middle of this series
> might not be the best place for that discussion (at least in terms of
> preserving for later reference). 
> 
> Elijah, would you be interested in compiling your thoughts into a document
> in 'Documentation/technical'? If not, Stolee or I could do it. If we could
> settle on some guidelines (option names, behavior, etc.) for better
> incorporating 'sparse-checkout' support into existing commands, it'd make
> future sparse index work substantially easier for everyone involved.

This sounds good! I am always confused about the inconsistency of the
meaning of "--sparse" across a variety of commands. A guideline
definitely corrects prior integrations and helps future ones.

> As for this series, I think the best way to move the sparse index work along
> is to drop this patch ("builtin/grep.c: add --sparse option") altogether.
> Shaoxuan's updates in patch 3 [1] make 'git grep' sparse index-compatible
> for *all* invocations (not just those without '--sparse'), so we don't need
> the new option for sparse index compatibility. It can then be re-introduced
> later (possibly modified) in a series dedicated to unifying the
> sparse-checkout UX.

Are you suggesting that we should still follow the original "use --cache
to search within the index and show SKIP_WORKTREE entries found"? I'm
asking because the tests in the second patch [2] are still using the
lately-introduced "--sparse". If yes, then I think it sounds good to
re-introduce the (potentially) modified UI in the future :-).

[2]
https://lore.kernel.org/git/20220908001854.206789-3-shaoxuan.yuan02@gmail.com/

> 
> [1] https://lore.kernel.org/git/20220908001854.206789-4-shaoxuan.yuan02@gmail.com/

Thanks,
Shaoxuan

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v5 1/3] builtin/grep.c: add --sparse option
  2022-09-18 19:52           ` Victoria Dye
  2022-09-19  1:23             ` Junio C Hamano
  2022-09-19  4:27             ` Shaoxuan Yuan
@ 2022-09-19 11:03             ` Ævar Arnfjörð Bjarmason
  2022-09-20  7:13             ` Elijah Newren
  3 siblings, 0 replies; 69+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-09-19 11:03 UTC (permalink / raw)
  To: Victoria Dye
  Cc: Elijah Newren, Junio C Hamano, Shaoxuan Yuan, Derrick Stolee,
	Git Mailing List


On Sun, Sep 18 2022, Victoria Dye wrote:

> Elijah, would you be interested in compiling your thoughts into a document
> in 'Documentation/technical'? If not, Stolee or I could do it. If we could
> settle on some guidelines (option names, behavior, etc.) for better
> incorporating 'sparse-checkout' support into existing commands, it'd make
> future sparse index work substantially easier for everyone involved.

This sounds good. I'd just like to suggest that incorporating a table
similar to the one I made for checkout/switch in would be useful for
such documentation:

	https://lore.kernel.org/git/211021.86wnm6l1ip.gmgdl@evledraar.gmail.com/

We ended up dropping the ball on that topic, but for cross-command UX I
think it's a very useful way to present how a "meta option", or an
option shared across many commands is expected to behave.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v5 1/3] builtin/grep.c: add --sparse option
  2022-09-18 19:52           ` Victoria Dye
                               ` (2 preceding siblings ...)
  2022-09-19 11:03             ` Ævar Arnfjörð Bjarmason
@ 2022-09-20  7:13             ` Elijah Newren
  3 siblings, 0 replies; 69+ messages in thread
From: Elijah Newren @ 2022-09-20  7:13 UTC (permalink / raw)
  To: Victoria Dye
  Cc: Junio C Hamano, Shaoxuan Yuan, Derrick Stolee, Git Mailing List

On Sun, Sep 18, 2022 at 12:52 PM Victoria Dye <vdye@github.com> wrote:
>
> Elijah Newren wrote:
> > == Overall ==
> >
> > For existing querying commands (just ls-files), `--sparse` already
> > means restrict to the sparse cone.  If we keep using the existing flag
> > names, grep should follow suit.
> >
> > For existing modification commands already released (add, rm), the
> > fact that the command is modifying actually gives a different way to
> > interpret things such that it's not clear `--sparse` was even a
> > problem.  However, perhaps the name of the flag is bad just because
> > there are multiple ways to view it and those who view it one way will
> > see it as counter-intuitive.
> >
> > == Flag rename? ==
> >
> > There's another reason to potentially rename the flag.  We already
> > have `--sparse` and `--dense` flags for rev-list and friends.  So,
> > when we want to enable those other commands to restrict to the
> > sparsity patterns, we probably need a different name.  So, perhaps, we
> > should rename our `--sparse/--dense` to `--restrict/--no-restrict`.
> > Such a rename would also likely clear up the ambiguity about which way
> > to interpret the command for the add & rm commands (though it'd pick
> > the second one and suggest we were using the wrong name after all).
> >
> > (There are also two other commands that use `--sparse` -- pack-objects
> > and show-branch, though in a much different way and neither would ever
> > be affected by our new --sparse/--dense/--restrict/--no-restrict
> > flags.)
> >
> > Other names are also possible.  Any suggestions?
> >
> > == global flag vs subcommand flags ==
> >
> > Do we want to make --[no-]restrict a flag for each subcommand, or just
> > make it a global git flag?  I kind of think it'd make sense to do the
> > latter
> >
> > == Defaults ==
> >
> > As discussed before, we probably want querying commands (ls-files,
> > grep, log, etc.) to default to --no-restrict for now, since we are
> > otherwise slowly changing the defaults.  We may want to swap that
> > default in the future.
> >
> > However, for modification commands, I think we want the default to be
> > --restrict, regardless of the default for querying commands.  There
> > are some potentially very negative surprises for users if we don't,
> > and those surprises will be delayed rather than occur at the time the
> > user runs the command.  In fact, those negative surprises are likely
> > why those commands were the first to gain an option controlling
> > whether they operated on paths outside the sparsity specification.
> > (Also, the modification commands print a warning if they could have
> > affected other files but didn't due the the default of restricting, so
> > I think we have their default correct, even if the flag name is
> > suboptimal.)
>
> One of the things I've found myself a bit frustrated with while working on
> these sparse index integrations is that we haven't had a clear set of
> guidelines for times when we need to make UI/UX changes relating to
> 'sparse-checkout' compatibility. I think what you've outlined here is a good
> start to a larger discussion on the topic, but in the middle of this series
> might not be the best place for that discussion (at least in terms of
> preserving for later reference).

Yeah, that's fair, and I apologize for the problems.  I should have
pushed for a resolution and/or documentation of these issues at some
point; particularly since I was the one to bring it up in the first
place.  Between Stolee asking us to defer for a year-ish on UI/UX
changes in sparse-checkout while he got sparse-index into place, and
various other things coming up in the meantime, I just didn't get back
to it.  I probably should have, especially since we also had other
similar discussions going back to when git-sparse-checkout was first
introduced, but we've often focused on just solving the next subset of
usecases that were within reach rather than getting a bigger design
document.  Knowing that these kinds of issues were lurking was part of
the reason I insisted on having the big scary warning in the docs:

"""
THIS COMMAND IS EXPERIMENTAL. ITS BEHAVIOR, AND THE BEHAVIOR OF OTHER
COMMANDS IN THE PRESENCE OF SPARSE-CHECKOUTS, WILL LIKELY CHANGE IN
THE FUTURE.
"""

I'm glad I at least had the foresight to insist on that small measure...  :-)

> Elijah, would you be interested in compiling your thoughts into a document
> in 'Documentation/technical'? If not, Stolee or I could do it. If we could
> settle on some guidelines (option names, behavior, etc.) for better
> incorporating 'sparse-checkout' support into existing commands, it'd make
> future sparse index work substantially easier for everyone involved.

Sure, I'll take a stab at it this week.

> As for this series, I think the best way to move the sparse index work along
> is to drop this patch ("builtin/grep.c: add --sparse option") altogether.
> Shaoxuan's updates in patch 3 [1] make 'git grep' sparse index-compatible
> for *all* invocations (not just those without '--sparse'), so we don't need
> the new option for sparse index compatibility. It can then be re-introduced
> later (possibly modified) in a series dedicated to unifying the
> sparse-checkout UX.

Seems reasonable.

> [1] https://lore.kernel.org/git/20220908001854.206789-4-shaoxuan.yuan02@gmail.com/

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH v6 0/1] grep: integrate with sparse index
  2022-08-17  7:56 [PATCH v1 0/2] grep: integrate with sparse index Shaoxuan Yuan
                   ` (6 preceding siblings ...)
  2022-09-08  0:18 ` [PATCH v5 0/3] grep: integrate with sparse index Shaoxuan Yuan
@ 2022-09-23  4:18 ` Shaoxuan Yuan
  2022-09-23  4:18   ` [PATCH v6 1/1] builtin/grep.c: " Shaoxuan Yuan
                     ` (2 more replies)
  7 siblings, 3 replies; 69+ messages in thread
From: Shaoxuan Yuan @ 2022-09-23  4:18 UTC (permalink / raw)
  To: git; +Cc: derrickstolee, gitster, vdye, newren, avarab, Shaoxuan Yuan

Integrate `git-grep` with sparse-index and test the performance
improvement.

Changes since v5
----------------

* Drop the `--sparse` option patch and edit corresponding tests. 
  We can wait until a better name is decided to replace `--sparse`.

* Modify the commit message, especially get rid of the `--sparse`
  occurences.

Changes since v4
----------------
* Reset the length of `struct strbuf name` back to `name_base_len`,
  instead of 0, after `grep_tree()` returns.

* Add test cases in t1092 for `grep` recursing into submodules.

* Add a few NEEDSWORK to explain the current problem with submodules.

Changes since v3
----------------
* Shorten the perf result tables in commit message.

* Update the commit message to reflect the changes in the commit.

* Update the commit message to indicate the performance improvement
  is dependent on the pathspec.

* Stop passing `ce_mode` through `check_attr`. Instead, set the
  `base_len` to 0 to make the code more reasonable and less abuse of
  `check_attr`.

* Remove another invention of `base`. Use the existing `name` as the
  argument for `grep_tree()`, and reset it back to `ce->name` after
  `grep_tree()` returns.

* Update the p2000 test to use a more general pathspec for better
  compatibility (i.e. do not use git repository specific pathspec).

* Add tests to t1092 'grep is not expanded' to verify the change
  brought by "builtin/grep.c: walking tree instead of expanding index
  with --sparse": the index *never* expands.

Changes since v2
----------------

* Modify the commit message for "builtin/grep.c: integrate with sparse
  index" to make it obvious that the perf test results are not from
  p2000 tests, but from manual perf runs.

* Add tree-walking logic as an extra (the third) patch to improve the
  performance when --sparse is used. This resolved the left-over-bit
  in v2 [1].

[1] https://lore.kernel.org/git/20220829232843.183711-1-shaoxuan.yuan02@gmail.com/

Changes since v1
----------------

* Rewrite the commit message for "builtin/grep.c: add --sparse option"
  to be clearer.

* Update the documentation (both in-code and man page) for --sparse.

* Add a few tests to test the new behavior (when _only_ --cached is
  supplied).

* Reformat the perf test results to not look like directly from p2000
  tests.

* Put the "command_requires_full_index" lines right after parse_options().

* Add a pathspec test in t1092, and reword a few test documentations.

Shaoxuan Yuan (1):
  builtin/grep.c: integrate with sparse index

 builtin/grep.c                           | 48 +++++++++++++++-
 t/perf/p2000-sparse-operations.sh        |  1 +
 t/t1092-sparse-checkout-compatibility.sh | 72 ++++++++++++++++++++++++
 3 files changed, 118 insertions(+), 3 deletions(-)

Range-diff against v5:
1:  1d00d23bf9 < -:  ---------- builtin/grep.c: add --sparse option
2:  926b8d2462 < -:  ---------- builtin/grep.c: integrate with sparse index
3:  18b65034fe ! 1:  8604111d74 builtin/grep.c: walking tree instead of expanding index with --sparse
    @@ Metadata
     Author: Shaoxuan Yuan <shaoxuan.yuan02@gmail.com>
     
      ## Commit message ##
    -    builtin/grep.c: walking tree instead of expanding index with --sparse
    +    builtin/grep.c: integrate with sparse index
     
    -    Before this patch, whenever --sparse is used, `git-grep` utilizes the
    -    ensure_full_index() method to expand the index and search all the
    -    entries. Because this method requires walking all the trees and
    -    constructing the index, it is the slow part within the whole command.
    +    Turn on sparse index and remove ensure_full_index().
    +
    +    Before this patch, `git-grep` utilizes the ensure_full_index() method to
    +    expand the index and search all the entries. Because this method
    +    requires walking all the trees and constructing the index, it is the
    +    slow part within the whole command.
     
         To achieve better performance, this patch uses grep_tree() to search the
         sparse directory entries and get rid of the ensure_full_index() method.
    @@ Commit message
            result of expanding the index.
     
         2) grep_tree() utilizes pathspecs to limit the scope of searching.
    -       ensure_full_index() always expands the index when --sparse is used,
    -       that means it will always walk all the trees and blobs in the repo
    -       without caring if the user only wants a subset of the content, i.e.
    -       using a pathspec. On the other hand, grep_tree() will only search
    -       the contents that match the pathspec, and thus possibly walking fewer
    -       trees.
    +       ensure_full_index() always expands the index, which means it will
    +       always walk all the trees and blobs in the repo without caring if
    +       the user only wants a subset of the content, i.e. using a pathspec.
    +       On the other hand, grep_tree() will only search the contents that
    +       match the pathspec, and thus possibly walking fewer trees.
     
         3) grep_tree() does not construct and copy back a new index, while
            ensure_full_index() does. This also saves some time.
    @@ Commit message
         - Summary:
     
         p2000 tests demonstrate a ~71% execution time reduction for
    -    `git grep --cached --sparse bogus -- "f2/f1/f1/*"` using tree-walking
    -    logic. However, notice that this result varies depending on the pathspec
    +    `git grep --cached bogus -- "f2/f1/f1/*"` using tree-walking logic.
    +    However, notice that this result varies depending on the pathspec
         given. See below "Command used for testing" for more details.
     
         Test                              HEAD~   HEAD
    @@ Commit message
     
         - Command used for testing:
     
    -            git grep --cached --sparse bogus -- "f2/f1/f1/*"
    +            git grep --cached bogus -- "f2/f1/f1/*"
     
         The reason for specifying a pathspec is that, if we don't specify a
         pathspec, then grep_tree() will walk all the trees and blobs to find the
    @@ Commit message
     
                 Command used:
     
    -                    git grep --cached --sparse bogus
    +                    git grep --cached bogus
     
                 Test                                HEAD~  HEAD
                 ---------------------------------------------------
    @@ Commit message
         Suggested-by: Derrick Stolee <derrickstolee@github.com>
         Helped-by: Derrick Stolee <derrickstolee@github.com>
         Helped-by: Victoria Dye <vdye@github.com>
    +    Helped-by: Elijah Newren <newren@gmail.com>
         Signed-off-by: Shaoxuan Yuan <shaoxuan.yuan02@gmail.com>
     
      ## builtin/grep.c ##
    @@ builtin/grep.c: static int grep_cache(struct grep_opt *opt,
      	if (repo_read_index(repo) < 0)
      		die(_("index file corrupt"));
      
    --	if (grep_sparse)
    --		ensure_full_index(repo->index);
    --
    +-	/* TODO: audit for interaction with sparse-index. */
    +-	ensure_full_index(repo->index);
      	for (nr = 0; nr < repo->index->cache_nr; nr++) {
      		const struct cache_entry *ce = repo->index->cache[nr];
      
    @@ builtin/grep.c: static int grep_cache(struct grep_opt *opt,
     +			struct tree_desc tree;
     +			void *data;
     +			unsigned long size;
    -+
    -+			data = read_object_file(&ce->oid, &type, &size);
    -+			init_tree_desc(&tree, data, size);
      
     -		if (S_ISREG(ce->ce_mode) &&
    ++			data = read_object_file(&ce->oid, &type, &size);
    ++			init_tree_desc(&tree, data, size);
    ++
     +			hit |= grep_tree(opt, pathspec, &tree, &name, 0, 0);
     +			strbuf_setlen(&name, name_base_len);
     +			strbuf_addstr(&name, ce->name);
    @@ builtin/grep.c: static int grep_cache(struct grep_opt *opt,
      		    match_pathspec(repo->index, pathspec, name.buf, name.len, 0, NULL,
      				   S_ISDIR(ce->ce_mode) ||
      				   S_ISGITLINK(ce->ce_mode))) {
    +@@ builtin/grep.c: int cmd_grep(int argc, const char **argv, const char *prefix)
    + 			     PARSE_OPT_KEEP_DASHDASH |
    + 			     PARSE_OPT_STOP_AT_NON_OPTION);
    + 
    ++	if (the_repository->gitdir) {
    ++		prepare_repo_settings(the_repository);
    ++		the_repository->settings.command_requires_full_index = 0;
    ++	}
    ++
    + 	if (use_index && !startup_info->have_repository) {
    + 		int fallback = 0;
    + 		git_config_get_bool("grep.fallbacktonoindex", &fallback);
     
      ## t/perf/p2000-sparse-operations.sh ##
     @@ t/perf/p2000-sparse-operations.sh: test_perf_on_all git read-tree -mu HEAD
    @@ t/t1092-sparse-checkout-compatibility.sh: init_repos () {
      run_on_sparse () {
      	(
      		cd sparse-checkout &&
    -@@ t/t1092-sparse-checkout-compatibility.sh: test_expect_success 'grep is not expanded' '
    +@@ t/t1092-sparse-checkout-compatibility.sh: test_expect_success 'sparse index is not expanded: rm' '
    + 	ensure_not_expanded rm -r deep
    + '
      
    - 	# All files within the folder1/* pathspec are sparse,
    - 	# so this command does not find any matches
    --	ensure_not_expanded ! grep a -- folder1/*
    ++test_expect_success 'grep with and --cached' '
    ++	init_repos &&
    ++
    ++	test_all_match git grep --cached a &&
    ++	test_all_match git grep --cached a -- "folder1/*"
    ++'
    ++
    ++test_expect_success 'grep is not expanded' '
    ++	init_repos &&
    ++
    ++	ensure_not_expanded grep a &&
    ++	ensure_not_expanded grep a -- deep/* &&
    ++
    ++	# All files within the folder1/* pathspec are sparse,
    ++	# so this command does not find any matches
     +	ensure_not_expanded ! grep a -- folder1/* &&
     +
     +	# test out-of-cone pathspec with or without wildcard
    -+	ensure_not_expanded grep --sparse --cached a -- "folder1/a" &&
    -+	ensure_not_expanded grep --sparse --cached a -- "folder1/*" &&
    ++	ensure_not_expanded grep --cached a -- "folder1/a" &&
    ++	ensure_not_expanded grep --cached a -- "folder1/*" &&
     +
     +	# test in-cone pathspec with or without wildcard
    -+	ensure_not_expanded grep --sparse --cached a -- "deep/a" &&
    -+	ensure_not_expanded grep --sparse --cached a -- "deep/*"
    ++	ensure_not_expanded grep --cached a -- "deep/a" &&
    ++	ensure_not_expanded grep --cached a -- "deep/*"
     +'
     +
     +# NEEDSWORK: when running `grep` in the superproject with --recurse-submodules,
    @@ t/t1092-sparse-checkout-compatibility.sh: test_expect_success 'grep is not expan
     +	# do not use ensure_not_expanded() here, becasue `grep` should be
     +	# run in the superproject, not in "./sparse-index"
     +	GIT_TRACE2_EVENT="$(pwd)/trace2.txt" \
    -+	git grep --sparse --cached --recurse-submodules a -- "*/folder1/*" &&
    ++	git grep --cached --recurse-submodules a -- "*/folder1/*" &&
     +	test_region ! index ensure_full_index trace2.txt
     +'
     +
    @@ t/t1092-sparse-checkout-compatibility.sh: test_expect_success 'grep is not expan
     +	sparse-checkout/folder1/a:a
     +	sparse-index/folder1/a:a
     +	EOF
    -+	git grep --sparse --cached --recurse-submodules a -- "*/folder1/*" >actual &&
    ++	git grep --cached --recurse-submodules a -- "*/folder1/*" >actual &&
     +	test_cmp actual expect
    - '
    - 
    ++'
    ++
      test_done

base-commit: 1b3d6e17fe83eb6f79ffbac2f2c61bbf1eaef5f8
-- 
2.37.0


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH v6 1/1] builtin/grep.c: integrate with sparse index
  2022-09-23  4:18 ` [PATCH v6 0/1] grep: integrate with sparse index Shaoxuan Yuan
@ 2022-09-23  4:18   ` Shaoxuan Yuan
  2022-09-23 16:40     ` Junio C Hamano
  2022-09-23 16:58     ` Junio C Hamano
  2022-09-23 14:13   ` [PATCH v6 0/1] grep: " Derrick Stolee
  2022-09-23 16:01   ` Victoria Dye
  2 siblings, 2 replies; 69+ messages in thread
From: Shaoxuan Yuan @ 2022-09-23  4:18 UTC (permalink / raw)
  To: git; +Cc: derrickstolee, gitster, vdye, newren, avarab, Shaoxuan Yuan

Turn on sparse index and remove ensure_full_index().

Before this patch, `git-grep` utilizes the ensure_full_index() method to
expand the index and search all the entries. Because this method
requires walking all the trees and constructing the index, it is the
slow part within the whole command.

To achieve better performance, this patch uses grep_tree() to search the
sparse directory entries and get rid of the ensure_full_index() method.

Why grep_tree() is a better choice over ensure_full_index()?

1) grep_tree() is as correct as ensure_full_index(). grep_tree() looks
   into every sparse-directory entry (represented by a tree) recursively
   when looping over the index, and the result of doing so matches the
   result of expanding the index.

2) grep_tree() utilizes pathspecs to limit the scope of searching.
   ensure_full_index() always expands the index, which means it will
   always walk all the trees and blobs in the repo without caring if
   the user only wants a subset of the content, i.e. using a pathspec.
   On the other hand, grep_tree() will only search the contents that
   match the pathspec, and thus possibly walking fewer trees.

3) grep_tree() does not construct and copy back a new index, while
   ensure_full_index() does. This also saves some time.

----------------
Performance test

- Summary:

p2000 tests demonstrate a ~71% execution time reduction for
`git grep --cached bogus -- "f2/f1/f1/*"` using tree-walking logic.
However, notice that this result varies depending on the pathspec
given. See below "Command used for testing" for more details.

Test                              HEAD~   HEAD
-------------------------------------------------------
2000.78: git grep ... (full-v3)   0.35    0.39 (≈)
2000.79: git grep ... (full-v4)   0.36    0.30 (≈)
2000.80: git grep ... (sparse-v3) 0.88    0.23 (-73.8%)
2000.81: git grep ... (sparse-v4) 0.83    0.26 (-68.6%)

- Command used for testing:

	git grep --cached bogus -- "f2/f1/f1/*"

The reason for specifying a pathspec is that, if we don't specify a
pathspec, then grep_tree() will walk all the trees and blobs to find the
pattern, and the time consumed doing so is not too different from using
the original ensure_full_index() method, which also spends most of the
time walking trees. However, when a pathspec is specified, this latest
logic will only walk the area of trees enclosed by the pathspec, and the
time consumed is reasonably a lot less.

Generally speaking, because the performance gain is acheived by walking
less trees, which are specified by the pathspec, the HEAD time v.s.
HEAD~ time in sparse-v[3|4], should be proportional to
"pathspec enclosed area" v.s. "all area", respectively. Namely, the
wider the <pathspec> is encompassing, the less the performance
difference between HEAD~ and HEAD, and vice versa.

That is, if we don't specify a pathspec, the performance difference [1]
is indistinguishable: both methods walk all the trees and take generally
same amount of time (even with the index construction time included for
ensure_full_index()).

[1] Performance test result without pathspec (hence walking all trees):

	Command used:

		git grep --cached bogus

	Test                                HEAD~  HEAD
	---------------------------------------------------
	2000.78: git grep ... (full-v3)     6.17   5.19 (≈)
	2000.79: git grep ... (full-v4)     6.19   5.46 (≈)
	2000.80: git grep ... (sparse-v3)   6.57   6.44 (≈)
	2000.81: git grep ... (sparse-v4)   6.65   6.28 (≈)

--------------------------
NEEDSWORK about submodules

There are a few NEEDSWORKs that belong to improvements beyond this
topic. See the NEEDSWORK in builtin/grep.c::grep_submodule() for
more context. The other two NEEDSWORKs in t1092 are also relative.

Suggested-by: Derrick Stolee <derrickstolee@github.com>
Helped-by: Derrick Stolee <derrickstolee@github.com>
Helped-by: Victoria Dye <vdye@github.com>
Helped-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Shaoxuan Yuan <shaoxuan.yuan02@gmail.com>
---
 builtin/grep.c                           | 48 +++++++++++++++-
 t/perf/p2000-sparse-operations.sh        |  1 +
 t/t1092-sparse-checkout-compatibility.sh | 72 ++++++++++++++++++++++++
 3 files changed, 118 insertions(+), 3 deletions(-)

diff --git a/builtin/grep.c b/builtin/grep.c
index e6bcdf860c..5fa927d4e2 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -458,6 +458,33 @@ static int grep_submodule(struct grep_opt *opt,
 	 * subrepo's odbs to the in-memory alternates list.
 	 */
 	obj_read_lock();
+
+	/*
+	 * NEEDSWORK: when reading a submodule, the sparsity settings in the
+	 * superproject are incorrectly forgotten or misused. For example:
+	 *
+	 * 1. "command_requires_full_index"
+	 * 	When this setting is turned on for `grep`, only the superproject
+	 *	knows it. All the submodules are read with their own configs
+	 *	and get prepare_repo_settings()'d. Therefore, these submodules
+	 *	"forget" the sparse-index feature switch. As a result, the index
+	 *	of these submodules are expanded unexpectedly.
+	 *
+	 * 2. "core_apply_sparse_checkout"
+	 *	When running `grep` in the superproject, this setting is
+	 *	populated using the superproject's configs. However, once
+	 *	initialized, this config is globally accessible and is read by
+	 *	prepare_repo_settings() for the submodules. For instance, if a
+	 *	submodule is using a sparse-checkout, however, the superproject
+	 *	is not, the result is that the config from the superproject will
+	 *	dictate the behavior for the submodule, making it "forget" its
+	 *	sparse-checkout state.
+	 *
+	 * 3. "core_sparse_checkout_cone"
+	 *	ditto.
+	 *
+	 * Note that this list is not exhaustive.
+	 */
 	repo_read_gitmodules(subrepo, 0);
 
 	/*
@@ -520,8 +547,6 @@ static int grep_cache(struct grep_opt *opt,
 	if (repo_read_index(repo) < 0)
 		die(_("index file corrupt"));
 
-	/* TODO: audit for interaction with sparse-index. */
-	ensure_full_index(repo->index);
 	for (nr = 0; nr < repo->index->cache_nr; nr++) {
 		const struct cache_entry *ce = repo->index->cache[nr];
 
@@ -530,8 +555,20 @@ static int grep_cache(struct grep_opt *opt,
 
 		strbuf_setlen(&name, name_base_len);
 		strbuf_addstr(&name, ce->name);
+		if (S_ISSPARSEDIR(ce->ce_mode)) {
+			enum object_type type;
+			struct tree_desc tree;
+			void *data;
+			unsigned long size;
 
-		if (S_ISREG(ce->ce_mode) &&
+			data = read_object_file(&ce->oid, &type, &size);
+			init_tree_desc(&tree, data, size);
+
+			hit |= grep_tree(opt, pathspec, &tree, &name, 0, 0);
+			strbuf_setlen(&name, name_base_len);
+			strbuf_addstr(&name, ce->name);
+			free(data);
+		} else if (S_ISREG(ce->ce_mode) &&
 		    match_pathspec(repo->index, pathspec, name.buf, name.len, 0, NULL,
 				   S_ISDIR(ce->ce_mode) ||
 				   S_ISGITLINK(ce->ce_mode))) {
@@ -984,6 +1021,11 @@ int cmd_grep(int argc, const char **argv, const char *prefix)
 			     PARSE_OPT_KEEP_DASHDASH |
 			     PARSE_OPT_STOP_AT_NON_OPTION);
 
+	if (the_repository->gitdir) {
+		prepare_repo_settings(the_repository);
+		the_repository->settings.command_requires_full_index = 0;
+	}
+
 	if (use_index && !startup_info->have_repository) {
 		int fallback = 0;
 		git_config_get_bool("grep.fallbacktonoindex", &fallback);
diff --git a/t/perf/p2000-sparse-operations.sh b/t/perf/p2000-sparse-operations.sh
index fce8151d41..3242cfe91a 100755
--- a/t/perf/p2000-sparse-operations.sh
+++ b/t/perf/p2000-sparse-operations.sh
@@ -124,5 +124,6 @@ test_perf_on_all git read-tree -mu HEAD
 test_perf_on_all git checkout-index -f --all
 test_perf_on_all git update-index --add --remove $SPARSE_CONE/a
 test_perf_on_all "git rm -f $SPARSE_CONE/a && git checkout HEAD -- $SPARSE_CONE/a"
+test_perf_on_all git grep --cached --sparse bogus -- "f2/f1/f1/*"
 
 test_done
diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index b9350c075c..711b52fb46 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -162,6 +162,19 @@ init_repos () {
 	git -C sparse-index sparse-checkout set deep
 }
 
+init_repos_as_submodules () {
+	git reset --hard &&
+	init_repos &&
+	git submodule add ./full-checkout &&
+	git submodule add ./sparse-checkout &&
+	git submodule add ./sparse-index &&
+
+	git submodule status >actual &&
+	grep full-checkout actual &&
+	grep sparse-checkout actual &&
+	grep sparse-index actual
+}
+
 run_on_sparse () {
 	(
 		cd sparse-checkout &&
@@ -1981,4 +1994,63 @@ test_expect_success 'sparse index is not expanded: rm' '
 	ensure_not_expanded rm -r deep
 '
 
+test_expect_success 'grep with and --cached' '
+	init_repos &&
+
+	test_all_match git grep --cached a &&
+	test_all_match git grep --cached a -- "folder1/*"
+'
+
+test_expect_success 'grep is not expanded' '
+	init_repos &&
+
+	ensure_not_expanded grep a &&
+	ensure_not_expanded grep a -- deep/* &&
+
+	# All files within the folder1/* pathspec are sparse,
+	# so this command does not find any matches
+	ensure_not_expanded ! grep a -- folder1/* &&
+
+	# test out-of-cone pathspec with or without wildcard
+	ensure_not_expanded grep --cached a -- "folder1/a" &&
+	ensure_not_expanded grep --cached a -- "folder1/*" &&
+
+	# test in-cone pathspec with or without wildcard
+	ensure_not_expanded grep --cached a -- "deep/a" &&
+	ensure_not_expanded grep --cached a -- "deep/*"
+'
+
+# NEEDSWORK: when running `grep` in the superproject with --recurse-submodules,
+# Git expands the index of the submodules unexpectedly. Even though `grep`
+# builtin is marked as "command_requires_full_index = 0", this config is only
+# useful for the superproject. Namely, the submodules have their own configs,
+# which are _not_ populated by the one-time sparse-index feature switch.
+test_expect_failure 'grep within submodules is not expanded' '
+	init_repos_as_submodules &&
+
+	# do not use ensure_not_expanded() here, becasue `grep` should be
+	# run in the superproject, not in "./sparse-index"
+	GIT_TRACE2_EVENT="$(pwd)/trace2.txt" \
+	git grep --cached --recurse-submodules a -- "*/folder1/*" &&
+	test_region ! index ensure_full_index trace2.txt
+'
+
+# NEEDSWORK: this test is not actually testing the code. The design purpose
+# of this test is to verify the grep result when the submodules are using a
+# sparse-index. Namely, we want "folder1/" as a tree (a sparse directory); but
+# because of the index expansion, we are now grepping the "folder1/a" blob.
+# Because of the problem stated above 'grep within submodules is not expanded',
+# we don't have the ideal test environment yet.
+test_expect_success 'grep sparse directory within submodules' '
+	init_repos_as_submodules &&
+
+	cat >expect <<-\EOF &&
+	full-checkout/folder1/a:a
+	sparse-checkout/folder1/a:a
+	sparse-index/folder1/a:a
+	EOF
+	git grep --cached --recurse-submodules a -- "*/folder1/*" >actual &&
+	test_cmp actual expect
+'
+
 test_done
-- 
2.37.0


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [PATCH v6 0/1] grep: integrate with sparse index
  2022-09-23  4:18 ` [PATCH v6 0/1] grep: integrate with sparse index Shaoxuan Yuan
  2022-09-23  4:18   ` [PATCH v6 1/1] builtin/grep.c: " Shaoxuan Yuan
@ 2022-09-23 14:13   ` Derrick Stolee
  2022-09-23 16:01   ` Victoria Dye
  2 siblings, 0 replies; 69+ messages in thread
From: Derrick Stolee @ 2022-09-23 14:13 UTC (permalink / raw)
  To: Shaoxuan Yuan, git; +Cc: gitster, vdye, newren, avarab

On 9/23/2022 12:18 AM, Shaoxuan Yuan wrote:
> Integrate `git-grep` with sparse-index and test the performance
> improvement.
> 
> Changes since v5
> ----------------
> 
> * Drop the `--sparse` option patch and edit corresponding tests. 
>   We can wait until a better name is decided to replace `--sparse`.
> 
> * Modify the commit message, especially get rid of the `--sparse`
>   occurences.

It's nice that now that you are calling grep_tree() when reaching a
sparse directory entry, you can still have all of the ensure_not_expanded
tests work even without --sparse.

There is definitely room for improving the user experience to focus on
the sparse cone by implementing a replacement for --sparse in the future,
especially for users with partial clones.

But this patch stands on its own. Thank you for your hard work here.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v6 0/1] grep: integrate with sparse index
  2022-09-23  4:18 ` [PATCH v6 0/1] grep: integrate with sparse index Shaoxuan Yuan
  2022-09-23  4:18   ` [PATCH v6 1/1] builtin/grep.c: " Shaoxuan Yuan
  2022-09-23 14:13   ` [PATCH v6 0/1] grep: " Derrick Stolee
@ 2022-09-23 16:01   ` Victoria Dye
  2022-09-23 17:08     ` Junio C Hamano
  2 siblings, 1 reply; 69+ messages in thread
From: Victoria Dye @ 2022-09-23 16:01 UTC (permalink / raw)
  To: Shaoxuan Yuan, git; +Cc: derrickstolee, gitster, newren, avarab

Shaoxuan Yuan wrote:
> Integrate `git-grep` with sparse-index and test the performance
> improvement.
> 
> Changes since v5
> ----------------
> 
> * Drop the `--sparse` option patch and edit corresponding tests. 
>   We can wait until a better name is decided to replace `--sparse`.
> 
> * Modify the commit message, especially get rid of the `--sparse`
>   occurences.
> 

Thanks for the update! Everything in this patch is either part of the
previous version's patch 3 or comes from the tests & sparse index enabling
of the previous patch 2. The resulting patch enables the sparse index for
all usage of '--cached', and avoids any user option changes. 

All that to say, this version looks good to me!

Thanks!
- Victoria

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v6 1/1] builtin/grep.c: integrate with sparse index
  2022-09-23  4:18   ` [PATCH v6 1/1] builtin/grep.c: " Shaoxuan Yuan
@ 2022-09-23 16:40     ` Junio C Hamano
  2022-09-23 16:58     ` Junio C Hamano
  1 sibling, 0 replies; 69+ messages in thread
From: Junio C Hamano @ 2022-09-23 16:40 UTC (permalink / raw)
  To: Shaoxuan Yuan; +Cc: git, derrickstolee, vdye, newren, avarab

Shaoxuan Yuan <shaoxuan.yuan02@gmail.com> writes:

> - Command used for testing:
>
> 	git grep --cached bogus -- "f2/f1/f1/*"
>
> The reason for specifying a pathspec is that, if we don't specify a
> pathspec, then grep_tree() will walk all the trees and blobs to find the
> pattern, and the time consumed doing so is not too different from using
> the original ensure_full_index() method, which also spends most of the
> time walking trees. However, when a pathspec is specified, this latest
> logic will only walk the area of trees enclosed by the pathspec, and the
> time consumed is reasonably a lot less.

Good.  So without pathspec, we lazily populate the index and catch
matches even from outside the sparse cone.  We punt to "implicitly"
apply the sparse cone(s) as a pathspec that limits the hits to the
paths in the sparse cone(s).

> That is, if we don't specify a pathspec, the performance difference [1]
> is indistinguishable: both methods walk all the trees and take generally
> same amount of time (even with the index construction time included for
> ensure_full_index()).

Good.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v6 1/1] builtin/grep.c: integrate with sparse index
  2022-09-23  4:18   ` [PATCH v6 1/1] builtin/grep.c: " Shaoxuan Yuan
  2022-09-23 16:40     ` Junio C Hamano
@ 2022-09-23 16:58     ` Junio C Hamano
  2022-09-26 17:28       ` Junio C Hamano
  1 sibling, 1 reply; 69+ messages in thread
From: Junio C Hamano @ 2022-09-23 16:58 UTC (permalink / raw)
  To: Shaoxuan Yuan; +Cc: git, derrickstolee, vdye, newren, avarab

Shaoxuan Yuan <shaoxuan.yuan02@gmail.com> writes:

> +test_expect_success 'grep with and --cached' '

"with and --cached"?  "with and without --cached" is probably a good
thing to test but you may need to add tests for "with" case, too?

> +	init_repos &&
> +
> +	test_all_match git grep --cached a &&
> +	test_all_match git grep --cached a -- "folder1/*"
> +'

The above is very relevant for the purpose of ...

> -	/* TODO: audit for interaction with sparse-index. */
> -	ensure_full_index(repo->index);

... auditing.  Run the command with a pathspec that specify areas
inside and outside the sparse cone(s) and ensure the result match
those in a non-sparse-index, with test_all_match().

As to the lack of the tests WITHOUT "--cached", I suspect that it is
omitted because there is no checked-out copies to grep in, but I
suspect that it is papering over a buggy design.  If we do not by
default limit the operation only to paths inside sparse cone(s),
shouldn't we be treating the paths outside as if they exist with the
same contents as they are in the index (and unmodified)?  If we take
the position that "working tree files on paths outside the sparse
cone(s) do not exist", "git diff" would need to say that they are
all removed to be consistent, which probably is not what we want to
see.

> +test_expect_success 'grep is not expanded' '
> +	init_repos &&
> +
> +	ensure_not_expanded grep a &&
> +	ensure_not_expanded grep a -- deep/* &&
> +
> +	# All files within the folder1/* pathspec are sparse,
> +	# so this command does not find any matches
> +	ensure_not_expanded ! grep a -- folder1/* &&
> +
> +	# test out-of-cone pathspec with or without wildcard
> +	ensure_not_expanded grep --cached a -- "folder1/a" &&
> +	ensure_not_expanded grep --cached a -- "folder1/*" &&
> +
> +	# test in-cone pathspec with or without wildcard
> +	ensure_not_expanded grep --cached a -- "deep/a" &&
> +	ensure_not_expanded grep --cached a -- "deep/*"
> +'

It is not wrong per-se, but I am not sure how relevant these tests
are.

The implementation of ensure_not_expanded very intimately knows
that a call to ensure_full_index() is the one we are trying to avoid
(and we do not even detect if another way to fully expand the index
is invented and used), and we know we are removing the only call to
the function in "git grep".

Thanks.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v6 0/1] grep: integrate with sparse index
  2022-09-23 16:01   ` Victoria Dye
@ 2022-09-23 17:08     ` Junio C Hamano
  0 siblings, 0 replies; 69+ messages in thread
From: Junio C Hamano @ 2022-09-23 17:08 UTC (permalink / raw)
  To: Victoria Dye; +Cc: Shaoxuan Yuan, git, derrickstolee, newren, avarab

Victoria Dye <vdye@github.com> writes:

> Shaoxuan Yuan wrote:
>> Integrate `git-grep` with sparse-index and test the performance
>> improvement.
>> 
>> Changes since v5
>> ----------------
>> 
>> * Drop the `--sparse` option patch and edit corresponding tests. 
>>   We can wait until a better name is decided to replace `--sparse`.
>> 
>> * Modify the commit message, especially get rid of the `--sparse`
>>   occurences.
>> 
>
> Thanks for the update! Everything in this patch is either part of the
> previous version's patch 3 or comes from the tests & sparse index enabling
> of the previous patch 2. The resulting patch enables the sparse index for
> all usage of '--cached', and avoids any user option changes. 
>
> All that to say, this version looks good to me!

Thanks, all.  Captured but outside the upcoming release so expect
that it will be slow to merge into any of the integration branches.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v6 1/1] builtin/grep.c: integrate with sparse index
  2022-09-23 16:58     ` Junio C Hamano
@ 2022-09-26 17:28       ` Junio C Hamano
  0 siblings, 0 replies; 69+ messages in thread
From: Junio C Hamano @ 2022-09-26 17:28 UTC (permalink / raw)
  To: Shaoxuan Yuan; +Cc: git, derrickstolee, vdye, newren, avarab

Junio C Hamano <gitster@pobox.com> writes:

> Shaoxuan Yuan <shaoxuan.yuan02@gmail.com> writes:
>
>> +test_expect_success 'grep with and --cached' '
>
> "with and --cached"?  "with and without --cached" is probably a good
> thing to test but you may need to add tests for "with" case, too?

I meant "for WITHOUT case, too", but ...

>> +	init_repos &&
>> +
>> +	test_all_match git grep --cached a &&
>> +	test_all_match git grep --cached a -- "folder1/*"
>> +'
>
> The above is very relevant for the purpose of ...
>
>> -	/* TODO: audit for interaction with sparse-index. */
>> -	ensure_full_index(repo->index);
>
> ... auditing.  Run the command with a pathspec that specify areas
> inside and outside the sparse cone(s) and ensure the result match
> those in a non-sparse-index, with test_all_match().
>
> As to the lack of the tests WITHOUT "--cached", I suspect that it is
> omitted because there is no checked-out copies to grep in, but I
> suspect that it is papering over a buggy design.

... in light of the recent "sparse-checkout.txt: ... directions"
document patch by Elijah

  http://lore.kernel.org/git/pull.1367.git.1664064588846.gitgitgadget@gmail.com/

I think I was quite mistaken.  The guiding principle should not be
to pretend that the paths stubbed out with sparse checkout mechanism
are unchanged from HEAD.  It should be to pretend that they do not
exist and they never existed.

So it is perfectly expected that the output with and without
"--cached" are different.  The former (without an option to ignore
paths outside the sparse checkout even for in-repository data)
should find stuff from in-tree, while the latter should look for
things only in the checked out files.

^ permalink raw reply	[flat|nested] 69+ messages in thread

end of thread, other threads:[~2022-09-26 17:53 UTC | newest]

Thread overview: 69+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-17  7:56 [PATCH v1 0/2] grep: integrate with sparse index Shaoxuan Yuan
2022-08-17  7:56 ` [PATCH v1 1/2] builtin/grep.c: add --sparse option Shaoxuan Yuan
2022-08-17 14:12   ` Derrick Stolee
2022-08-17 17:13     ` Junio C Hamano
2022-08-17 17:34       ` Victoria Dye
2022-08-17 17:43         ` Derrick Stolee
2022-08-17 18:47           ` Junio C Hamano
2022-08-17 17:37     ` Elijah Newren
2022-08-24 18:20     ` Shaoxuan Yuan
2022-08-24 19:08       ` Derrick Stolee
2022-08-17  7:56 ` [PATCH v1 2/2] builtin/grep.c: integrate with sparse index Shaoxuan Yuan
2022-08-17 14:23   ` Derrick Stolee
2022-08-24 21:06     ` Shaoxuan Yuan
2022-08-25  0:39       ` Derrick Stolee
2022-08-17 13:46 ` [PATCH v1 0/2] grep: " Derrick Stolee
2022-08-29 23:28 ` [PATCH v2 " Shaoxuan Yuan
2022-08-29 23:28   ` [PATCH v2 1/2] builtin/grep.c: add --sparse option Shaoxuan Yuan
2022-08-29 23:28   ` [PATCH v2 2/2] builtin/grep.c: integrate with sparse index Shaoxuan Yuan
2022-08-30 13:45     ` Derrick Stolee
2022-09-01  4:57 ` [PATCH v3 0/3] grep: " Shaoxuan Yuan
2022-09-01  4:57   ` [PATCH v3 1/3] builtin/grep.c: add --sparse option Shaoxuan Yuan
2022-09-01  4:57   ` [PATCH v3 2/3] builtin/grep.c: integrate with sparse index Shaoxuan Yuan
2022-09-01  4:57   ` [PATCH v3 3/3] builtin/grep.c: walking tree instead of expanding index with --sparse Shaoxuan Yuan
2022-09-01 17:03     ` Derrick Stolee
2022-09-01 18:31       ` Shaoxuan Yuan
2022-09-01 17:17     ` Junio C Hamano
2022-09-01 17:27       ` Junio C Hamano
2022-09-01 22:49         ` Shaoxuan Yuan
2022-09-01 22:36       ` Shaoxuan Yuan
2022-09-02  3:28     ` Victoria Dye
2022-09-02 18:47       ` Shaoxuan Yuan
2022-09-03  0:36 ` [PATCH v4 0/3] grep: integrate with sparse index Shaoxuan Yuan
2022-09-03  0:36   ` [PATCH v4 1/3] builtin/grep.c: add --sparse option Shaoxuan Yuan
2022-09-03  0:36   ` [PATCH v4 2/3] builtin/grep.c: integrate with sparse index Shaoxuan Yuan
2022-09-03  0:36   ` [PATCH v4 3/3] builtin/grep.c: walking tree instead of expanding index with --sparse Shaoxuan Yuan
2022-09-03  4:39     ` Junio C Hamano
2022-09-08  0:24       ` Shaoxuan Yuan
2022-09-08  0:18 ` [PATCH v5 0/3] grep: integrate with sparse index Shaoxuan Yuan
2022-09-08  0:18   ` [PATCH v5 1/3] builtin/grep.c: add --sparse option Shaoxuan Yuan
2022-09-10  1:07     ` Victoria Dye
2022-09-14  6:08     ` Elijah Newren
2022-09-15  2:57       ` Junio C Hamano
2022-09-18  2:14         ` Elijah Newren
2022-09-18 19:52           ` Victoria Dye
2022-09-19  1:23             ` Junio C Hamano
2022-09-19  4:27             ` Shaoxuan Yuan
2022-09-19 11:03             ` Ævar Arnfjörð Bjarmason
2022-09-20  7:13             ` Elijah Newren
2022-09-17  3:34       ` Shaoxuan Yuan
2022-09-18  4:24         ` Elijah Newren
2022-09-19  4:13           ` Shaoxuan Yuan
2022-09-17  3:45       ` Shaoxuan Yuan
2022-09-08  0:18   ` [PATCH v5 2/3] builtin/grep.c: integrate with sparse index Shaoxuan Yuan
2022-09-08  0:18   ` [PATCH v5 3/3] builtin/grep.c: walking tree instead of expanding index with --sparse Shaoxuan Yuan
2022-09-08 17:59     ` Junio C Hamano
2022-09-08 20:46       ` Derrick Stolee
2022-09-08 20:56         ` Junio C Hamano
2022-09-08 21:06           ` Shaoxuan Yuan
2022-09-09 12:49           ` Derrick Stolee
2022-09-13 17:23         ` Junio C Hamano
2022-09-10  2:04     ` Victoria Dye
2022-09-23  4:18 ` [PATCH v6 0/1] grep: integrate with sparse index Shaoxuan Yuan
2022-09-23  4:18   ` [PATCH v6 1/1] builtin/grep.c: " Shaoxuan Yuan
2022-09-23 16:40     ` Junio C Hamano
2022-09-23 16:58     ` Junio C Hamano
2022-09-26 17:28       ` Junio C Hamano
2022-09-23 14:13   ` [PATCH v6 0/1] grep: " Derrick Stolee
2022-09-23 16:01   ` Victoria Dye
2022-09-23 17:08     ` Junio C Hamano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).