Git Mailing List Archive on lore.kernel.org
 help / color / Atom feed
* [RFC PATCH 0/3] grep: honor sparse checkout and add option to ignore it
@ 2020-03-24  6:04 Matheus Tavares
  2020-03-24  6:11 ` [RFC PATCH 1/3] doc: grep: unify info on configuration variables Matheus Tavares
                   ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: Matheus Tavares @ 2020-03-24  6:04 UTC (permalink / raw)
  To: git; +Cc: dstolee, newren

This series is based on the discussions we had some months ago[1], about
git-grep not currently honoring the sparsity patterns. To summarize, the
idea is that, since a sparse checkout is used to limit the set of files
in which users are interested, git-grep should, by default, only search
within this boundary.  But it would be good to also have an
'--ignore-sparsity' option, to restore the old behavior when needed, as
there are also valid use cases for it. The following patches seek to
address these suggestions. The first patch is not really related, it is
a cleanup, used by the third one.

[1]: https://lore.kernel.org/git/CAHd-oW7e5qCuxZLBeVDq+Th3E+E4+P8=WzJfK8WcG2yz=n_nag@mail.gmail.com/t/#u

Matheus Tavares (3):
  doc: grep: unify info on configuration variables
  grep: honor sparse checkout patterns
  grep: add option to ignore sparsity patterns

 Documentation/config/grep.txt    |  10 ++-
 Documentation/git-grep.txt       |  40 +++-------
 builtin/grep.c                   |  36 ++++++++-
 t/t7011-skip-worktree-reading.sh |   9 ---
 t/t7817-grep-sparse-checkout.sh  | 130 +++++++++++++++++++++++++++++++
 5 files changed, 180 insertions(+), 45 deletions(-)
 create mode 100755 t/t7817-grep-sparse-checkout.sh

-- 
2.25.1


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC PATCH 1/3] doc: grep: unify info on configuration variables
  2020-03-24  6:04 [RFC PATCH 0/3] grep: honor sparse checkout and add option to ignore it Matheus Tavares
@ 2020-03-24  6:11 ` Matheus Tavares
  2020-03-24  7:57   ` Elijah Newren
  2020-03-24  6:12 ` [RFC PATCH 2/3] grep: honor sparse checkout patterns Matheus Tavares
  2020-03-24  6:13 ` [RFC PATCH 3/3] grep: add option to ignore sparsity patterns Matheus Tavares
  2 siblings, 1 reply; 26+ messages in thread
From: Matheus Tavares @ 2020-03-24  6:11 UTC (permalink / raw)
  To: git; +Cc: dstolee, newren, sandals

Explanations about the configuration variables for git-grep are
duplicated in "Documentation/git-grep.txt" and
"Documentation/config/grep.txt". Let's unify the information in the
second file and include it in the first.

Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
---
 Documentation/config/grep.txt |  7 +++++--
 Documentation/git-grep.txt    | 35 +++++------------------------------
 2 files changed, 10 insertions(+), 32 deletions(-)

diff --git a/Documentation/config/grep.txt b/Documentation/config/grep.txt
index 44abe45a7c..76689771aa 100644
--- a/Documentation/config/grep.txt
+++ b/Documentation/config/grep.txt
@@ -16,8 +16,11 @@ grep.extendedRegexp::
 	other than 'default'.
 
 grep.threads::
-	Number of grep worker threads to use.
-	See `grep.threads` in linkgit:git-grep[1] for more information.
+	Number of grep worker threads to use. See `--threads` in
+	linkgit:git-grep[1] for more information.
+
+grep.fullName::
+	If set to true, enable `--full-name` option by default.
 
 grep.fallbackToNoIndex::
 	If set to true, fall back to git grep --no-index if git grep
diff --git a/Documentation/git-grep.txt b/Documentation/git-grep.txt
index ddb6acc025..97e25d7b1b 100644
--- a/Documentation/git-grep.txt
+++ b/Documentation/git-grep.txt
@@ -41,34 +41,7 @@ characters.  An empty string as search expression matches all lines.
 CONFIGURATION
 -------------
 
-grep.lineNumber::
-	If set to true, enable `-n` option by default.
-
-grep.column::
-	If set to true, enable the `--column` option by default.
-
-grep.patternType::
-	Set the default matching behavior. Using a value of 'basic', 'extended',
-	'fixed', or 'perl' will enable the `--basic-regexp`, `--extended-regexp`,
-	`--fixed-strings`, or `--perl-regexp` option accordingly, while the
-	value 'default' will return to the default matching behavior.
-
-grep.extendedRegexp::
-	If set to true, enable `--extended-regexp` option by default. This
-	option is ignored when the `grep.patternType` option is set to a value
-	other than 'default'.
-
-grep.threads::
-	Number of grep worker threads to use. If unset (or set to 0), Git will
-	use as many threads as the number of logical cores available.
-
-grep.fullName::
-	If set to true, enable `--full-name` option by default.
-
-grep.fallbackToNoIndex::
-	If set to true, fall back to git grep --no-index if git grep
-	is executed outside of a git repository.  Defaults to false.
-
+include::config/grep.txt[]
 
 OPTIONS
 -------
@@ -267,8 +240,10 @@ providing this option will cause it to die.
 	found.
 
 --threads <num>::
-	Number of grep worker threads to use.
-	See `grep.threads` in 'CONFIGURATION' for more information.
+	Number of grep worker threads to use. If not provided (or set to
+	0), Git will use as many worker threads as the number of logical
+	cores available. The default value can also be set with the
+	`grep.threads` configuration (see linkgit:git-config[1]).
 
 -f <file>::
 	Read patterns from <file>, one per line.
-- 
2.25.1


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC PATCH 2/3] grep: honor sparse checkout patterns
  2020-03-24  6:04 [RFC PATCH 0/3] grep: honor sparse checkout and add option to ignore it Matheus Tavares
  2020-03-24  6:11 ` [RFC PATCH 1/3] doc: grep: unify info on configuration variables Matheus Tavares
@ 2020-03-24  6:12 ` Matheus Tavares
  2020-03-24  7:15   ` Elijah Newren
  2020-03-24  6:13 ` [RFC PATCH 3/3] grep: add option to ignore sparsity patterns Matheus Tavares
  2 siblings, 1 reply; 26+ messages in thread
From: Matheus Tavares @ 2020-03-24  6:12 UTC (permalink / raw)
  To: git; +Cc: dstolee, newren, sandals, stefanbeller

One of the main uses for a sparse checkout is to allow users to focus on
the subset of files in a repository in which they are interested. But
git-grep currently ignores the sparsity patterns and report all matches
found outside this subset, which kind of goes in the oposity direction.
Let's fix that, making it honor the sparsity boundaries for every
grepping case:

- git grep in worktree
- git grep --cached
- git grep $REVISION
- git grep --untracked and git grep --no-index (which already respect
  sparse checkout boundaries)

This is also what some users reported[1] they would want as the default
behavior.

Note: for `git grep $REVISION`, we will choose to honor the sparsity
patterns only when $REVISION is a commit-ish object. The reason is that,
for a tree, we don't know whether it represents the root of a
repository or a subtree. So we wouldn't be able to correctly match it
against the sparsity patterns. E.g. suppose we have a repository with
these two sparsity rules: "/*" and "!/a"; and the following structure:

/
| - a (file)
| - d (dir)
    | - a (file)

If `git grep $REVISION` were to honor the sparsity patterns for every
object type, when grepping the /d tree, we would wrongly ignore the /d/a
file. This happens because we wouldn't know it resides in /d and
therefore it would wrongly match the pattern "!/a". Furthermore, for a
search in a blob object, we wouldn't even have a path to check the
patterns against. So, let's ignore the sparsity patterns when grepping
non-commit-ish objects (tags to commits should be fine).

Finally, the old behavior is still desirable for some use cases. So the
next patch will add an option to allow restoring it when needed.

[1]: https://lore.kernel.org/git/CABPp-BGuFhDwWZBRaD3nA8ui46wor-4=Ha1G1oApsfF8KNpfGQ@mail.gmail.com/

Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
---

Something I'm not entirely sure in this patch is how we implement the
mechanism to honor sparsity for the `git grep <commit-ish>` case (which
is treated in the grep_tree() function). Currently, the patch looks for
an index entry that matches the path, and then checks its skip_worktree
bit. But this operation is perfomed in O(log(N)); N being the number of
index entries. If there are many entries (and no so many sparsity
patterns), maybe a better approach would be to try matching the path
directly against the sparsity patterns. This would be O(M) in the number
of patterns, and it could be done, in builtin/grep.c, with a function
like the following:

static struct pattern_list sparsity_patterns;
static int sparsity_patterns_initialized = 0;
static enum pattern_match_result path_matches_sparsity_patterns(
					const char *path, int pathlen,
					const char *basename,
					struct repository *repo)
{
	int dtype = DT_UNKNOWN;

	if (!sparsity_patterns_initialized) {
		char *sparse_file = git_pathdup("info/sparse-checkout");
		int ret;

		memset(&sparsity_patterns, 0, sizeof(sparsity_patterns));
		sparsity_patterns.use_cone_patterns = core_sparse_checkout_cone;
		ret = add_patterns_from_file_to_list(sparse_file, "", 0,
						     &sparsity_patterns, NULL);
		free(sparse_file);

		if (ret < 0)
			die(_("failed to load sparse-checkout patterns"));
		sparsity_patterns_initialized = 1;
	}

	return path_matches_pattern_list(path, pathlen, basename, &dtype,
					 &sparsity_patterns, repo->index);
}

Also, if I understand correctly, the index doesn't hold paths to dirs,
right? So even if a complete dir is excluded from sparse checkout, we
still have to check all its subentries, only to discover that they
should all be skipped from the search. However, if we were to check
against the sparsity patterns directly (e.g. with the function above),
we could skip such directories together with all their entries.

Oh, and there is also the case of a commit whose tree paths are not in
the index (maybe manually created objects?). For such commits, with the
index lookup approach, we would have to fall back on ignoring the
sparsity rules. I'm not sure if that would be OK, though.

Any thoughts on these two approaches (looking up the skip_worktree bit
in the index or directly matching against sparsity patterns), will be
highly appreciated. (Note that it only concerns the `git grep
<commit-ish>` case. The other cases already iterate thought the index, so
there is no O(log(N)) extra complexity).

 builtin/grep.c                   | 29 ++++++++---
 t/t7011-skip-worktree-reading.sh |  9 ----
 t/t7817-grep-sparse-checkout.sh  | 88 ++++++++++++++++++++++++++++++++
 3 files changed, 111 insertions(+), 15 deletions(-)
 create mode 100755 t/t7817-grep-sparse-checkout.sh

diff --git a/builtin/grep.c b/builtin/grep.c
index 99e2685090..52ec72a036 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -388,7 +388,7 @@ static int grep_cache(struct grep_opt *opt,
 		      const struct pathspec *pathspec, int cached);
 static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
 		     struct tree_desc *tree, struct strbuf *base, int tn_len,
-		     int check_attr);
+		     int from_commit);
 
 static int grep_submodule(struct grep_opt *opt,
 			  const struct pathspec *pathspec,
@@ -486,6 +486,10 @@ static int grep_cache(struct grep_opt *opt,
 
 	for (nr = 0; nr < repo->index->cache_nr; nr++) {
 		const struct cache_entry *ce = repo->index->cache[nr];
+
+		if (ce_skip_worktree(ce))
+			continue;
+
 		strbuf_setlen(&name, name_base_len);
 		strbuf_addstr(&name, ce->name);
 
@@ -498,8 +502,7 @@ static int grep_cache(struct grep_opt *opt,
 			 * cache entry are identical, even if worktree file has
 			 * been modified, so use cache version instead
 			 */
-			if (cached || (ce->ce_flags & CE_VALID) ||
-			    ce_skip_worktree(ce)) {
+			if (cached || (ce->ce_flags & CE_VALID)) {
 				if (ce_stage(ce) || ce_intent_to_add(ce))
 					continue;
 				hit |= grep_oid(opt, &ce->oid, name.buf,
@@ -532,7 +535,7 @@ static int grep_cache(struct grep_opt *opt,
 
 static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
 		     struct tree_desc *tree, struct strbuf *base, int tn_len,
-		     int check_attr)
+		     int from_commit)
 {
 	struct repository *repo = opt->repo;
 	int hit = 0;
@@ -546,6 +549,9 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
 		name_base_len = name.len;
 	}
 
+	if (from_commit && repo_read_index(repo) < 0)
+		die(_("index file corrupt"));
+
 	while (tree_entry(tree, &entry)) {
 		int te_len = tree_entry_len(&entry);
 
@@ -564,9 +570,20 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
 
 		strbuf_add(base, entry.path, te_len);
 
+		if (from_commit) {
+			int pos = index_name_pos(repo->index,
+						 base->buf + tn_len,
+						 base->len - tn_len);
+			if (pos >= 0 &&
+			    ce_skip_worktree(repo->index->cache[pos])) {
+				strbuf_setlen(base, old_baselen);
+				continue;
+			}
+		}
+
 		if (S_ISREG(entry.mode)) {
 			hit |= grep_oid(opt, &entry.oid, base->buf, tn_len,
-					 check_attr ? base->buf + tn_len : NULL);
+					from_commit ? base->buf + tn_len : NULL);
 		} else if (S_ISDIR(entry.mode)) {
 			enum object_type type;
 			struct tree_desc sub;
@@ -581,7 +598,7 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
 			strbuf_addch(base, '/');
 			init_tree_desc(&sub, data, size);
 			hit |= grep_tree(opt, pathspec, &sub, base, tn_len,
-					 check_attr);
+					 from_commit);
 			free(data);
 		} else if (recurse_submodules && S_ISGITLINK(entry.mode)) {
 			hit |= grep_submodule(opt, pathspec, &entry.oid,
diff --git a/t/t7011-skip-worktree-reading.sh b/t/t7011-skip-worktree-reading.sh
index 37525cae3a..26852586ac 100755
--- a/t/t7011-skip-worktree-reading.sh
+++ b/t/t7011-skip-worktree-reading.sh
@@ -109,15 +109,6 @@ test_expect_success 'ls-files --modified' '
 	test -z "$(git ls-files -m)"
 '
 
-test_expect_success 'grep with skip-worktree file' '
-	git update-index --no-skip-worktree 1 &&
-	echo test > 1 &&
-	git update-index 1 &&
-	git update-index --skip-worktree 1 &&
-	rm 1 &&
-	test "$(git grep --no-ext-grep test)" = "1:test"
-'
-
 echo ":000000 100644 $ZERO_OID $EMPTY_BLOB A	1" > expected
 test_expect_success 'diff-index does not examine skip-worktree absent entries' '
 	setup_absent &&
diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh
new file mode 100755
index 0000000000..fccf44e829
--- /dev/null
+++ b/t/t7817-grep-sparse-checkout.sh
@@ -0,0 +1,88 @@
+#!/bin/sh
+
+test_description='grep in sparse checkout
+
+This test creates the following dir structure:
+.
+| - a
+| - b
+| - dir
+    | - c
+
+Only "a" should be present due to the sparse checkout patterns:
+"/*", "!/b" and "!/dir".
+'
+
+. ./test-lib.sh
+
+test_expect_success 'setup' '
+	echo "text" >a &&
+	echo "text" >b &&
+	mkdir dir &&
+	echo "text" >dir/c &&
+	git add a b dir &&
+	git commit -m "initial commit" &&
+	git tag -am t-commit t-commit HEAD &&
+	tree=$(git rev-parse HEAD^{tree}) &&
+	git tag -am t-tree t-tree $tree &&
+	cat >.git/info/sparse-checkout <<-EOF &&
+	/*
+	!/b
+	!/dir
+	EOF
+	git sparse-checkout init &&
+	test_path_is_missing b &&
+	test_path_is_missing dir &&
+	test_path_is_file a
+'
+
+test_expect_success 'grep in working tree should honor sparse checkout' '
+	cat >expect <<-EOF &&
+	a:text
+	EOF
+	git grep "text" >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'grep --cached should honor sparse checkout' '
+	cat >expect <<-EOF &&
+	a:text
+	EOF
+	git grep --cached "text" >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'grep <commit-ish> should honor sparse checkout' '
+	commit=$(git rev-parse HEAD) &&
+	cat >expect_commit <<-EOF &&
+	$commit:a:text
+	EOF
+	cat >expect_t-commit <<-EOF &&
+	t-commit:a:text
+	EOF
+	git grep "text" $commit >actual_commit &&
+	test_cmp expect_commit actual_commit &&
+	git grep "text" t-commit >actual_t-commit &&
+	test_cmp expect_t-commit actual_t-commit
+'
+
+test_expect_success 'grep <tree-ish> should search outside sparse checkout' '
+	commit=$(git rev-parse HEAD) &&
+	tree=$(git rev-parse HEAD^{tree}) &&
+	cat >expect_tree <<-EOF &&
+	$tree:a:text
+	$tree:b:text
+	$tree:dir/c:text
+	EOF
+	cat >expect_t-tree <<-EOF &&
+	t-tree:a:text
+	t-tree:b:text
+	t-tree:dir/c:text
+	EOF
+	git grep "text" $tree >actual_tree &&
+	test_cmp expect_tree actual_tree &&
+	git grep "text" t-tree >actual_t-tree &&
+	test_cmp expect_t-tree actual_t-tree
+'
+
+test_done
-- 
2.25.1


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC PATCH 3/3] grep: add option to ignore sparsity patterns
  2020-03-24  6:04 [RFC PATCH 0/3] grep: honor sparse checkout and add option to ignore it Matheus Tavares
  2020-03-24  6:11 ` [RFC PATCH 1/3] doc: grep: unify info on configuration variables Matheus Tavares
  2020-03-24  6:12 ` [RFC PATCH 2/3] grep: honor sparse checkout patterns Matheus Tavares
@ 2020-03-24  6:13 ` Matheus Tavares
  2020-03-24  7:54   ` Elijah Newren
  2 siblings, 1 reply; 26+ messages in thread
From: Matheus Tavares @ 2020-03-24  6:13 UTC (permalink / raw)
  To: git; +Cc: dstolee, newren, pclouds

In the last commit, git-grep learned to honor sparsity patterns. For
some use cases, however, it may be desirable to search outside the
sparse checkout. So add the '--ignore-sparsity' option, which restores
the old behavior. Also add the grep.ignoreSparsity configuration, to
allow setting this behavior by default.

Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
---

Note: I still have to make --ignore-sparsity be able to work together
with --untracked. Unfortunatelly, this won't be as simple because the
codeflow taken by --untracked goes to grep_directory() which just
iterates the working tree, without looking the index entries. So I will
have to either: make --untracked use grep_cache(), and grep the
untracked files later; or try matching the working tree paths against
the sparsity patterns, without looking for the skip_worktree bit in
the index (as I mentioned in the previous patch's comments). Any
preferences regarding these two approaches? (or other suggestions?)

 Documentation/config/grep.txt   |  3 +++
 Documentation/git-grep.txt      |  5 ++++
 builtin/grep.c                  | 19 +++++++++++----
 t/t7817-grep-sparse-checkout.sh | 42 +++++++++++++++++++++++++++++++++
 4 files changed, 65 insertions(+), 4 deletions(-)

diff --git a/Documentation/config/grep.txt b/Documentation/config/grep.txt
index 76689771aa..c1d49484c8 100644
--- a/Documentation/config/grep.txt
+++ b/Documentation/config/grep.txt
@@ -25,3 +25,6 @@ grep.fullName::
 grep.fallbackToNoIndex::
 	If set to true, fall back to git grep --no-index if git grep
 	is executed outside of a git repository.  Defaults to false.
+
+grep.ignoreSparsity::
+	If set to true, enable `--ignore-sparsity` by default.
diff --git a/Documentation/git-grep.txt b/Documentation/git-grep.txt
index 97e25d7b1b..5c5c66c056 100644
--- a/Documentation/git-grep.txt
+++ b/Documentation/git-grep.txt
@@ -65,6 +65,11 @@ OPTIONS
 	mechanism.  Only useful when searching files in the current directory
 	with `--no-index`.
 
+--ignore-sparsity::
+	In a sparse checked out repository (see linkgit:git-sparse-checkout[1]),
+	also search in files that are outside the sparse checkout. This option
+	cannot be used with --no-index or --untracked.
+
 --recurse-submodules::
 	Recursively search in each submodule that has been initialized and
 	checked out in the repository.  When used in combination with the
diff --git a/builtin/grep.c b/builtin/grep.c
index 52ec72a036..17eae3edd6 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -33,6 +33,8 @@ static char const * const grep_usage[] = {
 
 static int recurse_submodules;
 
+static int ignore_sparsity = 0;
+
 static int num_threads;
 
 static pthread_t *threads;
@@ -292,6 +294,9 @@ static int grep_cmd_config(const char *var, const char *value, void *cb)
 	if (!strcmp(var, "submodule.recurse"))
 		recurse_submodules = git_config_bool(var, value);
 
+	if (!strcmp(var, "grep.ignoresparsity"))
+		ignore_sparsity = git_config_bool(var, value);
+
 	return st;
 }
 
@@ -487,7 +492,7 @@ static int grep_cache(struct grep_opt *opt,
 	for (nr = 0; nr < repo->index->cache_nr; nr++) {
 		const struct cache_entry *ce = repo->index->cache[nr];
 
-		if (ce_skip_worktree(ce))
+		if (!ignore_sparsity && ce_skip_worktree(ce))
 			continue;
 
 		strbuf_setlen(&name, name_base_len);
@@ -502,7 +507,8 @@ static int grep_cache(struct grep_opt *opt,
 			 * cache entry are identical, even if worktree file has
 			 * been modified, so use cache version instead
 			 */
-			if (cached || (ce->ce_flags & CE_VALID)) {
+			if (cached || (ce->ce_flags & CE_VALID) ||
+			    ce_skip_worktree(ce)) {
 				if (ce_stage(ce) || ce_intent_to_add(ce))
 					continue;
 				hit |= grep_oid(opt, &ce->oid, name.buf,
@@ -549,7 +555,7 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
 		name_base_len = name.len;
 	}
 
-	if (from_commit && repo_read_index(repo) < 0)
+	if (!ignore_sparsity && from_commit && repo_read_index(repo) < 0)
 		die(_("index file corrupt"));
 
 	while (tree_entry(tree, &entry)) {
@@ -570,7 +576,7 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
 
 		strbuf_add(base, entry.path, te_len);
 
-		if (from_commit) {
+		if (!ignore_sparsity && from_commit) {
 			int pos = index_name_pos(repo->index,
 						 base->buf + tn_len,
 						 base->len - tn_len);
@@ -932,6 +938,8 @@ int cmd_grep(int argc, const char **argv, const char *prefix)
 		OPT_BOOL_F(0, "ext-grep", &external_grep_allowed__ignored,
 			   N_("allow calling of grep(1) (ignored by this build)"),
 			   PARSE_OPT_NOCOMPLETE),
+		OPT_BOOL(0, "ignore-sparsity", &ignore_sparsity,
+			 N_("also search in files outside the sparse checkout")),
 		OPT_END()
 	};
 
@@ -1073,6 +1081,9 @@ int cmd_grep(int argc, const char **argv, const char *prefix)
 	if (recurse_submodules && untracked)
 		die(_("--untracked not supported with --recurse-submodules"));
 
+	if (ignore_sparsity && (!use_index || untracked))
+		die(_("--no-index or --untracked cannot be used with --ignore-sparsity"));
+
 	if (show_in_pager) {
 		if (num_threads > 1)
 			warning(_("invalid option combination, ignoring --threads"));
diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh
index fccf44e829..1891ddea57 100755
--- a/t/t7817-grep-sparse-checkout.sh
+++ b/t/t7817-grep-sparse-checkout.sh
@@ -85,4 +85,46 @@ test_expect_success 'grep <tree-ish> should search outside sparse checkout' '
 	test_cmp expect_t-tree actual_t-tree
 '
 
+for cmd in 'git grep --ignore-sparsity' 'git -c grep.ignoreSparsity grep' \
+	   'git -c grep.ignoreSparsity=false grep --ignore-sparsity'
+do
+	test_expect_success "$cmd should search outside sparse checkout" '
+		cat >expect <<-EOF &&
+		a:text
+		b:text
+		dir/c:text
+		EOF
+		$cmd "text" >actual &&
+		test_cmp expect actual
+	'
+
+	test_expect_success "$cmd --cached should search outside sparse checkout" '
+		cat >expect <<-EOF &&
+		a:text
+		b:text
+		dir/c:text
+		EOF
+		$cmd --cached "text" >actual &&
+		test_cmp expect actual
+	'
+
+	test_expect_success "$cmd <commit-ish> should search outside sparse checkout" '
+		commit=$(git rev-parse HEAD) &&
+		cat >expect_commit <<-EOF &&
+		$commit:a:text
+		$commit:b:text
+		$commit:dir/c:text
+		EOF
+		cat >expect_t-commit <<-EOF &&
+		t-commit:a:text
+		t-commit:b:text
+		t-commit:dir/c:text
+		EOF
+		$cmd "text" $commit >actual_commit &&
+		test_cmp expect_commit actual_commit &&
+		$cmd "text" t-commit >actual_t-commit &&
+		test_cmp expect_t-commit actual_t-commit
+	'
+done
+
 test_done
-- 
2.25.1


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 2/3] grep: honor sparse checkout patterns
  2020-03-24  6:12 ` [RFC PATCH 2/3] grep: honor sparse checkout patterns Matheus Tavares
@ 2020-03-24  7:15   ` Elijah Newren
  2020-03-24 15:12     ` Derrick Stolee
  2020-03-24 22:55     ` Matheus Tavares Bernardino
  0 siblings, 2 replies; 26+ messages in thread
From: Elijah Newren @ 2020-03-24  7:15 UTC (permalink / raw)
  To: Matheus Tavares
  Cc: Git Mailing List, Derrick Stolee, brian m. carlson, Stefan Beller

Hi Matheus,

On Mon, Mar 23, 2020 at 11:12 PM Matheus Tavares
<matheus.bernardino@usp.br> wrote:
>
> One of the main uses for a sparse checkout is to allow users to focus on
> the subset of files in a repository in which they are interested. But
> git-grep currently ignores the sparsity patterns and report all matches
> found outside this subset, which kind of goes in the oposity direction.
> Let's fix that, making it honor the sparsity boundaries for every
> grepping case:
>
> - git grep in worktree
> - git grep --cached
> - git grep $REVISION

Wahoo!  This is great.

> - git grep --untracked and git grep --no-index (which already respect
>   sparse checkout boundaries)
>
> This is also what some users reported[1] they would want as the default
> behavior.
>
> Note: for `git grep $REVISION`, we will choose to honor the sparsity
> patterns only when $REVISION is a commit-ish object. The reason is that,

Makes sense.

> for a tree, we don't know whether it represents the root of a
> repository or a subtree. So we wouldn't be able to correctly match it
> against the sparsity patterns. E.g. suppose we have a repository with
> these two sparsity rules: "/*" and "!/a"; and the following structure:
>
> /
> | - a (file)
> | - d (dir)
>     | - a (file)
>
> If `git grep $REVISION` were to honor the sparsity patterns for every
> object type, when grepping the /d tree, we would wrongly ignore the /d/a
> file. This happens because we wouldn't know it resides in /d and
> therefore it would wrongly match the pattern "!/a". Furthermore, for a
> search in a blob object, we wouldn't even have a path to check the
> patterns against. So, let's ignore the sparsity patterns when grepping
> non-commit-ish objects (tags to commits should be fine).
>
> Finally, the old behavior is still desirable for some use cases. So the
> next patch will add an option to allow restoring it when needed.
>
> [1]: https://lore.kernel.org/git/CABPp-BGuFhDwWZBRaD3nA8ui46wor-4=Ha1G1oApsfF8KNpfGQ@mail.gmail.com/
>
> Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
> ---
>
> Something I'm not entirely sure in this patch is how we implement the
> mechanism to honor sparsity for the `git grep <commit-ish>` case (which
> is treated in the grep_tree() function). Currently, the patch looks for
> an index entry that matches the path, and then checks its skip_worktree

As you discuss below, checking the index is both wrong _and_ costly.
You should use the sparsity patterns; Stolee did a lot of work to make
those correspond to simple hashes you could check to determine whether
to even walk into a subdirectory.  So, O(1).  Yeah, that's "only" cone
mode but the non-cone sparsity patterns were a performance nightmare
waiting to rear its ugly head.  We should just try to encourage
everyone to move to cone mode, or accept the slowness they get without
it.

> bit. But this operation is perfomed in O(log(N)); N being the number of
> index entries. If there are many entries (and no so many sparsity
> patterns), maybe a better approach would be to try matching the path
> directly against the sparsity patterns. This would be O(M) in the number
> of patterns, and it could be done, in builtin/grep.c, with a function
> like the following:
>
> static struct pattern_list sparsity_patterns;
> static int sparsity_patterns_initialized = 0;
> static enum pattern_match_result path_matches_sparsity_patterns(
>                                         const char *path, int pathlen,
>                                         const char *basename,
>                                         struct repository *repo)
> {
>         int dtype = DT_UNKNOWN;
>
>         if (!sparsity_patterns_initialized) {
>                 char *sparse_file = git_pathdup("info/sparse-checkout");
>                 int ret;
>
>                 memset(&sparsity_patterns, 0, sizeof(sparsity_patterns));
>                 sparsity_patterns.use_cone_patterns = core_sparse_checkout_cone;
>                 ret = add_patterns_from_file_to_list(sparse_file, "", 0,
>                                                      &sparsity_patterns, NULL);
>                 free(sparse_file);
>
>                 if (ret < 0)
>                         die(_("failed to load sparse-checkout patterns"));
>                 sparsity_patterns_initialized = 1;
>         }
>
>         return path_matches_pattern_list(path, pathlen, basename, &dtype,
>                                          &sparsity_patterns, repo->index);
> }
>
> Also, if I understand correctly, the index doesn't hold paths to dirs,
> right? So even if a complete dir is excluded from sparse checkout, we
> still have to check all its subentries, only to discover that they
> should all be skipped from the search. However, if we were to check
> against the sparsity patterns directly (e.g. with the function above),
> we could skip such directories together with all their entries.
>
> Oh, and there is also the case of a commit whose tree paths are not in
> the index (maybe manually created objects?). For such commits, with the
> index lookup approach, we would have to fall back on ignoring the
> sparsity rules. I'm not sure if that would be OK, though.
>
> Any thoughts on these two approaches (looking up the skip_worktree bit
> in the index or directly matching against sparsity patterns), will be
> highly appreciated. (Note that it only concerns the `git grep
> <commit-ish>` case. The other cases already iterate thought the index, so
> there is no O(log(N)) extra complexity).
>
>  builtin/grep.c                   | 29 ++++++++---
>  t/t7011-skip-worktree-reading.sh |  9 ----
>  t/t7817-grep-sparse-checkout.sh  | 88 ++++++++++++++++++++++++++++++++
>  3 files changed, 111 insertions(+), 15 deletions(-)
>  create mode 100755 t/t7817-grep-sparse-checkout.sh
>
> diff --git a/builtin/grep.c b/builtin/grep.c
> index 99e2685090..52ec72a036 100644
> --- a/builtin/grep.c
> +++ b/builtin/grep.c
> @@ -388,7 +388,7 @@ static int grep_cache(struct grep_opt *opt,
>                       const struct pathspec *pathspec, int cached);
>  static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>                      struct tree_desc *tree, struct strbuf *base, int tn_len,
> -                    int check_attr);
> +                    int from_commit);

I'm not familiar with grep.c and have to admit I don't know what
"check_attr" means.  Slightly surprised to see you replace it, but
maybe reading the rest will explain...

>
>  static int grep_submodule(struct grep_opt *opt,
>                           const struct pathspec *pathspec,
> @@ -486,6 +486,10 @@ static int grep_cache(struct grep_opt *opt,
>
>         for (nr = 0; nr < repo->index->cache_nr; nr++) {
>                 const struct cache_entry *ce = repo->index->cache[nr];
> +
> +               if (ce_skip_worktree(ce))
> +                       continue;
> +

Looks good for the case where we are grepping through what's cached.

>                 strbuf_setlen(&name, name_base_len);
>                 strbuf_addstr(&name, ce->name);
>
> @@ -498,8 +502,7 @@ static int grep_cache(struct grep_opt *opt,
>                          * cache entry are identical, even if worktree file has
>                          * been modified, so use cache version instead
>                          */
> -                       if (cached || (ce->ce_flags & CE_VALID) ||
> -                           ce_skip_worktree(ce)) {
> +                       if (cached || (ce->ce_flags & CE_VALID)) {

I had the same change when I was trying to hack something like this
patch into place but only handled the worktree case before realized it
was a bit bigger job.

>                                 if (ce_stage(ce) || ce_intent_to_add(ce))
>                                         continue;
>                                 hit |= grep_oid(opt, &ce->oid, name.buf,
> @@ -532,7 +535,7 @@ static int grep_cache(struct grep_opt *opt,
>
>  static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>                      struct tree_desc *tree, struct strbuf *base, int tn_len,
> -                    int check_attr)
> +                    int from_commit)
>  {
>         struct repository *repo = opt->repo;
>         int hit = 0;
> @@ -546,6 +549,9 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>                 name_base_len = name.len;
>         }
>
> +       if (from_commit && repo_read_index(repo) < 0)
> +               die(_("index file corrupt"));
> +

As above, I don't think we should need to read the index.  We should
compare to sparsity patterns, which in the important case (cone mode)
simplifies to a hash lookup as we walk directories.

>         while (tree_entry(tree, &entry)) {
>                 int te_len = tree_entry_len(&entry);
>
> @@ -564,9 +570,20 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>
>                 strbuf_add(base, entry.path, te_len);
>
> +               if (from_commit) {
> +                       int pos = index_name_pos(repo->index,
> +                                                base->buf + tn_len,
> +                                                base->len - tn_len);
> +                       if (pos >= 0 &&
> +                           ce_skip_worktree(repo->index->cache[pos])) {
> +                               strbuf_setlen(base, old_baselen);
> +                               continue;
> +                       }
> +               }
> +
>                 if (S_ISREG(entry.mode)) {
>                         hit |= grep_oid(opt, &entry.oid, base->buf, tn_len,
> -                                        check_attr ? base->buf + tn_len : NULL);
> +                                       from_commit ? base->buf + tn_len : NULL);

Sadly, this doesn't help me understand check_attr or from_commit.
Could you clue me in a bit?

>                 } else if (S_ISDIR(entry.mode)) {
>                         enum object_type type;
>                         struct tree_desc sub;
> @@ -581,7 +598,7 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>                         strbuf_addch(base, '/');
>                         init_tree_desc(&sub, data, size);
>                         hit |= grep_tree(opt, pathspec, &sub, base, tn_len,
> -                                        check_attr);
> +                                        from_commit);

Same.

>                         free(data);
>                 } else if (recurse_submodules && S_ISGITLINK(entry.mode)) {
>                         hit |= grep_submodule(opt, pathspec, &entry.oid,
> diff --git a/t/t7011-skip-worktree-reading.sh b/t/t7011-skip-worktree-reading.sh
> index 37525cae3a..26852586ac 100755
> --- a/t/t7011-skip-worktree-reading.sh
> +++ b/t/t7011-skip-worktree-reading.sh
> @@ -109,15 +109,6 @@ test_expect_success 'ls-files --modified' '
>         test -z "$(git ls-files -m)"
>  '
>
> -test_expect_success 'grep with skip-worktree file' '
> -       git update-index --no-skip-worktree 1 &&
> -       echo test > 1 &&
> -       git update-index 1 &&
> -       git update-index --skip-worktree 1 &&
> -       rm 1 &&
> -       test "$(git grep --no-ext-grep test)" = "1:test"
> -'
> -
>  echo ":000000 100644 $ZERO_OID $EMPTY_BLOB A   1" > expected
>  test_expect_success 'diff-index does not examine skip-worktree absent entries' '
>         setup_absent &&
> diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh
> new file mode 100755
> index 0000000000..fccf44e829
> --- /dev/null
> +++ b/t/t7817-grep-sparse-checkout.sh
> @@ -0,0 +1,88 @@
> +#!/bin/sh
> +
> +test_description='grep in sparse checkout
> +
> +This test creates the following dir structure:
> +.
> +| - a
> +| - b
> +| - dir
> +    | - c
> +
> +Only "a" should be present due to the sparse checkout patterns:
> +"/*", "!/b" and "!/dir".
> +'
> +
> +. ./test-lib.sh
> +
> +test_expect_success 'setup' '
> +       echo "text" >a &&
> +       echo "text" >b &&
> +       mkdir dir &&
> +       echo "text" >dir/c &&
> +       git add a b dir &&
> +       git commit -m "initial commit" &&
> +       git tag -am t-commit t-commit HEAD &&
> +       tree=$(git rev-parse HEAD^{tree}) &&
> +       git tag -am t-tree t-tree $tree &&
> +       cat >.git/info/sparse-checkout <<-EOF &&
> +       /*
> +       !/b
> +       !/dir
> +       EOF
> +       git sparse-checkout init &&

Using `git sparse-checkout init` but then manually writing to
.git/info/sparse-checkout?  Seems like it'd make more sense to use
`git sparse-checkout set` than writing the patterns directly yourself.
Also, would prefer to have the examples use cone mode (even if you
have to add subdirectories), as it makes the testcase a bit easier to
read and more performant, though neither is a big deal.

> +       test_path_is_missing b &&
> +       test_path_is_missing dir &&
> +       test_path_is_file a
> +'
> +
> +test_expect_success 'grep in working tree should honor sparse checkout' '
> +       cat >expect <<-EOF &&
> +       a:text
> +       EOF
> +       git grep "text" >actual &&
> +       test_cmp expect actual
> +'
> +
> +test_expect_success 'grep --cached should honor sparse checkout' '
> +       cat >expect <<-EOF &&
> +       a:text
> +       EOF
> +       git grep --cached "text" >actual &&
> +       test_cmp expect actual
> +'
> +
> +test_expect_success 'grep <commit-ish> should honor sparse checkout' '
> +       commit=$(git rev-parse HEAD) &&
> +       cat >expect_commit <<-EOF &&
> +       $commit:a:text
> +       EOF
> +       cat >expect_t-commit <<-EOF &&
> +       t-commit:a:text
> +       EOF
> +       git grep "text" $commit >actual_commit &&
> +       test_cmp expect_commit actual_commit &&
> +       git grep "text" t-commit >actual_t-commit &&
> +       test_cmp expect_t-commit actual_t-commit
> +'
> +
> +test_expect_success 'grep <tree-ish> should search outside sparse checkout' '

I think the test is fine but the title seems misleading.  "outside"
and "inside" aren't defined because <tree-ish> isn't known to be
rooted, meaning we have no way to apply the sparsity patterns.  So
perhaps just 'grep <tree-ish> should ignore sparsity patterns'?

> +       commit=$(git rev-parse HEAD) &&
> +       tree=$(git rev-parse HEAD^{tree}) &&
> +       cat >expect_tree <<-EOF &&
> +       $tree:a:text
> +       $tree:b:text
> +       $tree:dir/c:text
> +       EOF
> +       cat >expect_t-tree <<-EOF &&
> +       t-tree:a:text
> +       t-tree:b:text
> +       t-tree:dir/c:text
> +       EOF
> +       git grep "text" $tree >actual_tree &&
> +       test_cmp expect_tree actual_tree &&
> +       git grep "text" t-tree >actual_t-tree &&
> +       test_cmp expect_t-tree actual_t-tree
> +'
> +
> +test_done
> --
> 2.25.1

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 3/3] grep: add option to ignore sparsity patterns
  2020-03-24  6:13 ` [RFC PATCH 3/3] grep: add option to ignore sparsity patterns Matheus Tavares
@ 2020-03-24  7:54   ` Elijah Newren
  2020-03-24 18:30     ` Junio C Hamano
  2020-03-25 23:15     ` Matheus Tavares Bernardino
  0 siblings, 2 replies; 26+ messages in thread
From: Elijah Newren @ 2020-03-24  7:54 UTC (permalink / raw)
  To: Matheus Tavares
  Cc: Git Mailing List, Derrick Stolee, Nguyễn Thái Ngọc

On Mon, Mar 23, 2020 at 11:13 PM Matheus Tavares
<matheus.bernardino@usp.br> wrote:
>
> In the last commit, git-grep learned to honor sparsity patterns. For
> some use cases, however, it may be desirable to search outside the
> sparse checkout. So add the '--ignore-sparsity' option, which restores
> the old behavior. Also add the grep.ignoreSparsity configuration, to
> allow setting this behavior by default.

Should `--ignore-sparsity` be a global git option rather than a
grep-specific one?  Also, should grep.ignoreSparsity rather be
core.ignoreSparsity or core.searchOutsideSparsePaths or something?  In
particular, I want a world where:

* Someone can do a "sparse" clone that is NOT just about
sparse-checkout but also about partial clone.  In particular, it makes
use of partial clones to download only the history for the sparsity
paths, and does a sparse-checkout --cone to get those checked out.
(Or, perhaps, defaults to just downloading history for the toplevel
dir, much like `sparse-checkout init --cone`, and then when the user
runs `sparse-checkout set $dir1 $dir2 ...` then it downloads the extra
bits).
* grep, diff, log, shortlog, blame, bisect (and maybe others) all by
default make use of the sparsity patterns to limit their output (but
can all use whatever flag(s) are added here to search outside the
sparsity pattern cones).  This helps users feel they are in a smaller
repo and searching just their area of interest, and it avoids partial
clones downloading blobs unnecessarily.  Nice for the user, and nice
for the system.
* worktrees behave nicer; when creating a new one it inherits the
sparsity patterns of the parent (again to avoid partail clones having
to download everything, and let users continue working on their area
of interest, though they can disable sparse checkouts at any time, of
course).  Still would like Junio's feedback on this one.
* rebase, merge, cherry-pick, etc. (all via the merge machiner) have
smarter tree-merging logic such that when trees are unchanged on one
or both sides of history, we take advantage of the subset of those
cases where we can avoid traversing into subtrees but can resolve the
merge at the tree level.  This is a performance optimization even when
you have all trees and blob available, but an even more important one
if you don't want partial clones to suddenly have to download
unnecessary objects.  I have ideas and am working on this as part of
merge-ort.

> Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
> ---
>
> Note: I still have to make --ignore-sparsity be able to work together
> with --untracked. Unfortunatelly, this won't be as simple because the
> codeflow taken by --untracked goes to grep_directory() which just
> iterates the working tree, without looking the index entries. So I will
> have to either: make --untracked use grep_cache(), and grep the
> untracked files later; or try matching the working tree paths against
> the sparsity patterns, without looking for the skip_worktree bit in
> the index (as I mentioned in the previous patch's comments). Any
> preferences regarding these two approaches? (or other suggestions?)

Hmm.  So, 'tracked' in git is the idea that we are keeping information
about specific files.  'sparse-checkout' is the idea that we have a
subset of those that we can work with without materializing all the
other tracked files; it's clearly a subset of the realm of 'tracked'.
'untracked' is about getting everything outside the set of 'tracked'
files, which to me means it is clearly outside the set of sparsity
paths too (and thus you could take --untracked as implying
--ignore-sparsity, though whether you do might not matter in practice
because of the items I'll discuss next).  Of course, I am also
assuming `--untracked` is incompatible with --cached or specifying
revisions or trees (based on it's definiton of "In addition to
searching in the tracked files in the *working tree*, search also in
untracked files." -- emphasis added.)  If the incompatibility of
--untracked and --cached/REVSIONS/TREES is not enforced, we may want
to look into erroring out if they are given together.  Once we do, we
don't have to worry about grep_cache() at all in the case of
--untracked and shouldn't.  Files with the skip_worktree bit won't
exist in the working directory, and thus won't be searched (this is
what makes --untracked imply --ignore-sparsity not really matter).

In short: With --untracked you are grepping ALL (non-ignored) files in
the working directory -- either because they are both tracked and in
the sparsity paths (anything tracked that isn't in the sparsity paths
has the skip_worktree bit and thus isn't present), or because it is an
untracked file.  [And this may be what grep_directory() already does.]

Does that make sense?

>  Documentation/config/grep.txt   |  3 +++
>  Documentation/git-grep.txt      |  5 ++++
>  builtin/grep.c                  | 19 +++++++++++----
>  t/t7817-grep-sparse-checkout.sh | 42 +++++++++++++++++++++++++++++++++
>  4 files changed, 65 insertions(+), 4 deletions(-)
>
> diff --git a/Documentation/config/grep.txt b/Documentation/config/grep.txt
> index 76689771aa..c1d49484c8 100644
> --- a/Documentation/config/grep.txt
> +++ b/Documentation/config/grep.txt
> @@ -25,3 +25,6 @@ grep.fullName::
>  grep.fallbackToNoIndex::
>         If set to true, fall back to git grep --no-index if git grep
>         is executed outside of a git repository.  Defaults to false.
> +
> +grep.ignoreSparsity::
> +       If set to true, enable `--ignore-sparsity` by default.
> diff --git a/Documentation/git-grep.txt b/Documentation/git-grep.txt
> index 97e25d7b1b..5c5c66c056 100644
> --- a/Documentation/git-grep.txt
> +++ b/Documentation/git-grep.txt
> @@ -65,6 +65,11 @@ OPTIONS
>         mechanism.  Only useful when searching files in the current directory
>         with `--no-index`.
>
> +--ignore-sparsity::
> +       In a sparse checked out repository (see linkgit:git-sparse-checkout[1]),
> +       also search in files that are outside the sparse checkout. This option
> +       cannot be used with --no-index or --untracked.

If they are outside the sparse checkout, then they are not present on
disk -- so what is this outside stuff that is being searched?  Perhaps
clarify that this is only useful in combination with
--cached/REVISION/TREE, where there do exist paths outside the
sparsity patterns that become relevant?

>  --recurse-submodules::
>         Recursively search in each submodule that has been initialized and
>         checked out in the repository.  When used in combination with the
> diff --git a/builtin/grep.c b/builtin/grep.c
> index 52ec72a036..17eae3edd6 100644
> --- a/builtin/grep.c
> +++ b/builtin/grep.c
> @@ -33,6 +33,8 @@ static char const * const grep_usage[] = {
>
>  static int recurse_submodules;
>
> +static int ignore_sparsity = 0;
> +
>  static int num_threads;
>
>  static pthread_t *threads;
> @@ -292,6 +294,9 @@ static int grep_cmd_config(const char *var, const char *value, void *cb)
>         if (!strcmp(var, "submodule.recurse"))
>                 recurse_submodules = git_config_bool(var, value);
>
> +       if (!strcmp(var, "grep.ignoresparsity"))
> +               ignore_sparsity = git_config_bool(var, value);
> +
>         return st;
>  }
>
> @@ -487,7 +492,7 @@ static int grep_cache(struct grep_opt *opt,
>         for (nr = 0; nr < repo->index->cache_nr; nr++) {
>                 const struct cache_entry *ce = repo->index->cache[nr];
>
> -               if (ce_skip_worktree(ce))
> +               if (!ignore_sparsity && ce_skip_worktree(ce))

Oh boy on the double negatives...maybe we want to rename this flag somehow?

>                         continue;
>
>                 strbuf_setlen(&name, name_base_len);
> @@ -502,7 +507,8 @@ static int grep_cache(struct grep_opt *opt,
>                          * cache entry are identical, even if worktree file has
>                          * been modified, so use cache version instead
>                          */
> -                       if (cached || (ce->ce_flags & CE_VALID)) {
> +                       if (cached || (ce->ce_flags & CE_VALID) ||
> +                           ce_skip_worktree(ce)) {
>                                 if (ce_stage(ce) || ce_intent_to_add(ce))
>                                         continue;
>                                 hit |= grep_oid(opt, &ce->oid, name.buf,
> @@ -549,7 +555,7 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>                 name_base_len = name.len;
>         }
>
> -       if (from_commit && repo_read_index(repo) < 0)
> +       if (!ignore_sparsity && from_commit && repo_read_index(repo) < 0)
>                 die(_("index file corrupt"));
>
>         while (tree_entry(tree, &entry)) {
> @@ -570,7 +576,7 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>
>                 strbuf_add(base, entry.path, te_len);
>
> -               if (from_commit) {
> +               if (!ignore_sparsity && from_commit) {
>                         int pos = index_name_pos(repo->index,
>                                                  base->buf + tn_len,
>                                                  base->len - tn_len);
> @@ -932,6 +938,8 @@ int cmd_grep(int argc, const char **argv, const char *prefix)
>                 OPT_BOOL_F(0, "ext-grep", &external_grep_allowed__ignored,
>                            N_("allow calling of grep(1) (ignored by this build)"),
>                            PARSE_OPT_NOCOMPLETE),
> +               OPT_BOOL(0, "ignore-sparsity", &ignore_sparsity,
> +                        N_("also search in files outside the sparse checkout")),
>                 OPT_END()
>         };
>
> @@ -1073,6 +1081,9 @@ int cmd_grep(int argc, const char **argv, const char *prefix)
>         if (recurse_submodules && untracked)
>                 die(_("--untracked not supported with --recurse-submodules"));
>
> +       if (ignore_sparsity && (!use_index || untracked))
> +               die(_("--no-index or --untracked cannot be used with --ignore-sparsity"));
> +
>         if (show_in_pager) {
>                 if (num_threads > 1)
>                         warning(_("invalid option combination, ignoring --threads"));
> diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh
> index fccf44e829..1891ddea57 100755
> --- a/t/t7817-grep-sparse-checkout.sh
> +++ b/t/t7817-grep-sparse-checkout.sh
> @@ -85,4 +85,46 @@ test_expect_success 'grep <tree-ish> should search outside sparse checkout' '
>         test_cmp expect_t-tree actual_t-tree
>  '
>
> +for cmd in 'git grep --ignore-sparsity' 'git -c grep.ignoreSparsity grep' \
> +          'git -c grep.ignoreSparsity=false grep --ignore-sparsity'
> +do
> +       test_expect_success "$cmd should search outside sparse checkout" '
> +               cat >expect <<-EOF &&
> +               a:text
> +               b:text
> +               dir/c:text
> +               EOF
> +               $cmd "text" >actual &&
> +               test_cmp expect actual
> +       '
> +
> +       test_expect_success "$cmd --cached should search outside sparse checkout" '
> +               cat >expect <<-EOF &&
> +               a:text
> +               b:text
> +               dir/c:text
> +               EOF
> +               $cmd --cached "text" >actual &&
> +               test_cmp expect actual
> +       '
> +
> +       test_expect_success "$cmd <commit-ish> should search outside sparse checkout" '
> +               commit=$(git rev-parse HEAD) &&
> +               cat >expect_commit <<-EOF &&
> +               $commit:a:text
> +               $commit:b:text
> +               $commit:dir/c:text
> +               EOF
> +               cat >expect_t-commit <<-EOF &&
> +               t-commit:a:text
> +               t-commit:b:text
> +               t-commit:dir/c:text
> +               EOF
> +               $cmd "text" $commit >actual_commit &&
> +               test_cmp expect_commit actual_commit &&
> +               $cmd "text" t-commit >actual_t-commit &&
> +               test_cmp expect_t-commit actual_t-commit
> +       '
> +done
> +
>  test_done
> --
> 2.25.1

I think there are several things that we need to straighten out first
and will affect a lot of this patch quite a bit:
* The feedback from the previous patch that the revision handling
should use sparsity patterns rather than ce_skip_worktree() is going
to affect this patch a fair amount.
* I think the fact that --ignore-sparsity is meaningless without
--cached or a REVISION or TREE may also affect things.
* The decision about how to globally name and set the
"ignore-sparsity" bit without requiring users to set it for each and
every subcommand will change this patch a bit too.


I'm super excited to see work in this area.  I hope I'm not
discouraging you by attempting to provide what I think is the bigger
picture I'd like us to work towards.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 1/3] doc: grep: unify info on configuration variables
  2020-03-24  6:11 ` [RFC PATCH 1/3] doc: grep: unify info on configuration variables Matheus Tavares
@ 2020-03-24  7:57   ` Elijah Newren
  2020-03-24 21:26     ` Junio C Hamano
  0 siblings, 1 reply; 26+ messages in thread
From: Elijah Newren @ 2020-03-24  7:57 UTC (permalink / raw)
  To: Matheus Tavares; +Cc: Git Mailing List, Derrick Stolee, brian m. carlson

On Mon, Mar 23, 2020 at 11:11 PM Matheus Tavares
<matheus.bernardino@usp.br> wrote:
>
> Explanations about the configuration variables for git-grep are
> duplicated in "Documentation/git-grep.txt" and
> "Documentation/config/grep.txt". Let's unify the information in the
> second file and include it in the first.
>
> Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
> ---
>  Documentation/config/grep.txt |  7 +++++--
>  Documentation/git-grep.txt    | 35 +++++------------------------------
>  2 files changed, 10 insertions(+), 32 deletions(-)
>
> diff --git a/Documentation/config/grep.txt b/Documentation/config/grep.txt
> index 44abe45a7c..76689771aa 100644
> --- a/Documentation/config/grep.txt
> +++ b/Documentation/config/grep.txt
> @@ -16,8 +16,11 @@ grep.extendedRegexp::
>         other than 'default'.
>
>  grep.threads::
> -       Number of grep worker threads to use.
> -       See `grep.threads` in linkgit:git-grep[1] for more information.
> +       Number of grep worker threads to use. See `--threads` in
> +       linkgit:git-grep[1] for more information.
> +
> +grep.fullName::
> +       If set to true, enable `--full-name` option by default.
>
>  grep.fallbackToNoIndex::
>         If set to true, fall back to git grep --no-index if git grep
> diff --git a/Documentation/git-grep.txt b/Documentation/git-grep.txt
> index ddb6acc025..97e25d7b1b 100644
> --- a/Documentation/git-grep.txt
> +++ b/Documentation/git-grep.txt
> @@ -41,34 +41,7 @@ characters.  An empty string as search expression matches all lines.
>  CONFIGURATION
>  -------------
>
> -grep.lineNumber::
> -       If set to true, enable `-n` option by default.
> -
> -grep.column::
> -       If set to true, enable the `--column` option by default.
> -
> -grep.patternType::
> -       Set the default matching behavior. Using a value of 'basic', 'extended',
> -       'fixed', or 'perl' will enable the `--basic-regexp`, `--extended-regexp`,
> -       `--fixed-strings`, or `--perl-regexp` option accordingly, while the
> -       value 'default' will return to the default matching behavior.
> -
> -grep.extendedRegexp::
> -       If set to true, enable `--extended-regexp` option by default. This
> -       option is ignored when the `grep.patternType` option is set to a value
> -       other than 'default'.
> -
> -grep.threads::
> -       Number of grep worker threads to use. If unset (or set to 0), Git will
> -       use as many threads as the number of logical cores available.
> -
> -grep.fullName::
> -       If set to true, enable `--full-name` option by default.
> -
> -grep.fallbackToNoIndex::
> -       If set to true, fall back to git grep --no-index if git grep
> -       is executed outside of a git repository.  Defaults to false.
> -
> +include::config/grep.txt[]
>
>  OPTIONS
>  -------
> @@ -267,8 +240,10 @@ providing this option will cause it to die.
>         found.
>
>  --threads <num>::
> -       Number of grep worker threads to use.
> -       See `grep.threads` in 'CONFIGURATION' for more information.
> +       Number of grep worker threads to use. If not provided (or set to
> +       0), Git will use as many worker threads as the number of logical
> +       cores available. The default value can also be set with the
> +       `grep.threads` configuration (see linkgit:git-config[1]).

I'm possibly showing my ignorance here, but doesn't the
"include::config/grep.txt[]" you added above mean that the user
doesn't have to see an external manpage but can see the definition
earlier within this same manpage?

>
>  -f <file>::
>         Read patterns from <file>, one per line.
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 2/3] grep: honor sparse checkout patterns
  2020-03-24  7:15   ` Elijah Newren
@ 2020-03-24 15:12     ` Derrick Stolee
  2020-03-24 16:16       ` Elijah Newren
  2020-03-24 23:01       ` Matheus Tavares Bernardino
  2020-03-24 22:55     ` Matheus Tavares Bernardino
  1 sibling, 2 replies; 26+ messages in thread
From: Derrick Stolee @ 2020-03-24 15:12 UTC (permalink / raw)
  To: Elijah Newren, Matheus Tavares
  Cc: Git Mailing List, Derrick Stolee, brian m. carlson, Stefan Beller

On 3/24/2020 3:15 AM, Elijah Newren wrote:
> Hi Matheus,
> 
> On Mon, Mar 23, 2020 at 11:12 PM Matheus Tavares
> <matheus.bernardino@usp.br> wrote:
>>
>> One of the main uses for a sparse checkout is to allow users to focus on
>> the subset of files in a repository in which they are interested. But
>> git-grep currently ignores the sparsity patterns and report all matches
>> found outside this subset, which kind of goes in the oposity direction.
>> Let's fix that, making it honor the sparsity boundaries for every
>> grepping case:
>>
>> - git grep in worktree
>> - git grep --cached
>> - git grep $REVISION
> 
> Wahoo!  This is great.

I am also excited. Also thrilled to see the option to get the old
behavior in the next patch.

>> Something I'm not entirely sure in this patch is how we implement the
>> mechanism to honor sparsity for the `git grep <commit-ish>` case (which
>> is treated in the grep_tree() function). Currently, the patch looks for
>> an index entry that matches the path, and then checks its skip_worktree
> 
> As you discuss below, checking the index is both wrong _and_ costly.

I'm not sure why checking the index is _wrong_, but I agree about the
performance cost.

> You should use the sparsity patterns; Stolee did a lot of work to make
> those correspond to simple hashes you could check to determine whether
> to even walk into a subdirectory.  So, O(1).  Yeah, that's "only" cone
> mode but the non-cone sparsity patterns were a performance nightmare
> waiting to rear its ugly head.  We should just try to encourage
> everyone to move to cone mode, or accept the slowness they get without
> it.
> 
>> bit. But this operation is perfomed in O(log(N)); N being the number of
>> index entries. If there are many entries (and no so many sparsity
>> patterns), maybe a better approach would be to try matching the path
>> directly against the sparsity patterns. This would be O(M) in the number
>> of patterns, and it could be done, in builtin/grep.c, with a function
>> like the following:
>>
>> static struct pattern_list sparsity_patterns;
>> static int sparsity_patterns_initialized = 0;
>> static enum pattern_match_result path_matches_sparsity_patterns(
>>                                         const char *path, int pathlen,
>>                                         const char *basename,
>>                                         struct repository *repo)
>> {
>>         int dtype = DT_UNKNOWN;
>>
>>         if (!sparsity_patterns_initialized) {
>>                 char *sparse_file = git_pathdup("info/sparse-checkout");
>>                 int ret;
>>
>>                 memset(&sparsity_patterns, 0, sizeof(sparsity_patterns));
>>                 sparsity_patterns.use_cone_patterns = core_sparse_checkout_cone;
>>                 ret = add_patterns_from_file_to_list(sparse_file, "", 0,
>>                                                      &sparsity_patterns, NULL);
>>                 free(sparse_file);
>>
>>                 if (ret < 0)
>>                         die(_("failed to load sparse-checkout patterns"));
>>                 sparsity_patterns_initialized = 1;
>>         }
>>
>>         return path_matches_pattern_list(path, pathlen, basename, &dtype,
>>                                          &sparsity_patterns, repo->index);
>> }
>>
>> Also, if I understand correctly, the index doesn't hold paths to dirs,
>> right? So even if a complete dir is excluded from sparse checkout, we
>> still have to check all its subentries, only to discover that they
>> should all be skipped from the search. However, if we were to check
>> against the sparsity patterns directly (e.g. with the function above),
>> we could skip such directories together with all their entries.

When in cone mode, we can check if a directory is one of these three
modes:

1. Completely contained in the cone (recursive match)
2. Completely outside the cone
3. Neither. Keep matching subdirectories. (parent match)

The clear_ce_flags() code in dir.c includes the matching algorithms
for this. Hopefully you can re-use a lot of it. You may need to extract
some methods to use them from the grep code.

>> Oh, and there is also the case of a commit whose tree paths are not in
>> the index (maybe manually created objects?). For such commits, with the
>> index lookup approach, we would have to fall back on ignoring the
>> sparsity rules. I'm not sure if that would be OK, though.
>>
>> Any thoughts on these two approaches (looking up the skip_worktree bit
>> in the index or directly matching against sparsity patterns), will be
>> highly appreciated. (Note that it only concerns the `git grep
>> <commit-ish>` case. The other cases already iterate thought the index, so
>> there is no O(log(N)) extra complexity).
>>
>>  builtin/grep.c                   | 29 ++++++++---
>>  t/t7011-skip-worktree-reading.sh |  9 ----
>>  t/t7817-grep-sparse-checkout.sh  | 88 ++++++++++++++++++++++++++++++++
>>  3 files changed, 111 insertions(+), 15 deletions(-)
>>  create mode 100755 t/t7817-grep-sparse-checkout.sh
>>
>> diff --git a/builtin/grep.c b/builtin/grep.c
>> index 99e2685090..52ec72a036 100644
>> --- a/builtin/grep.c
>> +++ b/builtin/grep.c
>> @@ -388,7 +388,7 @@ static int grep_cache(struct grep_opt *opt,
>>                       const struct pathspec *pathspec, int cached);
>>  static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>>                      struct tree_desc *tree, struct strbuf *base, int tn_len,
>> -                    int check_attr);
>> +                    int from_commit);
> 
> I'm not familiar with grep.c and have to admit I don't know what
> "check_attr" means.  Slightly surprised to see you replace it, but
> maybe reading the rest will explain...
> 
>>
>>  static int grep_submodule(struct grep_opt *opt,
>>                           const struct pathspec *pathspec,
>> @@ -486,6 +486,10 @@ static int grep_cache(struct grep_opt *opt,
>>
>>         for (nr = 0; nr < repo->index->cache_nr; nr++) {
>>                 const struct cache_entry *ce = repo->index->cache[nr];
>> +
>> +               if (ce_skip_worktree(ce))
>> +                       continue;
>> +
> 
> Looks good for the case where we are grepping through what's cached.
> 
>>                 strbuf_setlen(&name, name_base_len);
>>                 strbuf_addstr(&name, ce->name);
>>
>> @@ -498,8 +502,7 @@ static int grep_cache(struct grep_opt *opt,
>>                          * cache entry are identical, even if worktree file has
>>                          * been modified, so use cache version instead
>>                          */
>> -                       if (cached || (ce->ce_flags & CE_VALID) ||
>> -                           ce_skip_worktree(ce)) {
>> +                       if (cached || (ce->ce_flags & CE_VALID)) {
> 
> I had the same change when I was trying to hack something like this
> patch into place but only handled the worktree case before realized it
> was a bit bigger job.
> 
>>                                 if (ce_stage(ce) || ce_intent_to_add(ce))
>>                                         continue;
>>                                 hit |= grep_oid(opt, &ce->oid, name.buf,
>> @@ -532,7 +535,7 @@ static int grep_cache(struct grep_opt *opt,
>>
>>  static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>>                      struct tree_desc *tree, struct strbuf *base, int tn_len,
>> -                    int check_attr)
>> +                    int from_commit)
>>  {
>>         struct repository *repo = opt->repo;
>>         int hit = 0;
>> @@ -546,6 +549,9 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>>                 name_base_len = name.len;
>>         }
>>
>> +       if (from_commit && repo_read_index(repo) < 0)
>> +               die(_("index file corrupt"));
>> +
> 
> As above, I don't think we should need to read the index.  We should
> compare to sparsity patterns, which in the important case (cone mode)
> simplifies to a hash lookup as we walk directories.
> 
>>         while (tree_entry(tree, &entry)) {
>>                 int te_len = tree_entry_len(&entry);
>>
>> @@ -564,9 +570,20 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>>
>>                 strbuf_add(base, entry.path, te_len);
>>
>> +               if (from_commit) {
>> +                       int pos = index_name_pos(repo->index,
>> +                                                base->buf + tn_len,
>> +                                                base->len - tn_len);
>> +                       if (pos >= 0 &&
>> +                           ce_skip_worktree(repo->index->cache[pos])) {
>> +                               strbuf_setlen(base, old_baselen);
>> +                               continue;
>> +                       }
>> +               }
>> +
>>                 if (S_ISREG(entry.mode)) {
>>                         hit |= grep_oid(opt, &entry.oid, base->buf, tn_len,
>> -                                        check_attr ? base->buf + tn_len : NULL);
>> +                                       from_commit ? base->buf + tn_len : NULL);
> 
> Sadly, this doesn't help me understand check_attr or from_commit.
> Could you clue me in a bit?

Yeah, Elijah and I know the sparse-checkout code quite well, but are
unfamiliar with grep. Let's all expand our knowledge!

>>                 } else if (S_ISDIR(entry.mode)) {
>>                         enum object_type type;
>>                         struct tree_desc sub;
>> @@ -581,7 +598,7 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>>                         strbuf_addch(base, '/');
>>                         init_tree_desc(&sub, data, size);
>>                         hit |= grep_tree(opt, pathspec, &sub, base, tn_len,
>> -                                        check_attr);
>> +                                        from_commit);
> 
> Same.
> 
>>                         free(data);
>>                 } else if (recurse_submodules && S_ISGITLINK(entry.mode)) {
>>                         hit |= grep_submodule(opt, pathspec, &entry.oid,
>> diff --git a/t/t7011-skip-worktree-reading.sh b/t/t7011-skip-worktree-reading.sh
>> index 37525cae3a..26852586ac 100755
>> --- a/t/t7011-skip-worktree-reading.sh
>> +++ b/t/t7011-skip-worktree-reading.sh
>> @@ -109,15 +109,6 @@ test_expect_success 'ls-files --modified' '
>>         test -z "$(git ls-files -m)"
>>  '
>>
>> -test_expect_success 'grep with skip-worktree file' '
>> -       git update-index --no-skip-worktree 1 &&
>> -       echo test > 1 &&
>> -       git update-index 1 &&
>> -       git update-index --skip-worktree 1 &&
>> -       rm 1 &&
>> -       test "$(git grep --no-ext-grep test)" = "1:test"
>> -'
>> -
>>  echo ":000000 100644 $ZERO_OID $EMPTY_BLOB A   1" > expected
>>  test_expect_success 'diff-index does not examine skip-worktree absent entries' '
>>         setup_absent &&
>> diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh
>> new file mode 100755
>> index 0000000000..fccf44e829
>> --- /dev/null
>> +++ b/t/t7817-grep-sparse-checkout.sh
>> @@ -0,0 +1,88 @@
>> +#!/bin/sh
>> +
>> +test_description='grep in sparse checkout
>> +
>> +This test creates the following dir structure:
>> +.
>> +| - a
>> +| - b
>> +| - dir
>> +    | - c
>> +
>> +Only "a" should be present due to the sparse checkout patterns:
>> +"/*", "!/b" and "!/dir".
>> +'
>> +
>> +. ./test-lib.sh
>> +
>> +test_expect_success 'setup' '
>> +       echo "text" >a &&
>> +       echo "text" >b &&
>> +       mkdir dir &&
>> +       echo "text" >dir/c &&
>> +       git add a b dir &&
>> +       git commit -m "initial commit" &&
>> +       git tag -am t-commit t-commit HEAD &&
>> +       tree=$(git rev-parse HEAD^{tree}) &&
>> +       git tag -am t-tree t-tree $tree &&
>> +       cat >.git/info/sparse-checkout <<-EOF &&
>> +       /*
>> +       !/b
>> +       !/dir
>> +       EOF
>> +       git sparse-checkout init &&
> 
> Using `git sparse-checkout init` but then manually writing to
> .git/info/sparse-checkout?  Seems like it'd make more sense to use
> `git sparse-checkout set` than writing the patterns directly yourself.
> Also, would prefer to have the examples use cone mode (even if you
> have to add subdirectories), as it makes the testcase a bit easier to
> read and more performant, though neither is a big deal.

I agree that we should use the builtin so your test script is less
brittle to potential back-end changes to sparse-checkout (none planned).

I do recommend having at least one test with non-cone mode patterns,
especially if you are checking the pattern-matching yourself instead of
relying on the index.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 2/3] grep: honor sparse checkout patterns
  2020-03-24 15:12     ` Derrick Stolee
@ 2020-03-24 16:16       ` Elijah Newren
  2020-03-24 17:02         ` Derrick Stolee
  2020-03-24 23:01       ` Matheus Tavares Bernardino
  1 sibling, 1 reply; 26+ messages in thread
From: Elijah Newren @ 2020-03-24 16:16 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Matheus Tavares, Git Mailing List, Derrick Stolee,
	brian m. carlson, Stefan Beller

On Tue, Mar 24, 2020 at 8:12 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 3/24/2020 3:15 AM, Elijah Newren wrote:
> > Hi Matheus,
> >
> > On Mon, Mar 23, 2020 at 11:12 PM Matheus Tavares
...
> >> Something I'm not entirely sure in this patch is how we implement the
> >> mechanism to honor sparsity for the `git grep <commit-ish>` case (which
> >> is treated in the grep_tree() function). Currently, the patch looks for
> >> an index entry that matches the path, and then checks its skip_worktree
> >
> > As you discuss below, checking the index is both wrong _and_ costly.
>
> I'm not sure why checking the index is _wrong_, but I agree about the
> performance cost.

Let's say there are two directories, dir1 and dir2.  Over time, there
have existed a total of six files:
   dir1/{a,b,c}
   dir2/{d,e,f}
At the current time, there are only four files in the index:
   dir1/{a,b}
   dir2/{d,e}
And the user has done a `git sparse-checkout set dir2` and then at
some point later run `git grep OTHERCOMMIT foobar`.  What happens?

Well, since we're in a sparse checkout, we should only search the
relevant paths within OTHERCOMMIT for "foobar".  Let's say we attempt
to figure out the "relevant paths" using the index.  We can tell that
dir1/a and dir2/a are marked as SKIP_WORKTREE so we don't search them.
dir1/c is untracked -- what do we do with it?  Include it?  Exclude
it?  Carrying on with the other files, dir2/d and dir2/e are tracked
and !SKIP_WORKTREE so we search them.  dir2/f is untracked -- what do
we do with it?  Include it?  Exclude it?

We're left without the necessary information to tell whether we should
search OTHERCOMMIT's dir1/c and dir2/f if we consult the index.  Any
decision we make is going to be wrong for one of the two paths.

If we instead do not attempt to consult the index (which corresponds
to a version close to HEAD) in order to ask questions about the
completely different OTHERCOMMIT, but instead use the sparsity
patterns to query whether those files/directories are interesting,
then we get the right answer.  The index can only be consulted for the
right answer in the case of --cached; in all other cases (including
OTHERCOMMIT == HEAD), we should use the sparsity patterns.  In fact,
we could also use the sparsity patterns in the case of --cached, it's
just that for that one particular case consulting the index will also
give the right answer.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 2/3] grep: honor sparse checkout patterns
  2020-03-24 16:16       ` Elijah Newren
@ 2020-03-24 17:02         ` Derrick Stolee
  0 siblings, 0 replies; 26+ messages in thread
From: Derrick Stolee @ 2020-03-24 17:02 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Matheus Tavares, Git Mailing List, Derrick Stolee,
	brian m. carlson, Stefan Beller

On 3/24/2020 12:16 PM, Elijah Newren wrote:
> On Tue, Mar 24, 2020 at 8:12 AM Derrick Stolee <stolee@gmail.com> wrote:
>>
>> On 3/24/2020 3:15 AM, Elijah Newren wrote:
>>> Hi Matheus,
>>>
>>> On Mon, Mar 23, 2020 at 11:12 PM Matheus Tavares
> ...
>>>> Something I'm not entirely sure in this patch is how we implement the
>>>> mechanism to honor sparsity for the `git grep <commit-ish>` case (which
>>>> is treated in the grep_tree() function). Currently, the patch looks for
>>>> an index entry that matches the path, and then checks its skip_worktree
>>>
>>> As you discuss below, checking the index is both wrong _and_ costly.
>>
>> I'm not sure why checking the index is _wrong_, but I agree about the
>> performance cost.
> 
> Let's say there are two directories, dir1 and dir2.  Over time, there
> have existed a total of six files:
>    dir1/{a,b,c}
>    dir2/{d,e,f}
> At the current time, there are only four files in the index:
>    dir1/{a,b}
>    dir2/{d,e}
> And the user has done a `git sparse-checkout set dir2` and then at
> some point later run `git grep OTHERCOMMIT foobar`.  What happens?
> 
> Well, since we're in a sparse checkout, we should only search the
> relevant paths within OTHERCOMMIT for "foobar".  Let's say we attempt
> to figure out the "relevant paths" using the index.  We can tell that
> dir1/a and dir2/a are marked as SKIP_WORKTREE so we don't search them.
> dir1/c is untracked -- what do we do with it?  Include it?  Exclude
> it?  Carrying on with the other files, dir2/d and dir2/e are tracked
> and !SKIP_WORKTREE so we search them.  dir2/f is untracked -- what do
> we do with it?  Include it?  Exclude it?
> 
> We're left without the necessary information to tell whether we should
> search OTHERCOMMIT's dir1/c and dir2/f if we consult the index.  Any
> decision we make is going to be wrong for one of the two paths.
> 
> If we instead do not attempt to consult the index (which corresponds
> to a version close to HEAD) in order to ask questions about the
> completely different OTHERCOMMIT, but instead use the sparsity
> patterns to query whether those files/directories are interesting,
> then we get the right answer.  The index can only be consulted for the
> right answer in the case of --cached; in all other cases (including
> OTHERCOMMIT == HEAD), we should use the sparsity patterns.  In fact,
> we could also use the sparsity patterns in the case of --cached, it's
> just that for that one particular case consulting the index will also
> give the right answer.

Thanks! This helps a lot.

-Stolee


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 3/3] grep: add option to ignore sparsity patterns
  2020-03-24  7:54   ` Elijah Newren
@ 2020-03-24 18:30     ` Junio C Hamano
  2020-03-24 19:07       ` Elijah Newren
  2020-03-30  3:23       ` Matheus Tavares Bernardino
  2020-03-25 23:15     ` Matheus Tavares Bernardino
  1 sibling, 2 replies; 26+ messages in thread
From: Junio C Hamano @ 2020-03-24 18:30 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Matheus Tavares, Git Mailing List, Derrick Stolee,
	Nguyễn Thái Ngọc

Elijah Newren <newren@gmail.com> writes:

> On Mon, Mar 23, 2020 at 11:13 PM Matheus Tavares
> <matheus.bernardino@usp.br> wrote:
>>
>> In the last commit, git-grep learned to honor sparsity patterns. For
>> some use cases, however, it may be desirable to search outside the
>> sparse checkout. So add the '--ignore-sparsity' option, which restores
>> the old behavior. Also add the grep.ignoreSparsity configuration, to
>> allow setting this behavior by default.
>
> Should `--ignore-sparsity` be a global git option rather than a
> grep-specific one?  Also, should grep.ignoreSparsity rather be
> core.ignoreSparsity or core.searchOutsideSparsePaths or something?

Great question.  I think "git diff" with various options would also
want to optionally be able to be confined within the sparse cone, or
checking the entire world by lazily fetching outside the sparsity.

> * grep, diff, log, shortlog, blame, bisect (and maybe others) all by
> default make use of the sparsity patterns to limit their output (but
> can all use whatever flag(s) are added here to search outside the
> sparsity pattern cones).  This helps users feel they are in a smaller
> repo and searching just their area of interest, and it avoids partial
> clones downloading blobs unnecessarily.  Nice for the user, and nice
> for the system.

I am not sure which one should be the default.  From historical
point of view that sparse stuff was done as an optimization to omit
initial work and lazily give the whole world, I may have slight
preference to the "we pretend that you have everything, just some
parts may be slower to come to you" world view to be the default,
with an option to limit the view to whatever sparsity you initially
set up.  Regardless of the choice of the default, it would be a good
idea to make the subcommands consistently offer the same default and
allow the non-default views with the same UI.



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 3/3] grep: add option to ignore sparsity patterns
  2020-03-24 18:30     ` Junio C Hamano
@ 2020-03-24 19:07       ` Elijah Newren
  2020-03-25 20:18         ` Junio C Hamano
  2020-03-30  3:23       ` Matheus Tavares Bernardino
  1 sibling, 1 reply; 26+ messages in thread
From: Elijah Newren @ 2020-03-24 19:07 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Matheus Tavares, Git Mailing List, Derrick Stolee,
	Nguyễn Thái Ngọc

On Tue, Mar 24, 2020 at 11:30 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Elijah Newren <newren@gmail.com> writes:
>
> > On Mon, Mar 23, 2020 at 11:13 PM Matheus Tavares
> > <matheus.bernardino@usp.br> wrote:
> >>
> >> In the last commit, git-grep learned to honor sparsity patterns. For
> >> some use cases, however, it may be desirable to search outside the
> >> sparse checkout. So add the '--ignore-sparsity' option, which restores
> >> the old behavior. Also add the grep.ignoreSparsity configuration, to
> >> allow setting this behavior by default.
> >
> > Should `--ignore-sparsity` be a global git option rather than a
> > grep-specific one?  Also, should grep.ignoreSparsity rather be
> > core.ignoreSparsity or core.searchOutsideSparsePaths or something?
>
> Great question.  I think "git diff" with various options would also
> want to optionally be able to be confined within the sparse cone, or
> checking the entire world by lazily fetching outside the sparsity.
>
> > * grep, diff, log, shortlog, blame, bisect (and maybe others) all by
> > default make use of the sparsity patterns to limit their output (but
> > can all use whatever flag(s) are added here to search outside the
> > sparsity pattern cones).  This helps users feel they are in a smaller
> > repo and searching just their area of interest, and it avoids partial
> > clones downloading blobs unnecessarily.  Nice for the user, and nice
> > for the system.
>
> I am not sure which one should be the default.  From historical
> point of view that sparse stuff was done as an optimization to omit
> initial work and lazily give the whole world, I may have slight
> preference to the "we pretend that you have everything, just some
> parts may be slower to come to you" world view to be the default,
> with an option to limit the view to whatever sparsity you initially
> set up.

It sounds like you are describing partial clone rather than sparse
checkout?  Or perhaps you're trying to blur the distinction,
suggesting the two should be used together, with the partial clone
machinery learning to download history within the specified sparse
cones?

>  Regardless of the choice of the default, it would be a good
> idea to make the subcommands consistently offer the same default and
> allow the non-default views with the same UI.

Agreed.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 1/3] doc: grep: unify info on configuration variables
  2020-03-24  7:57   ` Elijah Newren
@ 2020-03-24 21:26     ` Junio C Hamano
  2020-03-24 23:38       ` Matheus Tavares
  0 siblings, 1 reply; 26+ messages in thread
From: Junio C Hamano @ 2020-03-24 21:26 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Matheus Tavares, Git Mailing List, Derrick Stolee, brian m. carlson

Elijah Newren <newren@gmail.com> writes:

>> diff --git a/Documentation/config/grep.txt b/Documentation/config/grep.txt
>> index 44abe45a7c..76689771aa 100644
>> --- a/Documentation/config/grep.txt
>> +++ b/Documentation/config/grep.txt
>> @@ -16,8 +16,11 @@ grep.extendedRegexp::
>> ...
>> +       Number of grep worker threads to use. See `--threads` in
>> +       linkgit:git-grep[1] for more information.
>> ...
>> diff --git a/Documentation/git-grep.txt b/Documentation/git-grep.txt
>> index ddb6acc025..97e25d7b1b 100644
>> --- a/Documentation/git-grep.txt
>> +++ b/Documentation/git-grep.txt
>> @@ -41,34 +41,7 @@ characters.  An empty string as search expression matches all lines.
>> ...
>> +include::config/grep.txt[]
>> ...
>>  --threads <num>::
>> -       Number of grep worker threads to use.
>> -       See `grep.threads` in 'CONFIGURATION' for more information.
>> +       Number of grep worker threads to use. If not provided (or set to
>> +       0), Git will use as many worker threads as the number of logical
>> +       cores available. The default value can also be set with the
>> +       `grep.threads` configuration (see linkgit:git-config[1]).
>
> I'm possibly showing my ignorance here, but doesn't the
> "include::config/grep.txt[]" you added above mean that the user
> doesn't have to see an external manpage but can see the definition
> earlier within this same manpage?

I think so.  Also, the new reference "See `--threads` in git-grep"
added to grep.threads to config/grep.txt would become somewhat
redundant in the context of "git grep --help" (only "See --threads"
is relevant when it appears in this same manual page).

Readers who finds the reference in "git config --help" still needs
to see that --threads is an option to git-grep, though.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 2/3] grep: honor sparse checkout patterns
  2020-03-24  7:15   ` Elijah Newren
  2020-03-24 15:12     ` Derrick Stolee
@ 2020-03-24 22:55     ` Matheus Tavares Bernardino
  1 sibling, 0 replies; 26+ messages in thread
From: Matheus Tavares Bernardino @ 2020-03-24 22:55 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Git Mailing List, Derrick Stolee, brian m. carlson, Stefan Beller

On Tue, Mar 24, 2020 at 4:15 AM Elijah Newren <newren@gmail.com> wrote:
>
> On Mon, Mar 23, 2020 at 11:12 PM Matheus Tavares
> <matheus.bernardino@usp.br> wrote:
> >
> > Something I'm not entirely sure in this patch is how we implement the
> > mechanism to honor sparsity for the `git grep <commit-ish>` case (which
> > is treated in the grep_tree() function). Currently, the patch looks for
> > an index entry that matches the path, and then checks its skip_worktree
>
> As you discuss below, checking the index is both wrong _and_ costly.
> You should use the sparsity patterns; Stolee did a lot of work to make
> those correspond to simple hashes you could check to determine whether
> to even walk into a subdirectory.  So, O(1).  Yeah, that's "only" cone
> mode but the non-cone sparsity patterns were a performance nightmare
> waiting to rear its ugly head.  We should just try to encourage
> everyone to move to cone mode, or accept the slowness they get without
> it.

OK, makes sense. And your reply to Stolee, later in this thread, made
it clearer for me why checking the index is not only costly but also
wrong. Thanks for the great explanation! I will use the sparsity
patterns directly, in the next iteration.

> > diff --git a/builtin/grep.c b/builtin/grep.c
> > index 99e2685090..52ec72a036 100644
> > --- a/builtin/grep.c
> > +++ b/builtin/grep.c
> > @@ -388,7 +388,7 @@ static int grep_cache(struct grep_opt *opt,
> >                       const struct pathspec *pathspec, int cached);
> >  static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
> >                      struct tree_desc *tree, struct strbuf *base, int tn_len,
> > -                    int check_attr);
> > +                    int from_commit);
>
> I'm not familiar with grep.c and have to admit I don't know what
> "check_attr" means. Slightly surprised to see you replace it, but
> maybe reading the rest will explain...
...
>>                 if (S_ISREG(entry.mode)) {
>>                         hit |= grep_oid(opt, &entry.oid, base->buf, tn_len,
>> -                                        check_attr ? base->buf + tn_len : NULL);
>> +                                       from_commit ? base->buf + tn_len : NULL);
>
> Sadly, this doesn't help me understand check_attr or from_commit.
> Could you clue me in a bit?

Sure! The grep machinery can optionally look the .gitattributes file,
to see if a given path has a "diff" attribute assigned to it. This
attribute points to a diff driver in .gitconfig, which can specify
many things, such as whether the path should be treated as a binary or
not. The "check_attr" flag passed to grep_tree() tells the grep
machinery if it should perform this attribute lookup for the paths in
the given tree.

I decided to replace it with "from_commit" because the only times we
want an attribute lookup when grepping a tree, is when it comes from a
commit. I.e., when the tree is the root. (The reasoning goes in the
same lines as for why we only check sparsity patterns in git-grep for
commit-ish objects: we cannot check pattern matching for trees which
we are not sure to be rooted). Since "knowing if the tree is a root or
not" is useful in grep_tree() for both sparsity checks and attribute
checks, I thought we could use a single "from_commit" variable instead
of "check_attr" and "check_sparsity", which would always have matching
values. But on second thought, I could maybe rename the variable to
something as "is_root_tree" or add a comment explaining the usage of
"from_commit".

(I'm not a big fan of "is_root_tree", thought, because we could give a
root tree to grep_tree() but not really know it.)

> > diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh
> > new file mode 100755
> > index 0000000000..fccf44e829
> > --- /dev/null
> > +++ b/t/t7817-grep-sparse-checkout.sh
...
> > +test_expect_success 'setup' '
> > +       echo "text" >a &&
> > +       echo "text" >b &&
> > +       mkdir dir &&
> > +       echo "text" >dir/c &&
> > +       git add a b dir &&
> > +       git commit -m "initial commit" &&
> > +       git tag -am t-commit t-commit HEAD &&
> > +       tree=$(git rev-parse HEAD^{tree}) &&
> > +       git tag -am t-tree t-tree $tree &&
> > +       cat >.git/info/sparse-checkout <<-EOF &&
> > +       /*
> > +       !/b
> > +       !/dir
> > +       EOF
> > +       git sparse-checkout init &&
>
> Using `git sparse-checkout init` but then manually writing to
> .git/info/sparse-checkout?  Seems like it'd make more sense to use
> `git sparse-checkout set` than writing the patterns directly yourself.
> Also, would prefer to have the examples use cone mode (even if you
> have to add subdirectories), as it makes the testcase a bit easier to
> read and more performant, though neither is a big deal.

OK, I will make use of the builtin here. I will also use the cone mode
(and leave one test without it, as Stolee suggested later in this
thread).

> > +test_expect_success 'grep <tree-ish> should search outside sparse checkout' '
>
> I think the test is fine but the title seems misleading.  "outside"
> and "inside" aren't defined because <tree-ish> isn't known to be
> rooted, meaning we have no way to apply the sparsity patterns.  So
> perhaps just 'grep <tree-ish> should ignore sparsity patterns'?

Right! "should ignore sparsity patterns" is a much better name, thanks.

Thanks a lot for the thoughtful review and comments!

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 2/3] grep: honor sparse checkout patterns
  2020-03-24 15:12     ` Derrick Stolee
  2020-03-24 16:16       ` Elijah Newren
@ 2020-03-24 23:01       ` Matheus Tavares Bernardino
  1 sibling, 0 replies; 26+ messages in thread
From: Matheus Tavares Bernardino @ 2020-03-24 23:01 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Elijah Newren, Git Mailing List, Derrick Stolee,
	brian m. carlson, Stefan Beller

On Tue, Mar 24, 2020 at 12:12 PM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 3/24/2020 3:15 AM, Elijah Newren wrote:
> >
> > On Mon, Mar 23, 2020 at 11:12 PM Matheus Tavares
> > <matheus.bernardino@usp.br> wrote:
> >>
> >> Also, if I understand correctly, the index doesn't hold paths to dirs,
> >> right? So even if a complete dir is excluded from sparse checkout, we
> >> still have to check all its subentries, only to discover that they
> >> should all be skipped from the search. However, if we were to check
> >> against the sparsity patterns directly (e.g. with the function above),
> >> we could skip such directories together with all their entries.
>
> When in cone mode, we can check if a directory is one of these three
> modes:
>
> 1. Completely contained in the cone (recursive match)
> 2. Completely outside the cone
> 3. Neither. Keep matching subdirectories. (parent match)
>
> The clear_ce_flags() code in dir.c includes the matching algorithms
> for this. Hopefully you can re-use a lot of it. You may need to extract
> some methods to use them from the grep code.

Thanks for the pointer! I will take a look at the code in dir.c.

> >> diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh
> >> new file mode 100755
> >> index 0000000000..fccf44e829
...
> >> +       cat >.git/info/sparse-checkout <<-EOF &&
> >> +       /*
> >> +       !/b
> >> +       !/dir
> >> +       EOF
> >> +       git sparse-checkout init &&
> >
> > Using `git sparse-checkout init` but then manually writing to
> > .git/info/sparse-checkout?  Seems like it'd make more sense to use
> > `git sparse-checkout set` than writing the patterns directly yourself.
> > Also, would prefer to have the examples use cone mode (even if you
> > have to add subdirectories), as it makes the testcase a bit easier to
> > read and more performant, though neither is a big deal.
>
> I agree that we should use the builtin so your test script is less
> brittle to potential back-end changes to sparse-checkout (none planned).

Makes sense!

> I do recommend having at least one test with non-cone mode patterns,
> especially if you are checking the pattern-matching yourself instead of
> relying on the index.

OK, I will leave at least one test with non-cone patterns then. Thanks
for the comments!

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 1/3] doc: grep: unify info on configuration variables
  2020-03-24 21:26     ` Junio C Hamano
@ 2020-03-24 23:38       ` Matheus Tavares
  0 siblings, 0 replies; 26+ messages in thread
From: Matheus Tavares @ 2020-03-24 23:38 UTC (permalink / raw)
  To: gitster; +Cc: dstolee, git, matheus.bernardino, newren, sandals

On Tue, Mar 24, 2020 at 6:26 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Elijah Newren <newren@gmail.com> writes:
>
> >> diff --git a/Documentation/config/grep.txt b/Documentation/config/grep.txt
> >> index 44abe45a7c..76689771aa 100644
> >> --- a/Documentation/config/grep.txt
> >> +++ b/Documentation/config/grep.txt
> >> @@ -16,8 +16,11 @@ grep.extendedRegexp::
> >> ...
> >> +       Number of grep worker threads to use. See `--threads` in
> >> +       linkgit:git-grep[1] for more information.
> >> ...
> >> diff --git a/Documentation/git-grep.txt b/Documentation/git-grep.txt
> >> index ddb6acc025..97e25d7b1b 100644
> >> --- a/Documentation/git-grep.txt
> >> +++ b/Documentation/git-grep.txt
> >> @@ -41,34 +41,7 @@ characters.  An empty string as search expression matches all lines.
> >> ...
> >> +include::config/grep.txt[]
> >> ...
> >>  --threads <num>::
> >> -       Number of grep worker threads to use.
> >> -       See `grep.threads` in 'CONFIGURATION' for more information.
> >> +       Number of grep worker threads to use. If not provided (or set to
> >> +       0), Git will use as many worker threads as the number of logical
> >> +       cores available. The default value can also be set with the
> >> +       `grep.threads` configuration (see linkgit:git-config[1]).
> >
> > I'm possibly showing my ignorance here, but doesn't the
> > "include::config/grep.txt[]" you added above mean that the user
> > doesn't have to see an external manpage but can see the definition
> > earlier within this same manpage?

You are right. I added the "(see linkgit:git-config[1])" here more as a
reference to the config system itself (for a user that is possibly not familiar
with git-config). But if this is not necessary, we can remove the reference.

> I think so.  Also, the new reference "See `--threads` in git-grep"
> added to grep.threads to config/grep.txt would become somewhat
> redundant in the context of "git grep --help" (only "See --threads"
> is relevant when it appears in this same manual page).

Thanks for pointing that out. I think we can solve this issue with the
following:

diff --git a/Documentation/config/grep.txt b/Documentation/config/grep.txt
index c1d49484c8..ac06db4206 100644
--- a/Documentation/config/grep.txt
+++ b/Documentation/config/grep.txt
@@ -16,8 +16,11 @@ grep.extendedRegexp::
 	other than 'default'.

 grep.threads::
-	Number of grep worker threads to use. See `--threads` in
-	linkgit:git-grep[1] for more information.
+	Number of grep worker threads to use. See `--threads`
+ifndef::git-grep[]
+	in linkgit:git-grep[1]
+endif::git-grep[]
+	for more information.

 grep.fullName::
 	If set to true, enable `--full-name` option by default.
diff --git a/Documentation/git-grep.txt b/Documentation/git-grep.txt
index 5c5c66c056..192aab4cba 100644
--- a/Documentation/git-grep.txt
+++ b/Documentation/git-grep.txt
@@ -41,6 +41,7 @@ characters.  An empty string as search expression matches all lines.
 CONFIGURATION
 -------------

+:git-grep: 1
 include::config/grep.txt[]

 OPTIONS

I will add these changes in v2.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 3/3] grep: add option to ignore sparsity patterns
  2020-03-24 19:07       ` Elijah Newren
@ 2020-03-25 20:18         ` Junio C Hamano
  0 siblings, 0 replies; 26+ messages in thread
From: Junio C Hamano @ 2020-03-25 20:18 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Matheus Tavares, Git Mailing List, Derrick Stolee,
	Nguyễn Thái Ngọc

Elijah Newren <newren@gmail.com> writes:

> It sounds like you are describing partial clone rather than sparse
> checkout?  Or perhaps you're trying to blur the distinction,
> suggesting the two should be used together, with the partial clone
> machinery learning to download history within the specified sparse
> cones?

Yeah, I guess it is a little bit of both ;-)

>>  Regardless of the choice of the default, it would be a good
>> idea to make the subcommands consistently offer the same default and
>> allow the non-default views with the same UI.
>
> Agreed.

Yup, thanks.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 3/3] grep: add option to ignore sparsity patterns
  2020-03-24  7:54   ` Elijah Newren
  2020-03-24 18:30     ` Junio C Hamano
@ 2020-03-25 23:15     ` Matheus Tavares Bernardino
  2020-03-26  6:02       ` Elijah Newren
  1 sibling, 1 reply; 26+ messages in thread
From: Matheus Tavares Bernardino @ 2020-03-25 23:15 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Git Mailing List, Derrick Stolee,
	Nguyễn Thái Ngọc, Junio C Hamano

On Tue, Mar 24, 2020 at 4:55 AM Elijah Newren <newren@gmail.com> wrote:
>
> On Mon, Mar 23, 2020 at 11:13 PM Matheus Tavares
> <matheus.bernardino@usp.br> wrote:
>
> > Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
> > ---
> >
> > Note: I still have to make --ignore-sparsity be able to work together
> > with --untracked. Unfortunatelly, this won't be as simple because the
> > codeflow taken by --untracked goes to grep_directory() which just
> > iterates the working tree, without looking the index entries. So I will
> > have to either: make --untracked use grep_cache(), and grep the
> > untracked files later; or try matching the working tree paths against
> > the sparsity patterns, without looking for the skip_worktree bit in
> > the index (as I mentioned in the previous patch's comments). Any
> > preferences regarding these two approaches? (or other suggestions?)
>
> Hmm.  So, 'tracked' in git is the idea that we are keeping information
> about specific files.  'sparse-checkout' is the idea that we have a
> subset of those that we can work with without materializing all the
> other tracked files; it's clearly a subset of the realm of 'tracked'.
> 'untracked' is about getting everything outside the set of 'tracked'
> files, which to me means it is clearly outside the set of sparsity
> paths too (and thus you could take --untracked as implying
> --ignore-sparsity, though whether you do might not matter in practice
> because of the items I'll discuss next). Of course, I am also
> assuming `--untracked` is incompatible with --cached or specifying
> revisions or trees (based on it's definiton of "In addition to
> searching in the tracked files in the *working tree*, search also in
> untracked files." -- emphasis added.)

Hm, I see the point now, but I'm still a little confused: The "in the
working tree" section of the definition would exclude non checked out
files, right? However, git-grep's description says "Look for specified
patterns in the tracked files *in the work tree*", and it still
searches non checked out files (loading them from the cache, even when
--cache is not given). I know that's exactly what we are trying to
change with this patchset, but we will still give the
--ignore-sparsity option to allow the old behavior when needed (unless
we prohibit using --ignore-sparsity without --cached or $REV). I guess
my doubt is whether the problem is in the implementation of the
working tree grep, which considers non checked out files, or in the
docs, which say "tracked files *in the work tree*".

I tend to go with the latter, since using `git grep --ignore-sparsity`
in a sparse checked out working tree, to grep not present files as
well, kind of makes sense to me. And if the problem is indeed in the
docs, then I think we should also allow --ignore-sparsity when
grepping with --untracked, since it's an analogous case.

> If the incompatibility of
> --untracked and --cached/REVSIONS/TREES is not enforced, we may want
> to look into erroring out if they are given together.  Once we do, we
> don't have to worry about grep_cache() at all in the case of
> --untracked and shouldn't.  Files with the skip_worktree bit won't
> exist in the working directory, and thus won't be searched (this is
> what makes --untracked imply --ignore-sparsity not really matter).
>
> In short: With --untracked you are grepping ALL (non-ignored) files in
> the working directory -- either because they are both tracked and in
> the sparsity paths (anything tracked that isn't in the sparsity paths
> has the skip_worktree bit and thus isn't present), or because it is an
> untracked file.  [And this may be what grep_directory() already does.]
>
> Does that make sense?

It does, and thanks for a very detailed explanation. But as I
mentioned before, I'm a little uncertain about --untracked implying
--ignore-spasity. The commit that added --untracked (0a93fb8) says:

"grep --untracked" would find the specified patterns from files in
untracked files in addition to its usual behaviour of finding them in
the tracked files

So, in my mind, it feels like --untracked wasn't meant to limit the
search to "all non-ignored files in the working directory", but to add
untracked files to the search (which could also contain tracked but
non checked out files). Wouldn't the "all non-ignored files in the
working directory" case be the use of --no-index?

> > diff --git a/builtin/grep.c b/builtin/grep.c
> > index 52ec72a036..17eae3edd6 100644
> > --- a/builtin/grep.c
> > +++ b/builtin/grep.c
...
> >
> > @@ -487,7 +492,7 @@ static int grep_cache(struct grep_opt *opt,
> >         for (nr = 0; nr < repo->index->cache_nr; nr++) {
> >                 const struct cache_entry *ce = repo->index->cache[nr];
> >
> > -               if (ce_skip_worktree(ce))
> > +               if (!ignore_sparsity && ce_skip_worktree(ce))
>
> Oh boy on the double negatives...maybe we want to rename this flag somehow?

Yeah, I also thought about that, but couldn't come up with a better
name myself... My alternatives were all too verbose.

...
> I'm super excited to see work in this area.  I hope I'm not
> discouraging you by attempting to provide what I think is the bigger
> picture I'd like us to work towards.

Not at all! :) Thanks a lot for the bigger picture and other
explanations. They help me understand the long-term goals and make
better decisions now.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 3/3] grep: add option to ignore sparsity patterns
  2020-03-25 23:15     ` Matheus Tavares Bernardino
@ 2020-03-26  6:02       ` Elijah Newren
  2020-03-27 15:51         ` Junio C Hamano
  2020-03-30  1:12         ` Matheus Tavares Bernardino
  0 siblings, 2 replies; 26+ messages in thread
From: Elijah Newren @ 2020-03-26  6:02 UTC (permalink / raw)
  To: Matheus Tavares Bernardino
  Cc: Git Mailing List, Derrick Stolee,
	Nguyễn Thái Ngọc, Junio C Hamano

Hi Matheus!

On Wed, Mar 25, 2020 at 4:15 PM Matheus Tavares Bernardino
<matheus.bernardino@usp.br> wrote:
>
> On Tue, Mar 24, 2020 at 4:55 AM Elijah Newren <newren@gmail.com> wrote:
> >
> > On Mon, Mar 23, 2020 at 11:13 PM Matheus Tavares
> > <matheus.bernardino@usp.br> wrote:
> >
> > > Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
> > > ---
> > >
> > > Note: I still have to make --ignore-sparsity be able to work together
> > > with --untracked. Unfortunatelly, this won't be as simple because the
> > > codeflow taken by --untracked goes to grep_directory() which just
> > > iterates the working tree, without looking the index entries. So I will
> > > have to either: make --untracked use grep_cache(), and grep the
> > > untracked files later; or try matching the working tree paths against
> > > the sparsity patterns, without looking for the skip_worktree bit in
> > > the index (as I mentioned in the previous patch's comments). Any
> > > preferences regarding these two approaches? (or other suggestions?)
> >
> > Hmm.  So, 'tracked' in git is the idea that we are keeping information
> > about specific files.  'sparse-checkout' is the idea that we have a
> > subset of those that we can work with without materializing all the
> > other tracked files; it's clearly a subset of the realm of 'tracked'.
> > 'untracked' is about getting everything outside the set of 'tracked'
> > files, which to me means it is clearly outside the set of sparsity
> > paths too (and thus you could take --untracked as implying
> > --ignore-sparsity, though whether you do might not matter in practice
> > because of the items I'll discuss next). Of course, I am also
> > assuming `--untracked` is incompatible with --cached or specifying
> > revisions or trees (based on it's definiton of "In addition to
> > searching in the tracked files in the *working tree*, search also in
> > untracked files." -- emphasis added.)
>
> Hm, I see the point now, but I'm still a little confused: The "in the
> working tree" section of the definition would exclude non checked out
> files, right? However, git-grep's description says "Look for specified
> patterns in the tracked files *in the work tree*", and it still
> searches non checked out files (loading them from the cache, even when
> --cache is not given). I know that's exactly what we are trying to

I really respect Duy and he does some amazing work and I wish he were
still active in git, but the SKIP_WORKTREE stuff wasn't his best work
and even he downplayed it: "In my defense it was one of my first
contribution when I was naiver...I'd love to hear how sparse checkout
could be improved, or even replaced."[0]

I've seen enough egregiously confusing cases and enough
difficult-to-recover-from cases with the implementation of the
SKIP_WORKTREE handling that I think it is dangerous to assume behavior
you see with it is intended design.  A year and a half ago, I read all
available docs to figure out how to sparsify and de-sparsify, and read
them several times but was still confused.  If I could only figure it
out with great difficulty, a lot of google searching, and even trying
to look at the code, what chance did "normal" users stand?  To add
more flavor to that argument, let me cite [1] (the three paragraphs
starting with "Playing with sparse-checkout, it feels to me like a
half-baked feature"), [2], as well as good chunks of [3], [4], and
[5].

[0] https://lore.kernel.org/git/CACsJy8ArUXD0cF2vQAVnzM_AGto2k2yQTFuTO7PhP4ffHM8dVQ@mail.gmail.com/
[1] https://lore.kernel.org/git/CABPp-BFKf2N6TYzCCneRwWUektMzRMnHLZ8JT64q=MGj5WQZkA@mail.gmail.com/
[2] https://lore.kernel.org/git/CABPp-BGE-m_UFfUt_moXG-YR=ZW8hMzMwraD7fkFV-+sEHw36w@mail.gmail.com/
[3] https://lore.kernel.org/git/pull.316.git.gitgitgadget@gmail.com/
[4] https://lore.kernel.org/git/pull.513.git.1579029962.gitgitgadget@gmail.com/
[5] https://lore.kernel.org/git/a46439c8536f912ad4a1e1751852cf477d3d7dc7.1584813609.git.gitgitgadget@gmail.com/

But let me try to explain it all below from first principles in a way
that will hopefully make sense why falling back to loading from the
cache when --cached is not given is just flat wrong.  The explanation
from first principles should also help explain --untracked a bit
better, and when there are decisions about whether to use sparsity
patterns.

> change with this patchset, but we will still give the
> --ignore-sparsity option to allow the old behavior when needed (unless
> we prohibit using --ignore-sparsity without --cached or $REV). I guess
> my doubt is whether the problem is in the implementation of the
> working tree grep, which considers non checked out files, or in the
> docs, which say "tracked files *in the work tree*".
>
> I tend to go with the latter, since using `git grep --ignore-sparsity`
> in a sparse checked out working tree, to grep not present files as
> well, kind of makes sense to me. And if the problem is indeed in the
> docs, then I think we should also allow --ignore-sparsity when
> grepping with --untracked, since it's an analogous case.

It's probably not a surprise to you given what I've already said above
to hear me say that the docs are correct in this case.  But not only
are the docs correct, I'll go even further and claim that falling back
to the cache when --cached is not passed is indefensible and leads to
surprises and contradictions.  But instead of just claiming that, let
me try to spell out a bit better why I believe that from first
principles, though:

There were previously three types of files for git:
  * tracked
  * ignored
  * untracked
where:
  * tracked was defined as "recorded in index"
  * ignored was defined as "a file which is not tracked and which
matches an ignore rule (.gitignore, .git/info/exclude, etc.)"
  * untracked was defined as "all other files present in the working directory".
With the SKIP_WORKTREE bit and sparse-checkouts, we actually have four
types because we split the "tracked" category into two:
  * tracked and matches the sparsity patterns (implies it will be
missing from the working directory as the SKIP_WORKTREE bit is set)
  * tracked and does not match the sparsity patterns (implies it will
be present in the working directory, as the SKIP_WORKTREE bit is not
set)
But let's ignore the splitting of the tracked type for a minute as
well as everything else related to sparseness.  Let's just look at how
grep was designed.

git grep has traditionally been about searching "tracked files in the
work tree" as you highlighted (and note that sparsity bits came four
years later in 2009, so cannot undercut that claim).  If the user has
made edits to files and hasn't staged them, grep would search those
working tree files with their edits, not old cached versions of those
files.  People were told that git grep was a great way to just search
relevant stuff (instead of normal grep which would look through build
results and random big files in your working directory that you
weren't even tracking).  Then in 2011 grep gained options like
--untracked to extend the search in the working tree to also include
untracked files, and added --no-exclude-standard (which is "only
useful with --untracked") so that people had a way to search *all*
files in the working tree (tracked, untracked, and ignored files).
(Note: no mechanism was provided for searching tracked and ignored
files without untracked as far as I can tell, though I don't see why
that would make sense.)  git-grep also gained options like --no-index
so that it could be used in a directory that wasn't tracked by git at
all -- it turns out people liked git-grep better than normal grep (I
think it got colorization first?), even for things that weren't being
tracked by git.  But again, all these cases were about searching files
that existed in the working tree.

Of course, people sometimes wanted to search a version other than what
existed in the working tree.  And thus options like --cached or
specifying a REVISION were added early on.

Sometimes, code that wasn't meant to be used together accidentally is
used together or the docs suggest they can be used together.  In 2010,
someone had to clarify that --cached was incompatible with <tree>; not
sure why someone would attempt to use them together, but that's the
type of accident that is easy to have in the implementation or docs
because it doesn't even occur to people who understand the design and
the data structures why anyone would attempt that.  Inevitably,
someone comes along who doesn't understand the underlying data
structures or design or terminology and tries incompatible options
together...and then gets surprised.  (Side note: I think this kind of
issues occurs fairly frequently, so I'm unlikely to assume options
were meant to be supported together based solely on a lack of logic
that would throw an error when both are specified.  We could probably
add a bunch of useful microprojects around checking for flags that
should be incompatible and making sure git throws errors when both are
specified.  We had lots of cases in rebase, for example, where if
users happened to specify two flags then one would just be silently
ignored.)

REVISION and --cached are not just incompatible with each other; each
is incompatible with all three of --untracked, --no-index, and
--no-exclude-standard.  This is because REVISION and --cached are
about picking some version other than what exists in the working tree
to search through, while those other options are all intended for when
we are searching through files in the working tree (and in particular,
exist to extend how many files in the working tree we look through).

One more useful case to consider before we start adding SKIP_WORKTREE
into the mix.  Let's say that you have three files:
   fileA
   fileB
   fileC
and all of them are tracked.  You have made edits to fileA and fileB,
and ran 'rm fileC' (NOT 'git rm fileC', i.e. the deletion is not
staged).  Now, you run 'git grep mystring'.  Quick question: Which
files are searched for 'mystring'?  Well...
  * REVISION and --cached were left out of the git grep command, so
working tree files should be searched, not staged versions or versions
from other commits
  * No flags like --untracked or --no-exclude-standard were included,
so only tracked files in the working tree should be searched
  * There are two files in the working tree, both tracked: fileA and fileB.
So, this searches fileA and fileB.  In particular: NO VERSION of fileC
is searched.  fileC may be tracked/cached, but we don't search any
version of that file, because this particular command line is about
searching the working directory and fileC is not in the working
directory.  To the best of my knowledge, git grep has always behaved
that way.


Users understand the idea of searching the working copy vs. the index
vs. "old" (or different) versions of the repository.  They also
understand that when searching the working copy, by default a subset
of the files are searched.  Tell me: given all this information here,
what possible explanation is there for SKIP_WORKTREE entries to be
translated into searches of the cache when --cached is not specified?
Please square that away with the fact that 'rm fileC' results in fileC
NOT being searched.

It's just completely, utterly wrong.

Also, hopefully this helps answer your question about --untracked and
skip_worktree.  --untracked is only useful when searching through the
working tree, and is entirely about adding the "untracked" category to
the things we search.  The skip_worktree bit is about adding more
granularity to the "tracked" category.  The two are thus entirely
orthogonal and --untracked shouldn't change behavior at all in the
face of sparse checkouts.

And I also think it explains more when the sparsity patterns and
--ignore-sparsity-patterns flags even matter.  The division of working
tree files which were tracked into two subsets (those that match
sparsity patterns and those that don't) didn't matter because only one
of those two sets existed and could be searched.  So the question is,
when can the sparsity pattern divide a set of files into two subsets
where both are non-empty?  And the answer is when --cached or REVISION
is specified.  This is the case Junio recently brought up and said
that there are good reasons users might want to limit to just the
paths that match the sparsity patterns, and other reasons when users
might want to search everything[6].  So, both cases need to be
supported fairly easily, and this will be true for several commands
besides just grep.

[6] https://lore.kernel.org/git/xmqq7dz938sc.fsf@gitster.c.googlers.com/

> > If the incompatibility of
> > --untracked and --cached/REVSIONS/TREES is not enforced, we may want
> > to look into erroring out if they are given together.  Once we do, we
> > don't have to worry about grep_cache() at all in the case of
> > --untracked and shouldn't.  Files with the skip_worktree bit won't
> > exist in the working directory, and thus won't be searched (this is
> > what makes --untracked imply --ignore-sparsity not really matter).
> >
> > In short: With --untracked you are grepping ALL (non-ignored) files in
> > the working directory -- either because they are both tracked and in
> > the sparsity paths (anything tracked that isn't in the sparsity paths
> > has the skip_worktree bit and thus isn't present), or because it is an
> > untracked file.  [And this may be what grep_directory() already does.]
> >
> > Does that make sense?
>
> It does, and thanks for a very detailed explanation. But as I
> mentioned before, I'm a little uncertain about --untracked implying
> --ignore-sparsity. The commit that added --untracked (0a93fb8) says:
>
> "grep --untracked" would find the specified patterns from files in
> untracked files in addition to its usual behaviour of finding them in
> the tracked files
>
> So, in my mind, it feels like --untracked wasn't meant to limit the
> search to "all non-ignored files in the working directory", but to add
> untracked files to the search (which could also contain tracked but
> non checked out files). Wouldn't the "all non-ignored files in the
> working directory" case be the use of --no-index?

--no-index is specifically designed for when the directory isn't
tracked by git at all.  It would be equivalent, though, to saying we
wanted to search all files in the working copy regardless of whether
they are tracked, untracked, or ignored, i.e. equivalent to specifying
both --untracked and --no-exclude-standard.

And you were right to be uncertain about --untracked implying
--ignore-sparsity; --untracked is completely orthogonal to sparsity.
(However, it wouldn't much matter if it did imply that option or if it
implied its opposite: --untracked implies we are only looking at the
working directory files, and thus we aren't even going to check the
sparsity patterns, we'll just check which files exist in the working
directory.  `git sparse-checkout reapply` will care about the sparsity
patterns and possibly add files to the working copy or remove some,
but grep certainly shouldn't be having a side effect like that; it
should just search the directory as it exists.)

> > > diff --git a/builtin/grep.c b/builtin/grep.c
> > > index 52ec72a036..17eae3edd6 100644
> > > --- a/builtin/grep.c
> > > +++ b/builtin/grep.c
> ...
> > >
> > > @@ -487,7 +492,7 @@ static int grep_cache(struct grep_opt *opt,
> > >         for (nr = 0; nr < repo->index->cache_nr; nr++) {
> > >                 const struct cache_entry *ce = repo->index->cache[nr];
> > >
> > > -               if (ce_skip_worktree(ce))
> > > +               if (!ignore_sparsity && ce_skip_worktree(ce))
> >
> > Oh boy on the double negatives...maybe we want to rename this flag somehow?
>
> Yeah, I also thought about that, but couldn't come up with a better
> name myself... My alternatives were all too verbose.
>
> ...
> > I'm super excited to see work in this area.  I hope I'm not
> > discouraging you by attempting to provide what I think is the bigger
> > picture I'd like us to work towards.
>
> Not at all! :) Thanks a lot for the bigger picture and other
> explanations. They help me understand the long-term goals and make
> better decisions now.

Hope this email helps too.  I've composed it over about 4 different
sessions with various interruptions, so there's a good chance all my
edits and loss of train of thought might have made something murky.
Let me know which part(s) are confusing and I'll try to clarify.

Elijah

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 3/3] grep: add option to ignore sparsity patterns
  2020-03-26  6:02       ` Elijah Newren
@ 2020-03-27 15:51         ` Junio C Hamano
  2020-03-27 19:01           ` Elijah Newren
  2020-03-30  1:12         ` Matheus Tavares Bernardino
  1 sibling, 1 reply; 26+ messages in thread
From: Junio C Hamano @ 2020-03-27 15:51 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Matheus Tavares Bernardino, Git Mailing List, Derrick Stolee,
	Nguyễn Thái Ngọc

Elijah Newren <newren@gmail.com> writes:

> Sometimes, code that wasn't meant to be used together accidentally is
> used together or the docs suggest they can be used together.  ...
> ... but that's the
> type of accident that is easy to have in the implementation or docs
> because it doesn't even occur to people who understand the design and
> the data structures why anyone would attempt that.

The above is not limited to "git grep", but you said so clearly what
I have felt, without being able to express myself in a satisfactory
manner, for the last 10 years.

> ... (Side note: I think this kind of
> issues occurs fairly frequently, so I'm unlikely to assume options
> were meant to be supported together based solely on a lack of logic
> that would throw an error when both are specified.

Amen to that.

By the way, and I am so sorry to making the main issue of the
discussion into a mere "by the way" point, but if I understand your
message correctly, the primary conclusion in there is that a file
that is not in the working tree, if the sparsity pattern tells us
that it should not be checked out to the working tree, should not be
sought in the index instead.  I think I agree with that conclusion.

I however have some disagreement on a minor point, though.

"git grep -e '<pattern>' master" looks for the pattern in the commit
at the tip of the master branch.  "git grep -e '<pattern>' master
pu" does so in these two commits.  I do not think it is conceptually
wrong to allow "git grep -e '<pattern>' --cached master pu" to look
for three "commits", i.e. those two commits that already exist, plus
the one you would be creating if you were to "git commit" right now.
Similarly, I do not see a reason why we should forbid looking for
the same pattern in the tracked files in the working tree at the
same time we check tree object(s) and/or the index.

At least in principle.

There are two practical issues that makes these combinations
problematic, but I do not think they are insurmountable.

 - Once you give an object on the command line, there is no syntax
   to let you say "oh, by the way, I want the working tree as well".
   If you are looking in the index, the working tree, and optionally
   in some objects, "--index" instead of "--cached" would be the
   standard way to tell the command "I want to affect both the index
   and the working tree", but there is no way to say "I want only
   tracked files in the working tree and these objects searched".
   We'd need a new syntax to express it if we wanted to allow the
   combination.

 - The lines found in the working tree and in the index are prefixed
   by the filename, while they are prefixed by the tree's name and a
   colon.  When output for the working tree and the index are
   combined, we cannot tell where each hit came from.  We need to
   change the output to allow us to tell them apart, by
   e.g. prefixing "<worktree>:" and "<index>:" in a way similar to
   we use "<revision>:".

Thanks.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 3/3] grep: add option to ignore sparsity patterns
  2020-03-27 15:51         ` Junio C Hamano
@ 2020-03-27 19:01           ` Elijah Newren
  0 siblings, 0 replies; 26+ messages in thread
From: Elijah Newren @ 2020-03-27 19:01 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Matheus Tavares Bernardino, Git Mailing List, Derrick Stolee,
	Nguyễn Thái Ngọc

On Fri, Mar 27, 2020 at 8:51 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Elijah Newren <newren@gmail.com> writes:
>
> > Sometimes, code that wasn't meant to be used together accidentally is
> > used together or the docs suggest they can be used together.  ...
> > ... but that's the
> > type of accident that is easy to have in the implementation or docs
> > because it doesn't even occur to people who understand the design and
> > the data structures why anyone would attempt that.
>
> The above is not limited to "git grep", but you said so clearly what
> I have felt, without being able to express myself in a satisfactory
> manner, for the last 10 years.
>
> > ... (Side note: I think this kind of
> > issues occurs fairly frequently, so I'm unlikely to assume options
> > were meant to be supported together based solely on a lack of logic
> > that would throw an error when both are specified.
>
> Amen to that.
>
> By the way, and I am so sorry to making the main issue of the
> discussion into a mere "by the way" point, but if I understand your
> message correctly, the primary conclusion in there is that a file
> that is not in the working tree, if the sparsity pattern tells us
> that it should not be checked out to the working tree, should not be
> sought in the index instead.  I think I agree with that conclusion.

Cool.

> I however have some disagreement on a minor point, though.
>
> "git grep -e '<pattern>' master" looks for the pattern in the commit
> at the tip of the master branch.  "git grep -e '<pattern>' master
> pu" does so in these two commits.  I do not think it is conceptually
> wrong to allow "git grep -e '<pattern>' --cached master pu" to look
> for three "commits", i.e. those two commits that already exist, plus
> the one you would be creating if you were to "git commit" right now.
> Similarly, I do not see a reason why we should forbid looking for
> the same pattern in the tracked files in the working tree at the
> same time we check tree object(s) and/or the index.
>
> At least in principle.
>
> There are two practical issues that makes these combinations
> problematic, but I do not think they are insurmountable.
>
>  - Once you give an object on the command line, there is no syntax
>    to let you say "oh, by the way, I want the working tree as well".
>    If you are looking in the index, the working tree, and optionally
>    in some objects, "--index" instead of "--cached" would be the
>    standard way to tell the command "I want to affect both the index
>    and the working tree", but there is no way to say "I want only
>    tracked files in the working tree and these objects searched".
>    We'd need a new syntax to express it if we wanted to allow the
>    combination.
>
>  - The lines found in the working tree and in the index are prefixed
>    by the filename, while they are prefixed by the tree's name and a
>    colon.  When output for the working tree and the index are
>    combined, we cannot tell where each hit came from.  We need to
>    change the output to allow us to tell them apart, by
>    e.g. prefixing "<worktree>:" and "<index>:" in a way similar to
>    we use "<revision>:".
>
> Thanks.

Ah, so you're saying that even though --cached and REVISION are
incompatible today, that's not fundamental and we could conceivably
let them or even more options be used together in the future and you
even highlight how it could be made to sensibly work.  I agree with
what you say here: _if_ there is a way for users to explicitly specify
that they want to search multiple versions (whether that is revisions
or the index or the working tree), _and_ we have a way to distinguish
which version we found the results from, then (and only then) it'd
make sense to search the complete set of files from each of those
versions and show the results for the matches we found.

That differs in multiple important ways from the SKIP_WORKTREE
behavior I was railing against, and I think what you propose as a
possibility in contrast would make sense.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 3/3] grep: add option to ignore sparsity patterns
  2020-03-26  6:02       ` Elijah Newren
  2020-03-27 15:51         ` Junio C Hamano
@ 2020-03-30  1:12         ` Matheus Tavares Bernardino
  2020-03-31 16:48           ` Elijah Newren
  1 sibling, 1 reply; 26+ messages in thread
From: Matheus Tavares Bernardino @ 2020-03-30  1:12 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Git Mailing List, Derrick Stolee,
	Nguyễn Thái Ngọc, Junio C Hamano

On Thu, Mar 26, 2020 at 3:02 AM Elijah Newren <newren@gmail.com> wrote:
>
> Hi Matheus!

Hi, Elijah.

First of all, thanks for taking the time to go over these topics in
great detail. I must say it's much clearer for me now.

> On Wed, Mar 25, 2020 at 4:15 PM Matheus Tavares Bernardino
> <matheus.bernardino@usp.br> wrote:
> >
[...]
> One more useful case to consider before we start adding SKIP_WORKTREE
> into the mix.  Let's say that you have three files:
>    fileA
>   fileB
>    fileC
> and all of them are tracked.  You have made edits to fileA and fileB,
> and ran 'rm fileC' (NOT 'git rm fileC', i.e. the deletion is not
> staged).  Now, you run 'git grep mystring'.  Quick question: Which
> files are searched for 'mystring'?  Well...
>   * REVISION and --cached were left out of the git grep command, so
> working tree files should be searched, not staged versions or versions
> from other commits
>  * No flags like --untracked or --no-exclude-standard were included,
> so only tracked files in the working tree should be searched
>   * There are two files in the working tree, both tracked: fileA and fileB.
> So, this searches fileA and fileB.  In particular: NO VERSION of fileC
> is searched.  fileC may be tracked/cached, but we don't search any
> version of that file, because this particular command line is about
> searching the working directory and fileC is not in the working
> directory.  To the best of my knowledge, git grep has always behaved
> that way.
>
> Users understand the idea of searching the working copy vs. the index
> vs. "old" (or different) versions of the repository.  They also
> understand that when searching the working copy, by default a subset
> of the files are searched.  Tell me: given all this information here,
> what possible explanation is there for SKIP_WORKTREE entries to be
> translated into searches of the cache when --cached is not specified?
> Please square that away with the fact that 'rm fileC' results in fileC
> NOT being searched.
>
> It's just completely, utterly wrong.

Makes sense, thanks. I agree that we shouldn't fall back to the cache
when searching the working tree.

> Also, hopefully this helps answer your question about --untracked and
> skip_worktree.  --untracked is only useful when searching through the
> working tree, and is entirely about adding the "untracked" category to
> the things we search.  The skip_worktree bit is about adding more
> granularity to the "tracked" category.  The two are thus entirely
> orthogonal and --untracked shouldn't change behavior at all in the
> face of sparse checkouts.

Thanks, your explanation clarified the issue I had. I see now why
--untracked and --ignore-sparsity don't make sense together.

It also made me think about the combination of --cached and
--untracked which, IIUC, should be prohibited. I will add a patch in
v2, making git-grep error out in this case.

> And I also think it explains more when the sparsity patterns and
> --ignore-sparsity-patterns flags even matter.  The division of working
> tree files which were tracked into two subsets (those that match
> sparsity patterns and those that don't) didn't matter because only one
> of those two sets existed and could be searched.  So the question is,
> when can the sparsity pattern divide a set of files into two subsets
> where both are non-empty?  And the answer is when --cached or REVISION
> is specified.

Makes sense. I will add in --ignore-sparsity's description that it is
only relevant with --cached or REVISION, as you previously suggested.
When it is used outside of these cases, though, I think we could just
warn that --ignore-sparsity will be discarded (to avoid erroring out
when users have grep.ignoreSparsity enabled).

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 3/3] grep: add option to ignore sparsity patterns
  2020-03-24 18:30     ` Junio C Hamano
  2020-03-24 19:07       ` Elijah Newren
@ 2020-03-30  3:23       ` Matheus Tavares Bernardino
  2020-03-31 19:12         ` Elijah Newren
  1 sibling, 1 reply; 26+ messages in thread
From: Matheus Tavares Bernardino @ 2020-03-30  3:23 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Elijah Newren, Git Mailing List, Derrick Stolee,
	Nguyễn Thái Ngọc

On Tue, Mar 24, 2020 at 3:30 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Elijah Newren <newren@gmail.com> writes:
>
> > On Mon, Mar 23, 2020 at 11:13 PM Matheus Tavares
> > <matheus.bernardino@usp.br> wrote:
> >>
> >> In the last commit, git-grep learned to honor sparsity patterns. For
> >> some use cases, however, it may be desirable to search outside the
> >> sparse checkout. So add the '--ignore-sparsity' option, which restores
> >> the old behavior. Also add the grep.ignoreSparsity configuration, to
> >> allow setting this behavior by default.
> >
> > Should `--ignore-sparsity` be a global git option rather than a
> > grep-specific one?  Also, should grep.ignoreSparsity rather be
> > core.ignoreSparsity or core.searchOutsideSparsePaths or something?
>
> Great question.  I think "git diff" with various options would also
> want to optionally be able to be confined within the sparse cone, or
> checking the entire world by lazily fetching outside the sparsity.
[...]
> Regardless of the choice of the default, it would be a good
> idea to make the subcommands consistently offer the same default and
> allow the non-default views with the same UI.

Yeah, it seems like a sensible path. Regarding implementation, there
is the question that Elijah raised, of whether to use a global git
option or separate but consistent options for each subcommand. I don't
have much experience with sparse checkout to argument for one or
another, so I would like to hear what others have to say about it.

A question that comes to my mind regarding the global git option is:
will --ignore-sparsity (or whichever name we choose for it [1]) be
sufficient for all subcommands? Or may some of them require additional
options for command-specific behaviors concerning sparsity patterns?
Also, would it be OK if we just ignored the option in commands that do
not operate differently in sparse checkouts (maybe, fetch, branch and
send-email, for example)? And would it make sense to allow
constructions such as `git --ignore-sparsity checkout` or even `git
--ignore-sparsity sparse-checkout ...`?

[1]: Does anyone have suggestions for the option/config name? The best
I could come up with so far (without being too verbose) is
--no-sparsity-constraints. But I fear this might sound generic. As
Elijah already mentioned, --ignore-sparsity is not good either, as it
introduces double negatives in code...

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 3/3] grep: add option to ignore sparsity patterns
  2020-03-30  1:12         ` Matheus Tavares Bernardino
@ 2020-03-31 16:48           ` Elijah Newren
  0 siblings, 0 replies; 26+ messages in thread
From: Elijah Newren @ 2020-03-31 16:48 UTC (permalink / raw)
  To: Matheus Tavares Bernardino
  Cc: Git Mailing List, Derrick Stolee,
	Nguyễn Thái Ngọc, Junio C Hamano

On Sun, Mar 29, 2020 at 6:13 PM Matheus Tavares Bernardino
<matheus.bernardino@usp.br> wrote:
>
> On Thu, Mar 26, 2020 at 3:02 AM Elijah Newren <newren@gmail.com> wrote:
> >
> > Hi Matheus!
>
> Hi, Elijah.
>
> First of all, thanks for taking the time to go over these topics in
> great detail. I must say it's much clearer for me now.
>
> > On Wed, Mar 25, 2020 at 4:15 PM Matheus Tavares Bernardino
> > <matheus.bernardino@usp.br> wrote:
> > >
> [...]
> > One more useful case to consider before we start adding SKIP_WORKTREE
> > into the mix.  Let's say that you have three files:
> >    fileA
> >   fileB
> >    fileC
> > and all of them are tracked.  You have made edits to fileA and fileB,
> > and ran 'rm fileC' (NOT 'git rm fileC', i.e. the deletion is not
> > staged).  Now, you run 'git grep mystring'.  Quick question: Which
> > files are searched for 'mystring'?  Well...
> >   * REVISION and --cached were left out of the git grep command, so
> > working tree files should be searched, not staged versions or versions
> > from other commits
> >  * No flags like --untracked or --no-exclude-standard were included,
> > so only tracked files in the working tree should be searched
> >   * There are two files in the working tree, both tracked: fileA and fileB.
> > So, this searches fileA and fileB.  In particular: NO VERSION of fileC
> > is searched.  fileC may be tracked/cached, but we don't search any
> > version of that file, because this particular command line is about
> > searching the working directory and fileC is not in the working
> > directory.  To the best of my knowledge, git grep has always behaved
> > that way.
> >
> > Users understand the idea of searching the working copy vs. the index
> > vs. "old" (or different) versions of the repository.  They also
> > understand that when searching the working copy, by default a subset
> > of the files are searched.  Tell me: given all this information here,
> > what possible explanation is there for SKIP_WORKTREE entries to be
> > translated into searches of the cache when --cached is not specified?
> > Please square that away with the fact that 'rm fileC' results in fileC
> > NOT being searched.
> >
> > It's just completely, utterly wrong.
>
> Makes sense, thanks. I agree that we shouldn't fall back to the cache
> when searching the working tree.
>
> > Also, hopefully this helps answer your question about --untracked and
> > skip_worktree.  --untracked is only useful when searching through the
> > working tree, and is entirely about adding the "untracked" category to
> > the things we search.  The skip_worktree bit is about adding more
> > granularity to the "tracked" category.  The two are thus entirely
> > orthogonal and --untracked shouldn't change behavior at all in the
> > face of sparse checkouts.
>
> Thanks, your explanation clarified the issue I had. I see now why
> --untracked and --ignore-sparsity don't make sense together.
>
> It also made me think about the combination of --cached and
> --untracked which, IIUC, should be prohibited. I will add a patch in
> v2, making git-grep error out in this case.
>
> > And I also think it explains more when the sparsity patterns and
> > --ignore-sparsity-patterns flags even matter.  The division of working
> > tree files which were tracked into two subsets (those that match
> > sparsity patterns and those that don't) didn't matter because only one
> > of those two sets existed and could be searched.  So the question is,
> > when can the sparsity pattern divide a set of files into two subsets
> > where both are non-empty?  And the answer is when --cached or REVISION
> > is specified.
>
> Makes sense. I will add in --ignore-sparsity's description that it is
> only relevant with --cached or REVISION, as you previously suggested.
> When it is used outside of these cases, though, I think we could just
> warn that --ignore-sparsity will be discarded (to avoid erroring out
> when users have grep.ignoreSparsity enabled).

Not grep.ignoreSparsity but core.ignoreSparsity or core.$WHATEVER  ;-)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 3/3] grep: add option to ignore sparsity patterns
  2020-03-30  3:23       ` Matheus Tavares Bernardino
@ 2020-03-31 19:12         ` Elijah Newren
  2020-03-31 20:02           ` Derrick Stolee
  0 siblings, 1 reply; 26+ messages in thread
From: Elijah Newren @ 2020-03-31 19:12 UTC (permalink / raw)
  To: Matheus Tavares Bernardino
  Cc: Junio C Hamano, Git Mailing List, Derrick Stolee,
	Nguyễn Thái Ngọc, Jonathan Tan

// adding Jonathan Tan to cc based on the fact that we keep bringing
up partial clones and how it relates...

On Sun, Mar 29, 2020 at 8:23 PM Matheus Tavares Bernardino
<matheus.bernardino@usp.br> wrote:
>
> On Tue, Mar 24, 2020 at 3:30 PM Junio C Hamano <gitster@pobox.com> wrote:
> >
> > Elijah Newren <newren@gmail.com> writes:
> >
> > > On Mon, Mar 23, 2020 at 11:13 PM Matheus Tavares
> > > <matheus.bernardino@usp.br> wrote:
> > >>
> > >> In the last commit, git-grep learned to honor sparsity patterns. For
> > >> some use cases, however, it may be desirable to search outside the
> > >> sparse checkout. So add the '--ignore-sparsity' option, which restores
> > >> the old behavior. Also add the grep.ignoreSparsity configuration, to
> > >> allow setting this behavior by default.
> > >
> > > Should `--ignore-sparsity` be a global git option rather than a
> > > grep-specific one?  Also, should grep.ignoreSparsity rather be
> > > core.ignoreSparsity or core.searchOutsideSparsePaths or something?
> >
> > Great question.  I think "git diff" with various options would also
> > want to optionally be able to be confined within the sparse cone, or
> > checking the entire world by lazily fetching outside the sparsity.
> [...]
> > Regardless of the choice of the default, it would be a good
> > idea to make the subcommands consistently offer the same default and
> > allow the non-default views with the same UI.
>
> Yeah, it seems like a sensible path. Regarding implementation, there
> is the question that Elijah raised, of whether to use a global git
> option or separate but consistent options for each subcommand. I don't
> have much experience with sparse checkout to argument for one or
> another, so I would like to hear what others have to say about it.
>
> A question that comes to my mind regarding the global git option is:
> will --ignore-sparsity (or whichever name we choose for it [1]) be
> sufficient for all subcommands? Or may some of them require additional
> options for command-specific behaviors concerning sparsity patterns?
> Also, would it be OK if we just ignored the option in commands that do
> not operate differently in sparse checkouts (maybe, fetch, branch and
> send-email, for example)? And would it make sense to allow
> constructions such as `git --ignore-sparsity checkout` or even `git
> --ignore-sparsity sparse-checkout ...`?

I think the same option would probably be sufficient for all
subcommands, though I have a minor question about the merge machinery
(below).  And generally, I think it would be unusual for people to
pass the command line flag; I suspect most would set a config option
for most cases and then only occasionally override it on the command
line.  Since that config option would always be set, I'd expect
commands that are unaffected to just ignore it (much like both "git -c
merge.detectRenames=true fetch" and "git --work-tree=othertree fetch"
will both ignore the irrelevant options rather than trying to detect
that they were specified and error out).

> [1]: Does anyone have suggestions for the option/config name? The best
> I could come up with so far (without being too verbose) is
> --no-sparsity-constraints. But I fear this might sound generic. As
> Elijah already mentioned, --ignore-sparsity is not good either, as it
> introduces double negatives in code...

Does verbosity matter that much?  I think people would set it in
config, and tab completion would make it pretty easy to complete in
any event.

Anyway, maybe it will help if I provide a very rough first draft of
what changes we could introduce to Documentation/config/core.txt, and
then ask a bunch of my own questions about it below:

"""
core.restrictToSparsePaths::
        Only meaningful in conjuntion with core.sparseCheckoutCone.
        This option extends sparse checkouts (which limit which paths
        are written to the worktree), so that output and operations
        are also limited to the sparsity paths where possible and
        implemented.  The purpose of this option is to (1) focus
        output for the user on the portion of the repository that is
        of interest to them, and (2) enable potentially dramatic
        performance improvements, especially in conjunction with
        partial clones.
+
When this option is true, git commands such as log, diff, and grep may
limit their output to the directories specified by the sparse cone, or
to the intersection of those paths and any (like `*.c) that the user
might also specify on the command line.  (Note that this limit for
diff and grep only becomes relevant with --cached or when specifying a
REVISION, since a search of the working tree will automatically be
limited to the sparse paths that are present.)  Also, commands like
bisect may only select commits which modify paths within the sparsity
cone.  The merge machinery may use the sparse paths as a heuristic to
avoid trying to detect renames from within the sparsity cone to
outside the sparsity cone when at least one side of history only
touches paths within the sparsity cone (this can make the merge
machinery faster, but may risk modify/delete conflicts since upstream
can rename a file within the sparsity paths to a location outside
them).  Commands which export, integrity check, or create history will
always operate on full trees (e.g. fast-export, format-patch, fsck,
commit, etc.), unaffected by any sparsity patterns.
"""

Several questions here, of course:

  * do people like or hate the name?  indifferent?  have alternate ideas?
  * should we restrict this to core.sparseCheckoutCone as I suggested
above or also allow people to do it with core.sparseCheckout without
the cone mode?  I think attempting to weld partial clones together
with core.sparseCheckout is crazy, so I'm tempted to just make it be
specific to cone mode and to push people to use it.  But I'm
interested in thoughts on the matter.
  * should worktrees be affected?  (I've been an advocate of new
worktrees inheriting the sparse patterns of the worktree in use at the
time the new worktree was created.  Junio once suggested he didn't
like that and that worktrees should start out dense.  That seems
problematic to me in big repos with partial clones and sparse chckouts
in use.  Perhaps dense new worktrees is the behavior you get when
core.restrictToSparsePaths is false?)
  * does my idea for the merge machinery make folks uncomfortable?
Should that be a different option?  Being able to do trivial *tree*
merges for the huge portion of the tree outside the sparsity paths
would be a huge win, especially with partial clones, but it certainly
is different.  Then again, microsoft has disabled rename detection
entirely based on it being too expensive, so perhaps the idea of
rename-detection-within-your-cone-if-you-really-didn't-modify-anything-outside-the-cone-on-your-side-of-history
is a reasonable middle ground between off and on for rename detection.
  * what should the default be?  Junio suggested elsewhere[1] that
sparse-checkouts and partial clones should probably be welded together
(with partial clones downloading just history in the sparsity paths by
default), in which case having this option be true would be useful.
But it may also be slightly weird because it'll probably take us a
while to implement this; while the big warning in
git-sparse-checkout.txt certainly allows this:
        THIS COMMAND IS EXPERIMENTAL. ITS BEHAVIOR, AND THE BEHAVIOR OF OTHER
        COMMANDS IN THE PRESENCE OF SPARSE-CHECKOUTS, WILL LIKELY CHANGE IN
        THE FUTURE.
It may still be slightly weird that the default behavior of commands
in the presence of sparse-checkouts changes release to release until
we get it all implemented.

[1] https://lore.kernel.org/git/xmqqh7ycw5lc.fsf@gitster.c.googlers.com/

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 3/3] grep: add option to ignore sparsity patterns
  2020-03-31 19:12         ` Elijah Newren
@ 2020-03-31 20:02           ` Derrick Stolee
  0 siblings, 0 replies; 26+ messages in thread
From: Derrick Stolee @ 2020-03-31 20:02 UTC (permalink / raw)
  To: Elijah Newren, Matheus Tavares Bernardino
  Cc: Junio C Hamano, Git Mailing List, Derrick Stolee,
	Nguyễn Thái Ngọc, Jonathan Tan

On 3/31/2020 3:12 PM, Elijah Newren wrote:
> // adding Jonathan Tan to cc based on the fact that we keep bringing
> up partial clones and how it relates...
> 
> On Sun, Mar 29, 2020 at 8:23 PM Matheus Tavares Bernardino
> <matheus.bernardino@usp.br> wrote:
>>
>> On Tue, Mar 24, 2020 at 3:30 PM Junio C Hamano <gitster@pobox.com> wrote:
>>>
>>> Elijah Newren <newren@gmail.com> writes:
>>>
>>>> On Mon, Mar 23, 2020 at 11:13 PM Matheus Tavares
>>>> <matheus.bernardino@usp.br> wrote:
>>>>>
>>>>> In the last commit, git-grep learned to honor sparsity patterns. For
>>>>> some use cases, however, it may be desirable to search outside the
>>>>> sparse checkout. So add the '--ignore-sparsity' option, which restores
>>>>> the old behavior. Also add the grep.ignoreSparsity configuration, to
>>>>> allow setting this behavior by default.
>>>>
>>>> Should `--ignore-sparsity` be a global git option rather than a
>>>> grep-specific one?  Also, should grep.ignoreSparsity rather be
>>>> core.ignoreSparsity or core.searchOutsideSparsePaths or something?
>>>
>>> Great question.  I think "git diff" with various options would also
>>> want to optionally be able to be confined within the sparse cone, or
>>> checking the entire world by lazily fetching outside the sparsity.
>> [...]
>>> Regardless of the choice of the default, it would be a good
>>> idea to make the subcommands consistently offer the same default and
>>> allow the non-default views with the same UI.
>>
>> Yeah, it seems like a sensible path. Regarding implementation, there
>> is the question that Elijah raised, of whether to use a global git
>> option or separate but consistent options for each subcommand. I don't
>> have much experience with sparse checkout to argument for one or
>> another, so I would like to hear what others have to say about it.
>>
>> A question that comes to my mind regarding the global git option is:
>> will --ignore-sparsity (or whichever name we choose for it [1]) be
>> sufficient for all subcommands? Or may some of them require additional
>> options for command-specific behaviors concerning sparsity patterns?
>> Also, would it be OK if we just ignored the option in commands that do
>> not operate differently in sparse checkouts (maybe, fetch, branch and
>> send-email, for example)? And would it make sense to allow
>> constructions such as `git --ignore-sparsity checkout` or even `git
>> --ignore-sparsity sparse-checkout ...`?
> 
> I think the same option would probably be sufficient for all
> subcommands, though I have a minor question about the merge machinery
> (below).  And generally, I think it would be unusual for people to
> pass the command line flag; I suspect most would set a config option
> for most cases and then only occasionally override it on the command
> line.  Since that config option would always be set, I'd expect
> commands that are unaffected to just ignore it (much like both "git -c
> merge.detectRenames=true fetch" and "git --work-tree=othertree fetch"
> will both ignore the irrelevant options rather than trying to detect
> that they were specified and error out).
> 
>> [1]: Does anyone have suggestions for the option/config name? The best
>> I could come up with so far (without being too verbose) is
>> --no-sparsity-constraints. But I fear this might sound generic. As
>> Elijah already mentioned, --ignore-sparsity is not good either, as it
>> introduces double negatives in code...
> 
> Does verbosity matter that much?  I think people would set it in
> config, and tab completion would make it pretty easy to complete in
> any event.
> 
> Anyway, maybe it will help if I provide a very rough first draft of
> what changes we could introduce to Documentation/config/core.txt, and
> then ask a bunch of my own questions about it below:
> 
> """
> core.restrictToSparsePaths::
>         Only meaningful in conjuntion with core.sparseCheckoutCone.
>         This option extends sparse checkouts (which limit which paths
>         are written to the worktree), so that output and operations
>         are also limited to the sparsity paths where possible and
>         implemented.  The purpose of this option is to (1) focus
>         output for the user on the portion of the repository that is
>         of interest to them, and (2) enable potentially dramatic
>         performance improvements, especially in conjunction with
>         partial clones.
> +
> When this option is true, git commands such as log, diff, and grep may
> limit their output to the directories specified by the sparse cone, or
> to the intersection of those paths and any (like `*.c) that the user
> might also specify on the command line.  (Note that this limit for
> diff and grep only becomes relevant with --cached or when specifying a
> REVISION, since a search of the working tree will automatically be
> limited to the sparse paths that are present.)  Also, commands like
> bisect may only select commits which modify paths within the sparsity
> cone.  The merge machinery may use the sparse paths as a heuristic to
> avoid trying to detect renames from within the sparsity cone to
> outside the sparsity cone when at least one side of history only
> touches paths within the sparsity cone (this can make the merge
> machinery faster, but may risk modify/delete conflicts since upstream
> can rename a file within the sparsity paths to a location outside
> them).  Commands which export, integrity check, or create history will
> always operate on full trees (e.g. fast-export, format-patch, fsck,
> commit, etc.), unaffected by any sparsity patterns.
> """
> 
> Several questions here, of course:
> 
>   * do people like or hate the name?  indifferent?  have alternate ideas?

It's probably time to create a 'sparse-checkout' config space. That
would allow

	sparse-checkout.restrictGrep = true

as an option. Or a more general

	sparse-checkout.restrictCommands = true

to make it clear that it affects multiple commands.

>   * should we restrict this to core.sparseCheckoutCone as I suggested
> above or also allow people to do it with core.sparseCheckout without
> the cone mode?  I think attempting to weld partial clones together
> with core.sparseCheckout is crazy, so I'm tempted to just make it be
> specific to cone mode and to push people to use it.  But I'm
> interested in thoughts on the matter.

Personally, I prefer cone mode and think it covers 99% of cases.
However, there are some who are using a big directory full of large
binaries and relying on file-prefix matches to get only the big
binaries they need. Until they restructure their repositories to
take advantage of cone mode, we should be considerate of the full
sparse-checkout specification when possible.

>   * should worktrees be affected?  (I've been an advocate of new
> worktrees inheriting the sparse patterns of the worktree in use at the
> time the new worktree was created.  Junio once suggested he didn't
> like that and that worktrees should start out dense.  That seems
> problematic to me in big repos with partial clones and sparse chckouts
> in use.  Perhaps dense new worktrees is the behavior you get when
> core.restrictToSparsePaths is false?)

We should probably consider a `--sparse` option for `git worktree add`
so we can allow interested users to add worktrees that initialize to
a sparse-checkout. Optionally create a config option that would copy
the sparse-checkout file from the current repo to the worktree.

>   * does my idea for the merge machinery make folks uncomfortable?
> Should that be a different option?  Being able to do trivial *tree*
> merges for the huge portion of the tree outside the sparsity paths
> would be a huge win, especially with partial clones, but it certainly
> is different.  Then again, microsoft has disabled rename detection
> entirely based on it being too expensive, so perhaps the idea of
> rename-detection-within-your-cone-if-you-really-didn't-modify-anything-outside-the-cone-on-your-side-of-history
> is a reasonable middle ground between off and on for rename detection.

The part where you say " when at least one side of history only
touches paths within the sparsity cone" makes me want to entertain
the idea if it can be done cleanly.

I'm more concerned about the "git bisect" logic being restricted to
the cone, since that is such an open-ended command for what is
considered "good" or "bad".

>   * what should the default be?  Junio suggested elsewhere[1] that
> sparse-checkouts and partial clones should probably be welded together
> (with partial clones downloading just history in the sparsity paths by
> default), in which case having this option be true would be useful.

My opinion on this is as follows: filtering blobs based on sparse-
checkout patterns does not filter enough, and filtering trees based
on sparse-checkout patterns filters too much. The costs are just
flipped: having extra trees is not a huge problem but recovering from
a "tree miss" is problematic. Having extra blobs is painful, but
recovering from a "blob miss" is not a big deal.

> But it may also be slightly weird because it'll probably take us a
> while to implement this; while the big warning in
> git-sparse-checkout.txt certainly allows this:
>         THIS COMMAND IS EXPERIMENTAL. ITS BEHAVIOR, AND THE BEHAVIOR OF OTHER
>         COMMANDS IN THE PRESENCE OF SPARSE-CHECKOUTS, WILL LIKELY CHANGE IN
>         THE FUTURE.
> It may still be slightly weird that the default behavior of commands
> in the presence of sparse-checkouts changes release to release until
> we get it all implemented.

I appreciate that we put that warning at the top. We will be
able to do more experimental things with the feature because
of it. The idea I'm toying with is to have "git clone --sparse"
set core.sparseCheckoutCone = true.

Also, if we are creating the "sparse-checkout.*" config space,
we should "rename" core.sparseCheckoutCone to sparse-checkout.coneMode
or something. We would need to support both for a while, for sure.

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, back to index

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-24  6:04 [RFC PATCH 0/3] grep: honor sparse checkout and add option to ignore it Matheus Tavares
2020-03-24  6:11 ` [RFC PATCH 1/3] doc: grep: unify info on configuration variables Matheus Tavares
2020-03-24  7:57   ` Elijah Newren
2020-03-24 21:26     ` Junio C Hamano
2020-03-24 23:38       ` Matheus Tavares
2020-03-24  6:12 ` [RFC PATCH 2/3] grep: honor sparse checkout patterns Matheus Tavares
2020-03-24  7:15   ` Elijah Newren
2020-03-24 15:12     ` Derrick Stolee
2020-03-24 16:16       ` Elijah Newren
2020-03-24 17:02         ` Derrick Stolee
2020-03-24 23:01       ` Matheus Tavares Bernardino
2020-03-24 22:55     ` Matheus Tavares Bernardino
2020-03-24  6:13 ` [RFC PATCH 3/3] grep: add option to ignore sparsity patterns Matheus Tavares
2020-03-24  7:54   ` Elijah Newren
2020-03-24 18:30     ` Junio C Hamano
2020-03-24 19:07       ` Elijah Newren
2020-03-25 20:18         ` Junio C Hamano
2020-03-30  3:23       ` Matheus Tavares Bernardino
2020-03-31 19:12         ` Elijah Newren
2020-03-31 20:02           ` Derrick Stolee
2020-03-25 23:15     ` Matheus Tavares Bernardino
2020-03-26  6:02       ` Elijah Newren
2020-03-27 15:51         ` Junio C Hamano
2020-03-27 19:01           ` Elijah Newren
2020-03-30  1:12         ` Matheus Tavares Bernardino
2020-03-31 16:48           ` Elijah Newren

Git Mailing List Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/git/0 git/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 git git/ https://lore.kernel.org/git \
		git@vger.kernel.org
	public-inbox-index git

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.git


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git