git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Elijah Newren <newren@gmail.com>
To: Matheus Tavares <matheus.bernardino@usp.br>
Cc: "Git Mailing List" <git@vger.kernel.org>,
	"Derrick Stolee" <dstolee@microsoft.com>,
	"Nguyễn Thái Ngọc" <pclouds@gmail.com>
Subject: Re: [RFC PATCH 3/3] grep: add option to ignore sparsity patterns
Date: Tue, 24 Mar 2020 00:54:54 -0700	[thread overview]
Message-ID: <CABPp-BEbNCYk0pCuEDQ_ViB2=varJPBsVODxNvJs0EVRyBqjBg@mail.gmail.com> (raw)
In-Reply-To: <a76242ecfa69cf29995bd859305dc2ab1bc1a221.1585027716.git.matheus.bernardino@usp.br>

On Mon, Mar 23, 2020 at 11:13 PM Matheus Tavares
<matheus.bernardino@usp.br> wrote:
>
> In the last commit, git-grep learned to honor sparsity patterns. For
> some use cases, however, it may be desirable to search outside the
> sparse checkout. So add the '--ignore-sparsity' option, which restores
> the old behavior. Also add the grep.ignoreSparsity configuration, to
> allow setting this behavior by default.

Should `--ignore-sparsity` be a global git option rather than a
grep-specific one?  Also, should grep.ignoreSparsity rather be
core.ignoreSparsity or core.searchOutsideSparsePaths or something?  In
particular, I want a world where:

* Someone can do a "sparse" clone that is NOT just about
sparse-checkout but also about partial clone.  In particular, it makes
use of partial clones to download only the history for the sparsity
paths, and does a sparse-checkout --cone to get those checked out.
(Or, perhaps, defaults to just downloading history for the toplevel
dir, much like `sparse-checkout init --cone`, and then when the user
runs `sparse-checkout set $dir1 $dir2 ...` then it downloads the extra
bits).
* grep, diff, log, shortlog, blame, bisect (and maybe others) all by
default make use of the sparsity patterns to limit their output (but
can all use whatever flag(s) are added here to search outside the
sparsity pattern cones).  This helps users feel they are in a smaller
repo and searching just their area of interest, and it avoids partial
clones downloading blobs unnecessarily.  Nice for the user, and nice
for the system.
* worktrees behave nicer; when creating a new one it inherits the
sparsity patterns of the parent (again to avoid partail clones having
to download everything, and let users continue working on their area
of interest, though they can disable sparse checkouts at any time, of
course).  Still would like Junio's feedback on this one.
* rebase, merge, cherry-pick, etc. (all via the merge machiner) have
smarter tree-merging logic such that when trees are unchanged on one
or both sides of history, we take advantage of the subset of those
cases where we can avoid traversing into subtrees but can resolve the
merge at the tree level.  This is a performance optimization even when
you have all trees and blob available, but an even more important one
if you don't want partial clones to suddenly have to download
unnecessary objects.  I have ideas and am working on this as part of
merge-ort.

> Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
> ---
>
> Note: I still have to make --ignore-sparsity be able to work together
> with --untracked. Unfortunatelly, this won't be as simple because the
> codeflow taken by --untracked goes to grep_directory() which just
> iterates the working tree, without looking the index entries. So I will
> have to either: make --untracked use grep_cache(), and grep the
> untracked files later; or try matching the working tree paths against
> the sparsity patterns, without looking for the skip_worktree bit in
> the index (as I mentioned in the previous patch's comments). Any
> preferences regarding these two approaches? (or other suggestions?)

Hmm.  So, 'tracked' in git is the idea that we are keeping information
about specific files.  'sparse-checkout' is the idea that we have a
subset of those that we can work with without materializing all the
other tracked files; it's clearly a subset of the realm of 'tracked'.
'untracked' is about getting everything outside the set of 'tracked'
files, which to me means it is clearly outside the set of sparsity
paths too (and thus you could take --untracked as implying
--ignore-sparsity, though whether you do might not matter in practice
because of the items I'll discuss next).  Of course, I am also
assuming `--untracked` is incompatible with --cached or specifying
revisions or trees (based on it's definiton of "In addition to
searching in the tracked files in the *working tree*, search also in
untracked files." -- emphasis added.)  If the incompatibility of
--untracked and --cached/REVSIONS/TREES is not enforced, we may want
to look into erroring out if they are given together.  Once we do, we
don't have to worry about grep_cache() at all in the case of
--untracked and shouldn't.  Files with the skip_worktree bit won't
exist in the working directory, and thus won't be searched (this is
what makes --untracked imply --ignore-sparsity not really matter).

In short: With --untracked you are grepping ALL (non-ignored) files in
the working directory -- either because they are both tracked and in
the sparsity paths (anything tracked that isn't in the sparsity paths
has the skip_worktree bit and thus isn't present), or because it is an
untracked file.  [And this may be what grep_directory() already does.]

Does that make sense?

>  Documentation/config/grep.txt   |  3 +++
>  Documentation/git-grep.txt      |  5 ++++
>  builtin/grep.c                  | 19 +++++++++++----
>  t/t7817-grep-sparse-checkout.sh | 42 +++++++++++++++++++++++++++++++++
>  4 files changed, 65 insertions(+), 4 deletions(-)
>
> diff --git a/Documentation/config/grep.txt b/Documentation/config/grep.txt
> index 76689771aa..c1d49484c8 100644
> --- a/Documentation/config/grep.txt
> +++ b/Documentation/config/grep.txt
> @@ -25,3 +25,6 @@ grep.fullName::
>  grep.fallbackToNoIndex::
>         If set to true, fall back to git grep --no-index if git grep
>         is executed outside of a git repository.  Defaults to false.
> +
> +grep.ignoreSparsity::
> +       If set to true, enable `--ignore-sparsity` by default.
> diff --git a/Documentation/git-grep.txt b/Documentation/git-grep.txt
> index 97e25d7b1b..5c5c66c056 100644
> --- a/Documentation/git-grep.txt
> +++ b/Documentation/git-grep.txt
> @@ -65,6 +65,11 @@ OPTIONS
>         mechanism.  Only useful when searching files in the current directory
>         with `--no-index`.
>
> +--ignore-sparsity::
> +       In a sparse checked out repository (see linkgit:git-sparse-checkout[1]),
> +       also search in files that are outside the sparse checkout. This option
> +       cannot be used with --no-index or --untracked.

If they are outside the sparse checkout, then they are not present on
disk -- so what is this outside stuff that is being searched?  Perhaps
clarify that this is only useful in combination with
--cached/REVISION/TREE, where there do exist paths outside the
sparsity patterns that become relevant?

>  --recurse-submodules::
>         Recursively search in each submodule that has been initialized and
>         checked out in the repository.  When used in combination with the
> diff --git a/builtin/grep.c b/builtin/grep.c
> index 52ec72a036..17eae3edd6 100644
> --- a/builtin/grep.c
> +++ b/builtin/grep.c
> @@ -33,6 +33,8 @@ static char const * const grep_usage[] = {
>
>  static int recurse_submodules;
>
> +static int ignore_sparsity = 0;
> +
>  static int num_threads;
>
>  static pthread_t *threads;
> @@ -292,6 +294,9 @@ static int grep_cmd_config(const char *var, const char *value, void *cb)
>         if (!strcmp(var, "submodule.recurse"))
>                 recurse_submodules = git_config_bool(var, value);
>
> +       if (!strcmp(var, "grep.ignoresparsity"))
> +               ignore_sparsity = git_config_bool(var, value);
> +
>         return st;
>  }
>
> @@ -487,7 +492,7 @@ static int grep_cache(struct grep_opt *opt,
>         for (nr = 0; nr < repo->index->cache_nr; nr++) {
>                 const struct cache_entry *ce = repo->index->cache[nr];
>
> -               if (ce_skip_worktree(ce))
> +               if (!ignore_sparsity && ce_skip_worktree(ce))

Oh boy on the double negatives...maybe we want to rename this flag somehow?

>                         continue;
>
>                 strbuf_setlen(&name, name_base_len);
> @@ -502,7 +507,8 @@ static int grep_cache(struct grep_opt *opt,
>                          * cache entry are identical, even if worktree file has
>                          * been modified, so use cache version instead
>                          */
> -                       if (cached || (ce->ce_flags & CE_VALID)) {
> +                       if (cached || (ce->ce_flags & CE_VALID) ||
> +                           ce_skip_worktree(ce)) {
>                                 if (ce_stage(ce) || ce_intent_to_add(ce))
>                                         continue;
>                                 hit |= grep_oid(opt, &ce->oid, name.buf,
> @@ -549,7 +555,7 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>                 name_base_len = name.len;
>         }
>
> -       if (from_commit && repo_read_index(repo) < 0)
> +       if (!ignore_sparsity && from_commit && repo_read_index(repo) < 0)
>                 die(_("index file corrupt"));
>
>         while (tree_entry(tree, &entry)) {
> @@ -570,7 +576,7 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>
>                 strbuf_add(base, entry.path, te_len);
>
> -               if (from_commit) {
> +               if (!ignore_sparsity && from_commit) {
>                         int pos = index_name_pos(repo->index,
>                                                  base->buf + tn_len,
>                                                  base->len - tn_len);
> @@ -932,6 +938,8 @@ int cmd_grep(int argc, const char **argv, const char *prefix)
>                 OPT_BOOL_F(0, "ext-grep", &external_grep_allowed__ignored,
>                            N_("allow calling of grep(1) (ignored by this build)"),
>                            PARSE_OPT_NOCOMPLETE),
> +               OPT_BOOL(0, "ignore-sparsity", &ignore_sparsity,
> +                        N_("also search in files outside the sparse checkout")),
>                 OPT_END()
>         };
>
> @@ -1073,6 +1081,9 @@ int cmd_grep(int argc, const char **argv, const char *prefix)
>         if (recurse_submodules && untracked)
>                 die(_("--untracked not supported with --recurse-submodules"));
>
> +       if (ignore_sparsity && (!use_index || untracked))
> +               die(_("--no-index or --untracked cannot be used with --ignore-sparsity"));
> +
>         if (show_in_pager) {
>                 if (num_threads > 1)
>                         warning(_("invalid option combination, ignoring --threads"));
> diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh
> index fccf44e829..1891ddea57 100755
> --- a/t/t7817-grep-sparse-checkout.sh
> +++ b/t/t7817-grep-sparse-checkout.sh
> @@ -85,4 +85,46 @@ test_expect_success 'grep <tree-ish> should search outside sparse checkout' '
>         test_cmp expect_t-tree actual_t-tree
>  '
>
> +for cmd in 'git grep --ignore-sparsity' 'git -c grep.ignoreSparsity grep' \
> +          'git -c grep.ignoreSparsity=false grep --ignore-sparsity'
> +do
> +       test_expect_success "$cmd should search outside sparse checkout" '
> +               cat >expect <<-EOF &&
> +               a:text
> +               b:text
> +               dir/c:text
> +               EOF
> +               $cmd "text" >actual &&
> +               test_cmp expect actual
> +       '
> +
> +       test_expect_success "$cmd --cached should search outside sparse checkout" '
> +               cat >expect <<-EOF &&
> +               a:text
> +               b:text
> +               dir/c:text
> +               EOF
> +               $cmd --cached "text" >actual &&
> +               test_cmp expect actual
> +       '
> +
> +       test_expect_success "$cmd <commit-ish> should search outside sparse checkout" '
> +               commit=$(git rev-parse HEAD) &&
> +               cat >expect_commit <<-EOF &&
> +               $commit:a:text
> +               $commit:b:text
> +               $commit:dir/c:text
> +               EOF
> +               cat >expect_t-commit <<-EOF &&
> +               t-commit:a:text
> +               t-commit:b:text
> +               t-commit:dir/c:text
> +               EOF
> +               $cmd "text" $commit >actual_commit &&
> +               test_cmp expect_commit actual_commit &&
> +               $cmd "text" t-commit >actual_t-commit &&
> +               test_cmp expect_t-commit actual_t-commit
> +       '
> +done
> +
>  test_done
> --
> 2.25.1

I think there are several things that we need to straighten out first
and will affect a lot of this patch quite a bit:
* The feedback from the previous patch that the revision handling
should use sparsity patterns rather than ce_skip_worktree() is going
to affect this patch a fair amount.
* I think the fact that --ignore-sparsity is meaningless without
--cached or a REVISION or TREE may also affect things.
* The decision about how to globally name and set the
"ignore-sparsity" bit without requiring users to set it for each and
every subcommand will change this patch a bit too.


I'm super excited to see work in this area.  I hope I'm not
discouraging you by attempting to provide what I think is the bigger
picture I'd like us to work towards.

  reply	other threads:[~2020-03-24  7:55 UTC|newest]

Thread overview: 123+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-03-24  6:04 [RFC PATCH 0/3] grep: honor sparse checkout and add option to ignore it Matheus Tavares
2020-03-24  6:11 ` [RFC PATCH 1/3] doc: grep: unify info on configuration variables Matheus Tavares
2020-03-24  7:57   ` Elijah Newren
2020-03-24 21:26     ` Junio C Hamano
2020-03-24 23:38       ` Matheus Tavares
2020-03-24  6:12 ` [RFC PATCH 2/3] grep: honor sparse checkout patterns Matheus Tavares
2020-03-24  7:15   ` Elijah Newren
2020-03-24 15:12     ` Derrick Stolee
2020-03-24 16:16       ` Elijah Newren
2020-03-24 17:02         ` Derrick Stolee
2020-03-24 23:01       ` Matheus Tavares Bernardino
2020-03-24 22:55     ` Matheus Tavares Bernardino
2020-04-21  2:10       ` Matheus Tavares Bernardino
2020-04-21  3:08         ` Elijah Newren
2020-04-22 12:08           ` Derrick Stolee
2020-04-23  6:09           ` Matheus Tavares Bernardino
2020-03-24  6:13 ` [RFC PATCH 3/3] grep: add option to ignore sparsity patterns Matheus Tavares
2020-03-24  7:54   ` Elijah Newren [this message]
2020-03-24 18:30     ` Junio C Hamano
2020-03-24 19:07       ` Elijah Newren
2020-03-25 20:18         ` Junio C Hamano
2020-03-30  3:23       ` Matheus Tavares Bernardino
2020-03-31 19:12         ` Elijah Newren
2020-03-31 20:02           ` Derrick Stolee
2020-04-27 17:15             ` Matheus Tavares Bernardino
2020-04-29 16:46               ` Elijah Newren
2020-04-29 17:21             ` Elijah Newren
2020-03-25 23:15     ` Matheus Tavares Bernardino
2020-03-26  6:02       ` Elijah Newren
2020-03-27 15:51         ` Junio C Hamano
2020-03-27 19:01           ` Elijah Newren
2020-03-30  1:12         ` Matheus Tavares Bernardino
2020-03-31 16:48           ` Elijah Newren
2020-05-10  0:41 ` [RFC PATCH v2 0/4] grep: honor sparse checkout and add option to ignore it Matheus Tavares
2020-05-10  0:41   ` [RFC PATCH v2 1/4] doc: grep: unify info on configuration variables Matheus Tavares
2020-05-10  0:41   ` [RFC PATCH v2 2/4] config: load the correct config.worktree file Matheus Tavares
2020-05-11 19:10     ` Junio C Hamano
2020-05-12 22:55       ` Matheus Tavares Bernardino
2020-05-12 23:22         ` Junio C Hamano
2020-05-10  0:41   ` [RFC PATCH v2 3/4] grep: honor sparse checkout patterns Matheus Tavares
2020-05-11 19:35     ` Junio C Hamano
2020-05-13  0:05       ` Matheus Tavares Bernardino
2020-05-13  0:17         ` Junio C Hamano
2020-05-21  7:26           ` Elijah Newren
2020-05-21 17:35             ` Matheus Tavares Bernardino
2020-05-21 17:52               ` Elijah Newren
2020-05-22  5:49                 ` Matheus Tavares Bernardino
2020-05-22 14:26                   ` Elijah Newren
2020-05-22 15:36                     ` Elijah Newren
2020-05-22 20:54                       ` Matheus Tavares Bernardino
2020-05-22 21:06                         ` Elijah Newren
2020-06-10 11:40                     ` Derrick Stolee
2020-06-10 16:22                       ` Matheus Tavares Bernardino
2020-06-10 17:42                         ` Derrick Stolee
2020-06-10 18:14                           ` Matheus Tavares Bernardino
2020-06-10 20:12                         ` Elijah Newren
2020-06-10 19:58                       ` Elijah Newren
2020-05-21  7:36       ` Elijah Newren
2020-05-10  0:41   ` [RFC PATCH v2 4/4] config: add setting to ignore sparsity patterns in some cmds Matheus Tavares
2020-05-10  4:23     ` Matheus Tavares Bernardino
2020-05-21 17:18       ` Elijah Newren
2020-05-21  7:09     ` Elijah Newren
2020-05-28  1:12   ` [PATCH v3 0/5] grep: honor sparse checkout and add option to ignore it Matheus Tavares
2020-05-28  1:12     ` [PATCH v3 1/5] doc: grep: unify info on configuration variables Matheus Tavares
2020-05-28  1:13     ` [PATCH v3 2/5] t/helper/test-config: return exit codes consistently Matheus Tavares
2020-05-30 14:29       ` Elijah Newren
2020-06-01  4:36         ` Matheus Tavares Bernardino
2020-05-28  1:13     ` [PATCH v3 3/5] config: correctly read worktree configs in submodules Matheus Tavares
2020-05-30 14:49       ` Elijah Newren
2020-06-01  4:38         ` Matheus Tavares Bernardino
2020-05-28  1:13     ` [PATCH v3 4/5] grep: honor sparse checkout patterns Matheus Tavares
2020-05-30 15:48       ` Elijah Newren
2020-06-01  4:44         ` Matheus Tavares Bernardino
2020-06-03  2:38           ` Elijah Newren
2020-06-10 17:08             ` Matheus Tavares Bernardino
2020-05-28  1:13     ` [PATCH v3 5/5] config: add setting to ignore sparsity patterns in some cmds Matheus Tavares
2020-05-30 16:18       ` Elijah Newren
2020-06-01  4:45         ` Matheus Tavares Bernardino
2020-06-03  2:39           ` Elijah Newren
2020-06-10 21:15             ` Matheus Tavares Bernardino
2020-06-11  0:35               ` Elijah Newren
2020-06-12 15:44     ` [PATCH v4 0/6] grep: honor sparse checkout and add option to ignore it Matheus Tavares
2020-06-12 15:44       ` [PATCH v4 1/6] doc: grep: unify info on configuration variables Matheus Tavares
2020-06-12 15:45       ` [PATCH v4 2/6] t/helper/test-config: return exit codes consistently Matheus Tavares
2020-06-12 15:45       ` [PATCH v4 3/6] t/helper/test-config: facilitate addition of new cli options Matheus Tavares
2020-06-12 15:45       ` [PATCH v4 4/6] config: correctly read worktree configs in submodules Matheus Tavares
2020-06-16 19:13         ` Elijah Newren
2020-06-21 16:05           ` Matheus Tavares Bernardino
2020-09-01  2:41         ` Jonathan Nieder
2020-09-01 21:44           ` Matheus Tavares Bernardino
2020-06-12 15:45       ` [PATCH v4 5/6] grep: honor sparse checkout patterns Matheus Tavares
2020-06-12 15:45       ` [PATCH v4 6/6] config: add setting to ignore sparsity patterns in some cmds Matheus Tavares
2020-06-16 22:31       ` [PATCH v4 0/6] grep: honor sparse checkout and add option to ignore it Elijah Newren
2020-09-02  6:17       ` [PATCH v5 0/8] " Matheus Tavares
2020-09-02  6:17         ` [PATCH v5 1/8] doc: grep: unify info on configuration variables Matheus Tavares
2020-09-02  6:17         ` [PATCH v5 2/8] t1308-config-set: avoid false positives when using test-config Matheus Tavares
2020-09-02  6:57           ` Eric Sunshine
2020-09-02 16:16             ` Matheus Tavares Bernardino
2020-09-02 16:38               ` Eric Sunshine
2020-09-02  6:17         ` [PATCH v5 3/8] t/helper/test-config: be consistent with exit codes Matheus Tavares
2020-09-02  6:17         ` [PATCH v5 4/8] t/helper/test-config: check argc before accessing argv Matheus Tavares
2020-09-02  7:18           ` Eric Sunshine
2020-09-02  6:17         ` [PATCH v5 5/8] t/helper/test-config: unify exit labels Matheus Tavares
2020-09-02  7:30           ` Eric Sunshine
2020-09-02  6:17         ` [PATCH v5 6/8] config: correctly read worktree configs in submodules Matheus Tavares
2020-09-02 20:15           ` Jonathan Nieder
2020-09-09 13:04             ` Matheus Tavares Bernardino
2020-09-09 23:32               ` Jonathan Nieder
2020-09-02  6:17         ` [PATCH v5 7/8] grep: honor sparse checkout patterns Matheus Tavares
2020-09-02  6:17         ` [PATCH v5 8/8] config: add setting to ignore sparsity patterns in some cmds Matheus Tavares
2020-09-10 17:21         ` [PATCH v6 0/9] grep: honor sparse checkout and add option to ignore it Matheus Tavares
2020-09-10 17:21           ` [PATCH v6 1/9] doc: grep: unify info on configuration variables Matheus Tavares
2020-09-10 17:21           ` [PATCH v6 2/9] t1308-config-set: avoid false positives when using test-config Matheus Tavares
2020-09-10 17:21           ` [PATCH v6 3/9] t/helper/test-config: be consistent with exit codes Matheus Tavares
2020-09-10 17:21           ` [PATCH v6 4/9] t/helper/test-config: diagnose missing arguments Matheus Tavares
2020-09-10 17:21           ` [PATCH v6 5/9] t/helper/test-config: unify exit labels Matheus Tavares
2020-09-10 17:21           ` [PATCH v6 6/9] config: make do_git_config_sequence receive a 'struct repository' Matheus Tavares
2020-09-10 17:21           ` [PATCH v6 7/9] config: correctly read worktree configs in submodules Matheus Tavares
2020-09-10 17:21           ` [PATCH v6 8/9] grep: honor sparse checkout patterns Matheus Tavares
2020-09-10 17:21           ` [PATCH v6 9/9] config: add setting to ignore sparsity patterns in some cmds Matheus Tavares
2021-02-09 21:33           ` [PATCH v7] grep: honor sparse-checkout on working tree searches Matheus Tavares
2021-02-09 23:23             ` Junio C Hamano
2021-02-10  6:12               ` Elijah Newren

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CABPp-BEbNCYk0pCuEDQ_ViB2=varJPBsVODxNvJs0EVRyBqjBg@mail.gmail.com' \
    --to=newren@gmail.com \
    --cc=dstolee@microsoft.com \
    --cc=git@vger.kernel.org \
    --cc=matheus.bernardino@usp.br \
    --cc=pclouds@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).