All of lore.kernel.org
 help / color / mirror / Atom feed
From: Derrick Stolee <stolee@gmail.com>
To: Elijah Newren <newren@gmail.com>,
	Matheus Tavares <matheus.bernardino@usp.br>
Cc: Git Mailing List <git@vger.kernel.org>,
	Derrick Stolee <dstolee@microsoft.com>,
	"brian m. carlson" <sandals@crustytoothpaste.net>,
	Stefan Beller <stefanbeller@gmail.com>
Subject: Re: [RFC PATCH 2/3] grep: honor sparse checkout patterns
Date: Tue, 24 Mar 2020 11:12:50 -0400	[thread overview]
Message-ID: <59c04216-8dd9-cbbf-a869-a65ed8ca6e0a@gmail.com> (raw)
In-Reply-To: <CABPp-BHbrGGjV_22kwTERn19RaWk73_Y6tzWnjwO9u4isCRpVg@mail.gmail.com>

On 3/24/2020 3:15 AM, Elijah Newren wrote:
> Hi Matheus,
> 
> On Mon, Mar 23, 2020 at 11:12 PM Matheus Tavares
> <matheus.bernardino@usp.br> wrote:
>>
>> One of the main uses for a sparse checkout is to allow users to focus on
>> the subset of files in a repository in which they are interested. But
>> git-grep currently ignores the sparsity patterns and report all matches
>> found outside this subset, which kind of goes in the oposity direction.
>> Let's fix that, making it honor the sparsity boundaries for every
>> grepping case:
>>
>> - git grep in worktree
>> - git grep --cached
>> - git grep $REVISION
> 
> Wahoo!  This is great.

I am also excited. Also thrilled to see the option to get the old
behavior in the next patch.

>> Something I'm not entirely sure in this patch is how we implement the
>> mechanism to honor sparsity for the `git grep <commit-ish>` case (which
>> is treated in the grep_tree() function). Currently, the patch looks for
>> an index entry that matches the path, and then checks its skip_worktree
> 
> As you discuss below, checking the index is both wrong _and_ costly.

I'm not sure why checking the index is _wrong_, but I agree about the
performance cost.

> You should use the sparsity patterns; Stolee did a lot of work to make
> those correspond to simple hashes you could check to determine whether
> to even walk into a subdirectory.  So, O(1).  Yeah, that's "only" cone
> mode but the non-cone sparsity patterns were a performance nightmare
> waiting to rear its ugly head.  We should just try to encourage
> everyone to move to cone mode, or accept the slowness they get without
> it.
> 
>> bit. But this operation is perfomed in O(log(N)); N being the number of
>> index entries. If there are many entries (and no so many sparsity
>> patterns), maybe a better approach would be to try matching the path
>> directly against the sparsity patterns. This would be O(M) in the number
>> of patterns, and it could be done, in builtin/grep.c, with a function
>> like the following:
>>
>> static struct pattern_list sparsity_patterns;
>> static int sparsity_patterns_initialized = 0;
>> static enum pattern_match_result path_matches_sparsity_patterns(
>>                                         const char *path, int pathlen,
>>                                         const char *basename,
>>                                         struct repository *repo)
>> {
>>         int dtype = DT_UNKNOWN;
>>
>>         if (!sparsity_patterns_initialized) {
>>                 char *sparse_file = git_pathdup("info/sparse-checkout");
>>                 int ret;
>>
>>                 memset(&sparsity_patterns, 0, sizeof(sparsity_patterns));
>>                 sparsity_patterns.use_cone_patterns = core_sparse_checkout_cone;
>>                 ret = add_patterns_from_file_to_list(sparse_file, "", 0,
>>                                                      &sparsity_patterns, NULL);
>>                 free(sparse_file);
>>
>>                 if (ret < 0)
>>                         die(_("failed to load sparse-checkout patterns"));
>>                 sparsity_patterns_initialized = 1;
>>         }
>>
>>         return path_matches_pattern_list(path, pathlen, basename, &dtype,
>>                                          &sparsity_patterns, repo->index);
>> }
>>
>> Also, if I understand correctly, the index doesn't hold paths to dirs,
>> right? So even if a complete dir is excluded from sparse checkout, we
>> still have to check all its subentries, only to discover that they
>> should all be skipped from the search. However, if we were to check
>> against the sparsity patterns directly (e.g. with the function above),
>> we could skip such directories together with all their entries.

When in cone mode, we can check if a directory is one of these three
modes:

1. Completely contained in the cone (recursive match)
2. Completely outside the cone
3. Neither. Keep matching subdirectories. (parent match)

The clear_ce_flags() code in dir.c includes the matching algorithms
for this. Hopefully you can re-use a lot of it. You may need to extract
some methods to use them from the grep code.

>> Oh, and there is also the case of a commit whose tree paths are not in
>> the index (maybe manually created objects?). For such commits, with the
>> index lookup approach, we would have to fall back on ignoring the
>> sparsity rules. I'm not sure if that would be OK, though.
>>
>> Any thoughts on these two approaches (looking up the skip_worktree bit
>> in the index or directly matching against sparsity patterns), will be
>> highly appreciated. (Note that it only concerns the `git grep
>> <commit-ish>` case. The other cases already iterate thought the index, so
>> there is no O(log(N)) extra complexity).
>>
>>  builtin/grep.c                   | 29 ++++++++---
>>  t/t7011-skip-worktree-reading.sh |  9 ----
>>  t/t7817-grep-sparse-checkout.sh  | 88 ++++++++++++++++++++++++++++++++
>>  3 files changed, 111 insertions(+), 15 deletions(-)
>>  create mode 100755 t/t7817-grep-sparse-checkout.sh
>>
>> diff --git a/builtin/grep.c b/builtin/grep.c
>> index 99e2685090..52ec72a036 100644
>> --- a/builtin/grep.c
>> +++ b/builtin/grep.c
>> @@ -388,7 +388,7 @@ static int grep_cache(struct grep_opt *opt,
>>                       const struct pathspec *pathspec, int cached);
>>  static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>>                      struct tree_desc *tree, struct strbuf *base, int tn_len,
>> -                    int check_attr);
>> +                    int from_commit);
> 
> I'm not familiar with grep.c and have to admit I don't know what
> "check_attr" means.  Slightly surprised to see you replace it, but
> maybe reading the rest will explain...
> 
>>
>>  static int grep_submodule(struct grep_opt *opt,
>>                           const struct pathspec *pathspec,
>> @@ -486,6 +486,10 @@ static int grep_cache(struct grep_opt *opt,
>>
>>         for (nr = 0; nr < repo->index->cache_nr; nr++) {
>>                 const struct cache_entry *ce = repo->index->cache[nr];
>> +
>> +               if (ce_skip_worktree(ce))
>> +                       continue;
>> +
> 
> Looks good for the case where we are grepping through what's cached.
> 
>>                 strbuf_setlen(&name, name_base_len);
>>                 strbuf_addstr(&name, ce->name);
>>
>> @@ -498,8 +502,7 @@ static int grep_cache(struct grep_opt *opt,
>>                          * cache entry are identical, even if worktree file has
>>                          * been modified, so use cache version instead
>>                          */
>> -                       if (cached || (ce->ce_flags & CE_VALID) ||
>> -                           ce_skip_worktree(ce)) {
>> +                       if (cached || (ce->ce_flags & CE_VALID)) {
> 
> I had the same change when I was trying to hack something like this
> patch into place but only handled the worktree case before realized it
> was a bit bigger job.
> 
>>                                 if (ce_stage(ce) || ce_intent_to_add(ce))
>>                                         continue;
>>                                 hit |= grep_oid(opt, &ce->oid, name.buf,
>> @@ -532,7 +535,7 @@ static int grep_cache(struct grep_opt *opt,
>>
>>  static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>>                      struct tree_desc *tree, struct strbuf *base, int tn_len,
>> -                    int check_attr)
>> +                    int from_commit)
>>  {
>>         struct repository *repo = opt->repo;
>>         int hit = 0;
>> @@ -546,6 +549,9 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>>                 name_base_len = name.len;
>>         }
>>
>> +       if (from_commit && repo_read_index(repo) < 0)
>> +               die(_("index file corrupt"));
>> +
> 
> As above, I don't think we should need to read the index.  We should
> compare to sparsity patterns, which in the important case (cone mode)
> simplifies to a hash lookup as we walk directories.
> 
>>         while (tree_entry(tree, &entry)) {
>>                 int te_len = tree_entry_len(&entry);
>>
>> @@ -564,9 +570,20 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>>
>>                 strbuf_add(base, entry.path, te_len);
>>
>> +               if (from_commit) {
>> +                       int pos = index_name_pos(repo->index,
>> +                                                base->buf + tn_len,
>> +                                                base->len - tn_len);
>> +                       if (pos >= 0 &&
>> +                           ce_skip_worktree(repo->index->cache[pos])) {
>> +                               strbuf_setlen(base, old_baselen);
>> +                               continue;
>> +                       }
>> +               }
>> +
>>                 if (S_ISREG(entry.mode)) {
>>                         hit |= grep_oid(opt, &entry.oid, base->buf, tn_len,
>> -                                        check_attr ? base->buf + tn_len : NULL);
>> +                                       from_commit ? base->buf + tn_len : NULL);
> 
> Sadly, this doesn't help me understand check_attr or from_commit.
> Could you clue me in a bit?

Yeah, Elijah and I know the sparse-checkout code quite well, but are
unfamiliar with grep. Let's all expand our knowledge!

>>                 } else if (S_ISDIR(entry.mode)) {
>>                         enum object_type type;
>>                         struct tree_desc sub;
>> @@ -581,7 +598,7 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>>                         strbuf_addch(base, '/');
>>                         init_tree_desc(&sub, data, size);
>>                         hit |= grep_tree(opt, pathspec, &sub, base, tn_len,
>> -                                        check_attr);
>> +                                        from_commit);
> 
> Same.
> 
>>                         free(data);
>>                 } else if (recurse_submodules && S_ISGITLINK(entry.mode)) {
>>                         hit |= grep_submodule(opt, pathspec, &entry.oid,
>> diff --git a/t/t7011-skip-worktree-reading.sh b/t/t7011-skip-worktree-reading.sh
>> index 37525cae3a..26852586ac 100755
>> --- a/t/t7011-skip-worktree-reading.sh
>> +++ b/t/t7011-skip-worktree-reading.sh
>> @@ -109,15 +109,6 @@ test_expect_success 'ls-files --modified' '
>>         test -z "$(git ls-files -m)"
>>  '
>>
>> -test_expect_success 'grep with skip-worktree file' '
>> -       git update-index --no-skip-worktree 1 &&
>> -       echo test > 1 &&
>> -       git update-index 1 &&
>> -       git update-index --skip-worktree 1 &&
>> -       rm 1 &&
>> -       test "$(git grep --no-ext-grep test)" = "1:test"
>> -'
>> -
>>  echo ":000000 100644 $ZERO_OID $EMPTY_BLOB A   1" > expected
>>  test_expect_success 'diff-index does not examine skip-worktree absent entries' '
>>         setup_absent &&
>> diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh
>> new file mode 100755
>> index 0000000000..fccf44e829
>> --- /dev/null
>> +++ b/t/t7817-grep-sparse-checkout.sh
>> @@ -0,0 +1,88 @@
>> +#!/bin/sh
>> +
>> +test_description='grep in sparse checkout
>> +
>> +This test creates the following dir structure:
>> +.
>> +| - a
>> +| - b
>> +| - dir
>> +    | - c
>> +
>> +Only "a" should be present due to the sparse checkout patterns:
>> +"/*", "!/b" and "!/dir".
>> +'
>> +
>> +. ./test-lib.sh
>> +
>> +test_expect_success 'setup' '
>> +       echo "text" >a &&
>> +       echo "text" >b &&
>> +       mkdir dir &&
>> +       echo "text" >dir/c &&
>> +       git add a b dir &&
>> +       git commit -m "initial commit" &&
>> +       git tag -am t-commit t-commit HEAD &&
>> +       tree=$(git rev-parse HEAD^{tree}) &&
>> +       git tag -am t-tree t-tree $tree &&
>> +       cat >.git/info/sparse-checkout <<-EOF &&
>> +       /*
>> +       !/b
>> +       !/dir
>> +       EOF
>> +       git sparse-checkout init &&
> 
> Using `git sparse-checkout init` but then manually writing to
> .git/info/sparse-checkout?  Seems like it'd make more sense to use
> `git sparse-checkout set` than writing the patterns directly yourself.
> Also, would prefer to have the examples use cone mode (even if you
> have to add subdirectories), as it makes the testcase a bit easier to
> read and more performant, though neither is a big deal.

I agree that we should use the builtin so your test script is less
brittle to potential back-end changes to sparse-checkout (none planned).

I do recommend having at least one test with non-cone mode patterns,
especially if you are checking the pattern-matching yourself instead of
relying on the index.

Thanks,
-Stolee

  reply	other threads:[~2020-03-24 15:12 UTC|newest]

Thread overview: 123+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-03-24  6:04 [RFC PATCH 0/3] grep: honor sparse checkout and add option to ignore it Matheus Tavares
2020-03-24  6:11 ` [RFC PATCH 1/3] doc: grep: unify info on configuration variables Matheus Tavares
2020-03-24  7:57   ` Elijah Newren
2020-03-24 21:26     ` Junio C Hamano
2020-03-24 23:38       ` Matheus Tavares
2020-03-24  6:12 ` [RFC PATCH 2/3] grep: honor sparse checkout patterns Matheus Tavares
2020-03-24  7:15   ` Elijah Newren
2020-03-24 15:12     ` Derrick Stolee [this message]
2020-03-24 16:16       ` Elijah Newren
2020-03-24 17:02         ` Derrick Stolee
2020-03-24 23:01       ` Matheus Tavares Bernardino
2020-03-24 22:55     ` Matheus Tavares Bernardino
2020-04-21  2:10       ` Matheus Tavares Bernardino
2020-04-21  3:08         ` Elijah Newren
2020-04-22 12:08           ` Derrick Stolee
2020-04-23  6:09           ` Matheus Tavares Bernardino
2020-03-24  6:13 ` [RFC PATCH 3/3] grep: add option to ignore sparsity patterns Matheus Tavares
2020-03-24  7:54   ` Elijah Newren
2020-03-24 18:30     ` Junio C Hamano
2020-03-24 19:07       ` Elijah Newren
2020-03-25 20:18         ` Junio C Hamano
2020-03-30  3:23       ` Matheus Tavares Bernardino
2020-03-31 19:12         ` Elijah Newren
2020-03-31 20:02           ` Derrick Stolee
2020-04-27 17:15             ` Matheus Tavares Bernardino
2020-04-29 16:46               ` Elijah Newren
2020-04-29 17:21             ` Elijah Newren
2020-03-25 23:15     ` Matheus Tavares Bernardino
2020-03-26  6:02       ` Elijah Newren
2020-03-27 15:51         ` Junio C Hamano
2020-03-27 19:01           ` Elijah Newren
2020-03-30  1:12         ` Matheus Tavares Bernardino
2020-03-31 16:48           ` Elijah Newren
2020-05-10  0:41 ` [RFC PATCH v2 0/4] grep: honor sparse checkout and add option to ignore it Matheus Tavares
2020-05-10  0:41   ` [RFC PATCH v2 1/4] doc: grep: unify info on configuration variables Matheus Tavares
2020-05-10  0:41   ` [RFC PATCH v2 2/4] config: load the correct config.worktree file Matheus Tavares
2020-05-11 19:10     ` Junio C Hamano
2020-05-12 22:55       ` Matheus Tavares Bernardino
2020-05-12 23:22         ` Junio C Hamano
2020-05-10  0:41   ` [RFC PATCH v2 3/4] grep: honor sparse checkout patterns Matheus Tavares
2020-05-11 19:35     ` Junio C Hamano
2020-05-13  0:05       ` Matheus Tavares Bernardino
2020-05-13  0:17         ` Junio C Hamano
2020-05-21  7:26           ` Elijah Newren
2020-05-21 17:35             ` Matheus Tavares Bernardino
2020-05-21 17:52               ` Elijah Newren
2020-05-22  5:49                 ` Matheus Tavares Bernardino
2020-05-22 14:26                   ` Elijah Newren
2020-05-22 15:36                     ` Elijah Newren
2020-05-22 20:54                       ` Matheus Tavares Bernardino
2020-05-22 21:06                         ` Elijah Newren
2020-06-10 11:40                     ` Derrick Stolee
2020-06-10 16:22                       ` Matheus Tavares Bernardino
2020-06-10 17:42                         ` Derrick Stolee
2020-06-10 18:14                           ` Matheus Tavares Bernardino
2020-06-10 20:12                         ` Elijah Newren
2020-06-10 19:58                       ` Elijah Newren
2020-05-21  7:36       ` Elijah Newren
2020-05-10  0:41   ` [RFC PATCH v2 4/4] config: add setting to ignore sparsity patterns in some cmds Matheus Tavares
2020-05-10  4:23     ` Matheus Tavares Bernardino
2020-05-21 17:18       ` Elijah Newren
2020-05-21  7:09     ` Elijah Newren
2020-05-28  1:12   ` [PATCH v3 0/5] grep: honor sparse checkout and add option to ignore it Matheus Tavares
2020-05-28  1:12     ` [PATCH v3 1/5] doc: grep: unify info on configuration variables Matheus Tavares
2020-05-28  1:13     ` [PATCH v3 2/5] t/helper/test-config: return exit codes consistently Matheus Tavares
2020-05-30 14:29       ` Elijah Newren
2020-06-01  4:36         ` Matheus Tavares Bernardino
2020-05-28  1:13     ` [PATCH v3 3/5] config: correctly read worktree configs in submodules Matheus Tavares
2020-05-30 14:49       ` Elijah Newren
2020-06-01  4:38         ` Matheus Tavares Bernardino
2020-05-28  1:13     ` [PATCH v3 4/5] grep: honor sparse checkout patterns Matheus Tavares
2020-05-30 15:48       ` Elijah Newren
2020-06-01  4:44         ` Matheus Tavares Bernardino
2020-06-03  2:38           ` Elijah Newren
2020-06-10 17:08             ` Matheus Tavares Bernardino
2020-05-28  1:13     ` [PATCH v3 5/5] config: add setting to ignore sparsity patterns in some cmds Matheus Tavares
2020-05-30 16:18       ` Elijah Newren
2020-06-01  4:45         ` Matheus Tavares Bernardino
2020-06-03  2:39           ` Elijah Newren
2020-06-10 21:15             ` Matheus Tavares Bernardino
2020-06-11  0:35               ` Elijah Newren
2020-06-12 15:44     ` [PATCH v4 0/6] grep: honor sparse checkout and add option to ignore it Matheus Tavares
2020-06-12 15:44       ` [PATCH v4 1/6] doc: grep: unify info on configuration variables Matheus Tavares
2020-06-12 15:45       ` [PATCH v4 2/6] t/helper/test-config: return exit codes consistently Matheus Tavares
2020-06-12 15:45       ` [PATCH v4 3/6] t/helper/test-config: facilitate addition of new cli options Matheus Tavares
2020-06-12 15:45       ` [PATCH v4 4/6] config: correctly read worktree configs in submodules Matheus Tavares
2020-06-16 19:13         ` Elijah Newren
2020-06-21 16:05           ` Matheus Tavares Bernardino
2020-09-01  2:41         ` Jonathan Nieder
2020-09-01 21:44           ` Matheus Tavares Bernardino
2020-06-12 15:45       ` [PATCH v4 5/6] grep: honor sparse checkout patterns Matheus Tavares
2020-06-12 15:45       ` [PATCH v4 6/6] config: add setting to ignore sparsity patterns in some cmds Matheus Tavares
2020-06-16 22:31       ` [PATCH v4 0/6] grep: honor sparse checkout and add option to ignore it Elijah Newren
2020-09-02  6:17       ` [PATCH v5 0/8] " Matheus Tavares
2020-09-02  6:17         ` [PATCH v5 1/8] doc: grep: unify info on configuration variables Matheus Tavares
2020-09-02  6:17         ` [PATCH v5 2/8] t1308-config-set: avoid false positives when using test-config Matheus Tavares
2020-09-02  6:57           ` Eric Sunshine
2020-09-02 16:16             ` Matheus Tavares Bernardino
2020-09-02 16:38               ` Eric Sunshine
2020-09-02  6:17         ` [PATCH v5 3/8] t/helper/test-config: be consistent with exit codes Matheus Tavares
2020-09-02  6:17         ` [PATCH v5 4/8] t/helper/test-config: check argc before accessing argv Matheus Tavares
2020-09-02  7:18           ` Eric Sunshine
2020-09-02  6:17         ` [PATCH v5 5/8] t/helper/test-config: unify exit labels Matheus Tavares
2020-09-02  7:30           ` Eric Sunshine
2020-09-02  6:17         ` [PATCH v5 6/8] config: correctly read worktree configs in submodules Matheus Tavares
2020-09-02 20:15           ` Jonathan Nieder
2020-09-09 13:04             ` Matheus Tavares Bernardino
2020-09-09 23:32               ` Jonathan Nieder
2020-09-02  6:17         ` [PATCH v5 7/8] grep: honor sparse checkout patterns Matheus Tavares
2020-09-02  6:17         ` [PATCH v5 8/8] config: add setting to ignore sparsity patterns in some cmds Matheus Tavares
2020-09-10 17:21         ` [PATCH v6 0/9] grep: honor sparse checkout and add option to ignore it Matheus Tavares
2020-09-10 17:21           ` [PATCH v6 1/9] doc: grep: unify info on configuration variables Matheus Tavares
2020-09-10 17:21           ` [PATCH v6 2/9] t1308-config-set: avoid false positives when using test-config Matheus Tavares
2020-09-10 17:21           ` [PATCH v6 3/9] t/helper/test-config: be consistent with exit codes Matheus Tavares
2020-09-10 17:21           ` [PATCH v6 4/9] t/helper/test-config: diagnose missing arguments Matheus Tavares
2020-09-10 17:21           ` [PATCH v6 5/9] t/helper/test-config: unify exit labels Matheus Tavares
2020-09-10 17:21           ` [PATCH v6 6/9] config: make do_git_config_sequence receive a 'struct repository' Matheus Tavares
2020-09-10 17:21           ` [PATCH v6 7/9] config: correctly read worktree configs in submodules Matheus Tavares
2020-09-10 17:21           ` [PATCH v6 8/9] grep: honor sparse checkout patterns Matheus Tavares
2020-09-10 17:21           ` [PATCH v6 9/9] config: add setting to ignore sparsity patterns in some cmds Matheus Tavares
2021-02-09 21:33           ` [PATCH v7] grep: honor sparse-checkout on working tree searches Matheus Tavares
2021-02-09 23:23             ` Junio C Hamano
2021-02-10  6:12               ` Elijah Newren

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=59c04216-8dd9-cbbf-a869-a65ed8ca6e0a@gmail.com \
    --to=stolee@gmail.com \
    --cc=dstolee@microsoft.com \
    --cc=git@vger.kernel.org \
    --cc=matheus.bernardino@usp.br \
    --cc=newren@gmail.com \
    --cc=sandals@crustytoothpaste.net \
    --cc=stefanbeller@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.