git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Derrick Stolee <stolee@gmail.com>
To: Elijah Newren via GitGitGadget <gitgitgadget@gmail.com>,
	git@vger.kernel.org
Cc: "Martin Melka" <martin.melka@gmail.com>,
	"SZEDER Gábor" <szeder.dev@gmail.com>,
	"Samuel Lijin" <sxlijin@gmail.com>,
	"Nguyễn Thái Ngọc Duy" <pclouds@gmail.com>,
	"Elijah Newren" <newren@gmail.com>
Subject: Re: [PATCH v3 7/7] dir: replace exponential algorithm with a linear one, fix untracked cache
Date: Thu, 26 Mar 2020 09:13:51 -0400	[thread overview]
Message-ID: <a81cb70f-1a3a-6f0f-1500-66e63138a848@gmail.com> (raw)
In-Reply-To: <6cfca619e2c5bc37d1847d49496d1be42f4061ce.1585164718.git.gitgitgadget@gmail.com>

On 3/25/2020 3:31 PM, Elijah Newren via GitGitGadget wrote:
> From: Elijah Newren <newren@gmail.com>
> 
> dir's read_directory_recursive() naturally operates recursively in order
> to walk the directory tree.  Treating of directories is sometimes weird
> because there are so many different permutations about how to handle
> directories.  Some examples:
> 
>    * 'git ls-files -o --directory' only needs to know that a directory
>      itself is untracked; it doesn't need to recurse into it to see what
>      is underneath.
> 
>    * 'git status' needs to recurse into an untracked directory, but only
>      to determine whether or not it is empty.  If there are no files
>      underneath, the directory itself will be omitted from the output.
>      If it is not empty, only the directory will be listed.
> 
>    * 'git status --ignored' needs to recurse into untracked directories
>      and report all the ignored entries and then report the directory as
>      untracked -- UNLESS all the entries under the directory are
>      ignored, in which case we don't print any of the entries under the
>      directory and just report the directory itself as ignored.  (Note
>      that although this forces us to walk all untracked files underneath
>      the directory as well, we strip them from the output, except for
>      users like 'git clean' who also set DIR_KEEP_TRACKED_CONTENTS.)
> 
>    * For 'git clean', we may need to recurse into a directory that
>      doesn't match any specified pathspecs, if it's possible that there
>      is an entry underneath the directory that can match one of the
>      pathspecs.  In such a case, we need to be careful to omit the
>      directory itself from the list of paths (see commit 404ebceda01c
>      ("dir: also check directories for matching pathspecs", 2019-09-17))
> 
> Part of the tension noted above is that the treatment of a directory can
> change based on the files within it, and based on the various settings
> in dir->flags.  Trying to keep this in mind while reading over the code,
> it is easy to think in terms of "treat_directory() tells us what to do
> with a directory, and read_directory_recursive() is the thing that
> recurses".  Since we need to look into a directory to know how to treat
> it, though, it is quite easy to decide to (also) recurse into the
> directory from treat_directory() by adding a read_directory_recursive()
> call.  Adding such a call is actually fine, IF we make sure that
> read_directory_recursive() does not also recurse into that same
> directory.
> 
> Unfortunately, commit df5bcdf83aeb ("dir: recurse into untracked dirs
> for ignored files", 2017-05-18), added exactly such a case to the code,
> meaning we'd have two calls to read_directory_recursive() for an
> untracked directory.  So, if we had a file named
>    one/two/three/four/five/somefile.txt
> and nothing in one/ was tracked, then 'git status --ignored' would
> call read_directory_recursive() twice on the directory 'one/', and
> each of those would call read_directory_recursive() twice on the
> directory 'one/two/', and so on until read_directory_recursive() was
> called 2^5 times for 'one/two/three/four/five/'.
> 
> Avoid calling read_directory_recursive() twice per level by moving a
> lot of the special logic into treat_directory().
> 
> Since dir.c is somewhat complex, extra cruft built up around this over
> time.  While trying to unravel it, I noticed several instances where the
> first call to read_directory_recursive() would return e.g.
> path_untracked for some directory and a later one would return e.g.
> path_none, despite the fact that the directory clearly should have been
> considered untracked.  The code happened to work due to the side-effect
> from the first invocation of adding untracked entries to dir->entries;
> this allowed it to get the correct output despite the supposed override
> in return value by the later call.
> 
> I am somewhat concerned that there are still bugs and maybe even
> testcases with the wrong expectation.

For my part, I recently set up draft PRs to test the 'next' branch in
Scalar [1] and VFS for Git [2]. I'll create a Git installer using these
patches as well so I can run our functional test suite for a little extra
check of the behavior here.

[1] https://github.com/microsoft/scalar/pull/354/files
[2] https://github.com/microsoft/VFSForGit/pull/1645

>  I have tried to carefully
> document treat_directory() since it becomes more complex after this
> change (though much of this complexity came from elsewhere that probably
> deserved better comments to begin with).

I do enjoy your warning comments.

> However, much of my work felt
> more like a game of whackamole while attempting to make the code match
> the existing regression tests than an attempt to create an
> implementation that matched some clear design.  That seems wrong to me,
> but the rules of existing behavior had so many special cases that I had
> a hard time coming up with some overarching rules about what correct
> behavior is for all cases, forcing me to hope that the regression tests
> are correct and sufficient.  Such a hope seems likely to be ill-founded,
> given my experience with dir.c-related testcases in the last few months:
> 
>   Examples where the documentation was hard to parse or even just wrong:
>    * 3aca58045f4f (git-clean.txt: do not claim we will delete files with
>                    -n/--dry-run, 2019-09-17)
>    * 09487f2cbad3 (clean: avoid removing untracked files in a nested git
>                    repository, 2019-09-17)
>    * e86bbcf987fa (clean: disambiguate the definition of -d, 2019-09-17)
>   Examples where testcases were declared wrong and changed:
>    * 09487f2cbad3 (clean: avoid removing untracked files in a nested git
>                    repository, 2019-09-17)
>    * e86bbcf987fa (clean: disambiguate the definition of -d, 2019-09-17)
>    * a2b13367fe55 (Revert "dir.c: make 'git-status --ignored' work within
>                    leading directories", 2019-12-10)
>   Examples where testcases were clearly inadequate:
>    * 502c386ff944 (t7300-clean: demonstrate deleting nested repo with an
>                    ignored file breakage, 2019-08-25)
>    * 7541cc530239 (t7300: add testcases showing failure to clean specified
>                    pathspecs, 2019-09-17)
>    * a5e916c7453b (dir: fix off-by-one error in match_pathspec_item,
>                    2019-09-17)
>    * 404ebceda01c (dir: also check directories for matching pathspecs,
>                    2019-09-17)
>    * 09487f2cbad3 (clean: avoid removing untracked files in a nested git
>                    repository, 2019-09-17)
>    * e86bbcf987fa (clean: disambiguate the definition of -d, 2019-09-17)
>    * 452efd11fbf6 (t3011: demonstrate directory traversal failures,
>                    2019-12-10)
>    * b9670c1f5e6b (dir: fix checks on common prefix directory, 2019-12-19)
>   Examples where "correct behavior" was unclear to everyone:
>     https://lore.kernel.org/git/20190905154735.29784-1-newren@gmail.com/
>   Other commits of note:
>    * 902b90cf42bc (clean: fix theoretical path corruption, 2019-09-17)

Thanks for all of these pointers. These will be helpful if we ever do find
a regression that bisects to this patch.

> However, on the positive side, it does make the code much faster.  For
> the following simple shell loop in an empty repository:
> 
>   for depth in $(seq 10 25)
>   do
>     dirs=$(for i in $(seq 1 $depth) ; do printf 'dir/' ; done)
>     rm -rf dir
>     mkdir -p $dirs
>     >$dirs/untracked-file
>     /usr/bin/time --format="$depth: %e" git status --ignored >/dev/null
>   done
> 
> I saw the following timings, in seconds (note that the numbers are a
> little noisy from run-to-run, but the trend is very clear with every
> run):
> 
>     10: 0.03
>     11: 0.05
>     12: 0.08
>     13: 0.19
>     14: 0.29
>     15: 0.50
>     16: 1.05
>     17: 2.11
>     18: 4.11
>     19: 8.60
>     20: 17.55
>     21: 33.87
>     22: 68.71
>     23: 140.05
>     24: 274.45
>     25: 551.15

These are still impressive.

> For the above run, using strace I can look for the number of untracked
> directories opened and can verify that it matches the expected
> 2^($depth+1)-2 (the sum of 2^1 + 2^2 + 2^3 + ... + 2^$depth).
> 
> After this fix, with strace I can verify that the number of untracked
> directories that are opened drops to just $depth, and the timings all
> drop to 0.00.  In fact, it isn't until a depth of 190 nested directories
> that it sometimes starts reporting a time of 0.01 seconds and doesn't
> consistently report 0.01 seconds until there are 240 nested directories.
> The previous code would have taken
>   17.55 * 2^220 / (60*60*24*365) = 9.4 * 10^59 YEARS
> to have completed the 240 nested directories case.  It's not often
> that you get to speed something up by a factor of 3*10^69.
> 
> Finally, this also fixes the untracked cache, as noted by the test fixes
> in t7063.  Unfortunately, it does so by passing stop_at_first_file to
> close_cached_dir() in order to disable the caching of whether
> directories were empty (this caching was only relevant for directories
> that we knew we didn't need to walk all the entries under but just
> needed to know whether the directory had any entries within it in order
> to know if the directory itself should be marked as path_none or
> path_untracked).  I'm not convinced that disabling the is-the-dir-empty
> check is necessary; there is probably some way to still cache that and
> not get erroneous results.  However, I have not figured out how to do
> so.  If I revert the change to close_cached_dir() in this patch (thus
> continuing to cache cases where stop_at_first_file is true meaning we
> continue to cache whether directories are empty), then the untracked
> cache breakage in t7063 becomes more prevalant.  With my change to
> close_cached_dir() and the other changes to avoid traversing directories
> 2^n times in this patch, I not only avoid making the untracked_cache
> breakage in t7063 worse but actually fix the existing breakage.  Update
> the test results in t7063 to no longer expect check_only cache entries,
> to reflect that we have to do a bit more work in terms of how many
> directories we have to open, and to reflect that we fixed the 1/3 of
> tests that were broken in that testsuite.

We use the untracked cache in Scalar, so we should get some coverage
of that, too.

I'll let you know when the tests are done, and then do a review.

-Stolee

  reply	other threads:[~2020-03-26 13:13 UTC|newest]

Thread overview: 76+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-01-29 22:03 [PATCH 0/6] Avoid multiple recursive calls for same path in read_directory_recursive() Elijah Newren via GitGitGadget
2020-01-29 22:03 ` [PATCH 1/6] dir: consolidate treat_path() and treat_one_path() Elijah Newren via GitGitGadget
2020-01-29 22:03 ` [PATCH 2/6] dir: fix broken comment Elijah Newren via GitGitGadget
2020-01-29 22:03 ` [PATCH 3/6] dir: fix confusion based on variable tense Elijah Newren via GitGitGadget
2020-01-30 15:20   ` Derrick Stolee
2020-01-31 18:04   ` SZEDER Gábor
2020-01-31 18:17     ` Elijah Newren
2020-01-29 22:03 ` [PATCH 4/6] dir: move setting of nested_repo next to its actual usage Elijah Newren via GitGitGadget
2020-01-30 15:33   ` Derrick Stolee
2020-01-30 15:45     ` Elijah Newren
2020-01-30 16:00       ` Derrick Stolee
2020-01-30 16:10         ` Derrick Stolee
2020-01-30 16:20           ` Elijah Newren
2020-01-30 18:17             ` Derrick Stolee
2020-01-29 22:03 ` [PATCH 5/6] dir: replace exponential algorithm with a linear one Elijah Newren via GitGitGadget
2020-01-30 15:55   ` Derrick Stolee
2020-01-30 17:13     ` Elijah Newren
2020-01-30 17:45       ` Elijah Newren
2020-01-31 17:13   ` SZEDER Gábor
2020-01-31 17:47     ` Elijah Newren
2020-01-29 22:03 ` [PATCH 6/6] t7063: blindly accept diffs Elijah Newren via GitGitGadget
2020-01-31 18:31 ` [PATCH v2 0/6] Avoid multiple recursive calls for same path in read_directory_recursive() Elijah Newren via GitGitGadget
2020-01-31 18:31   ` [PATCH v2 1/6] dir: consolidate treat_path() and treat_one_path() Elijah Newren via GitGitGadget
2020-01-31 18:31   ` [PATCH v2 2/6] dir: fix broken comment Elijah Newren via GitGitGadget
2020-01-31 18:31   ` [PATCH v2 3/6] dir: fix confusion based on variable tense Elijah Newren via GitGitGadget
2020-01-31 18:31   ` [PATCH v2 4/6] dir: refactor treat_directory to clarify control flow Derrick Stolee via GitGitGadget
2020-01-31 18:31   ` [PATCH v2 5/6] dir: replace exponential algorithm with a linear one Elijah Newren via GitGitGadget
2020-01-31 18:31   ` [PATCH v2 6/6] t7063: blindly accept diffs Elijah Newren via GitGitGadget
2020-03-25 19:31   ` [PATCH v3 0/7] Avoid multiple recursive calls for same path in read_directory_recursive() Elijah Newren via GitGitGadget
2020-03-25 19:31     ` [PATCH v3 1/7] t7063: correct broken test expectation Elijah Newren via GitGitGadget
2020-03-26 13:02       ` Derrick Stolee
2020-03-26 21:18         ` Elijah Newren
2020-03-25 19:31     ` [PATCH v3 2/7] dir: fix simple typo in comment Elijah Newren via GitGitGadget
2020-03-25 19:31     ` [PATCH v3 3/7] dir: consolidate treat_path() and treat_one_path() Elijah Newren via GitGitGadget
2020-03-25 19:31     ` [PATCH v3 4/7] dir: fix broken comment Elijah Newren via GitGitGadget
2020-03-25 19:31     ` [PATCH v3 5/7] dir: fix confusion based on variable tense Elijah Newren via GitGitGadget
2020-03-25 19:31     ` [PATCH v3 6/7] dir: refactor treat_directory to clarify control flow Derrick Stolee via GitGitGadget
2020-03-25 19:31     ` [PATCH v3 7/7] dir: replace exponential algorithm with a linear one, fix untracked cache Elijah Newren via GitGitGadget
2020-03-26 13:13       ` Derrick Stolee [this message]
2020-03-26 21:27     ` [PATCH v4 0/7] Avoid multiple recursive calls for same path in read_directory_recursive() Elijah Newren via GitGitGadget
2020-03-26 21:27       ` [PATCH v4 1/7] t7063: more thorough status checking Elijah Newren via GitGitGadget
2020-03-27 13:09         ` Derrick Stolee
2020-03-29 18:18           ` Junio C Hamano
2020-03-31 20:15             ` Elijah Newren
2020-03-26 21:27       ` [PATCH v4 2/7] dir: fix simple typo in comment Elijah Newren via GitGitGadget
2020-03-26 21:27       ` [PATCH v4 3/7] dir: consolidate treat_path() and treat_one_path() Elijah Newren via GitGitGadget
2020-03-26 21:27       ` [PATCH v4 4/7] dir: fix broken comment Elijah Newren via GitGitGadget
2020-03-26 21:27       ` [PATCH v4 5/7] dir: fix confusion based on variable tense Elijah Newren via GitGitGadget
2020-03-26 21:27       ` [PATCH v4 6/7] dir: refactor treat_directory to clarify control flow Derrick Stolee via GitGitGadget
2020-03-26 21:27       ` [PATCH v4 7/7] dir: replace exponential algorithm with a linear one Elijah Newren via GitGitGadget
2020-03-27 13:13       ` [PATCH v4 0/7] Avoid multiple recursive calls for same path in read_directory_recursive() Derrick Stolee
2020-03-28 17:33         ` Elijah Newren
2020-03-29 18:20           ` Junio C Hamano
2020-04-01  4:17       ` [PATCH v5 00/12] " Elijah Newren via GitGitGadget
2020-04-01  4:17         ` [PATCH v5 01/12] t7063: more thorough status checking Elijah Newren via GitGitGadget
2020-04-01  4:17         ` [PATCH v5 02/12] t3000: add more testcases testing a variety of ls-files issues Elijah Newren via GitGitGadget
2020-04-01  4:17         ` [PATCH v5 03/12] dir: fix simple typo in comment Elijah Newren via GitGitGadget
2020-04-01  4:17         ` [PATCH v5 04/12] dir: consolidate treat_path() and treat_one_path() Elijah Newren via GitGitGadget
2020-04-01  4:17         ` [PATCH v5 05/12] dir: fix broken comment Elijah Newren via GitGitGadget
2020-04-01  4:17         ` [PATCH v5 06/12] dir: fix confusion based on variable tense Elijah Newren via GitGitGadget
2020-04-01  4:17         ` [PATCH v5 07/12] dir: refactor treat_directory to clarify control flow Derrick Stolee via GitGitGadget
2020-04-01  4:17         ` [PATCH v5 08/12] dir: replace exponential algorithm with a linear one Elijah Newren via GitGitGadget
2020-04-01 13:57           ` Derrick Stolee
2020-04-01 15:59             ` Elijah Newren
2020-04-01  4:17         ` [PATCH v5 09/12] dir: include DIR_KEEP_UNTRACKED_CONTENTS handling in treat_directory() Elijah Newren via GitGitGadget
2020-04-01  4:17         ` [PATCH v5 10/12] dir: replace double pathspec matching with single " Elijah Newren via GitGitGadget
2020-04-01  4:17         ` [PATCH v5 11/12] Fix error-prone fill_directory() API; make it only return matches Elijah Newren via GitGitGadget
2020-07-19  6:33           ` Andreas Schwab
2020-07-19 12:39             ` Martin Ågren
2020-07-20 15:25               ` Elijah Newren
2020-07-20 18:45                 ` [PATCH] dir: check pathspecs before returning `path_excluded` Martin Ågren
2020-07-20 18:49                   ` Elijah Newren
2020-07-20 18:51                     ` Martin Ågren
2020-07-20 20:25                   ` Junio C Hamano
2020-07-20 18:58                 ` [PATCH v5 11/12] Fix error-prone fill_directory() API; make it only return matches Junio C Hamano
2020-04-01  4:17         ` [PATCH v5 12/12] completion: fix 'git add' on paths under an untracked directory Elijah Newren via GitGitGadget

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a81cb70f-1a3a-6f0f-1500-66e63138a848@gmail.com \
    --to=stolee@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=martin.melka@gmail.com \
    --cc=newren@gmail.com \
    --cc=pclouds@gmail.com \
    --cc=sxlijin@gmail.com \
    --cc=szeder.dev@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).