Git Mailing List Archive on lore.kernel.org
 help / color / Atom feed
From: Derrick Stolee <stolee@gmail.com>
To: Elijah Newren via GitGitGadget <gitgitgadget@gmail.com>,
	git@vger.kernel.org
Cc: "Martin Melka" <martin.melka@gmail.com>,
	"SZEDER Gábor" <szeder.dev@gmail.com>,
	"Samuel Lijin" <sxlijin@gmail.com>,
	"Nguyễn Thái Ngọc Duy" <pclouds@gmail.com>,
	"Elijah Newren" <newren@gmail.com>,
	Kevin.Willford@microsoft.com
Subject: Re: [PATCH 5/6] dir: replace exponential algorithm with a linear one
Date: Thu, 30 Jan 2020 10:55:50 -0500
Message-ID: <59b5b766-29e2-a709-b407-56bf6ea60b47@gmail.com> (raw)
In-Reply-To: <40b378e7adbbff5ecfd95fd888465fd0f99791c8.1580335424.git.gitgitgadget@gmail.com>

I am very enticed by the subject!

On 1/29/2020 5:03 PM, Elijah Newren via GitGitGadget wrote:
> Unfortunately, commit df5bcdf83aeb ("dir: recurse into untracked dirs
> for ignored files", 2017-05-18), added exactly such a case to the code,

I was disappointed that the commit you mention did not add a test for
the new behavior, but then found a test change in the following commit
fb89888849 (dir: hide untracked contents of untracked dirs, 2017-05-18).
This makes me feel better that your changes are less likely to un-do
the intention of df5bcdf83aeb.

> meaning we'd have two calls to read_directory_recursive() for an
> untracked directory.  So, if we had a file named
>    one/two/three/four/five/somefile.txt
> and nothing in one/ was tracked, then 'git status --ignored' would
> call read_directory_recursive() twice on the directory 'one/', and
> each of those would call read_directory_recursive() twice on the
> directory 'one/two/', and so on until read_directory_recursive() was
> called 2^5 times for 'one/two/three/four/five/'.

Wow! Good find. "Accidentally exponential" is a lot worse than
"accidentally quadratic". At least the N here _usually_ does not
grow too quickly, but the constant here (lstat-ing directories and
files) is significant enough that 2^3 or 2^4 is enough to notice
the difference.

> Avoid calling read_directory_recursive() twice per level by moving a
> lot of the special logic into treat_directory().
> 
> Since dir.c is somewhat complex, extra cruft built up around this over
> time.  While trying to unravel it, I noticed several instances where the
> first call to read_directory_recursive() would return e.g.
> path_untracked for a some directory and a later one would return e.g.
> path_none, and the code relied on the side-effect of the first adding
> untracked entries to dir->entries in order to get the correct output
> despite the supposed override in return value by the later call.
>
> I am somewhat concerned that there are still bugs and maybe even
> testcases with the wrong expectation.  I have tried to carefully
> document treat_directory() since it becomes more complex after this
> change (though much of this complexity came from elsewhere that probably
> deserved better comments to begin with).  However, much of my work felt
> more like a game of whackamole while attempting to make the code match
> the existing regression tests than an attempt to create an
> implementation that matched some clear design.  That seems wrong to me,
> but the rules of existing behavior had so many special cases that I had
> a hard time coming up with some overarching rules about what correct
> behavior is for all cases, forcing me to hope that the regression tests
> are correct and sufficient.  (I'll note that this turmoil makes working
> with dir.c extremely unpleasant for me; I keep hoping it'll get better,
> but it never seems to.)

Keep fighting the good fight! It appears that some of our most-important
code has these complicated cases and side-effects because it has grown
so organically over time. It's unlikely that someone _could_ rewrite it
to avoid that pain, as dir.c contains a lot of accumulated knowledge from
the many special-cases Git handles. I suppose the only thing we can do
is try to write as many detailed tests as possible.

> However, on the positive side, it does make the code much faster.  For
> the following simple shell loop in an empty repository:
> 
>   for depth in $(seq 10 25)
>   do
>     dirs=$(for i in $(seq 1 $depth) ; do printf 'dir/' ; done)
>     rm -rf dir
>     mkdir -p $dirs
>     >$dirs/untracked-file
>     /usr/bin/time --format="$depth: %e" git status --ignored >/dev/null
>   done
> 
> I saw the following timings, in seconds (note that the numbers are a
> little noisy from run-to-run, but the trend is very clear with every
> run):
> 
>     10: 0.03
>     11: 0.05
>     12: 0.08
>     13: 0.19
>     14: 0.29
>     15: 0.50
>     16: 1.05
>     17: 2.11
>     18: 4.11
>     19: 8.60
>     20: 17.55
>     21: 33.87
>     22: 68.71
>     23: 140.05
>     24: 274.45
>     25: 551.15

Are these timings on Linux? I imagine that the timings will increase
much more quickly on Windows.

> After this fix, those drop to:
> 
>     10: 0.00
...
>     25: 0.00

Nice. I wonder if presenting these 0.00 values as a table is worth
the space? At least the effect is dramatic.

> In fact, it isn't until a depth of 190 nested directories that it
> sometimes starts reporting a time of 0.01 seconds and doesn't
> consistently report 0.01 seconds until there are 240 nested directories.
> The previous code would have taken
>   17.55 * 2^220 / (60*60*24*365) = 9.4 * 10^59 YEARS
> to have completed the 240 nested directories case.  It's not often
> that you get to speed something up by a factor of 3*10^69.

Awesome.

> WARNING: This change breaks t7063.  I don't know whether that is to be expected
> (I now intentionally visit untracked directories differently so naturally the
> untracked cache should change), or if I've broken something.  I'm hoping to get
> an untracked cache expert to chime in...

I suppose that when the untracked cache is enabled, your expectation that we
do not need to recurse into an untracked directory is incorrect: we actually
want to explore that directory. Is there a mode we can check to see if we
are REALLY REALLY collecting _all_ untracked paths? Perhaps we need to create
one?

I'm CC'ing Kevin Willford because he is more familiar with the Git index
than me, and perhaps the untracked cache in particular.

> Signed-off-by: Elijah Newren <newren@gmail.com>
> ---
>  dir.c | 151 ++++++++++++++++++++++++++++++++++++++++------------------
>  1 file changed, 105 insertions(+), 46 deletions(-)
> 
> diff --git a/dir.c b/dir.c
> index ef3307718a..aaf038a9c4 100644
> --- a/dir.c
> +++ b/dir.c
> @@ -1659,7 +1659,13 @@ static enum path_treatment treat_directory(struct dir_struct *dir,
>  	const char *dirname, int len, int baselen, int excluded,
>  	const struct pathspec *pathspec)
>  {
> -	int nested_repo;
> +	/*
> +	 * WARNING: From this function, you can return path_recurse or you
> +	 *          can call read_directory_recursive() (or neither), but
> +	 *          you CAN'T DO BOTH.
> +	 */
> +	enum path_treatment state;
> +	int nested_repo, old_ignored_nr, stop_early;
>  
>  	/* The "len-1" is to strip the final '/' */
>  	switch (directory_exists_in_index(istate, dirname, len-1)) {
> @@ -1713,18 +1719,101 @@ static enum path_treatment treat_directory(struct dir_struct *dir,
>  
>  	/* This is the "show_other_directories" case */
>  
> -	if (!(dir->flags & DIR_HIDE_EMPTY_DIRECTORIES))
> +	/*
> +	 * We only need to recurse into untracked/ignored directories if
> +	 * either of the following bits is set:
> +	 *   - DIR_SHOW_IGNORED_TOO (because then we need to determine if
> +	 *                           there are ignored directories below)
> +	 *   - DIR_HIDE_EMPTY_DIRECTORIES (because we have to determine if
> +	 *                                 the directory is empty)

Perhaps here is where you could also have a DIR_LIST_ALL_UNTRACKED
flag to ensure the untracked cache loads all untracked paths?

> +	 */
> +	if (!(dir->flags & (DIR_SHOW_IGNORED_TOO | DIR_HIDE_EMPTY_DIRECTORIES)))
>  		return excluded ? path_excluded : path_untracked;
>  
> +	/*
> +	 * If we only want to determine if dirname is empty, then we can
> +	 * stop at the first file we find underneath that directory rather
> +	 * than continuing to recurse beyond it.  If DIR_SHOW_IGNORED_TOO
> +	 * is set, then we want MORE than just determining if dirname is
> +	 * empty.
> +	 */
> +	stop_early = ((dir->flags & DIR_HIDE_EMPTY_DIRECTORIES) &&
> +		      !(dir->flags & DIR_SHOW_IGNORED_TOO));
> +
> +	/*
> +	 * If /every/ file within an untracked directory is ignored, then
> +	 * we want to treat the directory as ignored (for e.g. status
> +	 * --porcelain), without listing the individual ignored files
> +	 * underneath.  To do so, we'll save the current ignored_nr, and
> +	 * pop all the ones added after it if it turns out the entire
> +	 * directory is ignored.

Here is a question for an untracked cache expert: Do we store ignored
paths in the untracked cache?

> +	 */
> +	old_ignored_nr = dir->ignored_nr;
> +
> +	/* Actually recurse into dirname now, we'll fixup the state later. */
>  	untracked = lookup_untracked(dir->untracked, untracked,
>  				     dirname + baselen, len - baselen);
> +	state = read_directory_recursive(dir, istate, dirname, len, untracked,
> +					 stop_early, stop_early, pathspec);
> +
> +	/* There are a variety of reasons we may need to fixup the state... */
> +	if (state == path_excluded) {
> +		int i;
> +
> +		/*
> +		 * When stop_early is set, read_directory_recursive() will
> +		 * never return path_untracked regardless of whether
> +		 * underlying paths were untracked or ignored (because
> +		 * returning early means it excluded some paths, or
> +		 * something like that -- see commit 5aaa7fd39aaf ("Improve
> +		 * performance of git status --ignored", 2017-09-18)).
> +		 * However, we're not really concerned with the status of
> +		 * files under the directory, we just wanted to know
> +		 * whether the directory was empty (state == path_none) or
> +		 * not (state == path_excluded), and if not, we'd return
> +		 * our original status based on whether the untracked
> +		 * directory matched an exclusion pattern.
> +		 */
> +		if (stop_early)
> +			state = excluded ? path_excluded : path_untracked;
> +
> +		else {
> +			/*
> +			 * When
> +			 *     !stop_early && state == path_excluded
> +			 * then all paths under dirname were ignored.  For
> +			 * this case, git status --porcelain wants to just
> +			 * list the directory itself as ignored and not
> +			 * list the individual paths underneath.  Remove
> +			 * the individual paths underneath.
> +			 */
> +			for (i = old_ignored_nr + 1; i<dir->ignored_nr; ++i)
> +				free(dir->ignored[i]);
> +			dir->ignored_nr = old_ignored_nr;
> +		}
> +	}
>  
>  	/*
> -	 * If this is an excluded directory, then we only need to check if
> -	 * the directory contains any files.
> +	 * If there is nothing under the current directory and we are not
> +	 * hiding empty directories, then we need to report on the
> +	 * untracked or ignored status of the directory itself.
>  	 */
> -	return read_directory_recursive(dir, istate, dirname, len,
> -					untracked, 1, excluded, pathspec);
> +	if (state == path_none && !(dir->flags & DIR_HIDE_EMPTY_DIRECTORIES))
> +		state = excluded ? path_excluded : path_untracked;
> +
> +	/*
> +	 * We can recurse into untracked directories that don't match any
> +	 * of the given pathspecs when some file underneath the directory
> +	 * might match one of the pathspecs.  If so, we should make sure
> +	 * to note that the directory itself did not match.
> +	 */
> +	if (pathspec &&
> +	    !match_pathspec(istate, pathspec, dirname, len,
> +			    0 /* prefix */, NULL,
> +			    0 /* do NOT special case dirs */))
> +		state = path_none;
> +
> +	return state;
>  }

This is certainly a substantial change, and I'm not able to read it
carefully right now. I hope to return to it soon, but hopefully I've
pointed out some places that may lead you to resolve your untracked
cache issues.

Thanks,
-Stolee


  reply index

Thread overview: 68+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-01-29 22:03 [PATCH 0/6] Avoid multiple recursive calls for same path in read_directory_recursive() Elijah Newren via GitGitGadget
2020-01-29 22:03 ` [PATCH 1/6] dir: consolidate treat_path() and treat_one_path() Elijah Newren via GitGitGadget
2020-01-29 22:03 ` [PATCH 2/6] dir: fix broken comment Elijah Newren via GitGitGadget
2020-01-29 22:03 ` [PATCH 3/6] dir: fix confusion based on variable tense Elijah Newren via GitGitGadget
2020-01-30 15:20   ` Derrick Stolee
2020-01-31 18:04   ` SZEDER Gábor
2020-01-31 18:17     ` Elijah Newren
2020-01-29 22:03 ` [PATCH 4/6] dir: move setting of nested_repo next to its actual usage Elijah Newren via GitGitGadget
2020-01-30 15:33   ` Derrick Stolee
2020-01-30 15:45     ` Elijah Newren
2020-01-30 16:00       ` Derrick Stolee
2020-01-30 16:10         ` Derrick Stolee
2020-01-30 16:20           ` Elijah Newren
2020-01-30 18:17             ` Derrick Stolee
2020-01-29 22:03 ` [PATCH 5/6] dir: replace exponential algorithm with a linear one Elijah Newren via GitGitGadget
2020-01-30 15:55   ` Derrick Stolee [this message]
2020-01-30 17:13     ` Elijah Newren
2020-01-30 17:45       ` Elijah Newren
2020-01-31 17:13   ` SZEDER Gábor
2020-01-31 17:47     ` Elijah Newren
2020-01-29 22:03 ` [PATCH 6/6] t7063: blindly accept diffs Elijah Newren via GitGitGadget
2020-01-31 18:31 ` [PATCH v2 0/6] Avoid multiple recursive calls for same path in read_directory_recursive() Elijah Newren via GitGitGadget
2020-01-31 18:31   ` [PATCH v2 1/6] dir: consolidate treat_path() and treat_one_path() Elijah Newren via GitGitGadget
2020-01-31 18:31   ` [PATCH v2 2/6] dir: fix broken comment Elijah Newren via GitGitGadget
2020-01-31 18:31   ` [PATCH v2 3/6] dir: fix confusion based on variable tense Elijah Newren via GitGitGadget
2020-01-31 18:31   ` [PATCH v2 4/6] dir: refactor treat_directory to clarify control flow Derrick Stolee via GitGitGadget
2020-01-31 18:31   ` [PATCH v2 5/6] dir: replace exponential algorithm with a linear one Elijah Newren via GitGitGadget
2020-01-31 18:31   ` [PATCH v2 6/6] t7063: blindly accept diffs Elijah Newren via GitGitGadget
2020-03-25 19:31   ` [PATCH v3 0/7] Avoid multiple recursive calls for same path in read_directory_recursive() Elijah Newren via GitGitGadget
2020-03-25 19:31     ` [PATCH v3 1/7] t7063: correct broken test expectation Elijah Newren via GitGitGadget
2020-03-26 13:02       ` Derrick Stolee
2020-03-26 21:18         ` Elijah Newren
2020-03-25 19:31     ` [PATCH v3 2/7] dir: fix simple typo in comment Elijah Newren via GitGitGadget
2020-03-25 19:31     ` [PATCH v3 3/7] dir: consolidate treat_path() and treat_one_path() Elijah Newren via GitGitGadget
2020-03-25 19:31     ` [PATCH v3 4/7] dir: fix broken comment Elijah Newren via GitGitGadget
2020-03-25 19:31     ` [PATCH v3 5/7] dir: fix confusion based on variable tense Elijah Newren via GitGitGadget
2020-03-25 19:31     ` [PATCH v3 6/7] dir: refactor treat_directory to clarify control flow Derrick Stolee via GitGitGadget
2020-03-25 19:31     ` [PATCH v3 7/7] dir: replace exponential algorithm with a linear one, fix untracked cache Elijah Newren via GitGitGadget
2020-03-26 13:13       ` Derrick Stolee
2020-03-26 21:27     ` [PATCH v4 0/7] Avoid multiple recursive calls for same path in read_directory_recursive() Elijah Newren via GitGitGadget
2020-03-26 21:27       ` [PATCH v4 1/7] t7063: more thorough status checking Elijah Newren via GitGitGadget
2020-03-27 13:09         ` Derrick Stolee
2020-03-29 18:18           ` Junio C Hamano
2020-03-31 20:15             ` Elijah Newren
2020-03-26 21:27       ` [PATCH v4 2/7] dir: fix simple typo in comment Elijah Newren via GitGitGadget
2020-03-26 21:27       ` [PATCH v4 3/7] dir: consolidate treat_path() and treat_one_path() Elijah Newren via GitGitGadget
2020-03-26 21:27       ` [PATCH v4 4/7] dir: fix broken comment Elijah Newren via GitGitGadget
2020-03-26 21:27       ` [PATCH v4 5/7] dir: fix confusion based on variable tense Elijah Newren via GitGitGadget
2020-03-26 21:27       ` [PATCH v4 6/7] dir: refactor treat_directory to clarify control flow Derrick Stolee via GitGitGadget
2020-03-26 21:27       ` [PATCH v4 7/7] dir: replace exponential algorithm with a linear one Elijah Newren via GitGitGadget
2020-03-27 13:13       ` [PATCH v4 0/7] Avoid multiple recursive calls for same path in read_directory_recursive() Derrick Stolee
2020-03-28 17:33         ` Elijah Newren
2020-03-29 18:20           ` Junio C Hamano
2020-04-01  4:17       ` [PATCH v5 00/12] " Elijah Newren via GitGitGadget
2020-04-01  4:17         ` [PATCH v5 01/12] t7063: more thorough status checking Elijah Newren via GitGitGadget
2020-04-01  4:17         ` [PATCH v5 02/12] t3000: add more testcases testing a variety of ls-files issues Elijah Newren via GitGitGadget
2020-04-01  4:17         ` [PATCH v5 03/12] dir: fix simple typo in comment Elijah Newren via GitGitGadget
2020-04-01  4:17         ` [PATCH v5 04/12] dir: consolidate treat_path() and treat_one_path() Elijah Newren via GitGitGadget
2020-04-01  4:17         ` [PATCH v5 05/12] dir: fix broken comment Elijah Newren via GitGitGadget
2020-04-01  4:17         ` [PATCH v5 06/12] dir: fix confusion based on variable tense Elijah Newren via GitGitGadget
2020-04-01  4:17         ` [PATCH v5 07/12] dir: refactor treat_directory to clarify control flow Derrick Stolee via GitGitGadget
2020-04-01  4:17         ` [PATCH v5 08/12] dir: replace exponential algorithm with a linear one Elijah Newren via GitGitGadget
2020-04-01 13:57           ` Derrick Stolee
2020-04-01 15:59             ` Elijah Newren
2020-04-01  4:17         ` [PATCH v5 09/12] dir: include DIR_KEEP_UNTRACKED_CONTENTS handling in treat_directory() Elijah Newren via GitGitGadget
2020-04-01  4:17         ` [PATCH v5 10/12] dir: replace double pathspec matching with single " Elijah Newren via GitGitGadget
2020-04-01  4:17         ` [PATCH v5 11/12] Fix error-prone fill_directory() API; make it only return matches Elijah Newren via GitGitGadget
2020-04-01  4:17         ` [PATCH v5 12/12] completion: fix 'git add' on paths under an untracked directory Elijah Newren via GitGitGadget

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=59b5b766-29e2-a709-b407-56bf6ea60b47@gmail.com \
    --to=stolee@gmail.com \
    --cc=Kevin.Willford@microsoft.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=martin.melka@gmail.com \
    --cc=newren@gmail.com \
    --cc=pclouds@gmail.com \
    --cc=sxlijin@gmail.com \
    --cc=szeder.dev@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Git Mailing List Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/git/0 git/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 git git/ https://lore.kernel.org/git \
		git@vger.kernel.org
	public-inbox-index git

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.git


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git