Git Mailing List Archive on lore.kernel.org
 help / color / Atom feed
From: Elijah Newren <newren@gmail.com>
To: Derrick Stolee <stolee@gmail.com>
Cc: "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com>,
	"Git Mailing List" <git@vger.kernel.org>,
	"Martin Melka" <martin.melka@gmail.com>,
	"SZEDER Gábor" <szeder.dev@gmail.com>,
	"Samuel Lijin" <sxlijin@gmail.com>,
	"Nguyễn Thái Ngọc Duy" <pclouds@gmail.com>,
	Kevin.Willford@microsoft.com
Subject: Re: [PATCH 5/6] dir: replace exponential algorithm with a linear one
Date: Thu, 30 Jan 2020 09:13:21 -0800
Message-ID: <CABPp-BEQ5s=+6Rnb-A+pdEaoPXxfo-hMSegSe1eai=RE74A3Og@mail.gmail.com> (raw)
In-Reply-To: <59b5b766-29e2-a709-b407-56bf6ea60b47@gmail.com>

On Thu, Jan 30, 2020 at 7:55 AM Derrick Stolee <stolee@gmail.com> wrote:
> > However, on the positive side, it does make the code much faster.  For
> > the following simple shell loop in an empty repository:
> >
> >   for depth in $(seq 10 25)
> >   do
> >     dirs=$(for i in $(seq 1 $depth) ; do printf 'dir/' ; done)
> >     rm -rf dir
> >     mkdir -p $dirs
> >     >$dirs/untracked-file
> >     /usr/bin/time --format="$depth: %e" git status --ignored >/dev/null
> >   done
> >
> > I saw the following timings, in seconds (note that the numbers are a
> > little noisy from run-to-run, but the trend is very clear with every
> > run):
> >
> >     10: 0.03
> >     11: 0.05
> >     12: 0.08
> >     13: 0.19
> >     14: 0.29
> >     15: 0.50
> >     16: 1.05
> >     17: 2.11
> >     18: 4.11
> >     19: 8.60
> >     20: 17.55
> >     21: 33.87
> >     22: 68.71
> >     23: 140.05
> >     24: 274.45
> >     25: 551.15
>
> Are these timings on Linux? I imagine that the timings will increase
> much more quickly on Windows.

Yes, on Linux, with an SSD for the hard drive in this case (though I
suspect OS caching of the directories would probably eliminate any
differences between an SSD and a spinny disk since the same
directories are visited so many times).

> > After this fix, those drop to:
> >
> >     10: 0.00
> ...
> >     25: 0.00
>
> Nice. I wonder if presenting these 0.00 values as a table is worth
> the space? At least the effect is dramatic.

I first considered a table, but then noted it didn't match the code
snippet I provided and was worried I'd have to spend more time
explaining how I post-processed the output from two runs than we'd
gain from compressing the number of lines of the commit message.
Assuming reader time was more valuable, I opted to just keep the two
snippets of output.

> > WARNING: This change breaks t7063.  I don't know whether that is to be expected
> > (I now intentionally visit untracked directories differently so naturally the
> > untracked cache should change), or if I've broken something.  I'm hoping to get
> > an untracked cache expert to chime in...
>
> I suppose that when the untracked cache is enabled, your expectation that we
> do not need to recurse into an untracked directory is incorrect: we actually
> want to explore that directory. Is there a mode we can check to see if we
> are REALLY REALLY collecting _all_ untracked paths? Perhaps we need to create
> one?

I don't think I made any significant changes about using the untracked
cache versus traversing; the primary differences should be that I
traverse each directory once instead of 2^N times.  However, the
previous code would traverse with both check_only=0 and check_only=1,
and to avoid the whole 2^N thing I only traverse once.  That
fundamentally means I only won't traverse with both settings of that
flag.

The output in t7063 seems to suggest to me that the check_only flag
matters to what the untracked-cache stores ("check_only" literally
appears as part of the expected output), and the output also suggests
that the untracked-cache is recording when entries are visited
multiple times somehow.  Or maybe I'm just totally misunderstanding
the expected output in t7063.  I really have no clue about that stuff.

> I'm CC'ing Kevin Willford because he is more familiar with the Git index
> than me, and perhaps the untracked cache in particular.

Getting another set of eyes, even if they only know enough to provide
hunches or guesses, would be very welcome.

> > Signed-off-by: Elijah Newren <newren@gmail.com>
> > ---
> >  dir.c | 151 ++++++++++++++++++++++++++++++++++++++++------------------
> >  1 file changed, 105 insertions(+), 46 deletions(-)
> >
> > diff --git a/dir.c b/dir.c
> > index ef3307718a..aaf038a9c4 100644
> > --- a/dir.c
> > +++ b/dir.c
> > @@ -1659,7 +1659,13 @@ static enum path_treatment treat_directory(struct dir_struct *dir,
> >       const char *dirname, int len, int baselen, int excluded,
> >       const struct pathspec *pathspec)
> >  {
> > -     int nested_repo;
> > +     /*
> > +      * WARNING: From this function, you can return path_recurse or you
> > +      *          can call read_directory_recursive() (or neither), but
> > +      *          you CAN'T DO BOTH.
> > +      */
> > +     enum path_treatment state;
> > +     int nested_repo, old_ignored_nr, stop_early;
> >
> >       /* The "len-1" is to strip the final '/' */
> >       switch (directory_exists_in_index(istate, dirname, len-1)) {
> > @@ -1713,18 +1719,101 @@ static enum path_treatment treat_directory(struct dir_struct *dir,
> >
> >       /* This is the "show_other_directories" case */
> >
> > -     if (!(dir->flags & DIR_HIDE_EMPTY_DIRECTORIES))
> > +     /*
> > +      * We only need to recurse into untracked/ignored directories if
> > +      * either of the following bits is set:
> > +      *   - DIR_SHOW_IGNORED_TOO (because then we need to determine if
> > +      *                           there are ignored directories below)
> > +      *   - DIR_HIDE_EMPTY_DIRECTORIES (because we have to determine if
> > +      *                                 the directory is empty)
>
> Perhaps here is where you could also have a DIR_LIST_ALL_UNTRACKED
> flag to ensure the untracked cache loads all untracked paths?

Do you mean DIR_KEEP_UNTRACKED_CONTENTS (which is documented in dir.h
as only having meaning when DIR_SHOW_IGNORED_TOO is also set, and thus
caused me to not list it separately)?

Speaking of DIR_KEEP_UNTRACKED_CONTENTS, though, its handling as a
post-processing step in read_directory() is now inconsistent with how
we handle squashing a directory full of ignores into just marking the
containing directory as ignored.  I think I should move the
read_directory() logic for DIR_KEEP_UNTRACKED_CONTENTS to
treat_directory() and use another counter similar to old_ignored_nr.
It should be more efficient that way, too.

>
> > +      */
> > +     if (!(dir->flags & (DIR_SHOW_IGNORED_TOO | DIR_HIDE_EMPTY_DIRECTORIES)))
> >               return excluded ? path_excluded : path_untracked;
> >
> > +     /*
> > +      * If we only want to determine if dirname is empty, then we can
> > +      * stop at the first file we find underneath that directory rather
> > +      * than continuing to recurse beyond it.  If DIR_SHOW_IGNORED_TOO
> > +      * is set, then we want MORE than just determining if dirname is
> > +      * empty.
> > +      */
> > +     stop_early = ((dir->flags & DIR_HIDE_EMPTY_DIRECTORIES) &&
> > +                   !(dir->flags & DIR_SHOW_IGNORED_TOO));
> > +
> > +     /*
> > +      * If /every/ file within an untracked directory is ignored, then
> > +      * we want to treat the directory as ignored (for e.g. status
> > +      * --porcelain), without listing the individual ignored files
> > +      * underneath.  To do so, we'll save the current ignored_nr, and
> > +      * pop all the ones added after it if it turns out the entire
> > +      * directory is ignored.
>
> Here is a question for an untracked cache expert: Do we store ignored
> paths in the untracked cache?

According to 0dcb8d7fe0ec ("untracked cache: record .gitignore
information and dir hierarchy", 2015-03-08), no:

    This cached output is about untracked files only, not ignored files
    because the number of tracked files is usually small, so small cache
    overhead, while the number of ignored files could go really high
    (e.g. *.o files mixing with source code).

...unless, of course, someone came along later and changed the design goals.

[...]

> This is certainly a substantial change, and I'm not able to read it
> carefully right now. I hope to return to it soon, but hopefully I've
> pointed out some places that may lead you to resolve your untracked
> cache issues.

Yeah, it's pretty hard to reason about; personally I needed lots of
dumps of state during traversals just to partially make sense of it.

I had dumps of output from both before and after my changes printing
out return values of treat_directory() and paths and a bunch of other
stuff and was doing lots of comparisons (and repeatedly did this for
many, many different testcases with different toplevel git commands).
It was particularly annoying that the old stuff would traverse
everything 2^N times, half the time with check_only on and half the
time with it off.  It would return different state values for the same
path from different calls, often depending on the side effects of
dir.entries having had more entries added by the first recursion to
get the right output, despite the fact that the "wrong" state was
returned by treat_directory() for later visits to the same path (e.g.
path_untracked returned for the first time it was visited, then
path_none later, and it was a case where path_untracked was correct in
my view).

Despite those difficulties, having an extra set of eyes try to reason
about it and pointing out anything that looks amiss or even that just
looks hard to understand would be very welcome.

  reply index

Thread overview: 68+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-01-29 22:03 [PATCH 0/6] Avoid multiple recursive calls for same path in read_directory_recursive() Elijah Newren via GitGitGadget
2020-01-29 22:03 ` [PATCH 1/6] dir: consolidate treat_path() and treat_one_path() Elijah Newren via GitGitGadget
2020-01-29 22:03 ` [PATCH 2/6] dir: fix broken comment Elijah Newren via GitGitGadget
2020-01-29 22:03 ` [PATCH 3/6] dir: fix confusion based on variable tense Elijah Newren via GitGitGadget
2020-01-30 15:20   ` Derrick Stolee
2020-01-31 18:04   ` SZEDER Gábor
2020-01-31 18:17     ` Elijah Newren
2020-01-29 22:03 ` [PATCH 4/6] dir: move setting of nested_repo next to its actual usage Elijah Newren via GitGitGadget
2020-01-30 15:33   ` Derrick Stolee
2020-01-30 15:45     ` Elijah Newren
2020-01-30 16:00       ` Derrick Stolee
2020-01-30 16:10         ` Derrick Stolee
2020-01-30 16:20           ` Elijah Newren
2020-01-30 18:17             ` Derrick Stolee
2020-01-29 22:03 ` [PATCH 5/6] dir: replace exponential algorithm with a linear one Elijah Newren via GitGitGadget
2020-01-30 15:55   ` Derrick Stolee
2020-01-30 17:13     ` Elijah Newren [this message]
2020-01-30 17:45       ` Elijah Newren
2020-01-31 17:13   ` SZEDER Gábor
2020-01-31 17:47     ` Elijah Newren
2020-01-29 22:03 ` [PATCH 6/6] t7063: blindly accept diffs Elijah Newren via GitGitGadget
2020-01-31 18:31 ` [PATCH v2 0/6] Avoid multiple recursive calls for same path in read_directory_recursive() Elijah Newren via GitGitGadget
2020-01-31 18:31   ` [PATCH v2 1/6] dir: consolidate treat_path() and treat_one_path() Elijah Newren via GitGitGadget
2020-01-31 18:31   ` [PATCH v2 2/6] dir: fix broken comment Elijah Newren via GitGitGadget
2020-01-31 18:31   ` [PATCH v2 3/6] dir: fix confusion based on variable tense Elijah Newren via GitGitGadget
2020-01-31 18:31   ` [PATCH v2 4/6] dir: refactor treat_directory to clarify control flow Derrick Stolee via GitGitGadget
2020-01-31 18:31   ` [PATCH v2 5/6] dir: replace exponential algorithm with a linear one Elijah Newren via GitGitGadget
2020-01-31 18:31   ` [PATCH v2 6/6] t7063: blindly accept diffs Elijah Newren via GitGitGadget
2020-03-25 19:31   ` [PATCH v3 0/7] Avoid multiple recursive calls for same path in read_directory_recursive() Elijah Newren via GitGitGadget
2020-03-25 19:31     ` [PATCH v3 1/7] t7063: correct broken test expectation Elijah Newren via GitGitGadget
2020-03-26 13:02       ` Derrick Stolee
2020-03-26 21:18         ` Elijah Newren
2020-03-25 19:31     ` [PATCH v3 2/7] dir: fix simple typo in comment Elijah Newren via GitGitGadget
2020-03-25 19:31     ` [PATCH v3 3/7] dir: consolidate treat_path() and treat_one_path() Elijah Newren via GitGitGadget
2020-03-25 19:31     ` [PATCH v3 4/7] dir: fix broken comment Elijah Newren via GitGitGadget
2020-03-25 19:31     ` [PATCH v3 5/7] dir: fix confusion based on variable tense Elijah Newren via GitGitGadget
2020-03-25 19:31     ` [PATCH v3 6/7] dir: refactor treat_directory to clarify control flow Derrick Stolee via GitGitGadget
2020-03-25 19:31     ` [PATCH v3 7/7] dir: replace exponential algorithm with a linear one, fix untracked cache Elijah Newren via GitGitGadget
2020-03-26 13:13       ` Derrick Stolee
2020-03-26 21:27     ` [PATCH v4 0/7] Avoid multiple recursive calls for same path in read_directory_recursive() Elijah Newren via GitGitGadget
2020-03-26 21:27       ` [PATCH v4 1/7] t7063: more thorough status checking Elijah Newren via GitGitGadget
2020-03-27 13:09         ` Derrick Stolee
2020-03-29 18:18           ` Junio C Hamano
2020-03-31 20:15             ` Elijah Newren
2020-03-26 21:27       ` [PATCH v4 2/7] dir: fix simple typo in comment Elijah Newren via GitGitGadget
2020-03-26 21:27       ` [PATCH v4 3/7] dir: consolidate treat_path() and treat_one_path() Elijah Newren via GitGitGadget
2020-03-26 21:27       ` [PATCH v4 4/7] dir: fix broken comment Elijah Newren via GitGitGadget
2020-03-26 21:27       ` [PATCH v4 5/7] dir: fix confusion based on variable tense Elijah Newren via GitGitGadget
2020-03-26 21:27       ` [PATCH v4 6/7] dir: refactor treat_directory to clarify control flow Derrick Stolee via GitGitGadget
2020-03-26 21:27       ` [PATCH v4 7/7] dir: replace exponential algorithm with a linear one Elijah Newren via GitGitGadget
2020-03-27 13:13       ` [PATCH v4 0/7] Avoid multiple recursive calls for same path in read_directory_recursive() Derrick Stolee
2020-03-28 17:33         ` Elijah Newren
2020-03-29 18:20           ` Junio C Hamano
2020-04-01  4:17       ` [PATCH v5 00/12] " Elijah Newren via GitGitGadget
2020-04-01  4:17         ` [PATCH v5 01/12] t7063: more thorough status checking Elijah Newren via GitGitGadget
2020-04-01  4:17         ` [PATCH v5 02/12] t3000: add more testcases testing a variety of ls-files issues Elijah Newren via GitGitGadget
2020-04-01  4:17         ` [PATCH v5 03/12] dir: fix simple typo in comment Elijah Newren via GitGitGadget
2020-04-01  4:17         ` [PATCH v5 04/12] dir: consolidate treat_path() and treat_one_path() Elijah Newren via GitGitGadget
2020-04-01  4:17         ` [PATCH v5 05/12] dir: fix broken comment Elijah Newren via GitGitGadget
2020-04-01  4:17         ` [PATCH v5 06/12] dir: fix confusion based on variable tense Elijah Newren via GitGitGadget
2020-04-01  4:17         ` [PATCH v5 07/12] dir: refactor treat_directory to clarify control flow Derrick Stolee via GitGitGadget
2020-04-01  4:17         ` [PATCH v5 08/12] dir: replace exponential algorithm with a linear one Elijah Newren via GitGitGadget
2020-04-01 13:57           ` Derrick Stolee
2020-04-01 15:59             ` Elijah Newren
2020-04-01  4:17         ` [PATCH v5 09/12] dir: include DIR_KEEP_UNTRACKED_CONTENTS handling in treat_directory() Elijah Newren via GitGitGadget
2020-04-01  4:17         ` [PATCH v5 10/12] dir: replace double pathspec matching with single " Elijah Newren via GitGitGadget
2020-04-01  4:17         ` [PATCH v5 11/12] Fix error-prone fill_directory() API; make it only return matches Elijah Newren via GitGitGadget
2020-04-01  4:17         ` [PATCH v5 12/12] completion: fix 'git add' on paths under an untracked directory Elijah Newren via GitGitGadget

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CABPp-BEQ5s=+6Rnb-A+pdEaoPXxfo-hMSegSe1eai=RE74A3Og@mail.gmail.com' \
    --to=newren@gmail.com \
    --cc=Kevin.Willford@microsoft.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=martin.melka@gmail.com \
    --cc=pclouds@gmail.com \
    --cc=stolee@gmail.com \
    --cc=sxlijin@gmail.com \
    --cc=szeder.dev@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Git Mailing List Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/git/0 git/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 git git/ https://lore.kernel.org/git \
		git@vger.kernel.org
	public-inbox-index git

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.git


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git