git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Question about fsmonitor and --untracked-files=all
@ 2020-09-22 11:35 Tao Klerks
  2020-09-23 10:40 ` Johannes Schindelin
  0 siblings, 1 reply; 3+ messages in thread
From: Tao Klerks @ 2020-09-22 11:35 UTC (permalink / raw)
  To: git

Hi folks,

I've got a couple questions about the "fsmonitor" functionality,
untracked files, and multithreading.

Background:

In a repo with:
 * A couple hundred thousand tracked files, and a couple hundred
thousand .gitignored files, across a few thousand directories
 * The --untracked-cache setting, tested and working
 * core.fsmonitor set up with watchman (with the sample integration
script from january)
 * Git version 2.27.0.windows.1

"git status" takes about 2s
"git status --untracked-files=all" takes about 20s

When I turn off "core.fsmonitor", the numbers change to something like:
"git status": 8s
"git status --untracked-files=all": 9s

Using windows' "procmon" to observe git.exe's behavior from outside, I
think I've understood a couple things that surprise me:
1. when you specify "--untracked-files=all", git scans the entire
folder tree regardless of the "fsmonitor" hook
2. when you specify the "fsmonitor" hook, git does any
filesystem-scanning in a single-threaded fashion (as opposed to
multi-threaded without "fsmonitor" / normally)

These two things combine so that with "fsmonitor" set, normal
command-line git status performance is great, but the performance in
tools that eagerly look for untracked files (like "Git Extensions" on
windows) actually suffers - it takes twice as long to run the 'git -c
diff.ignoreSubModules=none status --porcelain=2 -z
--untracked-files=all' command that this UI wants (and blocks on, when
you go to a commit dialog).

Questions:

1. Is there a reason "--untracked-files=all" causes a full directory
tree scan even with the "fsmonitor" hook active, or is this
accidental?
2. Assuming that the full directory tree scan is indeed necessary even
with "fsmonitor" (when requesting all untracked files), could it be
made multithreaded?

(my apologies for the simplistic "outside-in" observations; I don't
feel qualified to attempt to understand the git source code)

Thanks for any help understanding the optimization opportunities here!

Tao Klerks

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Question about fsmonitor and --untracked-files=all
  2020-09-22 11:35 Question about fsmonitor and --untracked-files=all Tao Klerks
@ 2020-09-23 10:40 ` Johannes Schindelin
  2020-09-24 12:14   ` Tao Klerks
  0 siblings, 1 reply; 3+ messages in thread
From: Johannes Schindelin @ 2020-09-23 10:40 UTC (permalink / raw)
  To: Tao Klerks; +Cc: git

Hi Tao,

On Tue, 22 Sep 2020, Tao Klerks wrote:

> I've got a couple questions about the "fsmonitor" functionality,
> untracked files, and multithreading.
>
> Background:
>
> In a repo with:
>  * A couple hundred thousand tracked files, and a couple hundred
> thousand .gitignored files, across a few thousand directories
>  * The --untracked-cache setting, tested and working
>  * core.fsmonitor set up with watchman (with the sample integration
> script from january)
>  * Git version 2.27.0.windows.1
>
> "git status" takes about 2s
> "git status --untracked-files=all" takes about 20s
>
> When I turn off "core.fsmonitor", the numbers change to something like:
> "git status": 8s
> "git status --untracked-files=all": 9s
>
> Using windows' "procmon" to observe git.exe's behavior from outside, I
> think I've understood a couple things that surprise me:
> 1. when you specify "--untracked-files=all", git scans the entire
> folder tree regardless of the "fsmonitor" hook
> 2. when you specify the "fsmonitor" hook, git does any
> filesystem-scanning in a single-threaded fashion (as opposed to
> multi-threaded without "fsmonitor" / normally)
>
> These two things combine so that with "fsmonitor" set, normal
> command-line git status performance is great, but the performance in
> tools that eagerly look for untracked files (like "Git Extensions" on
> windows) actually suffers - it takes twice as long to run the 'git -c
> diff.ignoreSubModules=none status --porcelain=2 -z
> --untracked-files=all' command that this UI wants (and blocks on, when
> you go to a commit dialog).
>
> Questions:
>
> 1. Is there a reason "--untracked-files=all" causes a full directory
> tree scan even with the "fsmonitor" hook active, or is this
> accidental?

I have a hunch that this might be related to a performance hack we have in
Git for Windows: did you enable FSCache perchance?

If so, I _suspect_ that turning it off would accelerate `git status
--untracked-files=all`.

Ciao,
Johannes

> 2. Assuming that the full directory tree scan is indeed necessary even
> with "fsmonitor" (when requesting all untracked files), could it be
> made multithreaded?
>
> (my apologies for the simplistic "outside-in" observations; I don't
> feel qualified to attempt to understand the git source code)
>
> Thanks for any help understanding the optimization opportunities here!
>
> Tao Klerks
>

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Question about fsmonitor and --untracked-files=all
  2020-09-23 10:40 ` Johannes Schindelin
@ 2020-09-24 12:14   ` Tao Klerks
  0 siblings, 0 replies; 3+ messages in thread
From: Tao Klerks @ 2020-09-24 12:14 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git

Hi Johannes,

Thanks for the tip - unfortunately, that doesn't seem to have worked /
had any positive effect.

With "git config core.fscache false", everything/anything takes longer
except a simple "git status" with the fsmonitor enabled and the
untrackedCache enabled (in which case I guess nothing ends up needing
the filesystem). This combination (fsmonitor enabled, untrackedCache
enabled, and running simple "git status") is the *only* combination
that I've found so far that doesn't force a directory scan - and
*when* there is a directory scan (because of "--untracked-files=all",
or because the fsmonitor is disabled, or because the untrackedCache is
disabled), then having fscache disabled makes things significantly
worse/slower (20% slower to double the time, depending on the exact
combination).

I tried to stumble my way around some of the source code, and I
suspect I've found at least one explanation: The untracked cache
appears to be ignored when "--untracked-files=all" is specified, and
this appears to be intentional:
* In wt-status.c#wt_status_collect_untracked(), the "dir.flags" are
updated to include "DIR_SHOW_OTHER_DIRECTORIES" when the
"SHOW_ALL_UNTRACKED_FILES" arg is detected
* In later logic nested in dir.c#validate_untracked_cache(), the
presence of the "DIR_SHOW_OTHER_DIRECTORIES" flag causes the
validation to fail and, up one level in read_directory(), this causes
the untracked structure to be discarded

The relevant comment in "validate_untracked_cache()" says "See
treat_directory(), case index_nonexistent. Without this
[DIR_SHOW_OTHER_DIRECTORIES] flag, we may need to also cache .git file
content for the resolve_gitlink_ref() call, which we don't.". I can't
claim to understand the comment, the relationship to gitlinks, etc :(

Does this look like something solvable? It looks like supporting the
untrackedCache even with "--untracked-files=all" would make a
(potentially) large difference to git status performance in some
workflows with fsmonitor enabled.

(all that said, I still haven't understood why the presence of the
fsmonitor hook makes the difference, in terms of behavior, between
*multi-threaded* directory tree scanning for all directory contents
(without the fsmonitor), and *single-threaded* directory scanning for
untracked files specifically (with the fsmonitor))

Thanks for looking, any further thoughts will of course be most appreciated!

Tao Klerks

On Wed, Sep 23, 2020 at 4:42 PM Johannes Schindelin
<Johannes.Schindelin@gmx.de> wrote:
>
> Hi Tao,
>
> On Tue, 22 Sep 2020, Tao Klerks wrote:
>
> > I've got a couple questions about the "fsmonitor" functionality,
> > untracked files, and multithreading.
> >
> > Background:
> >
> > In a repo with:
> >  * A couple hundred thousand tracked files, and a couple hundred
> > thousand .gitignored files, across a few thousand directories
> >  * The --untracked-cache setting, tested and working
> >  * core.fsmonitor set up with watchman (with the sample integration
> > script from january)
> >  * Git version 2.27.0.windows.1
> >
> > "git status" takes about 2s
> > "git status --untracked-files=all" takes about 20s
> >
> > When I turn off "core.fsmonitor", the numbers change to something like:
> > "git status": 8s
> > "git status --untracked-files=all": 9s
> >
> > Using windows' "procmon" to observe git.exe's behavior from outside, I
> > think I've understood a couple things that surprise me:
> > 1. when you specify "--untracked-files=all", git scans the entire
> > folder tree regardless of the "fsmonitor" hook
> > 2. when you specify the "fsmonitor" hook, git does any
> > filesystem-scanning in a single-threaded fashion (as opposed to
> > multi-threaded without "fsmonitor" / normally)
> >
> > These two things combine so that with "fsmonitor" set, normal
> > command-line git status performance is great, but the performance in
> > tools that eagerly look for untracked files (like "Git Extensions" on
> > windows) actually suffers - it takes twice as long to run the 'git -c
> > diff.ignoreSubModules=none status --porcelain=2 -z
> > --untracked-files=all' command that this UI wants (and blocks on, when
> > you go to a commit dialog).
> >
> > Questions:
> >
> > 1. Is there a reason "--untracked-files=all" causes a full directory
> > tree scan even with the "fsmonitor" hook active, or is this
> > accidental?
>
> I have a hunch that this might be related to a performance hack we have in
> Git for Windows: did you enable FSCache perchance?
>
> If so, I _suspect_ that turning it off would accelerate `git status
> --untracked-files=all`.
>
> Ciao,
> Johannes
>
> > 2. Assuming that the full directory tree scan is indeed necessary even
> > with "fsmonitor" (when requesting all untracked files), could it be
> > made multithreaded?
> >
> > (my apologies for the simplistic "outside-in" observations; I don't
> > feel qualified to attempt to understand the git source code)
> >
> > Thanks for any help understanding the optimization opportunities here!
> >
> > Tao Klerks
> >

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2020-09-24 12:15 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-22 11:35 Question about fsmonitor and --untracked-files=all Tao Klerks
2020-09-23 10:40 ` Johannes Schindelin
2020-09-24 12:14   ` Tao Klerks

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).