git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Tao Klerks <tao@klerks.biz>
To: Johannes Schindelin <Johannes.Schindelin@gmx.de>
Cc: git@vger.kernel.org
Subject: Re: Question about fsmonitor and --untracked-files=all
Date: Thu, 24 Sep 2020 14:14:56 +0200	[thread overview]
Message-ID: <CAPMMpohJicVeCaKsPvommYbGEH-D1V02TTMaiVTV8ux+9z9vkQ@mail.gmail.com> (raw)
In-Reply-To: <nycvar.QRO.7.76.6.2009231238560.5061@tvgsbejvaqbjf.bet>

Hi Johannes,

Thanks for the tip - unfortunately, that doesn't seem to have worked /
had any positive effect.

With "git config core.fscache false", everything/anything takes longer
except a simple "git status" with the fsmonitor enabled and the
untrackedCache enabled (in which case I guess nothing ends up needing
the filesystem). This combination (fsmonitor enabled, untrackedCache
enabled, and running simple "git status") is the *only* combination
that I've found so far that doesn't force a directory scan - and
*when* there is a directory scan (because of "--untracked-files=all",
or because the fsmonitor is disabled, or because the untrackedCache is
disabled), then having fscache disabled makes things significantly
worse/slower (20% slower to double the time, depending on the exact
combination).

I tried to stumble my way around some of the source code, and I
suspect I've found at least one explanation: The untracked cache
appears to be ignored when "--untracked-files=all" is specified, and
this appears to be intentional:
* In wt-status.c#wt_status_collect_untracked(), the "dir.flags" are
updated to include "DIR_SHOW_OTHER_DIRECTORIES" when the
"SHOW_ALL_UNTRACKED_FILES" arg is detected
* In later logic nested in dir.c#validate_untracked_cache(), the
presence of the "DIR_SHOW_OTHER_DIRECTORIES" flag causes the
validation to fail and, up one level in read_directory(), this causes
the untracked structure to be discarded

The relevant comment in "validate_untracked_cache()" says "See
treat_directory(), case index_nonexistent. Without this
[DIR_SHOW_OTHER_DIRECTORIES] flag, we may need to also cache .git file
content for the resolve_gitlink_ref() call, which we don't.". I can't
claim to understand the comment, the relationship to gitlinks, etc :(

Does this look like something solvable? It looks like supporting the
untrackedCache even with "--untracked-files=all" would make a
(potentially) large difference to git status performance in some
workflows with fsmonitor enabled.

(all that said, I still haven't understood why the presence of the
fsmonitor hook makes the difference, in terms of behavior, between
*multi-threaded* directory tree scanning for all directory contents
(without the fsmonitor), and *single-threaded* directory scanning for
untracked files specifically (with the fsmonitor))

Thanks for looking, any further thoughts will of course be most appreciated!

Tao Klerks

On Wed, Sep 23, 2020 at 4:42 PM Johannes Schindelin
<Johannes.Schindelin@gmx.de> wrote:
>
> Hi Tao,
>
> On Tue, 22 Sep 2020, Tao Klerks wrote:
>
> > I've got a couple questions about the "fsmonitor" functionality,
> > untracked files, and multithreading.
> >
> > Background:
> >
> > In a repo with:
> >  * A couple hundred thousand tracked files, and a couple hundred
> > thousand .gitignored files, across a few thousand directories
> >  * The --untracked-cache setting, tested and working
> >  * core.fsmonitor set up with watchman (with the sample integration
> > script from january)
> >  * Git version 2.27.0.windows.1
> >
> > "git status" takes about 2s
> > "git status --untracked-files=all" takes about 20s
> >
> > When I turn off "core.fsmonitor", the numbers change to something like:
> > "git status": 8s
> > "git status --untracked-files=all": 9s
> >
> > Using windows' "procmon" to observe git.exe's behavior from outside, I
> > think I've understood a couple things that surprise me:
> > 1. when you specify "--untracked-files=all", git scans the entire
> > folder tree regardless of the "fsmonitor" hook
> > 2. when you specify the "fsmonitor" hook, git does any
> > filesystem-scanning in a single-threaded fashion (as opposed to
> > multi-threaded without "fsmonitor" / normally)
> >
> > These two things combine so that with "fsmonitor" set, normal
> > command-line git status performance is great, but the performance in
> > tools that eagerly look for untracked files (like "Git Extensions" on
> > windows) actually suffers - it takes twice as long to run the 'git -c
> > diff.ignoreSubModules=none status --porcelain=2 -z
> > --untracked-files=all' command that this UI wants (and blocks on, when
> > you go to a commit dialog).
> >
> > Questions:
> >
> > 1. Is there a reason "--untracked-files=all" causes a full directory
> > tree scan even with the "fsmonitor" hook active, or is this
> > accidental?
>
> I have a hunch that this might be related to a performance hack we have in
> Git for Windows: did you enable FSCache perchance?
>
> If so, I _suspect_ that turning it off would accelerate `git status
> --untracked-files=all`.
>
> Ciao,
> Johannes
>
> > 2. Assuming that the full directory tree scan is indeed necessary even
> > with "fsmonitor" (when requesting all untracked files), could it be
> > made multithreaded?
> >
> > (my apologies for the simplistic "outside-in" observations; I don't
> > feel qualified to attempt to understand the git source code)
> >
> > Thanks for any help understanding the optimization opportunities here!
> >
> > Tao Klerks
> >

      reply	other threads:[~2020-09-24 12:15 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-09-22 11:35 Question about fsmonitor and --untracked-files=all Tao Klerks
2020-09-23 10:40 ` Johannes Schindelin
2020-09-24 12:14   ` Tao Klerks [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAPMMpohJicVeCaKsPvommYbGEH-D1V02TTMaiVTV8ux+9z9vkQ@mail.gmail.com \
    --to=tao@klerks.biz \
    --cc=Johannes.Schindelin@gmx.de \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).