All of lore.kernel.org
 help / color / mirror / Atom feed
* inotify to minimize stat() calls
@ 2013-02-08 21:10 Ramkumar Ramachandra
  2013-02-08 22:15 ` Junio C Hamano
  2013-02-14 15:16 ` Ævar Arnfjörð Bjarmason
  0 siblings, 2 replies; 88+ messages in thread
From: Ramkumar Ramachandra @ 2013-02-08 21:10 UTC (permalink / raw)
  To: Git List

Hi,

For large repositories, many simple git commands like `git status`
take a while to respond.  I understand that this is because of large
number of stat() calls to figure out which files were changed.  I
overheard that Mercurial wants to solve this problem using itnotify,
but the idea bothers me because it's not portable.  Will Git ever
consider using inotify on Linux?  What is the downside?

Ram

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: inotify to minimize stat() calls
  2013-02-08 21:10 inotify to minimize stat() calls Ramkumar Ramachandra
@ 2013-02-08 22:15 ` Junio C Hamano
  2013-02-08 22:45   ` Junio C Hamano
  2013-02-14 15:16 ` Ævar Arnfjörð Bjarmason
  1 sibling, 1 reply; 88+ messages in thread
From: Junio C Hamano @ 2013-02-08 22:15 UTC (permalink / raw)
  To: Ramkumar Ramachandra; +Cc: Git List

Ramkumar Ramachandra <artagnon@gmail.com> writes:

> ...  Will Git ever
> consider using inotify on Linux?  What is the downside?

I think this has come up from time to time, but my understanding is
that nobody thought things through to find a good layer in the
codebase to interface to an external daemon that listens to inotify
events yet.  It is not something like "somebody decreed that we
would never consider because of such and such downsides."  We are
not there yet.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: inotify to minimize stat() calls
  2013-02-08 22:15 ` Junio C Hamano
@ 2013-02-08 22:45   ` Junio C Hamano
  2013-02-09  2:10     ` Duy Nguyen
  2013-02-09  2:56     ` Junio C Hamano
  0 siblings, 2 replies; 88+ messages in thread
From: Junio C Hamano @ 2013-02-08 22:45 UTC (permalink / raw)
  To: Ramkumar Ramachandra; +Cc: Git List

Junio C Hamano <gitster@pobox.com> writes:

> Ramkumar Ramachandra <artagnon@gmail.com> writes:
>
>> ...  Will Git ever
>> consider using inotify on Linux?  What is the downside?
>
> I think this has come up from time to time, but my understanding is
> that nobody thought things through to find a good layer in the
> codebase to interface to an external daemon that listens to inotify
> events yet.  It is not something like "somebody decreed that we
> would never consider because of such and such downsides."  We are
> not there yet.

I checked read-cache.c and preload-index.c code.  To get the
discussion rolling, I think something like the outline below may be
a good starting point and a feasible weekend hack for somebody
competent:

 * At the beginning of preload_index(), instead of spawning the
   worker thread and doing the lstat() check ourselves, we open a
   socket to our daemon (see below) that watches this repository and
   make a request for lstat update.  The request will contain:

    - The SHA1 checksum of the index file we just read (to ensure
      that we and our daemon share the same baseline to
      communicate); and

    - the pathspec data.

   Our daemon, if it already has a fresh data available, will give
   us a list of <path, lstat result>.  Our main process runs a loop
   that is equivalent to what preload_thread() runs but uses the
   lstat() data we obtained from the daemon.  If our daemon says it
   does not have a fresh data (or somehow our daemon is dead), we do
   the work ourselves.

 * Our daemon watches the index file and the working tree, and
   waits for the above consumer.  First it reads the index (and
   remembers what it read), and whenever an inotify event comes,
   does the lstat() and remembers the result.  It never writes
   to the index, and does not hold the index lock.  Whenever the
   index file changes, it needs to reload the index, and discard
   lstat() data it already has for paths that are lost from the
   updated index.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: inotify to minimize stat() calls
  2013-02-08 22:45   ` Junio C Hamano
@ 2013-02-09  2:10     ` Duy Nguyen
  2013-02-09  2:37       ` Junio C Hamano
  2013-02-09  2:56     ` Junio C Hamano
  1 sibling, 1 reply; 88+ messages in thread
From: Duy Nguyen @ 2013-02-09  2:10 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Ramkumar Ramachandra, Git List

On Sat, Feb 9, 2013 at 5:45 AM, Junio C Hamano <gitster@pobox.com> wrote:
> Junio C Hamano <gitster@pobox.com> writes:
>
>> Ramkumar Ramachandra <artagnon@gmail.com> writes:
>>
>>> ...  Will Git ever
>>> consider using inotify on Linux?  What is the downside?
>>
>> I think this has come up from time to time, but my understanding is
>> that nobody thought things through to find a good layer in the
>> codebase to interface to an external daemon that listens to inotify
>> events yet.  It is not something like "somebody decreed that we
>> would never consider because of such and such downsides."  We are
>> not there yet.
>
> I checked read-cache.c and preload-index.c code.  To get the
> discussion rolling, I think something like the outline below may be
> a good starting point and a feasible weekend hack for somebody
> competent:
>
>  * At the beginning of preload_index(), instead of spawning the
>    worker thread and doing the lstat() check ourselves, we open a
>    socket to our daemon (see below) that watches this repository and

Can we replace "open a socket to our daemon" with "open a special file
in .git to get stat data written by our daemon"? TCP/IP socket means
system-wide daemon, not attractive. UNIX socket is not available on
Windows (although there may be named pipe, I don't know).

>    make a request for lstat update.  The request will contain:
>
>     - The SHA1 checksum of the index file we just read (to ensure
>       that we and our daemon share the same baseline to
>       communicate); and
>
>     - the pathspec data.
>
>    Our daemon, if it already has a fresh data available, will give
>    us a list of <path, lstat result>.  Our main process runs a loop
>    that is equivalent to what preload_thread() runs but uses the
>    lstat() data we obtained from the daemon.  If our daemon says it
>    does not have a fresh data (or somehow our daemon is dead), we do
>    the work ourselves.
>
>  * Our daemon watches the index file and the working tree, and
>    waits for the above consumer.  First it reads the index (and
>    remembers what it read), and whenever an inotify event comes,
>    does the lstat() and remembers the result.  It never writes
>    to the index, and does not hold the index lock.  Whenever the
>    index file changes, it needs to reload the index, and discard
>    lstat() data it already has for paths that are lost from the
>    updated index.


-- 
Duy

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: inotify to minimize stat() calls
  2013-02-09  2:10     ` Duy Nguyen
@ 2013-02-09  2:37       ` Junio C Hamano
  0 siblings, 0 replies; 88+ messages in thread
From: Junio C Hamano @ 2013-02-09  2:37 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Ramkumar Ramachandra, Git List

Duy Nguyen <pclouds@gmail.com> writes:

> Can we replace "open a socket to our daemon" with "open a special file
> in .git to get stat data written by our daemon"? TCP/IP socket means
> system-wide daemon, not attractive. UNIX socket is not available on
> Windows (although there may be named pipe, I don't know).

I do not think TCP/IP socket is too bad (you have to be able to read
the index file to be able to ask questions to the daemon to begin
with, so you must have list of paths already; the answer from the
daemon would not leak anything more sensitive than you can already
know), and UNIX domain socket is not too bad either.

Just like the implementation detail of the daemon itself may differ
on platforms (does Windows have the identical inotify interface?  I
doubt it), I expect the RPC mechanism between the daemon and the
client would be platform dependent.  So take that "open a socket" as
a generic way to say "have these two communicate with some magic",
nothing more.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: inotify to minimize stat() calls
  2013-02-08 22:45   ` Junio C Hamano
  2013-02-09  2:10     ` Duy Nguyen
@ 2013-02-09  2:56     ` Junio C Hamano
  2013-02-09  3:36       ` Robert Zeh
  2013-02-09 11:32       ` Ramkumar Ramachandra
  1 sibling, 2 replies; 88+ messages in thread
From: Junio C Hamano @ 2013-02-09  2:56 UTC (permalink / raw)
  To: Git List; +Cc: Duy Nguyen, Ramkumar Ramachandra

Junio C Hamano <gitster@pobox.com> writes:

> I checked read-cache.c and preload-index.c code.  To get the
> discussion rolling, I think something like the outline below may be
> a good starting point and a feasible weekend hack for somebody
> competent:
>
>  * At the beginning of preload_index(), instead of spawning the
>    worker thread and doing the lstat() check ourselves, we open a
>    socket to our daemon (see below) that watches this repository and
>    make a request for lstat update.  The request will contain:
>
>     - The SHA1 checksum of the index file we just read (to ensure
>       that we and our daemon share the same baseline to
>       communicate); and
>
>     - the pathspec data.
>
>    Our daemon, if it already has a fresh data available, will give
>    us a list of <path, lstat result>.  Our main process runs a loop
>    that is equivalent to what preload_thread() runs but uses the
>    lstat() data we obtained from the daemon.  If our daemon says it
>    does not have a fresh data (or somehow our daemon is dead), we do
>    the work ourselves.
>
>  * Our daemon watches the index file and the working tree, and
>    waits for the above consumer.  First it reads the index (and
>    remembers what it read), and whenever an inotify event comes,
>    does the lstat() and remembers the result.  It never writes
>    to the index, and does not hold the index lock.  Whenever the
>    index file changes, it needs to reload the index, and discard
>    lstat() data it already has for paths that are lost from the
>    updated index.

I left the details unsaid in thee above because I thought it was
fairly obvious from the nature of the "outline", but let me spend a
few more lines to avoid confusion.

 - The way the daemon "watches" the changes to the working tree and
   the index may well be very platform dependent.  I said "inotify"
   above, but the mechanism does not have to be inotify.

 - The channel the daemon and the client communicates would also be
   system dependent.  UNIX domain socket in $GIT_DIR/ with a
   well-known name would be one possibility but it does not have to
   be the only option.

 - The data given from the daemon to the client does not have to
   include full lstat() information.  They start from the same index
   info, and the only thing preload_index() wants to know is for
   which paths it should call ce_mark_uptodate(ce), so the answer
   given by our daemon can be a list of paths.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: inotify to minimize stat() calls
  2013-02-09  2:56     ` Junio C Hamano
@ 2013-02-09  3:36       ` Robert Zeh
  2013-02-09 12:05         ` Ramkumar Ramachandra
  2013-02-09 11:32       ` Ramkumar Ramachandra
  1 sibling, 1 reply; 88+ messages in thread
From: Robert Zeh @ 2013-02-09  3:36 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Git List, Duy Nguyen, Ramkumar Ramachandra

The delay for commands like git status is much worse on Windows than Linux; for my workflow I would be happy with a Windows only implementation. 

>From the description so far, I have some question: how does the daemon get started and stopped?  Is there one per repository --- this seems to be implied by putting the unix domain socket in $GIT_DIR. Could we automatically reject connections from anything other than localhost when using TCP?

Robert Zeh

On Feb 8, 2013, at 8:56 PM, Junio C Hamano <gitster@pobox.com> wrote:

> Junio C Hamano <gitster@pobox.com> writes:
> 
>> I checked read-cache.c and preload-index.c code.  To get the
>> discussion rolling, I think something like the outline below may be
>> a good starting point and a feasible weekend hack for somebody
>> competent:
>> 
>> * At the beginning of preload_index(), instead of spawning the
>>   worker thread and doing the lstat() check ourselves, we open a
>>   socket to our daemon (see below) that watches this repository and
>>   make a request for lstat update.  The request will contain:
>> 
>>    - The SHA1 checksum of the index file we just read (to ensure
>>      that we and our daemon share the same baseline to
>>      communicate); and
>> 
>>    - the pathspec data.
>> 
>>   Our daemon, if it already has a fresh data available, will give
>>   us a list of <path, lstat result>.  Our main process runs a loop
>>   that is equivalent to what preload_thread() runs but uses the
>>   lstat() data we obtained from the daemon.  If our daemon says it
>>   does not have a fresh data (or somehow our daemon is dead), we do
>>   the work ourselves.
>> 
>> * Our daemon watches the index file and the working tree, and
>>   waits for the above consumer.  First it reads the index (and
>>   remembers what it read), and whenever an inotify event comes,
>>   does the lstat() and remembers the result.  It never writes
>>   to the index, and does not hold the index lock.  Whenever the
>>   index file changes, it needs to reload the index, and discard
>>   lstat() data it already has for paths that are lost from the
>>   updated index.
> 
> I left the details unsaid in thee above because I thought it was
> fairly obvious from the nature of the "outline", but let me spend a
> few more lines to avoid confusion.
> 
> - The way the daemon "watches" the changes to the working tree and
>   the index may well be very platform dependent.  I said "inotify"
>   above, but the mechanism does not have to be inotify.
> 
> - The channel the daemon and the client communicates would also be
>   system dependent.  UNIX domain socket in $GIT_DIR/ with a
>   well-known name would be one possibility but it does not have to
>   be the only option.
> 
> - The data given from the daemon to the client does not have to
>   include full lstat() information.  They start from the same index
>   info, and the only thing preload_index() wants to know is for
>   which paths it should call ce_mark_uptodate(ce), so the answer
>   given by our daemon can be a list of paths.
> --
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: inotify to minimize stat() calls
  2013-02-09  2:56     ` Junio C Hamano
  2013-02-09  3:36       ` Robert Zeh
@ 2013-02-09 11:32       ` Ramkumar Ramachandra
  1 sibling, 0 replies; 88+ messages in thread
From: Ramkumar Ramachandra @ 2013-02-09 11:32 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Git List, Duy Nguyen

Junio C Hamano wrote:
> Junio C Hamano <gitster@pobox.com> writes:
>
>> I checked read-cache.c and preload-index.c code.  To get the
>> discussion rolling, I think something like the outline below may be
>> a good starting point and a feasible weekend hack for somebody
>> competent:
>>
>>  * At the beginning of preload_index(), instead of spawning the
>>    worker thread and doing the lstat() check ourselves, we open a
>>    socket to our daemon (see below) that watches this repository and
>>    make a request for lstat update.  The request will contain:
>>
>>     - The SHA1 checksum of the index file we just read (to ensure
>>       that we and our daemon share the same baseline to
>>       communicate); and
>>
>>     - the pathspec data.
>>
>>    Our daemon, if it already has a fresh data available, will give
>>    us a list of <path, lstat result>.  Our main process runs a loop
>>    that is equivalent to what preload_thread() runs but uses the
>>    lstat() data we obtained from the daemon.  If our daemon says it
>>    does not have a fresh data (or somehow our daemon is dead), we do
>>    the work ourselves.
>>
>>  * Our daemon watches the index file and the working tree, and
>>    waits for the above consumer.  First it reads the index (and
>>    remembers what it read), and whenever an inotify event comes,
>>    does the lstat() and remembers the result.  It never writes
>>    to the index, and does not hold the index lock.  Whenever the
>>    index file changes, it needs to reload the index, and discard
>>    lstat() data it already has for paths that are lost from the
>>    updated index.
>
> I left the details unsaid in thee above because I thought it was
> fairly obvious from the nature of the "outline", but let me spend a
> few more lines to avoid confusion.
>
>  - The way the daemon "watches" the changes to the working tree and
>    the index may well be very platform dependent.  I said "inotify"
>    above, but the mechanism does not have to be inotify.

Is the BSD kernel's inotify the same as the one on Linux?  Must we
design something that's generic enough from the start?

More importantly, do you know of a platform-independent inotify
implementation in C?  A quick Googling turned up QFileSystemWatcher
[1], a part of QT.

[1]: http://qt-project.org/doc/qt-4.8/qfilesystemwatcher.html

>  - The channel the daemon and the client communicates would also be
>    system dependent.  UNIX domain socket in $GIT_DIR/ with a
>    well-known name would be one possibility but it does not have to
>    be the only option.

UNIX domain sockets are also preferred because we'd never want to
connect to a watch daemon over the network?

Then the communication channel code also has to be generic enough.

>  - The data given from the daemon to the client does not have to
>    include full lstat() information.  They start from the same index
>    info, and the only thing preload_index() wants to know is for
>    which paths it should call ce_mark_uptodate(ce), so the answer
>    given by our daemon can be a list of paths.

Right.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: inotify to minimize stat() calls
  2013-02-09  3:36       ` Robert Zeh
@ 2013-02-09 12:05         ` Ramkumar Ramachandra
  2013-02-09 12:11           ` Ramkumar Ramachandra
                             ` (2 more replies)
  0 siblings, 3 replies; 88+ messages in thread
From: Ramkumar Ramachandra @ 2013-02-09 12:05 UTC (permalink / raw)
  To: Robert Zeh; +Cc: Junio C Hamano, Git List, Duy Nguyen

Robert Zeh wrote:
> From the description so far, I have some question: how does the daemon get started and stopped?  Is there one per repository ...

What about getting systemd to watch everything for us?  Then we can
just have one daemon reporting filesystem changes over one global
socket.  It's API should be the inotify subset:

   systemd_add_watch
   systemd_remove_watch

Except systemd_add_watch also accepts a UNIX socket to send lstat
events to.  Our preload_index() is just reduced to making one
systemd_add_watch() call the very first time and updating the index as
necessary.  Now, what about desktops with huge uptimes (like mine)?
Won't they get polluted with too many useless watches over time?
Simple: timeout.  If nobody reads from the UNIX socket for two hours
after a systemd_add_watch, execute systemd_remove_watch automatically.

Someone must implement a similar daemon on other platforms reporting
information in exactly the same way (although with different
internals).  IP sockets are system-wide and all platforms have them,
so the communication channel is also standardized.

This is much better than Junio's suggestion to study possible
implementations on all platforms and designing a generic daemon/
communication channel.  That's no weekend project.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: inotify to minimize stat() calls
  2013-02-09 12:05         ` Ramkumar Ramachandra
@ 2013-02-09 12:11           ` Ramkumar Ramachandra
  2013-02-09 12:53           ` Ramkumar Ramachandra
  2013-02-09 19:35           ` Junio C Hamano
  2 siblings, 0 replies; 88+ messages in thread
From: Ramkumar Ramachandra @ 2013-02-09 12:11 UTC (permalink / raw)
  To: Robert Zeh; +Cc: Junio C Hamano, Git List, Duy Nguyen

Ramkumar Ramachandra wrote:
> What about getting systemd to watch everything for us?  Then we can
> just have one daemon reporting filesystem changes over one global
> socket.  It's API should be the inotify subset:

Er, not one global socket: many little sockets as described later.
(The idea was just forming while I was writing this paragraph)

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: inotify to minimize stat() calls
  2013-02-09 12:05         ` Ramkumar Ramachandra
  2013-02-09 12:11           ` Ramkumar Ramachandra
@ 2013-02-09 12:53           ` Ramkumar Ramachandra
  2013-02-09 12:59             ` Duy Nguyen
  2013-02-09 19:35           ` Junio C Hamano
  2 siblings, 1 reply; 88+ messages in thread
From: Ramkumar Ramachandra @ 2013-02-09 12:53 UTC (permalink / raw)
  To: Robert Zeh; +Cc: Junio C Hamano, Git List, Duy Nguyen

Ramkumar Ramachandra wrote:
> What about getting systemd to watch everything for us?

systemd is the perfect candidate!  It already has an inotify watcher:
see systemd.path(5).  Can't be used as-is because it spawns processes
on events, which is a non-scalable design.  Secondly, it uses static
.path files to define the rules which is no good for us.  So, we need
to add an API to it, and ask it to report events over IP sockets.  The
API part is simple too, because it already has a DBUS API for many
things [1]; it's just a matter of extending it.

Yes, I know.  This introduces dbus as an additional optional
non-portable dependency.  Do you have suggestions for alternatives
that aren't complicated?

[1]: http://www.freedesktop.org/wiki/Software/systemd/dbus

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: inotify to minimize stat() calls
  2013-02-09 12:53           ` Ramkumar Ramachandra
@ 2013-02-09 12:59             ` Duy Nguyen
  2013-02-09 17:10               ` Ramkumar Ramachandra
  0 siblings, 1 reply; 88+ messages in thread
From: Duy Nguyen @ 2013-02-09 12:59 UTC (permalink / raw)
  To: Ramkumar Ramachandra; +Cc: Robert Zeh, Junio C Hamano, Git List

On Sat, Feb 9, 2013 at 7:53 PM, Ramkumar Ramachandra <artagnon@gmail.com> wrote:
> Ramkumar Ramachandra wrote:
>> What about getting systemd to watch everything for us?
>
> systemd is the perfect candidate!

How about this as a start? I did not really check what it does, but it
does not look complicate enough to pull systemd in.

http://article.gmane.org/gmane.comp.version-control.git/151934

Youo may want to search the mail archive. This topic has come up a few
times before, there may be other similar patches.
-- 
Duy

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: inotify to minimize stat() calls
  2013-02-09 12:59             ` Duy Nguyen
@ 2013-02-09 17:10               ` Ramkumar Ramachandra
  2013-02-09 18:56                 ` Ramkumar Ramachandra
  2013-02-10  5:24                 ` Duy Nguyen
  0 siblings, 2 replies; 88+ messages in thread
From: Ramkumar Ramachandra @ 2013-02-09 17:10 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Robert Zeh, Junio C Hamano, Git List, finnag

Duy Nguyen wrote:
> How about this as a start? I did not really check what it does, but it
> does not look complicate enough to pull systemd in.
>
> http://article.gmane.org/gmane.comp.version-control.git/151934

Clever hack.  I didn't know that there was a switch called
core.ignoreStat which will disable automatic lstat() calls altogether.
 So, Finn advises that we set this switch and run igit instead of git.
 There's a git-inotify-daemon which runs inotifywait with -m forever,
updating a modified_files hash.  When it is sent a TERM from igit
(which is what happens immediately upon execution), it writes all this
collected information about modified files to a named pipe that igit
passes to it.  igit then does a git update-index --assume-unchained
--stdin to read the data from the pipe.  Towards the end of its life,
igit starts up a fresh git-inotify-daemon for future invocations.

Finn notes in the commit message that it offers no speedup, because
.gitignore files in every directory still have to be read.  I think
this is silly: we really should be caching .gitignore, and touching it
only when lstat() reports that the file has changed.

As far as a real implementation that we'd want to merge into git.git
is concerned, I have a few comments:
Running multiple daemons on-the-fly for monitoring filesystem changes
is not elegant at all.  Keeping track of the state of so many loose
daemons is a hard problem: how do we ensure any semblance of
reliability without that?  Systemd is a very big improvement over the
legacy of a hundred loose shell scripts that SysVInit demanded.  It
monitors and babysits daemons; it uses cgroups to even kill
misbehaving daemons.  I can inspect running daemons at any time, and
have a uniform way to start/ stop/ restart them.

Okay, now you're asking me to consider a system-wide daemon
independent of systemd.  It has to run with root privileges so it has
access to everyone's repositories, which means that people have to
trust it beyond doubt.  What does it do?  It has a generic API to
watch filesystem paths and report events over an IP socket.  Do you
think that this will only be useful to git?  Every other version
control system (and presumably many other pieces of software) will
want to use it.  One huge downside I see of making this part of
systemd is Ubuntu.  They've decided not to use systemd for some
unfathomable reason.

Really, the elephant in the room right now seems to be .gitignore.
Until that is fixed, there is really no use of writing this inotify
daemon, no?  Can someone enlighten me on how exactly .gitignore files
are processed?

> Youo may want to search the mail archive. This topic has come up a few
> times before, there may be other similar patches.

The thread you linked me to is a 2010 email, and now it's 2013.  We've
been silent about inotify for three years?

Thanks for your inputs, Duy.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: inotify to minimize stat() calls
  2013-02-09 17:10               ` Ramkumar Ramachandra
@ 2013-02-09 18:56                 ` Ramkumar Ramachandra
  2013-02-10  5:24                 ` Duy Nguyen
  1 sibling, 0 replies; 88+ messages in thread
From: Ramkumar Ramachandra @ 2013-02-09 18:56 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Robert Zeh, Junio C Hamano, Git List, Finn Arne Gangstad

Ramkumar Ramachandra wrote:
> Okay, now you're asking me to consider a system-wide daemon
> independent of systemd.  It has to run with root privileges so it has
> access to everyone's repositories, which means that people have to
> trust it beyond doubt.  What does it do?  It has a generic API to
> watch filesystem paths and report events over an IP socket.  Do you
> think that this will only be useful to git?  Every other version
> control system (and presumably many other pieces of software) will
> want to use it.  One huge downside I see of making this part of
> systemd is Ubuntu.  They've decided not to use systemd for some
> unfathomable reason.

After some thought, I've decided that extending systemd is not the way
to go.  And the dbus API is really an overkill.  Writing a simple
system-wide daemon shouldn't be a challenge; the hard part is getting
git to use it properly.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: inotify to minimize stat() calls
  2013-02-09 12:05         ` Ramkumar Ramachandra
  2013-02-09 12:11           ` Ramkumar Ramachandra
  2013-02-09 12:53           ` Ramkumar Ramachandra
@ 2013-02-09 19:35           ` Junio C Hamano
  2013-02-10 19:03             ` Robert Zeh
  2 siblings, 1 reply; 88+ messages in thread
From: Junio C Hamano @ 2013-02-09 19:35 UTC (permalink / raw)
  To: Ramkumar Ramachandra; +Cc: Robert Zeh, Git List, Duy Nguyen

Ramkumar Ramachandra <artagnon@gmail.com> writes:

> This is much better than Junio's suggestion to study possible
> implementations on all platforms and designing a generic daemon/
> communication channel.  That's no weekend project.

It appears that you misunderstood what I wrote.  That was not "here
is a design; I want it in my system.  Go implemment it".

It was "If somebody wants to discuss it but does not know where to
begin, doing a small experiment like this and reporting how well it
worked here may be one way to do so.", nothing more.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: inotify to minimize stat() calls
  2013-02-09 17:10               ` Ramkumar Ramachandra
  2013-02-09 18:56                 ` Ramkumar Ramachandra
@ 2013-02-10  5:24                 ` Duy Nguyen
  2013-02-10 11:17                   ` Duy Nguyen
  2013-02-19 13:16                   ` Drew Northup
  1 sibling, 2 replies; 88+ messages in thread
From: Duy Nguyen @ 2013-02-10  5:24 UTC (permalink / raw)
  To: Ramkumar Ramachandra; +Cc: Robert Zeh, Junio C Hamano, Git List, finnag

On Sun, Feb 10, 2013 at 12:10 AM, Ramkumar Ramachandra
<artagnon@gmail.com> wrote:
> Finn notes in the commit message that it offers no speedup, because
> .gitignore files in every directory still have to be read.  I think
> this is silly: we really should be caching .gitignore, and touching it
> only when lstat() reports that the file has changed.
>
> ...
>
> Really, the elephant in the room right now seems to be .gitignore.
> Until that is fixed, there is really no use of writing this inotify
> daemon, no?  Can someone enlighten me on how exactly .gitignore files
> are processed?

.gitignore is a different issue. I think it's mainly used with
read_directory/fill_directory to collect ignored files (or not-ignored
files). And it's not always used (well, status and add does, but diff
should not). I think wee need to measure how much mass lstat
elimination gains us (especially on big repos) and how much
.gitignore/.gitattributes caching does. I don't think .gitignore has
such a big impact though. strace on git.git tells me "git status"
issues about 2500 lstat calls, and just 740 open+getdents calls (on
total 3800 syscalls). I will think if we can do something about
.gitignore/.gitattributes.
-- 
Duy

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: inotify to minimize stat() calls
  2013-02-10  5:24                 ` Duy Nguyen
@ 2013-02-10 11:17                   ` Duy Nguyen
  2013-02-10 11:22                     ` Duy Nguyen
                                       ` (3 more replies)
  2013-02-19 13:16                   ` Drew Northup
  1 sibling, 4 replies; 88+ messages in thread
From: Duy Nguyen @ 2013-02-10 11:17 UTC (permalink / raw)
  To: Ramkumar Ramachandra; +Cc: Robert Zeh, Junio C Hamano, Git List, finnag

On Sun, Feb 10, 2013 at 12:24:58PM +0700, Duy Nguyen wrote:
> On Sun, Feb 10, 2013 at 12:10 AM, Ramkumar Ramachandra
> <artagnon@gmail.com> wrote:
> > Finn notes in the commit message that it offers no speedup, because
> > .gitignore files in every directory still have to be read.  I think
> > this is silly: we really should be caching .gitignore, and touching it
> > only when lstat() reports that the file has changed.
> >
> > ...
> >
> > Really, the elephant in the room right now seems to be .gitignore.
> > Until that is fixed, there is really no use of writing this inotify
> > daemon, no?  Can someone enlighten me on how exactly .gitignore files
> > are processed?
>
> .gitignore is a different issue. I think it's mainly used with
> read_directory/fill_directory to collect ignored files (or not-ignored
> files). And it's not always used (well, status and add does, but diff
> should not). I think wee need to measure how much mass lstat
> elimination gains us (especially on big repos) and how much
> .gitignore/.gitattributes caching does.

OK let's count. I start with a "standard" repository, linux-2.6. This
is the number from strace -T on "git status" (*). The first column is
accumulated time, the second the number of syscalls.

top syscalls sorted     top syscalls sorted
by acc. time            by number
----------------------------------------------
0.401906 40950 lstat    0.401906 40950 lstat
0.190484 5343 getdents	0.150055 5374 open
0.150055 5374 open	0.190484 5343 getdents
0.074843 2806 close	0.074843 2806 close
0.003216 157 read	0.003216 157 read

The following patch pretends every entry is uptodate without
lstat. With the patch, we can see refresh code is the cause of mass
lstat, as lstat disappears:

0.185347 5343 getdents  0.144173 5374 open
0.144173 5374 open	0.185347 5343 getdents
0.071844 2806 close	0.071844 2806 close
0.004918 135 brk	0.003378 157 read
0.003378 157 read	0.004918 135 brk

-- 8< --
diff --git a/read-cache.c b/read-cache.c
index 827ae55..94d8ed8 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -1018,6 +1018,10 @@ static struct cache_entry *refresh_cache_ent(struct index_state *istate,
 	if (ce_uptodate(ce))
 		return ce;

+#if 1
+	ce_mark_uptodate(ce);
+	return ce;
+#endif
 	/*
 	 * CE_VALID or CE_SKIP_WORKTREE means the user promised us
 	 * that the change to the work tree does not matter and told
-- 8< --

The following patch eliminates untracked search code. As we can see,
open+getdents also disappears with this patch:

0.462909 40950 lstat   0.462909 40950 lstat
0.003417 129 brk       0.003417 129 brk
0.000762 53 read       0.000762 53 read
0.000720 36 open       0.000720 36 open
0.000544 12 munmap     0.000454 33 close

So from syscalls point of view, we know what code issues most of
them. Let's see how much time we gain be these patches, which is an
approximate of the gain by inotify support. This time I measure on
gentoo-x86.git [1] because this one has really big worktree (100k
files)

        unmodified  read-cache.c  dir.c     both
real    0m0.550s    0m0.479s      0m0.287s  0m0.213s
user    0m0.305s    0m0.315s	  0m0.201s  0m0.182s
sys     0m0.240s    0m0.157s	  0m0.084s  0m0.030s

and the syscall picture on gentoo-x86.git:

1.106615 101942 lstat    1.106615 101942 lstat
0.667235 47083 getdents	 0.641604 47114 open
0.641604 47114 open	 0.667235 47083 getdents
0.286711 23573 close	 0.286711 23573 close
0.005842 350 brk	 0.005842 350 brk

We can see that shortcuting untracked code gives bigger gain than
index refresh code. So I have to agree that .gitignore may be the big
elephant in this particular case.

Bear in mind though this is Linux, where lstat is fast. On systems
with slow lstat, these timings could look very different due to the
large number of lstat calls compared to open+getdents. I really like
to see similar numbers on Windows.

read_directory/fill_directory code is mostly used by "git add" (not
with -u) and "git status", while refresh code is executed in add,
checkout, commit/status, diff, merge. So while smaller gain, reducing
lstat calls could benefit in more cases.

A relatively slow "git add" is acceptable. "git status" should be
fast. Although in my workflow, I do "git diff [--stat] [--cached]"
much more often than "git status" so relatively slow "git status" does
not hurt me much. But people may do it differently.

On speeding up read_directory with inotify support. I haven't thought
it through, but I think we could save (or get it via socket) a list of
untracked files in .git, regardless ignore status, with the help from
inotify. When this list is verified valid, read_directory could be
modified to traverse the tree using this list (plus the index) instead
of opendir+readdir. Not sure how the change might look though.


[1] http://git-exp.overlays.gentoo.org/gitweb/?p=exp/gentoo-x86.git;a=summary

(*) the script to produce those numbers is

-- 8< --
#!/bin/sh

export LANG=C
strace -T "$@" 2>&1 >/dev/null |
	sed 's/\(^[^(]*\)(.*<\([0-9.]*\)>$/\1 \2/' |
	awk '{
	  sec[$1]+=$2;
	  count[$1]++;
	}
	END {
	  for (i in sec)
	    printf("%f %d %s\n", sec[i], count[i], i);
	  }' >/tmp/s

sort -nr /tmp/s | head -n5
sort -nrk2 /tmp/s | head -n5
-- 8< --

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* Re: inotify to minimize stat() calls
  2013-02-10 11:17                   ` Duy Nguyen
@ 2013-02-10 11:22                     ` Duy Nguyen
  2013-02-10 20:16                       ` Junio C Hamano
  2013-02-10 13:26                     ` inotify to minimize stat() calls demerphq
                                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 88+ messages in thread
From: Duy Nguyen @ 2013-02-10 11:22 UTC (permalink / raw)
  To: Ramkumar Ramachandra; +Cc: Robert Zeh, Junio C Hamano, Git List, finnag

On Sun, Feb 10, 2013 at 06:17:32PM +0700, Duy Nguyen wrote:
> The following patch eliminates untracked search code. As we can see,
> open+getdents also disappears with this patch:
> 
> 0.462909 40950 lstat   0.462909 40950 lstat
> 0.003417 129 brk       0.003417 129 brk
> 0.000762 53 read       0.000762 53 read
> 0.000720 36 open       0.000720 36 open
> 0.000544 12 munmap     0.000454 33 close

.. and the patch is missing:

-- 8< --
diff --git a/dir.c b/dir.c
index 57394e4..1963c6f 100644
--- a/dir.c
+++ b/dir.c
@@ -1439,8 +1439,10 @@ int read_directory(struct dir_struct *dir, const char *path, int len, const char
 		return dir->nr;
 
 	simplify = create_simplify(pathspec);
+#if 0
 	if (!len || treat_leading_path(dir, path, len, simplify))
 		read_directory_recursive(dir, path, len, 0, simplify);
+#endif
 	free_simplify(simplify);
 	qsort(dir->entries, dir->nr, sizeof(struct dir_entry *), cmp_name);
 	qsort(dir->ignored, dir->ignored_nr, sizeof(struct dir_entry *), cmp_name);
-- 8< --

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* Re: inotify to minimize stat() calls
  2013-02-10 11:17                   ` Duy Nguyen
  2013-02-10 11:22                     ` Duy Nguyen
@ 2013-02-10 13:26                     ` demerphq
  2013-02-10 15:35                       ` Duy Nguyen
  2013-02-14 14:36                       ` Magnus Bäck
  2013-02-10 16:45                     ` Ramkumar Ramachandra
  2013-02-10 16:58                     ` Erik Faye-Lund
  3 siblings, 2 replies; 88+ messages in thread
From: demerphq @ 2013-02-10 13:26 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Ramkumar Ramachandra, Robert Zeh, Junio C Hamano, Git List, finnag

On 10 February 2013 12:17, Duy Nguyen <pclouds@gmail.com> wrote:
> Bear in mind though this is Linux, where lstat is fast. On systems
> with slow lstat, these timings could look very different due to the
> large number of lstat calls compared to open+getdents. I really like
> to see similar numbers on Windows.

Is windows stat really so slow? I encountered this perception in
windows Perl in the past, and I know that on windows Perl stat
*appears* slow compared to *nix, because in order to satisfy the full
*nix stat interface, specifically the nlink field, it must open and
close the file*. As of 5.10 this can be disabled by setting a magic
var ${^WIN32_SLOPPY_STAT} to a true value, which makes a significant
improvement to the performance of the Perl level stat implementation.
I would not be surprised if the cygwin implementation of stat() has
the same issue as Perl did, and that stat appears much slower than it
actually need be if you don't care about the nlink field.

Yves
* http://perl5.git.perl.org/perl.git/blob/HEAD:/win32/win32.c#l1492

-- 
perl -Mre=debug -e "/just|another|perl|hacker/"

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: inotify to minimize stat() calls
  2013-02-10 13:26                     ` inotify to minimize stat() calls demerphq
@ 2013-02-10 15:35                       ` Duy Nguyen
  2013-02-14 14:36                       ` Magnus Bäck
  1 sibling, 0 replies; 88+ messages in thread
From: Duy Nguyen @ 2013-02-10 15:35 UTC (permalink / raw)
  To: demerphq
  Cc: Ramkumar Ramachandra, Robert Zeh, Junio C Hamano, Git List, finnag

On Sun, Feb 10, 2013 at 8:26 PM, demerphq <demerphq@gmail.com> wrote:
> On 10 February 2013 12:17, Duy Nguyen <pclouds@gmail.com> wrote:
>> Bear in mind though this is Linux, where lstat is fast. On systems
>> with slow lstat, these timings could look very different due to the
>> large number of lstat calls compared to open+getdents. I really like
>> to see similar numbers on Windows.
>
> Is windows stat really so slow?

I can't say. I haven't used Windows for months (and git on Windows for years)..

> I encountered this perception in
> windows Perl in the past, and I know that on windows Perl stat
> *appears* slow compared to *nix, because in order to satisfy the full
> *nix stat interface, specifically the nlink field, it must open and
> close the file*. As of 5.10 this can be disabled by setting a magic
> var ${^WIN32_SLOPPY_STAT} to a true value, which makes a significant
> improvement to the performance of the Perl level stat implementation.
> I would not be surprised if the cygwin implementation of stat() has
> the same issue as Perl did, and that stat appears much slower than it
> actually need be if you don't care about the nlink field.

The native port of git uses get_file_attr (in
compat/mingw.c:do_lstat()) to simulate lstat and always sets nlink to
1. I assume this means git does not care about nlink field. I don't
know about cygwin though.

> Yves
> * http://perl5.git.perl.org/perl.git/blob/HEAD:/win32/win32.c#l1492
-- 
Duy

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: inotify to minimize stat() calls
  2013-02-10 11:17                   ` Duy Nguyen
  2013-02-10 11:22                     ` Duy Nguyen
  2013-02-10 13:26                     ` inotify to minimize stat() calls demerphq
@ 2013-02-10 16:45                     ` Ramkumar Ramachandra
  2013-02-11  3:03                       ` Duy Nguyen
  2013-02-10 16:58                     ` Erik Faye-Lund
  3 siblings, 1 reply; 88+ messages in thread
From: Ramkumar Ramachandra @ 2013-02-10 16:45 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Robert Zeh, Junio C Hamano, Git List, finnag

On Sun, Feb 10, 2013 at 4:47 PM, Duy Nguyen <pclouds@gmail.com> wrote:
> On Sun, Feb 10, 2013 at 12:24:58PM +0700, Duy Nguyen wrote:
>> On Sun, Feb 10, 2013 at 12:10 AM, Ramkumar Ramachandra
>> <artagnon@gmail.com> wrote:
>> > Finn notes in the commit message that it offers no speedup, because
>> > .gitignore files in every directory still have to be read.  I think
>> > this is silly: we really should be caching .gitignore, and touching it
>> > only when lstat() reports that the file has changed.
>> >
>> > ...
>> >
>> > Really, the elephant in the room right now seems to be .gitignore.
>> > Until that is fixed, there is really no use of writing this inotify
>> > daemon, no?  Can someone enlighten me on how exactly .gitignore files
>> > are processed?
>>
>> .gitignore is a different issue. I think it's mainly used with
>> read_directory/fill_directory to collect ignored files (or not-ignored
>> files). And it's not always used (well, status and add does, but diff
>> should not). I think wee need to measure how much mass lstat
>> elimination gains us (especially on big repos) and how much
>> .gitignore/.gitattributes caching does.
>
> OK let's count. I start with a "standard" repository, linux-2.6. This
> is the number from strace -T on "git status" (*). The first column is
> accumulated time, the second the number of syscalls.
>
> top syscalls sorted     top syscalls sorted
> by acc. time            by number
> ----------------------------------------------
> 0.401906 40950 lstat    0.401906 40950 lstat
> 0.190484 5343 getdents  0.150055 5374 open
> 0.150055 5374 open      0.190484 5343 getdents
> 0.074843 2806 close     0.074843 2806 close
> 0.003216 157 read       0.003216 157 read
>
> The following patch pretends every entry is uptodate without
> lstat. With the patch, we can see refresh code is the cause of mass
> lstat, as lstat disappears:
>
> 0.185347 5343 getdents  0.144173 5374 open
> 0.144173 5374 open      0.185347 5343 getdents
> 0.071844 2806 close     0.071844 2806 close
> 0.004918 135 brk        0.003378 157 read
> 0.003378 157 read       0.004918 135 brk

Okay, we're saving 40k lstat() calls.

> -- 8< --
> diff --git a/read-cache.c b/read-cache.c
> index 827ae55..94d8ed8 100644
> --- a/read-cache.c
> +++ b/read-cache.c
> @@ -1018,6 +1018,10 @@ static struct cache_entry *refresh_cache_ent(struct index_state *istate,
>         if (ce_uptodate(ce))
>                 return ce;
>
> +#if 1
> +       ce_mark_uptodate(ce);
> +       return ce;
> +#endif
>         /*
>          * CE_VALID or CE_SKIP_WORKTREE means the user promised us
>          * that the change to the work tree does not matter and told
> -- 8< --

So you're skipping the rest of refresh_cache_ent(), which contains our
lstat() and returning immediately.  Instead of marking paths with the
"assume unchanged" bit, as core.ignoreStat does, you're directly
attacking the function that refreshes the index and bypassing the
lstat() call.  How are they different?  read-cache.c:1030 checks
ce->flags & CE_VALID (which is set in read-cache.c:88 if
assume_unchanged) and bypasses the lstat() call anyway.  So why didn't
you just set core.ignoreStat for your test?

> The following patch eliminates untracked search code. As we can see,
> open+getdents also disappears with this patch:
>
> 0.462909 40950 lstat   0.462909 40950 lstat
> 0.003417 129 brk       0.003417 129 brk
> 0.000762 53 read       0.000762 53 read
> 0.000720 36 open       0.000720 36 open
> 0.000544 12 munmap     0.000454 33 close

Okay, 5k open and 5k getdents calls are gone, but what does this mean?

Pulling the patch from your next email to figure this out:
> -- 8< --
> diff --git a/dir.c b/dir.c
> index 57394e4..1963c6f 100644
> --- a/dir.c
> +++ b/dir.c
> @@ -1439,8 +1439,10 @@ int read_directory(struct dir_struct *dir, const char *path, int len, const char
>                 return dir->nr;
>
>         simplify = create_simplify(pathspec);
> +#if 0
>         if (!len || treat_leading_path(dir, path, len, simplify))
>                 read_directory_recursive(dir, path, len, 0, simplify);
> +#endif
>         free_simplify(simplify);
>         qsort(dir->entries, dir->nr, sizeof(struct dir_entry *), cmp_name);
>         qsort(dir->ignored, dir->ignored_nr, sizeof(struct dir_entry *), cmp_name);
> -- 8< --

Ah, read_directory(), from the .gitignore/ exclude angle.  Yes,
read_directory() seems to be the main culprit there, from my reading
of Documentation/technical/api-directory-listing.txt.

So, what did you do?  You short-circuited the function into never
executing read_directory_recursive(), so the opendir() and readdir()
are gone.  I'm confused about what this means: will new directories
fail to appear as "untracked" now?  Either way, I understand that
you've factored out the .gitignore/ excludes.  Let's look at the
timings now.

> So from syscalls point of view, we know what code issues most of
> them. Let's see how much time we gain be these patches, which is an
> approximate of the gain by inotify support. This time I measure on
> gentoo-x86.git [1] because this one has really big worktree (100k
> files)
>
>         unmodified  read-cache.c  dir.c     both
> real    0m0.550s    0m0.479s      0m0.287s  0m0.213s
> user    0m0.305s    0m0.315s      0m0.201s  0m0.182s
> sys     0m0.240s    0m0.157s      0m0.084s  0m0.030s

So, the .gitignore/ exclude does seem to be the elephant in the room,
after all!  There are only minor gains from not updating the index.

> and the syscall picture on gentoo-x86.git:
>
> 1.106615 101942 lstat    1.106615 101942 lstat
> 0.667235 47083 getdents  0.641604 47114 open
> 0.641604 47114 open      0.667235 47083 getdents
> 0.286711 23573 close     0.286711 23573 close
> 0.005842 350 brk         0.005842 350 brk

The lstat to getdents/ open is higher than in linux-2.6.git here, but
the profit in eliminating lstat is still very small.

> We can see that shortcuting untracked code gives bigger gain than
> index refresh code. So I have to agree that .gitignore may be the big
> elephant in this particular case.

Yes :)

> Bear in mind though this is Linux, where lstat is fast. On systems
> with slow lstat, these timings could look very different due to the
> large number of lstat calls compared to open+getdents. I really like
> to see similar numbers on Windows.

I see.

> read_directory/fill_directory code is mostly used by "git add" (not
> with -u) and "git status", while refresh code is executed in add,
> checkout, commit/status, diff, merge. So while smaller gain, reducing
> lstat calls could benefit in more cases.

Good point, although my major complaint with big repositories is
status and diff.  Checkout isn't too bad if the branches don't diverge
much, add is fast enough, but diff is a big problem.

> A relatively slow "git add" is acceptable. "git status" should be
> fast. Although in my workflow, I do "git diff [--stat] [--cached]"
> much more often than "git status" so relatively slow "git status" does
> not hurt me much. But people may do it differently.

Hm.

> On speeding up read_directory with inotify support. I haven't thought
> it through, but I think we could save (or get it via socket) a list of
> untracked files in .git, regardless ignore status, with the help from
> inotify. When this list is verified valid, read_directory could be
> modified to traverse the tree using this list (plus the index) instead
> of opendir+readdir. Not sure how the change might look though.

We could even tell if .gitignore has changed with inotify support, and
tell exactly when we need to update our path treatment.  And yes, we
can have inotify to tell us about new files and directories directly,
so we can traverse them.  I think we should go ahead with the
system-wide inotify daemon for now: its design needs to be discussed,
so that it's generic enough for all our usecases.  I'm thinking there
should be atleast two distinct calls to:
1. Report all changed paths, for use with read-cache.c.
2. Report only new paths, for use with dir.c.
Or can we figure out which of the changed paths are new ourselves?

Kudos to your great work on getting getting these numbers, Duy!

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: inotify to minimize stat() calls
  2013-02-10 11:17                   ` Duy Nguyen
                                       ` (2 preceding siblings ...)
  2013-02-10 16:45                     ` Ramkumar Ramachandra
@ 2013-02-10 16:58                     ` Erik Faye-Lund
  2013-02-11  3:53                       ` Duy Nguyen
  3 siblings, 1 reply; 88+ messages in thread
From: Erik Faye-Lund @ 2013-02-10 16:58 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Ramkumar Ramachandra, Robert Zeh, Junio C Hamano, Git List, finnag

On Sun, Feb 10, 2013 at 12:17 PM, Duy Nguyen <pclouds@gmail.com> wrote:
> On Sun, Feb 10, 2013 at 12:24:58PM +0700, Duy Nguyen wrote:
>> On Sun, Feb 10, 2013 at 12:10 AM, Ramkumar Ramachandra
>> <artagnon@gmail.com> wrote:
>> > Finn notes in the commit message that it offers no speedup, because
>> > .gitignore files in every directory still have to be read.  I think
>> > this is silly: we really should be caching .gitignore, and touching it
>> > only when lstat() reports that the file has changed.
>> >
>> > ...
>> >
>> > Really, the elephant in the room right now seems to be .gitignore.
>> > Until that is fixed, there is really no use of writing this inotify
>> > daemon, no?  Can someone enlighten me on how exactly .gitignore files
>> > are processed?
>>
>> .gitignore is a different issue. I think it's mainly used with
>> read_directory/fill_directory to collect ignored files (or not-ignored
>> files). And it's not always used (well, status and add does, but diff
>> should not). I think wee need to measure how much mass lstat
>> elimination gains us (especially on big repos) and how much
>> .gitignore/.gitattributes caching does.
>
> OK let's count. I start with a "standard" repository, linux-2.6. This
> is the number from strace -T on "git status" (*). The first column is
> accumulated time, the second the number of syscalls.
>
> top syscalls sorted     top syscalls sorted
> by acc. time            by number
> ----------------------------------------------
> 0.401906 40950 lstat    0.401906 40950 lstat
> 0.190484 5343 getdents  0.150055 5374 open
> 0.150055 5374 open      0.190484 5343 getdents
> 0.074843 2806 close     0.074843 2806 close
> 0.003216 157 read       0.003216 157 read
>
> The following patch pretends every entry is uptodate without
> lstat. With the patch, we can see refresh code is the cause of mass
> lstat, as lstat disappears:
>
> 0.185347 5343 getdents  0.144173 5374 open
> 0.144173 5374 open      0.185347 5343 getdents
> 0.071844 2806 close     0.071844 2806 close
> 0.004918 135 brk        0.003378 157 read
> 0.003378 157 read       0.004918 135 brk
>
> -- 8< --
> diff --git a/read-cache.c b/read-cache.c
> index 827ae55..94d8ed8 100644
> --- a/read-cache.c
> +++ b/read-cache.c
> @@ -1018,6 +1018,10 @@ static struct cache_entry *refresh_cache_ent(struct index_state *istate,
>         if (ce_uptodate(ce))
>                 return ce;
>
> +#if 1
> +       ce_mark_uptodate(ce);
> +       return ce;
> +#endif
>         /*
>          * CE_VALID or CE_SKIP_WORKTREE means the user promised us
>          * that the change to the work tree does not matter and told
> -- 8< --
>
> The following patch eliminates untracked search code. As we can see,
> open+getdents also disappears with this patch:
>
> 0.462909 40950 lstat   0.462909 40950 lstat
> 0.003417 129 brk       0.003417 129 brk
> 0.000762 53 read       0.000762 53 read
> 0.000720 36 open       0.000720 36 open
> 0.000544 12 munmap     0.000454 33 close
>
> So from syscalls point of view, we know what code issues most of
> them. Let's see how much time we gain be these patches, which is an
> approximate of the gain by inotify support. This time I measure on
> gentoo-x86.git [1] because this one has really big worktree (100k
> files)
>
>         unmodified  read-cache.c  dir.c     both
> real    0m0.550s    0m0.479s      0m0.287s  0m0.213s
> user    0m0.305s    0m0.315s      0m0.201s  0m0.182s
> sys     0m0.240s    0m0.157s      0m0.084s  0m0.030s
>
> and the syscall picture on gentoo-x86.git:
>
> 1.106615 101942 lstat    1.106615 101942 lstat
> 0.667235 47083 getdents  0.641604 47114 open
> 0.641604 47114 open      0.667235 47083 getdents
> 0.286711 23573 close     0.286711 23573 close
> 0.005842 350 brk         0.005842 350 brk
>
> We can see that shortcuting untracked code gives bigger gain than
> index refresh code. So I have to agree that .gitignore may be the big
> elephant in this particular case.
>
> Bear in mind though this is Linux, where lstat is fast. On systems
> with slow lstat, these timings could look very different due to the
> large number of lstat calls compared to open+getdents. I really like
> to see similar numbers on Windows.

Karsten Blees has done something similar-ish on Windows, and he posted
the results here:

https://groups.google.com/forum/#!topic/msysgit/fL_jykUmUNE/discussion

I also seem to remember he doing a ReadDirectoryChangesW version, but
I don't remember what happened with that.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: inotify to minimize stat() calls
  2013-02-09 19:35           ` Junio C Hamano
@ 2013-02-10 19:03             ` Robert Zeh
  2013-02-10 19:26               ` Martin Fick
                                 ` (2 more replies)
  0 siblings, 3 replies; 88+ messages in thread
From: Robert Zeh @ 2013-02-10 19:03 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Ramkumar Ramachandra, Git List, Duy Nguyen

On Sat, Feb 9, 2013 at 1:35 PM, Junio C Hamano <gitster@pobox.com> wrote:
> Ramkumar Ramachandra <artagnon@gmail.com> writes:
>
>> This is much better than Junio's suggestion to study possible
>> implementations on all platforms and designing a generic daemon/
>> communication channel.  That's no weekend project.
>
> It appears that you misunderstood what I wrote.  That was not "here
> is a design; I want it in my system.  Go implemment it".
>
> It was "If somebody wants to discuss it but does not know where to
> begin, doing a small experiment like this and reporting how well it
> worked here may be one way to do so.", nothing more.

What if instead of communicating over a socket, the daemon
dumped a file containing all of the lstat information after git
wrote a file? By definition the daemon should know about file writes.

There would be no network communication, which I think would make
things more secure. It would simplify the rendezvous by insisting on
well known locations in $GIT_DIR.

Robert Zeh

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: inotify to minimize stat() calls
  2013-02-10 19:03             ` Robert Zeh
@ 2013-02-10 19:26               ` Martin Fick
  2013-02-10 20:18                 ` Robert Zeh
  2013-02-11  3:21               ` Duy Nguyen
  2013-04-24 17:20               ` [PATCH] " Robert Zeh
  2 siblings, 1 reply; 88+ messages in thread
From: Martin Fick @ 2013-02-10 19:26 UTC (permalink / raw)
  To: Robert Zeh; +Cc: Junio C Hamano, Ramkumar Ramachandra, Git List, Duy Nguyen

On Sunday, February 10, 2013 12:03:00 pm Robert Zeh wrote:
> On Sat, Feb 9, 2013 at 1:35 PM, Junio C Hamano 
<gitster@pobox.com> wrote:
> > Ramkumar Ramachandra <artagnon@gmail.com> writes:
> >> This is much better than Junio's suggestion to study
> >> possible implementations on all platforms and
> >> designing a generic daemon/ communication channel. 
> >> That's no weekend project.
> > 
> > It appears that you misunderstood what I wrote.  That
> > was not "here is a design; I want it in my system.  Go
> > implemment it".
> > 
> > It was "If somebody wants to discuss it but does not
> > know where to begin, doing a small experiment like
> > this and reporting how well it worked here may be one
> > way to do so.", nothing more.
> 
> What if instead of communicating over a socket, the
> daemon dumped a file containing all of the lstat
> information after git wrote a file? By definition the
> daemon should know about file writes.

But git doesn't, how will it know when the file is written?
Will it use inotify, or poll (kind of defeats the point)?

-Martin

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: inotify to minimize stat() calls
  2013-02-10 11:22                     ` Duy Nguyen
@ 2013-02-10 20:16                       ` Junio C Hamano
  2013-02-11  2:56                         ` Duy Nguyen
  0 siblings, 1 reply; 88+ messages in thread
From: Junio C Hamano @ 2013-02-10 20:16 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Ramkumar Ramachandra, Robert Zeh, Git List, finnag

Duy Nguyen <pclouds@gmail.com> writes:

> On Sun, Feb 10, 2013 at 06:17:32PM +0700, Duy Nguyen wrote:
>> The following patch eliminates untracked search code. As we can see,
>> open+getdents also disappears with this patch:
>> 
>> 0.462909 40950 lstat   0.462909 40950 lstat
>> 0.003417 129 brk       0.003417 129 brk
>> 0.000762 53 read       0.000762 53 read
>> 0.000720 36 open       0.000720 36 open
>> 0.000544 12 munmap     0.000454 33 close
>
> .. and the patch is missing:
>
> -- 8< --
> diff --git a/dir.c b/dir.c
> index 57394e4..1963c6f 100644
> --- a/dir.c
> +++ b/dir.c
> @@ -1439,8 +1439,10 @@ int read_directory(struct dir_struct *dir, const char *path, int len, const char
>  		return dir->nr;
>  
>  	simplify = create_simplify(pathspec);
> +#if 0
>  	if (!len || treat_leading_path(dir, path, len, simplify))
>  		read_directory_recursive(dir, path, len, 0, simplify);
> +#endif

The other "lstat()" experiment was a very interesting one, but this
is not yet an interesting experiment to see where in the "ignore"
codepath we are spending times.

We know that we can tell wt_status_collect_untracked() not to bother
with the untracked or ignored files with !s->show_untracked_files
already, but I think the more interesting question is if we can show
the untracked files with less overhead.

If we want to show untrackedd files, it is a given that we need to
read directories to see what paths there are on the filesystem. Is
the opendir/readdir cost dominating in the process? Are we spending
a lot of time sifting the result of opendir/readdir via the ignore
mechanism? Is reading the "ignore" files costing us much to prime
the ignore mechanism?

If readdir cost is dominant, then that makes "cache gitignore" a
nonsense proposition, I think.  If you really want to "cache"
something, you need to have somebody (i.e. a daemon) who constantly
keeps an eye on the filesystem changes and can respond with the up
to date result directly to fill_directory().  I somehow doubt that
it is a direction we would want to go in, though.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: inotify to minimize stat() calls
  2013-02-10 19:26               ` Martin Fick
@ 2013-02-10 20:18                 ` Robert Zeh
  0 siblings, 0 replies; 88+ messages in thread
From: Robert Zeh @ 2013-02-10 20:18 UTC (permalink / raw)
  To: Martin Fick; +Cc: Junio C Hamano, Ramkumar Ramachandra, Git List, Duy Nguyen



On Feb 10, 2013, at 1:26 PM, Martin Fick <mfick@codeaurora.org> wrote:

> On Sunday, February 10, 2013 12:03:00 pm Robert Zeh wrote:
>> On Sat, Feb 9, 2013 at 1:35 PM, Junio C Hamano
> <gitster@pobox.com> wrote:
>>> Ramkumar Ramachandra <artagnon@gmail.com> writes:
>>>> This is much better than Junio's suggestion to study
>>>> possible implementations on all platforms and
>>>> designing a generic daemon/ communication channel. 
>>>> That's no weekend project.
>>> 
>>> It appears that you misunderstood what I wrote.  That
>>> was not "here is a design; I want it in my system.  Go
>>> implemment it".
>>> 
>>> It was "If somebody wants to discuss it but does not
>>> know where to begin, doing a small experiment like
>>> this and reporting how well it worked here may be one
>>> way to do so.", nothing more.
>> 
>> What if instead of communicating over a socket, the
>> daemon dumped a file containing all of the lstat
>> information after git wrote a file? By definition the
>> daemon should know about file writes.
> 
> But git doesn't, how will it know when the file is written?
> Will it use inotify, or poll (kind of defeats the point)?
> 
> -Martin

I was thinking it would loop on calls to stat for the file with a timeout; this is no different than what we would want to do over a socket in that we would need timeouts for network reads.  But we would only be calling stat on one file, instead of the entire repo. 

I think we can set things up so the file read is atomic, which means we can ignore the case of a daemon crashing midway through a conversation. 

Robert

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: inotify to minimize stat() calls
  2013-02-10 20:16                       ` Junio C Hamano
@ 2013-02-11  2:56                         ` Duy Nguyen
  2013-02-11 11:12                           ` Duy Nguyen
  2013-03-07 22:16                           ` Torsten Bögershausen
  0 siblings, 2 replies; 88+ messages in thread
From: Duy Nguyen @ 2013-02-11  2:56 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Ramkumar Ramachandra, Robert Zeh, Git List, finnag

On Mon, Feb 11, 2013 at 3:16 AM, Junio C Hamano <gitster@pobox.com> wrote:
> The other "lstat()" experiment was a very interesting one, but this
> is not yet an interesting experiment to see where in the "ignore"
> codepath we are spending times.
>
> We know that we can tell wt_status_collect_untracked() not to bother
> with the untracked or ignored files with !s->show_untracked_files
> already, but I think the more interesting question is if we can show
> the untracked files with less overhead.
>
> If we want to show untrackedd files, it is a given that we need to
> read directories to see what paths there are on the filesystem. Is
> the opendir/readdir cost dominating in the process? Are we spending
> a lot of time sifting the result of opendir/readdir via the ignore
> mechanism? Is reading the "ignore" files costing us much to prime
> the ignore mechanism?
>
> If readdir cost is dominant, then that makes "cache gitignore" a
> nonsense proposition, I think.  If you really want to "cache"
> something, you need to have somebody (i.e. a daemon) who constantly
> keeps an eye on the filesystem changes and can respond with the up
> to date result directly to fill_directory().  I somehow doubt that
> it is a direction we would want to go in, though.

Yeah, it did not cut out syscall cost, I also cut a lot of user-space
processing (plus .gitignore content access). From the timings I posted
earlier,

>         unmodified  dir.c
> real    0m0.550s    0m0.287s
> user    0m0.305s    0m0.201s
> sys     0m0.240s    0m0.084s

sys time is reduced from 0.24s to 0.08s, so readdir+opendir definitely
has something to do with it (and perhaps reading .gitignore). But it
also reduces user time from 0.305 to 0.201s. I don't think avoiding
readdir+openddir will bring us this gain. It's probably the cost of
matching .gitignore. I'll try to replace opendir+readdir with a
no-syscall version. At this point "untracked caching" sounds more
feasible (and less complex) than ".gitignore cachine".
-- 
Duy

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: inotify to minimize stat() calls
  2013-02-10 16:45                     ` Ramkumar Ramachandra
@ 2013-02-11  3:03                       ` Duy Nguyen
  0 siblings, 0 replies; 88+ messages in thread
From: Duy Nguyen @ 2013-02-11  3:03 UTC (permalink / raw)
  To: Ramkumar Ramachandra; +Cc: Robert Zeh, Junio C Hamano, Git List, finnag

On Sun, Feb 10, 2013 at 11:45 PM, Ramkumar Ramachandra
<artagnon@gmail.com> wrote:
> So you're skipping the rest of refresh_cache_ent(), which contains our
> lstat() and returning immediately.  Instead of marking paths with the
> "assume unchanged" bit, as core.ignoreStat does, you're directly
> attacking the function that refreshes the index and bypassing the
> lstat() call.  How are they different?  read-cache.c:1030 checks
> ce->flags & CE_VALID (which is set in read-cache.c:88 if
> assume_unchanged) and bypasses the lstat() call anyway.  So why didn't
> you just set core.ignoreStat for your test?

It just did not occur to me that core.ignoreStat does the same.

> Ah, read_directory(), from the .gitignore/ exclude angle.  Yes,
> read_directory() seems to be the main culprit there, from my reading
> of Documentation/technical/api-directory-listing.txt.
>
> So, what did you do?  You short-circuited the function into never
> executing read_directory_recursive(), so the opendir() and readdir()
> are gone.  I'm confused about what this means: will new directories
> fail to appear as "untracked" now?

No, read_directory returns the list of untracked/ignored files.
Returning empty lists means no untracked nor ignored files.
-- 
Duy

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: inotify to minimize stat() calls
  2013-02-10 19:03             ` Robert Zeh
  2013-02-10 19:26               ` Martin Fick
@ 2013-02-11  3:21               ` Duy Nguyen
  2013-02-11 14:13                 ` Robert Zeh
  2013-04-24 17:20               ` [PATCH] " Robert Zeh
  2 siblings, 1 reply; 88+ messages in thread
From: Duy Nguyen @ 2013-02-11  3:21 UTC (permalink / raw)
  To: Robert Zeh; +Cc: Junio C Hamano, Ramkumar Ramachandra, Git List

On Mon, Feb 11, 2013 at 2:03 AM, Robert Zeh <robert.allan.zeh@gmail.com> wrote:
> On Sat, Feb 9, 2013 at 1:35 PM, Junio C Hamano <gitster@pobox.com> wrote:
>> Ramkumar Ramachandra <artagnon@gmail.com> writes:
>>
>>> This is much better than Junio's suggestion to study possible
>>> implementations on all platforms and designing a generic daemon/
>>> communication channel.  That's no weekend project.
>>
>> It appears that you misunderstood what I wrote.  That was not "here
>> is a design; I want it in my system.  Go implemment it".
>>
>> It was "If somebody wants to discuss it but does not know where to
>> begin, doing a small experiment like this and reporting how well it
>> worked here may be one way to do so.", nothing more.
>
> What if instead of communicating over a socket, the daemon
> dumped a file containing all of the lstat information after git
> wrote a file? By definition the daemon should know about file writes.
>
> There would be no network communication, which I think would make
> things more secure. It would simplify the rendezvous by insisting on
> well known locations in $GIT_DIR.

We need some sort of interactive communication to the daemon anyway,
to validate that the information is uptodate. Assume that a user makes
some changes to his worktree before starting the daemon, git needs to
know that what the daemon provides does not represent a complete
file-change picture and it better refreshes the index the old way
once, then trust the daemon.

I think we could solve that by storing a "session id", provided by the
daemon, in .git/index. If the session id is not present (or does not
match what the current daemon gives), refresh the old way. After
refreshing, it may ask the daemon for new session id and store it.
Next time if the session id is still valid, trust the daemon's data.
This session id should be different every time the daemon restarts for
this to work.
-- 
Duy

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: inotify to minimize stat() calls
  2013-02-10 16:58                     ` Erik Faye-Lund
@ 2013-02-11  3:53                       ` Duy Nguyen
  2013-02-12 20:48                         ` Karsten Blees
  0 siblings, 1 reply; 88+ messages in thread
From: Duy Nguyen @ 2013-02-11  3:53 UTC (permalink / raw)
  To: kusmabite
  Cc: Ramkumar Ramachandra, Robert Zeh, Junio C Hamano, Git List,
	finnag, Karsten Blees

On Sun, Feb 10, 2013 at 11:58 PM, Erik Faye-Lund <kusmabite@gmail.com> wrote:
> Karsten Blees has done something similar-ish on Windows, and he posted
> the results here:
>
> https://groups.google.com/forum/#!topic/msysgit/fL_jykUmUNE/discussion
>
> I also seem to remember he doing a ReadDirectoryChangesW version, but
> I don't remember what happened with that.

Thanks. I came across that but did not remember. For one thing, we
know the inotify alternative for Windows: ReadDirectoryChangesW.

But the meat of the patch is not about that function. In fact it's
dropped in fscache-v3 [1]. It seems that doing
FindFirstFile/FindNextFile for an entire directory, cache the results
and use it to simulate lstat() is faster on Windows. Sounds similar to
preload-index. And because directory listing is cached anyway,
opendir/readdir is replaced to read from cache instead of opening the
directory again.

So it is orthogonal with using ReadDirectoryChangesW/inotify to
further reduce the system calls.

I copy "git status"'s (impressive) numbers from fscache-v0 for those
who are interested in:

preload | -u  | normal | cached | gain
--------+-----+--------+--------+------
false   | all | 25.144 | 3.055  |  8.2
false   | no  | 22.822 | 1.748  | 12.8
true    | all |  9.234 | 2.179  |  4.2
true    | no  |  6.833 | 0.955  |  7.2

[1] https://github.com/kblees/git/commit/35f319609aa046d2350db32d3afa1fa44920e880
-- 
Duy

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: inotify to minimize stat() calls
  2013-02-11  2:56                         ` Duy Nguyen
@ 2013-02-11 11:12                           ` Duy Nguyen
  2013-03-07 22:16                           ` Torsten Bögershausen
  1 sibling, 0 replies; 88+ messages in thread
From: Duy Nguyen @ 2013-02-11 11:12 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Ramkumar Ramachandra, Robert Zeh, Git List, finnag

On Mon, Feb 11, 2013 at 9:56 AM, Duy Nguyen <pclouds@gmail.com> wrote:
> Yeah, it did not cut out syscall cost, I also cut a lot of user-space
> processing (plus .gitignore content access). From the timings I posted
> earlier,
>
>>         unmodified  dir.c
>> real    0m0.550s    0m0.287s
>> user    0m0.305s    0m0.201s
>> sys     0m0.240s    0m0.084s
>
> sys time is reduced from 0.24s to 0.08s, so readdir+opendir definitely
> has something to do with it (and perhaps reading .gitignore). But it
> also reduces user time from 0.305 to 0.201s. I don't think avoiding
> readdir+openddir will bring us this gain. It's probably the cost of
> matching .gitignore. I'll try to replace opendir+readdir with a
> no-syscall version. At this point "untracked caching" sounds more
> feasible (and less complex) than ".gitignore cachine".

And this is read_directory's timing breakdown (again, "git status" on
gentoo-x86,git, built with -O2 on x86-64 if I did not mention before)

opendir   = 0.030s
readdir   = 0.083s
closedir  = 0.020s
{open,read,close}dir = 0.132s
treat_path           = 0.094s (172534 times)
dir_add_name         = 0.050s (101917 times)
read_directory       = 0.292s
# On branch master
nothing to commit, working directory clean

real    0m0.511s
user    0m0.347s
sys     0m0.157s

Instrumentation is done with gettimeofday. Without gettimeofday calls
inside read_directory_recursive, read_directory takes 0.267s (iow,
gettimeofday cost is about 0.30s). {open,read,close}dir + treat_path +
dir_add_name + gettimeofday add up quite close to 0.292s (strbuf_*
takes just about 0.005s)

Eliminating xxxdir syscalls may save us 0.132s (or less, we need to
pay to get the information elsewhere).

Because my worktree is clean, dir_add_name spends all 0.05s in
cache_name_exists(). If we somehow know the input path is not a
tracked entry, we could avoid cache_name_exists() and save 0.05s.

If we do the "untracked cache", the number of treat_path calls should
be much lower. In this particular case of gentoo-x86, I'd expect no
more than a dozen of untracked files, which cuts down treat_path and
dir_add_name's time to near zero. On a normal repository like git.git,
untracked files are about 1075 files with 2552 tracked files, we
should be able to save 2/3 to 1/2 of treat_path calls.
-- 
Duy

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: inotify to minimize stat() calls
  2013-02-11  3:21               ` Duy Nguyen
@ 2013-02-11 14:13                 ` Robert Zeh
  2013-02-19  9:57                   ` Ramkumar Ramachandra
  0 siblings, 1 reply; 88+ messages in thread
From: Robert Zeh @ 2013-02-11 14:13 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Junio C Hamano, Ramkumar Ramachandra, Git List

On Sun, Feb 10, 2013 at 9:21 PM, Duy Nguyen <pclouds@gmail.com> wrote:
> On Mon, Feb 11, 2013 at 2:03 AM, Robert Zeh <robert.allan.zeh@gmail.com> wrote:
>> On Sat, Feb 9, 2013 at 1:35 PM, Junio C Hamano <gitster@pobox.com> wrote:
>>> Ramkumar Ramachandra <artagnon@gmail.com> writes:
>>>
>>>> This is much better than Junio's suggestion to study possible
>>>> implementations on all platforms and designing a generic daemon/
>>>> communication channel.  That's no weekend project.
>>>
>>> It appears that you misunderstood what I wrote.  That was not "here
>>> is a design; I want it in my system.  Go implemment it".
>>>
>>> It was "If somebody wants to discuss it but does not know where to
>>> begin, doing a small experiment like this and reporting how well it
>>> worked here may be one way to do so.", nothing more.
>>
>> What if instead of communicating over a socket, the daemon
>> dumped a file containing all of the lstat information after git
>> wrote a file? By definition the daemon should know about file writes.
>>
>> There would be no network communication, which I think would make
>> things more secure. It would simplify the rendezvous by insisting on
>> well known locations in $GIT_DIR.
>
> We need some sort of interactive communication to the daemon anyway,
> to validate that the information is uptodate. Assume that a user makes
> some changes to his worktree before starting the daemon, git needs to
> know that what the daemon provides does not represent a complete
> file-change picture and it better refreshes the index the old way
> once, then trust the daemon.
>
> I think we could solve that by storing a "session id", provided by the
> daemon, in .git/index. If the session id is not present (or does not
> match what the current daemon gives), refresh the old way. After
> refreshing, it may ask the daemon for new session id and store it.
> Next time if the session id is still valid, trust the daemon's data.
> This session id should be different every time the daemon restarts for
> this to work.

I think we could do this without interactive communication,
if we did the following:
   1) The Daemon waits to see $GIT_DIR/lstat_request, and atomically
       writes out $GIT_DIR/lstat_cache.  By atomically I mean that it writes
       things out to a temporary file, and then does a rename.

   2) The client erases $GIT_DIR/lstat_cache, and writes
      $GIT_DIR/lstat_request

I think this is better than socket based communication because there
are fewer places to check
for failures.

Robert

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: inotify to minimize stat() calls
  2013-02-11  3:53                       ` Duy Nguyen
@ 2013-02-12 20:48                         ` Karsten Blees
  2013-02-13 10:06                           ` Duy Nguyen
                                             ` (2 more replies)
  0 siblings, 3 replies; 88+ messages in thread
From: Karsten Blees @ 2013-02-12 20:48 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: kusmabite, Ramkumar Ramachandra, Robert Zeh, Junio C Hamano,
	Git List, finnag

Am 11.02.2013 04:53, schrieb Duy Nguyen:
> On Sun, Feb 10, 2013 at 11:58 PM, Erik Faye-Lund <kusmabite@gmail.com> wrote:
>> Karsten Blees has done something similar-ish on Windows, and he posted
>> the results here:
>>
>> https://groups.google.com/forum/#!topic/msysgit/fL_jykUmUNE/discussion
>>

The new hashtable implementation in fscache [1] supports O(1) removal and has no mingw dependencies - might come in handy for anyone trying to implement an inotify daemon.

[1] https://github.com/kblees/git/commit/f7eb85c2

>> I also seem to remember he doing a ReadDirectoryChangesW version, but
>> I don't remember what happened with that.
> 
> Thanks. I came across that but did not remember. For one thing, we
> know the inotify alternative for Windows: ReadDirectoryChangesW.
> 

I dropped ReadDirectoryChangesW because maintaining a 'live' file system cache became more and more complicated. For example, according to MSDN docs, ReadDirectoryChangesW *may* report short DOS 8.3 names (i.e. "PROGRA~1" instead of "Program Files"), so a correct and fast cache implementation would have to be indexed by long *and* short names...

Another problem was that the 'live' cache had quite negative performance impact on mutating git commands (checkout, reset...). An inotify daemon running as a background process (not in-process as fscache) will probably affect everyone that modifies the working copy, e.g. running 'make' or the test-suite. This should be considered in the design.

> I copy "git status"'s (impressive) numbers from fscache-v0 for those
> who are interested in:
> 
> preload | -u  | normal | cached | gain
> --------+-----+--------+--------+------
> false   | all | 25.144 | 3.055  |  8.2
> false   | no  | 22.822 | 1.748  | 12.8
> true    | all |  9.234 | 2.179  |  4.2
> true    | no  |  6.833 | 0.955  |  7.2
> 

Note that I wasn't able to reproduce such bad 'normal' values in later tests, I guess disk fragmentation and/or virus scanner must have tricked me on that day...gain factors of 2.5 - 5 are more appropriate.


However, the difference between git status -uall and -uno was always about 1.3 s in all fscache versions, even though opendir/readdir/closedir was served entirely from the cache. I added a bit of performance tracing to find the cause, and I think most of the time spent in wt_status_collect_untracked can be eliminated:

1.) 0.939 s is spent in dir.c/excluded (i.e. checking .gitignore). This check is done for *every* file in the working copy, including files in the index. Checking the index first could eliminate most of that, i.e.:

(Note: patches are for discussion only, I'm aware that they may have unintended side effects...)

@@ -1097,6 +1097,8 @@ static enum path_treatment treat_path(struct dir_struct *dir,
                return path_ignored;
        strbuf_setlen(path, baselen);
        strbuf_addstr(path, de->d_name);
+       if (cache_name_exists(path->buf, path->len, ignore_case))
+               return path_ignored;
        if (simplify_away(path->buf, path->len, simplify))
                return path_ignored;
---


2.) 0.135 s is spent in name-hash.c/hash_index_entry_directories, reindexing the same directories over and over again. In the end, the hashtable contains 939k directory entries, even though the WebKit test repo only has 7k directories. Checking if a directory entry already exists could reduce that, i.e.:

@@ -53,14 +55,23 @@ static void hash_index_entry_directories(struct index_state *istate, struct cach
 	unsigned int hash;
 	void **pos;
 	double t = ticks();
+	struct cache_entry *ce2;
+	int len = ce_namelen(ce);
 
-	const char *ptr = ce->name;
-	while (*ptr) {
-		while (*ptr && *ptr != '/')
-			++ptr;
-		if (*ptr == '/') {
-			++ptr;
-			hash = hash_name(ce->name, ptr - ce->name);
+	while (len > 0) {
+		while (len > 0 && ce->name[len - 1] != '/')
+			len--;
+		if (len > 0) {
+			hash = hash_name(ce->name, len);
+			ce2 = lookup_hash(hash, &istate->name_hash);
+			while (ce2) {
+				if (same_name(ce2, ce->name, len, ignore_case)) {
+					add_since(t, &hash_dirs);
+					return;
+				}
+				ce2 = ce2->dir_next;
+			}
+			len--;
 			pos = insert_hash(hash, ce, &istate->name_hash);
 			if (pos) {
 				ce->dir_next = *pos;
---


Tests were done with the WebKit repo (~200k files, ~7k directories, 15 .gitignore files, ~100 entries in root .gitignore). Instrumented code can be found here: https://github.com/kblees/git/tree/kb/git-status-performance-tracing

Here's the performance traces of 'git status -s -uall'

Before patches:

trace: at builtin/commit.c:1221, time: 0.523429 s: cmd_status/read_cache_preload
trace: at builtin/commit.c:1223, time: 0.00403477 s: cmd_status/refresh_index
trace: at builtin/commit.c:1231, time: 0.00318494 s: cmd_status/hold_locked_index
trace: at wt-status.c:539, time: 0.00527396 s: wt_status_collect_changes_worktree
trace: at wt-status.c:544, time: 0.00545771 s: wt_status_collect_changes
trace: at wt-status.c:546, time: 1.286 s: wt_status_collect_untracked
trace: at builtin/commit.c:1233, time: 1.29852 s: cmd_status/wt_status_collect
trace: at dir.c:1540, time: 0.00170986 s: read_directory_recursive/strbuf_add
trace: at dir.c:1541, time: 0.00623972 s: read_directory_recursive/opendir
trace: at dir.c:1542, time: 0.00517881 s: read_directory_recursive/readdir
trace: at dir.c:1543, time: 0.992936 s: read_directory_recursive/treat_path
trace: at dir.c:1544, time: 0.277942 s: read_directory_recursive/dir_add_name
trace: at dir.c:1545, time: 0.0014594 s: read_directory_recursive/close
trace: at dir.c:1546, time: 0.939349 s: treat_one_path/excluded
trace: at dir.c:1547, time: 0.0050811 s: treat_one_path/dir_add_ignored
trace: at dir.c:1548, time: 0.00515875 s: treat_one_path/get_dtype
trace: at dir.c:1549, time: 0.00329322 s: treat_one_path/treat_directory
trace: at dir.c:1550, time: 0.222969 s: excluded/prep_exclude
trace: at dir.c:1551, time: 0.00443398 s: excluded/excluded_from_list[EXC_CMDL]
trace: at dir.c:1552, time: 0.699602 s: excluded/excluded_from_list[EXC_DIRS]
trace: at dir.c:1553, time: 0.00475736 s: excluded/excluded_from_list[EXC_FILE]
trace: at read-cache.c:460, time: 0.00967987 s: index_name_pos
trace: at name-hash.c:213, time: 0.190481 s: lazy_init_name_hash
trace: at name-hash.c:216, time: 0.135248 s: hash_index_entry_directories (938865 entries)
trace: at name-hash.c:217, time: 0.0806647 s: index_name_exists
trace: at compat/mingw.c:2137, time: 1.97424 s: command: c:\git\msysgit\git\git-status.exe -s -uall


After patches:

trace: at builtin/commit.c:1221, time: 0.517511 s: cmd_status/read_cache_preload
trace: at builtin/commit.c:1223, time: 0.00405227 s: cmd_status/refresh_index
trace: at builtin/commit.c:1231, time: 0.00322796 s: cmd_status/hold_locked_index
trace: at wt-status.c:539, time: 0.00530057 s: wt_status_collect_changes_worktree
trace: at wt-status.c:544, time: 0.00546062 s: wt_status_collect_changes
trace: at wt-status.c:546, time: 0.322799 s: wt_status_collect_untracked
trace: at builtin/commit.c:1233, time: 0.33536 s: cmd_status/wt_status_collect
trace: at dir.c:1542, time: 0.00120529 s: read_directory_recursive/strbuf_add
trace: at dir.c:1543, time: 0.00476647 s: read_directory_recursive/opendir
trace: at dir.c:1544, time: 0.00502022 s: read_directory_recursive/readdir
trace: at dir.c:1545, time: 0.310515 s: read_directory_recursive/treat_path
trace: at dir.c:1546, time: 0 s: read_directory_recursive/dir_add_name
trace: at dir.c:1547, time: 0.000831234 s: read_directory_recursive/close
trace: at dir.c:1548, time: 0.0668582 s: treat_one_path/excluded
trace: at dir.c:1549, time: 0.000173174 s: treat_one_path/dir_add_ignored
trace: at dir.c:1550, time: 0.000174267 s: treat_one_path/get_dtype
trace: at dir.c:1551, time: 0.00315468 s: treat_one_path/treat_directory
trace: at dir.c:1552, time: 0.039733 s: excluded/prep_exclude
trace: at dir.c:1553, time: 0.000185205 s: excluded/excluded_from_list[EXC_CMDL]
trace: at dir.c:1554, time: 0.0264496 s: excluded/excluded_from_list[EXC_DIRS]
trace: at dir.c:1555, time: 0.000170622 s: excluded/excluded_from_list[EXC_FILE]
trace: at read-cache.c:460, time: 0.00260636 s: index_name_pos
trace: at name-hash.c:224, time: 0.126637 s: lazy_init_name_hash
trace: at name-hash.c:227, time: 0.0500866 s: hash_index_entry_directories (7152 entries)
trace: at name-hash.c:228, time: 0.0790143 s: index_name_exists
trace: at compat/mingw.c:2137, time: 1.00595 s: command: c:\git\msysgit\git\git-status.exe -s -uall

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: inotify to minimize stat() calls
  2013-02-12 20:48                         ` Karsten Blees
@ 2013-02-13 10:06                           ` Duy Nguyen
  2013-02-13 12:15                           ` Duy Nguyen
  2013-02-19  9:49                           ` inotify to minimize stat() calls Ramkumar Ramachandra
  2 siblings, 0 replies; 88+ messages in thread
From: Duy Nguyen @ 2013-02-13 10:06 UTC (permalink / raw)
  To: blees
  Cc: kusmabite, Ramkumar Ramachandra, Robert Zeh, Junio C Hamano,
	Git List, finnag

On Tue, Feb 12, 2013 at 09:48:18PM +0100, Karsten Blees wrote:

> However, the difference between git status -uall and -uno was always
> about 1.3 s in all fscache versions, even though
> opendir/readdir/closedir was served entirely from the cache. I added
> a bit of performance tracing to find the cause, and I think most of
> the time spent in wt_status_collect_untracked can be eliminated:
> 
> 1.) 0.939 s is spent in dir.c/excluded (i.e. checking
> .gitignore). This check is done for *every* file in the working
> copy, including files in the index. Checking the index first could
> eliminate most of that, i.e.:
> 
> (Note: patches are for discussion only, I'm aware that they may have
> unintended side effects...)
>
> @@ -1097,6 +1097,8 @@ static enum path_treatment treat_path(struct dir_struct *dir,
>                 return path_ignored;
>         strbuf_setlen(path, baselen);
>         strbuf_addstr(path, de->d_name);
> +       if (cache_name_exists(path->buf, path->len, ignore_case))
> +               return path_ignored;
>         if (simplify_away(path->buf, path->len, simplify))
>                 return path_ignored;

The below patch passes the test suite for me and still does the same
thing. On my Linux box, running "git status" on gentoo-x86.git with
this patch saves 0.05s (0.548s without the patch, 0.505s with the
patch, best number of 20 runs).

And I just realized gentoo-x86.git does not have any .gitignore. On
webkit.git, it cuts "git status" time from 1.121s down to
0.762s. Unless I'm mistaken, "git add" should have the same benefit on
normal case too. Good finding!

-- 8< --
diff --git a/dir.c b/dir.c
index 57394e4..4b4cf60 100644
--- a/dir.c
+++ b/dir.c
@@ -1244,7 +1244,19 @@ static enum path_treatment treat_one_path(struct dir_struct *dir,
 					  const struct path_simplify *simplify,
 					  int dtype, struct dirent *de)
 {
-	int exclude = is_excluded(dir, path->buf, &dtype);
+	int exclude;
+
+	if (dtype == DT_UNKNOWN)
+		dtype = get_dtype(de, path->buf, path->len);
+
+	if (!(dir->flags & DIR_SHOW_IGNORED) &&
+	    !(dir->flags & DIR_COLLECT_IGNORED) &&
+	    dtype == DT_REG &&
+	    cache_name_exists(path->buf, path->len, ignore_case))
+		return path_ignored;
+
+	exclude = is_excluded(dir, path->buf, &dtype);
+
 	if (exclude && (dir->flags & DIR_COLLECT_IGNORED)
 	    && exclude_matches_pathspec(path->buf, path->len, simplify))
 		dir_add_ignored(dir, path->buf, path->len);
@@ -1256,9 +1268,6 @@ static enum path_treatment treat_one_path(struct dir_struct *dir,
 	if (exclude && !(dir->flags & DIR_SHOW_IGNORED))
 		return path_ignored;
 
-	if (dtype == DT_UNKNOWN)
-		dtype = get_dtype(de, path->buf, path->len);
-
 	switch (dtype) {
 	default:
 		return path_ignored;
-- 8< --

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* Re: inotify to minimize stat() calls
  2013-02-12 20:48                         ` Karsten Blees
  2013-02-13 10:06                           ` Duy Nguyen
@ 2013-02-13 12:15                           ` Duy Nguyen
  2013-02-13 18:18                             ` Jeff King
  2013-02-19  9:49                           ` inotify to minimize stat() calls Ramkumar Ramachandra
  2 siblings, 1 reply; 88+ messages in thread
From: Duy Nguyen @ 2013-02-13 12:15 UTC (permalink / raw)
  To: blees
  Cc: kusmabite, Ramkumar Ramachandra, Robert Zeh, Junio C Hamano,
	Git List, finnag, Jeff King

On Wed, Feb 13, 2013 at 3:48 AM, Karsten Blees <karsten.blees@gmail.com> wrote:
> 2.) 0.135 s is spent in name-hash.c/hash_index_entry_directories, reindexing the same directories over and over again. In the end, the hashtable contains 939k directory entries, even though the WebKit test repo only has 7k directories. Checking if a directory entry already exists could reduce that, i.e.:

This function is only used when core.ignorecase = true. I probably
won't be able to test this, so I'll leave this to other people who
care about ignorecase.

This function used to have lookup_hash, but it was removed by Jeff in
2548183 (fix phantom untracked files when core.ignorecase is set -
2011-10-06). There's a looong commit message which I'm too lazy to
read. Anybody who works on this should though.


> @@ -53,14 +55,23 @@ static void hash_index_entry_directories(struct index_state *istate, struct cach
>         unsigned int hash;
>         void **pos;
>         double t = ticks();
> +       struct cache_entry *ce2;
> +       int len = ce_namelen(ce);
>
> -       const char *ptr = ce->name;
> -       while (*ptr) {
> -               while (*ptr && *ptr != '/')
> -                       ++ptr;
> -               if (*ptr == '/') {
> -                       ++ptr;
> -                       hash = hash_name(ce->name, ptr - ce->name);
> +       while (len > 0) {
> +               while (len > 0 && ce->name[len - 1] != '/')
> +                       len--;
> +               if (len > 0) {
> +                       hash = hash_name(ce->name, len);
> +                       ce2 = lookup_hash(hash, &istate->name_hash);
> +                       while (ce2) {
> +                               if (same_name(ce2, ce->name, len, ignore_case)) {
> +                                       add_since(t, &hash_dirs);
> +                                       return;
> +                               }
> +                               ce2 = ce2->dir_next;
> +                       }
> +                       len--;
>                         pos = insert_hash(hash, ce, &istate->name_hash);
>                         if (pos) {
>                                 ce->dir_next = *pos;
-- 
Duy

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: inotify to minimize stat() calls
  2013-02-13 12:15                           ` Duy Nguyen
@ 2013-02-13 18:18                             ` Jeff King
  2013-02-13 19:47                               ` Jeff King
  2013-02-13 20:25                               ` Karsten Blees
  0 siblings, 2 replies; 88+ messages in thread
From: Jeff King @ 2013-02-13 18:18 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: blees, kusmabite, Ramkumar Ramachandra, Robert Zeh,
	Junio C Hamano, Git List, finnag

On Wed, Feb 13, 2013 at 07:15:47PM +0700, Nguyen Thai Ngoc Duy wrote:

> On Wed, Feb 13, 2013 at 3:48 AM, Karsten Blees <karsten.blees@gmail.com> wrote:
> > 2.) 0.135 s is spent in name-hash.c/hash_index_entry_directories, reindexing the same directories over and over again. In the end, the hashtable contains 939k directory entries, even though the WebKit test repo only has 7k directories. Checking if a directory entry already exists could reduce that, i.e.:
> 
> This function is only used when core.ignorecase = true. I probably
> won't be able to test this, so I'll leave this to other people who
> care about ignorecase.
> 
> This function used to have lookup_hash, but it was removed by Jeff in
> 2548183 (fix phantom untracked files when core.ignorecase is set -
> 2011-10-06). There's a looong commit message which I'm too lazy to
> read. Anybody who works on this should though.

Yeah, the problem that commit tried to solve is that linking to a single
cache entry through the hash is not enough, because we may remove cache
items. Imagine you have "dir/one" and "dir/two", and you add them to the
in-memory index in that order. The original code hashed "dir/" and
inserted a link to the "dir/one" cache entry. When it came time to put
in the "dir/two" entry, we noticed that there was already a "dir/" entry
and did nothing. Then later, if we remove "dir/one", we do so by marking
it with CE_UNHASHED. So a later query for "dir/" will see "nope, nothing
here that wasn't CE_UNHASHED", which is wrong. We never recorded that
"dir/two" existed under the hash for "dir/", so we can't know about it.

My patch just stores the cache_entry for both under the "dir/" hash.
As Karsten noticed, that can lead to a large number of hash entries,
because adding "some/deep/hierarchy/with/files" will add 4 directory
entries for just that single file. Moreover, looking at it again, I
don't think my patch produces the right behavior: we have a single
dir_next pointer, even though the same ce_entry may appear under many
directory hashes. So the cache_entries that has to "dir/foo/" and those
that hash to "dir/bar/" may get confused, because they will also both be
found under "dir/", and both try to create a linked list from the
dir_next pointer.

Looking at Karsten's patch, it seems like it will not add a cache entry
if there is one of the same name. But I'm not sure if that is right, as
the old one might be CE_UNHASHED (or it might get removed later). You
actually want to be able to find each cache_entry that has a file under
the directory at the hash of that directory, so you can make sure it is
still valid.

And of course that still leaves the existing correctness problem I
mentioned above.

I think the best way forward is to actually create a separate hash table
for the directory lookups. I note that we only care about these entries
in directory_exists_in_index_icase, which is really about whether
something is there, versus what exactly is there. So could we maybe get
by with a separate hash table that stores a count of entries at each
directory, and increment/decrement the count when we add/remove entries?

The biggest problem I see with that is that we do indeed care a little
bit what is at the directory: we check the mode to see if it is a gitdir
or not. But I think we can maybe sneak around that: gitdirs have actual
entries in the index, whereas the directories do not. So we would find
them via index_name_exists; anything that is not there, but _is_ in the
special directory hash would therefore be a directory.

I realize it got pretty esoteric there in the middle. I'll see if I can
work up a patch that expresses what I'm thinking.

-Peff

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: inotify to minimize stat() calls
  2013-02-13 18:18                             ` Jeff King
@ 2013-02-13 19:47                               ` Jeff King
  2013-02-13 20:25                               ` Karsten Blees
  1 sibling, 0 replies; 88+ messages in thread
From: Jeff King @ 2013-02-13 19:47 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: blees, kusmabite, Ramkumar Ramachandra, Robert Zeh,
	Junio C Hamano, Git List, finnag

On Wed, Feb 13, 2013 at 01:18:51PM -0500, Jeff King wrote:

> I think the best way forward is to actually create a separate hash table
> for the directory lookups. I note that we only care about these entries
> in directory_exists_in_index_icase, which is really about whether
> something is there, versus what exactly is there. So could we maybe get
> by with a separate hash table that stores a count of entries at each
> directory, and increment/decrement the count when we add/remove entries?
> 
> The biggest problem I see with that is that we do indeed care a little
> bit what is at the directory: we check the mode to see if it is a gitdir
> or not. But I think we can maybe sneak around that: gitdirs have actual
> entries in the index, whereas the directories do not. So we would find
> them via index_name_exists; anything that is not there, but _is_ in the
> special directory hash would therefore be a directory.
> 
> I realize it got pretty esoteric there in the middle. I'll see if I can
> work up a patch that expresses what I'm thinking.

So here's a patch. It's mostly meant to illustrate what I'm thinking,
and I have no clue if it introduces regressions. It does pass the test
suite, but we have virtually no ignorecase tests.  It seems to behave
sanely when I set core.ignorecase on my Linux box, but I have no idea
what it will do on a real case-insensitive system (nor even, to be
honest, what kinds of scenarios should be tested for the dir-hashing
stuff).

---
diff --git a/cache.h b/cache.h
index e493563..6630a35 100644
--- a/cache.h
+++ b/cache.h
@@ -131,7 +131,6 @@ struct cache_entry {
 	unsigned int ce_namelen;
 	unsigned char sha1[20];
 	struct cache_entry *next;
-	struct cache_entry *dir_next;
 	char name[FLEX_ARRAY]; /* more */
 };
 
@@ -267,26 +266,14 @@ extern void add_name_hash(struct index_state *istate, struct cache_entry *ce);
 	unsigned name_hash_initialized : 1,
 		 initialized : 1;
 	struct hash_table name_hash;
+	struct hash_table dir_hash;
 };
 
 extern struct index_state the_index;
 
 /* Name hashing */
 extern void add_name_hash(struct index_state *istate, struct cache_entry *ce);
-/*
- * We don't actually *remove* it, we can just mark it invalid so that
- * we won't find it in lookups.
- *
- * Not only would we have to search the lists (simple enough), but
- * we'd also have to rehash other hash buckets in case this makes the
- * hash bucket empty (common). So it's much better to just mark
- * it.
- */
-static inline void remove_name_hash(struct cache_entry *ce)
-{
-	ce->ce_flags |= CE_UNHASHED;
-}
-
+extern void remove_name_hash(struct index_state *istate, struct cache_entry *ce);
 
 #ifndef NO_THE_INDEX_COMPATIBILITY_MACROS
 #define active_cache (the_index.cache)
@@ -443,6 +430,7 @@ extern struct cache_entry *index_name_exists(struct index_state *istate, const c
 extern int unmerged_index(const struct index_state *);
 extern int verify_path(const char *path);
 extern struct cache_entry *index_name_exists(struct index_state *istate, const char *name, int namelen, int igncase);
+extern int index_icase_dir_exists(struct index_state *istate, const char *name, int namelen);
 extern int index_name_pos(const struct index_state *, const char *name, int namelen);
 #define ADD_CACHE_OK_TO_ADD 1		/* Ok to add */
 #define ADD_CACHE_OK_TO_REPLACE 2	/* Ok to replace file/directory */
diff --git a/dir.c b/dir.c
index 57394e4..f73ac34 100644
--- a/dir.c
+++ b/dir.c
@@ -927,29 +927,27 @@ static enum exist_status directory_exists_in_index_icase(const char *dirname, in
  */
 static enum exist_status directory_exists_in_index_icase(const char *dirname, int len)
 {
-	struct cache_entry *ce = index_name_exists(&the_index, dirname, len + 1, ignore_case);
-	unsigned char endchar;
-
-	if (!ce)
-		return index_nonexistent;
-	endchar = ce->name[len];
+	struct cache_entry *ce = index_name_exists(&the_index, dirname, len, ignore_case);
 
 	/*
-	 * The cache_entry structure returned will contain this dirname
-	 * and possibly additional path components.
+	 * We found something in the index, which means it is either an actual
+	 * file, or a gitdir.
 	 */
-	if (endchar == '/')
-		return index_directory;
+	if (ce) {
+	    if (S_ISGITLINK(ce->ce_mode))
+		    return index_gitdir;
+	    /* We call a file "index_nonexistent" here, because the caller is
+	     * asking about a directory.  */
+	    return index_nonexistent;
+	}
 
 	/*
-	 * If there are no additional path components, then this cache_entry
-	 * represents a submodule.  Submodules, despite being directories,
-	 * are stored in the cache without a closing slash.
+	 * Otherwise, it might be a leading path of something that is in the
+	 * index. We can look it up in the special dir hash.
 	 */
-	if (!endchar && S_ISGITLINK(ce->ce_mode))
-		return index_gitdir;
+	if (index_icase_dir_exists(&the_index, dirname, len))
+		return index_directory;
 
-	/* This should never be hit, but it exists just in case. */
 	return index_nonexistent;
 }
 
diff --git a/name-hash.c b/name-hash.c
index d8d25c2..de8239f 100644
--- a/name-hash.c
+++ b/name-hash.c
@@ -32,37 +32,88 @@ static void hash_index_entry_directories(struct index_state *istate, struct cach
 	return hash;
 }
 
-static void hash_index_entry_directories(struct index_state *istate, struct cache_entry *ce)
+struct dir_hash_entry {
+	struct dir_hash_entry *next;
+	int nr;
+	unsigned int namelen;
+	char name[FLEX_ARRAY];
+};
+
+static struct dir_hash_entry *find_dir_hash(struct hash_table *t,
+					    const char *name,
+					    unsigned int namelen)
+{
+	unsigned int hash = hash_name(name, namelen);
+	struct dir_hash_entry *ent;
+
+	for (ent = lookup_hash(hash, t); ent; ent = ent->next) {
+		if (ent->namelen == namelen &&
+		    !strncasecmp(ent->name, name, namelen))
+			return ent;
+	}
+	return NULL;
+}
+
+static struct dir_hash_entry *find_or_create_dir_hash(struct hash_table *t,
+						      const char *name,
+						      unsigned int namelen)
+{
+	struct dir_hash_entry *ent;
+
+	ent = find_dir_hash(t, name, namelen);
+	if (!ent) {
+		void **pos;
+
+		ent = xcalloc(sizeof(*ent) + namelen + 1, 1);
+		memcpy(ent->name, name, namelen);
+		ent->namelen = namelen;
+
+		pos = insert_hash(hash_name(name, namelen), ent, t);
+		if (pos) {
+			ent->next = *pos;
+			*pos = ent;
+		}
+	}
+
+	return ent;
+}
+
+static void hash_index_entry_directories(struct index_state *istate,
+					 struct cache_entry *ce,
+					 int add)
 {
 	/*
-	 * Throw each directory component in the hash for quick lookup
+	 * Throw each directory component into a hash for quick lookup
 	 * during a git status. Directory components are stored with their
 	 * closing slash.  Despite submodules being a directory, they never
 	 * reach this point, because they are stored without a closing slash
-	 * in the cache.
-	 *
-	 * Note that the cache_entry stored with the directory does not
-	 * represent the directory itself.  It is a pointer to an existing
-	 * filename, and its only purpose is to represent existence of the
-	 * directory in the cache.  It is very possible multiple directory
-	 * hash entries may point to the same cache_entry.
+	 * in the cache. This means we don't need to know anything about
+	 * what is stored at a particular directory, just that it is a leading
+	 * directory component of something else. Which means we can get away
+	 * with storing a count instead of a complete
 	 */
-	unsigned int hash;
-	void **pos;
-
 	const char *ptr = ce->name;
 	while (*ptr) {
 		while (*ptr && *ptr != '/')
 			++ptr;
 		if (*ptr == '/') {
-			++ptr;
-			hash = hash_name(ce->name, ptr - ce->name);
-			pos = insert_hash(hash, ce, &istate->name_hash);
-			if (pos) {
-				ce->dir_next = *pos;
-				*pos = ce;
+			struct dir_hash_entry *ent;
+
+			if (add) {
+				ent = find_or_create_dir_hash(&istate->dir_hash,
+							      ce->name,
+							      ptr - ce->name);
+				ent->nr++;
+			}
+			else {
+				ent = find_dir_hash(&istate->dir_hash,
+						    ce->name,
+						    ptr - ce->name);
+				if (ent)
+					ent->nr--;
 			}
 		}
+		ptr++;
 	}
 }
 
@@ -74,7 +125,7 @@ static void hash_index_entry(struct index_state *istate, struct cache_entry *ce)
 	if (ce->ce_flags & CE_HASHED)
 		return;
 	ce->ce_flags |= CE_HASHED;
-	ce->next = ce->dir_next = NULL;
+	ce->next = NULL;
 	hash = hash_name(ce->name, ce_namelen(ce));
 	pos = insert_hash(hash, ce, &istate->name_hash);
 	if (pos) {
@@ -83,7 +134,7 @@ static void hash_index_entry(struct index_state *istate, struct cache_entry *ce)
 	}
 
 	if (ignore_case)
-		hash_index_entry_directories(istate, ce);
+		hash_index_entry_directories(istate, ce, 1);
 }
 
 static void lazy_init_name_hash(struct index_state *istate)
@@ -104,6 +155,22 @@ void add_name_hash(struct index_state *istate, struct cache_entry *ce)
 		hash_index_entry(istate, ce);
 }
 
+/*
+ * We don't actually *remove* it, we can just mark it invalid so that
+ * we won't find it in lookups.
+ *
+ * Not only would we have to search the lists (simple enough), but
+ * we'd also have to rehash other hash buckets in case this makes the
+ * hash bucket empty (common). So it's much better to just mark
+ * it.
+ */
+void remove_name_hash(struct index_state *istate, struct cache_entry *ce)
+{
+	ce->ce_flags |= CE_UNHASHED;
+	if (istate->dir_hash.nr)
+		hash_index_entry_directories(istate, ce, 0);
+}
+
 static int slow_same_name(const char *name1, int len1, const char *name2, int len2)
 {
 	if (len1 != len2)
@@ -137,18 +204,7 @@ static int same_name(const struct cache_entry *ce, const char *name, int namelen
 	if (!icase)
 		return 0;
 
-	/*
-	 * If the entry we're comparing is a filename (no trailing slash), then compare
-	 * the lengths exactly.
-	 */
-	if (name[namelen - 1] != '/')
-		return slow_same_name(name, namelen, ce->name, len);
-
-	/*
-	 * For a directory, we point to an arbitrary cache_entry filename.  Just
-	 * make sure the directory portion matches.
-	 */
-	return slow_same_name(name, namelen, ce->name, namelen < len ? namelen : len);
+	return slow_same_name(name, namelen, ce->name, len);
 }
 
 struct cache_entry *index_name_exists(struct index_state *istate, const char *name, int namelen, int icase)
@@ -164,10 +220,7 @@ struct cache_entry *index_name_exists(struct index_state *istate, const char *na
 			if (same_name(ce, name, namelen, icase))
 				return ce;
 		}
-		if (icase && name[namelen - 1] == '/')
-			ce = ce->dir_next;
-		else
-			ce = ce->next;
+		ce = ce->next;
 	}
 
 	/*
@@ -188,3 +241,11 @@ struct cache_entry *index_name_exists(struct index_state *istate, const char *na
 	}
 	return NULL;
 }
+
+int index_icase_dir_exists(struct index_state *istate, const char *name, int namelen)
+{
+	struct dir_hash_entry *ent;
+
+	ent = find_dir_hash(&istate->dir_hash, name, namelen);
+	return ent && ent->nr;
+}
diff --git a/read-cache.c b/read-cache.c
index 827ae55..116c25c 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -46,7 +46,7 @@ static void replace_index_entry(struct index_state *istate, int nr, struct cache
 {
 	struct cache_entry *old = istate->cache[nr];
 
-	remove_name_hash(old);
+	remove_name_hash(istate, old);
 	set_index_entry(istate, nr, ce);
 	istate->cache_changed = 1;
 }
@@ -460,7 +460,7 @@ int remove_index_entry_at(struct index_state *istate, int pos)
 	struct cache_entry *ce = istate->cache[pos];
 
 	record_resolve_undo(istate, ce);
-	remove_name_hash(ce);
+	remove_name_hash(istate, ce);
 	istate->cache_changed = 1;
 	istate->cache_nr--;
 	if (pos >= istate->cache_nr)
@@ -483,7 +483,7 @@ void remove_marked_cache_entries(struct index_state *istate)
 
 	for (i = j = 0; i < istate->cache_nr; i++) {
 		if (ce_array[i]->ce_flags & CE_REMOVE)
-			remove_name_hash(ce_array[i]);
+			remove_name_hash(istate, ce_array[i]);
 		else
 			ce_array[j++] = ce_array[i];
 	}

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* Re: inotify to minimize stat() calls
  2013-02-13 18:18                             ` Jeff King
  2013-02-13 19:47                               ` Jeff King
@ 2013-02-13 20:25                               ` Karsten Blees
  2013-02-13 22:55                                 ` Jeff King
  1 sibling, 1 reply; 88+ messages in thread
From: Karsten Blees @ 2013-02-13 20:25 UTC (permalink / raw)
  To: Jeff King
  Cc: Duy Nguyen, kusmabite, Ramkumar Ramachandra, Robert Zeh,
	Junio C Hamano, Git List, finnag

Am 13.02.2013 19:18, schrieb Jeff King:
> Moreover, looking at it again, I
> don't think my patch produces the right behavior: we have a single
> dir_next pointer, even though the same ce_entry may appear under many
> directory hashes. So the cache_entries that has to "dir/foo/" and those
> that hash to "dir/bar/" may get confused, because they will also both be
> found under "dir/", and both try to create a linked list from the
> dir_next pointer.
> 

Indeed. In the worst case, this causes an endless loop if ce->dir_next == ce
---8<---
mkdir V
mkdir V/XQANY
mkdir WURZAUP
touch V/XQANY/test
git init
git config core.ignorecase true
git add .
git status
---8<---
Note: "V/", "V/XQANY/" and "WURZAUP/" all have the same hash_name. Although I found those strange values by brute force, hash collisions in 32 bit values are not that uncommon in real life :-)

> Looking at Karsten's patch, it seems like it will not add a cache entry
> if there is one of the same name. But I'm not sure if that is right, as
> the old one might be CE_UNHASHED (or it might get removed later). You
> actually want to be able to find each cache_entry that has a file under
> the directory at the hash of that directory, so you can make sure it is
> still valid.
> 

Yes, the patch was just to show potential performance savings, I didn't consider CE_UNHASHED at all.

> I think the best way forward is to actually create a separate hash table
> for the directory lookups. I note that we only care about these entries
> in directory_exists_in_index_icase, which is really about whether
> something is there, versus what exactly is there. So could we maybe get
> by with a separate hash table that stores a count of entries at each
> directory, and increment/decrement the count when we add/remove entries?
> 

Alternatively, we could simply create normal cache_entries for the directories that are linked via ce->next, but have a trailing '/' in their name?

Reference counting sounds good to me, at least better than allocating directory entries per cache entry * parent dirs.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: inotify to minimize stat() calls
  2013-02-13 20:25                               ` Karsten Blees
@ 2013-02-13 22:55                                 ` Jeff King
  2013-02-14  0:48                                   ` Karsten Blees
  0 siblings, 1 reply; 88+ messages in thread
From: Jeff King @ 2013-02-13 22:55 UTC (permalink / raw)
  To: blees
  Cc: Duy Nguyen, kusmabite, Ramkumar Ramachandra, Robert Zeh,
	Junio C Hamano, Git List, finnag

On Wed, Feb 13, 2013 at 09:25:59PM +0100, Karsten Blees wrote:

> Am 13.02.2013 19:18, schrieb Jeff King:
> > Moreover, looking at it again, I
> > don't think my patch produces the right behavior: we have a single
> > dir_next pointer, even though the same ce_entry may appear under many
> > directory hashes. So the cache_entries that has to "dir/foo/" and those
> > that hash to "dir/bar/" may get confused, because they will also both be
> > found under "dir/", and both try to create a linked list from the
> > dir_next pointer.
> > 
> 
> Indeed. In the worst case, this causes an endless loop if ce->dir_next == ce
> ---8<---
> mkdir V
> mkdir V/XQANY
> mkdir WURZAUP
> touch V/XQANY/test
> git init
> git config core.ignorecase true
> git add .
> git status
> ---8<---

Great, thanks for the test case. I can trivially replicate the endless
loop. The patch I sent earlier fixes that. So it's at least a step in
the (possible) right direction. I'm slightly concerned that there is
some other case that is expecting the directories in the main hash, but
I think I got them all.

> Note: "V/", "V/XQANY/" and "WURZAUP/" all have the same hash_name.
> Although I found those strange values by brute force, hash collisions
> in 32 bit values are not that uncommon in real life :-)

Cute. :)

> Alternatively, we could simply create normal cache_entries for the
> directories that are linked via ce->next, but have a trailing '/' in
> their name?
>
> Reference counting sounds good to me, at least better than allocating
> directory entries per cache entry * parent dirs.

I think that is more or less what my patch does, but it splits the
ref-counted fake cache_entries out into a separate hash of "struct
dir_hash_entry" rather than storing it in the regular hash. Which IMHO
is a bit cleaner for two reasons:

  1. You do not have to pay the memory price of storing fake
     cache_entries (the name+refcount struct for each directory is much
     smaller than a real cache_entry).

  2. It makes the code a bit simpler, as you do not have to do any
     "check for trailing /" magic on the result of index_name_exists to
     determine if it is a "real" name or just a fake dir entry.

-Peff

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: inotify to minimize stat() calls
  2013-02-13 22:55                                 ` Jeff King
@ 2013-02-14  0:48                                   ` Karsten Blees
  2013-02-27 14:45                                     ` [PATCH] name-hash.c: fix endless loop with core.ignorecase=true Karsten Blees
  0 siblings, 1 reply; 88+ messages in thread
From: Karsten Blees @ 2013-02-14  0:48 UTC (permalink / raw)
  To: Jeff King
  Cc: Duy Nguyen, kusmabite, Ramkumar Ramachandra, Robert Zeh,
	Junio C Hamano, Git List, finnag

Am 13.02.2013 23:55, schrieb Jeff King:
> On Wed, Feb 13, 2013 at 09:25:59PM +0100, Karsten Blees wrote:
> 
>> Alternatively, we could simply create normal cache_entries for the
>> directories that are linked via ce->next, but have a trailing '/' in
>> their name?
>>
>> Reference counting sounds good to me, at least better than allocating
>> directory entries per cache entry * parent dirs.
> 
> I think that is more or less what my patch does, but it splits the
> ref-counted fake cache_entries out into a separate hash of "struct
> dir_hash_entry" rather than storing it in the regular hash. Which IMHO
> is a bit cleaner for two reasons:
> 
>   1. You do not have to pay the memory price of storing fake
>      cache_entries (the name+refcount struct for each directory is much
>      smaller than a real cache_entry).
> 

Yes, but considering the small number of directories compared to files, I think this is a relatively small price to pay.

>   2. It makes the code a bit simpler, as you do not have to do any
>      "check for trailing /" magic on the result of index_name_exists to
>      determine if it is a "real" name or just a fake dir entry.
> 

True for dir.c. On the other hand, you need a lot of new find / find_or_create logic in name-hash.c.

Just to illustrate what I mean, here's a quick sketch (there's still a segfault somewhere, but I don't have time to debug right now...).

Note that hash_index_entry_directories works from right to left - if the immediate parent directory is there, there's no need to check the parent's parent.

cache_entry.dir points to the parent directory so that we don't need to lookup all path components for reference counting when adding / removing entries.

As directory entries are 'real' cache_entries, we can reuse the existing index_name_exists and hash_index_entry code.

I feel slightly guilty for abusing ce_size as reference counter...well :-)

---
 cache.h     |  4 +++-
 name-hash.c | 80 ++++++++++++++++++++++++++++---------------------------------
 2 files changed, 39 insertions(+), 45 deletions(-)

diff --git a/cache.h b/cache.h
index 665b512..2bc1372 100644
--- a/cache.h
+++ b/cache.h
@@ -131,7 +131,7 @@ struct cache_entry {
 	unsigned int ce_namelen;
 	unsigned char sha1[20];
 	struct cache_entry *next;
-	struct cache_entry *dir_next;
+	struct cache_entry *dir;
 	char name[FLEX_ARRAY]; /* more */
 };
 
@@ -285,6 +285,8 @@ extern void add_name_hash(struct index_state *istate, struct cache_entry *ce);
 static inline void remove_name_hash(struct cache_entry *ce)
 {
 	ce->ce_flags |= CE_UNHASHED;
+	if (ce->dir && !(--ce->dir->ce_size))
+		remove_name_hash(ce->dir);
 }
 
 
diff --git a/name-hash.c b/name-hash.c
index d8d25c2..01e8320 100644
--- a/name-hash.c
+++ b/name-hash.c
@@ -32,6 +32,9 @@ static unsigned int hash_name(const char *name, int namelen)
 	return hash;
 }
 
+static struct cache_entry *lookup_index_entry(struct index_state *istate, const char *name, int namelen, int icase);
+static void hash_index_entry(struct index_state *istate, struct cache_entry *ce);
+
 static void hash_index_entry_directories(struct index_state *istate, struct cache_entry *ce)
 {
 	/*
@@ -40,30 +43,25 @@ static void hash_index_entry_directories(struct index_state *istate, struct cach
 	 * closing slash.  Despite submodules being a directory, they never
 	 * reach this point, because they are stored without a closing slash
 	 * in the cache.
-	 *
-	 * Note that the cache_entry stored with the directory does not
-	 * represent the directory itself.  It is a pointer to an existing
-	 * filename, and its only purpose is to represent existence of the
-	 * directory in the cache.  It is very possible multiple directory
-	 * hash entries may point to the same cache_entry.
 	 */
-	unsigned int hash;
-	void **pos;
+	int len = ce_namelen(ce);
+	if (len && ce->name[len - 1] == '/')
+		len--;
+	while (len && ce->name[len - 1] != '/')
+		len--;
+	if (!len)
+		return;
 
-	const char *ptr = ce->name;
-	while (*ptr) {
-		while (*ptr && *ptr != '/')
-			++ptr;
-		if (*ptr == '/') {
-			++ptr;
-			hash = hash_name(ce->name, ptr - ce->name);
-			pos = insert_hash(hash, ce, &istate->name_hash);
-			if (pos) {
-				ce->dir_next = *pos;
-				*pos = ce;
-			}
-		}
+	ce->dir = lookup_index_entry(istate, ce->name, len, ignore_case);
+	if (!ce->dir) {
+		ce->dir = xcalloc(1, cache_entry_size(len));
+		memcpy(ce->dir->name, ce->name, len);
+		ce->dir->ce_namelen = len;
+		ce->dir->name[len] = 0;
+		hash_index_entry(istate, ce->dir);
 	}
+	ce->dir->ce_flags &= ~CE_UNHASHED;
+	ce->dir->ce_size++;
 }
 
 static void hash_index_entry(struct index_state *istate, struct cache_entry *ce)
@@ -74,7 +72,7 @@ static void hash_index_entry(struct index_state *istate, struct cache_entry *ce)
 	if (ce->ce_flags & CE_HASHED)
 		return;
 	ce->ce_flags |= CE_HASHED;
-	ce->next = ce->dir_next = NULL;
+	ce->next = ce->dir = NULL;
 	hash = hash_name(ce->name, ce_namelen(ce));
 	pos = insert_hash(hash, ce, &istate->name_hash);
 	if (pos) {
@@ -137,38 +135,32 @@ static int same_name(const struct cache_entry *ce, const char *name, int namelen
 	if (!icase)
 		return 0;
 
-	/*
-	 * If the entry we're comparing is a filename (no trailing slash), then compare
-	 * the lengths exactly.
-	 */
-	if (name[namelen - 1] != '/')
-		return slow_same_name(name, namelen, ce->name, len);
-
-	/*
-	 * For a directory, we point to an arbitrary cache_entry filename.  Just
-	 * make sure the directory portion matches.
-	 */
-	return slow_same_name(name, namelen, ce->name, namelen < len ? namelen : len);
+	return slow_same_name(name, namelen, ce->name, len);
 }
 
-struct cache_entry *index_name_exists(struct index_state *istate, const char *name, int namelen, int icase)
+static struct cache_entry *lookup_index_entry(struct index_state *istate, const char *name, int namelen, int icase)
 {
 	unsigned int hash = hash_name(name, namelen);
-	struct cache_entry *ce;
-
-	lazy_init_name_hash(istate);
-	ce = lookup_hash(hash, &istate->name_hash);
+	struct cache_entry *ce = lookup_hash(hash, &istate->name_hash);
 
 	while (ce) {
 		if (!(ce->ce_flags & CE_UNHASHED)) {
 			if (same_name(ce, name, namelen, icase))
 				return ce;
 		}
-		if (icase && name[namelen - 1] == '/')
-			ce = ce->dir_next;
-		else
-			ce = ce->next;
+		ce = ce->next;
 	}
+	return NULL;
+}
+
+struct cache_entry *index_name_exists(struct index_state *istate, const char *name, int namelen, int icase)
+{
+	struct cache_entry *ce;
+
+	lazy_init_name_hash(istate);
+	ce = lookup_index_entry(istate, name, namelen, icase);
+	if (ce)
+		return ce;
 
 	/*
 	 * Might be a submodule.  Despite submodules being directories,
@@ -182,7 +174,7 @@ struct cache_entry *index_name_exists(struct index_state *istate, const char *na
 	 * true.
 	 */
 	if (icase && name[namelen - 1] == '/') {
-		ce = index_name_exists(istate, name, namelen - 1, icase);
+		ce = lookup_index_entry(istate, name, namelen - 1, icase);
 		if (ce && S_ISGITLINK(ce->ce_mode))
 			return ce;
 	}

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* Re: inotify to minimize stat() calls
  2013-02-10 13:26                     ` inotify to minimize stat() calls demerphq
  2013-02-10 15:35                       ` Duy Nguyen
@ 2013-02-14 14:36                       ` Magnus Bäck
  1 sibling, 0 replies; 88+ messages in thread
From: Magnus Bäck @ 2013-02-14 14:36 UTC (permalink / raw)
  To: demerphq
  Cc: Duy Nguyen, Ramkumar Ramachandra, Robert Zeh, Junio C Hamano,
	Git List, finnag

On Sunday, February 10, 2013 at 08:26 EST,
     demerphq <demerphq@gmail.com> wrote:

> Is windows stat really so slow?

Well, the problem is that there is no such thing as "Windows stat" :-)

> I encountered this perception in windows Perl in the past, and I know
> that on windows Perl stat *appears* slow compared to *nix, because in
> order to satisfy the full *nix stat interface, specifically the nlink
> field, it must open and close the file*. As of 5.10 this can be
> disabled by setting a magic var ${^WIN32_SLOPPY_STAT} to a true value,
> which makes a significant improvement to the performance of the Perl
> level stat implementation.  I would not be surprised if the cygwin
> implementation of stat() has the same issue as Perl did, and that stat
> appears much slower than it actually need be if you don't care about
> the nlink field.

I suggested a few years ago that FindFirstFile() be used to implement
stat() since it's way faster than opening and closing the file, but
FindFirstFile() apparently produces unreliable mtime results when DST
shifts are involved.

http://thread.gmane.org/gmane.comp.version-control.git/114041
(The reference link in Johannes Sixt's first email is broken, but I'm
sure the information can be dug up.)

Based on a quick look it seems GetFileAttributesEx() is still used for
mingw and cygwin Git.

-- 
Magnus Bäck
baeck@google.com

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: inotify to minimize stat() calls
  2013-02-08 21:10 inotify to minimize stat() calls Ramkumar Ramachandra
  2013-02-08 22:15 ` Junio C Hamano
@ 2013-02-14 15:16 ` Ævar Arnfjörð Bjarmason
  2013-02-14 16:31   ` Junio C Hamano
  2013-02-19  9:40   ` Ramkumar Ramachandra
  1 sibling, 2 replies; 88+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2013-02-14 15:16 UTC (permalink / raw)
  To: Ramkumar Ramachandra; +Cc: Git List

On Fri, Feb 8, 2013 at 10:10 PM, Ramkumar Ramachandra
<artagnon@gmail.com> wrote:
> For large repositories, many simple git commands like `git status`
> take a while to respond.  I understand that this is because of large
> number of stat() calls to figure out which files were changed.  I
> overheard that Mercurial wants to solve this problem using itnotify,
> but the idea bothers me because it's not portable.  Will Git ever
> consider using inotify on Linux?  What is the downside?

There's one relatively easy sub-task of this that I haven't seen
mentioned: Improving the speed of interactive rebase on large (as in
lots of checked out files) repositories.

That's the single biggest thing that bothers me when I use Git with
large repos, not the speed of "git status". When you "git rebase -i
HEAD~100" re-arrange some patches and save the TODO list it takes say
0.5-1s for each patch to be applied, but at least 10x less than that
on a small repository. E.g. try this on linux-2.6.git v.s. some small
project with a few dozen files.

I looked into this a long while ago and remembered that rebase was
doing something like a git status for every commit that it made to
check the dirtyness.

This could be vastly improved by having an unsafe option to git-rebase
where it just assumes that the starting state + whatever it wrote out
is the current state, i.e. it would break if someone stuck up on your
checkout during an interactive rebase and changed a file, but the
common case of the user having exclusive access to the repo and
waiting for the rebase would be much faster.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: inotify to minimize stat() calls
  2013-02-14 15:16 ` Ævar Arnfjörð Bjarmason
@ 2013-02-14 16:31   ` Junio C Hamano
  2013-02-19  9:40   ` Ramkumar Ramachandra
  1 sibling, 0 replies; 88+ messages in thread
From: Junio C Hamano @ 2013-02-14 16:31 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Ramkumar Ramachandra, Git List

Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

> I looked into this a long while ago and remembered that rebase was
> doing something like a git status for every commit that it made to
> check the dirtyness.
>
> This could be vastly improved by having an unsafe option to git-rebase
> where it just assumes that the starting state + whatever it wrote out
> is the current state, i.e. it would break if someone stuck up on your
> checkout during an interactive rebase and changed a file,...

You could make it a lot safer than "just assumes", and the result
may become generally usable, I think.  For example, you can set a
"magic" bit somewhere in $GIT_DIR/rebase-i while you are in "I am
doing pick/pick/pick and the user will not interfere me" mode, and
clear that bit upon "rebase --continue".  And you cheat only while
that "magic" bit is set.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: inotify to minimize stat() calls
  2013-02-14 15:16 ` Ævar Arnfjörð Bjarmason
  2013-02-14 16:31   ` Junio C Hamano
@ 2013-02-19  9:40   ` Ramkumar Ramachandra
  1 sibling, 0 replies; 88+ messages in thread
From: Ramkumar Ramachandra @ 2013-02-19  9:40 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Git List, Junio C Hamano

Ævar Arnfjörð Bjarmason wrote:
> On Fri, Feb 8, 2013 at 10:10 PM, Ramkumar Ramachandra
> <artagnon@gmail.com> wrote:
>> For large repositories, many simple git commands like `git status`
>> take a while to respond.  I understand that this is because of large
>> number of stat() calls to figure out which files were changed.  I
>> overheard that Mercurial wants to solve this problem using itnotify,
>> but the idea bothers me because it's not portable.  Will Git ever
>> consider using inotify on Linux?  What is the downside?
>
> There's one relatively easy sub-task of this that I haven't seen
> mentioned: Improving the speed of interactive rebase on large (as in
> lots of checked out files) repositories.
>
> That's the single biggest thing that bothers me when I use Git with
> large repos, not the speed of "git status". When you "git rebase -i
> HEAD~100" re-arrange some patches and save the TODO list it takes say
> 0.5-1s for each patch to be applied, but at least 10x less than that
> on a small repository. E.g. try this on linux-2.6.git v.s. some small
> project with a few dozen files.
>
> I looked into this a long while ago and remembered that rebase was
> doing something like a git status for every commit that it made to
> check the dirtyness.

What is it really doing?  I think the main culprit is
require_clean_work_tree() from git-sh-setup.sh, and that is only run
in the `--continue` and `exec` codepaths.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: inotify to minimize stat() calls
  2013-02-12 20:48                         ` Karsten Blees
  2013-02-13 10:06                           ` Duy Nguyen
  2013-02-13 12:15                           ` Duy Nguyen
@ 2013-02-19  9:49                           ` Ramkumar Ramachandra
  2013-02-19 14:25                             ` Karsten Blees
  2 siblings, 1 reply; 88+ messages in thread
From: Ramkumar Ramachandra @ 2013-02-19  9:49 UTC (permalink / raw)
  To: blees; +Cc: Duy Nguyen, kusmabite, Robert Zeh, Junio C Hamano, Git List, finnag

Karsten Blees wrote:
> Am 11.02.2013 04:53, schrieb Duy Nguyen:
>> On Sun, Feb 10, 2013 at 11:58 PM, Erik Faye-Lund <kusmabite@gmail.com> wrote:
>>> Karsten Blees has done something similar-ish on Windows, and he posted
>>> the results here:
>>>
>>> https://groups.google.com/forum/#!topic/msysgit/fL_jykUmUNE/discussion
>>>
>
> The new hashtable implementation in fscache [1] supports O(1) removal and has no mingw dependencies - might come in handy for anyone trying to implement an inotify daemon.
>
> [1] https://github.com/kblees/git/commit/f7eb85c2

Thanks!  I'm cherry-picking this.  Why didn't you propose replacing
hash.{c,h} with this in git.git though?

>>> I also seem to remember he doing a ReadDirectoryChangesW version, but
>>> I don't remember what happened with that.
>>
>> Thanks. I came across that but did not remember. For one thing, we
>> know the inotify alternative for Windows: ReadDirectoryChangesW.
>>
>
> I dropped ReadDirectoryChangesW because maintaining a 'live' file system cache became more and more complicated. For example, according to MSDN docs, ReadDirectoryChangesW *may* report short DOS 8.3 names (i.e. "PROGRA~1" instead of "Program Files"), so a correct and fast cache implementation would have to be indexed by long *and* short names...
>
> Another problem was that the 'live' cache had quite negative performance impact on mutating git commands (checkout, reset...). An inotify daemon running as a background process (not in-process as fscache) will probably affect everyone that modifies the working copy, e.g. running 'make' or the test-suite. This should be considered in the design.

Yes, an external daemon would report creation of *.o files, from the
compile, for instance.  We need a way for it to be filtered at the
daemon itself, so git isn't burdened with the filtering.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: inotify to minimize stat() calls
  2013-02-11 14:13                 ` Robert Zeh
@ 2013-02-19  9:57                   ` Ramkumar Ramachandra
  0 siblings, 0 replies; 88+ messages in thread
From: Ramkumar Ramachandra @ 2013-02-19  9:57 UTC (permalink / raw)
  To: Robert Zeh; +Cc: Duy Nguyen, Junio C Hamano, Git List

Robert Zeh wrote:
> On Sun, Feb 10, 2013 at 9:21 PM, Duy Nguyen <pclouds@gmail.com> wrote:
>> On Mon, Feb 11, 2013 at 2:03 AM, Robert Zeh <robert.allan.zeh@gmail.com> wrote:
>>> On Sat, Feb 9, 2013 at 1:35 PM, Junio C Hamano <gitster@pobox.com> wrote:
>>>> Ramkumar Ramachandra <artagnon@gmail.com> writes:
>>>>
>>>>> This is much better than Junio's suggestion to study possible
>>>>> implementations on all platforms and designing a generic daemon/
>>>>> communication channel.  That's no weekend project.
>>>>
>>>> It appears that you misunderstood what I wrote.  That was not "here
>>>> is a design; I want it in my system.  Go implemment it".
>>>>
>>>> It was "If somebody wants to discuss it but does not know where to
>>>> begin, doing a small experiment like this and reporting how well it
>>>> worked here may be one way to do so.", nothing more.
>>>
>>> What if instead of communicating over a socket, the daemon
>>> dumped a file containing all of the lstat information after git
>>> wrote a file? By definition the daemon should know about file writes.
>>>
>>> There would be no network communication, which I think would make
>>> things more secure. It would simplify the rendezvous by insisting on
>>> well known locations in $GIT_DIR.
>>
>> We need some sort of interactive communication to the daemon anyway,
>> to validate that the information is uptodate. Assume that a user makes
>> some changes to his worktree before starting the daemon, git needs to
>> know that what the daemon provides does not represent a complete
>> file-change picture and it better refreshes the index the old way
>> once, then trust the daemon.
>>
>> I think we could solve that by storing a "session id", provided by the
>> daemon, in .git/index. If the session id is not present (or does not
>> match what the current daemon gives), refresh the old way. After
>> refreshing, it may ask the daemon for new session id and store it.
>> Next time if the session id is still valid, trust the daemon's data.
>> This session id should be different every time the daemon restarts for
>> this to work.
>
> I think we could do this without interactive communication,
> if we did the following:
>    1) The Daemon waits to see $GIT_DIR/lstat_request, and atomically
>        writes out $GIT_DIR/lstat_cache.  By atomically I mean that it writes
>        things out to a temporary file, and then does a rename.
>
>    2) The client erases $GIT_DIR/lstat_cache, and writes
>       $GIT_DIR/lstat_request
>
> I think this is better than socket based communication because there
> are fewer places to check
> for failures.

My main problem with file-based solutions is this: how will the daemon
accumulate inotify change events over time, and report it in a batch
to a git application that is spawned?  Will it append to the
.git/inotify_changes file everytime there's a change?  Wouldn't you
prefer to accumulate the events in-memory and report it over a socket
upon explicit request, to minimize IO?

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: inotify to minimize stat() calls
  2013-02-10  5:24                 ` Duy Nguyen
  2013-02-10 11:17                   ` Duy Nguyen
@ 2013-02-19 13:16                   ` Drew Northup
  2013-02-19 13:47                     ` Duy Nguyen
  1 sibling, 1 reply; 88+ messages in thread
From: Drew Northup @ 2013-02-19 13:16 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Ramkumar Ramachandra, Robert Zeh, Junio C Hamano, Git List, finnag

On Sun, Feb 10, 2013 at 12:24 AM, Duy Nguyen <pclouds@gmail.com> wrote:
> On Sun, Feb 10, 2013 at 12:10 AM, Ramkumar Ramachandra
> <artagnon@gmail.com> wrote:
>> Finn notes in the commit message that it offers no speedup, because
>> .gitignore files in every directory still have to be read.  I think
>> this is silly: we really should be caching .gitignore, and touching it
>> only when lstat() reports that the file has changed.
>> ...
>> Really, the elephant in the room right now seems to be .gitignore.
>> Until that is fixed, there is really no use of writing this inotify
>> daemon, no?  Can someone enlighten me on how exactly .gitignore files
>> are processed?
>
> .gitignore is a different issue. I think it's mainly used with
> read_directory/fill_directory to collect ignored files (or not-ignored
> files). And it's not always used (well, status and add does, but diff
> should not). I think wee need to measure how much mass lstat
> elimination gains us (especially on big repos) and how much
> .gitignore/.gitattributes caching does. I don't think .gitignore has
> such a big impact though. strace on git.git tells me "git status"
> issues about 2500 lstat calls, and just 740 open+getdents calls (on
> total 3800 syscalls). I will think if we can do something about
> .gitignore/.gitattributes.
> --
> Duy

Duy,
Did your testing turn up anything about the amount of time spent
parsing the .gitignore/.gitattributes files? Not the syscall count,
but the actual time spent running the parser (which I presume is
largely CPU-bound). The other notable bit of information to know would
be how much time is spent applying what has been parsed out of those
files to the content of the tree. Both will give a clear signal of the
prominence of those segments of code versus others elsewhere in the
"git stat" flow path. That information will tell us more clearly what,
if anything, it is worth keeping a cache of and what form that cache
should take.

-- 
-Drew Northup
--------------------------------------------------------------
"As opposed to vegetable or mineral error?"
-John Pescatore, SANS NewsBites Vol. 12 Num. 59

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: inotify to minimize stat() calls
  2013-02-19 13:16                   ` Drew Northup
@ 2013-02-19 13:47                     ` Duy Nguyen
  0 siblings, 0 replies; 88+ messages in thread
From: Duy Nguyen @ 2013-02-19 13:47 UTC (permalink / raw)
  To: Drew Northup
  Cc: Ramkumar Ramachandra, Robert Zeh, Junio C Hamano, Git List, finnag

On Tue, Feb 19, 2013 at 8:16 PM, Drew Northup <n1xim.email@gmail.com> wrote:
> Did your testing turn up anything about the amount of time spent
> parsing the .gitignore/.gitattributes files? Not the syscall count,
> but the actual time spent running the parser (which I presume is
> largely CPU-bound). The other notable bit of information to know would
> be how much time is spent applying what has been parsed out of those
> files to the content of the tree. Both will give a clear signal of the
> prominence of those segments of code versus others elsewhere in the
> "git stat" flow path. That information will tell us more clearly what,
> if anything, it is worth keeping a cache of and what form that cache
> should take.

Not specifically parsing, but we do waste CPU on
.gitignore/.gitattributes stuff. See

http://thread.gmane.org/gmane.comp.version-control.git/216347/focus=216381

Other measurements (which led to the above patch):

http://thread.gmane.org/gmane.comp.version-control.git/215820/focus=215900
http://thread.gmane.org/gmane.comp.version-control.git/215820/focus=216029
http://thread.gmane.org/gmane.comp.version-control.git/215820/focus=216195

So far we could reduce lstat, {open,read,close}dir syscalls with the
help of inotify, which saves time. I'm not sure if we should cache the
list of untracked-but-not-ignored files. It cuts down cpu time on
.gitignore but invalidation could be complicated.
-- 
Duy

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: inotify to minimize stat() calls
  2013-02-19  9:49                           ` inotify to minimize stat() calls Ramkumar Ramachandra
@ 2013-02-19 14:25                             ` Karsten Blees
  0 siblings, 0 replies; 88+ messages in thread
From: Karsten Blees @ 2013-02-19 14:25 UTC (permalink / raw)
  To: Ramkumar Ramachandra
  Cc: Duy Nguyen, kusmabite, Robert Zeh, Junio C Hamano, Git List, finnag

Am 19.02.2013 10:49, schrieb Ramkumar Ramachandra:
> Karsten Blees wrote:
>> Am 11.02.2013 04:53, schrieb Duy Nguyen:
>>> On Sun, Feb 10, 2013 at 11:58 PM, Erik Faye-Lund <kusmabite@gmail.com> wrote:
>>>> Karsten Blees has done something similar-ish on Windows, and he posted
>>>> the results here:
>>>>
>>>> https://groups.google.com/forum/#!topic/msysgit/fL_jykUmUNE/discussion
>>>>
>>
>> The new hashtable implementation in fscache [1] supports O(1) removal and has no mingw dependencies - might come in handy for anyone trying to implement an inotify daemon.
>>
>> [1] https://github.com/kblees/git/commit/f7eb85c2
> 
> Thanks!  I'm cherry-picking this.  Why didn't you propose replacing
> hash.{c,h} with this in git.git though?
> 

I was planning to, but didn't find the time yet to adapt existing hash.[ch] uses to the new version, and there's not much use adding four more files of dead code. If someone else could jump in here that would be great.

Note that there's another t0007 now, so t/t0007-hashmap.sh needs to be renamed.

>>>> I also seem to remember he doing a ReadDirectoryChangesW version, but
>>>> I don't remember what happened with that.
>>>
>>> Thanks. I came across that but did not remember. For one thing, we
>>> know the inotify alternative for Windows: ReadDirectoryChangesW.
>>>
>>
>> I dropped ReadDirectoryChangesW because maintaining a 'live' file system cache became more and more complicated. For example, according to MSDN docs, ReadDirectoryChangesW *may* report short DOS 8.3 names (i.e. "PROGRA~1" instead of "Program Files"), so a correct and fast cache implementation would have to be indexed by long *and* short names...
>>
>> Another problem was that the 'live' cache had quite negative performance impact on mutating git commands (checkout, reset...). An inotify daemon running as a background process (not in-process as fscache) will probably affect everyone that modifies the working copy, e.g. running 'make' or the test-suite. This should be considered in the design.
> 
> Yes, an external daemon would report creation of *.o files, from the
> compile, for instance.  We need a way for it to be filtered at the
> daemon itself, so git isn't burdened with the filtering.
> 

...and this filtering should affect foreground processes as little as possible. For example, gaining 1 s per git-status is counter-productive if compile time increases by 10 s because the daemon re-reads .gitignore files for every new *.o.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [PATCH] name-hash.c: fix endless loop with core.ignorecase=true
  2013-02-14  0:48                                   ` Karsten Blees
@ 2013-02-27 14:45                                     ` Karsten Blees
  2013-02-27 16:53                                       ` Junio C Hamano
  0 siblings, 1 reply; 88+ messages in thread
From: Karsten Blees @ 2013-02-27 14:45 UTC (permalink / raw)
  To: Jeff King
  Cc: Karsten Blees, Duy Nguyen, kusmabite, Ramkumar Ramachandra,
	Robert Zeh, Junio C Hamano, Git List, finnag

With core.ignorecase=true, name-hash.c builds a case insensitive index of
all tracked directories. Currently, the existing cache entry structures are
added multiple times to the same hashtable (with different name lengths and
hash codes). However, there's only one dir_next pointer, which gets
completely messed up in case of hash collisions. In the worst case, this
causes an endless loop if ce == ce->dir_next:

---8<---
# "V/", "V/XQANY/" and "WURZAUP/" all have the same hash_name
mkdir V
mkdir V/XQANY
mkdir WURZAUP
touch V/XQANY/test
git init
git config core.ignorecase true
git add .
git status
---8<---

Use a separate hashtable and separate structures for the directory index
so that each directory entry has its own next pointer. Use reference
counting to track which directory entry contains files.

There are only slight changes to the name-hash.c API:
- new free_name_hash() used by read_cache.c::discard_index()
- remove_name_hash() takes an additional index_state parameter
- index_name_exists() for a directory (trailing '/') may return a cache
  entry that has been removed (CE_UNHASHED). This is not a problem as the
  return value is only used to check if the directory exists (dir.c) or to
  normalize casing of directory names (read-cache.c).

Getting rid of cache_entry.dir_next reduces memory consumption, especially
with core.ignorecase=false (which doesn't use that member at all).

With core.ignorecase=true, building the directory index is slightly faster
as we add / check the parent directory first (instead of going through all
directory levels for each file in the index). E.g. with WebKit (~200k
files, ~7k dirs), time taken in lazy_init_name_hash is reduced from 176ms
to 130ms.

Signed-off-by: Karsten Blees <blees@dcon.de>
---
Also available here:
https://github.com/kblees/git/tree/kb/name-hash-fix-endless-loop
git pull git://github.com/kblees/git.git kb/name-hash-fix-endless-loop

This combines the pros of the patches suggested by Jeff and me:
- reduced memory usage due to smaller dir_entry and cache_entry structs
- faster indexing due to right-to-left directory lookup
- marginal API changes, i.e. less impact on the rest of git

Test suite runs clean on msysgit and Linux.

Have fun,
Karsten

 cache.h      |  17 ++-----
 name-hash.c  | 164 +++++++++++++++++++++++++++++++++++++++++++----------------
 read-cache.c |   9 ++--
 3 files changed, 126 insertions(+), 64 deletions(-)

diff --git a/cache.h b/cache.h
index e493563..898e346 100644
--- a/cache.h
+++ b/cache.h
@@ -131,7 +131,6 @@ struct cache_entry {
 	unsigned int ce_namelen;
 	unsigned char sha1[20];
 	struct cache_entry *next;
-	struct cache_entry *dir_next;
 	char name[FLEX_ARRAY]; /* more */
 };
 
@@ -267,25 +266,15 @@ struct index_state {
 	unsigned name_hash_initialized : 1,
 		 initialized : 1;
 	struct hash_table name_hash;
+	struct hash_table dir_hash;
 };
 
 extern struct index_state the_index;
 
 /* Name hashing */
 extern void add_name_hash(struct index_state *istate, struct cache_entry *ce);
-/*
- * We don't actually *remove* it, we can just mark it invalid so that
- * we won't find it in lookups.
- *
- * Not only would we have to search the lists (simple enough), but
- * we'd also have to rehash other hash buckets in case this makes the
- * hash bucket empty (common). So it's much better to just mark
- * it.
- */
-static inline void remove_name_hash(struct cache_entry *ce)
-{
-	ce->ce_flags |= CE_UNHASHED;
-}
+extern void remove_name_hash(struct index_state *istate, struct cache_entry *ce);
+extern void free_name_hash(struct index_state *istate);
 
 
 #ifndef NO_THE_INDEX_COMPATIBILITY_MACROS
diff --git a/name-hash.c b/name-hash.c
index 942c459..6b130e1 100644
--- a/name-hash.c
+++ b/name-hash.c
@@ -32,38 +32,75 @@ static unsigned int hash_name(const char *name, int namelen)
 	return hash;
 }
 
-static void hash_index_entry_directories(struct index_state *istate, struct cache_entry *ce)
+struct dir_entry {
+	struct dir_entry *next;
+	struct dir_entry *parent;
+	struct cache_entry *ce;
+	int nr;
+	unsigned int namelen;
+};
+
+static struct dir_entry *find_dir_entry(struct index_state *istate,
+		const char *name, unsigned int namelen)
+{
+	unsigned int hash = hash_name(name, namelen);
+	struct dir_entry *dir;
+
+	for (dir = lookup_hash(hash, &istate->dir_hash); dir; dir = dir->next)
+		if (dir->namelen == namelen &&
+		    !strncasecmp(dir->ce->name, name, namelen))
+			return dir;
+	return NULL;
+}
+
+static struct dir_entry *hash_dir_entry(struct index_state *istate,
+		struct cache_entry *ce, int namelen, int add)
 {
 	/*
 	 * Throw each directory component in the hash for quick lookup
 	 * during a git status. Directory components are stored with their
-	 * closing slash.  Despite submodules being a directory, they never
-	 * reach this point, because they are stored without a closing slash
-	 * in the cache.
-	 *
-	 * Note that the cache_entry stored with the directory does not
-	 * represent the directory itself.  It is a pointer to an existing
-	 * filename, and its only purpose is to represent existence of the
-	 * directory in the cache.  It is very possible multiple directory
-	 * hash entries may point to the same cache_entry.
+	 * closing slash.
 	 */
-	unsigned int hash;
-	void **pos;
+	struct dir_entry *dir, *p;
+
+	/* get length of parent directory */
+	while (namelen > 0 && !is_dir_sep(ce->name[namelen - 1]))
+		namelen--;
+	if (namelen <= 0)
+		return NULL;
+
+	/* lookup existing entry for that directory */
+	dir = find_dir_entry(istate, ce->name, namelen);
+	if (add && !dir) {
+		/* not found, create it and add to hash table */
+		void **pdir;
+		unsigned int hash = hash_name(ce->name, namelen);
 
-	const char *ptr = ce->name;
-	while (*ptr) {
-		while (*ptr && *ptr != '/')
-			++ptr;
-		if (*ptr == '/') {
-			++ptr;
-			hash = hash_name(ce->name, ptr - ce->name);
-			pos = insert_hash(hash, ce, &istate->name_hash);
-			if (pos) {
-				ce->dir_next = *pos;
-				*pos = ce;
-			}
+		dir = xcalloc(1, sizeof(struct dir_entry));
+		dir->namelen = namelen;
+		dir->ce = ce;
+
+		pdir = insert_hash(hash, dir, &istate->dir_hash);
+		if (pdir) {
+			dir->next = *pdir;
+			*pdir = dir;
 		}
+
+		/* recursively add missing parent directories */
+		dir->parent = hash_dir_entry(istate, ce, namelen - 1, add);
 	}
+
+	/* add or release reference to this entry (and parents if 0) */
+	p = dir;
+	if (add) {
+		while (p && !(p->nr++))
+			p = p->parent;
+	} else {
+		while (p && p->nr && !(--p->nr))
+			p = p->parent;
+	}
+
+	return dir;
 }
 
 static void hash_index_entry(struct index_state *istate, struct cache_entry *ce)
@@ -74,7 +111,7 @@ static void hash_index_entry(struct index_state *istate, struct cache_entry *ce)
 	if (ce->ce_flags & CE_HASHED)
 		return;
 	ce->ce_flags |= CE_HASHED;
-	ce->next = ce->dir_next = NULL;
+	ce->next = NULL;
 	hash = hash_name(ce->name, ce_namelen(ce));
 	pos = insert_hash(hash, ce, &istate->name_hash);
 	if (pos) {
@@ -82,8 +119,8 @@ static void hash_index_entry(struct index_state *istate, struct cache_entry *ce)
 		*pos = ce;
 	}
 
-	if (ignore_case)
-		hash_index_entry_directories(istate, ce);
+	if (ignore_case && !(ce->ce_flags & CE_UNHASHED))
+		hash_dir_entry(istate, ce, ce_namelen(ce), 1);
 }
 
 static void lazy_init_name_hash(struct index_state *istate)
@@ -99,11 +136,33 @@ static void lazy_init_name_hash(struct index_state *istate)
 
 void add_name_hash(struct index_state *istate, struct cache_entry *ce)
 {
+	/* if already hashed, add reference to directory entries */
+	if (ignore_case && (ce->ce_flags & CE_STATE_MASK) == CE_STATE_MASK)
+		hash_dir_entry(istate, ce, ce_namelen(ce), 1);
+
 	ce->ce_flags &= ~CE_UNHASHED;
 	if (istate->name_hash_initialized)
 		hash_index_entry(istate, ce);
 }
 
+/*
+ * We don't actually *remove* it, we can just mark it invalid so that
+ * we won't find it in lookups.
+ *
+ * Not only would we have to search the lists (simple enough), but
+ * we'd also have to rehash other hash buckets in case this makes the
+ * hash bucket empty (common). So it's much better to just mark
+ * it.
+ */
+void remove_name_hash(struct index_state *istate, struct cache_entry *ce)
+{
+	/* if already hashed, release reference to directory entries */
+	if (ignore_case && (ce->ce_flags & CE_STATE_MASK) == CE_HASHED)
+		hash_dir_entry(istate, ce, ce_namelen(ce), 0);
+
+	ce->ce_flags |= CE_UNHASHED;
+}
+
 static int slow_same_name(const char *name1, int len1, const char *name2, int len2)
 {
 	if (len1 != len2)
@@ -137,18 +196,7 @@ static int same_name(const struct cache_entry *ce, const char *name, int namelen
 	if (!icase)
 		return 0;
 
-	/*
-	 * If the entry we're comparing is a filename (no trailing slash), then compare
-	 * the lengths exactly.
-	 */
-	if (name[namelen - 1] != '/')
-		return slow_same_name(name, namelen, ce->name, len);
-
-	/*
-	 * For a directory, we point to an arbitrary cache_entry filename.  Just
-	 * make sure the directory portion matches.
-	 */
-	return slow_same_name(name, namelen, ce->name, namelen < len ? namelen : len);
+	return slow_same_name(name, namelen, ce->name, len);
 }
 
 struct cache_entry *index_name_exists(struct index_state *istate, const char *name, int namelen, int icase)
@@ -164,16 +212,14 @@ struct cache_entry *index_name_exists(struct index_state *istate, const char *na
 			if (same_name(ce, name, namelen, icase))
 				return ce;
 		}
-		if (icase && name[namelen - 1] == '/')
-			ce = ce->dir_next;
-		else
-			ce = ce->next;
+		ce = ce->next;
 	}
 
 	/*
-	 * Might be a submodule.  Despite submodules being directories,
+	 * When looking for a directory (trailing '/'), it might be a
+	 * submodule or a directory. Despite submodules being directories,
 	 * they are stored in the name hash without a closing slash.
-	 * When ignore_case is 1, directories are stored in the name hash
+	 * When ignore_case is 1, directories are stored in a separate hash
 	 * with their closing slash.
 	 *
 	 * The side effect of this storage technique is we have need to
@@ -182,9 +228,37 @@ struct cache_entry *index_name_exists(struct index_state *istate, const char *na
 	 * true.
 	 */
 	if (icase && name[namelen - 1] == '/') {
+		struct dir_entry *dir = find_dir_entry(istate, name, namelen);
+		if (dir && dir->nr)
+			return dir->ce;
+
 		ce = index_name_exists(istate, name, namelen - 1, icase);
 		if (ce && S_ISGITLINK(ce->ce_mode))
 			return ce;
 	}
 	return NULL;
 }
+
+static int free_dir_entry(void *entry, void *unused)
+{
+	struct dir_entry *dir = entry;
+	while (dir) {
+		struct dir_entry *next = dir->next;
+		free(dir);
+		dir = next;
+	}
+	return 0;
+}
+
+void free_name_hash(struct index_state *istate)
+{
+	if (!istate->name_hash_initialized)
+		return;
+	istate->name_hash_initialized = 0;
+	if (ignore_case)
+		/* free directory entries */
+		for_each_hash(&istate->dir_hash, free_dir_entry, NULL);
+
+	free_hash(&istate->name_hash);
+	free_hash(&istate->dir_hash);
+}
diff --git a/read-cache.c b/read-cache.c
index 827ae55..47eb9d8 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -46,7 +46,7 @@ static void replace_index_entry(struct index_state *istate, int nr, struct cache
 {
 	struct cache_entry *old = istate->cache[nr];
 
-	remove_name_hash(old);
+	remove_name_hash(istate, old);
 	set_index_entry(istate, nr, ce);
 	istate->cache_changed = 1;
 }
@@ -460,7 +460,7 @@ int remove_index_entry_at(struct index_state *istate, int pos)
 	struct cache_entry *ce = istate->cache[pos];
 
 	record_resolve_undo(istate, ce);
-	remove_name_hash(ce);
+	remove_name_hash(istate, ce);
 	istate->cache_changed = 1;
 	istate->cache_nr--;
 	if (pos >= istate->cache_nr)
@@ -483,7 +483,7 @@ void remove_marked_cache_entries(struct index_state *istate)
 
 	for (i = j = 0; i < istate->cache_nr; i++) {
 		if (ce_array[i]->ce_flags & CE_REMOVE)
-			remove_name_hash(ce_array[i]);
+			remove_name_hash(istate, ce_array[i]);
 		else
 			ce_array[j++] = ce_array[i];
 	}
@@ -1515,8 +1515,7 @@ int discard_index(struct index_state *istate)
 	istate->cache_changed = 0;
 	istate->timestamp.sec = 0;
 	istate->timestamp.nsec = 0;
-	istate->name_hash_initialized = 0;
-	free_hash(&istate->name_hash);
+	free_name_hash(istate);
 	cache_tree_free(&(istate->cache_tree));
 	istate->initialized = 0;
 
-- 
1.8.1.2.7986.g6e98809.dirty

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* Re: [PATCH] name-hash.c: fix endless loop with core.ignorecase=true
  2013-02-27 14:45                                     ` [PATCH] name-hash.c: fix endless loop with core.ignorecase=true Karsten Blees
@ 2013-02-27 16:53                                       ` Junio C Hamano
  2013-02-27 21:52                                         ` Karsten Blees
  0 siblings, 1 reply; 88+ messages in thread
From: Junio C Hamano @ 2013-02-27 16:53 UTC (permalink / raw)
  To: Karsten Blees
  Cc: Jeff King, Duy Nguyen, kusmabite, Ramkumar Ramachandra,
	Robert Zeh, Git List, finnag

Karsten Blees <karsten.blees@gmail.com> writes:

> With core.ignorecase=true, name-hash.c builds a case insensitive index of
> all tracked directories. Currently, the existing cache entry structures are
> added multiple times to the same hashtable (with different name lengths and
> hash codes). However, there's only one dir_next pointer, which gets
> completely messed up in case of hash collisions. In the worst case, this
> causes an endless loop if ce == ce->dir_next:
>
> ---8<---
> # "V/", "V/XQANY/" and "WURZAUP/" all have the same hash_name
> mkdir V
> mkdir V/XQANY
> mkdir WURZAUP
> touch V/XQANY/test
> git init
> git config core.ignorecase true
> git add .
> git status
> ---8<---

Instead of using the scissors mark to confuse "am -c", indenting
this block would have been easier to later readers.

Also it is somewhat a shame that we do not use the above sample
collisions in a new test case.

> +static struct dir_entry *hash_dir_entry(struct index_state *istate,
> +		struct cache_entry *ce, int namelen, int add)
>  {
>  	/*
>  	 * Throw each directory component in the hash for quick lookup
>  	 * during a git status. Directory components are stored with their
> -	 * closing slash.  Despite submodules being a directory, they never
> -	 * reach this point, because they are stored without a closing slash
> -	 * in the cache.

Is the description of submodule no longer relevant?

> -	 * Note that the cache_entry stored with the directory does not
> -	 * represent the directory itself.  It is a pointer to an existing
> -	 * filename, and its only purpose is to represent existence of the
> -	 * directory in the cache.  It is very possible multiple directory
> -	 * hash entries may point to the same cache_entry.

Is this paragraph no longer relevant?  It seems to me that it still
holds true, given the way how dir->ce points at the given ce.

> +	 * closing slash.
>  	 */
> +	struct dir_entry *dir, *p;
> +
> +	/* get length of parent directory */
> +	while (namelen > 0 && !is_dir_sep(ce->name[namelen - 1]))
> +		namelen--;
> +	if (namelen <= 0)
> +		return NULL;
> +
> +	/* lookup existing entry for that directory */
> +	dir = find_dir_entry(istate, ce->name, namelen);
> +	if (add && !dir) {
> ...
>  	}
> +
> +	/* add or release reference to this entry (and parents if 0) */
> +	p = dir;
> +	if (add) {
> +		while (p && !(p->nr++))
> +			p = p->parent;
> +	} else {
> +		while (p && p->nr && !(--p->nr))
> +			p = p->parent;
> +	}

Can we free the entry when its refcnt goes down to zero?  If yes, is
it worth doing so?

> +
> +	return dir;
>  }
>  
>  static void hash_index_entry(struct index_state *istate, struct cache_entry *ce)
> @@ -74,7 +111,7 @@ static void hash_index_entry(struct index_state *istate, struct cache_entry *ce)
>  	if (ce->ce_flags & CE_HASHED)
>  		return;
>  	ce->ce_flags |= CE_HASHED;
> -	ce->next = ce->dir_next = NULL;
> +	ce->next = NULL;
>  	hash = hash_name(ce->name, ce_namelen(ce));
>  	pos = insert_hash(hash, ce, &istate->name_hash);
>  	if (pos) {
> @@ -82,8 +119,8 @@ static void hash_index_entry(struct index_state *istate, struct cache_entry *ce)
>  		*pos = ce;
>  	}
>  
> -	if (ignore_case)
> -		hash_index_entry_directories(istate, ce);
> +	if (ignore_case && !(ce->ce_flags & CE_UNHASHED))
> +		hash_dir_entry(istate, ce, ce_namelen(ce), 1);
>  }
>  
>  static void lazy_init_name_hash(struct index_state *istate)
> @@ -99,11 +136,33 @@ static void lazy_init_name_hash(struct index_state *istate)
>  
>  void add_name_hash(struct index_state *istate, struct cache_entry *ce)
>  {
> +	/* if already hashed, add reference to directory entries */
> +	if (ignore_case && (ce->ce_flags & CE_STATE_MASK) == CE_STATE_MASK)
> +		hash_dir_entry(istate, ce, ce_namelen(ce), 1);

Instead of a single function with "are we adding or removing?"
parameter, it would be a lot easier to read the callers if they are
wrapped in two helpers, add_dir_entry() and del_dir_entry() or
something, especially when the add=[0|1] parameter is constant for
each and every callsite (i.e. the codeflow determines it, not the
data).

>  	ce->ce_flags &= ~CE_UNHASHED;
>  	if (istate->name_hash_initialized)
>  		hash_index_entry(istate, ce);
>  }
>  
> +/*
> + * We don't actually *remove* it, we can just mark it invalid so that
> + * we won't find it in lookups.
> + *
> + * Not only would we have to search the lists (simple enough), but
> + * we'd also have to rehash other hash buckets in case this makes the
> + * hash bucket empty (common). So it's much better to just mark
> + * it.
> + */
> +void remove_name_hash(struct index_state *istate, struct cache_entry *ce)
> +{
> +	/* if already hashed, release reference to directory entries */
> +	if (ignore_case && (ce->ce_flags & CE_STATE_MASK) == CE_HASHED)
> +		hash_dir_entry(istate, ce, ce_namelen(ce), 0);

And here as well.

> +
> +	ce->ce_flags |= CE_UNHASHED;
> +}
> +
>  static int slow_same_name(const char *name1, int len1, const char *name2, int len2)
>  {
>  	if (len1 != len2)
> @@ -137,18 +196,7 @@ static int same_name(const struct cache_entry *ce, const char *name, int namelen
>  	if (!icase)
>  		return 0;
>  
> -	/*
> -	 * If the entry we're comparing is a filename (no trailing slash), then compare
> -	 * the lengths exactly.
> -	 */
> -	if (name[namelen - 1] != '/')
> -		return slow_same_name(name, namelen, ce->name, len);
> -
> -	/*
> -	 * For a directory, we point to an arbitrary cache_entry filename.  Just
> -	 * make sure the directory portion matches.
> -	 */
> -	return slow_same_name(name, namelen, ce->name, namelen < len ? namelen : len);
> +	return slow_same_name(name, namelen, ce->name, len);

Hmph, what is this change about?  Nobody calls same_name() with a
directory name anymore or something?

Thanks.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH] name-hash.c: fix endless loop with core.ignorecase=true
  2013-02-27 16:53                                       ` Junio C Hamano
@ 2013-02-27 21:52                                         ` Karsten Blees
  2013-02-27 23:57                                           ` [PATCH v2] " Karsten Blees
  0 siblings, 1 reply; 88+ messages in thread
From: Karsten Blees @ 2013-02-27 21:52 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jeff King, Duy Nguyen, kusmabite, Ramkumar Ramachandra,
	Robert Zeh, Git List, finnag

Am 27.02.2013 17:53, schrieb Junio C Hamano:
> Karsten Blees <karsten.blees@gmail.com> writes:
> 
>> With core.ignorecase=true, name-hash.c builds a case insensitive index of
>> all tracked directories. Currently, the existing cache entry structures are
>> added multiple times to the same hashtable (with different name lengths and
>> hash codes). However, there's only one dir_next pointer, which gets
>> completely messed up in case of hash collisions. In the worst case, this
>> causes an endless loop if ce == ce->dir_next:
>>
>> ---8<---
>> # "V/", "V/XQANY/" and "WURZAUP/" all have the same hash_name
>> mkdir V
>> mkdir V/XQANY
>> mkdir WURZAUP
>> touch V/XQANY/test
>> git init
>> git config core.ignorecase true
>> git add .
>> git status
>> ---8<---
> 
> Instead of using the scissors mark to confuse "am -c", indenting
> this block would have been easier to later readers.
> 
> Also it is somewhat a shame that we do not use the above sample
> collisions in a new test case.
> 

Duly noted.

Is there a way to run 'git status' with timeout? A test that doesn't complete (instead of failing) isn't nice...

>> +static struct dir_entry *hash_dir_entry(struct index_state *istate,
>> +		struct cache_entry *ce, int namelen, int add)
>>  {
>>  	/*
>>  	 * Throw each directory component in the hash for quick lookup
>>  	 * during a git status. Directory components are stored with their
>> -	 * closing slash.  Despite submodules being a directory, they never
>> -	 * reach this point, because they are stored without a closing slash
>> -	 * in the cache.
> 
> Is the description of submodule no longer relevant?
> 
>> -	 * Note that the cache_entry stored with the directory does not
>> -	 * represent the directory itself.  It is a pointer to an existing
>> -	 * filename, and its only purpose is to represent existence of the
>> -	 * directory in the cache.  It is very possible multiple directory
>> -	 * hash entries may point to the same cache_entry.
> 
> Is this paragraph no longer relevant?  It seems to me that it still
> holds true, given the way how dir->ce points at the given ce.
> 

I interpreted this as an explanation why it was safe to add the same cache_entry to the same name_hash multiple times...now that we have separate dir_entries and index_state.dir_hash, that's no longer a problem. But rereading that paragraph again, it is still mostly true (except for the 'existance' part, which is solved by reference counting).

>> +	 * closing slash.
>>  	 */
>> +	struct dir_entry *dir, *p;
>> +
>> +	/* get length of parent directory */
>> +	while (namelen > 0 && !is_dir_sep(ce->name[namelen - 1]))
>> +		namelen--;
>> +	if (namelen <= 0)
>> +		return NULL;
>> +
>> +	/* lookup existing entry for that directory */
>> +	dir = find_dir_entry(istate, ce->name, namelen);
>> +	if (add && !dir) {
>> ...
>>  	}
>> +
>> +	/* add or release reference to this entry (and parents if 0) */
>> +	p = dir;
>> +	if (add) {
>> +		while (p && !(p->nr++))
>> +			p = p->parent;
>> +	} else {
>> +		while (p && p->nr && !(--p->nr))
>> +			p = p->parent;
>> +	}
> 
> Can we free the entry when its refcnt goes down to zero?  If yes, is
> it worth doing so?
> 

There's no remove_hash in hash.[ch], and dir_entry.next may point to another dir_entry with the same hash code, so we must not free the memory (same problem as CE_UNHASHED).

>> +
>> +	return dir;
>>  }
>>  
>>  static void hash_index_entry(struct index_state *istate, struct cache_entry *ce)
>> @@ -74,7 +111,7 @@ static void hash_index_entry(struct index_state *istate, struct cache_entry *ce)
>>  	if (ce->ce_flags & CE_HASHED)
>>  		return;
>>  	ce->ce_flags |= CE_HASHED;
>> -	ce->next = ce->dir_next = NULL;
>> +	ce->next = NULL;
>>  	hash = hash_name(ce->name, ce_namelen(ce));
>>  	pos = insert_hash(hash, ce, &istate->name_hash);
>>  	if (pos) {
>> @@ -82,8 +119,8 @@ static void hash_index_entry(struct index_state *istate, struct cache_entry *ce)
>>  		*pos = ce;
>>  	}
>>  
>> -	if (ignore_case)
>> -		hash_index_entry_directories(istate, ce);
>> +	if (ignore_case && !(ce->ce_flags & CE_UNHASHED))
>> +		hash_dir_entry(istate, ce, ce_namelen(ce), 1);
>>  }
>>  
>>  static void lazy_init_name_hash(struct index_state *istate)
>> @@ -99,11 +136,33 @@ static void lazy_init_name_hash(struct index_state *istate)
>>  
>>  void add_name_hash(struct index_state *istate, struct cache_entry *ce)
>>  {
>> +	/* if already hashed, add reference to directory entries */
>> +	if (ignore_case && (ce->ce_flags & CE_STATE_MASK) == CE_STATE_MASK)
>> +		hash_dir_entry(istate, ce, ce_namelen(ce), 1);
> 
> Instead of a single function with "are we adding or removing?"
> parameter, it would be a lot easier to read the callers if they are
> wrapped in two helpers, add_dir_entry() and del_dir_entry() or
> something, especially when the add=[0|1] parameter is constant for
> each and every callsite (i.e. the codeflow determines it, not the
> data).
> 

OK

>>  	ce->ce_flags &= ~CE_UNHASHED;
>>  	if (istate->name_hash_initialized)
>>  		hash_index_entry(istate, ce);
>>  }
>>  
>> +/*
>> + * We don't actually *remove* it, we can just mark it invalid so that
>> + * we won't find it in lookups.
>> + *
>> + * Not only would we have to search the lists (simple enough), but
>> + * we'd also have to rehash other hash buckets in case this makes the
>> + * hash bucket empty (common). So it's much better to just mark
>> + * it.
>> + */
>> +void remove_name_hash(struct index_state *istate, struct cache_entry *ce)
>> +{
>> +	/* if already hashed, release reference to directory entries */
>> +	if (ignore_case && (ce->ce_flags & CE_STATE_MASK) == CE_HASHED)
>> +		hash_dir_entry(istate, ce, ce_namelen(ce), 0);
> 
> And here as well.
> 
>> +
>> +	ce->ce_flags |= CE_UNHASHED;
>> +}
>> +
>>  static int slow_same_name(const char *name1, int len1, const char *name2, int len2)
>>  {
>>  	if (len1 != len2)
>> @@ -137,18 +196,7 @@ static int same_name(const struct cache_entry *ce, const char *name, int namelen
>>  	if (!icase)
>>  		return 0;
>>  
>> -	/*
>> -	 * If the entry we're comparing is a filename (no trailing slash), then compare
>> -	 * the lengths exactly.
>> -	 */
>> -	if (name[namelen - 1] != '/')
>> -		return slow_same_name(name, namelen, ce->name, len);
>> -
>> -	/*
>> -	 * For a directory, we point to an arbitrary cache_entry filename.  Just
>> -	 * make sure the directory portion matches.
>> -	 */
>> -	return slow_same_name(name, namelen, ce->name, namelen < len ? namelen : len);
>> +	return slow_same_name(name, namelen, ce->name, len);
> 
> Hmph, what is this change about?  Nobody calls same_name() with a
> directory name anymore or something?
> 

dir_entries (with trailing /) are in index_state.dir_hash, so we wouldn't expect to find anything in index_state.name_hash, especially not a cache_entry. find_dir_entry simply uses strncasecmp, as we only do directory indexing with core.ignorecase=true.

> Thanks.
> 

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [PATCH v2] name-hash.c: fix endless loop with core.ignorecase=true
  2013-02-27 21:52                                         ` Karsten Blees
@ 2013-02-27 23:57                                           ` Karsten Blees
  2013-02-28  0:27                                             ` Junio C Hamano
  0 siblings, 1 reply; 88+ messages in thread
From: Karsten Blees @ 2013-02-27 23:57 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jeff King, Duy Nguyen, kusmabite, Ramkumar Ramachandra,
	Robert Zeh, Git List, finnag

With core.ignorecase=true, name-hash.c builds a case insensitive index of
all tracked directories. Currently, the existing cache entry structures are
added multiple times to the same hashtable (with different name lengths and
hash codes). However, there's only one dir_next pointer, which gets
completely messed up in case of hash collisions. In the worst case, this
causes an endless loop if ce == ce->dir_next (see t7062).

Use a separate hashtable and separate structures for the directory index
so that each directory entry has its own next pointer. Use reference
counting to track which directory entry contains files.

There are only slight changes to the name-hash.c API:
- new free_name_hash() used by read_cache.c::discard_index()
- remove_name_hash() takes an additional index_state parameter
- index_name_exists() for a directory (trailing '/') may return a cache
  entry that has been removed (CE_UNHASHED). This is not a problem as the
  return value is only used to check if the directory exists (dir.c) or to
  normalize casing of directory names (read-cache.c).

Getting rid of cache_entry.dir_next reduces memory consumption, especially
with core.ignorecase=false (which doesn't use that member at all).

With core.ignorecase=true, building the directory index is slightly faster
as we add / check the parent directory first (instead of going through all
directory levels for each file in the index). E.g. with WebKit (~200k
files, ~7k dirs), time spent in lazy_init_name_hash is reduced from 176ms
to 130ms.

Signed-off-by: Karsten Blees <blees@dcon.de>
---
Also available here:
https://github.com/kblees/git/tree/kb/name-hash-fix-endless-loop-v2
git pull git://github.com/kblees/git.git kb/name-hash-fix-endless-loop-v2

 cache.h                        |  17 +---
 name-hash.c                    | 182 +++++++++++++++++++++++++++++++----------
 read-cache.c                   |   9 +-
 t/t7062-wtstatus-ignorecase.sh |  20 +++++
 4 files changed, 166 insertions(+), 62 deletions(-)
 create mode 100755 t/t7062-wtstatus-ignorecase.sh

diff --git a/cache.h b/cache.h
index e493563..898e346 100644
--- a/cache.h
+++ b/cache.h
@@ -131,7 +131,6 @@ struct cache_entry {
 	unsigned int ce_namelen;
 	unsigned char sha1[20];
 	struct cache_entry *next;
-	struct cache_entry *dir_next;
 	char name[FLEX_ARRAY]; /* more */
 };
 
@@ -267,25 +266,15 @@ struct index_state {
 	unsigned name_hash_initialized : 1,
 		 initialized : 1;
 	struct hash_table name_hash;
+	struct hash_table dir_hash;
 };
 
 extern struct index_state the_index;
 
 /* Name hashing */
 extern void add_name_hash(struct index_state *istate, struct cache_entry *ce);
-/*
- * We don't actually *remove* it, we can just mark it invalid so that
- * we won't find it in lookups.
- *
- * Not only would we have to search the lists (simple enough), but
- * we'd also have to rehash other hash buckets in case this makes the
- * hash bucket empty (common). So it's much better to just mark
- * it.
- */
-static inline void remove_name_hash(struct cache_entry *ce)
-{
-	ce->ce_flags |= CE_UNHASHED;
-}
+extern void remove_name_hash(struct index_state *istate, struct cache_entry *ce);
+extern void free_name_hash(struct index_state *istate);
 
 
 #ifndef NO_THE_INDEX_COMPATIBILITY_MACROS
diff --git a/name-hash.c b/name-hash.c
index 942c459..6d7e198 100644
--- a/name-hash.c
+++ b/name-hash.c
@@ -32,38 +32,96 @@ static unsigned int hash_name(const char *name, int namelen)
 	return hash;
 }
 
-static void hash_index_entry_directories(struct index_state *istate, struct cache_entry *ce)
+struct dir_entry {
+	struct dir_entry *next;
+	struct dir_entry *parent;
+	struct cache_entry *ce;
+	int nr;
+	unsigned int namelen;
+};
+
+static struct dir_entry *find_dir_entry(struct index_state *istate,
+		const char *name, unsigned int namelen)
+{
+	unsigned int hash = hash_name(name, namelen);
+	struct dir_entry *dir;
+
+	for (dir = lookup_hash(hash, &istate->dir_hash); dir; dir = dir->next)
+		if (dir->namelen == namelen &&
+		    !strncasecmp(dir->ce->name, name, namelen))
+			return dir;
+	return NULL;
+}
+
+static struct dir_entry *hash_dir_entry(struct index_state *istate,
+		struct cache_entry *ce, int namelen)
 {
 	/*
 	 * Throw each directory component in the hash for quick lookup
 	 * during a git status. Directory components are stored with their
 	 * closing slash.  Despite submodules being a directory, they never
 	 * reach this point, because they are stored without a closing slash
-	 * in the cache.
+	 * in index_state.name_hash (as ordinary cache_entries).
 	 *
-	 * Note that the cache_entry stored with the directory does not
-	 * represent the directory itself.  It is a pointer to an existing
-	 * filename, and its only purpose is to represent existence of the
-	 * directory in the cache.  It is very possible multiple directory
-	 * hash entries may point to the same cache_entry.
+	 * Note that the cache_entry stored with the dir_entry merely
+	 * supplies the name of the directory (up to dir_entry.namelen). We
+	 * track the number of 'active' files in a directory in dir_entry.nr,
+	 * so we can tell if the directory is still relevant, e.g. for git
+	 * status. However, if cache_entries are removed, we cannot pinpoint
+	 * an exact cache_entry that's still active. It is very possible that
+	 * multiple dir_entries point to the same cache_entry.
 	 */
-	unsigned int hash;
-	void **pos;
+	struct dir_entry *dir;
+
+	/* get length of parent directory */
+	while (namelen > 0 && !is_dir_sep(ce->name[namelen - 1]))
+		namelen--;
+	if (namelen <= 0)
+		return NULL;
+
+	/* lookup existing entry for that directory */
+	dir = find_dir_entry(istate, ce->name, namelen);
+	if (!dir) {
+		/* not found, create it and add to hash table */
+		void **pdir;
+		unsigned int hash = hash_name(ce->name, namelen);
 
-	const char *ptr = ce->name;
-	while (*ptr) {
-		while (*ptr && *ptr != '/')
-			++ptr;
-		if (*ptr == '/') {
-			++ptr;
-			hash = hash_name(ce->name, ptr - ce->name);
-			pos = insert_hash(hash, ce, &istate->name_hash);
-			if (pos) {
-				ce->dir_next = *pos;
-				*pos = ce;
-			}
+		dir = xcalloc(1, sizeof(struct dir_entry));
+		dir->namelen = namelen;
+		dir->ce = ce;
+
+		pdir = insert_hash(hash, dir, &istate->dir_hash);
+		if (pdir) {
+			dir->next = *pdir;
+			*pdir = dir;
 		}
+
+		/* recursively add missing parent directories */
+		dir->parent = hash_dir_entry(istate, ce, namelen - 1);
 	}
+	return dir;
+}
+
+static void add_dir_entry(struct index_state *istate, struct cache_entry *ce)
+{
+	/* Add reference to the directory entry (and parents if 0). */
+	struct dir_entry *dir = hash_dir_entry(istate, ce, ce_namelen(ce));
+	while (dir && !(dir->nr++))
+		dir = dir->parent;
+}
+
+static void remove_dir_entry(struct index_state *istate, struct cache_entry *ce)
+{
+	/*
+	 * Release reference to the directory entry (and parents if 0).
+	 *
+	 * Note: we do not remove / free the entry because there's no
+	 * hash.[ch]::remove_hash and dir->next may point to other entries
+	 * that are still valid, so we must not free the memory.
+	 */
+	struct dir_entry *dir = hash_dir_entry(istate, ce, ce_namelen(ce));
+	while (dir && dir->nr && !(--dir->nr))
+		dir = dir->parent;
 }
 
 static void hash_index_entry(struct index_state *istate, struct cache_entry *ce)
@@ -74,7 +132,7 @@ static void hash_index_entry(struct index_state *istate, struct cache_entry *ce)
 	if (ce->ce_flags & CE_HASHED)
 		return;
 	ce->ce_flags |= CE_HASHED;
-	ce->next = ce->dir_next = NULL;
+	ce->next = NULL;
 	hash = hash_name(ce->name, ce_namelen(ce));
 	pos = insert_hash(hash, ce, &istate->name_hash);
 	if (pos) {
@@ -82,8 +140,8 @@ static void hash_index_entry(struct index_state *istate, struct cache_entry *ce)
 		*pos = ce;
 	}
 
-	if (ignore_case)
-		hash_index_entry_directories(istate, ce);
+	if (ignore_case && !(ce->ce_flags & CE_UNHASHED))
+		add_dir_entry(istate, ce);
 }
 
 static void lazy_init_name_hash(struct index_state *istate)
@@ -99,11 +157,33 @@ static void lazy_init_name_hash(struct index_state *istate)
 
 void add_name_hash(struct index_state *istate, struct cache_entry *ce)
 {
+	/* if already hashed, add reference to directory entries */
+	if (ignore_case && (ce->ce_flags & CE_STATE_MASK) == CE_STATE_MASK)
+		add_dir_entry(istate, ce);
+
 	ce->ce_flags &= ~CE_UNHASHED;
 	if (istate->name_hash_initialized)
 		hash_index_entry(istate, ce);
 }
 
+/*
+ * We don't actually *remove* it, we can just mark it invalid so that
+ * we won't find it in lookups.
+ *
+ * Not only would we have to search the lists (simple enough), but
+ * we'd also have to rehash other hash buckets in case this makes the
+ * hash bucket empty (common). So it's much better to just mark
+ * it.
+ */
+void remove_name_hash(struct index_state *istate, struct cache_entry *ce)
+{
+	/* if already hashed, release reference to directory entries */
+	if (ignore_case && (ce->ce_flags & CE_STATE_MASK) == CE_HASHED)
+		remove_dir_entry(istate, ce);
+
+	ce->ce_flags |= CE_UNHASHED;
+}
+
 static int slow_same_name(const char *name1, int len1, const char *name2, int len2)
 {
 	if (len1 != len2)
@@ -137,18 +217,7 @@ static int same_name(const struct cache_entry *ce, const char *name, int namelen
 	if (!icase)
 		return 0;
 
-	/*
-	 * If the entry we're comparing is a filename (no trailing slash), then compare
-	 * the lengths exactly.
-	 */
-	if (name[namelen - 1] != '/')
-		return slow_same_name(name, namelen, ce->name, len);
-
-	/*
-	 * For a directory, we point to an arbitrary cache_entry filename.  Just
-	 * make sure the directory portion matches.
-	 */
-	return slow_same_name(name, namelen, ce->name, namelen < len ? namelen : len);
+	return slow_same_name(name, namelen, ce->name, len);
 }
 
 struct cache_entry *index_name_exists(struct index_state *istate, const char *name, int namelen, int icase)
@@ -164,27 +233,54 @@ struct cache_entry *index_name_exists(struct index_state *istate, const char *na
 			if (same_name(ce, name, namelen, icase))
 				return ce;
 		}
-		if (icase && name[namelen - 1] == '/')
-			ce = ce->dir_next;
-		else
-			ce = ce->next;
+		ce = ce->next;
 	}
 
 	/*
-	 * Might be a submodule.  Despite submodules being directories,
+	 * When looking for a directory (trailing '/'), it might be a
+	 * submodule or a directory. Despite submodules being directories,
 	 * they are stored in the name hash without a closing slash.
-	 * When ignore_case is 1, directories are stored in the name hash
-	 * with their closing slash.
+	 * When ignore_case is 1, directories are stored in a separate hash
+	 * table *with* their closing slash.
 	 *
 	 * The side effect of this storage technique is we have need to
+	 * lookup the directory in a separate hash table, and if not found
 	 * remove the slash from name and perform the lookup again without
 	 * the slash.  If a match is made, S_ISGITLINK(ce->mode) will be
 	 * true.
 	 */
 	if (icase && name[namelen - 1] == '/') {
+		struct dir_entry *dir = find_dir_entry(istate, name, namelen);
+		if (dir && dir->nr)
+			return dir->ce;
+
 		ce = index_name_exists(istate, name, namelen - 1, icase);
 		if (ce && S_ISGITLINK(ce->ce_mode))
 			return ce;
 	}
 	return NULL;
 }
+
+static int free_dir_entry(void *entry, void *unused)
+{
+	struct dir_entry *dir = entry;
+	while (dir) {
+		struct dir_entry *next = dir->next;
+		free(dir);
+		dir = next;
+	}
+	return 0;
+}
+
+void free_name_hash(struct index_state *istate)
+{
+	if (!istate->name_hash_initialized)
+		return;
+	istate->name_hash_initialized = 0;
+	if (ignore_case)
+		/* free directory entries */
+		for_each_hash(&istate->dir_hash, free_dir_entry, NULL);
+
+	free_hash(&istate->name_hash);
+	free_hash(&istate->dir_hash);
+}
diff --git a/read-cache.c b/read-cache.c
index 827ae55..47eb9d8 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -46,7 +46,7 @@ static void replace_index_entry(struct index_state *istate, int nr, struct cache
 {
 	struct cache_entry *old = istate->cache[nr];
 
-	remove_name_hash(old);
+	remove_name_hash(istate, old);
 	set_index_entry(istate, nr, ce);
 	istate->cache_changed = 1;
 }
@@ -460,7 +460,7 @@ int remove_index_entry_at(struct index_state *istate, int pos)
 	struct cache_entry *ce = istate->cache[pos];
 
 	record_resolve_undo(istate, ce);
-	remove_name_hash(ce);
+	remove_name_hash(istate, ce);
 	istate->cache_changed = 1;
 	istate->cache_nr--;
 	if (pos >= istate->cache_nr)
@@ -483,7 +483,7 @@ void remove_marked_cache_entries(struct index_state *istate)
 
 	for (i = j = 0; i < istate->cache_nr; i++) {
 		if (ce_array[i]->ce_flags & CE_REMOVE)
-			remove_name_hash(ce_array[i]);
+			remove_name_hash(istate, ce_array[i]);
 		else
 			ce_array[j++] = ce_array[i];
 	}
@@ -1515,8 +1515,7 @@ int discard_index(struct index_state *istate)
 	istate->cache_changed = 0;
 	istate->timestamp.sec = 0;
 	istate->timestamp.nsec = 0;
-	istate->name_hash_initialized = 0;
-	free_hash(&istate->name_hash);
+	free_name_hash(istate);
 	cache_tree_free(&(istate->cache_tree));
 	istate->initialized = 0;
 
diff --git a/t/t7062-wtstatus-ignorecase.sh b/t/t7062-wtstatus-ignorecase.sh
new file mode 100755
index 0000000..73709db
--- /dev/null
+++ b/t/t7062-wtstatus-ignorecase.sh
@@ -0,0 +1,20 @@
+#!/bin/sh
+
+test_description='git-status with core.ignorecase=true'
+
+. ./test-lib.sh
+
+test_expect_success 'status with hash collisions' '
+	# note: "V/", "V/XQANY/" and "WURZAUP/" produce the same hash code
+	# in name-hash.c::hash_name
+	mkdir V &&
+	mkdir V/XQANY &&
+	mkdir WURZAUP &&
+	touch V/XQANY/test &&
+	git config core.ignorecase true &&
+	git add . &&
+	# test is successful if git status completes (no endless loop)
+	git status
+'
+
+test_done
-- 
1.8.1.2.7987.g4a34b82

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* Re: [PATCH v2] name-hash.c: fix endless loop with core.ignorecase=true
  2013-02-27 23:57                                           ` [PATCH v2] " Karsten Blees
@ 2013-02-28  0:27                                             ` Junio C Hamano
  0 siblings, 0 replies; 88+ messages in thread
From: Junio C Hamano @ 2013-02-28  0:27 UTC (permalink / raw)
  To: Karsten Blees
  Cc: Jeff King, Duy Nguyen, kusmabite, Ramkumar Ramachandra,
	Robert Zeh, Git List, finnag

Karsten Blees <karsten.blees@gmail.com> writes:

> With core.ignorecase=true, name-hash.c builds a case insensitive index of
> all tracked directories. Currently, the existing cache entry structures are
> added multiple times to the same hashtable (with different name lengths and
> hash codes). However, there's only one dir_next pointer, which gets
> completely messed up in case of hash collisions. In the worst case, this
> causes an endless loop if ce == ce->dir_next (see t7062).
>
> Use a separate hashtable and separate structures for the directory index
> so that each directory entry has its own next pointer. Use reference
> counting to track which directory entry contains files.
>
> There are only slight changes to the name-hash.c API:
> - new free_name_hash() used by read_cache.c::discard_index()
> - remove_name_hash() takes an additional index_state parameter
> - index_name_exists() for a directory (trailing '/') may return a cache
>   entry that has been removed (CE_UNHASHED). This is not a problem as the
>   return value is only used to check if the directory exists (dir.c) or to
>   normalize casing of directory names (read-cache.c).
>
> Getting rid of cache_entry.dir_next reduces memory consumption, especially
> with core.ignorecase=false (which doesn't use that member at all).
>
> With core.ignorecase=true, building the directory index is slightly faster
> as we add / check the parent directory first (instead of going through all
> directory levels for each file in the index). E.g. with WebKit (~200k
> files, ~7k dirs), time spent in lazy_init_name_hash is reduced from 176ms
> to 130ms.
>
> Signed-off-by: Karsten Blees <blees@dcon.de>
> ---

One thing that still puzzles me is what guarantee we have on the
liftime of these ce's that are borrowed by these dir_hash entries.
There are a few places where we call free(ce) around "aliased"
entries (only happens with ignore_case set).  I do not think it is a
new issue (we used to borrow a ce to represent a directory in the
name_hash by using the leading prefix of its name anyway, and this
patch only changes which hash table is used to hold it), and I do
not think it will be an issue for case sensitive systems, so I would
stop being worried about it for now, though ;-)

Thanks, will replace and queue.


> diff --git a/t/t7062-wtstatus-ignorecase.sh b/t/t7062-wtstatus-ignorecase.sh
> new file mode 100755
> index 0000000..73709db
> --- /dev/null
> +++ b/t/t7062-wtstatus-ignorecase.sh
> @@ -0,0 +1,20 @@
> +#!/bin/sh
> +
> +test_description='git-status with core.ignorecase=true'
> +
> +. ./test-lib.sh
> +
> +test_expect_success 'status with hash collisions' '
> +	# note: "V/", "V/XQANY/" and "WURZAUP/" produce the same hash code
> +	# in name-hash.c::hash_name
> +	mkdir V &&
> +	mkdir V/XQANY &&
> +	mkdir WURZAUP &&
> +	touch V/XQANY/test &&
> +	git config core.ignorecase true &&
> +	git add . &&
> +	# test is successful if git status completes (no endless loop)
> +	git status
> +'
> +
> +test_done

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: inotify to minimize stat() calls
  2013-02-11  2:56                         ` Duy Nguyen
  2013-02-11 11:12                           ` Duy Nguyen
@ 2013-03-07 22:16                           ` Torsten Bögershausen
  2013-03-08  0:04                             ` Junio C Hamano
  1 sibling, 1 reply; 88+ messages in thread
From: Torsten Bögershausen @ 2013-03-07 22:16 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Junio C Hamano, Ramkumar Ramachandra, Robert Zeh, Git List,
	finnag, Torsten Bögershausen

On 11.02.13 03:56, Duy Nguyen wrote:
> On Mon, Feb 11, 2013 at 3:16 AM, Junio C Hamano <gitster@pobox.com> wrote:
>> The other "lstat()" experiment was a very interesting one, but this
>> is not yet an interesting experiment to see where in the "ignore"
>> codepath we are spending times.
>>
>> We know that we can tell wt_status_collect_untracked() not to bother
>> with the untracked or ignored files with !s->show_untracked_files
>> already, but I think the more interesting question is if we can show
>> the untracked files with less overhead.
>>
>> If we want to show untrackedd files, it is a given that we need to
>> read directories to see what paths there are on the filesystem. Is
>> the opendir/readdir cost dominating in the process? Are we spending
>> a lot of time sifting the result of opendir/readdir via the ignore
>> mechanism? Is reading the "ignore" files costing us much to prime
>> the ignore mechanism?
>>
>> If readdir cost is dominant, then that makes "cache gitignore" a
>> nonsense proposition, I think.  If you really want to "cache"
>> something, you need to have somebody (i.e. a daemon) who constantly
>> keeps an eye on the filesystem changes and can respond with the up
>> to date result directly to fill_directory().  I somehow doubt that
>> it is a direction we would want to go in, though.
> 
> Yeah, it did not cut out syscall cost, I also cut a lot of user-space
> processing (plus .gitignore content access). From the timings I posted
> earlier,
> 
>>         unmodified  dir.c
>> real    0m0.550s    0m0.287s
>> user    0m0.305s    0m0.201s
>> sys     0m0.240s    0m0.084s
> 
> sys time is reduced from 0.24s to 0.08s, so readdir+opendir definitely
> has something to do with it (and perhaps reading .gitignore). But it
> also reduces user time from 0.305 to 0.201s. I don't think avoiding
> readdir+openddir will bring us this gain. It's probably the cost of
> matching .gitignore. I'll try to replace opendir+readdir with a
> no-syscall version. At this point "untracked caching" sounds more
> feasible (and less complex) than ".gitignore cachine".
> 
Thanks for Duy for the measurements, and patches.
I took the freedom to convert the patched dir.c into a 
"runtime configurable" git status option.
I'm not sure if the following copy-and-paste work applies,
(it is based on Git 1.8.1.3), but the time spend for 
"git status --changed-only" is basically half the time of
"git status", similar to what Duy has measured.
I did a test both on a Linux box and Mac OS.

And the speedup is so impressive, that I am tempted to submit a patch simlar
to the following, what do we think about it?
/Torsten




-- >8 --

[PATCH] git status: add option changed-only
git status may be run faster if
- we only check if files are changed which are already known to git.
- we don't check if there are untracked files.

"git status --changed-only" (or the short form "git status -c")

will only check for changed files which are already known to git,
and which are in the index.

The call to read_directory_recursive() is skipped and untracked files
in the working tree are not reported.

Inspired-by: Duy Nguyen <pclouds@gmail.com>
Signed-off-by: Torsten Bögershausen <tboegi@web.de>
---
 builtin/commit.c | 2 ++
 dir.c            | 5 +++--
 dir.h            | 3 ++-
 wt-status.c      | 3 +++
 wt-status.h      | 1 +
 5 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/builtin/commit.c b/builtin/commit.c
index d6dd3df..6a5ba11 100644
--- a/builtin/commit.c
+++ b/builtin/commit.c
@@ -1158,6 +1158,8 @@ int cmd_status(int argc, const char **argv, const char *prefix)
 	unsigned char sha1[20];
 	static struct option builtin_status_options[] = {
 		OPT__VERBOSE(&verbose, N_("be verbose")),
+		OPT_BOOLEAN('c', "changed-only", &s.check_changed_only,
+			    N_("Ignore untracked files. Check if files known to git are modified")),
 		OPT_SET_INT('s', "short", &status_format,
 			    N_("show status concisely"), STATUS_FORMAT_SHORT),
 		OPT_BOOLEAN('b', "branch", &s.show_branch,
diff --git a/dir.c b/dir.c
index a473ca2..555b652 100644
--- a/dir.c
+++ b/dir.c
@@ -1274,8 +1274,9 @@ int read_directory(struct dir_struct *dir, const char *path, int len, const char
 		return dir->nr;
 
 	simplify = create_simplify(pathspec);
-	if (!len || treat_leading_path(dir, path, len, simplify))
-		read_directory_recursive(dir, path, len, 0, simplify);
+	if ((!(dir->flags & DIR_CHECK_CHANGED_ONLY)) &&
+			(!len || treat_leading_path(dir, path, len, simplify))) o
+			read_directory_recursive(dir, path, len, 0, simplify);
 	free_simplify(simplify);
 	qsort(dir->entries, dir->nr, sizeof(struct dir_entry *), cmp_name);
 	qsort(dir->ignored, dir->ignored_nr, sizeof(struct dir_entry *), cmp_name);
diff --git a/dir.h b/dir.h
index f5c89e3..1a915a7 100644
--- a/dir.h
+++ b/dir.h
@@ -41,7 +41,8 @@ struct dir_struct {
 		DIR_SHOW_OTHER_DIRECTORIES = 1<<1,
 		DIR_HIDE_EMPTY_DIRECTORIES = 1<<2,
 		DIR_NO_GITLINKS = 1<<3,
-		DIR_COLLECT_IGNORED = 1<<4
+		DIR_COLLECT_IGNORED = 1<<4,
+		DIR_CHECK_CHANGED_ONLY = 1<<5
 	} flags;
 	struct dir_entry **entries;
 	struct dir_entry **ignored;
diff --git a/wt-status.c b/wt-status.c
index d7cfe8f..b315785 100644
--- a/wt-status.c
+++ b/wt-status.c
@@ -503,6 +503,9 @@ static void wt_status_collect_untracked(struct wt_status *s)
 	if (s->show_untracked_files != SHOW_ALL_UNTRACKED_FILES)
 		dir.flags |=
 			DIR_SHOW_OTHER_DIRECTORIES | DIR_HIDE_EMPTY_DIRECTORIES;
+	if (s->check_changed_only)
+		dir.flags |= DIR_CHECK_CHANGED_ONLY;
+
 	setup_standard_excludes(&dir);
 
 	fill_directory(&dir, s->pathspec);
diff --git a/wt-status.h b/wt-status.h
index 236b41f..7eb0115 100644
--- a/wt-status.h
+++ b/wt-status.h
@@ -47,6 +47,7 @@ struct wt_status {
 	const char **pathspec;
 	int verbose;
 	int amend;
+	int check_changed_only;
 	enum commit_whence whence;
 	int nowarn;
 	int use_color;
-- 
1.8.2.rc2

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* Re: inotify to minimize stat() calls
  2013-03-07 22:16                           ` Torsten Bögershausen
@ 2013-03-08  0:04                             ` Junio C Hamano
  2013-03-08  7:01                               ` Torsten Bögershausen
  0 siblings, 1 reply; 88+ messages in thread
From: Junio C Hamano @ 2013-03-08  0:04 UTC (permalink / raw)
  To: Torsten Bögershausen
  Cc: Duy Nguyen, Ramkumar Ramachandra, Robert Zeh, Git List, finnag

Torsten Bögershausen <tboegi@web.de> writes:

> diff --git a/builtin/commit.c b/builtin/commit.c
> index d6dd3df..6a5ba11 100644
> --- a/builtin/commit.c
> +++ b/builtin/commit.c
> @@ -1158,6 +1158,8 @@ int cmd_status(int argc, const char **argv, const char *prefix)
>  	unsigned char sha1[20];
>  	static struct option builtin_status_options[] = {
>  		OPT__VERBOSE(&verbose, N_("be verbose")),
> +		OPT_BOOLEAN('c', "changed-only", &s.check_changed_only,
> +			    N_("Ignore untracked files. Check if files known to git are modified")),

Doesn't this make one wonder why a separate bit and implementation
is necessary to say "I am not interested in untracked files" when
"-uno" option is already there?

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: inotify to minimize stat() calls
  2013-03-08  0:04                             ` Junio C Hamano
@ 2013-03-08  7:01                               ` Torsten Bögershausen
  2013-03-08  8:15                                 ` Junio C Hamano
  0 siblings, 1 reply; 88+ messages in thread
From: Torsten Bögershausen @ 2013-03-08  7:01 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Torsten Bögershausen, Duy Nguyen, Ramkumar Ramachandra,
	Robert Zeh, Git List, finnag

On 08.03.13 01:04, Junio C Hamano wrote:
> Torsten Bögershausen <tboegi@web.de> writes:
> 
>> diff --git a/builtin/commit.c b/builtin/commit.c
>> index d6dd3df..6a5ba11 100644
>> --- a/builtin/commit.c
>> +++ b/builtin/commit.c
>> @@ -1158,6 +1158,8 @@ int cmd_status(int argc, const char **argv, const char *prefix)
>>  	unsigned char sha1[20];
>>  	static struct option builtin_status_options[] = {
>>  		OPT__VERBOSE(&verbose, N_("be verbose")),
>> +		OPT_BOOLEAN('c', "changed-only", &s.check_changed_only,
>> +			    N_("Ignore untracked files. Check if files known to git are modified")),
> 
> Doesn't this make one wonder why a separate bit and implementation
> is necessary to say "I am not interested in untracked files" when
> "-uno" option is already there?
Thanks Junio,
this is good news.
I need to admit that I wasn't aware about "git status -uno".

Thinking about it, how many git users are aware of the speed penalty
when running git status to find out which (tracked) files they had changed?

Or to put it the other way, when a developer wants a quick overview
about the files she changed, then git status -uno may be a good and fast friend.

Does it make sence to stress put that someway in the documentation?

diff --git a/Documentation/git-status.txt b/Documentation/git-status.txt
index 9f1ef9a..360d813 100644
--- a/Documentation/git-status.txt
+++ b/Documentation/git-status.txt
@@ -51,13 +51,18 @@ default is 'normal', i.e. show untracked files and directori
 +
 The possible options are:
 +
-       - 'no'     - Show no untracked files
+       - 'no'     - Show no untracked files (this is fastest)
        - 'normal' - Shows untracked files and directories
        - 'all'    - Also shows individual files in untracked directories.
 +
 The default can be changed using the status.showUntrackedFiles
 configuration variable documented in linkgit:git-config[1].
 
++
+Note: Searching for untracked files or directories may take some time.
+A fast way to get a status of files tracked by git is to use
+'git status -uno'
+












> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* Re: inotify to minimize stat() calls
  2013-03-08  7:01                               ` Torsten Bögershausen
@ 2013-03-08  8:15                                 ` Junio C Hamano
  2013-03-08  9:24                                   ` Torsten Bögershausen
  2013-03-08 10:53                                   ` Duy Nguyen
  0 siblings, 2 replies; 88+ messages in thread
From: Junio C Hamano @ 2013-03-08  8:15 UTC (permalink / raw)
  To: Torsten Bögershausen
  Cc: Duy Nguyen, Ramkumar Ramachandra, Robert Zeh, Git List, finnag

Torsten Bögershausen <tboegi@web.de> writes:

>> Doesn't this make one wonder why a separate bit and implementation
>> is necessary to say "I am not interested in untracked files" when
>> "-uno" option is already there?
> ...
> I need to admit that I wasn't aware about "git status -uno".

Not so fast.  I did not ask you "Why do you need a new one to solve
the same problem -uno already solves?"

> Thinking about it, how many git users are aware of the speed penalty
> when running git status to find out which (tracked) files they had changed?
>
> Or to put it the other way, when a developer wants a quick overview
> about the files she changed, then git status -uno may be a good and fast friend.
>
> Does it make sence to stress put that someway in the documentation?
>
> diff --git a/Documentation/git-status.txt b/Documentation/git-status.txt
> index 9f1ef9a..360d813 100644
> --- a/Documentation/git-status.txt
> +++ b/Documentation/git-status.txt
> @@ -51,13 +51,18 @@ default is 'normal', i.e. show untracked files and directori
>  +
>  The possible options are:
>  +
> -       - 'no'     - Show no untracked files
> +       - 'no'     - Show no untracked files (this is fastest)

There is a trade-off around the use of -uno between safety and
performance.  The default is not to use -uno so that you will not
forget to add a file you newly created (i.e safety).  You would pay
for the safety with the cost to find such untracked files (i.e.
performance).

I suspect that the documentation was written with the assumption
that at least for the people who are reading this part of the
documentation, the trade-off is obvious.  In order to find more
information, you naturally need to spend more cycles.

If the trade-off is not so obvious, however, I do not object at all
to describing it. But if we are to do so, I do object to mentioning
only one side of the trade-off.  People who choose "fastest" needs
to be made very aware that they are disabling "safety".

That brings us back to the "Why a separate implementation when -uno
is there?" question.

Your patch adds new code; it does not just enables the same logic as
what the existing -uno does with a new flag.  Does the new compute
different things?  Does it find more stuff by spending extra cycles?
Does it find less stuff by being extra faster?

These questions are important.

If the new option strikes the trade-off between safety and
performance at a point different from the point where the existing
-uno option does, it _might_ still be worth adding as a separate
option.  I didn't get that impression when I saw the patch, but I
admit that I did not follow the code carefully myself.

That is the reason why I was wondering why a separate bit and
implementation had to be added by the patch.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: inotify to minimize stat() calls
  2013-03-08  8:15                                 ` Junio C Hamano
@ 2013-03-08  9:24                                   ` Torsten Bögershausen
  2013-03-08 10:53                                   ` Duy Nguyen
  1 sibling, 0 replies; 88+ messages in thread
From: Torsten Bögershausen @ 2013-03-08  9:24 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Torsten Bögershausen, Duy Nguyen, Ramkumar Ramachandra,
	Robert Zeh, Git List, finnag

On 08.03.13 09:15, Junio C Hamano wrote:
> Torsten Bögershausen <tboegi@web.de> writes:
> 
>>> Doesn't this make one wonder why a separate bit and implementation
>>> is necessary to say "I am not interested in untracked files" when
>>> "-uno" option is already there?
>> ...
>> I need to admit that I wasn't aware about "git status -uno".
> 
> Not so fast.  I did not ask you "Why do you need a new one to solve
> the same problem -uno already solves?"
> 
>> Thinking about it, how many git users are aware of the speed penalty
>> when running git status to find out which (tracked) files they had changed?
>>
>> Or to put it the other way, when a developer wants a quick overview
>> about the files she changed, then git status -uno may be a good and fast friend.
>>
>> Does it make sence to stress put that someway in the documentation?
>>
>> diff --git a/Documentation/git-status.txt b/Documentation/git-status.txt
>> index 9f1ef9a..360d813 100644
>> --- a/Documentation/git-status.txt
>> +++ b/Documentation/git-status.txt
>> @@ -51,13 +51,18 @@ default is 'normal', i.e. show untracked files and directori
>>  +
>>  The possible options are:
>>  +
>> -       - 'no'     - Show no untracked files
>> +       - 'no'     - Show no untracked files (this is fastest)
> 
> There is a trade-off around the use of -uno between safety and
> performance.  The default is not to use -uno so that you will not
> forget to add a file you newly created (i.e safety).  You would pay
> for the safety with the cost to find such untracked files (i.e.
> performance).
> 
> I suspect that the documentation was written with the assumption
> that at least for the people who are reading this part of the
> documentation, the trade-off is obvious.  In order to find more
> information, you naturally need to spend more cycles.
> 
> If the trade-off is not so obvious, however, I do not object at all
> to describing it. But if we are to do so, I do object to mentioning
> only one side of the trade-off.  People who choose "fastest" needs
> to be made very aware that they are disabling "safety".
> 
> That brings us back to the "Why a separate implementation when -uno
> is there?" question.
[...]
The short version:
The -uno option does exactly what the -c option intended to do ;-)
(The code path to disable the "expensive" call to read_directory_recursive()
in dir.c is slightly different).
Making benchmarks (again, sorry for the noise) shows that -uno and -c are equally fast,
making 5 git status on a linux tree, take the best of 5:

git status
real    0m0.697s

git status -uno
real    0m0.291s

(with the patch) git status -c
real    0m0.289s


These are not really scientific numbers, but all in all we have motivation enough to drop
the "git status -c" patch completely.

My feeling is still that the suggested documentation "this is fastest" is not a good choice either.
Let me try to come up with a better suggestion.
/Torsten
 

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: inotify to minimize stat() calls
  2013-03-08  8:15                                 ` Junio C Hamano
  2013-03-08  9:24                                   ` Torsten Bögershausen
@ 2013-03-08 10:53                                   ` Duy Nguyen
  2013-03-10  8:23                                     ` Ramkumar Ramachandra
  2013-03-13 12:59                                     ` [PATCH] status: hint the user about -uno if read_directory takes too long Nguyễn Thái Ngọc Duy
  1 sibling, 2 replies; 88+ messages in thread
From: Duy Nguyen @ 2013-03-08 10:53 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Torsten Bögershausen, Ramkumar Ramachandra, Robert Zeh,
	Git List, finnag

On Fri, Mar 8, 2013 at 3:15 PM, Junio C Hamano <gitster@pobox.com> wrote:
>>  The possible options are:
>>  +
>> -       - 'no'     - Show no untracked files
>> +       - 'no'     - Show no untracked files (this is fastest)
>
> There is a trade-off around the use of -uno between safety and
> performance.  The default is not to use -uno so that you will not
> forget to add a file you newly created (i.e safety).  You would pay
> for the safety with the cost to find such untracked files (i.e.
> performance).
>
> I suspect that the documentation was written with the assumption
> that at least for the people who are reading this part of the
> documentation, the trade-off is obvious.  In order to find more
> information, you naturally need to spend more cycles.
>
> If the trade-off is not so obvious, however, I do not object at all
> to describing it. But if we are to do so, I do object to mentioning
> only one side of the trade-off.  People who choose "fastest" needs
> to be made very aware that they are disabling "safety".

On the topic of trading off, I was thinking about new -uauto as
default that is like -uall if it takes less than a certan amount of
time (e.g. 0.5 seconds), if it exceeds that limit, the operation is
aborted (i.e. it turns to -uno). The safety net is still there, "git
status" advices to use -u to show full information.

Or a less intrusive approach: measure the time and advice the user to
(read doc and) use -uno.

But it's probably worth waiting for the first cut of inotify support
from Ram. It's better with inotify anyway.
-- 
Duy

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: inotify to minimize stat() calls
  2013-03-08 10:53                                   ` Duy Nguyen
@ 2013-03-10  8:23                                     ` Ramkumar Ramachandra
  2013-03-13 12:59                                     ` [PATCH] status: hint the user about -uno if read_directory takes too long Nguyễn Thái Ngọc Duy
  1 sibling, 0 replies; 88+ messages in thread
From: Ramkumar Ramachandra @ 2013-03-10  8:23 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Junio C Hamano, Torsten Bögershausen, Robert Zeh, Git List, finnag

Duy Nguyen wrote:
> On Fri, Mar 8, 2013 at 3:15 PM, Junio C Hamano <gitster@pobox.com> wrote:
>>>  The possible options are:
>>>  +
>>> -       - 'no'     - Show no untracked files
>>> +       - 'no'     - Show no untracked files (this is fastest)
>>
>> There is a trade-off around the use of -uno between safety and
>> performance.  The default is not to use -uno so that you will not
>> forget to add a file you newly created (i.e safety).  You would pay
>> for the safety with the cost to find such untracked files (i.e.
>> performance).
>>
>> I suspect that the documentation was written with the assumption
>> that at least for the people who are reading this part of the
>> documentation, the trade-off is obvious.  In order to find more
>> information, you naturally need to spend more cycles.
>>
>> If the trade-off is not so obvious, however, I do not object at all
>> to describing it. But if we are to do so, I do object to mentioning
>> only one side of the trade-off.  People who choose "fastest" needs
>> to be made very aware that they are disabling "safety".
>
> On the topic of trading off, I was thinking about new -uauto as
> default that is like -uall if it takes less than a certan amount of
> time (e.g. 0.5 seconds), if it exceeds that limit, the operation is
> aborted (i.e. it turns to -uno). The safety net is still there, "git
> status" advices to use -u to show full information.

Ugh, this is too opaque; the user has no idea whether untracked files
are being counted or not.

> Or a less intrusive approach: measure the time and advice the user to
> (read doc and) use -uno.

I just learnt about -uno myself, from this thread.  At best, it's a
stopgap until we get inotify support.

> But it's probably worth waiting for the first cut of inotify support
> from Ram. It's better with inotify anyway.

This is quite urgent in my opinion.  One of git's primary tasks is to
quickly tell me what changed in the repository, and inotify is the
perfect way to do this.
I'll try to get the first cut out quickly, so we can immediately
correct any fundamental design flaws.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [PATCH] status: hint the user about -uno if read_directory takes too long
  2013-03-08 10:53                                   ` Duy Nguyen
  2013-03-10  8:23                                     ` Ramkumar Ramachandra
@ 2013-03-13 12:59                                     ` Nguyễn Thái Ngọc Duy
  2013-03-13 15:21                                       ` Torsten Bögershausen
  2013-03-13 16:16                                       ` Junio C Hamano
  1 sibling, 2 replies; 88+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2013-03-13 12:59 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, tboegi, artagnon, robert.allan.zeh, finnag,
	Nguyễn Thái Ngọc Duy

This patch attempts to advertise -uno to the users who tolerate slow
"git status" on large repositories (or slow machines/disks). The 2
seconds limit is quite arbitrary but is probably long enough to start
using -uno.

Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 Documentation/config.txt |  4 ++++
 advice.c                 |  2 ++
 advice.h                 |  1 +
 t/t7060-wtstatus.sh      |  2 ++
 t/t7508-status.sh        |  4 ++++
 t/t7512-status-help.sh   |  1 +
 wt-status.c              | 20 +++++++++++++++++++-
 wt-status.h              |  1 +
 8 files changed, 34 insertions(+), 1 deletion(-)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index bbba728..e91d06f 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -178,6 +178,10 @@ advice.*::
 		the template shown when writing commit messages in
 		linkgit:git-commit[1], and in the help message shown
 		by linkgit:git-checkout[1] when switching branch.
+	statusUno::
+		If collecting untracked files in linkgit:git-status[1]
+		takes more than 2 seconds, hint the user that the option
+		`-uno` could be used to stop collecting untracked files.
 	commitBeforeMerge::
 		Advice shown when linkgit:git-merge[1] refuses to
 		merge to avoid overwriting local changes.
diff --git a/advice.c b/advice.c
index 780f58d..72b5c66 100644
--- a/advice.c
+++ b/advice.c
@@ -8,6 +8,7 @@ int advice_push_already_exists = 1;
 int advice_push_fetch_first = 1;
 int advice_push_needs_force = 1;
 int advice_status_hints = 1;
+int advice_status_uno = 1;
 int advice_commit_before_merge = 1;
 int advice_resolve_conflict = 1;
 int advice_implicit_identity = 1;
@@ -25,6 +26,7 @@ static struct {
 	{ "pushfetchfirst", &advice_push_fetch_first },
 	{ "pushneedsforce", &advice_push_needs_force },
 	{ "statushints", &advice_status_hints },
+	{ "statusuno", &advice_status_uno },
 	{ "commitbeforemerge", &advice_commit_before_merge },
 	{ "resolveconflict", &advice_resolve_conflict },
 	{ "implicitidentity", &advice_implicit_identity },
diff --git a/advice.h b/advice.h
index fad36df..d7e03be 100644
--- a/advice.h
+++ b/advice.h
@@ -11,6 +11,7 @@ extern int advice_push_already_exists;
 extern int advice_push_fetch_first;
 extern int advice_push_needs_force;
 extern int advice_status_hints;
+extern int advice_status_uno;
 extern int advice_commit_before_merge;
 extern int advice_resolve_conflict;
 extern int advice_implicit_identity;
diff --git a/t/t7060-wtstatus.sh b/t/t7060-wtstatus.sh
index f4f38a5..dd340d5 100755
--- a/t/t7060-wtstatus.sh
+++ b/t/t7060-wtstatus.sh
@@ -5,6 +5,7 @@ test_description='basic work tree status reporting'
 . ./test-lib.sh
 
 test_expect_success setup '
+	git config advice.statusuno false &&
 	test_commit A &&
 	test_commit B oneside added &&
 	git checkout A^0 &&
@@ -46,6 +47,7 @@ test_expect_success 'M/D conflict does not segfault' '
 	(
 		cd mdconflict &&
 		git init &&
+		git config advice.statusuno false
 		test_commit initial foo "" &&
 		test_commit modify foo foo &&
 		git checkout -b side HEAD^ &&
diff --git a/t/t7508-status.sh b/t/t7508-status.sh
index a79c032..9d6e4db 100755
--- a/t/t7508-status.sh
+++ b/t/t7508-status.sh
@@ -8,11 +8,13 @@ test_description='git status'
 . ./test-lib.sh
 
 test_expect_success 'status -h in broken repository' '
+	git config advice.statusuno false &&
 	mkdir broken &&
 	test_when_finished "rm -fr broken" &&
 	(
 		cd broken &&
 		git init &&
+		git config advice.statusuno false &&
 		echo "[status] showuntrackedfiles = CORRUPT" >>.git/config &&
 		test_expect_code 129 git status -h >usage 2>&1
 	) &&
@@ -25,6 +27,7 @@ test_expect_success 'commit -h in broken repository' '
 	(
 		cd broken &&
 		git init &&
+		git config advice.statusuno false &&
 		echo "[status] showuntrackedfiles = CORRUPT" >>.git/config &&
 		test_expect_code 129 git commit -h >usage 2>&1
 	) &&
@@ -780,6 +783,7 @@ test_expect_success 'status refreshes the index' '
 test_expect_success 'setup status submodule summary' '
 	test_create_repo sm && (
 		cd sm &&
+		git config advice.statusuno false &&
 		>foo &&
 		git add foo &&
 		git commit -m "Add foo"
diff --git a/t/t7512-status-help.sh b/t/t7512-status-help.sh
index d2da89a..033a1b3 100755
--- a/t/t7512-status-help.sh
+++ b/t/t7512-status-help.sh
@@ -14,6 +14,7 @@ test_description='git status advice'
 set_fake_editor
 
 test_expect_success 'prepare for conflicts' '
+	git config advice.statusuno false &&
 	test_commit init main.txt init &&
 	git branch conflicts &&
 	test_commit on_master main.txt on_master &&
diff --git a/wt-status.c b/wt-status.c
index ef405d0..6fde08b 100644
--- a/wt-status.c
+++ b/wt-status.c
@@ -540,7 +540,16 @@ void wt_status_collect(struct wt_status *s)
 		wt_status_collect_changes_initial(s);
 	else
 		wt_status_collect_changes_index(s);
-	wt_status_collect_untracked(s);
+	if (s->show_untracked_files && advice_status_uno) {
+		struct timeval tv1, tv2;
+		gettimeofday(&tv1, NULL);
+		wt_status_collect_untracked(s);
+		gettimeofday(&tv2, NULL);
+		s->untracked_in_ms =
+			(uint64_t)tv2.tv_sec * 1000 + tv2.tv_usec / 1000 -
+			((uint64_t)tv1.tv_sec * 1000 + tv1.tv_usec / 1000);
+	} else
+		wt_status_collect_untracked(s);
 }
 
 static void wt_status_print_unmerged(struct wt_status *s)
@@ -1097,6 +1106,15 @@ void wt_status_print(struct wt_status *s)
 		wt_status_print_other(s, &s->untracked, _("Untracked files"), "add");
 		if (s->show_ignored_files)
 			wt_status_print_other(s, &s->ignored, _("Ignored files"), "add -f");
+		if (advice_status_uno && s->untracked_in_ms > 2000) {
+			status_printf_ln(s, GIT_COLOR_NORMAL,
+					 _("It took %.2f seconds to collect untracked files."),
+					 (float)s->untracked_in_ms / 1000);
+			status_printf_ln(s, GIT_COLOR_NORMAL,
+					 _("If it happens often, you may want to use option -uno"));
+			status_printf_ln(s, GIT_COLOR_NORMAL,
+					 _("to speed up by stopping displaying untracked files"));
+		}
 	} else if (s->commitable)
 		status_printf_ln(s, GIT_COLOR_NORMAL, _("Untracked files not listed%s"),
 			advice_status_hints
diff --git a/wt-status.h b/wt-status.h
index 81e1dcf..74208c0 100644
--- a/wt-status.h
+++ b/wt-status.h
@@ -69,6 +69,7 @@ struct wt_status {
 	struct string_list change;
 	struct string_list untracked;
 	struct string_list ignored;
+	uint32_t untracked_in_ms;
 };
 
 struct wt_status_state {
-- 
1.8.1.2.536.gf441e6d

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* Re: [PATCH] status: hint the user about -uno if read_directory takes too long
  2013-03-13 12:59                                     ` [PATCH] status: hint the user about -uno if read_directory takes too long Nguyễn Thái Ngọc Duy
@ 2013-03-13 15:21                                       ` Torsten Bögershausen
  2013-03-13 16:16                                       ` Junio C Hamano
  1 sibling, 0 replies; 88+ messages in thread
From: Torsten Bögershausen @ 2013-03-13 15:21 UTC (permalink / raw)
  To: Nguyễn Thái Ngọc Duy
  Cc: git, Junio C Hamano, tboegi, artagnon, robert.allan.zeh, finnag

On 13.03.13 13:59, Nguyễn Thái Ngọc Duy wrote:
> This patch attempts to advertise -uno to the users who tolerate slow
> "git status" on large repositories (or slow machines/disks). The 2
> seconds limit is quite arbitrary but is probably long enough to start
> using -uno.
>
> Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
> ---
>  Documentation/config.txt |  4 ++++
>  advice.c                 |  2 ++
>  advice.h                 |  1 +
>  t/t7060-wtstatus.sh      |  2 ++
>  t/t7508-status.sh        |  4 ++++
>  t/t7512-status-help.sh   |  1 +
>  wt-status.c              | 20 +++++++++++++++++++-
>  wt-status.h              |  1 +
>  8 files changed, 34 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/config.txt b/Documentation/config.txt
> index bbba728..e91d06f 100644
> --- a/Documentation/config.txt
> +++ b/Documentation/config.txt
> @@ -178,6 +178,10 @@ advice.*::
>  		the template shown when writing commit messages in
>  		linkgit:git-commit[1], and in the help message shown
>  		by linkgit:git-checkout[1] when switching branch.
> +	statusUno::
> +		If collecting untracked files in linkgit:git-status[1]
> +		takes more than 2 seconds, hint the user that the option
> +		`-uno` could be used to stop collecting untracked files.
Thanks, I like the idea
could we make a "de-Luxe" version where

statusUno is an integer, counting in milliseconds?

/Torsten

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH] status: hint the user about -uno if read_directory takes too long
  2013-03-13 12:59                                     ` [PATCH] status: hint the user about -uno if read_directory takes too long Nguyễn Thái Ngọc Duy
  2013-03-13 15:21                                       ` Torsten Bögershausen
@ 2013-03-13 16:16                                       ` Junio C Hamano
  2013-03-14 10:22                                         ` Duy Nguyen
  1 sibling, 1 reply; 88+ messages in thread
From: Junio C Hamano @ 2013-03-13 16:16 UTC (permalink / raw)
  To: Nguyễn Thái Ngọc Duy
  Cc: git, tboegi, artagnon, robert.allan.zeh, finnag

Nguyễn Thái Ngọc Duy  <pclouds@gmail.com> writes:

> diff --git a/Documentation/config.txt b/Documentation/config.txt
> index bbba728..e91d06f 100644
> --- a/Documentation/config.txt
> +++ b/Documentation/config.txt
> @@ -178,6 +178,10 @@ advice.*::
>  		the template shown when writing commit messages in
>  		linkgit:git-commit[1], and in the help message shown
>  		by linkgit:git-checkout[1] when switching branch.
> +	statusUno::
> +		If collecting untracked files in linkgit:git-status[1]
> +		takes more than 2 seconds, hint the user that the option
> +		`-uno` could be used to stop collecting untracked files.

It looks to me that the way this paragraph conveys information is
vastly different from all the others in the section.  The section
begins with "by setting their corresponding variables to false
various advice messages can be squelched; here are the list of
variables and which advice message each of them controls", so the
description should be in "variable:: which advice message" form.

The noise this introduces to the test suite is a bit irritating and
makes us think twice if this really a good change.

> diff --git a/wt-status.c b/wt-status.c
> index ef405d0..6fde08b 100644
> --- a/wt-status.c
> +++ b/wt-status.c
> @@ -540,7 +540,16 @@ void wt_status_collect(struct wt_status *s)
>  		wt_status_collect_changes_initial(s);
>  	else
>  		wt_status_collect_changes_index(s);
> -	wt_status_collect_untracked(s);
> +	if (s->show_untracked_files && advice_status_uno) {
> +		struct timeval tv1, tv2;
> +		gettimeofday(&tv1, NULL);
> +		wt_status_collect_untracked(s);
> +		gettimeofday(&tv2, NULL);
> +		s->untracked_in_ms =
> +			(uint64_t)tv2.tv_sec * 1000 + tv2.tv_usec / 1000 -
> +			((uint64_t)tv1.tv_sec * 1000 + tv1.tv_usec / 1000);
> +	} else
> +		wt_status_collect_untracked(s);
>  }

This is not wrong per-se but it took me two reads to spot that this
is not "if advise is active, do the timer but do not collect;
otherwise do just collect as before".  I wonder if we can structure
the code a bit better to make the timing bit less loud.

>  static void wt_status_print_unmerged(struct wt_status *s)
> @@ -1097,6 +1106,15 @@ void wt_status_print(struct wt_status *s)
>  		wt_status_print_other(s, &s->untracked, _("Untracked files"), "add");
>  		if (s->show_ignored_files)
>  			wt_status_print_other(s, &s->ignored, _("Ignored files"), "add -f");
> +		if (advice_status_uno && s->untracked_in_ms > 2000) {
> +			status_printf_ln(s, GIT_COLOR_NORMAL,
> +					 _("It took %.2f seconds to collect untracked files."),
> +					 (float)s->untracked_in_ms / 1000);
> +			status_printf_ln(s, GIT_COLOR_NORMAL,
> +					 _("If it happens often, you may want to use option -uno"));
> +			status_printf_ln(s, GIT_COLOR_NORMAL,
> +					 _("to speed up by stopping displaying untracked files"));
> +		}

"to speed up by stopping displaying untracked files" does not look
like giving a balanced suggestion.  It is increasing the risk of
forgetting about newly created files the user may want to add, but
the risk is not properly warned.

I tend to agree that the new advice would help users if phrased in a
right way.  Do we want them in COLOR_NORMAL, or do we want to make
them stand out a bit more (do we have COLOR_BLINK ;-)?

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH] status: hint the user about -uno if read_directory takes too long
  2013-03-13 16:16                                       ` Junio C Hamano
@ 2013-03-14 10:22                                         ` Duy Nguyen
  2013-03-14 15:05                                           ` Junio C Hamano
  2013-03-16  1:51                                           ` Duy Nguyen
  0 siblings, 2 replies; 88+ messages in thread
From: Duy Nguyen @ 2013-03-14 10:22 UTC (permalink / raw)
  To: Junio C Hamano, tboegi; +Cc: git, artagnon, robert.allan.zeh, finnag

On Wed, Mar 13, 2013 at 10:21 PM, Torsten Bögershausen <tboegi@web.de> wrote:
>> +     statusUno::
>> +             If collecting untracked files in linkgit:git-status[1]
>> +             takes more than 2 seconds, hint the user that the option
>> +             `-uno` could be used to stop collecting untracked files.
> Thanks, I like the idea
> could we make a "de-Luxe" version where
>
> statusUno is an integer, counting in milliseconds?

No problem.

On Wed, Mar 13, 2013 at 11:16 PM, Junio C Hamano <gitster@pobox.com> wrote:
> The noise this introduces to the test suite is a bit irritating and
> makes us think twice if this really a good change.

I originally thought of two options, this or add an env flag in git
binary that turns this off in the test suite. The latter did not sound
good. But I forgot that we set a fake $HOME in the test suite, we
could disable this in $HOME/.gitconfig, less clutter in individual
tests.

>>  static void wt_status_print_unmerged(struct wt_status *s)
>> +             if (advice_status_uno && s->untracked_in_ms > 2000) {
>> +                     status_printf_ln(s, GIT_COLOR_NORMAL,
>> +                                      _("It took %.2f seconds to collect untracked files."),
>> +                                      (float)s->untracked_in_ms / 1000);
>> +                     status_printf_ln(s, GIT_COLOR_NORMAL,
>> +                                      _("If it happens often, you may want to use option -uno"));
>> +                     status_printf_ln(s, GIT_COLOR_NORMAL,
>> +                                      _("to speed up by stopping displaying untracked files"));
>> +             }
>
> "to speed up by stopping displaying untracked files" does not look
> like giving a balanced suggestion.  It is increasing the risk of
> forgetting about newly created files the user may want to add, but
> the risk is not properly warned.

How about "It took X ms to collect untracked files.\nCheck out the
option -u for a potential speedup"? I deliberately hide "no" so that
the user cannot blindly type and run it without reading document
first. We can give full explanation and warning there in the document.

> I tend to agree that the new advice would help users if phrased in a
> right way.  Do we want them in COLOR_NORMAL, or do we want to make
> them stand out a bit more (do we have COLOR_BLINK ;-)?

There will be false positives (cold cache for example). So yeah
something more standing out is good but it should catch too much
attention. We're currently using red and green in status output. Maybe
this one can take blue.

PS. What about advertising index v4? I sent a patch some time ago to
put an advice in git-clone. I think it's a good place, but we could
place it somewhere else..
-- 
Duy

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH] status: hint the user about -uno if read_directory takes too long
  2013-03-14 10:22                                         ` Duy Nguyen
@ 2013-03-14 15:05                                           ` Junio C Hamano
  2013-03-15 12:30                                             ` Duy Nguyen
  2013-03-16  1:51                                           ` Duy Nguyen
  1 sibling, 1 reply; 88+ messages in thread
From: Junio C Hamano @ 2013-03-14 15:05 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: tboegi, git, artagnon, robert.allan.zeh, finnag

Duy Nguyen <pclouds@gmail.com> writes:

> On Wed, Mar 13, 2013 at 10:21 PM, Torsten Bögershausen <tboegi@web.de> wrote:
>>> +     statusUno::
>>> +             If collecting untracked files in linkgit:git-status[1]
>>> +             takes more than 2 seconds, hint the user that the option
>>> +             `-uno` could be used to stop collecting untracked files.
>> Thanks, I like the idea
>> could we make a "de-Luxe" version where
>>
>> statusUno is an integer, counting in milliseconds?
>
> No problem.

A huge problem, as it breaks consistency and more importantly, the
suggestion misses the entire point of what "advice.*" variables are.

"advise.*" variables are bools that indicate "Have I learned this
somewhat tricky feature and/or characteristics of Git yet or do I
still need a reminder?"  There is no room for "I still need a
reminder if it takes more than N seconds".  You either already have
got it, or you haven't.

>> "to speed up by stopping displaying untracked files" does not look
>> like giving a balanced suggestion.  It is increasing the risk of
>> forgetting about newly created files the user may want to add, but
>> the risk is not properly warned.
>
> How about "It took X ms to collect untracked files.\nCheck out the
> option -u for a potential speedup"? I deliberately hide "no" so that
> the user cannot blindly type and run it without reading document
> first. We can give full explanation and warning there in the document.

But it makes the advise much less useful to introduce more levels of
indirections, no?

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH] status: hint the user about -uno if read_directory takes too long
  2013-03-14 15:05                                           ` Junio C Hamano
@ 2013-03-15 12:30                                             ` Duy Nguyen
  2013-03-15 15:52                                               ` Torsten Bögershausen
  0 siblings, 1 reply; 88+ messages in thread
From: Duy Nguyen @ 2013-03-15 12:30 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: tboegi, git, artagnon, robert.allan.zeh, finnag

On Thu, Mar 14, 2013 at 10:05 PM, Junio C Hamano <gitster@pobox.com> wrote:
>>> "to speed up by stopping displaying untracked files" does not look
>>> like giving a balanced suggestion.  It is increasing the risk of
>>> forgetting about newly created files the user may want to add, but
>>> the risk is not properly warned.
>>
>> How about "It took X ms to collect untracked files.\nCheck out the
>> option -u for a potential speedup"? I deliberately hide "no" so that
>> the user cannot blindly type and run it without reading document
>> first. We can give full explanation and warning there in the document.
>
> But it makes the advise much less useful to introduce more levels of
> indirections, no?

To me the message's value is the pointer to -uno that not many people
know about. And I don't want it to be too verbose as there'll be false
positives (cold cache, busy disks, low memory..), 2-3 lines should be
max. So indirections are not a concern. You want to speed up, you need
to pay some time. Anyway how do you put it to suggest -uno in
git-status with all the implications?
-- 
Duy

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH] status: hint the user about -uno if read_directory takes too long
  2013-03-15 12:30                                             ` Duy Nguyen
@ 2013-03-15 15:52                                               ` Torsten Bögershausen
  2013-03-15 15:57                                                 ` Ramkumar Ramachandra
  2013-03-15 16:53                                                 ` Junio C Hamano
  0 siblings, 2 replies; 88+ messages in thread
From: Torsten Bögershausen @ 2013-03-15 15:52 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Junio C Hamano, tboegi, git, artagnon, robert.allan.zeh, finnag

On 03/15/2013 01:30 PM, Duy Nguyen wrote:
> On Thu, Mar 14, 2013 at 10:05 PM, Junio C Hamano<gitster@pobox.com>  wrote:
>>>> "to speed up by stopping displaying untracked files" does not look
>>>> like giving a balanced suggestion.  It is increasing the risk of
>>>> forgetting about newly created files the user may want to add, but
>>>> the risk is not properly warned.
>>> How about "It took X ms to collect untracked files.\nCheck out the
>>> option -u for a potential speedup"? I deliberately hide "no" so that
>>> the user cannot blindly type and run it without reading document
>>> first. We can give full explanation and warning there in the document.
>> But it makes the advise much less useful to introduce more levels of
>> indirections, no?
> To me the message's value is the pointer to -uno that not many people
> know about. And I don't want it to be too verbose as there'll be false
> positives (cold cache, busy disks, low memory..), 2-3 lines should be
> max. So indirections are not a concern. You want to speed up, you need
> to pay some time. Anyway how do you put it to suggest -uno in
> git-status with all the implications?
I was thinking about the documentation, the best patch so far may look
like this:
What we think?
/Torsten


-- >8 --

[PATCH] git status: Document that git status -uno is faster

In some repostories users expere that "git status" command takes long time.
The command spends some time searching the file system for untracked files.
Document that searching for untracked file may take some time, and docuemnt
the option -uno better.

Signed-off-by: Torsten Bögershausen <tboegi@web.de>
---
  Documentation/git-status.txt | 7 +++++++
  1 file changed, 7 insertions(+)

diff --git a/Documentation/git-status.txt b/Documentation/git-status.txt
index 0412c40..fd36bbd 100644
--- a/Documentation/git-status.txt
+++ b/Documentation/git-status.txt
@@ -58,6 +58,13 @@ The possible options are:
  The default can be changed using the status.showUntrackedFiles
  configuration variable documented in linkgit:git-config[1].

++
+Note: Searching the file system for untracked files may take some time.
+git status -uno is faster than git status -uall.
+There is a trade-off around the use of -uno between safety and performance.
+The default is not to use -uno so that you will not forget to add a 
file you newly created (i.e safety).
+You would pay for the safety with the cost to find such untracked files 
(i.e. performance).
+
  --ignore-submodules[=<when>]::
      Ignore changes to submodules when looking for changes. <when> can be
      either "none", "untracked", "dirty" or "all", which is the default.
-- 
1.8.2.rc3.16.gce432ca

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* Re: [PATCH] status: hint the user about -uno if read_directory takes too long
  2013-03-15 15:52                                               ` Torsten Bögershausen
@ 2013-03-15 15:57                                                 ` Ramkumar Ramachandra
  2013-03-15 16:53                                                 ` Junio C Hamano
  1 sibling, 0 replies; 88+ messages in thread
From: Ramkumar Ramachandra @ 2013-03-15 15:57 UTC (permalink / raw)
  To: Torsten Bögershausen
  Cc: Duy Nguyen, Junio C Hamano, git, robert.allan.zeh, finnag

Torsten Bögershausen wrote:
> [PATCH] git status: Document that git status -uno is faster

Yes.  I like this patch.

> In some repostories users expere that "git status" command takes long time.
> The command spends some time searching the file system for untracked files.
> Document that searching for untracked file may take some time, and docuemnt
> the option -uno better.

Please correct the typos in the commit message.

> Signed-off-by: Torsten Bögershausen <tboegi@web.de>
> ---
>  Documentation/git-status.txt | 7 +++++++
>  1 file changed, 7 insertions(+)
>
> diff --git a/Documentation/git-status.txt b/Documentation/git-status.txt
> index 0412c40..fd36bbd 100644
> --- a/Documentation/git-status.txt
> +++ b/Documentation/git-status.txt
> @@ -58,6 +58,13 @@ The possible options are:
>  The default can be changed using the status.showUntrackedFiles
>  configuration variable documented in linkgit:git-config[1].
>
> ++
> +Note: Searching the file system for untracked files may take some time.
> +git status -uno is faster than git status -uall.
> +There is a trade-off around the use of -uno between safety and performance.
> +The default is not to use -uno so that you will not forget to add a file
> you newly created (i.e safety).
> +You would pay for the safety with the cost to find such untracked files
> (i.e. performance).

Good writeup.  What -uno does is already documented, so you've
explained the trade-off.
Why didn't you just wrap the paragraph to 80 columns though?

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH] status: hint the user about -uno if read_directory takes too long
  2013-03-15 15:52                                               ` Torsten Bögershausen
  2013-03-15 15:57                                                 ` Ramkumar Ramachandra
@ 2013-03-15 16:53                                                 ` Junio C Hamano
  2013-03-15 17:41                                                   ` Torsten Bögershausen
  1 sibling, 1 reply; 88+ messages in thread
From: Junio C Hamano @ 2013-03-15 16:53 UTC (permalink / raw)
  To: Torsten Bögershausen
  Cc: Duy Nguyen, git, artagnon, robert.allan.zeh, finnag

Torsten Bögershausen <tboegi@web.de> writes:

> [PATCH] git status: Document that git status -uno is faster
>
> In some repostories users expere that "git status" command takes long time.

expere???  Certainly you did not mean "expect".  "observe",
"experience", or "see", perhaps?

> The command spends some time searching the file system for untracked files.
> Document that searching for untracked file may take some time, and docuemnt
> the option -uno better.

Good intentions.

> Signed-off-by: Torsten Bögershausen <tboegi@web.de>
> ---
>  Documentation/git-status.txt | 7 +++++++
>  1 file changed, 7 insertions(+)
>
> diff --git a/Documentation/git-status.txt b/Documentation/git-status.txt
> index 0412c40..fd36bbd 100644
> --- a/Documentation/git-status.txt
> +++ b/Documentation/git-status.txt
> @@ -58,6 +58,13 @@ The possible options are:
>  The default can be changed using the status.showUntrackedFiles
>  configuration variable documented in linkgit:git-config[1].
>
> ++
> +Note: Searching the file system for untracked files may take some time.
> +git status -uno is faster than git status -uall.
> +There is a trade-off around the use of -uno between safety and performance.
> +The default is not to use -uno so that you will not forget to add a
> file you newly created (i.e safety).
> +You would pay for the safety with the cost to find such untracked
> files (i.e. performance).
> +

The second sentence looks out of flow, and the last sentence, while
technically not incorrect, is unclear what it is trying to convey in
the larger picture.

Perhaps it is just me.

In any case, I think it is a good idea to explain the reason why the
user might want to use a non-default setting, and the criteria the
user may want to base the choice on (which is the gist of your
addition), and it is a good idea to do so _before_ saying "The
default can be changed using ...".

How about this?

 Documentation/git-status.txt | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/Documentation/git-status.txt b/Documentation/git-status.txt
index 0412c40..9046df9 100644
--- a/Documentation/git-status.txt
+++ b/Documentation/git-status.txt
@@ -46,15 +46,21 @@ OPTIONS
 	Show untracked files.
 +
 The mode parameter is optional (defaults to 'all'), and is used to
-specify the handling of untracked files; when -u is not used, the
-default is 'normal', i.e. show untracked files and directories.
+specify the handling of untracked files.
 +
 The possible options are:
 +
-	- 'no'     - Show no untracked files
-	- 'normal' - Shows untracked files and directories
+	- 'no'     - Show no untracked files.
+	- 'normal' - Shows untracked files and directories.
 	- 'all'    - Also shows individual files in untracked directories.
 +
+When `-u` option is not used, untracked files and directories are
+shown (i.e. the same as specifying `normal`), to help you avoid
+forgetting to add newly created files.  Because it takes extra work
+to find untracked files in the filesystem, this mode may take some
+time in a large working tree.  You can use `no` to have `git status`
+return more quickly without showing untracked files.
++
 The default can be changed using the status.showUntrackedFiles
 configuration variable documented in linkgit:git-config[1].
 

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* Re: [PATCH] status: hint the user about -uno if read_directory takes too long
  2013-03-15 16:53                                                 ` Junio C Hamano
@ 2013-03-15 17:41                                                   ` Torsten Bögershausen
  2013-03-15 20:06                                                     ` Junio C Hamano
  0 siblings, 1 reply; 88+ messages in thread
From: Torsten Bögershausen @ 2013-03-15 17:41 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Torsten Bögershausen, Duy Nguyen, git, artagnon,
	robert.allan.zeh, finnag

On 03/15/2013 05:53 PM, Junio C Hamano wrote:
> Torsten Bögershausen<tboegi@web.de>  writes:
>
>> [PATCH] git status: Document that git status -uno is faster
>>
>> In some repostories users expere that "git status" command takes long time.
> expere???  Certainly you did not mean "expect".  "observe",
> "experience", or "see", perhaps?
>
>> The command spends some time searching the file system for untracked files.
>> Document that searching for untracked file may take some time, and docuemnt
>> the option -uno better.
> Good intentions.
>
>> Signed-off-by: Torsten Bögershausen<tboegi@web.de>
>> ---
>>   Documentation/git-status.txt | 7 +++++++
>>   1 file changed, 7 insertions(+)
>>
>> diff --git a/Documentation/git-status.txt b/Documentation/git-status.txt
>> index 0412c40..fd36bbd 100644
>> --- a/Documentation/git-status.txt
>> +++ b/Documentation/git-status.txt
>> @@ -58,6 +58,13 @@ The possible options are:
>>   The default can be changed using the status.showUntrackedFiles
>>   configuration variable documented in linkgit:git-config[1].
>>
>> ++
>> +Note: Searching the file system for untracked files may take some time.
>> +git status -uno is faster than git status -uall.
>> +There is a trade-off around the use of -uno between safety and performance.
>> +The default is not to use -uno so that you will not forget to add a
>> file you newly created (i.e safety).
>> +You would pay for the safety with the cost to find such untracked
>> files (i.e. performance).
>> +
> The second sentence looks out of flow, and the last sentence, while
> technically not incorrect, is unclear what it is trying to convey in
> the larger picture.
>
> Perhaps it is just me.
>
> In any case, I think it is a good idea to explain the reason why the
> user might want to use a non-default setting, and the criteria the
> user may want to base the choice on (which is the gist of your
> addition), and it is a good idea to do so _before_ saying "The
> default can be changed using ...".
>
> How about this?
>
>   Documentation/git-status.txt | 14 ++++++++++----
>   1 file changed, 10 insertions(+), 4 deletions(-)
>
> diff --git a/Documentation/git-status.txt b/Documentation/git-status.txt
> index 0412c40..9046df9 100644
> --- a/Documentation/git-status.txt
> +++ b/Documentation/git-status.txt
> @@ -46,15 +46,21 @@ OPTIONS
>   	Show untracked files.
>   +
>   The mode parameter is optional (defaults to 'all'), and is used to
> -specify the handling of untracked files; when -u is not used, the
> -default is 'normal', i.e. show untracked files and directories.
> +specify the handling of untracked files.
>   +
>   The possible options are:
>   +
> -	- 'no'     - Show no untracked files
> -	- 'normal' - Shows untracked files and directories
> +	- 'no'     - Show no untracked files.
> +	- 'normal' - Shows untracked files and directories.
>   	- 'all'    - Also shows individual files in untracked directories.
>   +
> +When `-u` option is not used, untracked files and directories are
> +shown (i.e. the same as specifying `normal`), to help you avoid
> +forgetting to add newly created files.  Because it takes extra work
> +to find untracked files in the filesystem, this mode may take some
> +time in a large working tree.  You can use `no` to have `git status`
(Small nit: extra space before the "You" in the line above)

Thanks, I like that much better than mine
(and expere is probably a word not yet invented)
/Torsten

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH] status: hint the user about -uno if read_directory takes too long
  2013-03-15 17:41                                                   ` Torsten Bögershausen
@ 2013-03-15 20:06                                                     ` Junio C Hamano
  2013-03-15 21:14                                                       ` Torsten Bögershausen
  0 siblings, 1 reply; 88+ messages in thread
From: Junio C Hamano @ 2013-03-15 20:06 UTC (permalink / raw)
  To: Torsten Bögershausen
  Cc: Duy Nguyen, git, artagnon, robert.allan.zeh, finnag

Torsten Bögershausen <tboegi@web.de> writes:

> Thanks, I like that much better than mine
> (and expere is probably a word not yet invented)

OK, then how about redoing Duy's patch like this on top?

I've moved the timing collection from the caller to callee, and I
think the result is more readable.  The message looked easier to see
with a leading blank line, so I added one.

-- >8 --
From: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Date: Wed, 13 Mar 2013 19:59:16 +0700
Subject: [PATCH] status: advise to consider use of -u when read_directory takes too long

Introduce advice.statusUoption to suggest considering use of -u to
strike different trade-off when it took more than 2 seconds to
enumerate untracked/ignored files.

Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 Documentation/config.txt |  4 ++++
 advice.c                 |  2 ++
 advice.h                 |  1 +
 t/t7060-wtstatus.sh      |  1 +
 t/t7508-status.sh        |  1 +
 t/t7512-status-help.sh   |  1 +
 wt-status.c              | 21 +++++++++++++++++++++
 wt-status.h              |  1 +
 8 files changed, 32 insertions(+)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index d1de857..a16eda5 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -163,6 +163,10 @@ advice.*::
 		state in the output of linkgit:git-status[1] and in
 		the template shown when writing commit messages in
 		linkgit:git-commit[1].
+	statusUoption::
+		Advise to consider using the `-u` option to linkgit:git-status[1]
+		when the command takes more than 2 seconds to enumerate untracked
+		files.
 	commitBeforeMerge::
 		Advice shown when linkgit:git-merge[1] refuses to
 		merge to avoid overwriting local changes.
diff --git a/advice.c b/advice.c
index edfbd4a..015011f 100644
--- a/advice.c
+++ b/advice.c
@@ -5,6 +5,7 @@ int advice_push_non_ff_current = 1;
 int advice_push_non_ff_default = 1;
 int advice_push_non_ff_matching = 1;
 int advice_status_hints = 1;
+int advice_status_u_option = 1;
 int advice_commit_before_merge = 1;
 int advice_resolve_conflict = 1;
 int advice_implicit_identity = 1;
@@ -19,6 +20,7 @@ static struct {
 	{ "pushnonffdefault", &advice_push_non_ff_default },
 	{ "pushnonffmatching", &advice_push_non_ff_matching },
 	{ "statushints", &advice_status_hints },
+	{ "statusuoption", &advice_status_u_option },
 	{ "commitbeforemerge", &advice_commit_before_merge },
 	{ "resolveconflict", &advice_resolve_conflict },
 	{ "implicitidentity", &advice_implicit_identity },
diff --git a/advice.h b/advice.h
index f3cdbbf..e3e665d 100644
--- a/advice.h
+++ b/advice.h
@@ -8,6 +8,7 @@ extern int advice_push_non_ff_current;
 extern int advice_push_non_ff_default;
 extern int advice_push_non_ff_matching;
 extern int advice_status_hints;
+extern int advice_status_u_option;
 extern int advice_commit_before_merge;
 extern int advice_resolve_conflict;
 extern int advice_implicit_identity;
diff --git a/t/t7060-wtstatus.sh b/t/t7060-wtstatus.sh
index f4f38a5..52ef06b 100755
--- a/t/t7060-wtstatus.sh
+++ b/t/t7060-wtstatus.sh
@@ -5,6 +5,7 @@ test_description='basic work tree status reporting'
 . ./test-lib.sh
 
 test_expect_success setup '
+	git config --global advice.statusuoption false &&
 	test_commit A &&
 	test_commit B oneside added &&
 	git checkout A^0 &&
diff --git a/t/t7508-status.sh b/t/t7508-status.sh
index e313ef1..15e063a 100755
--- a/t/t7508-status.sh
+++ b/t/t7508-status.sh
@@ -8,6 +8,7 @@ test_description='git status'
 . ./test-lib.sh
 
 test_expect_success 'status -h in broken repository' '
+	git config --global advice.statusuoption false &&
 	mkdir broken &&
 	test_when_finished "rm -fr broken" &&
 	(
diff --git a/t/t7512-status-help.sh b/t/t7512-status-help.sh
index b3f6eb9..2d53e03 100755
--- a/t/t7512-status-help.sh
+++ b/t/t7512-status-help.sh
@@ -14,6 +14,7 @@ test_description='git status advices'
 set_fake_editor
 
 test_expect_success 'prepare for conflicts' '
+	git config --global advice.statusuoption false &&
 	test_commit init main.txt init &&
 	git branch conflicts &&
 	test_commit on_master main.txt on_master &&
diff --git a/wt-status.c b/wt-status.c
index 2a9658b..6e75468 100644
--- a/wt-status.c
+++ b/wt-status.c
@@ -496,9 +496,14 @@ static void wt_status_collect_untracked(struct wt_status *s)
 {
 	int i;
 	struct dir_struct dir;
+	struct timeval t_begin;
 
 	if (!s->show_untracked_files)
 		return;
+
+	if (advice_status_u_option)
+		gettimeofday(&t_begin, NULL);
+
 	memset(&dir, 0, sizeof(dir));
 	if (s->show_untracked_files != SHOW_ALL_UNTRACKED_FILES)
 		dir.flags |=
@@ -528,6 +533,14 @@ static void wt_status_collect_untracked(struct wt_status *s)
 	}
 
 	free(dir.entries);
+
+	if (advice_status_u_option) {
+		struct timeval t_end;
+		gettimeofday(&t_end, NULL);
+		s->untracked_in_ms =
+			(uint64_t)t_end.tv_sec * 1000 + t_end.tv_usec / 1000 -
+			((uint64_t)t_begin.tv_sec * 1000 + t_begin.tv_usec / 1000);
+	}
 }
 
 void wt_status_collect(struct wt_status *s)
@@ -1011,6 +1024,14 @@ void wt_status_print(struct wt_status *s)
 		wt_status_print_other(s, &s->untracked, _("Untracked files"), "add");
 		if (s->show_ignored_files)
 			wt_status_print_other(s, &s->ignored, _("Ignored files"), "add -f");
+		if (advice_status_u_option && 2000 < s->untracked_in_ms) {
+			status_printf_ln(s, GIT_COLOR_NORMAL, "");
+			status_printf_ln(s, GIT_COLOR_NORMAL,
+				 _("It took %.2f seconds to enumerate untracked files."),
+				 s->untracked_in_ms / 1000.0);
+			status_printf_ln(s, GIT_COLOR_NORMAL,
+				 _("Consider the -u option for a possible speed-up?"));
+		}
 	} else if (s->commitable)
 		status_printf_ln(s, GIT_COLOR_NORMAL, _("Untracked files not listed%s"),
 			advice_status_hints
diff --git a/wt-status.h b/wt-status.h
index 236b41f..09420d0 100644
--- a/wt-status.h
+++ b/wt-status.h
@@ -69,6 +69,7 @@ struct wt_status {
 	struct string_list change;
 	struct string_list untracked;
 	struct string_list ignored;
+	uint32_t untracked_in_ms;
 };
 
 struct wt_status_state {
-- 
1.8.2-279-g744670c

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* Re: [PATCH] status: hint the user about -uno if read_directory takes too long
  2013-03-15 20:06                                                     ` Junio C Hamano
@ 2013-03-15 21:14                                                       ` Torsten Bögershausen
  2013-03-15 21:59                                                         ` Junio C Hamano
  0 siblings, 1 reply; 88+ messages in thread
From: Torsten Bögershausen @ 2013-03-15 21:14 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Torsten Bögershausen, Duy Nguyen, git, artagnon,
	robert.allan.zeh, finnag

On 15.03.13 21:06, Junio C Hamano wrote:
> Torsten Bögershausen <tboegi@web.de> writes:
> 
>> > Thanks, I like that much better than mine
>> > (and expere is probably a word not yet invented)
> OK, then how about redoing Duy's patch like this on top?
> 
> I've moved the timing collection from the caller to callee, and I
> think the result is more readable.  The message looked easier to see
> with a leading blank line, so I added one.
> 
> -- >8 --
> From: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
> Date: Wed, 13 Mar 2013 19:59:16 +0700
> Subject: [PATCH] status: advise to consider use of -u when read_directory takes too long
> 
> Introduce advice.statusUoption to suggest considering use of -u to
> strike different trade-off when it took more than 2 seconds to
> enumerate untracked/ignored files.
> 
> Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
> Signed-off-by: Junio C Hamano <gitster@pobox.com>
> ---
>  Documentation/config.txt |  4 ++++
>  advice.c                 |  2 ++
>  advice.h                 |  1 +
>  t/t7060-wtstatus.sh      |  1 +
>  t/t7508-status.sh        |  1 +
>  t/t7512-status-help.sh   |  1 +
>  wt-status.c              | 21 +++++++++++++++++++++
>  wt-status.h              |  1 +
>  8 files changed, 32 insertions(+)
> 
> diff --git a/Documentation/config.txt b/Documentation/config.txt
> index d1de857..a16eda5 100644
> --- a/Documentation/config.txt
> +++ b/Documentation/config.txt
> @@ -163,6 +163,10 @@ advice.*::
>  		state in the output of linkgit:git-status[1] and in
>  		the template shown when writing commit messages in
>  		linkgit:git-commit[1].
> +	statusUoption::
> +		Advise to consider using the `-u` option to linkgit:git-status[1]
> +		when the command takes more than 2 seconds to enumerate untracked
> +		files.
>  	commitBeforeMerge::
>  		Advice shown when linkgit:git-merge[1] refuses to
>  		merge to avoid overwriting local changes.
> diff --git a/advice.c b/advice.c
> index edfbd4a..015011f 100644
> --- a/advice.c
> +++ b/advice.c
> @@ -5,6 +5,7 @@ int advice_push_non_ff_current = 1;
>  int advice_push_non_ff_default = 1;
>  int advice_push_non_ff_matching = 1;
>  int advice_status_hints = 1;
> +int advice_status_u_option = 1;
>  int advice_commit_before_merge = 1;
>  int advice_resolve_conflict = 1;
>  int advice_implicit_identity = 1;
> @@ -19,6 +20,7 @@ static struct {
>  	{ "pushnonffdefault", &advice_push_non_ff_default },
>  	{ "pushnonffmatching", &advice_push_non_ff_matching },
>  	{ "statushints", &advice_status_hints },
> +	{ "statusuoption", &advice_status_u_option },
>  	{ "commitbeforemerge", &advice_commit_before_merge },
>  	{ "resolveconflict", &advice_resolve_conflict },
>  	{ "implicitidentity", &advice_implicit_identity },
> diff --git a/advice.h b/advice.h
> index f3cdbbf..e3e665d 100644
> --- a/advice.h
> +++ b/advice.h
> @@ -8,6 +8,7 @@ extern int advice_push_non_ff_current;
>  extern int advice_push_non_ff_default;
>  extern int advice_push_non_ff_matching;
>  extern int advice_status_hints;
> +extern int advice_status_u_option;
>  extern int advice_commit_before_merge;
>  extern int advice_resolve_conflict;
>  extern int advice_implicit_identity;
> diff --git a/t/t7060-wtstatus.sh b/t/t7060-wtstatus.sh
> index f4f38a5..52ef06b 100755
> --- a/t/t7060-wtstatus.sh
> +++ b/t/t7060-wtstatus.sh
> @@ -5,6 +5,7 @@ test_description='basic work tree status reporting'
>  . ./test-lib.sh
>  
>  test_expect_success setup '
> +	git config --global advice.statusuoption false &&
>  	test_commit A &&
>  	test_commit B oneside added &&
>  	git checkout A^0 &&
> diff --git a/t/t7508-status.sh b/t/t7508-status.sh
> index e313ef1..15e063a 100755
> --- a/t/t7508-status.sh
> +++ b/t/t7508-status.sh
> @@ -8,6 +8,7 @@ test_description='git status'
>  . ./test-lib.sh
>  
>  test_expect_success 'status -h in broken repository' '
> +	git config --global advice.statusuoption false &&
>  	mkdir broken &&
>  	test_when_finished "rm -fr broken" &&
>  	(
> diff --git a/t/t7512-status-help.sh b/t/t7512-status-help.sh
> index b3f6eb9..2d53e03 100755
> --- a/t/t7512-status-help.sh
> +++ b/t/t7512-status-help.sh
> @@ -14,6 +14,7 @@ test_description='git status advices'
>  set_fake_editor
>  
>  test_expect_success 'prepare for conflicts' '
> +	git config --global advice.statusuoption false &&
>  	test_commit init main.txt init &&
>  	git branch conflicts &&
>  	test_commit on_master main.txt on_master &&
> diff --git a/wt-status.c b/wt-status.c
> index 2a9658b..6e75468 100644
> --- a/wt-status.c
> +++ b/wt-status.c
> @@ -496,9 +496,14 @@ static void wt_status_collect_untracked(struct wt_status *s)
>  {
>  	int i;
>  	struct dir_struct dir;
> +	struct timeval t_begin;
>  
>  	if (!s->show_untracked_files)
>  		return;
> +
> +	if (advice_status_u_option)
> +		gettimeofday(&t_begin, NULL);
> +
>  	memset(&dir, 0, sizeof(dir));
>  	if (s->show_untracked_files != SHOW_ALL_UNTRACKED_FILES)
>  		dir.flags |=
> @@ -528,6 +533,14 @@ static void wt_status_collect_untracked(struct wt_status *s)
>  	}
>  
>  	free(dir.entries);
> +
> +	if (advice_status_u_option) {
> +		struct timeval t_end;
> +		gettimeofday(&t_end, NULL);
> +		s->untracked_in_ms =
> +			(uint64_t)t_end.tv_sec * 1000 + t_end.tv_usec / 1000 -
> +			((uint64_t)t_begin.tv_sec * 1000 + t_begin.tv_usec / 1000);
> +	}
>  }
>  
>  void wt_status_collect(struct wt_status *s)
> @@ -1011,6 +1024,14 @@ void wt_status_print(struct wt_status *s)
>  		wt_status_print_other(s, &s->untracked, _("Untracked files"), "add");
>  		if (s->show_ignored_files)
>  			wt_status_print_other(s, &s->ignored, _("Ignored files"), "add -f");
> +		if (advice_status_u_option && 2000 < s->untracked_in_ms) {
> +			status_printf_ln(s, GIT_COLOR_NORMAL, "");
> +			status_printf_ln(s, GIT_COLOR_NORMAL,
> +				 _("It took %.2f seconds to enumerate untracked files."),
> +				 s->untracked_in_ms / 1000.0);
> +			status_printf_ln(s, GIT_COLOR_NORMAL,
> +				 _("Consider the -u option for a possible speed-up?"));
> +		}
>  	} else if (s->commitable)
>  		status_printf_ln(s, GIT_COLOR_NORMAL, _("Untracked files not listed%s"),
>  			advice_status_hints
> diff --git a/wt-status.h b/wt-status.h
> index 236b41f..09420d0 100644
> --- a/wt-status.h
> +++ b/wt-status.h
> @@ -69,6 +69,7 @@ struct wt_status {
>  	struct string_list change;
>  	struct string_list untracked;
>  	struct string_list ignored;
> +	uint32_t untracked_in_ms;
>  };
>  
>  struct wt_status_state {
> -- 1.8.2-279-g744670c -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> 
Thanks, that looks good to me:

# It took 2.58 seconds to enumerate untracked files.
# Consider the -u option for a possible speed-up?

But:
If I follow the advice as is given and use "git status -u", the result is the same.


If I think loud, would it be better to say:

# It took 2.58 seconds to search for untracked files.
# Consider the -uno option for a possible speed-up?

or

# It took 2.58 seconds to search for untracked files.
# Consider the -u option for a possible speed-up?
# Please see git help status

/Torsten

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH] status: hint the user about -uno if read_directory takes too long
  2013-03-15 21:14                                                       ` Torsten Bögershausen
@ 2013-03-15 21:59                                                         ` Junio C Hamano
  2013-03-16  7:21                                                           ` Torsten Bögershausen
  0 siblings, 1 reply; 88+ messages in thread
From: Junio C Hamano @ 2013-03-15 21:59 UTC (permalink / raw)
  To: Torsten Bögershausen
  Cc: Duy Nguyen, git, artagnon, robert.allan.zeh, finnag

Torsten Bögershausen <tboegi@web.de> writes:

> Thanks, that looks good to me:
>
> # It took 2.58 seconds to enumerate untracked files.
> # Consider the -u option for a possible speed-up?
>
> But:
> If I follow the advice as is given and use "git status -u", the result is the same.

Yeah, that was taken from

    http://thread.gmane.org/gmane.comp.version-control.git/215820/focus=218125

to which I said something about "more levels of indirections".  This
episode shows that even a user who was very well aware of the issue
did not follow a single level of indirection.

> If I think loud, would it be better to say:
>
> # It took 2.58 seconds to search for untracked files.
> # Consider the -uno option for a possible speed-up?
>
> or
>
> # It took 2.58 seconds to search for untracked files.
> # Consider the -u option for a possible speed-up?
> # Please see git help status

The former actively hurts the users, but the latter would be good,
given that your documentation updates clarifies the trade off.

Or we can be more explicit and say

# It took 2.58 seconds to search for untracked files.  'status -uno'
# may speed it up, but you have to be careful not to forget to add
# new files yourself (see 'git help status').

or something.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH] status: hint the user about -uno if read_directory takes too long
  2013-03-14 10:22                                         ` Duy Nguyen
  2013-03-14 15:05                                           ` Junio C Hamano
@ 2013-03-16  1:51                                           ` Duy Nguyen
  1 sibling, 0 replies; 88+ messages in thread
From: Duy Nguyen @ 2013-03-16  1:51 UTC (permalink / raw)
  To: Junio C Hamano, tboegi; +Cc: git, artagnon, robert.allan.zeh, finnag

On Thu, Mar 14, 2013 at 5:22 PM, Duy Nguyen <pclouds@gmail.com> wrote:
> On Wed, Mar 13, 2013 at 11:16 PM, Junio C Hamano <gitster@pobox.com> wrote:
>> The noise this introduces to the test suite is a bit irritating and
>> makes us think twice if this really a good change.
>
> I originally thought of two options, this or add an env flag in git
> binary that turns this off in the test suite. The latter did not sound
> good. But I forgot that we set a fake $HOME in the test suite, we
> could disable this in $HOME/.gitconfig, less clutter in individual
> tests.

fwiw, adding to $HOME/.gitconfig by default in test-libs.sh does not
work. Else where we check "git config --list" and the new global
config key will fail them.
-- 
Duy

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH] status: hint the user about -uno if read_directory takes too long
  2013-03-15 21:59                                                         ` Junio C Hamano
@ 2013-03-16  7:21                                                           ` Torsten Bögershausen
  2013-03-17  4:47                                                             ` Junio C Hamano
  0 siblings, 1 reply; 88+ messages in thread
From: Torsten Bögershausen @ 2013-03-16  7:21 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Torsten Bögershausen, Duy Nguyen, git, artagnon,
	robert.allan.zeh, finnag

On 15.03.13 22:59, Junio C Hamano wrote:
> Torsten Bögershausen <tboegi@web.de> writes:
> 
>> Thanks, that looks good to me:
>>
>> # It took 2.58 seconds to enumerate untracked files.
>> # Consider the -u option for a possible speed-up?
>>
>> But:
>> If I follow the advice as is given and use "git status -u", the result is the same.
> 
> Yeah, that was taken from
> 
>     http://thread.gmane.org/gmane.comp.version-control.git/215820/focus=218125
> 
> to which I said something about "more levels of indirections".  This
> episode shows that even a user who was very well aware of the issue
> did not follow a single level of indirection.
> 
>> If I think loud, would it be better to say:
>>
>> # It took 2.58 seconds to search for untracked files.
>> # Consider the -uno option for a possible speed-up?
>>
>> or
>>
>> # It took 2.58 seconds to search for untracked files.
>> # Consider the -u option for a possible speed-up?
>> # Please see git help status
> 
> The former actively hurts the users, but the latter would be good,
> given that your documentation updates clarifies the trade off.
> 
> Or we can be more explicit and say
> 
> # It took 2.58 seconds to search for untracked files.  'status -uno'
> # may speed it up, but you have to be careful not to forget to add
> # new files yourself (see 'git help status').
> 
Thanks, that looks good for me

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH] status: hint the user about -uno if read_directory takes too long
  2013-03-16  7:21                                                           ` Torsten Bögershausen
@ 2013-03-17  4:47                                                             ` Junio C Hamano
  0 siblings, 0 replies; 88+ messages in thread
From: Junio C Hamano @ 2013-03-17  4:47 UTC (permalink / raw)
  To: Torsten Bögershausen
  Cc: Duy Nguyen, git, artagnon, robert.allan.zeh, finnag

Torsten Bögershausen <tboegi@web.de> writes:

>> Or we can be more explicit and say
>> 
>> # It took 2.58 seconds to search for untracked files.  'status -uno'
>> # may speed it up, but you have to be careful not to forget to add
>> # new files yourself (see 'git help status').
>> 
> Thanks, that looks good for me

OK, then I'll squash this in to the version queued to 'pu', but we
should start thinking about merging these multi-line messages into
one multi-line strings that is split by the output layer to help the
localization folks, using something like strbuf_commented_addf() and
strbuf_add_commented_lines().

diff --git a/wt-status.c b/wt-status.c
index 6e75468..53c2222 100644
--- a/wt-status.c
+++ b/wt-status.c
@@ -1027,10 +1027,14 @@ void wt_status_print(struct wt_status *s)
 		if (advice_status_u_option && 2000 < s->untracked_in_ms) {
 			status_printf_ln(s, GIT_COLOR_NORMAL, "");
 			status_printf_ln(s, GIT_COLOR_NORMAL,
-				 _("It took %.2f seconds to enumerate untracked files."),
+				 _("It took %.2f seconds to enumerate untracked files."
+				   "  'status -uno'"),
 				 s->untracked_in_ms / 1000.0);
 			status_printf_ln(s, GIT_COLOR_NORMAL,
-				 _("Consider the -u option for a possible speed-up?"));
+				 _("may speed it up, but you have to be careful not"
+				   " to forget to add"));
+			status_printf_ln(s, GIT_COLOR_NORMAL,
+				 _("new files yourself (see 'git help status')."));
 		}
 	} else if (s->commitable)
 		status_printf_ln(s, GIT_COLOR_NORMAL, _("Untracked files not listed%s"),

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* Re: [PATCH] inotify to minimize stat() calls
  2013-02-10 19:03             ` Robert Zeh
  2013-02-10 19:26               ` Martin Fick
  2013-02-11  3:21               ` Duy Nguyen
@ 2013-04-24 17:20               ` Robert Zeh
  2013-04-24 21:32                 ` Duy Nguyen
                                   ` (2 more replies)
  2 siblings, 3 replies; 88+ messages in thread
From: Robert Zeh @ 2013-04-24 17:20 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Ramkumar Ramachandra, Git List, Duy Nguyen

Here is a patch that creates a daemon that tracks file
state with inotify, writes it out to a file upon request,
and changes most of the calls to stat to use said cache.

It has bugs, but I figured it would be smarter to see
if the approach was acceptable at all before spending the
time to root the bugs out.

I've implemented the communication with a file, and not a socket, 
because I think implementing a socket is going to create
security issues on multiuser systems.  For example, would a
socket allow stat information to cross user boundaries?


Most stat calls are redirected to a cache that is maintained by a
daemon that maintains file system state via inotify.

Signed-off-by: Robert Zeh <robert.allan.zeh@gmail.com>
---
  abspath.c            |   9 ++-
  bisect.c             |   3 +-
  check-racy.c         |   2 +-
  combine-diff.c       |   3 +-
  command-list.txt     |   1 +
  config.c             |   3 +-
  copy.c               |   3 +-
  diff-lib.c           |   3 +-
  diff-no-index.c      |   3 +-
  diff.c               |   9 ++-
  diffcore-order.c     |   3 +-
  dir.c                |   4 +-
  filechange-cache.c   | 203 
+++++++++++++++++++++++++++++++++++++++++++++++++++
  filechange-cache.h   |  20 +++++
  filechange-daemon.c  | 164 +++++++++++++++++++++++++++++++++++++++++
  filechange-printer.c |  13 ++++
  git.c                |  27 +++++++
  ll-merge.c           |   3 +-
  merge-recursive.c    |   5 +-
  name-hash.c          |   3 +-
  name-hash.h          |   1 +
  notes-merge.c        |   3 +-
  path.c               |   5 +-
  read-cache.c         |  11 +--
  rerere.c             |   7 +-
  setup.c              |   5 +-
  test-chmtime.c       |   2 +-
  test-wildmatch.c     |   2 +-
  unpack-trees.c       |   6 +-
  29 files changed, 486 insertions(+), 40 deletions(-)
  create mode 100644 filechange-cache.c
  create mode 100644 filechange-cache.h
  create mode 100644 filechange-daemon.c
  create mode 100644 filechange-printer.c
  create mode 100644 name-hash.h

diff --git a/abspath.c b/abspath.c
index 40cdc46..798c005 100644
--- a/abspath.c
+++ b/abspath.c
@@ -1,3 +1,4 @@
+#include "filechange-cache.h"
  #include "cache.h"

  /*
@@ -8,7 +9,7 @@
  int is_directory(const char *path)
  {
  	struct stat st;
-	return (!stat(path, &st) && S_ISDIR(st.st_mode));
+	return (!cached_stat(path, &st) && S_ISDIR(st.st_mode));
  }

  /* We allow "recursive" symbolic links. Only within reason, though. */
@@ -117,7 +118,7 @@ static const char *real_path_internal(const char 
*path, int die_on_error)
  			last_elem = NULL;
  		}

-		if (!lstat(buf, &st) && S_ISLNK(st.st_mode)) {
+		if (!cached_lstat(buf, &st) && S_ISLNK(st.st_mode)) {
  			ssize_t len = readlink(buf, next_buf, PATH_MAX);
  			if (len < 0) {
  				if (die_on_error)
@@ -167,9 +168,9 @@ static const char *get_pwd_cwd(void)
  		return NULL;
  	pwd = getenv("PWD");
  	if (pwd && strcmp(pwd, cwd)) {
-		stat(cwd, &cwd_stat);
+		cached_stat(cwd, &cwd_stat);
  		if ((cwd_stat.st_dev || cwd_stat.st_ino) &&
-		    !stat(pwd, &pwd_stat) &&
+		    !cached_stat(pwd, &pwd_stat) &&
  		    pwd_stat.st_dev == cwd_stat.st_dev &&
  		    pwd_stat.st_ino == cwd_stat.st_ino) {
  			strlcpy(cwd, pwd, PATH_MAX);
diff --git a/bisect.c b/bisect.c
index bd1b7b5..d4b1af7 100644
--- a/bisect.c
+++ b/bisect.c
@@ -1,6 +1,7 @@
  #include "cache.h"
  #include "commit.h"
  #include "diff.h"
+#include "filechange-cache.h"
  #include "revision.h"
  #include "refs.h"
  #include "list-objects.h"
@@ -649,7 +650,7 @@ static int is_expected_rev(const unsigned char *sha1)
  	FILE *fp;
  	int res = 0;

-	if (stat(filename, &st) || !S_ISREG(st.st_mode))
+	if (cached_stat(filename, &st) || !S_ISREG(st.st_mode))
  		return 0;

  	fp = fopen(filename, "r");
diff --git a/check-racy.c b/check-racy.c
index 00d92a1..c54be01 100644
--- a/check-racy.c
+++ b/check-racy.c
@@ -11,7 +11,7 @@ int main(int ac, char **av)
  		struct cache_entry *ce = active_cache[i];
  		struct stat st;

-		if (lstat(ce->name, &st)) {
+		if (cached_lstat(ce->name, &st)) {
  			error("lstat(%s): %s", ce->name, strerror(errno));
  			continue;
  		}
diff --git a/combine-diff.c b/combine-diff.c
index 35d41cd..b6a09a5 100644
--- a/combine-diff.c
+++ b/combine-diff.c
@@ -3,6 +3,7 @@
  #include "blob.h"
  #include "diff.h"
  #include "diffcore.h"
+#include "filechange-cache.h"
  #include "quote.h"
  #include "xdiff-interface.h"
  #include "log-tree.h"
@@ -806,7 +807,7 @@ static void show_patch_diff(struct combine_diff_path 
*elem, int num_parent,
  		struct stat st;
  		int fd = -1;

-		if (lstat(elem->path, &st) < 0)
+		if (cached_lstat(elem->path, &st) < 0)
  			goto deleted_file;

  		if (S_ISLNK(st.st_mode)) {
diff --git a/command-list.txt b/command-list.txt
index bf83303..9dec5e1 100644
--- a/command-list.txt
+++ b/command-list.txt
@@ -29,6 +29,7 @@ git-count-objects 
ancillaryinterrogators
  git-credential                          purehelpers
  git-credential-cache                    purehelpers
  git-credential-store                    purehelpers
+git-filechange-daemon			purehelpers
  git-cvsexportcommit                     foreignscminterface
  git-cvsimport                           foreignscminterface
  git-cvsserver                           foreignscminterface
diff --git a/config.c b/config.c
index aefd80b..99749fe 100644
--- a/config.c
+++ b/config.c
@@ -7,6 +7,7 @@
   */
  #include "cache.h"
  #include "exec_cmd.h"
+#include "filechange-cache.h"
  #include "strbuf.h"
  #include "quote.h"

@@ -1436,7 +1437,7 @@ int git_config_set_multivar_in_file(const char 
*config_filename,
  			goto out_free;
  		}

-		fstat(in_fd, &st);
+		cached_fstat(in_fd, &st);
  		contents_sz = xsize_t(st.st_size);
  		contents = xmmap(NULL, contents_sz, PROT_READ,
  			MAP_PRIVATE, in_fd, 0);
diff --git a/copy.c b/copy.c
index a7f58fd..972fabe 100644
--- a/copy.c
+++ b/copy.c
@@ -1,3 +1,4 @@
+#include "filechange-cache.h"
  #include "cache.h"

  int copy_fd(int ifd, int ofd)
@@ -39,7 +40,7 @@ static int copy_times(const char *dst, const char *src)
  {
  	struct stat st;
  	struct utimbuf times;
-	if (stat(src, &st) < 0)
+	if (cached_stat(src, &st) < 0)
  		return -1;
  	times.actime = st.st_atime;
  	times.modtime = st.st_mtime;
diff --git a/diff-lib.c b/diff-lib.c
index f35de0f..8d5a005 100644
--- a/diff-lib.c
+++ b/diff-lib.c
@@ -2,6 +2,7 @@
   * Copyright (C) 2005 Junio C Hamano
   */
  #include "cache.h"
+#include "filechange-cache.h"
  #include "quote.h"
  #include "commit.h"
  #include "diff.h"
@@ -27,7 +28,7 @@
   */
  static int check_removed(const struct cache_entry *ce, struct stat *st)
  {
-	if (lstat(ce->name, st) < 0) {
+	if (cached_lstat(ce->name, st) < 0) {
  		if (errno != ENOENT && errno != ENOTDIR)
  			return -1;
  		return 1;
diff --git a/diff-no-index.c b/diff-no-index.c
index 74da659..d3fb354 100644
--- a/diff-no-index.c
+++ b/diff-no-index.c
@@ -7,6 +7,7 @@
  #include "cache.h"
  #include "color.h"
  #include "commit.h"
+#include "filechange-cache.h"
  #include "blob.h"
  #include "tag.h"
  #include "diff.h"
@@ -51,7 +52,7 @@ static int get_mode(const char *path, int *mode)
  #endif
  	else if (path == file_from_standard_input)
  		*mode = create_ce_mode(0666);
-	else if (lstat(path, &st))
+	else if (cached_lstat(path, &st))
  		return error("Could not access '%s'", path);
  	else
  		*mode = st.st_mode;
diff --git a/diff.c b/diff.c
index 156fec4..a5be122 100644
--- a/diff.c
+++ b/diff.c
@@ -5,6 +5,7 @@
  #include "quote.h"
  #include "diff.h"
  #include "diffcore.h"
+#include "filechange-cache.h"
  #include "delta.h"
  #include "xdiff-interface.h"
  #include "color.h"
@@ -2629,7 +2630,7 @@ static int reuse_worktree_file(const char *name, 
const unsigned char *sha1, int
  	 * If ce matches the file in the work tree, we can reuse it.
  	 */
  	if (ce_uptodate(ce) ||
-	    (!lstat(name, &st) && !ce_match_stat(ce, &st, 0)))
+	    (!cached_lstat(name, &st) && !ce_match_stat(ce, &st, 0)))
  		return 1;

  	return 0;
@@ -2684,7 +2685,7 @@ int diff_populate_filespec(struct diff_filespec 
*s, int size_only)
  		struct stat st;
  		int fd;

-		if (lstat(s->path, &st) < 0) {
+		if (cached_lstat(s->path, &st) < 0) {
  			if (errno == ENOENT) {
  			err_empty:
  				err = -1;
@@ -2826,7 +2827,7 @@ static struct diff_tempfile 
*prepare_temp_file(const char *name,
  	if (!one->sha1_valid ||
  	    reuse_worktree_file(name, one->sha1, 1)) {
  		struct stat st;
-		if (lstat(name, &st) < 0) {
+		if (cached_lstat(name, &st) < 0) {
  			if (errno == ENOENT)
  				goto not_a_valid_file;
  			die_errno("stat(%s)", name);
@@ -3043,7 +3044,7 @@ static void diff_fill_sha1_info(struct 
diff_filespec *one)
  				hashcpy(one->sha1, null_sha1);
  				return;
  			}
-			if (lstat(one->path, &st) < 0)
+			if (cached_lstat(one->path, &st) < 0)
  				die_errno("stat '%s'", one->path);
  			if (index_path(one->sha1, one->path, &st, 0))
  				die("cannot hash %s", one->path);
diff --git a/diffcore-order.c b/diffcore-order.c
index 23e9385..636be01 100644
--- a/diffcore-order.c
+++ b/diffcore-order.c
@@ -4,6 +4,7 @@
  #include "cache.h"
  #include "diff.h"
  #include "diffcore.h"
+#include "filechange-cache.h"

  static char **order;
  static int order_cnt;
@@ -22,7 +23,7 @@ static void prepare_order(const char *orderfile)
  	fd = open(orderfile, O_RDONLY);
  	if (fd < 0)
  		return;
-	if (fstat(fd, &st)) {
+	if (cached_fstat(fd, &st)) {
  		close(fd);
  		return;
  	}
diff --git a/dir.c b/dir.c
index 57394e4..a67a592 100644
--- a/dir.c
+++ b/dir.c
@@ -476,7 +476,7 @@ int add_excludes_from_file_to_list(const char *fname,
  	char *buf, *entry;

  	fd = open(fname, O_RDONLY);
-	if (fd < 0 || fstat(fd, &st) < 0) {
+	if (fd < 0 || cached_fstat(fd, &st) < 0) {
  		if (errno != ENOENT)
  			warn_on_inaccessible(fname);
  		if (0 <= fd)
@@ -1551,7 +1551,7 @@ static int remove_dir_recurse(struct strbuf *path, 
int flag, int *kept_up)

  		strbuf_setlen(path, len);
  		strbuf_addstr(path, e->d_name);
-		if (lstat(path->buf, &st))
+		if (cached_lstat(path->buf, &st))
  			; /* fall thru */
  		else if (S_ISDIR(st.st_mode)) {
  			if (!remove_dir_recurse(path, flag, &kept_down))
diff --git a/filechange-cache.c b/filechange-cache.c
new file mode 100644
index 0000000..80c698f
--- /dev/null
+++ b/filechange-cache.c
@@ -0,0 +1,203 @@
+#include <unistd.h>
+#include <stdio.h>
+#include "builtin.h"
+#include "hash.h"
+#include "name-hash.h"
+#include "strbuf.h"
+#include "filechange-cache.h"
+
+
+static struct hash_table stat_cache;
+static const int CACHE_ENTRY_FILE_SIZE =
+	sizeof(struct stat) + /* sizeof(stat_cache_entry.st) */
+	sizeof(int); /* sizeof(stat_cache_entry.stat_return) */
+
+static void insert_stat_cache_entry(const char *path,
+				    struct stat_cache_entry *new_entry);
+
+void setup_stat_cache()
+{
+	init_hash(&stat_cache);
+}
+
+static int write_stat_cache_entry(void *void_stat_cache_entry, void 
*void_fp)
+{
+	FILE *fp = (FILE*)(void_fp);
+	const struct stat_cache_entry *entry =
+		(struct stat_cache_entry*)(void_stat_cache_entry);
+
+	for (; entry; entry = entry->next) {
+		if (fprintf(fp, "%s\n", entry->path) < 0)
+			die_errno("Unable to write to %s",
+				  git_path("WT_STATUS_TMP"));
+		if (fwrite(&entry->stat_return,
+			   CACHE_ENTRY_FILE_SIZE, 1, fp) < 0)
+			die_errno("Unable to write to %s",
+				  git_path("WT_STATUS_TMP"));
+	}
+	return 0;
+}
+
+void write_stat_cache()
+{
+	const char *status_tmp = git_path("WT_STATUS_TMP");
+	const char *status_output = git_path("WT_STATUS");
+	FILE *fp = fopen(status_tmp, "w");
+	if (!fp)
+		die_errno("Unable to create %s", status_tmp);
+	if (fprintf(fp, "version_format=1\n") < 0)
+		die_errno("Unable to write to %s", status_tmp);
+	for_each_hash(&stat_cache, write_stat_cache_entry, fp);
+	if (fclose(fp) < 0)
+		die_errno("Unable to close %s", status_tmp);
+	if (rename(status_tmp, status_output) < 0)
+		die_errno("Unable to rename %s to %s", status_tmp,
+			  status_output);
+}
+
+static void read_stat_cache_file()
+{
+	struct strbuf line = STRBUF_INIT;
+	const char *status_output = git_path("WT_STATUS");
+	int read_version = 0;
+
+	FILE *fp = fopen(status_output, "r");
+	if (!fp)
+		die_errno("Unable to read %s", status_output);
+	
+	if (strbuf_getline(&line, fp, '\n') != EOF) {
+		sscanf(line.buf, "version_format=%d\n", &read_version);
+		if (read_version != 1) {
+			die("Expected version 1 of stat_cache file");
+		}
+	}
+
+	while (strbuf_getline(&line, fp, '\n') != EOF) {
+		struct stat_cache_entry *entry =
+			(struct stat_cache_entry*)(xcalloc(1, sizeof(*entry)));
+		entry->path = xstrdup(line.buf);
+		if (fread(&entry->stat_return,
+			  CACHE_ENTRY_FILE_SIZE, 1, fp) != 1) {
+			die_errno("Unable to read stat_cache file");
+		}
+		insert_stat_cache_entry(entry->path, entry);
+	}
+
+	strbuf_release(&line);
+}
+
+static int request_stat_cache_file()
+{
+	int count = 0;
+	int stat_return_code = 0;
+	const char *request_path = git_path("REQUEST_WT_STATUS");
+	const char *status_output = git_path("WT_STATUS");
+	struct stat stat_buf;
+
+	char buffer[1] = { 0 };
+	FILE *fp = NULL;
+
+	if (0 != stat(request_path, &stat_buf))
+		return 0;
+	
+	if (unlink(status_output) != 0 && (errno != ENOENT))
+		die_errno("Unable to remove %s", status_output);
+
+	fp = fopen(request_path, "w");
+	if (!fp) {
+		die_errno("Unable to open %s", request_path);
+	}
+
+	if (fwrite(&buffer, 0, 0, fp) != 0)
+		die_errno("Unable to write to %s", request_path);
+	
+	for (count = 0;
+	     (count < 10) &&
+		     ((stat_return_code = stat(status_output, &stat_buf)) != 0) &&
+		     (errno == ENOENT);
+	     count++) {
+		usleep(1000);
+	}
+
+	return stat_return_code == 0;
+}
+
+void read_stat_cache()
+{
+	static int read_cache = 1;
+	if (read_cache && request_stat_cache_file()) {
+		read_stat_cache_file();
+		read_cache = 0;
+	}
+	read_cache = 0;
+}
+
+struct stat_cache_entry *get_stat_cache_entry(const char *path)
+{
+	const unsigned int hash = hash_name(path, strlen(path));
+	struct stat_cache_entry *entry = NULL;
+	for(entry = lookup_hash(hash, &stat_cache); entry;
+	    entry = entry->next) {
+		if (!strcmp(path, entry->path)) return entry;
+	}
+	return NULL;
+}
+
+static void insert_stat_cache_entry(const char *path,
+				    struct stat_cache_entry *new_entry)
+{
+	assert(get_stat_cache_entry(path) == NULL);
+
+	void **insert_result =
+		insert_hash(hash_name(path, strlen(path)), (void*)new_entry,
+			    &stat_cache);
+	if (!insert_result) return;
+	struct stat_cache_entry *existing_entry =
+		(struct stat_cache_entry*)(*insert_result);
+	while(existing_entry->next) {
+		existing_entry = existing_entry->next;
+	}
+	assert(!existing_entry->next);
+	existing_entry->next = new_entry;
+}
+
+void update_stat_cache(const char *path)
+{
+	struct stat_cache_entry *entry = get_stat_cache_entry(path);
+	if (!entry) {
+		entry = (struct stat_cache_entry*)(xcalloc(1, sizeof(*entry)));
+		entry->path = xstrdup(path);
+		insert_stat_cache_entry(path, entry);
+	}
+	
+	entry->stat_return = lstat(path, &entry->st);
+}
+
+int cached_stat(const char *path, struct stat *buf)
+{
+	return stat(path, buf);
+}
+
+int cached_fstat(int fd, struct stat *buf)
+{
+	return fstat(fd, buf);
+}
+
+int cached_lstat(const char *path, struct stat *buf)
+{
+	int stat_return_value = 0;
+	struct stat_cache_entry *entry = 0;
+
+	read_stat_cache();
+
+	entry = get_stat_cache_entry(path);
+
+	stat_return_value = lstat(path, buf);
+	
+	if (entry && (stat_return_value != entry->stat_return) &&
+	    (memcpy(&entry->st, buf, sizeof(*buf)))) {
+		abort();
+	}
+	
+	return stat_return_value;
+}
diff --git a/filechange-cache.h b/filechange-cache.h
new file mode 100644
index 0000000..75a9f79
--- /dev/null
+++ b/filechange-cache.h
@@ -0,0 +1,20 @@
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <unistd.h>
+
+struct stat_cache_entry {
+	const char *path;
+	struct stat_cache_entry *next;
+	int stat_return;
+	struct stat st;
+};
+
+extern void write_stat_cache();
+extern void read_stat_cache();
+extern void setup_stat_cache();
+extern struct stat_cache_entry *get_stat_cache_entry(const char *path);
+extern void update_stat_cache(const char *path);
+
+extern int cached_stat(const char *path, struct stat *buf);
+extern int cached_fstat(int fd, struct stat *buf);
+extern int cached_lstat(const char *path, struct stat *buf);
diff --git a/filechange-daemon.c b/filechange-daemon.c
new file mode 100644
index 0000000..df6f0d3
--- /dev/null
+++ b/filechange-daemon.c
@@ -0,0 +1,164 @@
+#include <stdio.h>
+#include <libgen.h>
+#include <x86_64-linux-gnu/sys/inotify.h>
+
+#include "filechange-cache.h"
+#include "builtin.h"
+#include "dir.h"
+#include "hash.h"
+
+static int request_watch_descriptor = -1;
+static int root_directory_watch_descriptor = -1;
+
+static void setup_environment()
+{
+	setup_stat_cache();
+}
+
+static int setup_inotify()
+{
+	int inotify_fd = inotify_init();
+	if (inotify_fd < 0) {
+		die_errno("Unable to create inotify watch");
+	}
+	return inotify_fd;
+}
+
+static void restart()
+{
+
+}
+
+
+static void watch_control(int inotify_fd)
+{
+	struct stat stat_buf;
+	const char *request_path = git_path("REQUEST_WT_STATUS");
+
+	if ((stat(request_path, &stat_buf) == -1) && (errno == ENOENT)) {
+		FILE *out = fopen(request_path, "w");
+		if (out == NULL)
+			die_errno("Unable to create %s", request_path);
+	}
+
+	request_watch_descriptor = inotify_add_watch(inotify_fd,
+						     request_path, IN_MODIFY);
+	
+	if (request_watch_descriptor < 0)
+		die_errno("Unable to watch %s", get_git_dir());
+}
+
+static void watch_file(int inotify_fd, const char *path)
+{
+	int watch_descriptor = 0;
+	char *path_copy = xstrdup(path);
+	char *dir = dirname(path_copy);
+	const int interest_set =
+		IN_MODIFY  | IN_DELETE | IN_CREATE  |
+		IN_DELETE_SELF | IN_MOVE_SELF |
+		IN_MOVED_TO;
+
+	watch_descriptor = inotify_add_watch(inotify_fd, dir, interest_set);
+	if (watch_descriptor < 0)
+		die_errno("Unable to create inotify watch for %s", dir);
+
+	watch_descriptor = inotify_add_watch(inotify_fd, path, interest_set);
+	if (watch_descriptor < 0)
+		die_errno("Unable to create inotify watch for %s", dir);
+	update_stat_cache(path);
+
+	free(path_copy);
+}
+
+static void watch_directory(int inotify_fd)
+{
+	char buf[PATH_MAX];
+
+	if (!getcwd(buf, sizeof(buf)))
+		die_errno("Unable to get current directory");
+
+	int i = 0;
+	struct dir_struct dir;
+	const char *pathspec[1] = { buf, NULL };
+
+	memset(&dir, 0, sizeof(dir));
+	setup_standard_excludes(&dir);
+
+	fill_directory(&dir, pathspec);
+	for(i = 0; i < dir.nr; i++) {
+		struct dir_entry *ent = dir.entries[i];
+		watch_file(inotify_fd, ent->name);
+		free(ent);
+	}
+
+	free(dir.entries);
+	free(dir.ignored);
+}
+
+static void watch_root_directory(int inotify_fd)
+{
+	char buf[PATH_MAX];
+
+	if (!getcwd(buf, sizeof(buf)))
+		die_errno("Unable to get current directory");
+
+	root_directory_watch_descriptor =
+		inotify_add_watch(inotify_fd, buf, IN_DELETE);
+	if (root_directory_watch_descriptor < 0)
+		die_errno("Unable to watch %s directory", buf);
+}
+
+#define INOTIFY_EVENT_SIZE  (sizeof (struct inotify_event)  + PATH_MAX + 1)
+
+static void remove_request_file(void)
+{
+	const char *request_path = git_path("REQUEST_WT_STATUS");
+	if (unlink(request_path)) {
+		die_errno("Unable to remove %s on exit",
+			  request_path);
+	}
+}
+
+static void loop(int inotify_fd)
+{
+	char buffer[INOTIFY_EVENT_SIZE * 10];
+	int length = 0;
+	
+	while (1) {
+		int i = 0;
+		length = read(inotify_fd, buffer, sizeof(buffer));
+		for(i = 0; i < length; ) {
+			struct inotify_event *event =
+				(struct inotify_event*)(buffer+i);
+			/* printf("event: %d %x %d %s\n", event->wd, event->mask,
+			   event->len, event->name); */
+			if (request_watch_descriptor == event->wd) {
+				write_stat_cache();
+			} else if (root_directory_watch_descriptor
+				   == event->wd) {
+				printf("root directory died!\n");
+				exit(0);
+			} else if (event->mask & IN_Q_OVERFLOW) {
+				restart();
+			} else if (event->mask & IN_MODIFY) {
+				if (event->len)
+					update_stat_cache(event->name);
+			}
+			
+			i += sizeof(struct inotify_event) + event->len;
+		}
+	}
+}
+
+int main(int argc, const char **argv)
+{
+	const int inotify_fd = setup_inotify();
+
+	atexit(remove_request_file);
+	setup_environment();
+	watch_control(inotify_fd);
+	watch_root_directory(inotify_fd);
+	watch_directory(inotify_fd);
+	loop(inotify_fd);
+	return 0;
+}
diff --git a/filechange-printer.c b/filechange-printer.c
new file mode 100644
index 0000000..fe43d80
--- /dev/null
+++ b/filechange-printer.c
@@ -0,0 +1,13 @@
+#include <stdio.h>
+#include "filechange-cache.h"
+
+int main()
+{
+	struct stat_cache_entry *entry = NULL;
+	const char *missing = "t/t7201-co.sh";
+	read_stat_cache();
+	
+	entry = get_stat_cache_entry(missing);
+	printf("%p\n", entry);
+	return 0;
+}
diff --git a/git.c b/git.c
index b10c18b..ea92a65 100644
--- a/git.c
+++ b/git.c
@@ -504,6 +504,31 @@ static int run_argv(int *argcp, const char ***argv)
  }


+static void fork_filechange_daemon()
+{
+	struct stat stat_buf;
+	FILE *log = fopen("/tmp/foo.txt", "a");
+	fprintf(log, "cwd = %s\n", get_current_dir_name());
+
+	if (stat(git_path("REQUEST_WT_STATUS"), &stat_buf) == -1) {
+		pid_t child = 0;
+
+		child = fork();
+		fprintf(log, "starting %d\n", (int)child);
+		if (!child) {
+			fclose(log);
+			execl("/home/razeh/src/git/git-filechange-daemon",
+			      "/home/razeh/src/git/git-filechange-daemon",
+			      get_current_dir_name(),
+			      (char*) NULL);
+			die_errno("Unable to launch file change daemon");
+		}
+	} else {
+		fprintf(log, "already running\n");
+	}
+
+}
+
  int main(int argc, const char **argv)
  {
  	const char *cmd;
@@ -558,6 +583,8 @@ int main(int argc, const char **argv)
  	 */
  	setup_path();

+	fork_filechange_daemon();
+
  	while (1) {
  		static int done_help = 0;
  		static int was_alias = 0;
diff --git a/ll-merge.c b/ll-merge.c
index fb61ea6..7ced2bb 100644
--- a/ll-merge.c
+++ b/ll-merge.c
@@ -6,6 +6,7 @@

  #include "cache.h"
  #include "attr.h"
+#include "filechange-cache.h"
  #include "xdiff-interface.h"
  #include "run-command.h"
  #include "ll-merge.h"
@@ -195,7 +196,7 @@ static int ll_ext_merge(const struct ll_merge_driver 
*fn,
  	fd = open(temp[1], O_RDONLY);
  	if (fd < 0)
  		goto bad;
-	if (fstat(fd, &st))
+	if (cached_fstat(fd, &st))
  		goto close_bad;
  	result->size = st.st_size;
  	result->ptr = xmalloc(result->size + 1);
diff --git a/merge-recursive.c b/merge-recursive.c
index ea9dbd3..7d371d6 100644
--- a/merge-recursive.c
+++ b/merge-recursive.c
@@ -12,6 +12,7 @@
  #include "tree-walk.h"
  #include "diff.h"
  #include "diffcore.h"
+#include "filechange-cache.h"
  #include "tag.h"
  #include "unpack-trees.h"
  #include "string-list.h"
@@ -606,7 +607,7 @@ static char *unique_path(struct merge_options *o, 
const char *path, const char *
  			*p = '_';
  	while (string_list_has_string(&o->current_file_set, newpath) ||
  	       string_list_has_string(&o->current_directory_set, newpath) ||
-	       lstat(newpath, &st) == 0)
+	       cached_lstat(newpath, &st) == 0)
  		sprintf(p, "_%d", suffix++);

  	string_list_insert(&o->current_file_set, newpath);
@@ -634,7 +635,7 @@ static int dir_in_way(const char *path, int 
check_working_copy)
  	}

  	free(dirpath);
-	return check_working_copy && !lstat(path, &st) && S_ISDIR(st.st_mode);
+	return check_working_copy && !cached_lstat(path, &st) && 
S_ISDIR(st.st_mode);
  }

  static int was_tracked(const char *path)
diff --git a/name-hash.c b/name-hash.c
index d8d25c2..d88185f 100644
--- a/name-hash.c
+++ b/name-hash.c
@@ -7,6 +7,7 @@
   */
  #define NO_THE_INDEX_COMPATIBILITY_MACROS
  #include "cache.h"
+#include "name-hash.h"

  /*
   * This removes bit 5 if bit 6 is set.
@@ -20,7 +21,7 @@ static inline unsigned char icase_hash(unsigned char c)
  	return c & ~((c & 0x40) >> 1);
  }

-static unsigned int hash_name(const char *name, int namelen)
+unsigned int hash_name(const char *name, int namelen)
  {
  	unsigned int hash = 0x123;

diff --git a/name-hash.h b/name-hash.h
new file mode 100644
index 0000000..3355d94
--- /dev/null
+++ b/name-hash.h
@@ -0,0 +1 @@
+extern unsigned int hash_name(const char *name, int namelen);
diff --git a/notes-merge.c b/notes-merge.c
index 0f67bd3..f792f83 100644
--- a/notes-merge.c
+++ b/notes-merge.c
@@ -3,6 +3,7 @@
  #include "refs.h"
  #include "diff.h"
  #include "diffcore.h"
+#include "filechange-cache.h"
  #include "xdiff-interface.h"
  #include "ll-merge.h"
  #include "dir.h"
@@ -731,7 +732,7 @@ int notes_merge_commit(struct notes_merge_options *o,

  		strbuf_addstr(&path, e->d_name);
  		/* write file as blob, and add to partial_tree */
-		if (stat(path.buf, &st))
+		if (cached_stat(path.buf, &st))
  			die_errno("Failed to stat '%s'", path.buf);
  		if (index_path(blob_sha1, path.buf, &st, HASH_WRITE_OBJECT))
  			die("Failed to write blob object from '%s'", path.buf);
diff --git a/path.c b/path.c
index d3d3f8b..6844d2d 100644
--- a/path.c
+++ b/path.c
@@ -11,6 +11,7 @@
   * which is what it's designed for.
   */
  #include "cache.h"
+#include "filechange-cache.h"
  #include "strbuf.h"
  #include "string-list.h"

@@ -360,7 +361,7 @@ const char *enter_repo(const char *path, int strict)
  		for (i = 0; suffix[i]; i++) {
  			struct stat st;
  			strcpy(used_path + len, suffix[i]);
-			if (!stat(used_path, &st) &&
+			if (!cached_stat(used_path, &st) &&
  			    (S_ISREG(st.st_mode) ||
  			    (S_ISDIR(st.st_mode) && is_git_directory(used_path)))) {
  				strcat(validated_path, suffix[i]);
@@ -400,7 +401,7 @@ int set_shared_perm(const char *path, int mode)
  		return 0;
  	}
  	if (!mode) {
-		if (lstat(path, &st) < 0)
+		if (cached_lstat(path, &st) < 0)
  			return -1;
  		mode = st.st_mode;
  		orig_mode = mode;
diff --git a/read-cache.c b/read-cache.c
index 827ae55..508ddc1 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -8,6 +8,7 @@
  #include "cache-tree.h"
  #include "refs.h"
  #include "dir.h"
+#include "filechange-cache.h"
  #include "tree.h"
  #include "commit.h"
  #include "blob.h"
@@ -672,7 +673,7 @@ int add_to_index(struct index_state *istate, const 
char *path, struct stat *st,
  int add_file_to_index(struct index_state *istate, const char *path, 
int flags)
  {
  	struct stat st;
-	if (lstat(path, &st))
+	if (cached_lstat(path, &st))
  		die_errno("unable to stat '%s'", path);
  	return add_to_index(istate, path, &st, flags);
  }
@@ -1032,7 +1033,7 @@ static struct cache_entry 
*refresh_cache_ent(struct index_state *istate,
  		return ce;
  	}

-	if (lstat(ce->name, &st) < 0) {
+	if (cached_lstat(ce->name, &st) < 0) {
  		if (err)
  			*err = errno;
  		return NULL;
@@ -1430,7 +1431,7 @@ int read_index_from(struct index_state *istate, 
const char *path)
  		die_errno("index file open failed");
  	}

-	if (fstat(fd, &st))
+	if (cached_fstat(fd, &st))
  		die_errno("cannot stat the open index");

  	mmap_size = xsize_t(st.st_size);
@@ -1618,7 +1619,7 @@ static void ce_smudge_racily_clean_entry(struct 
cache_entry *ce)
  	 */
  	struct stat st;

-	if (lstat(ce->name, &st) < 0)
+	if (cached_lstat(ce->name, &st) < 0)
  		return;
  	if (ce_match_stat_basic(ce, &st))
  		return;
@@ -1830,7 +1831,7 @@ int write_index(struct index_state *istate, int newfd)
  			return -1;
  	}

-	if (ce_flush(&c, newfd) || fstat(newfd, &st))
+	if (ce_flush(&c, newfd) || cached_fstat(newfd, &st))
  		return -1;
  	istate->timestamp.sec = (unsigned int)st.st_mtime;
  	istate->timestamp.nsec = ST_MTIME_NSEC(st);
diff --git a/rerere.c b/rerere.c
index a6a5cd5..5115d0e 100644
--- a/rerere.c
+++ b/rerere.c
@@ -1,4 +1,5 @@
  #include "cache.h"
+#include "filechange-cache.h"
  #include "string-list.h"
  #include "rerere.h"
  #include "xdiff-interface.h"
@@ -28,7 +29,7 @@ const char *rerere_path(const char *hex, const char *file)
  static int has_rerere_resolution(const char *hex)
  {
  	struct stat st;
-	return !stat(rerere_path(hex, "postimage"), &st);
+	return !cached_stat(rerere_path(hex, "postimage"), &st);
  }

  static void read_rr(struct string_list *rr)
@@ -681,13 +682,13 @@ int rerere_forget(const char **pathspec)
  static time_t rerere_created_at(const char *name)
  {
  	struct stat st;
-	return stat(rerere_path(name, "preimage"), &st) ? (time_t) 0 : 
st.st_mtime;
+	return cached_stat(rerere_path(name, "preimage"), &st) ? (time_t) 0 : 
st.st_mtime;
  }

  static time_t rerere_last_used_at(const char *name)
  {
  	struct stat st;
-	return stat(rerere_path(name, "postimage"), &st) ? (time_t) 0 : 
st.st_mtime;
+	return cached_stat(rerere_path(name, "postimage"), &st) ? (time_t) 0 : 
st.st_mtime;
  }

  static void unlink_rr_item(const char *name)
diff --git a/setup.c b/setup.c
index 2e1521b..690987a 100644
--- a/setup.c
+++ b/setup.c
@@ -1,5 +1,6 @@
  #include "cache.h"
  #include "dir.h"
+#include "filechange-cache.h"
  #include "string-list.h"

  static int inside_git_dir = -1;
@@ -74,7 +75,7 @@ int check_filename(const char *prefix, const char *arg)
  		name = prefix_filename(prefix, strlen(prefix), arg);
  	else
  		name = arg;
-	if (!lstat(name, &st))
+	if (!cached_lstat(name, &st))
  		return 1; /* file exists */
  	if (errno == ENOENT || errno == ENOTDIR)
  		return 0; /* file does not exist */
@@ -638,7 +639,7 @@ static const char *setup_nongit(const char *cwd, int 
*nongit_ok)
  static dev_t get_device_or_die(const char *path, const char *prefix, 
int prefix_len)
  {
  	struct stat buf;
-	if (stat(path, &buf)) {
+	if (cached_stat(path, &buf)) {
  		die_errno("failed to stat '%*s%s%s'",
  				prefix_len,
  				prefix ? prefix : "",
diff --git a/test-chmtime.c b/test-chmtime.c
index 92713d1..bb5f22a 100644
--- a/test-chmtime.c
+++ b/test-chmtime.c
@@ -81,7 +81,7 @@ int main(int argc, const char *argv[])
  		struct stat sb;
  		struct utimbuf utb;

-		if (stat(argv[i], &sb) < 0) {
+		if (cached_stat(argv[i], &sb) < 0) {
  			fprintf(stderr, "Failed to stat %s: %s\n",
  			        argv[i], strerror(errno));
  			return -1;
diff --git a/test-wildmatch.c b/test-wildmatch.c
index a3e2643..838ff69 100644
--- a/test-wildmatch.c
+++ b/test-wildmatch.c
@@ -19,7 +19,7 @@ static int perf(int ac, char **av)
  	if (lang && strcmp(lang, "C"))
  		die("Please test it on C locale.");

-	if ((fd = open(file, O_RDONLY)) == -1 || fstat(fd, &st))
+	if ((fd = open(file, O_RDONLY)) == -1 || cached_fstat(fd, &st))
  		die_errno("file open");

  	buffer = xmalloc(st.st_size + 2);
diff --git a/unpack-trees.c b/unpack-trees.c
index 09e53df..fc20be4 100644
--- a/unpack-trees.c
+++ b/unpack-trees.c
@@ -1430,13 +1430,13 @@ static int verify_absent_1(struct cache_entry *ce,
  		char path[PATH_MAX + 1];
  		memcpy(path, ce->name, len);
  		path[len] = 0;
-		if (lstat(path, &st))
+		if (cached_lstat(path, &st))
  			return error("cannot stat '%s': %s", path,
  					strerror(errno));

  		return check_ok_to_remove(path, len, DT_UNKNOWN, NULL, &st,
  				error_type, o);
-	} else if (lstat(ce->name, &st)) {
+	} else if (cached_lstat(ce->name, &st)) {
  		if (errno != ENOENT)
  			return error("cannot stat '%s': %s", ce->name,
  				     strerror(errno));
@@ -1838,7 +1838,7 @@ int oneway_merge(struct cache_entry **src, struct 
unpack_trees_options *o)
  		int update = 0;
  		if (o->reset && o->update && !ce_uptodate(old) && 
!ce_skip_worktree(old)) {
  			struct stat st;
-			if (lstat(old->name, &st) ||
+			if (cached_lstat(old->name, &st) ||
  			    ie_match_stat(o->src_index, old, &st, 
CE_MATCH_IGNORE_VALID|CE_MATCH_IGNORE_SKIP_WORKTREE))
  				update |= CE_UPDATE;
  		}
-- 
1.8.2.rc0.29.g3a0aba8.dirty

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* Re: [PATCH] inotify to minimize stat() calls
  2013-04-24 17:20               ` [PATCH] " Robert Zeh
@ 2013-04-24 21:32                 ` Duy Nguyen
  2013-04-25 19:44                   ` Robert Zeh
  2013-04-25  8:18                 ` Thomas Rast
  2013-04-27 23:56                 ` Duy Nguyen
  2 siblings, 1 reply; 88+ messages in thread
From: Duy Nguyen @ 2013-04-24 21:32 UTC (permalink / raw)
  To: Robert Zeh; +Cc: Junio C Hamano, Ramkumar Ramachandra, Git List

On Thu, Apr 25, 2013 at 3:20 AM, Robert Zeh <robert.allan.zeh@gmail.com> wrote:
> Here is a patch that creates a daemon that tracks file
> state with inotify, writes it out to a file upon request,
> and changes most of the calls to stat to use said cache.
>
> It has bugs, but I figured it would be smarter to see
> if the approach was acceptable at all before spending the
> time to root the bugs out.

Any preliminary performance numbers? How does it do compared to
no-inotify version? When only a few files are changed? When half the
repo is changed?

> I've implemented the communication with a file, and not a socket, because I
> think implementing a socket is going to create
> security issues on multiuser systems.  For example, would a
> socket allow stat information to cross user boundaries?

I think UNIX socket on Linux at least respects file permissions. But
unix(7) follows with "This behavior differs from many BSD-derived
systems which ignore permissions for Unix sockets". Sighh

>  abspath.c            |   9 ++-
>  bisect.c             |   3 +-
>  check-racy.c         |   2 +-
>  combine-diff.c       |   3 +-
>  command-list.txt     |   1 +
>  config.c             |   3 +-
>  copy.c               |   3 +-
>  diff-lib.c           |   3 +-
>  diff-no-index.c      |   3 +-
>  diff.c               |   9 ++-
>  diffcore-order.c     |   3 +-
>  dir.c                |   4 +-
>  filechange-cache.c   | 203
> +++++++++++++++++++++++++++++++++++++++++++++++++++
>  filechange-cache.h   |  20 +++++
>  filechange-daemon.c  | 164 +++++++++++++++++++++++++++++++++++++++++
>  filechange-printer.c |  13 ++++
>  git.c                |  27 +++++++
>  ll-merge.c           |   3 +-
>  merge-recursive.c    |   5 +-
>  name-hash.c          |   3 +-
>  name-hash.h          |   1 +
>  notes-merge.c        |   3 +-
>  path.c               |   5 +-
>  read-cache.c         |  11 +--
>  rerere.c             |   7 +-
>  setup.c              |   5 +-
>  test-chmtime.c       |   2 +-
>  test-wildmatch.c     |   2 +-
>  unpack-trees.c       |   6 +-
>  29 files changed, 486 insertions(+), 40 deletions(-)
>  create mode 100644 filechange-cache.c
>  create mode 100644 filechange-cache.h
>  create mode 100644 filechange-daemon.c
>  create mode 100644 filechange-printer.c
>  create mode 100644 name-hash.h

Can you just replace lstat/stat with cached_lstat/stat inside
git-compat-util.h and not touch all files at once? I think you may
need to deal with paths outside working directory. But because you're
using lookup table, that should be no problem.
--
Duy

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH] inotify to minimize stat() calls
  2013-04-24 17:20               ` [PATCH] " Robert Zeh
  2013-04-24 21:32                 ` Duy Nguyen
@ 2013-04-25  8:18                 ` Thomas Rast
  2013-04-25 19:37                   ` Robert Zeh
  2013-04-27 23:56                 ` Duy Nguyen
  2 siblings, 1 reply; 88+ messages in thread
From: Thomas Rast @ 2013-04-25  8:18 UTC (permalink / raw)
  To: Robert Zeh; +Cc: Junio C Hamano, Ramkumar Ramachandra, Git List, Duy Nguyen

Robert Zeh <robert.allan.zeh@gmail.com> writes:

> Here is a patch that creates a daemon that tracks file
> state with inotify, writes it out to a file upon request,
> and changes most of the calls to stat to use said cache.
>
> It has bugs, but I figured it would be smarter to see
> if the approach was acceptable at all before spending the
> time to root the bugs out.

Thanks for tackling this; it's probably about time we got a inotify
support :-(

> I've implemented the communication with a file, and not a socket,
> because I think implementing a socket is going to create
> security issues on multiuser systems.  For example, would a
> socket allow stat information to cross user boundaries?

This ties in with an issue discussed in an earlier thread:

  http://thread.gmane.org/gmane.comp.version-control.git/217817/focus=218307

The conclusion there was that the default limits are set such that it is
not feasible to run one daemon per repository (that would quickly hit
the limits when e.g. iterating all repos in a typical android tree using
repo).

So whatever you use for communication needs to work as a global daemon.

I'd just trust the SSH folks to know about security; on my system
ssh-agent creates

  /tmp/ssh-RANDOMSTRING/agent.PID

where the directory has mode 0700, and the file is a unit socket with
mode 0600.  That should make doubly sure that no other user can open the
socket.

>  filechange-cache.c   | 203
> +++++++++++++++++++++++++++++++++++++++++++++++++++

Is your MUA wrapping the patch?

> +static void watch_directory(int inotify_fd)
> +{
> +	char buf[PATH_MAX];
> +
> +	if (!getcwd(buf, sizeof(buf)))
> +		die_errno("Unable to get current directory");
> +
> +	int i = 0;
> +	struct dir_struct dir;
> +	const char *pathspec[1] = { buf, NULL };
> +
> +	memset(&dir, 0, sizeof(dir));
> +	setup_standard_excludes(&dir);
> +
> +	fill_directory(&dir, pathspec);
> +	for(i = 0; i < dir.nr; i++) {
> +		struct dir_entry *ent = dir.entries[i];
> +		watch_file(inotify_fd, ent->name);
> +		free(ent);
> +	}

I don't get this bit.  The lstat() are run over all files listed in the
index.  So shouldn't your daemon watch exactly those (or rather, all
dirnames of such files)?

The actual directory contents are only needed to find untracked files,
and there would be a lot of complication surrounding that, so I suggest
saving that for later (and for now measuring the speedup with 'git
status -uno'!).

For example, you'd have to actually watch and re-read all .gitignore
files, and the .git/info/exclude, and the core.excludesfile, to see if
your notion of an ignored file became stale.

Also, you seem to call watch_directory() only on the current(?) dir, but
you need to recursively set up watches for all directories in the
repository.

> +	while (1) {
> +		int i = 0;
> +		length = read(inotify_fd, buffer, sizeof(buffer));
> +		for(i = 0; i < length; ) {
> +			struct inotify_event *event =
> +				(struct inotify_event*)(buffer+i);
> +			/* printf("event: %d %x %d %s\n", event->wd, event->mask,
> +			   event->len, event->name); */
> +			if (request_watch_descriptor == event->wd) {
> +				write_stat_cache();
> +			} else if (root_directory_watch_descriptor
> +				   == event->wd) {
> +				printf("root directory died!\n");
> +				exit(0);
> +			} else if (event->mask & IN_Q_OVERFLOW) {
> +				restart();

Good.

> +			} else if (event->mask & IN_MODIFY) {
> +				if (event->len)
> +					update_stat_cache(event->name);
> +			}

So whenever a file changes, you stat() it.  That's good for simplicity
now, but I suspect it will provide some optimization opportunities
later.


On some design aspects, I'd want:

* a toggle to run the test suite with the daemons, or without

* if you go with a user-wide daemon, a way to ensure that the test-suite
  daemon is not the same as my "real" daemon, and make sure it is killed
  after the test runs finish

* a test that triggers IN_Q_OVERFLOW, e.g. by sending SIGSTOP and doing
  a large repository operation

* a test that renames directories

The last one is just based on my personal experience with messing with
inotify; renaming directories is the "hard" case for that API.  We may
already cover this in the test suite, or we may not; but it must be
tested.

Other than that last point, focus your tests not on small tests but on
the test suite.  It would seem rather unlikely to me that you could
manage to pass the entire test suite with this daemon active but broken.

-- 
Thomas Rast
trast@{inf,student}.ethz.ch

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH] inotify to minimize stat() calls
  2013-04-25  8:18                 ` Thomas Rast
@ 2013-04-25 19:37                   ` Robert Zeh
  2013-04-25 19:59                     ` Thomas Rast
  0 siblings, 1 reply; 88+ messages in thread
From: Robert Zeh @ 2013-04-25 19:37 UTC (permalink / raw)
  To: Thomas Rast; +Cc: Junio C Hamano, Ramkumar Ramachandra, Git List, Duy Nguyen

On Thu, Apr 25, 2013 at 3:18 AM, Thomas Rast <trast@inf.ethz.ch> wrote:
>
> Robert Zeh <robert.allan.zeh@gmail.com> writes:
>
> > Here is a patch that creates a daemon that tracks file
> > state with inotify, writes it out to a file upon request,
> > and changes most of the calls to stat to use said cache.
> >
> > It has bugs, but I figured it would be smarter to see
> > if the approach was acceptable at all before spending the
> > time to root the bugs out.
>
> Thanks for tackling this; it's probably about time we got a inotify
> support :-(

> > I've implemented the communication with a file, and not a socket,
> > because I think implementing a socket is going to create
> > security issues on multiuser systems.  For example, would a
> > socket allow stat information to cross user boundaries?
>
> This ties in with an issue discussed in an earlier thread:
>
>   http://thread.gmane.org/gmane.comp.version-control.git/217817/focus=218307
>
> The conclusion there was that the default limits are set such that it is
> not feasible to run one daemon per repository (that would quickly hit
> the limits when e.g. iterating all repos in a typical android tree using
> repo).
>
> So whatever you use for communication needs to work as a global daemon.
>
> I'd just trust the SSH folks to know about security; on my system
> ssh-agent creates
>
>   /tmp/ssh-RANDOMSTRING/agent.PID
>
> where the directory has mode 0700, and the file is a unit socket with
> mode 0600.  That should make doubly sure that no other user can open the
> socket.
>
> >  filechange-cache.c   | 203
> > +++++++++++++++++++++++++++++++++++++++++++++++++++
>
> Is your MUA wrapping the patch?

Almost certainly.  I'll double check before I send off the next patch.

> > +static void watch_directory(int inotify_fd)
> > +{
> > +     char buf[PATH_MAX];
> > +
> > +     if (!getcwd(buf, sizeof(buf)))
> > +             die_errno("Unable to get current directory");
> > +
> > +     int i = 0;
> > +     struct dir_struct dir;
> > +     const char *pathspec[1] = { buf, NULL };
> > +
> > +     memset(&dir, 0, sizeof(dir));
> > +     setup_standard_excludes(&dir);
> > +
> > +     fill_directory(&dir, pathspec);
> > +     for(i = 0; i < dir.nr; i++) {
> > +             struct dir_entry *ent = dir.entries[i];
> > +             watch_file(inotify_fd, ent->name);
> > +             free(ent);
> > +     }
>
> I don't get this bit.  The lstat() are run over all files listed in the
> index.  So shouldn't your daemon watch exactly those (or rather, all
> dirnames of such files)?
I believe that fill_directory is handling watching only files in the index.
I had some problems a while back when I was only watching the
directory with some of the inotify structures coming back empty, which
is why I started watching each individual file.

> The actual directory contents are only needed to find untracked files,
> and there would be a lot of complication surrounding that, so I suggest
> saving that for later (and for now measuring the speedup with 'git
> status -uno'!).
The speed up test is a good idea.

> For example, you'd have to actually watch and re-read all .gitignore
> files, and the .git/info/exclude, and the core.excludesfile, to see if
> your notion of an ignored file became stale.
The thought in the back of my head was to simple have the daemon
restart if one of those files changed, under the assumption that a
restart wasn't that expensive, and that it would be complicated to check.


> Also, you seem to call watch_directory() only on the current(?) dir, but
> you need to recursively set up watches for all directories in the
> repository.

I'm calling fill_directory to get the list of files to watch; it appears to
be handling the recursion for me.  It also appears to be handling filtering
out all of the untracked files, etc.

> > +     while (1) {
> > +             int i = 0;
> > +             length = read(inotify_fd, buffer, sizeof(buffer));
> > +             for(i = 0; i < length; ) {
> > +                     struct inotify_event *event =
> > +                             (struct inotify_event*)(buffer+i);
> > +                     /* printf("event: %d %x %d %s\n", event->wd, event->mask,
> > +                        event->len, event->name); */
> > +                     if (request_watch_descriptor == event->wd) {
> > +                             write_stat_cache();
> > +                     } else if (root_directory_watch_descriptor
> > +                                == event->wd) {
> > +                             printf("root directory died!\n");
> > +                             exit(0);
> > +                     } else if (event->mask & IN_Q_OVERFLOW) {
> > +                             restart();
>
> Good.
>
> > +                     } else if (event->mask & IN_MODIFY) {
> > +                             if (event->len)
> > +                                     update_stat_cache(event->name);
> > +                     }
>
> So whenever a file changes, you stat() it.  That's good for simplicity
> now, but I suspect it will provide some optimization opportunities
> later.
I figured it would be a good idea to get things working, and then worry
about optimization later :-)

>
> On some design aspects, I'd want:
>
> * a toggle to run the test suite with the daemons, or without
Yeap.
> * if you go with a user-wide daemon, a way to ensure that the test-suite
>   daemon is not the same as my "real" daemon, and make sure it is killed
>   after the test runs finish
I'm assuming a command line argument that points a daemon at a port would be the
way to handle that.

> * a test that triggers IN_Q_OVERFLOW, e.g. by sending SIGSTOP and doing
>   a large repository operation
Yeap.  I think you'd want some way to verify (through a log file?)
that the overflow
happened.

> * a test that renames directories
Yeap.

> The last one is just based on my personal experience with messing with
> inotify; renaming directories is the "hard" case for that API.  We may
> already cover this in the test suite, or we may not; but it must be
> tested.
>
> Other than that last point, focus your tests not on small tests but on
> the test suite.  It would seem rather unlikely to me that you could
> manage to pass the entire test suite with this daemon active but broken.
I've had some experiences where the test suite passes with the daemon active,
but not populating the cache.
> --
> Thomas Rast
> trast@{inf,student}.ethz.ch

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH] inotify to minimize stat() calls
  2013-04-24 21:32                 ` Duy Nguyen
@ 2013-04-25 19:44                   ` Robert Zeh
  2013-04-25 21:20                     ` Duy Nguyen
  0 siblings, 1 reply; 88+ messages in thread
From: Robert Zeh @ 2013-04-25 19:44 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Junio C Hamano, Ramkumar Ramachandra, Git List

On Wed, Apr 24, 2013 at 4:32 PM, Duy Nguyen <pclouds@gmail.com> wrote:
> On Thu, Apr 25, 2013 at 3:20 AM, Robert Zeh <robert.allan.zeh@gmail.com> wrote:
>> Here is a patch that creates a daemon that tracks file
>> state with inotify, writes it out to a file upon request,
>> and changes most of the calls to stat to use said cache.
>>
>> It has bugs, but I figured it would be smarter to see
>> if the approach was acceptable at all before spending the
>> time to root the bugs out.
>
> Any preliminary performance numbers? How does it do compared to
> no-inotify version? When only a few files are changed? When half the
> repo is changed?

No numbers yet; I'm still working on correctness.  What I posted does
not pass all of the tests.

I like your ideas for performance tests.  My testing setup is
a VirtualBox instance on MacOS, so I'm not convinced that my numbers
will be meaningful.  The one thing I can report is that running the daemon
doesn't affect compilation performance.

The real win for this type of cache is Windows, where the file system
is slow.

>> I've implemented the communication with a file, and not a socket, because I
>> think implementing a socket is going to create
>> security issues on multiuser systems.  For example, would a
>> socket allow stat information to cross user boundaries?
>
> I think UNIX socket on Linux at least respects file permissions. But
> unix(7) follows with "This behavior differs from many BSD-derived
> systems which ignore permissions for Unix sockets". Sighh
>
>>  abspath.c            |   9 ++-
>>  bisect.c             |   3 +-
>>  check-racy.c         |   2 +-
>>  combine-diff.c       |   3 +-
>>  command-list.txt     |   1 +
>>  config.c             |   3 +-
>>  copy.c               |   3 +-
>>  diff-lib.c           |   3 +-
>>  diff-no-index.c      |   3 +-
>>  diff.c               |   9 ++-
>>  diffcore-order.c     |   3 +-
>>  dir.c                |   4 +-
>>  filechange-cache.c   | 203
>> +++++++++++++++++++++++++++++++++++++++++++++++++++
>>  filechange-cache.h   |  20 +++++
>>  filechange-daemon.c  | 164 +++++++++++++++++++++++++++++++++++++++++
>>  filechange-printer.c |  13 ++++
>>  git.c                |  27 +++++++
>>  ll-merge.c           |   3 +-
>>  merge-recursive.c    |   5 +-
>>  name-hash.c          |   3 +-
>>  name-hash.h          |   1 +
>>  notes-merge.c        |   3 +-
>>  path.c               |   5 +-
>>  read-cache.c         |  11 +--
>>  rerere.c             |   7 +-
>>  setup.c              |   5 +-
>>  test-chmtime.c       |   2 +-
>>  test-wildmatch.c     |   2 +-
>>  unpack-trees.c       |   6 +-
>>  29 files changed, 486 insertions(+), 40 deletions(-)
>>  create mode 100644 filechange-cache.c
>>  create mode 100644 filechange-cache.h
>>  create mode 100644 filechange-daemon.c
>>  create mode 100644 filechange-printer.c
>>  create mode 100644 name-hash.h
>
> Can you just replace lstat/stat with cached_lstat/stat inside
> git-compat-util.h and not touch all files at once? I think you may
> need to deal with paths outside working directory. But because you're
> using lookup table, that should be no problem.

That's a good idea; but there are a few places where you want to call
the uncached stat because calling the cache leads to recursion or
you bump into things that haven't been setup yet.  Any ideas how to
handle that?

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH] inotify to minimize stat() calls
  2013-04-25 19:37                   ` Robert Zeh
@ 2013-04-25 19:59                     ` Thomas Rast
  2013-04-27 13:51                       ` Thomas Rast
  0 siblings, 1 reply; 88+ messages in thread
From: Thomas Rast @ 2013-04-25 19:59 UTC (permalink / raw)
  To: Robert Zeh; +Cc: Junio C Hamano, Ramkumar Ramachandra, Git List, Duy Nguyen

Robert Zeh <robert.allan.zeh@gmail.com> writes:

> On Thu, Apr 25, 2013 at 3:18 AM, Thomas Rast <trast@inf.ethz.ch> wrote:
>>
>> I don't get this bit.  The lstat() are run over all files listed in the
>> index.  So shouldn't your daemon watch exactly those (or rather, all
>> dirnames of such files)?
> I believe that fill_directory is handling watching only files in the index.
> I had some problems a while back when I was only watching the
> directory with some of the inotify structures coming back empty, which
> is why I started watching each individual file.

This probably doesn't scale well enough.  For example on my system the
maximum number of watches I can set[1] is 64k.  linux.git contains 38k
files and the total number of files in all repos of an android clone I
have lying around is almost 300k.

Can you clarify what went wrong if you only watch directories?  After
all the events should be the same, except that you need to reassemble
the actual filename from the 'name' field in inotify_event and the
directory name associated with the watch descriptor.

I'll keep the rest of your mail for another reply ;-)

[1]  /proc/sys/fs/inotify/max_user_watches

-- 
Thomas Rast
trast@{inf,student}.ethz.ch

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH] inotify to minimize stat() calls
  2013-04-25 19:44                   ` Robert Zeh
@ 2013-04-25 21:20                     ` Duy Nguyen
  2013-04-26 15:35                       ` Robert Zeh
  0 siblings, 1 reply; 88+ messages in thread
From: Duy Nguyen @ 2013-04-25 21:20 UTC (permalink / raw)
  To: Robert Zeh; +Cc: Junio C Hamano, Ramkumar Ramachandra, Git List

On Fri, Apr 26, 2013 at 2:44 AM, Robert Zeh <robert.allan.zeh@gmail.com> wrote:
>> Can you just replace lstat/stat with cached_lstat/stat inside
>> git-compat-util.h and not touch all files at once? I think you may
>> need to deal with paths outside working directory. But because you're
>> using lookup table, that should be no problem.
>
> That's a good idea; but there are a few places where you want to call
> the uncached stat because calling the cache leads to recursion or
> you bump into things that haven't been setup yet.  Any ideas how to
> handle that?

On second thought, no my idea was stupid. We only need to optimize
lstat for certain cases and naming cached_lstat is much clearer. I
suspect read-cache.c and maybe dir.c and unpack-trees.c are the only
places that need cached_lstat. Other places should not issue many
lstats and we don't need to touch them.
--
Duy

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH] inotify to minimize stat() calls
  2013-04-25 21:20                     ` Duy Nguyen
@ 2013-04-26 15:35                       ` Robert Zeh
  0 siblings, 0 replies; 88+ messages in thread
From: Robert Zeh @ 2013-04-26 15:35 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Junio C Hamano, Ramkumar Ramachandra, Git List

On Thu, Apr 25, 2013 at 4:20 PM, Duy Nguyen <pclouds@gmail.com> wrote:
> On Fri, Apr 26, 2013 at 2:44 AM, Robert Zeh <robert.allan.zeh@gmail.com> wrote:
>>> Can you just replace lstat/stat with cached_lstat/stat inside
>>> git-compat-util.h and not touch all files at once? I think you may
>>> need to deal with paths outside working directory. But because you're
>>> using lookup table, that should be no problem.
>>
>> That's a good idea; but there are a few places where you want to call
>> the uncached stat because calling the cache leads to recursion or
>> you bump into things that haven't been setup yet.  Any ideas how to
>> handle that?
>
> On second thought, no my idea was stupid. We only need to optimize
> lstat for certain cases and naming cached_lstat is much clearer. I
> suspect read-cache.c and maybe dir.c and unpack-trees.c are the only
> places that need cached_lstat. Other places should not issue many
> lstats and we don't need to touch them.

ok.  The only reason I did it for all of them was the it was a simple search
and replace, and I didn't know how often lstat was called from various
locations.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH] inotify to minimize stat() calls
  2013-04-25 19:59                     ` Thomas Rast
@ 2013-04-27 13:51                       ` Thomas Rast
  0 siblings, 0 replies; 88+ messages in thread
From: Thomas Rast @ 2013-04-27 13:51 UTC (permalink / raw)
  To: Robert Zeh; +Cc: Junio C Hamano, Ramkumar Ramachandra, Git List, Duy Nguyen

Thomas Rast <trast@inf.ethz.ch> writes:

> Robert Zeh <robert.allan.zeh@gmail.com> writes:
>
>> On Thu, Apr 25, 2013 at 3:18 AM, Thomas Rast <trast@inf.ethz.ch> wrote:
>>>
>>> I don't get this bit.  The lstat() are run over all files listed in the
>>> index.  So shouldn't your daemon watch exactly those (or rather, all
>>> dirnames of such files)?
>> I believe that fill_directory is handling watching only files in the index.
>> I had some problems a while back when I was only watching the
>> directory with some of the inotify structures coming back empty, which
>> is why I started watching each individual file.
>
> This probably doesn't scale well enough.  For example on my system the
> maximum number of watches I can set[1] is 64k.  linux.git contains 38k
> files and the total number of files in all repos of an android clone I
> have lying around is almost 300k.

[I just sent something similar as a reply to a mail that I then noticed
was sent off-list, but I meant it to be public.]

I just had a change of heart.  It's probably better for the early work
if you make a very controllable, single-repository daemon.  Perhaps one
that only starts on demand (by running some git command) and runs until
again killed on demand.

That way it's much easier to test, and integrate as an option in the
test suite.  And for single-repo-minded people, like (I guess?) the
kernel and webkit folks, this should already provide some benefit.

The per-user daemon complication can come later; we know even at this
point that it will have to be done *eventually*, but let's go one step
at a time.

-- 
Thomas Rast
trast@{inf,student}.ethz.ch

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH] inotify to minimize stat() calls
  2013-04-24 17:20               ` [PATCH] " Robert Zeh
  2013-04-24 21:32                 ` Duy Nguyen
  2013-04-25  8:18                 ` Thomas Rast
@ 2013-04-27 23:56                 ` Duy Nguyen
       [not found]                   ` <CAKXa9=r2A7UeBV2s2H3wVGdPkS1zZ9huNJhtvTC-p0S5Ed12xA@mail.gmail.com>
  2 siblings, 1 reply; 88+ messages in thread
From: Duy Nguyen @ 2013-04-27 23:56 UTC (permalink / raw)
  To: Robert Zeh; +Cc: Junio C Hamano, Ramkumar Ramachandra, Git List

On Thu, Apr 25, 2013 at 12:20 AM, Robert Zeh <robert.allan.zeh@gmail.com> wrote:
> +int cached_lstat(const char *path, struct stat *buf)
> +{
> +       int stat_return_value = 0;
> +       struct stat_cache_entry *entry = 0;
> +
> +       read_stat_cache();
> +
> +       entry = get_stat_cache_entry(path);
> +
> +       stat_return_value = lstat(path, buf);
> +
> +       if (entry && (stat_return_value != entry->stat_return) &&
> +           (memcpy(&entry->st, buf, sizeof(*buf)))) {
> +               abort();
> +       }
> +
> +       return stat_return_value;
> +}

I must be missing something. If you always do lstat() in
cached_lstat(), what's the point of the cache? If you worry about
integrity (in the abort case), it'll be easier if you just record and
send paths from the daemon to git. Then you do lstat at one place
(git). This function may become more complex if still want to watch a
worktree way bigger that inotify limit. But I guess for now we could
just exit the daemon early in that case and fall back to normal lstat.
--
Duy

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: inotify to minimize stat() calls
       [not found]                   ` <CAKXa9=r2A7UeBV2s2H3wVGdPkS1zZ9huNJhtvTC-p0S5Ed12xA@mail.gmail.com>
@ 2013-04-30  0:27                     ` Duy Nguyen
  0 siblings, 0 replies; 88+ messages in thread
From: Duy Nguyen @ 2013-04-30  0:27 UTC (permalink / raw)
  To: Robert Zeh; +Cc: Junio C Hamano, Ramkumar Ramachandra, Git List

On Tue, Apr 30, 2013 at 1:05 AM, Robert Zeh <robert.allan.zeh@gmail.com> wrote:
> The call to lstat is only there for testing and should not be in there for
> the final version. Is there an easy way to only enable it for tests?

The usual trick is invent a new GIT_ environment variable. Then check
it and do something different. Then you can set the env in tests only.
--
Duy

^ permalink raw reply	[flat|nested] 88+ messages in thread

end of thread, other threads:[~2013-04-30  0:27 UTC | newest]

Thread overview: 88+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-02-08 21:10 inotify to minimize stat() calls Ramkumar Ramachandra
2013-02-08 22:15 ` Junio C Hamano
2013-02-08 22:45   ` Junio C Hamano
2013-02-09  2:10     ` Duy Nguyen
2013-02-09  2:37       ` Junio C Hamano
2013-02-09  2:56     ` Junio C Hamano
2013-02-09  3:36       ` Robert Zeh
2013-02-09 12:05         ` Ramkumar Ramachandra
2013-02-09 12:11           ` Ramkumar Ramachandra
2013-02-09 12:53           ` Ramkumar Ramachandra
2013-02-09 12:59             ` Duy Nguyen
2013-02-09 17:10               ` Ramkumar Ramachandra
2013-02-09 18:56                 ` Ramkumar Ramachandra
2013-02-10  5:24                 ` Duy Nguyen
2013-02-10 11:17                   ` Duy Nguyen
2013-02-10 11:22                     ` Duy Nguyen
2013-02-10 20:16                       ` Junio C Hamano
2013-02-11  2:56                         ` Duy Nguyen
2013-02-11 11:12                           ` Duy Nguyen
2013-03-07 22:16                           ` Torsten Bögershausen
2013-03-08  0:04                             ` Junio C Hamano
2013-03-08  7:01                               ` Torsten Bögershausen
2013-03-08  8:15                                 ` Junio C Hamano
2013-03-08  9:24                                   ` Torsten Bögershausen
2013-03-08 10:53                                   ` Duy Nguyen
2013-03-10  8:23                                     ` Ramkumar Ramachandra
2013-03-13 12:59                                     ` [PATCH] status: hint the user about -uno if read_directory takes too long Nguyễn Thái Ngọc Duy
2013-03-13 15:21                                       ` Torsten Bögershausen
2013-03-13 16:16                                       ` Junio C Hamano
2013-03-14 10:22                                         ` Duy Nguyen
2013-03-14 15:05                                           ` Junio C Hamano
2013-03-15 12:30                                             ` Duy Nguyen
2013-03-15 15:52                                               ` Torsten Bögershausen
2013-03-15 15:57                                                 ` Ramkumar Ramachandra
2013-03-15 16:53                                                 ` Junio C Hamano
2013-03-15 17:41                                                   ` Torsten Bögershausen
2013-03-15 20:06                                                     ` Junio C Hamano
2013-03-15 21:14                                                       ` Torsten Bögershausen
2013-03-15 21:59                                                         ` Junio C Hamano
2013-03-16  7:21                                                           ` Torsten Bögershausen
2013-03-17  4:47                                                             ` Junio C Hamano
2013-03-16  1:51                                           ` Duy Nguyen
2013-02-10 13:26                     ` inotify to minimize stat() calls demerphq
2013-02-10 15:35                       ` Duy Nguyen
2013-02-14 14:36                       ` Magnus Bäck
2013-02-10 16:45                     ` Ramkumar Ramachandra
2013-02-11  3:03                       ` Duy Nguyen
2013-02-10 16:58                     ` Erik Faye-Lund
2013-02-11  3:53                       ` Duy Nguyen
2013-02-12 20:48                         ` Karsten Blees
2013-02-13 10:06                           ` Duy Nguyen
2013-02-13 12:15                           ` Duy Nguyen
2013-02-13 18:18                             ` Jeff King
2013-02-13 19:47                               ` Jeff King
2013-02-13 20:25                               ` Karsten Blees
2013-02-13 22:55                                 ` Jeff King
2013-02-14  0:48                                   ` Karsten Blees
2013-02-27 14:45                                     ` [PATCH] name-hash.c: fix endless loop with core.ignorecase=true Karsten Blees
2013-02-27 16:53                                       ` Junio C Hamano
2013-02-27 21:52                                         ` Karsten Blees
2013-02-27 23:57                                           ` [PATCH v2] " Karsten Blees
2013-02-28  0:27                                             ` Junio C Hamano
2013-02-19  9:49                           ` inotify to minimize stat() calls Ramkumar Ramachandra
2013-02-19 14:25                             ` Karsten Blees
2013-02-19 13:16                   ` Drew Northup
2013-02-19 13:47                     ` Duy Nguyen
2013-02-09 19:35           ` Junio C Hamano
2013-02-10 19:03             ` Robert Zeh
2013-02-10 19:26               ` Martin Fick
2013-02-10 20:18                 ` Robert Zeh
2013-02-11  3:21               ` Duy Nguyen
2013-02-11 14:13                 ` Robert Zeh
2013-02-19  9:57                   ` Ramkumar Ramachandra
2013-04-24 17:20               ` [PATCH] " Robert Zeh
2013-04-24 21:32                 ` Duy Nguyen
2013-04-25 19:44                   ` Robert Zeh
2013-04-25 21:20                     ` Duy Nguyen
2013-04-26 15:35                       ` Robert Zeh
2013-04-25  8:18                 ` Thomas Rast
2013-04-25 19:37                   ` Robert Zeh
2013-04-25 19:59                     ` Thomas Rast
2013-04-27 13:51                       ` Thomas Rast
2013-04-27 23:56                 ` Duy Nguyen
     [not found]                   ` <CAKXa9=r2A7UeBV2s2H3wVGdPkS1zZ9huNJhtvTC-p0S5Ed12xA@mail.gmail.com>
2013-04-30  0:27                     ` Duy Nguyen
2013-02-09 11:32       ` Ramkumar Ramachandra
2013-02-14 15:16 ` Ævar Arnfjörð Bjarmason
2013-02-14 16:31   ` Junio C Hamano
2013-02-19  9:40   ` Ramkumar Ramachandra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.