Git Mailing List Archive on lore.kernel.org
 help / color / Atom feed
From: Ben Peart <peartben@gmail.com>
To: git@vger.kernel.org
Cc: gitster@pobox.com, benpeart@microsoft.com, pclouds@gmail.com,
	johannes.schindelin@gmx.de, David.Turner@twosigma.com,
	peff@peff.net
Subject: [PATCH v2 0/6] Fast git status via a file system watcher
Date: Thu, 18 May 2017 16:13:27 -0400
Message-ID: <20170518201333.13088-1-benpeart@microsoft.com> (raw)


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #0: Type: text/plain; charset=y, Size: 6184 bytes --]

Changes from V1 include:

 - add 64 bit endianness helper get_be64 in compat/bswap.h
 - switch to using get_be helpers when reading index extension
 - fix leak of strbuf in refresh_by_fsmonitor
 - rename update_istate and clean up parameter list
 - make t/t7519-status-fsmonitor.sh executable
 - update test cases to ensure fsmonitor extension is being used
 - update commit messages and titles


Goal
~~~~
 
Today, git must check existing files to see if there have been changes
and scan the working directory looking for new, untracked files.  As the
number of files and folders in the working directory increases, the time
to perform these checks can become very expensive O(# files in working
directory).

Given the number of new or modified files is typically a very small
percentage of the total number of files, it would be much more
performant if git only had to check files and folders that potentially
had changes. This reduces the cost to O(# modified files).

This patch series makes it possible to optionally add a hook process
that can return the set of files that may have been changed since the
requested time.  Git can then use this to limit its scan to only those
files and folders that potentially have changes.

Design
~~~~~~

A new git hook (query-fsmonitor) must exist and be enabled 
(core.fsmonitor=true) that takes a time_t formatted as a string and
outputs to stdout all files that have been modified since the requested
time.

A new 'fsmonitor' index extension has been added to store the time the
fsmonitor hook was last queried and a ewah bitmap of the current
'fsmonitor-dirty' files. Unmarked entries are 'fsmonitor-clean', marked
entries are 'fsmonitor-dirty.'

As needed, git will call the query-fsmonitor hook proc for the set of
changes since the index was last updated. Git then uses this set of
files along with the list saved in the fsmonitor index extension to flag
the potentially dirty index and untracked cache entries.  

refresh_index() and valid_cached_dir() are updated so that any entry not
flagged as potentially dirty is not checked as it cannot have any
changes. This saves all the work of checking files and folders for
changes that are already known to be clean.

If git finds out some entries are 'fsmonitor-dirty', but are really
unchanged (e.g. the file was changed, then reverted back), then Git will
clear the marking in the extension. If git adds or updates an index
entry, it is marked 'fsmonitor-dirty' to ensure it is checked for
changes.

The code is conservative so in case of any error (missing index
extension, error from hook, etc) it falls back to normal logic of
checking everything.

A sample hook is provided in query-fsmonitor.sample to integrate with
the cross platform Watchman file watching service
https://facebook.github.io/watchman/


Performance
~~~~~~~~~~~

The performance wins of this model are pretty dramatic. Each test was
run 3 times and averaged.  "Files" is the number of files in the working
directory.  Tests were done with a cold file system cache as well as
with a warm file system cache on a HDD.  SSD speeds were typically about
10x faster than the HDD.  Typical real world results would fall
somewhere between these extremes. 

*--------------------------------------------------------*
| Repo on HDD | Cache | fsmonitor=false | fsmonitor=true |
*--------------------------------------------------------*
| 3K Files    | Cold  |           0.77s |          0.55s |
+--------------------------------------------------------+
| 100K Files  | Cold  |          38.76s |          2.17s |
+--------------------------------------------------------+
| 3M Files    | Cold  |         421.55s |         18.57s |
+--------------------------------------------------------+
| 3K Files    | Warm  |           0.05s |          0.24s |
+--------------------------------------------------------+
| 100K Files  | Warm  |           1.13s |          0.40s |
+--------------------------------------------------------+
| 3M Files    | Warm  |          59.33s |          4.19s |
+--------------------------------------------------------+

Note that with the smallest repo, warm times actually increase slightly
as the overhead of calling the hook, watchman and perl outweighs the
savings of not scanning the working directory.


Credits
~~~~~~~

Idea taken and code refactored from 
http://public-inbox.org/git/1466914464-10358-1-git-send-email-novalis@novalis.org/

Current version as a fork of GFW on GitHub here: 
https://github.com/benpeart/git-for-windows/tree/fsmonitor

Ben Peart (6):
  bswap: add 64 bit endianness helper get_be64
  dir: make lookup_untracked() available outside of dir.c
  fsmonitor: teach git to optionally utilize a file system monitor to
    speed up detecting new or changed files.
  fsmonitor: add test cases for fsmonitor extension
  fsmonitor: add documentation for the fsmonitor extension.
  fsmonitor: add a sample query-fsmonitor hook script for Watchman

 Documentation/config.txt                 |   7 +
 Documentation/githooks.txt               |  23 +++
 Documentation/technical/index-format.txt |  18 +++
 Makefile                                 |   1 +
 builtin/update-index.c                   |   1 +
 cache.h                                  |   5 +
 compat/bswap.h                           |   4 +
 config.c                                 |   5 +
 dir.c                                    |  15 +-
 dir.h                                    |   5 +
 entry.c                                  |   1 +
 environment.c                            |   1 +
 fsmonitor.c                              | 231 +++++++++++++++++++++++++++++++
 fsmonitor.h                              |   9 ++
 read-cache.c                             |  28 +++-
 t/t7519-status-fsmonitor.sh              | 153 ++++++++++++++++++++
 templates/hooks--query-fsmonitor.sample  |  27 ++++
 unpack-trees.c                           |   1 +
 18 files changed, 532 insertions(+), 3 deletions(-)
 create mode 100644 fsmonitor.c
 create mode 100644 fsmonitor.h
 create mode 100755 t/t7519-status-fsmonitor.sh
 create mode 100644 templates/hooks--query-fsmonitor.sample

-- 
2.13.0.windows.1.6.g4597375fc3


             reply index

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-05-18 20:13 Ben Peart [this message]
2017-05-18 20:13 ` [PATCH v2 1/6] bswap: add 64 bit endianness helper get_be64 Ben Peart
2017-05-18 20:13 ` [PATCH v2 2/6] dir: make lookup_untracked() available outside of dir.c Ben Peart
2017-05-18 20:13 ` [PATCH v2 3/6] fsmonitor: teach git to optionally utilize a file system monitor to speed up detecting new or changed files Ben Peart
2017-05-19 15:33   ` Ben Peart
2017-05-20 10:41     ` Junio C Hamano
2017-05-24 12:30   ` Christian Couder
2017-05-18 20:13 ` [PATCH v2 4/6] fsmonitor: add test cases for fsmonitor extension Ben Peart
2017-05-20 16:55   ` Torsten Bögershausen
2017-05-18 20:13 ` [PATCH v2 5/6] fsmonitor: add documentation for the " Ben Peart
2017-05-20 11:28   ` Junio C Hamano
2017-05-20 12:10   ` Ævar Arnfjörð Bjarmason
2017-05-22 16:18     ` Ben Peart
2017-05-22 17:28       ` Ævar Arnfjörð Bjarmason
2017-05-25 13:49         ` Ben Peart
2017-05-18 20:13 ` [PATCH v2 6/6] fsmonitor: add a sample query-fsmonitor hook script for Watchman Ben Peart
2017-05-24 13:12   ` Christian Couder
2017-05-26  9:47     ` Ævar Arnfjörð Bjarmason
2017-05-26 16:02       ` Ben Peart
2017-05-25 21:05   ` Ævar Arnfjörð Bjarmason
2017-05-24 10:54 ` [PATCH v2 0/6] Fast git status via a file system watcher Christian Couder
2017-05-25 13:55   ` Ben Peart
2017-05-27  6:57     ` Christian Couder
2017-05-30 18:05       ` Ben Peart
2017-05-30 20:33         ` Christian Couder
2017-05-30 23:11           ` Ben Peart
2017-05-31  7:37             ` Christian Couder
2017-05-31  7:59     ` Christian Couder
2017-05-31 13:37       ` Ben Peart
2017-05-31 14:10         ` Ævar Arnfjörð Bjarmason

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170518201333.13088-1-benpeart@microsoft.com \
    --to=peartben@gmail.com \
    --cc=David.Turner@twosigma.com \
    --cc=benpeart@microsoft.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=johannes.schindelin@gmx.de \
    --cc=pclouds@gmail.com \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Git Mailing List Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/git/0 git/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 git git/ https://lore.kernel.org/git \
		git@vger.kernel.org
	public-inbox-index git

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.git


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git