Re: File monitor problem

From: Wez Furlong <wez@fb.com>
To: Amir Goldstein <amir73il@gmail.com>, Jan Kara <jack@suse.cz>
Cc: Mo Re Ra <more7.rev@gmail.com>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>
Subject: Re: File monitor problem
Date: Wed, 11 Dec 2019 22:06:13 +0000	[thread overview]
Message-ID: <8486261f-9cf2-e14e-c425-d9df7ba7b277@fb.com> (raw)
In-Reply-To: <CAOQ4uximwdf37JdVVfHuM_bxk=X7pz21hnT3thk01oDs_npfhw@mail.gmail.com>

On 12/10/19 12:49, Amir Goldstein wrote:

> [cc: Watchman maintainer]

Hi, I'm the Watchman creator and maintainer, and I also work on a FUSE 
based virtual filesystem called EdenFS that works with the source 
control systems that we use at Facebook.

I don't have much context on fanotify yet, but I do have a lot of 
practical experience with Watchman on various operating systems with 
very large recursive directory trees.

Amir asked me to participate in this discussion, and I think it's 
probably helpful to give a little bit of context on how we deal with 
some of the different watcher interfaces, and also how we see the 
consumers of Watchman making use of this sort of data.  There are tens 
of watchman consuming applications in common use inside FB, and a long 
tail of ad-hoc consumers that are not on my radar.

I don't want to flood you with data that may not feel relevant so I'm 
going to try to summarize some key points; I'd be happy to elaborate if 
you'd like more context!  These are written out as numbered statements 
to make it easier to reference in further discussion, and are not 
intended to be taken as any kind of prescriptive manifesto!

1. Humans think in terms of filenames.  Applications largely only care 
about filenames.  It's rare (it came up as a hypothetical for only one 
integrating application at FB in the past several years) that they care 
about optimizing for the various rename cases so long as they get 
notified that the old name is no longer visible in the filesystem and 
that a new name is now visible elsewhere in the portion of the 
filesystem that they are watching.

2. Application authors don't want to deal with the complexities of file 
watching, they just want to reliably know if/when a named file has 
changed.  Rename cookies and overflow events are too difficult for most 
applications to deal with at all/correctly.

3. Overflow events are painful to deal with.  In Watchman we deal with 
inotify overflow by re-crawling and examining the directory structure to 
re-synchronize with the filesystem state.  For very large trees this can 
take a very long time.

4. Partially related to 3., restarting the watchman server is an 
expensive event because we have to re-crawl everything to re-create the 
directory watches with inotify.  If the system provided a recursive 
watch function and some kind of a change journal that told watchman a 
set of N directories to crawl (where N < the M overall number of 
directories) and we had a stable identifier for files, then we could 
persist state across restarts and cheaply re-synchronize.

5. Is also related to 3. and 4.  We use btrfs subvolumes in our CI to 
snapshot large repos and make them available to jobs running in 
different containers potentially on different hosts.  If the journal 
mechanism from 4. were available in this situation it would make it 
super cheap to bring up watchman in those environments.

6. A downside to recursive watches on macOS is that fseventsd has very 
limited ability to add exceptions.  A common pattern at FB is that the 
buck build system maintains a build artifacts directory called 
`buck-out` in the repo.  On Linux we can ignore change notifications for 
this directory with zero cost by simply not registering it with 
inotify.  On macOS, the kernel interface allows for a maximum of 8 
exclusions.  The rest of the changes are delivered to fseventsd which 
stats and records everything in a sqlite database.  This is a 
performance hotspot for us because the number of excluded directories in 
a repo exceeds 8, and the uninteresting bulky build artifact writes then 
need to transit fseventsd and into watchman before we can decide to 
ignore them.

7. Windows has a journal mechanism that could potentially be used as 
suggested in 4. above, but it requires privileged access.  I happen to 
know from someone at MS that worked on a similar system that there is 
also a way to access a subset of this data that doesn't require 
privileged access, but that isn't documented.  I mention this because 
elsewhere in this thread is a discussion about privileged access to 
similar sounding information.

8. Related to 6. and 7., if there is a privileged system daemon to act 
as the interface between userspace<->kernel, care needs to be taken to 
avoid the sort of performance hotspot we see on macOS with 6. above.

OK, hopefully that doesn't feel too off the mark!  I don't think 
everything above needs to be handled directly at the kernel interface.  
Some of these details could be handled on the userspace side, either by 
a daemon (eg: watchman) or a suitably well designed client library 
(although that can make it difficult to consume in some programming 
environments).

--Wez