From: Amir Goldstein <amir73il@gmail.com>
To: Jan Kara <jack@suse.cz>
Cc: linux-fsdevel <linux-fsdevel@vger.kernel.org>
Subject: Re: [RFC][PATCH] fanotify: introduce filesystem view mark
Date: Wed, 25 Nov 2020 14:34:16 +0200 [thread overview]
Message-ID: <CAOQ4uxiaaQ9X8EBS-bd2DNMdg7ezNoRXCRvu+4idikx67OFbQQ@mail.gmail.com> (raw)
In-Reply-To: <20201125110156.GB16944@quack2.suse.cz>
On Wed, Nov 25, 2020 at 1:01 PM Jan Kara <jack@suse.cz> wrote:
>
> On Tue 24-11-20 16:47:41, Amir Goldstein wrote:
> > On Tue, Nov 24, 2020 at 3:49 PM Jan Kara <jack@suse.cz> wrote:
> > > On Mon 09-11-20 20:00:16, Amir Goldstein wrote:
> > > > A filesystem view is a subtree of a filesystem accessible from a specific
> > > > mount point. When marking an FS view, user expects to get events on all
> > > > inodes that are accessible from the marked mount, even if the events
> > > > were generated from another mount.
> > > >
> > > > In particular, the events such as FAN_CREATE, FAN_MOVE, FAN_DELETE that
> > > > are not delivered to a mount mark can be delivered to an FS view mark.
> > > >
> > > > One example of a filesystem view is btrfs subvolume, which cannot be
> > > > marked with a regular filesystem mark.
> > > >
> > > > Another example of a filesystem view is a bind mount, not on the root of
> > > > the filesystem, such as the bind mounts used for containers.
> > > >
> > > > A filesystem view mark is composed of a heads sb mark and an sb_view mark.
> > > > The filesystem view mark is connected to the head sb mark and the head
> > > > sb mark is connected to the sb object. The mask of the head sb mask is
> > > > a cumulative mask of all the associated sb_view mark masks.
> > > >
> > > > Filesystem view marks cannot co-exist with a regular filesystem mark on
> > > > the same filesystem.
> > > >
> > > > When an event is generated on the head sb mark, fsnotify iterates the
> > > > list of associated sb_view marks and filter events that happen outside
> > > > of the sb_view mount's root.
> > > >
> > > > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> > >
> > > I gave this just a high-level look (no detailed review) and here are my
> > > thoughts:
> > >
> > > 1) I like the functionality. IMO this is what a lot of people really want
> > > when looking for "filesystem wide fs monitoring".
> > >
> > > 2) I don't quite like the API you propose though. IMO it exposes details of
> > > implementation in the API. I'd rather like to have API the same as for
> > > mount marks but with a dedicated mark type flag in the API - like
> > > FAN_MARK_FILESYSTEM_SUBTREE (or we can keep VIEW if you like it but I think
> > > the less terms the better ;).
> >
> > Sure, FAN_MARK_FS_VIEW is a dedicated mark type.
> > The fact that is it a bitwise OR of MOUNT and FILESYSTEM is just a fun fact.
> > Sorry if that wasn't clear.
> > FAN_MARK_FILESYSTEM_SUBTREE sounds better for uapi.
> >
> > But I suppose you also meant that we should not limit the subtree root
> > to bind mount points?
> >
> > The reason I used a reference to mnt for a sb_view and not dentry
> > is because we have fsnotify_clear_marks_by_mount() callback to
> > handle cleanup of the sb_view marks (which I left as TODO).
> >
> > Alternatively, we can play cache pinning games with the subtree root dentry
> > like the case with inode mark, but I didn't want to get into that nor did I know
> > if we should - if subtree mark requires CAP_SYS_ADMIN anyway, why not
> > require a bind mount as its target, which is something much more visible to
> > admins.
>
> Yeah, I don't have problems with bind mounts in particular. Just I was
> thinking that concievably we could make these marks less priviledged (just
> with CAP_DAC_SEARCH or so) and then mountpoints may be unnecessarily
> restricting. I don't think pinning of subtree root dentry would be
> problematic as such - inode marks pin the inode anyway, this is not
> substantially different - if we can make it work reliably...
>
I'm going to need your help for that ;-)
> In fact I was considering for a while that we could even make subtree
> watches completely unpriviledged - when we walk the dir tree anyway, we
> could also check permissions along the way. Due to locking this would be
> difficult to do when generating the event but it might be actually doable
> if we perform the permission check when reporting the event to userspace.
> Just a food for thought...
Maybe... but there are some other advantages to restricting to mount.
One is that with btrfs subvolumes conn->fsid can actually cache the
subvolume's fsid and we could remove the restriction of -EXDEV
error of FAN_MARK_FILESYSTEM on subvolume.
Another has to do with performance optimization (see below).
>
> > > Also I think this is going to get expensive
> > > (e.g. imagine each write to page cache having to traverse potentially deep
> > > tree hierarchy potentially multiple times - once for each subtree). My
> > > suspicion should be verified with actual performance measurement but if I'm
> > > right and the overhead is indeed noticeable, I think we'll need to employ
> > > some caching or so to avoid repeated lookups...
> >
> > It's true, but here is a different angle to analyse the overhead - claim:
> > "If users don't have kernel subtree mark, they will use filesystem mark
> > and filter is userspace". If that claim is true, than one can argue that
> > this is fair - let the listener process pay the CPU overhead which can be
> > contained inside a cgroup and not everyone else. But what is the cost that
> > everyone else will be paying in that case?
> > Everyone will still pay the cost of the fanotify backend callback including
> > allocate, pack and queue the event.
> > The question then becomes, what is cheaper? The d_ancestor() traversal
> > or all the fanotify backend handler code?
> > Note that the former can be lockless and the latter does take locks.
>
> I understand and it's a fair point that queueing of the event is not cheap
> either so I'd be interested in the numbers. Something like how deep subtree
> walk is similar to a cost of queueing event. Or similarly checking of how many
> subtree watches is similarly costly as queueing of one event?
>
The cost shouldn't actually be sensitive to the number of subtree watches.
We should never do more than a single ancestor traversal per event up to
it's sb root and for each ancestor we should be able to check with O(1) if
a subtree watch exists on that inode/dentry.
If we stay with the bind mount restriction then we can use the check
d_mountpoint() on every ancestor which practically reduces the cost per
ancestor to zero in most cases.
> > I have a pretty good bet on the answer, but as you say only actual performance
> > benchmarks can tell.
> >
> > From my experience, real life fsevent listeners do not listen on FAN_MODIFY
> > but they listen on FAN_CLOSE_WRITE, because the the former is too noisy.
>
> Agreed.
>
> > The best case scenario is that users step forward to say that they want to
> > use fanotify but need the subtree filterting and can provide us with real life
> > performance measurement of the userspace vs. kernel alternatives (in terms
> > of penalty on the rest of the system).
>
> With the cost of having to go to userspace and there do essentially the same
> subtree walk as you do in the kernel, I have no doubt what's going to be
> faster (by orders of magnitude). What I'm somewhat uncertain is whether the
> subtree check is OK at the time of event generation or whether it should
> better be moved to the moment when we are about to report the event to
> userspace (when the cost of the subtree check goes to the process
> interested in the event which is fine - but as you properly note we already
> had to pay the cost of queueing the event so it isn't clear this is a win
> even for the processes generating events).
>
I think the only semantics that make sense are:
* If all object's present and past ancestors were always under subtree -
guaranty to get an event
* If all object's present and past ancestors were never under subtree -
guaranty to not get an event
* Otherwise - undefined
So I think it is ok to check for subtree on event generation time.
Thanks,
Amir.
Thanks,
Amir.
next prev parent reply other threads:[~2020-11-25 12:34 UTC|newest]
Thread overview: 38+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-11-09 18:00 [RFC][PATCH] fanotify: introduce filesystem view mark Amir Goldstein
2020-11-10 5:07 ` Amir Goldstein
2020-11-17 7:09 ` [fanotify] a23a7dc576: unixbench.score -3.7% regression kernel test robot
2020-11-24 13:49 ` [RFC][PATCH] fanotify: introduce filesystem view mark Jan Kara
2020-11-24 14:47 ` Amir Goldstein
2020-11-25 11:01 ` Jan Kara
2020-11-25 12:34 ` Amir Goldstein [this message]
2020-11-26 11:10 ` Jan Kara
2020-11-26 11:50 ` Amir Goldstein
2020-11-26 3:42 ` Amir Goldstein
2020-11-26 11:17 ` Jan Kara
2021-04-28 18:28 ` Amir Goldstein
2021-05-03 16:53 ` Jan Kara
2021-05-03 18:44 ` Amir Goldstein
2021-05-05 12:28 ` Jan Kara
2021-05-05 14:24 ` Christian Brauner
2021-05-05 14:42 ` Amir Goldstein
2021-05-05 14:56 ` Christian Brauner
2021-05-10 10:13 ` Jan Kara
2021-05-10 11:37 ` Amir Goldstein
2021-05-10 14:21 ` Jan Kara
2021-05-10 15:08 ` Amir Goldstein
2021-05-10 15:27 ` Jan Kara
2021-05-12 13:07 ` Christian Brauner
2021-05-12 13:34 ` Jan Kara
2021-05-12 16:15 ` Christian Brauner
2021-05-12 15:26 ` Christian Brauner
2021-05-13 10:55 ` Jan Kara
2021-05-14 13:56 ` Christian Brauner
2021-05-15 14:28 ` Amir Goldstein
2021-05-17 9:09 ` Jan Kara
2021-05-17 12:45 ` Amir Goldstein
2021-05-17 13:07 ` Jan Kara
2021-05-18 10:11 ` Christian Brauner
2021-05-18 16:02 ` Amir Goldstein
2021-05-19 9:31 ` Christian Brauner
2021-05-12 16:11 ` Christian Brauner
2021-05-05 13:25 ` Christian Brauner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAOQ4uxiaaQ9X8EBS-bd2DNMdg7ezNoRXCRvu+4idikx67OFbQQ@mail.gmail.com \
--to=amir73il@gmail.com \
--cc=jack@suse.cz \
--cc=linux-fsdevel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).