Linux-Fsdevel Archive on lore.kernel.org
 help / color / Atom feed
From: Jan Kara <jack@suse.cz>
To: Amir Goldstein <amir73il@gmail.com>
Cc: Jan Kara <jack@suse.cz>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Christian Brauner <christian.brauner@ubuntu.com>
Subject: Re: [RFC][PATCH] fanotify: introduce filesystem view mark
Date: Mon, 3 May 2021 18:53:15 +0200
Message-ID: <20210503165315.GE2994@quack2.suse.cz> (raw)
In-Reply-To: <CAOQ4uxgt1Cx5jx3L6iaDvbzCWPv=fcMgLaa9ODkiu9h718MkwQ@mail.gmail.com>

On Wed 28-04-21 21:28:18, Amir Goldstein wrote:
> On Thu, Nov 26, 2020 at 1:17 PM Jan Kara <jack@suse.cz> wrote:
> > On Thu 26-11-20 05:42:01, Amir Goldstein wrote:
> > > On Wed, Nov 25, 2020 at 1:01 PM Jan Kara <jack@suse.cz> wrote:
> > > > On Tue 24-11-20 16:47:41, Amir Goldstein wrote:
> > > > > On Tue, Nov 24, 2020 at 3:49 PM Jan Kara <jack@suse.cz> wrote:
> > > > > > On Mon 09-11-20 20:00:16, Amir Goldstein wrote:
> > > > > > > A filesystem view is a subtree of a filesystem accessible from a specific
> > > > > > > mount point.  When marking an FS view, user expects to get events on all
> > > > > > > inodes that are accessible from the marked mount, even if the events
> > > > > > > were generated from another mount.
> > > > > > >
> > > > > > > In particular, the events such as FAN_CREATE, FAN_MOVE, FAN_DELETE that
> > > > > > > are not delivered to a mount mark can be delivered to an FS view mark.
> > > > > > >
> > > > > > > One example of a filesystem view is btrfs subvolume, which cannot be
> > > > > > > marked with a regular filesystem mark.
> > > > > > >
> > > > > > > Another example of a filesystem view is a bind mount, not on the root of
> > > > > > > the filesystem, such as the bind mounts used for containers.
> > > > > > >
> > > > > > > A filesystem view mark is composed of a heads sb mark and an sb_view mark.
> > > > > > > The filesystem view mark is connected to the head sb mark and the head
> > > > > > > sb mark is connected to the sb object. The mask of the head sb mask is
> > > > > > > a cumulative mask of all the associated sb_view mark masks.
> > > > > > >
> > > > > > > Filesystem view marks cannot co-exist with a regular filesystem mark on
> > > > > > > the same filesystem.
> > > > > > >
> > > > > > > When an event is generated on the head sb mark, fsnotify iterates the
> > > > > > > list of associated sb_view marks and filter events that happen outside
> > > > > > > of the sb_view mount's root.
> > > > > > >
> > > > > > > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> > > > > >
> > > > > > I gave this just a high-level look (no detailed review) and here are my
> > > > > > thoughts:
> > > > > >
> > > > > > 1) I like the functionality. IMO this is what a lot of people really want
> > > > > > when looking for "filesystem wide fs monitoring".
> > > > > >
> > > > > > 2) I don't quite like the API you propose though. IMO it exposes details of
> > > > > > implementation in the API. I'd rather like to have API the same as for
> > > > > > mount marks but with a dedicated mark type flag in the API - like
> > > > > > FAN_MARK_FILESYSTEM_SUBTREE (or we can keep VIEW if you like it but I think
> > > > > > the less terms the better ;).
> > > > >
> > > > > Sure, FAN_MARK_FS_VIEW is a dedicated mark type.
> > > > > The fact that is it a bitwise OR of MOUNT and FILESYSTEM is just a fun fact.
> > > > > Sorry if that wasn't clear.
> > > > > FAN_MARK_FILESYSTEM_SUBTREE sounds better for uapi.
> > > > >
> > > > > But I suppose you also meant that we should not limit the subtree root
> > > > > to bind mount points?
> > > > >
> > > > > The reason I used a reference to mnt for a sb_view and not dentry
> > > > > is because we have fsnotify_clear_marks_by_mount() callback to
> > > > > handle cleanup of the sb_view marks (which I left as TODO).
> > > > >
> > > > > Alternatively, we can play cache pinning games with the subtree root dentry
> > > > > like the case with inode mark, but I didn't want to get into that nor did I know
> > > > > if we should - if subtree mark requires CAP_SYS_ADMIN anyway, why not
> > > > > require a bind mount as its target, which is something much more visible to
> > > > > admins.
> > > >
> > > > Yeah, I don't have problems with bind mounts in particular. Just I was
> > > > thinking that concievably we could make these marks less priviledged (just
> > > > with CAP_DAC_SEARCH or so) and then mountpoints may be unnecessarily
> > > > restricting. I don't think pinning of subtree root dentry would be
> > > > problematic as such - inode marks pin the inode anyway, this is not
> > > > substantially different - if we can make it work reliably...
> > > >
> > > > In fact I was considering for a while that we could even make subtree
> > > > watches completely unpriviledged - when we walk the dir tree anyway, we
> > > > could also check permissions along the way. Due to locking this would be
> > > > difficult to do when generating the event but it might be actually doable
> > > > if we perform the permission check when reporting the event to userspace.
> > > > Just a food for thought...
> > > >
> > >
> > > I think unprivileged subtree watches are something nice for the future, but
> > > for these FS_VIEW (or whatnot) marks, there is a lower hanging opportunity -
> > > make them require privileges relative to userns.
> >
> > Agreed, that's a middle step.
> >
> > > We don't need to relax that right from the start and it may requires some
> > > more work, but it could allow  unprivileged container user to set a
> > > filesystem-like watch on a filesystem where user is privileged relative
> > > to s_user_ns and that is a big win already.
> >
> > Yep, I'd prefer to separate these two problems. I.e., first handle the
> > subtree watches on their own (just keeping in mind we might want to make
> > them less priviledged eventually), when that it working, we can look in all
> > the implications of making fanotify accessible to less priviledged tasks.
> >
> > > It may also be possible in the future to allow setting this mark on a
> > > "unserns contained" mount - I'm not exactly sure of the details of idmapped
> > > mounts [1], but if mount has a userns associated with it to map fs uids then
> > > in theory we can check the view-ability of the event either at event read time
> > > or at event generation time - it requires that all ancestors have uid/gid that
> > > are *mapped* to the mount userns and nothing else, because we know
> > > that the listener process has CAP_DAC_SEARCH (or more) in the target
> > > userns.
> >
> > Event read is *much* simpler for permission checks IMO. First due to
> > locking necessary for permission checks (i_rwsem, xattr locks etc.), second
> > so that you don't have to mess with credentials used for checking.
> >
> 
> Jan,
> 
> I've lost track of all the "subtree mark" related threads ;-)

Yeah, me as well :)

> Getting back to this old thread, because the "fs view" concept that
> it presented is very close to two POCs I tried out recently which leverage
> the availability of mnt_userns in most of the call sites for fsnotify hooks.
> 
> The first POC was replacing the is_subtree() check with in_userns()
> which is far less expensive:
> 
> https://github.com/amir73il/linux/commits/fanotify_in_userns
> 
> This approach reduces the cost of check per mark, but there could
> still be a significant number of sb marks to iterate for every fs op
> in every container.
> 
> The second POC is based off the first POC but takes the reverse
> approach - instead of marking the sb object and filtering by userns,
> it places a mark on the userns object and filters by sb:
> 
> https://github.com/amir73il/linux/commits/fanotify_idmapped
> 
> The common use case is a single host filesystem which is
> idmapped via individual userns objects to many containers,
> so normally, fs operations inside containers would have to
> iterate a single mark.
> 
> I am well aware of your comments about trying to implement full
> blown subtree marks (up this very thread), but the userns-sb
> join approach is so much more low hanging than full blown
> subtree marks. And as a by-product, it very naturally provides
> the correct capability checks so users inside containers are
> able to "watch their world".
> 
> Patches to allow resolving file handles inside userns with the
> needed permission checks are also available on the POC branch,
> which makes the solution a lot more useful.
> 
> In that last POC, I introduced an explicit uapi flag
> FAN_MARK_IDMAPPED in combination with
> FAN_MARK_FILESYSTEM it provides the new capability.
> This is equivalent to a new mark type, it was just an aesthetic
> decision.

So in principle, I have no problem with allowing mount marks for ns-capable
processes. Also FAN_MARK_FILESYSTEM marks filtered by originating namespace
look OK to me (although if we extended mount marks to support directory
events as you try elsewhere, would there be still be a compeling usecase for
this?).

My main concern is creating a sane API so that if we expand the
functionality in the future we won't create a mess out of all
possibilities.

So I think there are two, relatively orthogonal decicions to make:

1) How the API should look like? For mounts there's no question I guess.
It's a mount mark as any other and we just relax the permission checks.
For FAN_MARK_FILESYSTEM marks we have to me more careful - I think
restricting mark to events generated only from a particular userns has to
be an explicit flag when adding the mark. Otherwise process that is
CAP_SYS_ADMIN in init_user_ns has no way of using these ns-filtered marks.
But this is also the reason why I'd like to think twice before adding this
event filtering if we can cover similar usecases by expanding mount marks
capabilities instead (it would certainly better fit overall API design).

2) Whether to internally attach marks to sb or to userns and how to
efficiently process them when generating events. This is an internal
decision of fsnotify and so I'm not concerned too much about it. We can
always tweak it in the future if the usecases show the CPU overhead is
significant. E.g. we could attach filtered marks to sb but hash it by
userns (or have rbtree ordered by userns in sb) to lower the CPU overhead
if there will be many sb marks expected. Attaching to userns as you suggest
in POC2 is certainly an option as well although I guess I sligthly prefer
to keep things in the sb so that we don't have to create yet another place
to attach marks to and all the handling associated with that.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

  reply index

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-11-09 18:00 Amir Goldstein
2020-11-10  5:07 ` Amir Goldstein
2020-11-17  7:09 ` [fanotify] a23a7dc576: unixbench.score -3.7% regression kernel test robot
2020-11-24 13:49 ` [RFC][PATCH] fanotify: introduce filesystem view mark Jan Kara
2020-11-24 14:47   ` Amir Goldstein
2020-11-25 11:01     ` Jan Kara
2020-11-25 12:34       ` Amir Goldstein
2020-11-26 11:10         ` Jan Kara
2020-11-26 11:50           ` Amir Goldstein
2020-11-26  3:42       ` Amir Goldstein
2020-11-26 11:17         ` Jan Kara
2021-04-28 18:28           ` Amir Goldstein
2021-05-03 16:53             ` Jan Kara [this message]
2021-05-03 18:44               ` Amir Goldstein
2021-05-05 12:28                 ` Jan Kara
2021-05-05 14:24                   ` Christian Brauner
2021-05-05 14:42                     ` Amir Goldstein
2021-05-05 14:56                       ` Christian Brauner
2021-05-10 10:13                     ` Jan Kara
2021-05-10 11:37                       ` Amir Goldstein
2021-05-10 14:21                         ` Jan Kara
2021-05-10 15:08                           ` Amir Goldstein
2021-05-10 15:27                             ` Jan Kara
2021-05-12 13:07                             ` Christian Brauner
2021-05-12 13:34                               ` Jan Kara
2021-05-12 16:15                                 ` Christian Brauner
2021-05-12 15:26                         ` Christian Brauner
2021-05-13 10:55                           ` Jan Kara
2021-05-14 13:56                             ` Christian Brauner
2021-05-15 14:28                               ` Amir Goldstein
2021-05-17  9:09                                 ` Jan Kara
2021-05-17 12:45                                   ` Amir Goldstein
2021-05-17 13:07                                     ` Jan Kara
2021-05-18 10:11                                 ` Christian Brauner
2021-05-18 16:02                                   ` Amir Goldstein
2021-05-19  9:31                                     ` Christian Brauner
2021-05-12 16:11                         ` Christian Brauner
2021-05-05 13:25               ` Christian Brauner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210503165315.GE2994@quack2.suse.cz \
    --to=jack@suse.cz \
    --cc=amir73il@gmail.com \
    --cc=christian.brauner@ubuntu.com \
    --cc=linux-fsdevel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-Fsdevel Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-fsdevel/0 linux-fsdevel/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-fsdevel linux-fsdevel/ https://lore.kernel.org/linux-fsdevel \
		linux-fsdevel@vger.kernel.org
	public-inbox-index linux-fsdevel

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-fsdevel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git