From: Jan Kara <jack@suse.cz>
To: Amir Goldstein <amir73il@gmail.com>
Cc: Jan Kara <jack@suse.cz>,
linux-fsdevel <linux-fsdevel@vger.kernel.org>,
Christian Brauner <christian.brauner@ubuntu.com>
Subject: Re: [RFC][PATCH] fanotify: introduce filesystem view mark
Date: Wed, 5 May 2021 14:28:15 +0200 [thread overview]
Message-ID: <20210505122815.GD29867@quack2.suse.cz> (raw)
In-Reply-To: <CAOQ4uxgy0DUEUo810m=bnLuHNbs60FLFPUUw8PLq9jJ8VTFD8g@mail.gmail.com>
On Mon 03-05-21 21:44:22, Amir Goldstein wrote:
> > > Getting back to this old thread, because the "fs view" concept that
> > > it presented is very close to two POCs I tried out recently which leverage
> > > the availability of mnt_userns in most of the call sites for fsnotify hooks.
> > >
> > > The first POC was replacing the is_subtree() check with in_userns()
> > > which is far less expensive:
> > >
> > > https://github.com/amir73il/linux/commits/fanotify_in_userns
> > >
> > > This approach reduces the cost of check per mark, but there could
> > > still be a significant number of sb marks to iterate for every fs op
> > > in every container.
> > >
> > > The second POC is based off the first POC but takes the reverse
> > > approach - instead of marking the sb object and filtering by userns,
> > > it places a mark on the userns object and filters by sb:
> > >
> > > https://github.com/amir73il/linux/commits/fanotify_idmapped
> > >
> > > The common use case is a single host filesystem which is
> > > idmapped via individual userns objects to many containers,
> > > so normally, fs operations inside containers would have to
> > > iterate a single mark.
> > >
> > > I am well aware of your comments about trying to implement full
> > > blown subtree marks (up this very thread), but the userns-sb
> > > join approach is so much more low hanging than full blown
> > > subtree marks. And as a by-product, it very naturally provides
> > > the correct capability checks so users inside containers are
> > > able to "watch their world".
> > >
> > > Patches to allow resolving file handles inside userns with the
> > > needed permission checks are also available on the POC branch,
> > > which makes the solution a lot more useful.
> > >
> > > In that last POC, I introduced an explicit uapi flag
> > > FAN_MARK_IDMAPPED in combination with
> > > FAN_MARK_FILESYSTEM it provides the new capability.
> > > This is equivalent to a new mark type, it was just an aesthetic
> > > decision.
> >
> > So in principle, I have no problem with allowing mount marks for ns-capable
> > processes. Also FAN_MARK_FILESYSTEM marks filtered by originating namespace
> > look OK to me (although if we extended mount marks to support directory
> > events as you try elsewhere, would there be still be a compeling usecase for
> > this?).
>
> In my opinion it would. This is the reason why I stopped that direction.
> The difference between FAN_MARK_FILESYSTEM|FAN_MARK_IDMAPPED
> and FAN_MARK_MOUNT is that the latter can be easily "escaped" by creating
> a bind mount or cloning a mount ns while the former is "sticky" to all additions
> to the mount tree that happen below the idmapped mount.
As far as I understood Christian, he was specifically interested in mount
events for container runtimes because filtering by 'mount' was desirable
for his usecase. But maybe I misunderstood. Christian? Also if you have
FAN_MARK_FILESYSTEM mark filtered by namespace, you still will not see
events to your shared filesystem generated from another namespace. So
"escaping" is just a matter of creating new namespace and mounting fs
there?
> That is a key difference that can allow running system services that use sb
> marks inside containers and actually be useful.
> "All" the system service needs to do in order to become idmapped aware
> is to check the path it is marking in /proc/self/mounts (or via syscall) and
> set the FAN_MARK_IDMAPPED flag.
> Everything else "just works" the same as in init user ns.
>
> > My main concern is creating a sane API so that if we expand the
> > functionality in the future we won't create a mess out of all
> > possibilities.
> >
>
> Agreed.
> If and when I post these patches, I will include the complete vision
> for the API to show where this fits it.
OK, thanks.
> > So I think there are two, relatively orthogonal decicions to make:
> >
> > 1) How the API should look like? For mounts there's no question I guess.
> > It's a mount mark as any other and we just relax the permission checks.
>
> Right.
>
> > For FAN_MARK_FILESYSTEM marks we have to me more careful - I think
> > restricting mark to events generated only from a particular userns has to
> > be an explicit flag when adding the mark. Otherwise process that is
> > CAP_SYS_ADMIN in init_user_ns has no way of using these ns-filtered marks.
>
> True. That's the reason I added the explicit flag in POC2.
>
> > But this is also the reason why I'd like to think twice before adding this
> > event filtering if we can cover similar usecases by expanding mount marks
> > capabilities instead (it would certainly better fit overall API design).
> >
>
> I explained above why that would not be good enough IMO.
> I think that expanding mount marks to support more events is nice for
> the unified APIs, but it is not nice enough IMO to justify the efforts
> related to
> promoting the vfs API changes against resistance and testing all the affected
> filesystems.
Understood. I agree if extending mount marks isn't enough for the container
usecase, then it's probably not worth the effort at this point.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
next prev parent reply other threads:[~2021-05-05 12:28 UTC|newest]
Thread overview: 38+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-11-09 18:00 [RFC][PATCH] fanotify: introduce filesystem view mark Amir Goldstein
2020-11-10 5:07 ` Amir Goldstein
2020-11-17 7:09 ` [fanotify] a23a7dc576: unixbench.score -3.7% regression kernel test robot
2020-11-24 13:49 ` [RFC][PATCH] fanotify: introduce filesystem view mark Jan Kara
2020-11-24 14:47 ` Amir Goldstein
2020-11-25 11:01 ` Jan Kara
2020-11-25 12:34 ` Amir Goldstein
2020-11-26 11:10 ` Jan Kara
2020-11-26 11:50 ` Amir Goldstein
2020-11-26 3:42 ` Amir Goldstein
2020-11-26 11:17 ` Jan Kara
2021-04-28 18:28 ` Amir Goldstein
2021-05-03 16:53 ` Jan Kara
2021-05-03 18:44 ` Amir Goldstein
2021-05-05 12:28 ` Jan Kara [this message]
2021-05-05 14:24 ` Christian Brauner
2021-05-05 14:42 ` Amir Goldstein
2021-05-05 14:56 ` Christian Brauner
2021-05-10 10:13 ` Jan Kara
2021-05-10 11:37 ` Amir Goldstein
2021-05-10 14:21 ` Jan Kara
2021-05-10 15:08 ` Amir Goldstein
2021-05-10 15:27 ` Jan Kara
2021-05-12 13:07 ` Christian Brauner
2021-05-12 13:34 ` Jan Kara
2021-05-12 16:15 ` Christian Brauner
2021-05-12 15:26 ` Christian Brauner
2021-05-13 10:55 ` Jan Kara
2021-05-14 13:56 ` Christian Brauner
2021-05-15 14:28 ` Amir Goldstein
2021-05-17 9:09 ` Jan Kara
2021-05-17 12:45 ` Amir Goldstein
2021-05-17 13:07 ` Jan Kara
2021-05-18 10:11 ` Christian Brauner
2021-05-18 16:02 ` Amir Goldstein
2021-05-19 9:31 ` Christian Brauner
2021-05-12 16:11 ` Christian Brauner
2021-05-05 13:25 ` Christian Brauner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20210505122815.GD29867@quack2.suse.cz \
--to=jack@suse.cz \
--cc=amir73il@gmail.com \
--cc=christian.brauner@ubuntu.com \
--cc=linux-fsdevel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).