Linux-Fsdevel Archive on lore.kernel.org
 help / color / Atom feed
From: Christian Brauner <christian.brauner@ubuntu.com>
To: Amir Goldstein <amir73il@gmail.com>
Cc: Jan Kara <jack@suse.cz>, linux-fsdevel <linux-fsdevel@vger.kernel.org>
Subject: Re: [RFC][PATCH] fanotify: introduce filesystem view mark
Date: Wed, 5 May 2021 16:56:46 +0200
Message-ID: <20210505145646.2bjlujl46qgf6qtd@wittgenstein> (raw)
In-Reply-To: <CAOQ4uxiab7M8i5E9shfCS=Rof1vCK0Jpfi-A2_3LPnK=_gFT9g@mail.gmail.com>

On Wed, May 05, 2021 at 05:42:46PM +0300, Amir Goldstein wrote:
> On Wed, May 5, 2021 at 5:24 PM Christian Brauner
> <christian.brauner@ubuntu.com> wrote:
> >
> > On Wed, May 05, 2021 at 02:28:15PM +0200, Jan Kara wrote:
> > > On Mon 03-05-21 21:44:22, Amir Goldstein wrote:
> > > > > > Getting back to this old thread, because the "fs view" concept that
> > > > > > it presented is very close to two POCs I tried out recently which leverage
> > > > > > the availability of mnt_userns in most of the call sites for fsnotify hooks.
> > > > > >
> > > > > > The first POC was replacing the is_subtree() check with in_userns()
> > > > > > which is far less expensive:
> > > > > >
> > > > > > https://github.com/amir73il/linux/commits/fanotify_in_userns
> > > > > >
> > > > > > This approach reduces the cost of check per mark, but there could
> > > > > > still be a significant number of sb marks to iterate for every fs op
> > > > > > in every container.
> > > > > >
> > > > > > The second POC is based off the first POC but takes the reverse
> > > > > > approach - instead of marking the sb object and filtering by userns,
> > > > > > it places a mark on the userns object and filters by sb:
> > > > > >
> > > > > > https://github.com/amir73il/linux/commits/fanotify_idmapped
> > > > > >
> > > > > > The common use case is a single host filesystem which is
> > > > > > idmapped via individual userns objects to many containers,
> > > > > > so normally, fs operations inside containers would have to
> > > > > > iterate a single mark.
> > > > > >
> > > > > > I am well aware of your comments about trying to implement full
> > > > > > blown subtree marks (up this very thread), but the userns-sb
> > > > > > join approach is so much more low hanging than full blown
> > > > > > subtree marks. And as a by-product, it very naturally provides
> > > > > > the correct capability checks so users inside containers are
> > > > > > able to "watch their world".
> > > > > >
> > > > > > Patches to allow resolving file handles inside userns with the
> > > > > > needed permission checks are also available on the POC branch,
> > > > > > which makes the solution a lot more useful.
> > > > > >
> > > > > > In that last POC, I introduced an explicit uapi flag
> > > > > > FAN_MARK_IDMAPPED in combination with
> > > > > > FAN_MARK_FILESYSTEM it provides the new capability.
> > > > > > This is equivalent to a new mark type, it was just an aesthetic
> > > > > > decision.
> > > > >
> > > > > So in principle, I have no problem with allowing mount marks for ns-capable
> > > > > processes. Also FAN_MARK_FILESYSTEM marks filtered by originating namespace
> > > > > look OK to me (although if we extended mount marks to support directory
> > > > > events as you try elsewhere, would there be still be a compeling usecase for
> > > > > this?).
> > > >
> > > > In my opinion it would. This is the reason why I stopped that direction.
> > > > The difference between FAN_MARK_FILESYSTEM|FAN_MARK_IDMAPPED
> > > > and FAN_MARK_MOUNT is that the latter can be easily "escaped" by creating
> > > > a bind mount or cloning a mount ns while the former is "sticky" to all additions
> > > > to the mount tree that happen below the idmapped mount.
> > >
> > > As far as I understood Christian, he was specifically interested in mount
> > > events for container runtimes because filtering by 'mount' was desirable
> > > for his usecase. But maybe I misunderstood. Christian? Also if you have
> >
> > I discussed this with Amir about two weeks ago. For container runtimes
> > Amir's idea of generating events based on the userns the fsnotify
> > instance was created in is actually quite clever because it gives a way
> > for the container to receive events for all filesystems and idmapped
> > mounts if its userns is attached to it. The model as we discussed it -
> > Amir, please tell me if I'm wrong - is that you'd be setting up an
> > fsnotify watch in a given userns and you'd be seeing events from all
> > superblocks that have the caller's userns as s_user_ns and all mounts
> > that have the caller's userns as mnt_userns. I think that's safe.
> 
> Not sure if we want to get events from all the fs mounted in this userns.
> We do not want events from proc/sys/debug fs which are mounted inside
> the usersn.
> 
> My POC does not implement a watch for ALL fs in userns, it implements
> only a filtered watch by userns-sb pair.

Right, that was me being sloopy. What I meant was "all marked
filesystems".

> 
> >
> > The reason why I found mount marks to be so compelling initially was
> > that they also work in cases where the caller is not in the userns that
> > is attached to the mnt (Similar to how you don't need to be in the
> > s_user_ns of the superblock you attached a filesystem mark to.).
> > That's not per se a container use-case though as the container will
> > almost always be in the userns that is attached to the mount (They don't
> > have to be of course just as with s_usern_s. You can very well be clever
> > and make a superblock be visible outside of the mounter's userns.).
> >
> > In addition the mount mark seemed to offer more granularity as the
> > caller can specifically select what they want to monitor. But I don't
> > think that justifies the complexity of the implementation that we would
> > need to push for.
> >
> > > FAN_MARK_FILESYSTEM mark filtered by namespace, you still will not see
> > > events to your shared filesystem generated from another namespace. So
> > > "escaping" is just a matter of creating new namespace and mounting fs
> > > there?
> >
> > Hm, that depends on the implementation. If Amir is using in_userns()
> > then the caller would be seeing events for their own userns and all
> > descendant userns. Since userns are hierarchical a container creating a
> > new userns wouldn't be able to "escape" the notifications.
> >
> 
> Not seeing events generated from another userns idmapped mount is a feature.

Sure, it's just depends on the implementation. :) I agree with you.

> FAN_MARK_FILESYSTEM gets events generated on fs from anywhere.
> FAN_MARK_MOUNT gets events generated on fs only via a specific mount.
> The idmapped fs mark is in between - get all events on fs via any mount inside
> a specific container (and all its descendants).
> 
> Escaping is not possible from within the container. In order to generate events
> that are not via a mount that is idmapped to the container userns, the
> host would
> need to provide access to a non-idmapped mount into the container and that
> would be a container management problem, not an fsnotify problem.
> 
> Christian, please correct me if I am wrong.

Yep, totally correct.

Christian

  reply index

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-11-09 18:00 Amir Goldstein
2020-11-10  5:07 ` Amir Goldstein
2020-11-17  7:09 ` [fanotify] a23a7dc576: unixbench.score -3.7% regression kernel test robot
2020-11-24 13:49 ` [RFC][PATCH] fanotify: introduce filesystem view mark Jan Kara
2020-11-24 14:47   ` Amir Goldstein
2020-11-25 11:01     ` Jan Kara
2020-11-25 12:34       ` Amir Goldstein
2020-11-26 11:10         ` Jan Kara
2020-11-26 11:50           ` Amir Goldstein
2020-11-26  3:42       ` Amir Goldstein
2020-11-26 11:17         ` Jan Kara
2021-04-28 18:28           ` Amir Goldstein
2021-05-03 16:53             ` Jan Kara
2021-05-03 18:44               ` Amir Goldstein
2021-05-05 12:28                 ` Jan Kara
2021-05-05 14:24                   ` Christian Brauner
2021-05-05 14:42                     ` Amir Goldstein
2021-05-05 14:56                       ` Christian Brauner [this message]
2021-05-10 10:13                     ` Jan Kara
2021-05-10 11:37                       ` Amir Goldstein
2021-05-10 14:21                         ` Jan Kara
2021-05-10 15:08                           ` Amir Goldstein
2021-05-10 15:27                             ` Jan Kara
2021-05-12 13:07                             ` Christian Brauner
2021-05-12 13:34                               ` Jan Kara
2021-05-12 16:15                                 ` Christian Brauner
2021-05-12 15:26                         ` Christian Brauner
2021-05-13 10:55                           ` Jan Kara
2021-05-14 13:56                             ` Christian Brauner
2021-05-15 14:28                               ` Amir Goldstein
2021-05-17  9:09                                 ` Jan Kara
2021-05-17 12:45                                   ` Amir Goldstein
2021-05-17 13:07                                     ` Jan Kara
2021-05-18 10:11                                 ` Christian Brauner
2021-05-18 16:02                                   ` Amir Goldstein
2021-05-19  9:31                                     ` Christian Brauner
2021-05-12 16:11                         ` Christian Brauner
2021-05-05 13:25               ` Christian Brauner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210505145646.2bjlujl46qgf6qtd@wittgenstein \
    --to=christian.brauner@ubuntu.com \
    --cc=amir73il@gmail.com \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-Fsdevel Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-fsdevel/0 linux-fsdevel/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-fsdevel linux-fsdevel/ https://lore.kernel.org/linux-fsdevel \
		linux-fsdevel@vger.kernel.org
	public-inbox-index linux-fsdevel

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-fsdevel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git