linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Christian Brauner <christian.brauner@ubuntu.com>
To: Amir Goldstein <amir73il@gmail.com>
Cc: Jan Kara <jack@suse.cz>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Miklos Szeredi <miklos@szeredi.hu>
Subject: Re: [RFC][PATCH] fanotify: introduce filesystem view mark
Date: Tue, 18 May 2021 12:11:35 +0200	[thread overview]
Message-ID: <20210518101135.jrldavggoibfpjhs@wittgenstein> (raw)
In-Reply-To: <CAOQ4uxgngZjBseOC_qYtxjZ_J4Rc50_Y7G+CSSpJznKBXvSU5A@mail.gmail.com>

On Sat, May 15, 2021 at 05:28:27PM +0300, Amir Goldstein wrote:
> On Fri, May 14, 2021 at 4:56 PM Christian Brauner
> <christian.brauner@ubuntu.com> wrote:
> >
> > On Thu, May 13, 2021 at 12:55:26PM +0200, Jan Kara wrote:
> > > On Wed 12-05-21 17:26:25, Christian Brauner wrote:
> > > > On Mon, May 10, 2021 at 02:37:59PM +0300, Amir Goldstein wrote:
> > > > > On Mon, May 10, 2021 at 1:13 PM Jan Kara <jack@suse.cz> wrote:
> > > > > > OK, so this feature would effectively allow sb-wide watching of events that
> > > > > > are generated from within the container (or its descendants). That sounds
> > > > > > useful. Just one question: If there's some part of a filesystem, that is
> > > > > > accesible by multiple containers (and thus multiple namespaces), or if
> > > > > > there's some change done to the filesystem say by container management SW,
> > > > > > then event for this change won't be visible inside the container (despite
> > > > > > that the fs change itself will be visible).
> > > > >
> > > > > That is correct.
> > > > > FYI, a privileged user can already mount an overlayfs in order to indirectly
> > > > > open and write to a file.
> > > > >
> > > > > Because overlayfs opens the underlying file FMODE_NONOTIFY this will
> > > > > hide OPEN/ACCESS/MODIFY/CLOSE events also for inode/sb marks.
> > > > > Since 459c7c565ac3 ("ovl: unprivieged mounts"), so can unprivileged users.
> > > > >
> > > > > I wonder if that is a problem that we need to fix...
> > > > >
> > > > > > This is kind of a similar
> > > > > > problem to the one we had with mount marks and why sb marks were created.
> > > > > > So aren't we just repeating the mistake with mount marks? Because it seems
> > > > > > to me that more often than not, applications are interested in getting
> > > > > > notification when what they can actually access within the fs has changed
> > > > > > (and this is what they actually get with the inode marks) and they don't
> > > > > > care that much where the change came from... Do you have some idea how
> > > > > > frequent are such cross-ns filesystem changes?
> > > > >
> > > > > The use case surely exist, the question is whether this use case will be
> > > > > handled by a single idmapped userns or multiple userns.
> > > > >
> > > > > You see, we simplified the discussion to an idmapped mount that uses
> > > > > the same userns and the userns the container processes are associated
> > > > > with, but in fact, container A can use userns A container B userns B and they
> > > > > can both access a shared idmapped mount mapped with userns AB.
> > > > >
> > > > > I think at this point in time, there are only ideas about how the shared data
> > > > > case would be managed, but Christian should know better than me.
> > > >
> > > > I think there are two major immediate container use-cases right now that
> > > > are already actively used:
> > > > 1. idmapped rootfs
> > > > A container manager wants to avoid recursively chowning the rootfs or
> > > > image for a container. To this end an idmapped mount is created. The
> > > > idmapped mount can either share the same userns as the container itself
> > > > or a separate userns can be used. What people use depends on their
> > > > concept of a container.
> > > > For example, systemd has merged support for idmapping a containers
> > > > rootfs in [1]. The systemd approach to containers never puts the
> > > > container itself in control of most things including most of its mounts.
> > > > That is very much the approach of having it be a rather tightly managed
> > > > system. Specifically, this means that systemd currently uses a separate
> > > > userns to idmap.
> > > > In contrast other container managers usually treat the container as a
> > > > mostly separate system and put it in charge of all its mounts. This
> > > > means the userns used for the idmapped mount will be the same as the
> > > > container runs in (see [2]).
> > >
> > > OK, thanks for explanation. So to make fanotify idmap-filtered marks work
> > > for systemd-style containers we would indeed need what Amir proposed -
> > > i.e., the container manager intercepts fanotify_mark calls and decides
> > > which namespace to setup the mark in as there's no sufficient priviledge
> > > within the container to do that AFAIU.
> >
> > Yes, that's how that would work.
> >
> > >
> > > > 2. data sharing among containers or among the host and containers etc.
> > > > The most common use-case is to share data from the host with the
> > > > container such as a download folder or the Linux folder on ChromeOS.
> > > > Most container managers will simly re-use the container's userns for
> > > > that too. More complex cases arise where data is shared between
> > > > containers with different idmappings then often a separate userns will
> > > > have to be used.
> > >
> > > OK, but if say on ChromeOS you copy something to the Linux folder by app A
> > > (say file manager) and containerized app B (say browser) watches that mount
> >
> > For ChromeOS it is currently somewhat simple since they currently only
> > allow a single container by default. So everytime you start an app in
> > the container it's the same app so they all write to the Linux Files
> > folder through the same container. (I'm glossing over a range of details
> > but that's not really relevant to the general spirit of the example.).
> >
> >
> > > for changes with idmap-filtered mark, then it won't see notification for
> > > those changes because A presumably runs in a different namespace than B, am
> > > I imagining this right? So mark which filters events based on namespace of
> > > the originating process won't be usable for such usecase AFAICT.
> >
> > Idmap filtered marks won't cover that use-case as envisioned now. Though
> > I'm not sure they really need to as the semantics are related to mount
> > marks.
> 
> We really need to refer to those as filesystem marks. They are definitely
> NOT mount marks. We are trying to design a better API that will not share
> as many flaws with mount marks...
> 
> > A mount mark would allow you to receive events based on the
> > originating mount. If two mounts A and B are separate but expose the
> > same files you wouldn't see events caused by B if you're watching A.
> > Similarly you would only see events from mounts that have been delegated
> > to you through the idmapped userns. I find this acceptable especially if
> > clearly documented.
> >
> 
> The way I see it, we should delegate all the decisions over to userspace,
> but I agree that the current "simple" proposal may not provide a good
> enough answer to the case of a subtree that is shared with the host.

I was focussed on what happens if you set an idmapped filtered mark for
a container for a set of files that is exposed to another container via
another idmapped mount. And it seemed to me that it was ok if the
container A doesn't see events from container B.

You seem to be looking at this from the host's perspective right now
which is interesting as well.

> 
> IMO, it should be a container manager decision whether changes done by
> the host are:
> a) Not visible to containerized application

Yes, that seems ok.

> b) Watched in host via recursive inode watches
> c) Watched in host by filesystem mark filtered in userspace
> d) Watched in host by an "noop" idmapped mount in host, through
>      which all relevant apps in host access the shared folder

So b)-d) are concerned with the host getting notifcations for changes
done from any container that uses a given set of files possibly through
different mounts.

> 
> We can later provide the option of "subtree filtered filesystem mark"
> which can be choice (e). It will incur performance overhead on the system
> that is higher than option (d) but lower than option (c).
> 
> In the end, it all depends on the individual use case.
> 
> The way forward as I see it is:
> 1. Need to write a proposal where FAN_MARK_IDMAPPED is the
>     first step towards a wider API that also includes subtree marks -
>     whether we end up implementing subtree mark or not
> 2. Need to make sure that setting up N idmapped marks
>     does not have O(N) performance overhead on the system
> 3. Need to make sure that the idmapped marks proposal is deemed
>     desired by concrete container manager project and use cases
> 
> If there are no objections to this roadmap, I will prepare the
> proposal on my next context switch.

Yes, sounds good.

Christian


  parent reply	other threads:[~2021-05-18 10:11 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-11-09 18:00 [RFC][PATCH] fanotify: introduce filesystem view mark Amir Goldstein
2020-11-10  5:07 ` Amir Goldstein
2020-11-17  7:09 ` [fanotify] a23a7dc576: unixbench.score -3.7% regression kernel test robot
2020-11-24 13:49 ` [RFC][PATCH] fanotify: introduce filesystem view mark Jan Kara
2020-11-24 14:47   ` Amir Goldstein
2020-11-25 11:01     ` Jan Kara
2020-11-25 12:34       ` Amir Goldstein
2020-11-26 11:10         ` Jan Kara
2020-11-26 11:50           ` Amir Goldstein
2020-11-26  3:42       ` Amir Goldstein
2020-11-26 11:17         ` Jan Kara
2021-04-28 18:28           ` Amir Goldstein
2021-05-03 16:53             ` Jan Kara
2021-05-03 18:44               ` Amir Goldstein
2021-05-05 12:28                 ` Jan Kara
2021-05-05 14:24                   ` Christian Brauner
2021-05-05 14:42                     ` Amir Goldstein
2021-05-05 14:56                       ` Christian Brauner
2021-05-10 10:13                     ` Jan Kara
2021-05-10 11:37                       ` Amir Goldstein
2021-05-10 14:21                         ` Jan Kara
2021-05-10 15:08                           ` Amir Goldstein
2021-05-10 15:27                             ` Jan Kara
2021-05-12 13:07                             ` Christian Brauner
2021-05-12 13:34                               ` Jan Kara
2021-05-12 16:15                                 ` Christian Brauner
2021-05-12 15:26                         ` Christian Brauner
2021-05-13 10:55                           ` Jan Kara
2021-05-14 13:56                             ` Christian Brauner
2021-05-15 14:28                               ` Amir Goldstein
2021-05-17  9:09                                 ` Jan Kara
2021-05-17 12:45                                   ` Amir Goldstein
2021-05-17 13:07                                     ` Jan Kara
2021-05-18 10:11                                 ` Christian Brauner [this message]
2021-05-18 16:02                                   ` Amir Goldstein
2021-05-19  9:31                                     ` Christian Brauner
2021-05-12 16:11                         ` Christian Brauner
2021-05-05 13:25               ` Christian Brauner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210518101135.jrldavggoibfpjhs@wittgenstein \
    --to=christian.brauner@ubuntu.com \
    --cc=amir73il@gmail.com \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=miklos@szeredi.hu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).