containers.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
From: Christian Brauner <christian.brauner@ubuntu.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Lennart Poettering <lennart@poettering.net>,
	Mimi Zohar <zohar@linux.ibm.com>,
	David Howells <dhowells@redhat.com>,
	Andreas Dilger <adilger.kernel@dilger.ca>,
	containers@lists.linux-foundation.org,
	Tycho Andersen <tycho@tycho.ws>,
	Miklos Szeredi <miklos@szeredi.hu>,
	smbarber@chromium.org, Christoph Hellwig <hch@infradead.org>,
	linux-ext4@vger.kernel.org, Mrunal Patel <mpatel@redhat.com>,
	Kees Cook <keescook@chromium.org>, Arnd Bergmann <arnd@arndb.de>,
	Jann Horn <jannh@google.com>,
	selinux@vger.kernel.org, Josh Triplett <josh@joshtriplett.org>,
	linux-fsdevel@vger.kernel.org,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Andy Lutomirski <luto@kernel.org>,
	OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>,
	Geoffrey Thomas <geofft@ldpreload.com>,
	James Bottomley <James.Bottomley@hansenpartnership.com>,
	John Johansen <john.johansen@canonical.com>,
	Theodore Tso <tytso@mit.edu>,
	Seth Forshee <seth.forshee@canonical.com>,
	Dmitry Kasatkin <dmitry.kasatkin@gmail.com>,
	Stephen Smalley <stephen.smalley.work@gmail.com>,
	Jonathan Corbet <corbet@lwn.net>,
	linux-unionfs@vger.kernel.org,
	linux-security-module@vger.kernel.org, linux-audit@redhat.com,
	"Eric W. Biederman" <ebiederm@xmission.com>,
	linux-api@vger.kernel.org,
	Casey Schaufler <casey@schaufler-ca.com>,
	Alban Crequy <alban@kinvolk.io>,
	linux-integrity@vger.kernel.org, Todd Kjos <tkjos@google.com>
Subject: Re: [PATCH 00/34] fs: idmapped mounts
Date: Thu, 29 Oct 2020 17:19:20 +0100	[thread overview]
Message-ID: <20201029161920.zp7p3335x3q2a36e@wittgenstein> (raw)
In-Reply-To: <20201029022733.GB306023@dread.disaster.area>

On Thu, Oct 29, 2020 at 01:27:33PM +1100, Dave Chinner wrote:
> On Thu, Oct 29, 2020 at 01:32:18AM +0100, Christian Brauner wrote:
> > Hey everyone,
> > 
> > I vanished for a little while to focus on this work here so sorry for
> > not being available by mail for a while.
> > 
> > Since quite a long time we have issues with sharing mounts between
> > multiple unprivileged containers with different id mappings, sharing a
> > rootfs between multiple containers with different id mappings, and also
> > sharing regular directories and filesystems between users with different
> > uids and gids. The latter use-cases have become even more important with
> > the availability and adoption of systemd-homed (cf. [1]) to implement
> > portable home directories.
> > 
> > The solutions we have tried and proposed so far include the introduction
> > of fsid mappings, a tiny overlay based filesystem, and an approach to
> > call override creds in the vfs. None of these solutions have covered all
> > of the above use-cases.
> > 
> > The solution proposed here has it's origins in multiple discussions
> > during Linux Plumbers 2017 during and after the end of the containers
> > microconference.
> > To the best of my knowledge this involved Aleksa, Stéphane, Eric, David,
> > James, and myself. A variant of the solution proposed here has also been
> > discussed, again to the best of my knowledge, after a Linux conference
> > in St. Petersburg in Russia between Christoph, Tycho, and myself in 2017
> > after Linux Plumbers.
> > I've taken the time to finally implement a working version of this
> > solution over the last weeks to the best of my abilities. Tycho has
> > signed up for this sligthly crazy endeavour as well and he has helped
> > with the conversion of the xattr codepaths.
> > 
> > The core idea is to make idmappings a property of struct vfsmount
> > instead of tying it to a process being inside of a user namespace which
> > has been the case for all other proposed approaches.
> > It means that idmappings become a property of bind-mounts, i.e. each
> > bind-mount can have a separate idmapping. This has the obvious advantage
> > that idmapped mounts can be created inside of the initial user
> > namespace, i.e. on the host itself instead of requiring the caller to be
> > located inside of a user namespace. This enables such use-cases as e.g.
> > making a usb stick available in multiple locations with different
> > idmappings (see the vfat port that is part of this patch series).
> > 
> > The vfsmount struct gains a new struct user_namespace member. The
> > idmapping of the user namespace becomes the idmapping of the mount. A
> > caller that is either privileged with respect to the user namespace of
> > the superblock of the underlying filesystem or a caller that is
> > privileged with respect to the user namespace a mount has been idmapped
> > with can create a new bind-mount and mark it with a user namespace. The
> > user namespace the mount will be marked with can be specified by passing
> > a file descriptor refering to the user namespace as an argument to the
> > new mount_setattr() syscall together with the new MOUNT_ATTR_IDMAP flag.
> > By default vfsmounts are marked with the initial user namespace and no
> > behavioral or performance changes should be observed. All mapping
> > operations are nops for the initial user namespace.
> > 
> > When a file/inode is accessed through an idmapped mount the i_uid and
> > i_gid of the inode will be remapped according to the user namespace the
> > mount has been marked with. When a new object is created based on the
> > fsuid and fsgid of the caller they will similarly be remapped according
> > to the user namespace of the mount they care created from.
> > 
> > This means the user namespace of the mount needs to be passed down into
> > a few relevant inode_operations. This mostly includes inode operations
> > that create filesystem objects or change file attributes.
> 
> That's really quite ... messy.

I don't agree. It's changes all across the vfs but it's not hacky in any
way since it cleanly passes down an additional argument (I'm biased of
course.). 

> 
> Maybe I'm missing something, but if you have the user_ns to be used
> for the VFS operation we are about to execute then why can't we use
> the same model as current_fsuid/current_fsgid() for passing the
> filesystem credentials down to the filesystem operations?  i.e.
> attach it to the current->cred->fs_userns, and then the filesystem
> code that actually needs to know the current userns can call
> current_fs_user_ns() instead of current_user_ns().  i.e.
> 
> #define current_fs_user_ns()	\
> 	(current->cred->fs_userns ? current->cred->fs_userns \
> 				  : current->cred->userns)
> 
> At this point, the filesystem will now always have the correct
> userns it is supposed to use for mapping the uid/gid, right?

Thanks for this interesting idea! I have some troubles with it though.

This approach (always) seemed conceptually wrong to me. Like Tycho said
somewhere else this basically would act like a global variable which
isn't great.

There's also a substantial difference between in that the current fsuid
and fsgid are an actual property of the callers creds so to have them in
there makes perfect sense. But the user namespace of the vfsmount is a
property of the mount and as such glueing it to the callers creds when
calling into the vfs is just weird and I would very much like to avoid
this. If inode's wouldn't have an i_sb member we wouldn't suddenly start
to pass down the s_user_ns via the callers creds to the filesystems.

I'm also not fond of having to call prepare_creds() and override_creds()
all across the vfs. It's messy and prepare_creds() is especially
problematic during RCU pathwalk where we can't call it. We could
in path_init() at the start of every every lookup operation call
prepare_creds() and then override them when we need to switch the
fs_userns global variable and then put_creds() at the end of every path
walk in terminate_walk(). But this means penalizing every lookup
operations with an additional prepare_creds() which needs to be called
at least once, I think. Then during lookup we would need to
override/change this new global fs_userns variable potentially at each
mountpoint crossing to switch back to the correct fs_userns for idmapped
and non-idmapped mounts. We'd also need to rearrange a bunch of
terminate_walk() calls and we would like end up with a comparable amount
of changes only that they would indeed be more messy since we're
strapping fs_userns to the caller's creds.

I an alternative might be to have a combined approach where you pass the
user namespace around in the vfs but when calling into the filesystem
use the fs_userns global variable approach but I would very much prefer
to avoid this and instead cleanly pass down the user namespace
correctly. That's more work, it'll take longer but I and others are
around to see these changes through.

> 
> Also, if we are passing work off to worker threads, duplicating
> the current creds will capture this information and won't leave
> random landmines where stuff doesn't work as it should because the
> worker thread is unaware of the userns that it is supposed to be
> doing filesytsem operations under...

That seems like a problem that can be handled by simply keeping the
userns around similar to how we need to already keep creds around.

Christian
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

  reply	other threads:[~2020-10-29 16:20 UTC|newest]

Thread overview: 71+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-29  0:32 [PATCH 00/34] fs: idmapped mounts Christian Brauner
2020-10-29  0:32 ` [PATCH 01/34] namespace: take lock_mount_hash() directly when changing flags Christian Brauner
2020-11-01 14:41   ` Christoph Hellwig
2020-11-02 13:33     ` Christian Brauner
2020-10-29  0:32 ` [PATCH 02/34] namespace: only take read lock in do_reconfigure_mnt() Christian Brauner
2020-10-29  0:32 ` [PATCH 03/34] fs: add mount_setattr() Christian Brauner
2020-11-01 14:42   ` Christoph Hellwig
2020-11-02 13:34     ` Christian Brauner
2020-10-29  0:32 ` [PATCH 04/34] tests: add mount_setattr() selftests Christian Brauner
2020-10-29  0:32 ` [PATCH 05/34] fs: introduce MOUNT_ATTR_IDMAP Christian Brauner
2020-11-01 14:45   ` Christoph Hellwig
2020-11-02 13:29     ` Christian Brauner
2020-10-29  0:32 ` [PATCH 06/34] fs: add id translation helpers Christian Brauner
2020-11-01 14:46   ` Christoph Hellwig
2020-11-02 13:25     ` Christian Brauner
2020-10-29  0:32 ` [PATCH 07/34] capability: handle idmapped mounts Christian Brauner
2020-11-01 14:48   ` Christoph Hellwig
2020-11-02 13:23     ` Christian Brauner
2020-10-29  0:32 ` [PATCH 08/34] namei: add idmapped mount aware permission helpers Christian Brauner
2020-10-29  0:32 ` [PATCH 09/34] inode: add idmapped mount aware init and " Christian Brauner
2020-10-29  0:32 ` [PATCH 10/34] attr: handle idmapped mounts Christian Brauner
2020-10-29  0:32 ` [PATCH 11/34] acl: " Christian Brauner
2020-10-29  0:32 ` [PATCH 12/34] xattr: " Christian Brauner
2020-10-29  0:32 ` [PATCH 13/34] selftests: add idmapped mounts xattr selftest Christian Brauner
2020-10-29  0:32 ` [PATCH 14/34] commoncap: handle idmapped mounts Christian Brauner
2020-10-29  0:32 ` [PATCH 15/34] stat: add mapped_generic_fillattr() Christian Brauner
2020-10-29  0:32 ` [PATCH 16/34] namei: handle idmapped mounts in may_*() helpers Christian Brauner
2020-10-29  0:32 ` [PATCH 17/34] namei: introduce struct renamedata Christian Brauner
2020-10-29  0:32 ` [PATCH 18/34] namei: prepare for idmapped mounts Christian Brauner
2020-10-29  0:32 ` [PATCH 19/34] namei: add lookup helpers with idmapped mounts aware permission checking Christian Brauner
2020-10-29  0:32 ` [PATCH 20/34] open: handle idmapped mounts in do_truncate() Christian Brauner
2020-10-29  0:32 ` [PATCH 21/34] open: handle idmapped mounts Christian Brauner
2020-10-29  0:32 ` [PATCH 22/34] af_unix: " Christian Brauner
2020-10-29  0:32 ` [PATCH 23/34] utimes: " Christian Brauner
2020-10-29  0:32 ` [PATCH 24/34] would_dump: " Christian Brauner
2020-10-29  0:32 ` [PATCH 25/34] exec: " Christian Brauner
2020-10-29  0:32 ` [PATCH 26/34] fs: add helpers for idmap mounts Christian Brauner
2020-10-29  0:32 ` [PATCH 27/34] apparmor: handle idmapped mounts Christian Brauner
2020-10-29  0:32 ` [PATCH 28/34] audit: " Christian Brauner
2020-10-29  0:32 ` [PATCH 29/34] ima: " Christian Brauner
2020-10-29  0:32 ` [PATCH 30/34] ext4: support " Christian Brauner
2020-10-29  0:32 ` [PATCH 31/34] expfs: handle " Christian Brauner
2020-10-29  0:32 ` [PATCH 32/34] overlayfs: handle idmapped lower directories Christian Brauner
2020-10-30 11:10   ` Amir Goldstein
2020-10-30 11:52     ` Christian Brauner
2020-10-29  0:32 ` [PATCH 33/34] overlayfs: handle idmapped merged mounts Christian Brauner
2020-10-30  9:57   ` Amir Goldstein
2020-10-29  0:32 ` [PATCH 34/34] fat: handle idmapped mounts Christian Brauner
2020-10-29  2:27 ` [PATCH 00/34] fs: " Dave Chinner
2020-10-29 16:19   ` Christian Brauner [this message]
2020-10-29 21:06     ` Tycho Andersen
2020-10-29  7:20 ` Sargun Dhillon
2020-10-29 15:47 ` Eric W. Biederman
2020-10-29 15:51   ` Aleksa Sarai
2020-10-29 16:37     ` Eric W. Biederman
2020-10-30  2:18       ` Serge E. Hallyn
2020-10-30 15:07       ` Seth Forshee
2020-10-30 16:03         ` Serge E. Hallyn
2020-11-03 14:10       ` Alban Crequy
2020-10-29 16:05   ` Lennart Poettering
2020-10-29 16:36     ` Sargun Dhillon
2020-10-29 16:54     ` Eric W. Biederman
2020-10-29 16:12   ` Tycho Andersen
2020-10-29 16:23     ` Serge E. Hallyn
2020-10-29 16:44     ` Eric W. Biederman
2020-10-29 18:04       ` Stéphane Graber
2020-10-29 21:03       ` Tycho Andersen
2020-10-29 21:58 ` Andy Lutomirski
2020-10-30 12:01   ` Christian Brauner
2020-10-30 16:17     ` Serge E. Hallyn
2020-10-31 17:43     ` Andy Lutomirski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201029161920.zp7p3335x3q2a36e@wittgenstein \
    --to=christian.brauner@ubuntu.com \
    --cc=James.Bottomley@hansenpartnership.com \
    --cc=adilger.kernel@dilger.ca \
    --cc=alban@kinvolk.io \
    --cc=arnd@arndb.de \
    --cc=casey@schaufler-ca.com \
    --cc=containers@lists.linux-foundation.org \
    --cc=corbet@lwn.net \
    --cc=david@fromorbit.com \
    --cc=dhowells@redhat.com \
    --cc=dmitry.kasatkin@gmail.com \
    --cc=ebiederm@xmission.com \
    --cc=geofft@ldpreload.com \
    --cc=hch@infradead.org \
    --cc=hirofumi@mail.parknet.co.jp \
    --cc=jannh@google.com \
    --cc=john.johansen@canonical.com \
    --cc=josh@joshtriplett.org \
    --cc=keescook@chromium.org \
    --cc=lennart@poettering.net \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-audit@redhat.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-integrity@vger.kernel.org \
    --cc=linux-security-module@vger.kernel.org \
    --cc=linux-unionfs@vger.kernel.org \
    --cc=luto@kernel.org \
    --cc=miklos@szeredi.hu \
    --cc=mpatel@redhat.com \
    --cc=selinux@vger.kernel.org \
    --cc=seth.forshee@canonical.com \
    --cc=smbarber@chromium.org \
    --cc=stephen.smalley.work@gmail.com \
    --cc=tkjos@google.com \
    --cc=tycho@tycho.ws \
    --cc=tytso@mit.edu \
    --cc=viro@zeniv.linux.org.uk \
    --cc=zohar@linux.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).