Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

From: James Bottomley <James.Bottomley@HansenPartnership.com>
To: Djalal Harouni <tixxdz@gmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>,
	Chris Mason <clm@fb.com>,
	tytso@mit.edu, Serge Hallyn <serge.hallyn@canonical.com>,
	Josh Triplett <josh@joshtriplett.org>,
	"Eric W. Biederman" <ebiederm@xmission.com>,
	Andy Lutomirski <luto@kernel.org>,
	Seth Forshee <seth.forshee@canonical.com>,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org,
	Dongsu Park <dongsu@endocode.com>,
	David Herrmann <dh.herrmann@googlemail.com>,
	Miklos Szeredi <mszeredi@redhat.com>,
	Alban Crequy <alban.crequy@gmail.com>,
	Dave Chinner <david@fromorbit.com>
Subject: Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems
Date: Thu, 12 May 2016 15:24:12 -0700	[thread overview]
Message-ID: <1463091852.2380.72.camel@HansenPartnership.com> (raw)
In-Reply-To: <20160512195552.GB2859@dztty>

On Thu, 2016-05-12 at 20:55 +0100, Djalal Harouni wrote:
> On Wed, May 11, 2016 at 11:33:38AM -0700, James Bottomley wrote:
> > On Wed, 2016-05-11 at 17:42 +0100, Djalal Harouni wrote:
> > > On Tue, May 10, 2016 at 04:36:56PM -0700, James Bottomley wrote:
> [...]
> > > Hmm anyway you are mounting this on behalf of filesystems, so if
> > > you 
> > > add the recursive thing, you will just probably make everything 
> > > worse, by making any /proc, /sys dentry that's under that path 
> > > shiftable, and unprivileged users can just create user namespaces
> > > and 
> > > read /proc/* and all the other stuff that doesn't have capable() 
> > > related to the init_user_ns host...
> > 
> > That's up to the admin who does the shifting.  Recursive would be
> > an
> > option if added.
> 
> Hmm, not sure if you get my point... you just made it an admin 
> problem where admins want to mount an image downloaded verify it and 
> use it for their container with /proc...! that's another problem!

You can't allow unprivileged containers to shift uids on arbitrary
filesystems, so the admin always has to do something for the initial
setup.

> > >   what if you have paths like /filesystem0/uidshiftedY/dir,
> > > /filesystem0/uidshiftedX/dir , /filesystem0/notshifted/dir 
> > > where some of them are also bind mounts that point to same dentry
> > > ?
> > 
> > Without recursive semantics, you see the underlying inode.  With 
> > them, you see the upper vfsmnts.  Shiftfs isn't idempotent, so you 
> > would need to be careful about nesting.  However, that's an admin
> > problem.
> > 
> > > Also, you create a totally new user namespace interface here! by 
> > > making your own new interface we just lose the notion of 
> > > init_user_ns and its children and mapping ?
> > 
> > I don't quite understand this; the only use of the init_user_ns is 
> > the capable(CAP_SYS_ADMIN) in fill_super which is how only the real
> > admin can mount at a shifted uid/gid.  Otherwise, there's no need 
> > to see into the userns because filesystems see the kuid_t/kgid_t 
> > which is what I'm shifting.
> > 
> > > I'm not sure of the implication of all this... your user 
> > > namespace mapping is not related at all to init_user_ns! it seems 
> > > that it has its own init_user_ns ?   does a capable() check now 
> > > on a shifted filesystem relates to that and hence to your mapping 
> > > or to the real init_user_ns ?
> > 
> > capable(CAP_SYS_ADMIN) == ns_capable(&init_user_ns, CAP_SYS_ADMIN)
> > 
> > Or is there a misunderstanding here about how user namespaces work
> > inside the kernel?  The design is that the ID shift is done as you
> > cross the kernel boundary, so a filesystem, being usually all in
> > -kernel operating via the VFS interfaces, ideally never needs to 
> > make any from_kuid/make_kuid calls.  However, there are ways 
> > filesystems can send data across the kernel/user bounary outside of 
> > the usual vfs interfaces (ioctls being the most usual one) so in 
> > that specific code, they have to do the kuid_t to uid_t changes 
> > themselves.  Shiftfs never sends data to the user outside of the 
> > VFS so it never needs to do this and can operate entirely on
> > kuid_ts.
> > 
> > > > There's a bit of an open question of whether it should have vfs
> > > > changes: the way the struct file f_inode and f_ops are hijacked 
> > > > is a bit nasty and perhaps d_select_inode() could be made a bit
> > > > cleverer to help us here instead.
> > > 
> > > I'm not sure if this PoC works... but you sure you didn't 
> > > introduce a serious vulnerability here ? you use a new mapping 
> > > and you update current_fsuid() creds up, which is global on any 
> > > fs operation, so may be: lets operate on any inode, update our 
> > > current_fsuid()... and access the rest of *unshifted filesystems*
> > > ... !?
> > 
> > The credentials are per thread, so it's a standard way of doing
> > credential shifting and no other threads of execution in the same 
> > task get access. As long as you bound the override_creds()/revert_c
> > reds() pairs within the kernel, you're safe.
> 
> No, and here sorry I mean shifted.
> 
> current_fsuid() is global through all fs operations which means it
> crosses user namespaces... it was safe the days of only init_user_ns,
> not anymore... You give a mapping inside containers to fsuid where 
> they don't want to have it... this allows to operate on inodes inside
> other containers... update current_fsuid() even if we want that user 
> to be nobody inside the container... and later it can access the 
> inodes of the shifted fs... and by same current of course...

OK, I still don't understand what you're getting at.  There are three
per-thread uids: uid, euid and fsuid (real, effective and filesystem). 
 They're all either settable via syscall or inherited on fork.  They're
all kernel side, meaning they're kuid_t.  Their values stay invariant
as you move through namespaces.  They change (and get mapped according
to the current user namespace setting) when you call set[fe]uid() So
when I enter a user namespace with mapping

0 100000 1000

and call setuid(0) (which sets all three). they all pick up the kuid_t
of 100000.  This means that writing a file inside the user namespace
after calling setuid(0) appears as real uid 100000 on the medium even
though if I call getuid() from the namespace, I get back 0.  What
shiftfs does is hijack temporarily the kernel fsuid/fsgid for
permission checks, so you can remap to any old uid on the medium
(although usually you'd pass in uidmap=0:100000:1000") it maps back
from kuid_t 100000 to kuid_t 0, which is why the container can now read
and write the underlying medium at on-media id 0 even through root
inside the container has kuid_t 100000.  There's no permanent change of
fsuid and it stays at its invariant value for the thread except as a
temporary measure to do the permission checks on the underlying of the
shifted filesystem.

> > > The worst thing is that current_fsuid() does not follow now the
> > > /proc/self/uid_map interface! this is a serious vulnerability and 
> > > a mix of the current semantics... it's updated but using other
> > > rules...?
> > 
> > current_fsuid() is aready mapped via the userns; it's already a 
> > kuid_t at its final value.  Shifting that is what you want to remap
> > underlying volume uid/gid's.  The uidmap/gidmap inputs to this are 
> > shifts on the final underlying uid/gids.
> 
> => some points:
> Changing setfsuid() its interfaces and rules... or an indrect way to
> break another syscall...

There is no change to setfsuid().

> The userns used for *mapping* is totatly different and not standard..
> . losing "init_user_ns and its decendents userns *semantics*...", a
> yet a totatly unlinked mapping...

There is no user namespace mapping at all.  This is a simple shift,
kernel side, of uids and gids at their kuid_t values.

> Breaking current_uid(),current_euid(),current_fsuid() which are
> mapped but in *different* user namespaces... hence different values
> inside namespaces... you can change your userns mapping but that
> current_fsuid specific one will always be remapped to some other 
> value inside even if you don't want it... It crosses user 
> namespaces...  uid and euid are remapped according to /proc/self/uid_
> map, fsuid is remapped according to this new interface...
> 
> Hard coding the mapping, nested containers/apps may *share* fsuid and
> can't get rid of it even if they change the inside userns mapping to
> disable, split, reduce mapped users or offer better isolation they
> can't... no way to make private inodes inside containers if they 
> share the final fsuid, inside container mapping is ignored...
> ...

OK, I think there's a misunderstanding about how credential overrides
work.  They're not permanent changes to the credentials, they're
temporary ones to get stuff done within the kernel at a temporary
privilege.  You can make credentials permanent if you go through
prepare_creds()/commit_creds(), but for making them temporary you do
prepare_creds()/override_creds() and then revert_creds() once you're
done using them.

If you want to see a current use of this, try fs/open.c:faccessat. 
 What it's doing is temporarily overriding fsuid with the real uid to
check the permissions before reverting the credentials and returning to
the user.

James

> > the privileged ids down to 100000, but I have a volume which still 
> > has realids, I can mount that volume using shiftfs with
> > uidmap=0:100000:1000 and it will allow this userns to read and 
> > write the volume through its remapped ids.
> > 
> > > For overlayfs I did write an expriment but for me it's not an 
> > > overlayfs or another new filesystem problem... we are 
> > > manipulating UID/GID identities...
> > > 
> > > It would have been better if you did send this as a separate 
> > > thread. It was a vfs:userns RFC fix which if we continue we turn 
> > > it into a complicated thing! implement another new light 
> > > filesystem with userns... (overlayfs...)
> > > 
> > > Will follow up if the appropriate thread is created, not here, I 
> > > guess it's ok ?
> > 
> > Well, I can resend the patch as a separate thread when I've fixed 
> > some of the problems viro pointed out.
> > 
> > James
> > 
> > > > James
> > > > 
> > > 
> > > Thank you for your feedback!
> > > 
> > > 
> > 
> 
> Thanks!
>