From: James Bottomley <James.Bottomley@HansenPartnership.com>
To: Djalal Harouni <tixxdz@gmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>,
Chris Mason <clm@fb.com>,
tytso@mit.edu, Serge Hallyn <serge.hallyn@canonical.com>,
Josh Triplett <josh@joshtriplett.org>,
"Eric W. Biederman" <ebiederm@xmission.com>,
Andy Lutomirski <luto@kernel.org>,
Seth Forshee <seth.forshee@canonical.com>,
linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-security-module@vger.kernel.org,
Dongsu Park <dongsu@endocode.com>,
David Herrmann <dh.herrmann@googlemail.com>,
Miklos Szeredi <mszeredi@redhat.com>,
Alban Crequy <alban.crequy@gmail.com>,
Dave Chinner <david@fromorbit.com>
Subject: Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems
Date: Thu, 12 May 2016 15:24:12 -0700 [thread overview]
Message-ID: <1463091852.2380.72.camel@HansenPartnership.com> (raw)
In-Reply-To: <20160512195552.GB2859@dztty>
On Thu, 2016-05-12 at 20:55 +0100, Djalal Harouni wrote:
> On Wed, May 11, 2016 at 11:33:38AM -0700, James Bottomley wrote:
> > On Wed, 2016-05-11 at 17:42 +0100, Djalal Harouni wrote:
> > > On Tue, May 10, 2016 at 04:36:56PM -0700, James Bottomley wrote:
> [...]
> > > Hmm anyway you are mounting this on behalf of filesystems, so if
> > > you
> > > add the recursive thing, you will just probably make everything
> > > worse, by making any /proc, /sys dentry that's under that path
> > > shiftable, and unprivileged users can just create user namespaces
> > > and
> > > read /proc/* and all the other stuff that doesn't have capable()
> > > related to the init_user_ns host...
> >
> > That's up to the admin who does the shifting. Recursive would be
> > an
> > option if added.
>
> Hmm, not sure if you get my point... you just made it an admin
> problem where admins want to mount an image downloaded verify it and
> use it for their container with /proc...! that's another problem!
You can't allow unprivileged containers to shift uids on arbitrary
filesystems, so the admin always has to do something for the initial
setup.
> > > what if you have paths like /filesystem0/uidshiftedY/dir,
> > > /filesystem0/uidshiftedX/dir , /filesystem0/notshifted/dir
> > > where some of them are also bind mounts that point to same dentry
> > > ?
> >
> > Without recursive semantics, you see the underlying inode. With
> > them, you see the upper vfsmnts. Shiftfs isn't idempotent, so you
> > would need to be careful about nesting. However, that's an admin
> > problem.
> >
> > > Also, you create a totally new user namespace interface here! by
> > > making your own new interface we just lose the notion of
> > > init_user_ns and its children and mapping ?
> >
> > I don't quite understand this; the only use of the init_user_ns is
> > the capable(CAP_SYS_ADMIN) in fill_super which is how only the real
> > admin can mount at a shifted uid/gid. Otherwise, there's no need
> > to see into the userns because filesystems see the kuid_t/kgid_t
> > which is what I'm shifting.
> >
> > > I'm not sure of the implication of all this... your user
> > > namespace mapping is not related at all to init_user_ns! it seems
> > > that it has its own init_user_ns ? does a capable() check now
> > > on a shifted filesystem relates to that and hence to your mapping
> > > or to the real init_user_ns ?
> >
> > capable(CAP_SYS_ADMIN) == ns_capable(&init_user_ns, CAP_SYS_ADMIN)
> >
> > Or is there a misunderstanding here about how user namespaces work
> > inside the kernel? The design is that the ID shift is done as you
> > cross the kernel boundary, so a filesystem, being usually all in
> > -kernel operating via the VFS interfaces, ideally never needs to
> > make any from_kuid/make_kuid calls. However, there are ways
> > filesystems can send data across the kernel/user bounary outside of
> > the usual vfs interfaces (ioctls being the most usual one) so in
> > that specific code, they have to do the kuid_t to uid_t changes
> > themselves. Shiftfs never sends data to the user outside of the
> > VFS so it never needs to do this and can operate entirely on
> > kuid_ts.
> >
> > > > There's a bit of an open question of whether it should have vfs
> > > > changes: the way the struct file f_inode and f_ops are hijacked
> > > > is a bit nasty and perhaps d_select_inode() could be made a bit
> > > > cleverer to help us here instead.
> > >
> > > I'm not sure if this PoC works... but you sure you didn't
> > > introduce a serious vulnerability here ? you use a new mapping
> > > and you update current_fsuid() creds up, which is global on any
> > > fs operation, so may be: lets operate on any inode, update our
> > > current_fsuid()... and access the rest of *unshifted filesystems*
> > > ... !?
> >
> > The credentials are per thread, so it's a standard way of doing
> > credential shifting and no other threads of execution in the same
> > task get access. As long as you bound the override_creds()/revert_c
> > reds() pairs within the kernel, you're safe.
>
> No, and here sorry I mean shifted.
>
> current_fsuid() is global through all fs operations which means it
> crosses user namespaces... it was safe the days of only init_user_ns,
> not anymore... You give a mapping inside containers to fsuid where
> they don't want to have it... this allows to operate on inodes inside
> other containers... update current_fsuid() even if we want that user
> to be nobody inside the container... and later it can access the
> inodes of the shifted fs... and by same current of course...
OK, I still don't understand what you're getting at. There are three
per-thread uids: uid, euid and fsuid (real, effective and filesystem).
They're all either settable via syscall or inherited on fork. They're
all kernel side, meaning they're kuid_t. Their values stay invariant
as you move through namespaces. They change (and get mapped according
to the current user namespace setting) when you call set[fe]uid() So
when I enter a user namespace with mapping
0 100000 1000
and call setuid(0) (which sets all three). they all pick up the kuid_t
of 100000. This means that writing a file inside the user namespace
after calling setuid(0) appears as real uid 100000 on the medium even
though if I call getuid() from the namespace, I get back 0. What
shiftfs does is hijack temporarily the kernel fsuid/fsgid for
permission checks, so you can remap to any old uid on the medium
(although usually you'd pass in uidmap=0:100000:1000") it maps back
from kuid_t 100000 to kuid_t 0, which is why the container can now read
and write the underlying medium at on-media id 0 even through root
inside the container has kuid_t 100000. There's no permanent change of
fsuid and it stays at its invariant value for the thread except as a
temporary measure to do the permission checks on the underlying of the
shifted filesystem.
> > > The worst thing is that current_fsuid() does not follow now the
> > > /proc/self/uid_map interface! this is a serious vulnerability and
> > > a mix of the current semantics... it's updated but using other
> > > rules...?
> >
> > current_fsuid() is aready mapped via the userns; it's already a
> > kuid_t at its final value. Shifting that is what you want to remap
> > underlying volume uid/gid's. The uidmap/gidmap inputs to this are
> > shifts on the final underlying uid/gids.
>
> => some points:
> Changing setfsuid() its interfaces and rules... or an indrect way to
> break another syscall...
There is no change to setfsuid().
> The userns used for *mapping* is totatly different and not standard..
> . losing "init_user_ns and its decendents userns *semantics*...", a
> yet a totatly unlinked mapping...
There is no user namespace mapping at all. This is a simple shift,
kernel side, of uids and gids at their kuid_t values.
> Breaking current_uid(),current_euid(),current_fsuid() which are
> mapped but in *different* user namespaces... hence different values
> inside namespaces... you can change your userns mapping but that
> current_fsuid specific one will always be remapped to some other
> value inside even if you don't want it... It crosses user
> namespaces... uid and euid are remapped according to /proc/self/uid_
> map, fsuid is remapped according to this new interface...
>
> Hard coding the mapping, nested containers/apps may *share* fsuid and
> can't get rid of it even if they change the inside userns mapping to
> disable, split, reduce mapped users or offer better isolation they
> can't... no way to make private inodes inside containers if they
> share the final fsuid, inside container mapping is ignored...
> ...
OK, I think there's a misunderstanding about how credential overrides
work. They're not permanent changes to the credentials, they're
temporary ones to get stuff done within the kernel at a temporary
privilege. You can make credentials permanent if you go through
prepare_creds()/commit_creds(), but for making them temporary you do
prepare_creds()/override_creds() and then revert_creds() once you're
done using them.
If you want to see a current use of this, try fs/open.c:faccessat.
What it's doing is temporarily overriding fsuid with the real uid to
check the permissions before reverting the credentials and returning to
the user.
James
> > the privileged ids down to 100000, but I have a volume which still
> > has realids, I can mount that volume using shiftfs with
> > uidmap=0:100000:1000 and it will allow this userns to read and
> > write the volume through its remapped ids.
> >
> > > For overlayfs I did write an expriment but for me it's not an
> > > overlayfs or another new filesystem problem... we are
> > > manipulating UID/GID identities...
> > >
> > > It would have been better if you did send this as a separate
> > > thread. It was a vfs:userns RFC fix which if we continue we turn
> > > it into a complicated thing! implement another new light
> > > filesystem with userns... (overlayfs...)
> > >
> > > Will follow up if the appropriate thread is created, not here, I
> > > guess it's ok ?
> >
> > Well, I can resend the patch as a separate thread when I've fixed
> > some of the problems viro pointed out.
> >
> > James
> >
> > > > James
> > > >
> > >
> > > Thank you for your feedback!
> > >
> > >
> >
>
> Thanks!
>
next prev parent reply other threads:[~2016-05-12 22:24 UTC|newest]
Thread overview: 49+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-05-04 14:26 [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems Djalal Harouni
2016-05-04 14:26 ` [RFC v2 PATCH 1/8] VFS: add CLONE_MNTNS_SHIFT_UIDGID flag to allow mounts to shift their UIDs/GIDs Djalal Harouni
2016-05-04 14:26 ` [RFC v2 PATCH 2/8] VFS:uidshift: add flags and helpers to shift UIDs and GIDs to virtual view Djalal Harouni
2016-05-04 14:26 ` [RFC v2 PATCH 3/8] fs: Treat foreign mounts as nosuid Djalal Harouni
2016-05-04 23:19 ` Serge Hallyn
2016-05-05 13:05 ` Seth Forshee
2016-05-05 22:40 ` Djalal Harouni
2016-05-04 14:26 ` [RFC v2 PATCH 4/8] VFS:userns: shift UID/GID to virtual view during permission access Djalal Harouni
2016-05-04 14:26 ` [RFC v2 PATCH 5/8] VFS:userns: add helpers to shift UIDs and GIDs into on-disk view Djalal Harouni
2016-05-04 14:26 ` [RFC v2 PATCH 6/8] VFS:userns: shift UID/GID to on-disk view before any write to disk Djalal Harouni
2016-05-04 14:26 ` [RFC v2 PATCH 7/8] ext4: add support for vfs_shift_uids and vfs_shift_gids mount options Djalal Harouni
2016-05-04 14:26 ` [RFC v2 PATCH 8/8] btrfs: " Djalal Harouni
2016-05-04 16:34 ` [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems Josh Triplett
2016-05-04 21:06 ` James Bottomley
2016-05-05 7:36 ` Djalal Harouni
2016-05-05 11:56 ` James Bottomley
2016-05-05 21:49 ` Djalal Harouni
2016-05-05 22:08 ` James Bottomley
2016-05-10 23:36 ` James Bottomley
2016-05-11 0:38 ` Al Viro
2016-05-11 0:53 ` Al Viro
2016-05-11 3:47 ` James Bottomley
2016-05-11 16:42 ` Djalal Harouni
2016-05-11 18:33 ` James Bottomley
2016-05-12 19:55 ` Djalal Harouni
2016-05-12 22:24 ` James Bottomley [this message]
2016-05-14 9:53 ` Djalal Harouni
2016-05-14 13:46 ` James Bottomley
2016-05-15 2:21 ` Eric W. Biederman
2016-05-15 15:04 ` James Bottomley
2016-05-16 14:12 ` Seth Forshee
2016-05-16 16:42 ` Eric W. Biederman
2016-05-16 18:25 ` Seth Forshee
2016-05-16 19:13 ` James Bottomley
2016-05-17 22:40 ` Eric W. Biederman
2016-05-17 11:42 ` Djalal Harouni
2016-05-17 15:42 ` Djalal Harouni
2016-05-04 23:30 ` Serge Hallyn
2016-05-06 14:38 ` Djalal Harouni
2016-05-09 16:26 ` Serge Hallyn
2016-05-10 10:33 ` Djalal Harouni
2016-05-05 0:23 ` Dave Chinner
2016-05-05 1:44 ` Andy Lutomirski
2016-05-05 2:25 ` Dave Chinner
2016-05-05 3:29 ` Andy Lutomirski
2016-05-05 22:34 ` Djalal Harouni
2016-05-05 22:24 ` Djalal Harouni
2016-05-06 2:50 ` Dave Chinner
2016-05-12 19:47 ` Djalal Harouni
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1463091852.2380.72.camel@HansenPartnership.com \
--to=james.bottomley@hansenpartnership.com \
--cc=alban.crequy@gmail.com \
--cc=clm@fb.com \
--cc=david@fromorbit.com \
--cc=dh.herrmann@googlemail.com \
--cc=dongsu@endocode.com \
--cc=ebiederm@xmission.com \
--cc=josh@joshtriplett.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-security-module@vger.kernel.org \
--cc=luto@kernel.org \
--cc=mszeredi@redhat.com \
--cc=serge.hallyn@canonical.com \
--cc=seth.forshee@canonical.com \
--cc=tixxdz@gmail.com \
--cc=tytso@mit.edu \
--cc=viro@zeniv.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).