From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-fsdevel-owner@vger.kernel.org>
Received: from mail-wm0-f65.google.com ([74.125.82.65]:33524 "EHLO
	mail-wm0-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1751627AbcEQLmV (ORCPT
	<rfc822;linux-fsdevel@vger.kernel.org>);
	Tue, 17 May 2016 07:42:21 -0400
Date: Tue, 17 May 2016 12:42:04 +0100
From: Djalal Harouni <tixxdz@gmail.com>
To: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Chris Mason <clm@fb.com>, tytso@mit.edu,
	Serge Hallyn <serge.hallyn@canonical.com>,
	Josh Triplett <josh@joshtriplett.org>,
	Andy Lutomirski <luto@kernel.org>,
	Seth Forshee <seth.forshee@canonical.com>,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org,
	Dongsu Park <dongsu@endocode.com>,
	David Herrmann <dh.herrmann@googlemail.com>,
	Miklos Szeredi <mszeredi@redhat.com>,
	Alban Crequy <alban.crequy@gmail.com>,
	Dave Chinner <david@fromorbit.com>
Subject: Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems
Message-ID: <20160517114204.GA7808@dztty.fritz.box>
References: <20160505214957.GA3071@dztty>
 <1462486085.2289.23.camel@HansenPartnership.com>
 <1462923416.14896.10.camel@HansenPartnership.com>
 <20160511164247.GA9908@dztty.fritz.box>
 <1462991618.2356.55.camel@HansenPartnership.com>
 <20160512195552.GB2859@dztty>
 <1463091852.2380.72.camel@HansenPartnership.com>
 <20160514095303.GA3476@dztty>
 <1463233614.2355.20.camel@HansenPartnership.com>
 <87twi0giws.fsf@x220.int.ebiederm.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <87twi0giws.fsf@x220.int.ebiederm.org>
Sender: linux-fsdevel-owner@vger.kernel.org
List-ID: <linux-fsdevel.vger.kernel.org>

Hi Eric,

On Sat, May 14, 2016 at 09:21:55PM -0500, Eric W. Biederman wrote:
> James Bottomley <James.Bottomley@HansenPartnership.com> writes:
> 
> > On Sat, 2016-05-14 at 10:53 +0100, Djalal Harouni wrote:
> 
> Just a couple of quick comments from a very high level design point.
> 
> - I think a shiftfs is valuable in the same way that overlayfs is
>   valuable.
> 
>   Esepcially in the Docker case where a lot of containers want a shared
>   base image (for efficiency), but it is desirable to run those
>   containers in different user namespaces for safety.
> 
> - It is also the plan to make it possible to mount a filesystem where
>   the uids and gids of that filesystem on disk do not have a one to one
>   mapping to kernel uids and gids.  99% of the work has already be done,
>   for all filesystem except XFS.
> 
>   That said there are some significant issues to work through, before
>   something like that can be enabled.
> 
>   * Handling of uids/gids on disk that don't map into a kuid/kgid.
>   * Safety from poisoned filesystem images.
> 
>   I have slowly been working with Seth Forshee on these issues as
>   the last thing I want is to introduce more security bugs right now.
>   Seth being a braver man than I am has already merged his changes into
>   the Ubuntu kernel.
> 
>   Right now we are targeting fuse, because fuse is already designed to
>   handle poisoned filesystem images.  So to safely enable this kind of
>   mapping for fuse is not a giant step.

Alright!

>   The big thing from my point of view is to get the VFS interfaces
>   correct so that the VFS handles all of the weird cases that come up
>   with uids and gids that don't map, and any other weird cases.  Keeping
>   the weird bits out of the filesystems.

Indeed, I totally agree here.


> James, Djalal  I regert I have not been able to read through either of
> your patches cloesely yet.  From a high level view I believe there are
> use cases for both approaches, and the use cases do not necessarily
> overlap.
> 
> Djalal I think you are seeing the upsides and not the practical dangers
> of poisoned filesystem images.

Thanks for your reply Eric, I will let you sleep on the approach. Yes
it's totatly different thing, I think you can consider it as a first
step to use filesystems inside user namespace safely. Real root is still
the only one who mounts and sets the mount namespace shift flag that can
be inherited by unprivlieged userns users.. So real root is *still* in
control of things. The solution is flexible. At the same time you have
the fuse patches for ones that want to use it for unprivileged mounts, and
later and it depends on the future and the state of art or how things
are and improve...

The real problem seems poisoned filesystem images, ok I agree. However
this series considers at the moment only real root is the one who has to
mount filesystems that will be used for user namespaces.

So nothing real changes, just consider it like this:
1) root of init_user_ns mounts filesystems with mount shift flags and
create shift mount namespace.
2) then give access for inodes that have inode->{uid/gid} that match
the inside mapping of the calling process. This is like real root doing
recursive chown of files to give rwx permission but without hitting the
real disk. Every thing is virtual.

So nothing really changes for poisoned filesystems since unprivileged
users can't mount them, only real is able to do so, and he can verify
the image before doing so...

Now, the problem that I can see is if there is some special inodes
related to these filesystems and host resources that are marked 0400
only for real root, in this case we have to add the needed capability
check, capable in init_user_ns. For ioctl I guess they are already safe
since they should have the appropriate capable check, but I will check
it of course.

Now, as Seth has been working with fuse mounts, and I guess they will be
merged, I will of course check with him so everything is synced and that
this approach will continue to work after his patches are merged.


> James I think you are missing the fact that all filesystems already have
> the make_kuid and make_kgid calls right where the data comes off disk,
> and the from_kuid and from_kgid calls right where the on-disk data is
> being created just before it goes on disk.  Which means that the actual
> impact on filesystems of the translation is trivial.
> 
> Where the actual impact of filesystems is much higher is the
> infrastructure needed to ensure poisoned filesystem images do not cause
> a kernel compromise.  That extends to the filesystem testing and code
> review process beyond and is more than just a kernel problem.  Hardening
> that attack surface of the disk side of filesystems is difficult
> especially when not impacting filesystem performance.
> 
> 
> So I don't think it makes sense to frame this as an either/or situation.
> I think there is a need for both solutions.
> 
> Djalal if you could work with Seth I think that would be very useful.  I
> know I am dragging my heels there but I really hope I can dig in and get
> everything reviewed and merged soonish.

Alright!


> James if you could see shiftfs with a different set of merits than what
> to Djalal is doing I think that would be useful.  As it would allow
> everyone to concentrate on getting the bugs out of their solutions.
> 
> 
> 
> That said I am not certain shiftfs makes sense without Seth's patches to
> handle the weird cases at the VFS level.    What do you do with uids and
> gids that don't map?  You can reinvent how to handle the strange cases
> in shfitfs or we can work on solving this problem at the VFS level so
> people don't have to go through the error prone work of reinventing
> solutions.
> 
> 
> The big ugly nasty in all of this is that we are fundamentally dealing
> with uids and gids which are security identifiers.  Practically any bug
> is exploitable and CVE worthy.  So it make sense to tread very
> carefully.  Even with care it can takes months if not years to get
> the number of bugs down to a level where you are not the favorite target
> of people looking for exploitable kernel bugs.

I totally share this concern, that's why this RFC was designed like this,
when you have time please check it, thanks!

Here just for the record, I had a series that works with overlayfs that
updated current_fsuid() to match inodes to give access, and later drop
it for another better solution, but in the end I'm pretty sure that
this should be handled inside VFS, and do not mess with creds or
current_fsuid since they are global values, they cross user namespaces.

> Eric

-- 
Djalal Harouni
http://opendz.org