From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-fsdevel-owner@vger.kernel.org>
Received: from out01.mta.xmission.com ([166.70.13.231]:45955 "EHLO
	out01.mta.xmission.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1751906AbcEQWv0 (ORCPT
	<rfc822;linux-fsdevel@vger.kernel.org>);
	Tue, 17 May 2016 18:51:26 -0400
From: ebiederm@xmission.com (Eric W. Biederman)
To: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Djalal Harouni <tixxdz@gmail.com>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Chris Mason <clm@fb.com>, tytso@mit.edu,
	Serge Hallyn <serge.hallyn@canonical.com>,
	Josh Triplett <josh@joshtriplett.org>,
	Andy Lutomirski <luto@kernel.org>,
	Seth Forshee <seth.forshee@canonical.com>,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org,
	Dongsu Park <dongsu@endocode.com>,
	David Herrmann <dh.herrmann@googlemail.com>,
	Miklos Szeredi <mszeredi@redhat.com>,
	Alban Crequy <alban.crequy@gmail.com>,
	Dave Chinner <david@fromorbit.com>
In-Reply-To: <1463425996.4101.14.camel@HansenPartnership.com> (James
	Bottomley's message of "Mon, 16 May 2016 15:13:16 -0400")
References: <1462395979.14310.133.camel@HansenPartnership.com>
	<20160505073636.GA3357@dztty>
	<1462449388.2419.27.camel@HansenPartnership.com>
	<20160505214957.GA3071@dztty>
	<1462486085.2289.23.camel@HansenPartnership.com>
	<1462923416.14896.10.camel@HansenPartnership.com>
	<20160511164247.GA9908@dztty.fritz.box>
	<1462991618.2356.55.camel@HansenPartnership.com>
	<20160512195552.GB2859@dztty>
	<1463091852.2380.72.camel@HansenPartnership.com>
	<20160514095303.GA3476@dztty>
	<1463233614.2355.20.camel@HansenPartnership.com>
	<87twi0giws.fsf@x220.int.ebiederm.org>
	<1463425996.4101.14.camel@HansenPartnership.com>
Date: Tue, 17 May 2016 17:40:12 -0500
Message-ID: <87lh38xq9f.fsf@x220.int.ebiederm.org>
MIME-Version: 1.0
Content-Type: text/plain
Subject: Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems
Sender: linux-fsdevel-owner@vger.kernel.org
List-ID: <linux-fsdevel.vger.kernel.org>

James Bottomley <James.Bottomley@HansenPartnership.com> writes:

> On Sat, 2016-05-14 at 21:21 -0500, Eric W. Biederman wrote:
>> James Bottomley <James.Bottomley@HansenPartnership.com> writes:
>> 
>> > On Sat, 2016-05-14 at 10:53 +0100, Djalal Harouni wrote:
>> 
>> Just a couple of quick comments from a very high level design point.
>> 
>> - I think a shiftfs is valuable in the same way that overlayfs is
>>   valuable.
>> 
>>   Esepcially in the Docker case where a lot of containers want a shared
>>   base image (for efficiency), but it is desirable to run those
>>   containers in different user namespaces for safety.
>> 
>> - It is also the plan to make it possible to mount a filesystem where
>>   the uids and gids of that filesystem on disk do not have a one to one
>>   mapping to kernel uids and gids.  99% of the work has already be done,
>>   for all filesystem except XFS.
>
> Can you elaborate a bit more on why we want to do this?  I think only
> having a single shift of uid_t to kuid_t across the kernel to user
> boundary is a nice feature of user namespaces.  Architecturally, it's
> not such a big thing to do it as the data goes on to the disk as well,
> but what's the use case for it?

fuse/nfs or just plain sanity.  As the data comes off disk we convert it
into the kernel internal form kuid_t and kgid_t.   For shiftfs this
would be converting the uids when they come from your underlying
filesystem to the upper level vfs abstractions.

Converting to the kernel form for a filesystem such as fuse that is does
all that is necessary to keep evil users from breaking the kernel means
that we call allow users in a user namespace to mount fuse themselves.
Supply whatever uids and gids they want in the fuse messages.  If the
uids/gids don't map from the mounting users user namespace into the
kernel then we set inode->i_uid to INVALID_UID.

That is all we ask of a filesystem, and we are sorting out the rest in
the VFS as nothing sets INVALID_UID in inode->i_uid today.


>>   That said there are some significant issues to work through, before
>>   something like that can be enabled.
>> 
>>   * Handling of uids/gids on disk that don't map into a kuid/kgid.
>
> So I think this is nicely handled in the capability checks in
> generic_permission() (capable_wrt_inode_uidgid()) is there a need to
> make it more complex (and thus more error prone)?

No just a need to handle INVALID_UID, and INVALID_GID which we don't
handle today.

>>   * Safety from poisoned filesystem images.
>
> By poisoned FS image, you mean an image over whose internal data the
> user has control?  The basic problem of how do we give users write
> access to data devices they can then cause to be mounted as
> filesystems?

Yes.  For fuse except for uids and gids this is already solved for most
other filesystems it is a whole new world of horror.

The general case of evil usb devices (think android) that look like
block devices but can return whatever they want already exists in the
wild.

>>   I have slowly been working with Seth Forshee on these issues as
>>   the last thing I want is to introduce more security bugs right now.
>>   Seth being a braver man than I am has already merged his changes into
>>   the Ubuntu kernel.
>> 
>>   Right now we are targeting fuse, because fuse is already designed to
>>   handle poisoned filesystem images.  So to safely enable this kind of
>>   mapping for fuse is not a giant step.
>> 
>>   The big thing from my point of view is to get the VFS interfaces
>>   correct so that the VFS handles all of the weird cases that come up
>>   with uids and gids that don't map, and any other weird cases.  Keeping
>>   the weird bits out of the filesystems.
>
> If by VFS interfaces, you mean where we've already got the mapping 
> confined, absolutely.

Yes.  It is just making certain we handle INVALID_UID and INVALID_GID
that results from a mapping failure.  As we don't handle that in 4.6.0.

>> James I think you are missing the fact that all filesystems already 
>> have the make_kuid and make_kgid calls right where the data comes off
>> disk,
>
> I beg to differ: they certainly don't.  The underlying filesystem
> populates the inode in ->lookup with the data off the disk which goes
> into the inode as a kuid_t/kgid_t  It remains forever in the inode as
> that.  We convert it as it goes out of the kernel in the stat calls
> (actually stat.c:cp_old/new_stat())

They do.  i_uid_write calls make_kuid to map the in comming uid from
disk into a kuid_t.  That is all I was referring to.

The only thing I am looking at infrastructure wise it to make it so that
we cleanly handle when the first parameter to make_kuid is not
&init_user_ns.  That is the core point of Seths work.

>>  and the from_kuid and from_kgid calls right where the on-disk data
>> is being created just before it goes on disk.  Which means that the
>> actual impact on filesystems of the translation is trivial.
>
> Are you looking at a different tree from me?  I'm actually just looking
> at Linus git head.

Take a look at i_uid_read and i_gid_read.  They are inline functions in
fs.h

Eric