From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754530AbcEPTNY (ORCPT ); Mon, 16 May 2016 15:13:24 -0400 Received: from bedivere.hansenpartnership.com ([66.63.167.143]:46970 "EHLO bedivere.hansenpartnership.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753874AbcEPTNW (ORCPT ); Mon, 16 May 2016 15:13:22 -0400 Message-ID: <1463425996.4101.14.camel@HansenPartnership.com> Subject: Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems From: James Bottomley To: "Eric W. Biederman" Cc: Djalal Harouni , Alexander Viro , Chris Mason , tytso@mit.edu, Serge Hallyn , Josh Triplett , Andy Lutomirski , Seth Forshee , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-security-module@vger.kernel.org, Dongsu Park , David Herrmann , Miklos Szeredi , Alban Crequy , Dave Chinner Date: Mon, 16 May 2016 15:13:16 -0400 In-Reply-To: <87twi0giws.fsf@x220.int.ebiederm.org> References: <1462395979.14310.133.camel@HansenPartnership.com> <20160505073636.GA3357@dztty> <1462449388.2419.27.camel@HansenPartnership.com> <20160505214957.GA3071@dztty> <1462486085.2289.23.camel@HansenPartnership.com> <1462923416.14896.10.camel@HansenPartnership.com> <20160511164247.GA9908@dztty.fritz.box> <1462991618.2356.55.camel@HansenPartnership.com> <20160512195552.GB2859@dztty> <1463091852.2380.72.camel@HansenPartnership.com> <20160514095303.GA3476@dztty> <1463233614.2355.20.camel@HansenPartnership.com> <87twi0giws.fsf@x220.int.ebiederm.org> Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.16.5 Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sat, 2016-05-14 at 21:21 -0500, Eric W. Biederman wrote: > James Bottomley writes: > > > On Sat, 2016-05-14 at 10:53 +0100, Djalal Harouni wrote: > > Just a couple of quick comments from a very high level design point. > > - I think a shiftfs is valuable in the same way that overlayfs is > valuable. > > Esepcially in the Docker case where a lot of containers want a shared > base image (for efficiency), but it is desirable to run those > containers in different user namespaces for safety. > > - It is also the plan to make it possible to mount a filesystem where > the uids and gids of that filesystem on disk do not have a one to one > mapping to kernel uids and gids. 99% of the work has already be done, > for all filesystem except XFS. Can you elaborate a bit more on why we want to do this? I think only having a single shift of uid_t to kuid_t across the kernel to user boundary is a nice feature of user namespaces. Architecturally, it's not such a big thing to do it as the data goes on to the disk as well, but what's the use case for it? > That said there are some significant issues to work through, before > something like that can be enabled. > > * Handling of uids/gids on disk that don't map into a kuid/kgid. So I think this is nicely handled in the capability checks in generic_permission() (capable_wrt_inode_uidgid()) is there a need to make it more complex (and thus more error prone)? > * Safety from poisoned filesystem images. By poisoned FS image, you mean an image over whose internal data the user has control? The basic problem of how do we give users write access to data devices they can then cause to be mounted as filesystems? > I have slowly been working with Seth Forshee on these issues as > the last thing I want is to introduce more security bugs right now. > Seth being a braver man than I am has already merged his changes into > the Ubuntu kernel. > > Right now we are targeting fuse, because fuse is already designed to > handle poisoned filesystem images. So to safely enable this kind of > mapping for fuse is not a giant step. > > The big thing from my point of view is to get the VFS interfaces > correct so that the VFS handles all of the weird cases that come up > with uids and gids that don't map, and any other weird cases. Keeping > the weird bits out of the filesystems. If by VFS interfaces, you mean where we've already got the mapping confined, absolutely. > James I think you are missing the fact that all filesystems already > have the make_kuid and make_kgid calls right where the data comes off > disk, I beg to differ: they certainly don't. The underlying filesystem populates the inode in ->lookup with the data off the disk which goes into the inode as a kuid_t/kgid_t It remains forever in the inode as that. We convert it as it goes out of the kernel in the stat calls (actually stat.c:cp_old/new_stat()) > and the from_kuid and from_kgid calls right where the on-disk data > is being created just before it goes on disk. Which means that the > actual impact on filesystems of the translation is trivial. Are you looking at a different tree from me? I'm actually just looking at Linus git head. James