From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f65.google.com ([74.125.82.65]:33524 "EHLO mail-wm0-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751627AbcEQLmV (ORCPT ); Tue, 17 May 2016 07:42:21 -0400 Date: Tue, 17 May 2016 12:42:04 +0100 From: Djalal Harouni To: "Eric W. Biederman" Cc: James Bottomley , Alexander Viro , Chris Mason , tytso@mit.edu, Serge Hallyn , Josh Triplett , Andy Lutomirski , Seth Forshee , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-security-module@vger.kernel.org, Dongsu Park , David Herrmann , Miklos Szeredi , Alban Crequy , Dave Chinner Subject: Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems Message-ID: <20160517114204.GA7808@dztty.fritz.box> References: <20160505214957.GA3071@dztty> <1462486085.2289.23.camel@HansenPartnership.com> <1462923416.14896.10.camel@HansenPartnership.com> <20160511164247.GA9908@dztty.fritz.box> <1462991618.2356.55.camel@HansenPartnership.com> <20160512195552.GB2859@dztty> <1463091852.2380.72.camel@HansenPartnership.com> <20160514095303.GA3476@dztty> <1463233614.2355.20.camel@HansenPartnership.com> <87twi0giws.fsf@x220.int.ebiederm.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <87twi0giws.fsf@x220.int.ebiederm.org> Sender: linux-fsdevel-owner@vger.kernel.org List-ID: Hi Eric, On Sat, May 14, 2016 at 09:21:55PM -0500, Eric W. Biederman wrote: > James Bottomley writes: > > > On Sat, 2016-05-14 at 10:53 +0100, Djalal Harouni wrote: > > Just a couple of quick comments from a very high level design point. > > - I think a shiftfs is valuable in the same way that overlayfs is > valuable. > > Esepcially in the Docker case where a lot of containers want a shared > base image (for efficiency), but it is desirable to run those > containers in different user namespaces for safety. > > - It is also the plan to make it possible to mount a filesystem where > the uids and gids of that filesystem on disk do not have a one to one > mapping to kernel uids and gids. 99% of the work has already be done, > for all filesystem except XFS. > > That said there are some significant issues to work through, before > something like that can be enabled. > > * Handling of uids/gids on disk that don't map into a kuid/kgid. > * Safety from poisoned filesystem images. > > I have slowly been working with Seth Forshee on these issues as > the last thing I want is to introduce more security bugs right now. > Seth being a braver man than I am has already merged his changes into > the Ubuntu kernel. > > Right now we are targeting fuse, because fuse is already designed to > handle poisoned filesystem images. So to safely enable this kind of > mapping for fuse is not a giant step. Alright! > The big thing from my point of view is to get the VFS interfaces > correct so that the VFS handles all of the weird cases that come up > with uids and gids that don't map, and any other weird cases. Keeping > the weird bits out of the filesystems. Indeed, I totally agree here. > James, Djalal I regert I have not been able to read through either of > your patches cloesely yet. From a high level view I believe there are > use cases for both approaches, and the use cases do not necessarily > overlap. > > Djalal I think you are seeing the upsides and not the practical dangers > of poisoned filesystem images. Thanks for your reply Eric, I will let you sleep on the approach. Yes it's totatly different thing, I think you can consider it as a first step to use filesystems inside user namespace safely. Real root is still the only one who mounts and sets the mount namespace shift flag that can be inherited by unprivlieged userns users.. So real root is *still* in control of things. The solution is flexible. At the same time you have the fuse patches for ones that want to use it for unprivileged mounts, and later and it depends on the future and the state of art or how things are and improve... The real problem seems poisoned filesystem images, ok I agree. However this series considers at the moment only real root is the one who has to mount filesystems that will be used for user namespaces. So nothing real changes, just consider it like this: 1) root of init_user_ns mounts filesystems with mount shift flags and create shift mount namespace. 2) then give access for inodes that have inode->{uid/gid} that match the inside mapping of the calling process. This is like real root doing recursive chown of files to give rwx permission but without hitting the real disk. Every thing is virtual. So nothing really changes for poisoned filesystems since unprivileged users can't mount them, only real is able to do so, and he can verify the image before doing so... Now, the problem that I can see is if there is some special inodes related to these filesystems and host resources that are marked 0400 only for real root, in this case we have to add the needed capability check, capable in init_user_ns. For ioctl I guess they are already safe since they should have the appropriate capable check, but I will check it of course. Now, as Seth has been working with fuse mounts, and I guess they will be merged, I will of course check with him so everything is synced and that this approach will continue to work after his patches are merged. > James I think you are missing the fact that all filesystems already have > the make_kuid and make_kgid calls right where the data comes off disk, > and the from_kuid and from_kgid calls right where the on-disk data is > being created just before it goes on disk. Which means that the actual > impact on filesystems of the translation is trivial. > > Where the actual impact of filesystems is much higher is the > infrastructure needed to ensure poisoned filesystem images do not cause > a kernel compromise. That extends to the filesystem testing and code > review process beyond and is more than just a kernel problem. Hardening > that attack surface of the disk side of filesystems is difficult > especially when not impacting filesystem performance. > > > So I don't think it makes sense to frame this as an either/or situation. > I think there is a need for both solutions. > > Djalal if you could work with Seth I think that would be very useful. I > know I am dragging my heels there but I really hope I can dig in and get > everything reviewed and merged soonish. Alright! > James if you could see shiftfs with a different set of merits than what > to Djalal is doing I think that would be useful. As it would allow > everyone to concentrate on getting the bugs out of their solutions. > > > > That said I am not certain shiftfs makes sense without Seth's patches to > handle the weird cases at the VFS level. What do you do with uids and > gids that don't map? You can reinvent how to handle the strange cases > in shfitfs or we can work on solving this problem at the VFS level so > people don't have to go through the error prone work of reinventing > solutions. > > > The big ugly nasty in all of this is that we are fundamentally dealing > with uids and gids which are security identifiers. Practically any bug > is exploitable and CVE worthy. So it make sense to tread very > carefully. Even with care it can takes months if not years to get > the number of bugs down to a level where you are not the favorite target > of people looking for exploitable kernel bugs. I totally share this concern, that's why this RFC was designed like this, when you have time please check it, thanks! Here just for the record, I had a series that works with overlayfs that updated current_fsuid() to match inodes to give access, and later drop it for another better solution, but in the end I'm pretty sure that this should be handled inside VFS, and do not mess with creds or current_fsuid since they are global values, they cross user namespaces. > Eric -- Djalal Harouni http://opendz.org