From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from youngberry.canonical.com ([91.189.89.112]:52566 "EHLO youngberry.canonical.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751018AbcEIQ1I (ORCPT ); Mon, 9 May 2016 12:27:08 -0400 Date: Mon, 9 May 2016 16:26:30 +0000 From: Serge Hallyn To: Djalal Harouni Cc: Alexander Viro , Chris Mason , tytso@mit.edu, Serge Hallyn , Josh Triplett , "Eric W. Biederman" , Andy Lutomirski , Seth Forshee , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-security-module@vger.kernel.org, Dongsu Park , David Herrmann , Miklos Szeredi , Alban Crequy , Dave Chinner Subject: Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems Message-ID: <20160509162630.GC30629@ubuntumail> References: <1462372014-3786-1-git-send-email-tixxdz@gmail.com> <20160504233009.GB17801@ubuntumail> <20160506143836.GA6815@dztty.fritz.box> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160506143836.GA6815@dztty.fritz.box> Sender: linux-fsdevel-owner@vger.kernel.org List-ID: Quoting Djalal Harouni (tixxdz@gmail.com): > Hi, > > On Wed, May 04, 2016 at 11:30:09PM +0000, Serge Hallyn wrote: > > Quoting Djalal Harouni (tixxdz@gmail.com): > > > This is version 2 of the VFS:userns support portable root filesystems > > > RFC. Changes since version 1: > > > > > > * Update documentation and remove some ambiguity about the feature. > > > Based on Josh Triplett comments. > > > * Use a new email address to send the RFC :-) > > > > > > > > > This RFC tries to explore how to support filesystem operations inside > > > user namespace using only VFS and a per mount namespace solution. This > > > allows to take advantage of user namespace separations without > > > introducing any change at the filesystems level. All this is handled > > > with the virtual view of mount namespaces. > > > > Given your use case, is there any way we could work in some tradeoffs > > to protect the host? What I'm thinking is that containers can all > > share devices uid-mapped at will, however any device mounted with > > uid shifting cannot be used by the inital user namespace. Or maybe > > just non-executable in that case, as you'll need enough access to > > the fs to set up the containers you want to run. > > > > So if /dev/sda1 is your host /, you have to use /dev/sda2 as the > > container rootfs source. Mount it under /containers with uid > > shifting. Now all containers regardless of uid mappings see > > the shifted fs contents. But the host root cannot be tricked by > > files on it, as /dev/sda2 is non-executable as far as it is > > concerned. > Of course the whole setup is based on the container manager to setup > the right mount namespace, clean mounts, etc then pivot root, boot or > whatever... > > Now I guess we can achieve what you want with MS_SLAVE|MS_REC on / ? > > You create a new mount/pid... namespaces with shift flags, but you are still > in init_user_ns, you remount your / with MS_SLAVE|MS_REC, then you > create new mount/pid namespaces with shift flag (two mount namespaces > here if you don't want to race setting MS_SLAVE flag and creating mount > namespace and you don't trust other processes... or you want the same nested > setup...) > > This second new secure mount namespace will be the one that you will use > to setup the container, device nodes, loops... fs that you want into the > container (probably with shift options) and also filesystems that you can't > mount inside user namespaces nor want them to show up or propagate into > host, you may also want to umount stuff too or remount to change mount > options too.., etc anyway here call it the cleaning of the mount namespace. > > Now during this phase, when you mount and prepare these file systems, > mount them with noexec flag first, then remount later with exec, or delay > the mounting just before you do a new clone(CLONE_NEWUSER...). During this > phase the container manager should get the device that you want to be > shared from input or argument, and it will only mount it and prepare > it inside new mount namespaces or containers and make sure that it will > never be propagated back... > > After clone(CLONE_NEWUSER|CLONE_NEWNS|CLONE_MNTNS_SHIFT_UIDGID), setup > the user namespace mapping, I guess you drop capabilities, do setuid() > or whatever and start the PID 1 or the app of the container. > > Now and to not confuse more Dave, since he doesn't like the idea of > a shared backing device, and me neither for obvious reasons! the shared > device should not be used for a rootfs, maybe for read-only user shared > data, or shared config, that's it... but for real rootfs they should have > their own *different* backing device! unless you know what you are doing > hehe I don't want to confuse people, and I just lack time, will also > respond to Dave email. Yes. We're saying slightly different things. You're saying that the admin should assign different backing stores for containers. I'm saying perhaps the kernel should enforce that, because $leaks. Let's say the host admin did a perfect setup of a container with shifted uids. Now he wants to run a quick ps in the container... he does it in a way that leaks a /proc/pid reference into the container so that (evil) container root can use /proc/pid/root/ to get a toehold into the host /. Does he now have shifted access to that? I think if we say "this blockdev will have shifted uids in /proc/$pid/ns/user", then immediately that blockdev becomes not-readable (or not-executable) in any namespace which does not have /proc/$pid/ns/user as an ancestor. With obvious check as in write-versus-execute exclusion that you cannot mark the blockdev shifted if ancestor user_ns already has a file open for execute. BTW, perhaps I should do this in a separate email, but here is how I would expect to use this: 1. Using zfs: I create a bare (unshifted) rootfs fs1. When I want to create a new container, I zfs clone fs1 to fs2, and let the container use fs2 shifted. No danger to fs1 since fs2 is cow. Same with btrfs. 2. Using overlay: I create a bare (unshifted) rootfs fs1. When I want to create a new container, I I mount fs1 read-only and shifted as base layer, then fs2 as the rw layer. The point here is that the zfs clone plus container start takes (for a 600-800M rootfs) about .5 seconds on my laptop, while the act of shifting all the uids takes another 2 seconds. So being able do this without manually shifting would be a huge improvement for cases (i.e. docker) where you do lots and lots of quick deploys. > > Just a thought. > > You think it will solve the case ? > > > Thanks for your comments! > > -- > Djalal Harouni > http://opendz.org