From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-fsdevel-owner@vger.kernel.org>
Received: from youngberry.canonical.com ([91.189.89.112]:35335 "EHLO
        youngberry.canonical.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1755076AbeFRQDc (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Mon, 18 Jun 2018 12:03:32 -0400
Received: from mail-it0-f69.google.com ([209.85.214.69])
        by youngberry.canonical.com with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16)
        (Exim 4.76)
        (envelope-from <seth.forshee@canonical.com>)
        id 1fUwcs-0007zj-Sw
        for linux-fsdevel@vger.kernel.org; Mon, 18 Jun 2018 16:03:31 +0000
Received: by mail-it0-f69.google.com with SMTP id c7-v6so8650188itd.7
        for <linux-fsdevel@vger.kernel.org>; Mon, 18 Jun 2018 09:03:30 -0700 (PDT)
Date: Mon, 18 Jun 2018 11:03:28 -0500
From: Seth Forshee <seth.forshee@canonical.com>
To: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: "Serge E. Hallyn" <serge@hallyn.com>,
        linux-fsdevel@vger.kernel.org,
        containers@lists.linux-foundation.org,
        Tyler Hicks <tyler.hicks@canonical.com>,
        Christian Brauner <christian.brauner@canonical.com>
Subject: Re: shiftfs status and future development
Message-ID: <20180618160328.GR30028@ubuntu-xps13>
References: <20180614184448.GC30028@ubuntu-xps13>
 <20180615135638.GA29299@mail.hallyn.com>
 <20180615145917.GF30028@ubuntu-xps13>
 <1529118185.4048.46.camel@HansenPartnership.com>
 <20180618134032.GP30028@ubuntu-xps13>
 <1529333819.4021.4.camel@HansenPartnership.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <1529333819.4021.4.camel@HansenPartnership.com>
Sender: linux-fsdevel-owner@vger.kernel.org
List-ID: <linux-fsdevel.vger.kernel.org>

On Mon, Jun 18, 2018 at 07:56:59AM -0700, James Bottomley wrote:
> On Mon, 2018-06-18 at 08:40 -0500, Seth Forshee wrote:
> > On Fri, Jun 15, 2018 at 08:03:05PM -0700, James Bottomley wrote:
> > > On Fri, 2018-06-15 at 09:59 -0500, Seth Forshee wrote:
> > > > On Fri, Jun 15, 2018 at 08:56:38AM -0500, Serge E. Hallyn wrote:
> > > > > Quoting Seth Forshee (seth.forshee@canonical.com):
> > > > > > I wanted to inquire about the current status of shiftfs and
> > > > > > the plans for it moving forward. We'd like to have this
> > > > > > functionality available for use in lxd, and I'm interesetd in
> > > > > > helping with development (or picking up development if it's
> > > > > > stalled).
> > > > > > 
> > > > > > To start, is anyone still working on shiftfs or similar
> > > > > > functionality? I haven't found it in any git tree on
> > > > > > kernel.org, and as far as mailing list activity the last
> > > > > > submission I can find is [1]. Is there anything newer than
> > > > > > this?
> > > > > > 
> > > > > > Based on past mailing list discussions, it seems like there
> > > > > > was still debate as to whether this feature should be an
> > > > > > overlay filesystem or something supported at the vfs level.
> > > > > > Was this ever resolved?
> > > > > > 
> > > > > > Thanks,
> > > > > > Seth
> > > > > > 
> > > > > > [1]
> > > > > > http://lkml.kernel.org/r/1487638025.2337.49.camel@HansenPartn
> > > > > > ership.com
> > > > > 
> > > > > Hey Seth,
> > > > > 
> > > > > I haven't heard anything in a long time.  But if this is going
> > > > > to pick back up, can we come up with a detailed set of goals
> > > > > and requirements?
> > > 
> > > That would actually help.
> > > 
> > > > I was planning to follow up later with some discussion of
> > > > requirements. Here are some of ours:
> > > > 
> > > >  - Supports any id maps possible for a user namespace
> > > 
> > > Could you clarify: right at the moment, it basically reverses the
> > > namespace ID mapping when it does on to the filesystem using the
> > > superblock user namespace, so, in theory you can have an arbitrary
> > > mapping simply by changing the s_userns.  The problem here is that
> > > you don't have a lot of tools for manipulating the s_userns.
> > 
> > For our purposes the way you're shifting with s_user_ns works fine. I
> > know that Serge would prefer a more arbitrary shift so that an
> > arbitrary, unprivileged range in the source fs could be use (e.g. use
> > ids 100000 - 101000 in the source instead of 0 - 1000), and my
> > thoughts on that are quoted below.
> 
> The original (v1) shiftfs did simply take a range of ids to shift as an
> argument.  However, that one could only be set up by root and Eric
> expressed a desire that it use the s_user_ns.

I like using s_user_ns too, just pointing out that it does likely
preclude using a shifted source.

> > > >  - Does not break inotify
> > > 
> > > I don't expect it does, but I haven't checked.
> > 
> > I haven't checked either; I'm planning to do so soon. This is a
> > concern that was expressed to me by others, I think because inotify
> > doesn't work with overlayfs.
> 
> I think shiftfs does work simply because it doesn't really do overlays,
> so lots of stuff that doesn't work with overlays does work with it.
> 
> > > >  - Passes accurate disk usage and source information from the
> > > > "underlay"
> > > 
> > > mounts of this type don't currently show up in df
> > > 
> > > >  - Works with a variety of filesystems (ext4, xfx, btrfs, etc.)
> > > 
> > > yes
> > > 
> > > >  - Works with nested containers
> > > 
> > > yes
> > 
> > I'd say not so much:
> > 
> >         /* to mark a mount point, must be real root */
> >         if (ssi->mark && !capable(CAP_SYS_ADMIN))
> >                 goto out;
> > 
> > So within a container I cannot mark a range to be shiftfs-mountable
> > within a container I create. I'd argue that as long as a user has
> > CAP_SYS_ADMIN towards sb->s_user_ns for the source filesystem it
> > should be safe to allow this as it implies privleges wrt all ids
> > found in the source mount. This will likely lead to stacked shiftfs
> > mounts, not sure yet whether or not this works in the current code.
> 
> Um, I think we have different definitions of "works with nested
> containers".

Ultimately what I mean is that it should work the same way in a
container as in the host, given that the container has the necessary
capabilities towards the source subtree that it wants to id shift. So my
container should be able to mark a subtree over which I have
ns_capable(sb->s_user_ns, CAP_SYS_ADMIN) and then create a container
that can shiftfs-mount that subtree.

> Recall that for a nested container the s_user_ns is also
> nested, so we shift all the way back to the uid in the root.  That
> means if the check for marking is not capable(CAP_SYS_ADMIN) then an
> unprivileged user would be able to gain root write access by setting up
> a nested shift.

This is true. However, real root already delegated the ability to write
as root within a subtree to the container by marking that subtree.
Since that container can already write as root to that subtree, what
problem is created by letting it create a nested container that can also
write to that subtree as root? Either the host already set things up so
that writes as root to that subtree are not an issue, or it didn't and
you already have a problem.

For non-shiftfs filesystems, inodes from the filesystem cannot have any
ownership by ids not in s_user_ns, so the nested container could not
write any ids not already under the control of the first-level
container.

> If your definition of nested means we only shift back
> one level of user_ns nesting then this could become ns_capable(), so I
> think we need to add "what is the desired nesting behaviour?" to the
> questions to be answered by the requirements.

No, I think in the shiftfs-over-shiftfs use case it does shift all the
way back to the host. My argument is that this isn't actually a problem.