From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-fsdevel-owner@vger.kernel.org>
Received: from h2.hallyn.com ([78.46.35.8]:57914 "EHLO mail.hallyn.com"
        rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S933569AbeGCQyv (ORCPT <rfc822;linux-fsdevel@vger.kernel.org>);
        Tue, 3 Jul 2018 12:54:51 -0400
Date: Tue, 3 Jul 2018 11:54:50 -0500
From: "Serge E. Hallyn" <serge@hallyn.com>
To: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Amir Goldstein <amir73il@gmail.com>,
        linux-fsdevel <linux-fsdevel@vger.kernel.org>,
        Seth Forshee <seth.forshee@canonical.com>,
        Linux Containers <containers@lists.linux-foundation.org>,
        Tyler Hicks <tyler.hicks@canonical.com>,
        Christian Brauner <christian.brauner@canonical.com>
Subject: Re: shiftfs status and future development
Message-ID: <20180703165450.GB22894@mail.hallyn.com>
References: <20180614184448.GC30028@ubuntu-xps13>
 <20180615135638.GA29299@mail.hallyn.com>
 <20180615145917.GF30028@ubuntu-xps13>
 <1529118185.4048.46.camel@HansenPartnership.com>
 <20180618134032.GP30028@ubuntu-xps13>
 <1529333819.4021.4.camel@HansenPartnership.com>
 <CAOQ4uxiLrhgwc1uNDmL4+DisWFB6EkfeYfybZ-s7db5RzvXZ2g@mail.gmail.com>
 <1530085696.4243.5.camel@HansenPartnership.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <1530085696.4243.5.camel@HansenPartnership.com>
Sender: linux-fsdevel-owner@vger.kernel.org
List-ID: <linux-fsdevel.vger.kernel.org>

Quoting James Bottomley (James.Bottomley@HansenPartnership.com):
> On Mon, 2018-06-18 at 20:11 +0300, Amir Goldstein wrote:
> > On Mon, Jun 18, 2018 at 5:56 PM, James Bottomley
> > <James.Bottomley@hansenpartnership.com> wrote:
> > [...]
> > > > > > �- Does not break inotify
> > > > > 
> > > > > I don't expect it does, but I haven't checked.
> > > > 
> > > > I haven't checked either; I'm planning to do so soon. This is a
> > > > concern that was expressed to me by others, I think because
> > > > inotify doesn't work with overlayfs.
> > > 
> > > I think shiftfs does work simply because it doesn't really do
> > > overlays, so lots of stuff that doesn't work with overlays does
> > > work with it.
> > > 
> > 
> > I'm afraid shiftfs suffers from the same problems that the old naiive
> > overlayfs inode implementation suffered from.
> > 
> > This problem is demonstrated with LTP tests inotify08 inotify09.
> > shiftfs_new_inode() is called on every lookup, so inotify watch
> > may be set on an inode object, then dentry is evicted from cache
> > and then all events on new dentry are not reported on the watched
> > inode. You will need to implement hashed inodes to solve it.
> > Can be done as overlay does - hashing by real inode pointer.
> > 
> > This is just one of those subtle things about stacked fs and there
> > may be other in present and more in future - if we don't have a
> > shared code base for the two stacked fs, I wager you are going to end
> > up "cherry picking" fixes often.
> > 
> > IMO, an important question to ask is, since both shiftfs and
> > overlayfs are strongly coupled with container use cases, are there
> > users that are interested in both layering AND shifting? on the same
> > "mark"? If the answer is yes, then this may be an argument in favor
> > of integrating at least some of shittfs functionality into overlayfs.
> 
> My container use case is interested in shifting but not layering.  Even
> the docker use case would only mix the two with the overlay graph
> driver.  There seem to be quite a few clouds using non overlayfs graph
> drivers (the dm one being the most popular).
> 
> > Another argument is that shiftfs itself takes the maximum allowed
> > 2 levels of s_stack_depth for it's 2 mounts, so it is actually not
> > possible with current VFS limitation to combine shiftfs with
> > overlayfs.
> 
> That's an artificial, not an inherent, restriction that was introduced
> to keep the call stack small.  It can be increased or even eliminated
> (although then we'd risk a real run off the end of the kernel stack
> problem).
> 
> > This could be solved relatively easily by adding "-o mark" support
> > to overlayfs and allowing to mount shiftfs also over "marked"
> > overlayfs inside container.
> 
> Can we please decided whether the temporary mark, as implemented in the
> current patch set or a more permanent security.<something> xattr type
> mark is preferred for this?  It's an important question that's been
> asked, but we have no resolution on.

I think something permanent is mandatory.  Otherwise users may be able
to induce a reboot into a state where the temp mark isn't made.  A
security.<something> xattr has the problem that an older kernel may not
know about it.

Two possibilities which have been mentioned before:

1. just demand that the *source* idmap doesn't start at 0.  Ideally it
would be something like 100k uids starting at 100k.  The kernel would
refuse to do a shiftfs mount if the source idmap includes uid 0.

I suppose the "let-them-shoot-themselves-in-the-foot" approach would be
to just strongly recommend using such a source uid mapping, but not
enforce it.

2. Enforce that the base directory have perms 700 for shiftfs-mount to
be allowed.  So /var/lib/shiftfs/base-rootfs might be root-owned with
700 perms.  Root then can shiftfs-mount it to /container1 uid-shifted to
100000:0:100000.  Yes, root could stupidly change the perms later, but
failing that uid 1000 can never exploit a setuid-root script under
/var/lib/shiftfs/base-rootfs

-serge