From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=zRJA=NN=vger.kernel.org=linux-kernel-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-8.2 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS,
	INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_PASS,URIBL_BLOCKED,
	USER_AGENT_MUTT autolearn=ham autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id D33D1C32789
	for <linux-kernel@archiver.kernel.org>; Fri,  2 Nov 2018 12:44:09 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.kernel.org (Postfix) with ESMTP id 9C1D52081B
	for <linux-kernel@archiver.kernel.org>; Fri,  2 Nov 2018 12:44:09 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 9C1D52081B
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=canonical.com
Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-kernel-owner@vger.kernel.org
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1727244AbeKBVvL (ORCPT
        <rfc822;linux-kernel@archiver.kernel.org>);
        Fri, 2 Nov 2018 17:51:11 -0400
Received: from youngberry.canonical.com ([91.189.89.112]:44416 "EHLO
        youngberry.canonical.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726104AbeKBVvL (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Fri, 2 Nov 2018 17:51:11 -0400
Received: from mail-io1-f72.google.com ([209.85.166.72])
        by youngberry.canonical.com with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16)
        (Exim 4.76)
        (envelope-from <seth.forshee@canonical.com>)
        id 1gIYo0-0007Fp-D4
        for linux-kernel@vger.kernel.org; Fri, 02 Nov 2018 12:44:04 +0000
Received: by mail-io1-f72.google.com with SMTP id o7-v6so1786422ioh.22
        for <linux-kernel@vger.kernel.org>; Fri, 02 Nov 2018 05:44:04 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:date:from:to:cc:subject:message-id:references
         :mime-version:content-disposition:in-reply-to:user-agent;
        bh=LiTGJy/yASg4Kf9qswzmezhHG46xu34E64ej2cVtOOo=;
        b=Ll52BzLw4FfeHgw/KCj4EcCbBkubKKYP+FUUo//0R7rBZ7y9NEA7QJ+FQ2x165HnmA
         ckLCnM87ETYHn7JxpeSDXJSuCZ4La/dogxEOBsL1tS3bl3OhpQkaccbMWqyoxBKsln46
         j5vs6p+3dVkR9T11JRvLk0r7RhuhGKbWp2ffjMDUuUYqoDMMqZJzyRP989I5SrgkrtpC
         qLzXOQOC/esXBOvvMqq91Bq7yH71LUHEih6EW/O0Qfg1gxhnvvC7rTtW+Z8sIG5Iy/CV
         sSQz1rNOi9Hc/MH86/u3W5TqspJQDRzOrrcdBykkOWf4U6L4PaynpBoilktPP6ZKvHBe
         XBPg==
X-Gm-Message-State: AGRZ1gLfB+butPTkF6ZPkPLIxOhv2ONnT/aAtJV6We52hMWoP03OMR3E
        yZz3EpfIdsvSb/ZLqOdyhidonuZHaPa9yAcqhys+0hWWZIXX6aWQ40pD+3l4eXftHIbXudXWt2Z
        W9bUtB50vdMztl6ZXuJgcpLs7T3SDDd215XnatHeWPg==
X-Received: by 2002:a6b:5f13:: with SMTP id t19-v6mr9382106iob.268.1541162643196;
        Fri, 02 Nov 2018 05:44:03 -0700 (PDT)
X-Google-Smtp-Source: AJdET5e1uSk85/rp2XA5KJPT09n497bELW0OGom2wHfsGNQTQegNIxF2g7B1H2doJUMh7Krju/ZE2A==
X-Received: by 2002:a6b:5f13:: with SMTP id t19-v6mr9382076iob.268.1541162642629;
        Fri, 02 Nov 2018 05:44:02 -0700 (PDT)
Received: from localhost ([2605:a601:ac7:2a20:7c8b:4047:a2ef:69cd])
        by smtp.gmail.com with ESMTPSA id v134-v6sm13546918ita.23.2018.11.02.05.44.01
        (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256);
        Fri, 02 Nov 2018 05:44:01 -0700 (PDT)
Date:   Fri, 2 Nov 2018 07:44:00 -0500
From:   Seth Forshee <seth.forshee@canonical.com>
To:     Amir Goldstein <amir73il@gmail.com>
Cc:     linux-fsdevel <linux-fsdevel@vger.kernel.org>,
        linux-kernel <linux-kernel@vger.kernel.org>,
        Linux Containers <containers@lists.linux-foundation.org>,
        James Bottomley <James.Bottomley@hansenpartnership.com>,
        overlayfs <linux-unionfs@vger.kernel.org>
Subject: Re: [RFC PATCH 6/6] shiftfs: support nested shiftfs mounts
Message-ID: <20181102124400.GB29262@ubuntu-xps13>
References: <20181101214856.4563-1-seth.forshee@canonical.com>
 <20181101214856.4563-7-seth.forshee@canonical.com>
 <CAOQ4uxhkH+0tchHkrgvBZGYzeLD2fGGgtkKstt3JmhPDaGW_dg@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CAOQ4uxhkH+0tchHkrgvBZGYzeLD2fGGgtkKstt3JmhPDaGW_dg@mail.gmail.com>
User-Agent: Mutt/1.10.1 (2018-07-13)
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Fri, Nov 02, 2018 at 12:02:45PM +0200, Amir Goldstein wrote:
> On Thu, Nov 1, 2018 at 11:49 PM Seth Forshee <seth.forshee@canonical.com> wrote:
> >
> > shiftfs mounts cannot be nested for two reasons -- global
> > CAP_SYS_ADMIN is required to set up a mark mount, and a single
> > functional shiftfs mount meets the filesystem stacking depth
> > limit.
> >
> > The CAP_SYS_ADMIN requirement can be relaxed. All of the kernel
> > ids in a mount must be within that mount's s_user_ns, so all that
> > is needed is CAP_SYS_ADMIN within that s_user_ns.
> >
> > The stack depth issue can be worked around with a couple of
> > adjustments. First, a mark mount doesn't really need to count
> > against the stacking depth as it doesn't contribute to the call
> > stack depth during filesystem operations. Therefore the mount
> > over the mark mount only needs to count as one more than the
> > lower filesystems stack depth.
> 
> That's true, but it also highlights the point that the "mark" sb is
> completely unneeded and it really is just a nice little hack.
> All the information it really stores is a lower mount reference,
> a lower root dentry and a declarative statement "I am shiftable!".

Seems I should have saved some of the things I said in my previous
response for this one. As you no doubt gleaned from that email, I agree
with this.

> Come to think about it, "container shiftable" really isn't that different from
> NFS export with "no_root_squash" and auto mounted USB drive.
> I mean the shifting itself is different of course, but the
> declaration, not so much.
> If I am allowing sudoers on another machine to mess with root owned
> files visible
> on my machine, I am pretty much have the same issues as container admins
> accessing root owned files on my init_user_ns filesystem. In all those cases,
> I'd better not be executing suid binaries from the untrusted "external" source.
> 
> Instead of mounting a dummy filesystem to make the declaration, you could
> get the same thing with:
>    mount(path, path, "none", MS_BIND | MS_EXTERN | MS_NOEXEC)
> and all you need to do is add MS_EXTERN (MS_SHIFTABLE MS_UNTRUSTED
> or whatnot)  constant to uapi and manage to come up good man page description.
> 
> Then users could actually mount a filesystem in init_user_ns MS_EXTERN and
> avoid the extra bind mount step (for a full filesystem tree export).
> Declaring a mounted image MS_EXTERN has merits on its own even without
> containers and shitfs, for example with pluggable storage. Other LSMs could make
> good use of that declaration.

I'm missing how we figure out the target user ns in this scheme. We need
a context with privileges towards the source path's s_user_ns to say
it's okay to shift ids for the files under the source path, and then we
need a target user ns for the id shifts. Currently the target is
current_user_ns when the final shiftfs mount is created.

So, how do we determine the target s_user_ns in your scheme?

> 
> >
> > Second, when the lower mount is shiftfs we can just skip down to
> > that mount's lower filesystem and shift ids relative to that.
> > There is no reason to shift ids twice, and the lower path has
> > already been marked safe for id shifting by a user privileged
> > towards all ids in that mount's user ns.
> >
> > Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
> > ---
> >  fs/shiftfs.c | 68 +++++++++++++++++++++++++++++++++++-----------------
> >  1 file changed, 46 insertions(+), 22 deletions(-)
> >
> > diff --git a/fs/shiftfs.c b/fs/shiftfs.c
> > index b19af7b2fe75..008ace2842b9 100644
> > --- a/fs/shiftfs.c
> > +++ b/fs/shiftfs.c
> > @@ -930,7 +930,7 @@ static int shiftfs_fill_super(struct super_block *sb, void *raw_data,
> >         struct shiftfs_data *data = raw_data;
> >         char *name = kstrdup(data->path, GFP_KERNEL);
> >         int err = -ENOMEM;
> > -       struct shiftfs_super_info *ssi = NULL;
> > +       struct shiftfs_super_info *ssi = NULL, *mp_ssi;
> >         struct path path;
> >         struct dentry *dentry;
> >
> > @@ -946,11 +946,7 @@ static int shiftfs_fill_super(struct super_block *sb, void *raw_data,
> >         if (err)
> >                 goto out;
> >
> > -       /* to mark a mount point, must be real root */
> > -       if (ssi->mark && !capable(CAP_SYS_ADMIN))
> > -               goto out;
> > -
> > -       /* else to mount a mark, must be userns admin */
> > +       /* to mount a mark, must be userns admin */
> >         if (!ssi->mark && !ns_capable(current_user_ns(), CAP_SYS_ADMIN))
> >                 goto out;
> 
> Isn't this check performed by vfs anyway? i.e. in mount_nodev() -> sget()

Yeah, I noticed that too. I left it in for the moment to emphasize the
change I was making, but it can be removed.

> 
> >
> > @@ -962,41 +958,66 @@ static int shiftfs_fill_super(struct super_block *sb, void *raw_data,
> >
> >         if (!S_ISDIR(path.dentry->d_inode->i_mode)) {
> >                 err = -ENOTDIR;
> > -               goto out_put;
> > -       }
> > -
> > -       sb->s_stack_depth = path.dentry->d_sb->s_stack_depth + 1;
> > -       if (sb->s_stack_depth > FILESYSTEM_MAX_STACK_DEPTH) {
> > -               printk(KERN_ERR "shiftfs: maximum stacking depth exceeded\n");
> > -               err = -EINVAL;
> > -               goto out_put;
> > +               goto out_put_path;
> >         }
> >
> >         if (ssi->mark) {
> > +               struct super_block *lower_sb = path.mnt->mnt_sb;
> > +
> > +               /* to mark a mount point, must root wrt lower s_user_ns */
> > +               if (!ns_capable(lower_sb->s_user_ns, CAP_SYS_ADMIN))
> > +                       goto out_put_path;
> > +
> > +
> >                 /*
> >                  * this part is visible unshifted, so make sure no
> >                  * executables that could be used to give suid
> >                  * privileges
> >                  */
> >                 sb->s_iflags = SB_I_NOEXEC;
> 
> As commented on cover letter, why allow access to any files besides root at all?
> In fact, the only justification for a dummy sb (instead of bind mount with
> MS_EXTERN flag) would be in order to override inode operations with noop ops
> to prevent access to unshifted files from within container.

Summarizing my response to the other message, if the mark mount is kept
(and I would prefer that it isn't kept) that seems reasonable to me.

Thanks,
Seth