From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1751504AbdEJAs3 (ORCPT <rfc822;w@1wt.eu>);
        Tue, 9 May 2017 20:48:29 -0400
Received: from out03.mta.xmission.com ([166.70.13.233]:48270 "EHLO
        out03.mta.xmission.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1750780AbdEJAs1 (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Tue, 9 May 2017 20:48:27 -0400
From: ebiederm@xmission.com (Eric W. Biederman)
To: Andrey Vagin <avagin@openvz.org>
Cc: linux-fsdevel <linux-fsdevel@vger.kernel.org>,
        LKML <linux-kernel@vger.kernel.org>,
        Linux API <linux-api@vger.kernel.org>,
        "criu\@openvz.org" <criu@openvz.org>,
        Alexander Viro <viro@zeniv.linux.org.uk>
References: <20170428051831.20084-1-avagin@openvz.org>
        <CANaxB-y9W6E_6W70BPWduTcZ+A3u=w9ZLw2dvdrfe-gYcDvKhQ@mail.gmail.com>
Date: Tue, 09 May 2017 19:42:00 -0500
In-Reply-To: <CANaxB-y9W6E_6W70BPWduTcZ+A3u=w9ZLw2dvdrfe-gYcDvKhQ@mail.gmail.com>
        (Andrey Vagin's message of "Tue, 9 May 2017 10:36:46 -0700")
Message-ID: <87mvalr19j.fsf@xmission.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.1 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain
X-XM-SPF: eid=1d8Fng-00080s-5T;;;mid=<87mvalr19j.fsf@xmission.com>;;;hst=in02.mta.xmission.com;;;ip=97.121.81.159;;;frm=ebiederm@xmission.com;;;spf=neutral
X-XM-AID: U2FsdGVkX190DmezNKVBMT6z7tEqMoZeJ41WgmPmlMU=
X-SA-Exim-Connect-IP: 97.121.81.159
X-SA-Exim-Mail-From: ebiederm@xmission.com
X-Spam-Report: * -1.0 ALL_TRUSTED Passed through trusted hosts only via SMTP
        *  0.0 TVD_RCVD_IP Message was received from an IP address
        *  0.7 XMSubLong Long Subject
        *  0.0 T_TM2_M_HEADER_IN_MSG BODY: No description available.
        *  0.8 BAYES_50 BODY: Bayes spam probability is 40 to 60%
        *      [score: 0.5000]
        * -0.0 DCC_CHECK_NEGATIVE Not listed in DCC
        *      [sa02 1397; Body=1 Fuz1=1 Fuz2=1]
        *  0.1 XMSolicitRefs_0 Weightloss drug
X-Spam-DCC: XMission; sa02 1397; Body=1 Fuz1=1 Fuz2=1 
X-Spam-Combo: ;Andrey Vagin <avagin@openvz.org>
X-Spam-Relay-Country: 
X-Spam-Timing: total 5782 ms - load_scoreonly_sql: 0.07 (0.0%),
        signal_user_changed: 7 (0.1%), b_tie_ro: 6 (0.1%), parse: 2.4 (0.0%),
        extract_message_metadata: 46 (0.8%), get_uri_detail_list: 10 (0.2%),
        tests_pri_-1000: 20 (0.3%), tests_pri_-950: 2.4 (0.0%), tests_pri_-900: 1.81
        (0.0%), tests_pri_-400: 59 (1.0%), check_bayes: 57 (1.0%), b_tokenize: 27
        (0.5%), b_tok_get_all: 14 (0.2%), b_comp_prob: 8 (0.1%), b_tok_touch_all: 3.6
        (0.1%), b_finish: 0.87 (0.0%), tests_pri_0: 2271 (39.3%),
        check_dkim_signature: 1.25 (0.0%), check_dkim_adsp: 5 (0.1%), tests_pri_500:
        3364 (58.2%), poll_dns_idle: 3348 (57.9%), rewrite_mail: 0.00 (0.0%)
Subject: Re: [PATCH] mnt: allow to add a mount into an existing group
X-Spam-Flag: No
X-SA-Exim-Version: 4.2.1 (built Thu, 05 May 2016 13:38:54 -0600)
X-SA-Exim-Scanned: Yes (on in02.mta.xmission.com)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

Andrey Vagin <avagin@openvz.org> writes:

> On Tue, Jan 24, 2017 at 02:03:23PM +1300, Eric W. Biederman wrote:
>> Andrei Vagin <avagin@openvz.org> writes:
>>
>> > Now a shared group can be only inherited from a source mount.
>> > This patch adds an ability to add a mount into an existing shared
>> > group.
>>
>> This sounds like a lot of the discussion on bind mounts accross
>> namespaces.  I am going to stay out of this for a bit until
>> we resolve my latest patch.
>
> Hi Eric,
>
> Your patches about shadow/side mounts were committed, can we resume
> the discussion about this patch?

We can.

Although personally I would rather get back to figuring out how
we can reduce the horrible time complexity of the worst case
for umount -l.

> As for security, a mount can be added to a shared group, only if a
> caller has CAP_SYS_ADMIN in namespaces of both mounts, so an
> unprivileged user can't create a shared mount with a parent mount
> namespace. If a user has CAP_SYS_ADMIN, I don't see a reason to
> restrict him to create shared mounts between namespaces, even if they
> are in different user-namespaces.

Can they create loops in mount propagation trees that we can not create
today?  It feels like that would be very simple to do with an interface
like this.

A loop in a mount propagation tree would be an absolute disaster.

I remember Al Viro had some very firm ideas about bind mounts from
foreign namespaces.  That I have never take the time to understand.
I suspect whatever objections he had will apply in this case.  Or else
this code might be made unnecessary by allowing bind mounts between
mount namespaces.


> Now I look at volume drivers in container services (like docker and
> kubernetes) and I think this functionality can be useful for them too.
> Now it is impossible to run a plugin driver in unprivileged containers
> (with sub-userns), because a container has to have a shared mount with
> a mount namespace where the service is running. The idea of these
> plugins is that a service requests a volume mount from a plugin and
> then starts a container with this volume, so the service need to have
> a way to get a mount from a service.

Interesting.  I personally think the checkpoint/restart use case
is more compelling.

>> Eric
>
>
> On Thu, Apr 27, 2017 at 10:18 PM, Andrei Vagin <avagin@openvz.org> wrote:
>> Now a shared group can be only inherited from a source mount.
>> This patch adds an ability to add a mount into an existing shared
>> group.
>>
>> mount(source, target, NULL, MS_SET_GROUP, NULL)
>>
>> mount() with the MS_SET_GROUP flag adds the "target" mount into a group
>> of the "source" mount. The calling process has to have the CAP_SYS_ADMIN
>> capability in namespaces of these mounts. The source and the target
>> mounts have to have the same super block.
>>
>> This new functionality together with "mnt: Tuck mounts under others
>> instead of creating shadow/side mounts." allows CRIU to dump and restore
>> any set of mount namespaces.
>>
>> Currently we have a lot of issues about dumping and restoring mount
>> namespaces. The bigest problem is that we can't construct mount trees
>> directly due to several reasons:
>> * groups can't be set, they can be only inherited
>> * file systems has to be mounted from the specified user namespaces
>> * the mount() syscall doesn't just create one mount -- the mount is
>>   also propagated to all members of a parent group
>> * umount() doesn't detach mounts from all members of a group
>>   (mounts with children are not umounted)
>> * mounts are propagated underneath of existing mounts
>> * mount() doesn't allow to make bind-mounts between two namespaces
>> * processes can have opened file descriptors to overmounted files
>>
>> All these operations are non-trivial, making the task of restoring
>> a mount namespace practically unsolvable for reasonable time. The
>> proposed change allows to restore a mount namespace in a direct
>> manner, without any super complex logic.
>>
>> Cc: Eric W. Biederman <ebiederm@xmission.com>
>> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
>> Signed-off-by: Andrei Vagin <avagin@openvz.org>
>> ---
>>  fs/namespace.c          | 66 ++++++++++++++++++++++++++++++++++++++++++++++---
>>  include/uapi/linux/fs.h |  6 +++++
>>  2 files changed, 68 insertions(+), 4 deletions(-)
>>
>> diff --git a/fs/namespace.c b/fs/namespace.c
>> index cc1375ef..3bf0cd2 100644
>> --- a/fs/namespace.c
>> +++ b/fs/namespace.c
>> @@ -2355,6 +2355,57 @@ static inline int tree_contains_unbindable(struct mount *mnt)
>>         return 0;
>>  }
>>
>> +static int do_set_group(struct path *path, const char *sibling_name)
>> +{
>> +       struct mount *sibling, *mnt;
>> +       struct path sibling_path;
>> +       int err;
>> +
>> +       if (!sibling_name || !*sibling_name)
>> +               return -EINVAL;
>> +
>> +       err = kern_path(sibling_name, LOOKUP_FOLLOW, &sibling_path);
>> +       if (err)
>> +               return err;
>> +
>> +       sibling = real_mount(sibling_path.mnt);
>> +       mnt = real_mount(path->mnt);
>> +
>> +       namespace_lock();
>> +
>> +       err = -EPERM;
>> +       if (!sibling->mnt_ns ||
>> +           !ns_capable(sibling->mnt_ns->user_ns, CAP_SYS_ADMIN))
>> +               goto out_unlock;
>> +
>> +       err = -EINVAL;
>> +       if (sibling->mnt.mnt_sb != mnt->mnt.mnt_sb)
>> +               goto out_unlock;
>> +
>> +       if (IS_MNT_SHARED(mnt) || IS_MNT_SLAVE(mnt))
>> +               goto out_unlock;
>> +
>> +       if (IS_MNT_SLAVE(sibling)) {
>> +               struct mount *m = sibling->mnt_master;
>> +
>> +               list_add(&mnt->mnt_slave, &m->mnt_slave_list);
>> +               mnt->mnt_master = m;
>> +       }
>> +
>> +       if (IS_MNT_SHARED(sibling)) {
>> +               mnt->mnt_group_id = sibling->mnt_group_id;
>> +               list_add(&mnt->mnt_share, &sibling->mnt_share);
>> +               set_mnt_shared(mnt);
>> +       }
>> +
>> +       err = 0;
>> +out_unlock:
>> +       namespace_unlock();
>> +
>> +       path_put(&sibling_path);
>> +       return err;
>> +}
>> +
>>  static int do_move_mount(struct path *path, const char *old_name)
>>  {
>>         struct path old_path, parent_path;
>> @@ -2769,6 +2820,7 @@ long do_mount(const char *dev_name, const char __user *dir_name,
>>         struct path path;
>>         int retval = 0;
>>         int mnt_flags = 0;
>> +       unsigned long cmd;
>>
>>         /* Discard magic */
>>         if ((flags & MS_MGC_MSK) == MS_MGC_VAL)
>> @@ -2820,19 +2872,25 @@ long do_mount(const char *dev_name, const char __user *dir_name,
>>                 mnt_flags |= path.mnt->mnt_flags & MNT_ATIME_MASK;
>>         }
>>
>> +       cmd = flags & (MS_REMOUNT | MS_BIND |
>> +                      MS_SHARED | MS_PRIVATE | MS_SLAVE | MS_UNBINDABLE |
>> +                      MS_MOVE | MS_SET_GROUP);
>> +
>>         flags &= ~(MS_NOSUID | MS_NOEXEC | MS_NODEV | MS_ACTIVE | MS_BORN |
>>                    MS_NOATIME | MS_NODIRATIME | MS_RELATIME| MS_KERNMOUNT |
>>                    MS_STRICTATIME | MS_NOREMOTELOCK | MS_SUBMOUNT);
>>
>> -       if (flags & MS_REMOUNT)
>> +       if (cmd & MS_REMOUNT)
>>                 retval = do_remount(&path, flags & ~MS_REMOUNT, mnt_flags,
>>                                     data_page);
>> -       else if (flags & MS_BIND)
>> +       else if (cmd & MS_BIND)
>>                 retval = do_loopback(&path, dev_name, flags & MS_REC);
>> -       else if (flags & (MS_SHARED | MS_PRIVATE | MS_SLAVE | MS_UNBINDABLE))
>> +       else if (cmd & (MS_SHARED | MS_PRIVATE | MS_SLAVE | MS_UNBINDABLE))
>>                 retval = do_change_type(&path, flags);
>> -       else if (flags & MS_MOVE)
>> +       else if (cmd & MS_MOVE)
>>                 retval = do_move_mount(&path, dev_name);
>> +       else if (cmd & MS_SET_GROUP)
>> +               retval = do_set_group(&path, dev_name);
>>         else
>>                 retval = do_new_mount(&path, type_page, flags, mnt_flags,
>>                                       dev_name, data_page);
>> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
>> index 048a85e..33423aa 100644
>> --- a/include/uapi/linux/fs.h
>> +++ b/include/uapi/linux/fs.h
>> @@ -131,6 +131,12 @@ struct inodes_stat_t {
>>  #define MS_STRICTATIME (1<<24) /* Always perform atime updates */
>>  #define MS_LAZYTIME    (1<<25) /* Update the on-disk [acm]times lazily */
>>
>> +/*
>> + * Here are commands and flags. Commands are handled in do_mount()
>> + * and can intersect with kernel internal flags.
>> + */
>> +#define MS_SET_GROUP   (1<<26) /* Add a mount into a shared group */
>> +
>>  /* These sb flags are internal to the kernel */
>>  #define MS_SUBMOUNT     (1<<26)
>>  #define MS_NOREMOTELOCK        (1<<27)
>> --
>> 2.9.3
>>