linux-api.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
To: Alexander Viro <viro@zeniv.linux.org.uk>,
	"Eric W . Biederman" <ebiederm@xmission.com>
Cc: Pavel Tikhomirov <snorcht@gmail.com>,
	linux-fsdevel@vger.kernel.org,
	lkml <linux-kernel@vger.kernel.org>,
	linux-api@vger.kernel.org, crml <criu@openvz.org>,
	Andrei Vagin <avagin@gmail.com>
Subject: Re: [CRIU] [PATCH] mnt: allow to add a mount into an existing group
Date: Wed, 24 Mar 2021 17:52:08 +0300	[thread overview]
Message-ID: <96115c40-3fc7-e3d6-ecd8-1be2969e5ff4@virtuozzo.com> (raw)
In-Reply-To: <aba1e14c-8af8-e171-dbf8-c9000ddccb70@virtuozzo.com>

Adding Andrew to CC with the right email.

On 3/23/21 3:59 PM, Pavel Tikhomirov wrote:
> Hi! Can we restart the discussion on this topic?
> 
> In CRIU we need to be able to dump/restore all mount trees of system 
> container (CT). CT can have anything inside - users which create their 
> custom mounts configuration, systemd with custom mount namespaces for 
> it's services, nested application containers inside the CT with their 
> own mount namespaces, and all mounts in CT mount trees can be grouped by 
> sharing groupes (e.g. same shared_id + master_id pair), and those groups 
> can depend one from another forming a tree structure of sharing groups.
> 
> 1) Imagine that we have this sharing group tree (in format (shared_id, 
> master_id), 0 means no sharing, we don't care about actual mounts for 
> now only master-slave dependencies between sharing groups):
> 
> (1,0)
>    |- (2,1)
>    |- (3,1)
>         |- (4,3)
>              |- (0,4)
> 
> The main problem of restoring mounts is the fact that sharing groups 
> currently can be only inherited, e.g. if you have one mount (first) with 
> shared_id = x, master_id = y, the only way to get another mount with 
> (x,y) is to create a bindmount from the first mount. Also to create 
> mount (y,z) from mount (x,y) one should also first inherit (x,y) via 
> bindmount and than change to (y,z).
> 
> This means that mentioned above tree puts restriction on the mounts 
> creation order, one need to have at least one mount for each of sharing 
> groups (1,0), (3,1) and (4,3) before creating the first mount of the 
> sharing group (0,4).
> 
> But what if we want to mount (restore) actual mounts in this mount tree 
> "reverse" order:
> 
> mntid    parent    mountpoint    (shared_id, master_id)
> 101    0    /tmp        (0,4)
> 102    101    /tmp        (4,3)
> 103    102    /tmp        (3,1)
> 104    103    /tmp        (1,0)
> 
> Mount 104's sharing group should be created before mount 101, 102 and 
> 103 sharing groups, but mount 104 should be created after those mounts. 
> One can actually prepare this setup (on mainstream kernel) by 
> pre-creating sharing groups elsewhere and then binding to /tmp in proper 
> order with careful unmounting of propagations (see test.sh attached):
> 
> [root@snorch propagation-tests]# bash ../test.sh
> ------------
> 960 1120 0:56 / /tmp/propagation-tests/tmp rw,relatime master:452 - 
> tmpfs propagation-tests-src rw,inode64
> 958 960 0:56 / /tmp/propagation-tests/tmp/sub rw,relatime shared:452 
> master:451 - tmpfs propagation-tests-src rw,inode64
> 961 958 0:56 / /tmp/propagation-tests/tmp/sub/sub rw,relatime shared:451 
> master:433 - tmpfs propagation-tests-src rw,inode64
> 963 961 0:56 / /tmp/propagation-tests/tmp/sub/sub/sub rw,relatime 
> shared:433 - tmpfs propagation-tests-src rw,inode64
> ------------
> 
> But this "pre-creating" from test.sh is not universal at all and only 
> works for this simple case. CRIU does not know anything about the 
> history of mount creation for system container, it also does not know 
> anything about any temporary mounts which were used and then removed. So 
> understanding the proper order is almost impossible like Andrew says.
> 
> I've also prepared a presentation on Linux Plumbers last year about how 
> much problems propagation brings to mounts restore in CRIU, you can take 
> a look here https://www.linuxplumbersconf.org/event/7/contributions/640/
> 
> 2) Propagation creates tons of mounts
> 3) Mount reparenting
> 4) "Mount trap"
> 5) "Non-uniform" propagation
> 6) “Cross-namespace” sharing groups
> 
> Allowing to create mounts private first and create sharing groups later 
> and copy sharing groups later instead of inheriting them resolves all 
> the problems with propagation at once.
> 
> One can take a look on the implementation of sharing group restore in 
> CRIU if we have this (mnt: allow to add a mount into an existing group) 
> patch applied: 
> https://github.com/Snorch/criu/blob/bebbded98128ec787950fa8365a6c74ce6a3b2cb/criu/mount-v2.c#L898 
> 
> 
> Obviously this does not solve all the problems with mounts I know about 
> but it's a big step forward in properly supporting them in CRIU. We 
> already have this tested in Virtuozzo for almost a year and it works nice.
> 
> Notes:
> 
> - There is another idea, but I should say early that I don't like it, 
> because with it restoring mounts with criu would be still super complex. 
> We can add extra flag to mount/move_mount syscall to disable propagation 
> temporary so that CRIU can restore the mount tree without problems 2-5, 
> also we can now create cross-namespace bindmounts with 
> (copy_tree+move_mount) to solve 6. But this solution does not help much 
> with problem 1 - ordering and the need of temporary mounts. As you can 
> see in test.sh you would still need to think hard to solve different 
> similar configurations of reverse order between mounts and sharing groups.
> 
> - We can actually prohibit cross-namespace MS_SET_GROUP if you like. (If 
> both namespaces are non abstract.) We can use open_tree to create a copy 
> of the mount with the same sharing group and only then copy sharing from 
> the copy while being in proper mountns.
> 
> - We still need it:
> 
>  > this code might be made unnecessary by allowing bind mounts between
>  > mount namespaces.
> 
> No, because of problem 1. Guessing right order would be still to complex.
> 
> - This approach does not allow creation of any "bad" trees.
> 
>  > Can they create loops in mount propagation trees that we can not 
> create today?
> 
> There would be no loops in "sharing groups tree" for sure, as this new 
> MS_SET_GROUP only adds one _private_ mount to one group (without moving 
> between groups), the tree itself is unchanged after mount(MS_SET_GROUP).
> 
> - Probably mount syscall is not the right place for MS_SET_GROUP. I see 
> new syscall mount_setattr, first I thought reworking MS_SET_GROUP to be 
> a part of it, but interface of mount_setattr for copying is not 
> convenient. Probably we can add MS_SET_GROUP flag to move_mount which 
> has exactly what we want path to source and destination relative to fd:
> 
> static inline int move_mount(int from_dfd, const char *from_pathname,
>                               int to_dfd, const char *to_pathname,
>                               unsigned int flags)
> 
> As in mount-v2 now I had to use proc hacks to access mounts at dfd:
> 
> https://github.com/Snorch/criu/blob/bebbded98128ec787950fa8365a6c74ce6a3b2cb/criu/mount-v2.c#L923 
> 
> 
> - I hope that we still have a chance for MS_SET_GROUP, this way we can 
> port mount-v2 to mainstream CRIU.
> 

-- 
Best regards, Tikhomirov Pavel
Software Developer, Virtuozzo.

      reply	other threads:[~2021-03-24 14:53 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-04-28  5:18 [PATCH] mnt: allow to add a mount into an existing group Andrei Vagin
     [not found] ` <20170428051831.20084-1-avagin-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
2017-05-09 17:36   ` Andrey Vagin
     [not found]     ` <CANaxB-y9W6E_6W70BPWduTcZ+A3u=w9ZLw2dvdrfe-gYcDvKhQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-05-10  0:42       ` Eric W. Biederman
2017-05-10 23:58         ` Andrei Vagin
2021-03-23 12:59           ` [CRIU] " Pavel Tikhomirov
2021-03-24 14:52             ` Pavel Tikhomirov [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=96115c40-3fc7-e3d6-ecd8-1be2969e5ff4@virtuozzo.com \
    --to=ptikhomirov@virtuozzo.com \
    --cc=avagin@gmail.com \
    --cc=criu@openvz.org \
    --cc=ebiederm@xmission.com \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=snorcht@gmail.com \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).