All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 1/2] move_mount: allow to add a mount into an existing group
@ 2021-07-14 13:27 Pavel Tikhomirov
  2021-07-14 13:27 ` [PATCH v3 2/2] tests: add move_mount(MOVE_MOUNT_SET_GROUP) selftest Pavel Tikhomirov
  2021-07-14 14:43 ` [PATCH v3 1/2] move_mount: allow to add a mount into an existing group Christian Brauner
  0 siblings, 2 replies; 4+ messages in thread
From: Pavel Tikhomirov @ 2021-07-14 13:27 UTC (permalink / raw)
  To: Christian Brauner, linux-fsdevel
  Cc: Pavel Tikhomirov, Eric W . Biederman, Alexander Viro,
	Mattias Nissler, Aleksa Sarai, Andrei Vagin, linux-api, lkml

Previously a sharing group (shared and master ids pair) can be only
inherited when mount is created via bindmount. This patch adds an
ability to add an existing private mount into an existing sharing group.

With this functionality one can first create the desired mount tree from
only private mounts (without the need to care about undesired mount
propagation or mount creation order implied by sharing group
dependencies), and next then setup any desired mount sharing between
those mounts in tree as needed.

This allows CRIU to restore any set of mount namespaces, mount trees and
sharing group trees for a container.

We have many issues with restoring mounts in CRIU related to sharing
groups and propagation:
- reverse sharing groups vs mount tree order requires complex mounts
  reordering which mostly implies also using some temporary mounts
(please see https://lkml.org/lkml/2021/3/23/569 for more info)

- mount() syscall creates tons of mounts due to propagation
- mount re-parenting due to propagation
- "Mount Trap" due to propagation
- "Non Uniform" propagation, meaning that with different tricks with
  mount order and temporary children-"lock" mounts one can create mount
  trees which can't be restored without those tricks
(see https://www.linuxplumbersconf.org/event/7/contributions/640/)

With this new functionality we can resolve all the problems with
propagation at once.

Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Mattias Nissler <mnissler@chromium.org>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-api@vger.kernel.org
Cc: lkml <linux-kernel@vger.kernel.org>
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>

---
This is a rework of "mnt: allow to add a mount into an existing group"
patch from Andrei. https://lkml.org/lkml/2017/4/28/20

New do_set_group is similar to do_move_mount, but with some restrictions
of do_move_mount removed and that's why:

1) Allow "cross-namespace" sharing group set. If we allow operation only
with mounts from current+anon mount namespace one would still be able to
setns(from_mntns) + open_tree(from, OPEN_TREE_CLONE) + setns(to_mntns) +
move_mount(anon, to, MOVE_MOUNT_SET_GROUP) to set sharing group to mount
in different mount namespace with source mount. But with this approach
we would need to create anon mount namespace and mount copy each time,
which is just a waste of resources. So instead lets just check if we are
allowed to modify both mount namespaces (which looks equivalent to what
setns-es and open_tree check).

2) Skip checks wich only apply to actually moving mount which we have in
do_move_mount and open_tree. We don't need to check for MNT_LOCKED,
d_is_dir matching, unbindable, nsfs loops and ancestor relation as we
don't move mounts.

Also let's add some new checks:

1) Don't allow to copy sharing from mount with narrow root to a wider
root, so that user does not have power to receive more propagations when
user already has. (proposed by Andrei)

2) Don't allow to copy sharing from mount with locked children for the
same reason, as user shouldn't see propagations to areas overmounted by
locked mounts (if the user could not already do it before sharing
adjustment).

3) If From is private for MOVE_MOUNT_SET_GROUP let's report an error
instead of just doing nothing, so that the user knows that there is
probably some logical usage error. (proposed by Christian)

Security note: there would be no (new) loops in sharing groups tree,
because this new move_mount(MOVE_MOUNT_SET_GROUP) operation only adds
one _private_ mount to one group (without moving between groups), the
sharing groups tree itself stays unchanged after it.

In Virtuozzo we have "mount-v2" implementation, based with the original
kernel patch from Andrei, tested for almost a year and it actually
decreased number of bugs with mounts a lot. One can take a look on the
implementation of sharing group restore in CRIU in "mount-v2" here:

https://src.openvz.org/projects/OVZ/repos/criu/browse/criu/mount-v2.c#898

This works almost the same with current version of patch if we replace
mount(MS_SET_GROUP) to move_mount(MOVE_MOUNT_SET_GROUP), please see
super-draft port for mainstream criu, this at least passes
non-user-namespaced mount tests (zdtm.py --mounts-v2 -f ns).

https://github.com/Snorch/criu/commits/mount-v2-poc

v2: Solve the problem mentioned by Andrei:
- check mnt_root of "to" is in the sub-tree of mnt_root of "from"
- also check "from" has no locked mounts in subroot of "to"
v3: Add checks:
- check paths to be mount root dentries
- return EINVAL if From is private (no sharing to copy)

---
 fs/namespace.c             | 75 +++++++++++++++++++++++++++++++++++++-
 include/uapi/linux/mount.h |  3 +-
 2 files changed, 76 insertions(+), 2 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index ab4174a3c802..a7828e695e03 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2684,6 +2684,76 @@ static bool check_for_nsfs_mounts(struct mount *subtree)
 	return ret;
 }
 
+static int do_set_group(struct path *from_path, struct path *to_path)
+{
+	struct mount *from, *to;
+	int err;
+
+	from = real_mount(from_path->mnt);
+	to = real_mount(to_path->mnt);
+
+	namespace_lock();
+
+	err = -EINVAL;
+	/* To and From must be mounted */
+	if (!is_mounted(&from->mnt))
+		goto out;
+	if (!is_mounted(&to->mnt))
+		goto out;
+
+	err = -EPERM;
+	/* We should be allowed to modify mount namespaces of both mounts */
+	if (!ns_capable(from->mnt_ns->user_ns, CAP_SYS_ADMIN))
+		goto out;
+	if (!ns_capable(to->mnt_ns->user_ns, CAP_SYS_ADMIN))
+		goto out;
+
+	err = -EINVAL;
+	/* To and From paths should be mount roots */
+	if (from_path->dentry != from_path->mnt->mnt_root)
+		goto out;
+	if (to_path->dentry != to_path->mnt->mnt_root)
+		goto out;
+
+	/* Setting sharing groups is only allowed across same superblock */
+	if (from->mnt.mnt_sb != to->mnt.mnt_sb)
+		goto out;
+
+	/* From mount root should be wider than To mount root */
+	if (!is_subdir(to->mnt.mnt_root, from->mnt.mnt_root))
+		goto out;
+
+	/* From mount should not have locked children in place of To's root */
+	if (has_locked_children(from, to->mnt.mnt_root))
+		goto out;
+
+	/* Setting sharing groups is only allowed on private mounts */
+	if (IS_MNT_SHARED(to) || IS_MNT_SLAVE(to))
+		goto out;
+
+	/* From should not be private */
+	if (!IS_MNT_SHARED(from) && !IS_MNT_SLAVE(from))
+		goto out;
+
+	if (IS_MNT_SLAVE(from)) {
+		struct mount *m = from->mnt_master;
+
+		list_add(&to->mnt_slave, &m->mnt_slave_list);
+		to->mnt_master = m;
+	}
+
+	if (IS_MNT_SHARED(from)) {
+		to->mnt_group_id = from->mnt_group_id;
+		list_add(&to->mnt_share, &from->mnt_share);
+		set_mnt_shared(to);
+	}
+
+	err = 0;
+out:
+	namespace_unlock();
+	return err;
+}
+
 static int do_move_mount(struct path *old_path, struct path *new_path)
 {
 	struct mnt_namespace *ns;
@@ -3669,7 +3739,10 @@ SYSCALL_DEFINE5(move_mount,
 	if (ret < 0)
 		goto out_to;
 
-	ret = do_move_mount(&from_path, &to_path);
+	if (flags & MOVE_MOUNT_SET_GROUP)
+		ret = do_set_group(&from_path, &to_path);
+	else
+		ret = do_move_mount(&from_path, &to_path);
 
 out_to:
 	path_put(&to_path);
diff --git a/include/uapi/linux/mount.h b/include/uapi/linux/mount.h
index dd7a166fdf9c..4d93967f8aea 100644
--- a/include/uapi/linux/mount.h
+++ b/include/uapi/linux/mount.h
@@ -73,7 +73,8 @@
 #define MOVE_MOUNT_T_SYMLINKS		0x00000010 /* Follow symlinks on to path */
 #define MOVE_MOUNT_T_AUTOMOUNTS		0x00000020 /* Follow automounts on to path */
 #define MOVE_MOUNT_T_EMPTY_PATH		0x00000040 /* Empty to path permitted */
-#define MOVE_MOUNT__MASK		0x00000077
+#define MOVE_MOUNT_SET_GROUP		0x00000100 /* Set sharing group instead */
+#define MOVE_MOUNT__MASK		0x00000177
 
 /*
  * fsopen() flags.
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH v3 2/2] tests: add move_mount(MOVE_MOUNT_SET_GROUP) selftest
  2021-07-14 13:27 [PATCH v3 1/2] move_mount: allow to add a mount into an existing group Pavel Tikhomirov
@ 2021-07-14 13:27 ` Pavel Tikhomirov
  2021-07-14 14:43 ` [PATCH v3 1/2] move_mount: allow to add a mount into an existing group Christian Brauner
  1 sibling, 0 replies; 4+ messages in thread
From: Pavel Tikhomirov @ 2021-07-14 13:27 UTC (permalink / raw)
  To: Christian Brauner, linux-fsdevel; +Cc: Pavel Tikhomirov

Add a simple selftest for a move_mount(MOVE_MOUNT_SET_GROUP). This tests
that one can copy sharing from one mount from nested mntns with nested
userns owner to another mount from other nested mntns with other nested
userns owner while in their parent userns.

  TAP version 13
  1..1
  # Starting 1 tests from 2 test cases.
  #  RUN           move_mount_set_group.complex_sharing_copying ...
  #            OK  move_mount_set_group.complex_sharing_copying
  ok 1 move_mount_set_group.complex_sharing_copying
  # PASSED: 1 / 1 tests passed.
  # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>

---
I took mount_setattr test as an example, I'm not to experienced in
selftests so hope I'm not doing something wrong here.

I implemented a testcase having in mind the way how I plan to use this
interface in criu, so it's not simply copying sharing between two nearby
mounts but it also adds some userns+mntns-es to test cross-namespace
copying.

Note: One can also test MOVE_MOUNT_SET_GROUP via zdtm tests on criu
mount-v2 POC: https://github.com/Snorch/criu/commits/mount-v2-poc

v3: add some test

---
 tools/testing/selftests/Makefile              |   1 +
 .../selftests/move_mount_set_group/.gitignore |   1 +
 .../selftests/move_mount_set_group/Makefile   |   7 +
 .../selftests/move_mount_set_group/config     |   1 +
 .../move_mount_set_group_test.c               | 375 ++++++++++++++++++
 5 files changed, 385 insertions(+)
 create mode 100644 tools/testing/selftests/move_mount_set_group/.gitignore
 create mode 100644 tools/testing/selftests/move_mount_set_group/Makefile
 create mode 100644 tools/testing/selftests/move_mount_set_group/config
 create mode 100644 tools/testing/selftests/move_mount_set_group/move_mount_set_group_test.c

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index fb010a35d61a..dd0388eab94d 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -35,6 +35,7 @@ TARGETS += memory-hotplug
 TARGETS += mincore
 TARGETS += mount
 TARGETS += mount_setattr
+TARGETS += move_mount_set_group
 TARGETS += mqueue
 TARGETS += nci
 TARGETS += net
diff --git a/tools/testing/selftests/move_mount_set_group/.gitignore b/tools/testing/selftests/move_mount_set_group/.gitignore
new file mode 100644
index 000000000000..f5e339268720
--- /dev/null
+++ b/tools/testing/selftests/move_mount_set_group/.gitignore
@@ -0,0 +1 @@
+move_mount_set_group_test
diff --git a/tools/testing/selftests/move_mount_set_group/Makefile b/tools/testing/selftests/move_mount_set_group/Makefile
new file mode 100644
index 000000000000..80c2d86812b0
--- /dev/null
+++ b/tools/testing/selftests/move_mount_set_group/Makefile
@@ -0,0 +1,7 @@
+# SPDX-License-Identifier: GPL-2.0
+# Makefile for mount selftests.
+CFLAGS = -g -I../../../../usr/include/ -Wall -O2
+
+TEST_GEN_FILES += move_mount_set_group_test
+
+include ../lib.mk
diff --git a/tools/testing/selftests/move_mount_set_group/config b/tools/testing/selftests/move_mount_set_group/config
new file mode 100644
index 000000000000..416bd53ce982
--- /dev/null
+++ b/tools/testing/selftests/move_mount_set_group/config
@@ -0,0 +1 @@
+CONFIG_USER_NS=y
diff --git a/tools/testing/selftests/move_mount_set_group/move_mount_set_group_test.c b/tools/testing/selftests/move_mount_set_group/move_mount_set_group_test.c
new file mode 100644
index 000000000000..ca0c0c2db991
--- /dev/null
+++ b/tools/testing/selftests/move_mount_set_group/move_mount_set_group_test.c
@@ -0,0 +1,375 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+#include <sched.h>
+#include <stdio.h>
+#include <errno.h>
+#include <string.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <sys/mount.h>
+#include <sys/wait.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <fcntl.h>
+#include <stdbool.h>
+#include <stdarg.h>
+#include <sys/syscall.h>
+
+#include "../kselftest_harness.h"
+
+#ifndef CLONE_NEWNS
+#define CLONE_NEWNS 0x00020000
+#endif
+
+#ifndef CLONE_NEWUSER
+#define CLONE_NEWUSER 0x10000000
+#endif
+
+#ifndef MS_SHARED
+#define MS_SHARED (1 << 20)
+#endif
+
+#ifndef MS_PRIVATE
+#define MS_PRIVATE (1<<18)
+#endif
+
+#ifndef MOVE_MOUNT_SET_GROUP
+#define MOVE_MOUNT_SET_GROUP 0x00000100
+#endif
+
+#ifndef MOVE_MOUNT_F_EMPTY_PATH
+#define MOVE_MOUNT_F_EMPTY_PATH 0x00000004
+#endif
+
+#ifndef MOVE_MOUNT_T_EMPTY_PATH
+#define MOVE_MOUNT_T_EMPTY_PATH 0x00000040
+#endif
+
+static ssize_t write_nointr(int fd, const void *buf, size_t count)
+{
+	ssize_t ret;
+
+	do {
+		ret = write(fd, buf, count);
+	} while (ret < 0 && errno == EINTR);
+
+	return ret;
+}
+
+static int write_file(const char *path, const void *buf, size_t count)
+{
+	int fd;
+	ssize_t ret;
+
+	fd = open(path, O_WRONLY | O_CLOEXEC | O_NOCTTY | O_NOFOLLOW);
+	if (fd < 0)
+		return -1;
+
+	ret = write_nointr(fd, buf, count);
+	close(fd);
+	if (ret < 0 || (size_t)ret != count)
+		return -1;
+
+	return 0;
+}
+
+static int create_and_enter_userns(void)
+{
+	uid_t uid;
+	gid_t gid;
+	char map[100];
+
+	uid = getuid();
+	gid = getgid();
+
+	if (unshare(CLONE_NEWUSER))
+		return -1;
+
+	if (write_file("/proc/self/setgroups", "deny", sizeof("deny") - 1) &&
+	    errno != ENOENT)
+		return -1;
+
+	snprintf(map, sizeof(map), "0 %d 1", uid);
+	if (write_file("/proc/self/uid_map", map, strlen(map)))
+		return -1;
+
+
+	snprintf(map, sizeof(map), "0 %d 1", gid);
+	if (write_file("/proc/self/gid_map", map, strlen(map)))
+		return -1;
+
+	if (setgid(0))
+		return -1;
+
+	if (setuid(0))
+		return -1;
+
+	return 0;
+}
+
+static int prepare_unpriv_mountns(void)
+{
+	if (create_and_enter_userns())
+		return -1;
+
+	if (unshare(CLONE_NEWNS))
+		return -1;
+
+	if (mount(NULL, "/", NULL, MS_REC | MS_PRIVATE, 0))
+		return -1;
+
+	return 0;
+}
+
+static char *get_field(char *src, int nfields)
+{
+	int i;
+	char *p = src;
+
+	for (i = 0; i < nfields; i++) {
+		while (*p && *p != ' ' && *p != '\t')
+			p++;
+
+		if (!*p)
+			break;
+
+		p++;
+	}
+
+	return p;
+}
+
+static void null_endofword(char *word)
+{
+	while (*word && *word != ' ' && *word != '\t')
+		word++;
+	*word = '\0';
+}
+
+static bool is_shared_mount(const char *path)
+{
+	size_t len = 0;
+	char *line = NULL;
+	FILE *f = NULL;
+
+	f = fopen("/proc/self/mountinfo", "re");
+	if (!f)
+		return false;
+
+	while (getline(&line, &len, f) != -1) {
+		char *opts, *target;
+
+		target = get_field(line, 4);
+		if (!target)
+			continue;
+
+		opts = get_field(target, 2);
+		if (!opts)
+			continue;
+
+		null_endofword(target);
+
+		if (strcmp(target, path) != 0)
+			continue;
+
+		null_endofword(opts);
+		if (strstr(opts, "shared:"))
+			return true;
+	}
+
+	free(line);
+	fclose(f);
+
+	return false;
+}
+
+/* Attempt to de-conflict with the selftests tree. */
+#ifndef SKIP
+#define SKIP(s, ...)	XFAIL(s, ##__VA_ARGS__)
+#endif
+
+#define SET_GROUP_FROM	"/tmp/move_mount_set_group_supported_from"
+#define SET_GROUP_TO	"/tmp/move_mount_set_group_supported_to"
+
+static int move_mount_set_group_supported(void)
+{
+	int ret;
+
+	if (mount("testing", "/tmp", "tmpfs", MS_NOATIME | MS_NODEV,
+		  "size=100000,mode=700"))
+		return -1;
+
+	if (mount(NULL, "/tmp", NULL, MS_PRIVATE, 0))
+		return -1;
+
+	if (mkdir(SET_GROUP_FROM, 0777))
+		return -1;
+
+	if (mkdir(SET_GROUP_TO, 0777))
+		return -1;
+
+	if (mount("testing", SET_GROUP_FROM, "tmpfs", MS_NOATIME | MS_NODEV,
+		  "size=100000,mode=700"))
+		return -1;
+
+	if (mount(SET_GROUP_FROM, SET_GROUP_TO, NULL, MS_BIND, NULL))
+		return -1;
+
+	if (mount(NULL, SET_GROUP_FROM, NULL, MS_SHARED, 0))
+		return -1;
+
+	ret = syscall(SYS_move_mount, AT_FDCWD, SET_GROUP_FROM,
+		      AT_FDCWD, SET_GROUP_TO, MOVE_MOUNT_SET_GROUP);
+	umount2("/tmp", MNT_DETACH);
+
+	return ret < 0 ? false : true;
+}
+
+FIXTURE(move_mount_set_group) {
+};
+
+#define SET_GROUP_A "/tmp/A"
+
+FIXTURE_SETUP(move_mount_set_group)
+{
+	int ret;
+
+	ASSERT_EQ(prepare_unpriv_mountns(), 0);
+
+	ret = move_mount_set_group_supported();
+	ASSERT_GE(ret, 0);
+	if (!ret)
+		SKIP(return, "move_mount(MOVE_MOUNT_SET_GROUP) is not supported");
+
+	umount2("/tmp", MNT_DETACH);
+
+	ASSERT_EQ(mount("testing", "/tmp", "tmpfs", MS_NOATIME | MS_NODEV,
+			"size=100000,mode=700"), 0);
+
+	ASSERT_EQ(mkdir(SET_GROUP_A, 0777), 0);
+
+	ASSERT_EQ(mount("testing", SET_GROUP_A, "tmpfs", MS_NOATIME | MS_NODEV,
+			"size=100000,mode=700"), 0);
+}
+
+FIXTURE_TEARDOWN(move_mount_set_group)
+{
+	int ret;
+
+	ret = move_mount_set_group_supported();
+	ASSERT_GE(ret, 0);
+	if (!ret)
+		SKIP(return, "move_mount(MOVE_MOUNT_SET_GROUP) is not supported");
+
+	umount2("/tmp", MNT_DETACH);
+}
+
+#define __STACK_SIZE (8 * 1024 * 1024)
+static pid_t do_clone(int (*fn)(void *), void *arg, int flags)
+{
+	void *stack;
+
+	stack = malloc(__STACK_SIZE);
+	if (!stack)
+		return -ENOMEM;
+
+#ifdef __ia64__
+	return __clone2(fn, stack, __STACK_SIZE, flags | SIGCHLD, arg, NULL);
+#else
+	return clone(fn, stack + __STACK_SIZE, flags | SIGCHLD, arg, NULL);
+#endif
+}
+
+static int wait_for_pid(pid_t pid)
+{
+        int status, ret;
+
+again:
+        ret = waitpid(pid, &status, 0);
+        if (ret == -1) {
+                if (errno == EINTR)
+                        goto again;
+
+                return -1;
+        }
+
+        if (!WIFEXITED(status))
+                return -1;
+
+        return WEXITSTATUS(status);
+}
+
+struct child_args {
+	int unsfd;
+	int mntnsfd;
+	bool shared;
+	int mntfd;
+};
+
+static int get_nestedns_mount_cb(void *data)
+{
+	struct child_args *ca = (struct child_args *)data;
+	int ret;
+
+	ret = prepare_unpriv_mountns();
+	if (ret)
+		return 1;
+
+	if (ca->shared) {
+		ret = mount(NULL, SET_GROUP_A, NULL, MS_SHARED, 0);
+		if (ret)
+			return 1;
+	}
+
+	ret = open("/proc/self/ns/user", O_RDONLY);
+	if (ret < 0)
+		return 1;
+	ca->unsfd = ret;
+
+	ret = open("/proc/self/ns/mnt", O_RDONLY);
+	if (ret < 0)
+		return 1;
+	ca->mntnsfd = ret;
+
+	ret = open(SET_GROUP_A, O_RDONLY);
+	if (ret < 0)
+		return 1;
+	ca->mntfd = ret;
+
+	return 0;
+}
+
+TEST_F(move_mount_set_group, complex_sharing_copying)
+{
+	struct child_args ca_from = {
+		.shared = true,
+	};
+	struct child_args ca_to = {
+		.shared = false,
+	};
+	pid_t pid;
+	int ret;
+
+	ret = move_mount_set_group_supported();
+	ASSERT_GE(ret, 0);
+	if (!ret)
+		SKIP(return, "move_mount(MOVE_MOUNT_SET_GROUP) is not supported");
+
+	pid = do_clone(get_nestedns_mount_cb, (void *)&ca_from, CLONE_VFORK |
+		       CLONE_VM | CLONE_FILES); ASSERT_GT(pid, 0);
+	ASSERT_EQ(wait_for_pid(pid), 0);
+
+	pid = do_clone(get_nestedns_mount_cb, (void *)&ca_to, CLONE_VFORK |
+		       CLONE_VM | CLONE_FILES); ASSERT_GT(pid, 0);
+	ASSERT_EQ(wait_for_pid(pid), 0);
+
+	ASSERT_EQ(syscall(SYS_move_mount, ca_from.mntfd, "",
+			  ca_to.mntfd, "", MOVE_MOUNT_SET_GROUP
+			  | MOVE_MOUNT_F_EMPTY_PATH | MOVE_MOUNT_T_EMPTY_PATH),
+		  0);
+
+	ASSERT_EQ(setns(ca_to.mntnsfd, CLONE_NEWNS), 0);
+	ASSERT_EQ(is_shared_mount(SET_GROUP_A), 1);
+}
+
+TEST_HARNESS_MAIN
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH v3 1/2] move_mount: allow to add a mount into an existing group
  2021-07-14 13:27 [PATCH v3 1/2] move_mount: allow to add a mount into an existing group Pavel Tikhomirov
  2021-07-14 13:27 ` [PATCH v3 2/2] tests: add move_mount(MOVE_MOUNT_SET_GROUP) selftest Pavel Tikhomirov
@ 2021-07-14 14:43 ` Christian Brauner
  2021-07-14 16:12   ` Pavel Tikhomirov
  1 sibling, 1 reply; 4+ messages in thread
From: Christian Brauner @ 2021-07-14 14:43 UTC (permalink / raw)
  To: Pavel Tikhomirov
  Cc: linux-fsdevel, Eric W . Biederman, Alexander Viro,
	Mattias Nissler, Aleksa Sarai, Andrei Vagin, linux-api, lkml

On Wed, Jul 14, 2021 at 04:27:53PM +0300, Pavel Tikhomirov wrote:
> Previously a sharing group (shared and master ids pair) can be only
> inherited when mount is created via bindmount. This patch adds an
> ability to add an existing private mount into an existing sharing group.
> 
> With this functionality one can first create the desired mount tree from
> only private mounts (without the need to care about undesired mount
> propagation or mount creation order implied by sharing group
> dependencies), and next then setup any desired mount sharing between
> those mounts in tree as needed.
> 
> This allows CRIU to restore any set of mount namespaces, mount trees and
> sharing group trees for a container.
> 
> We have many issues with restoring mounts in CRIU related to sharing
> groups and propagation:
> - reverse sharing groups vs mount tree order requires complex mounts
>   reordering which mostly implies also using some temporary mounts
> (please see https://lkml.org/lkml/2021/3/23/569 for more info)
> 
> - mount() syscall creates tons of mounts due to propagation
> - mount re-parenting due to propagation
> - "Mount Trap" due to propagation
> - "Non Uniform" propagation, meaning that with different tricks with
>   mount order and temporary children-"lock" mounts one can create mount
>   trees which can't be restored without those tricks
> (see https://www.linuxplumbersconf.org/event/7/contributions/640/)
> 
> With this new functionality we can resolve all the problems with
> propagation at once.
> 
> Link: https://lore.kernel.org/r/20210714132754.94633-1-ptikhomirov@virtuozzo.com
> Cc: Eric W. Biederman <ebiederm@xmission.com>
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Cc: Christian Brauner <christian.brauner@ubuntu.com>
> Cc: Mattias Nissler <mnissler@chromium.org>
> Cc: Aleksa Sarai <cyphar@cyphar.com>
> Cc: Andrei Vagin <avagin@gmail.com>
> Cc: linux-fsdevel@vger.kernel.org
> Cc: linux-api@vger.kernel.org
> Cc: lkml <linux-kernel@vger.kernel.org>
> Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
> ---
> This is a rework of "mnt: allow to add a mount into an existing group"
> patch from Andrei. https://lkml.org/lkml/2017/4/28/20
> 
> New do_set_group is similar to do_move_mount, but with some restrictions
> of do_move_mount removed and that's why:
> 
> 1) Allow "cross-namespace" sharing group set. If we allow operation only
> with mounts from current+anon mount namespace one would still be able to
> setns(from_mntns) + open_tree(from, OPEN_TREE_CLONE) + setns(to_mntns) +
> move_mount(anon, to, MOVE_MOUNT_SET_GROUP) to set sharing group to mount
> in different mount namespace with source mount. But with this approach
> we would need to create anon mount namespace and mount copy each time,
> which is just a waste of resources. So instead lets just check if we are
> allowed to modify both mount namespaces (which looks equivalent to what
> setns-es and open_tree check).
> 
> 2) Skip checks wich only apply to actually moving mount which we have in
> do_move_mount and open_tree. We don't need to check for MNT_LOCKED,
> d_is_dir matching, unbindable, nsfs loops and ancestor relation as we
> don't move mounts.
> 
> Also let's add some new checks:
> 
> 1) Don't allow to copy sharing from mount with narrow root to a wider
> root, so that user does not have power to receive more propagations when
> user already has. (proposed by Andrei)
> 
> 2) Don't allow to copy sharing from mount with locked children for the
> same reason, as user shouldn't see propagations to areas overmounted by
> locked mounts (if the user could not already do it before sharing
> adjustment).
> 
> 3) If From is private for MOVE_MOUNT_SET_GROUP let's report an error
> instead of just doing nothing, so that the user knows that there is
> probably some logical usage error. (proposed by Christian)
> 
> Security note: there would be no (new) loops in sharing groups tree,
> because this new move_mount(MOVE_MOUNT_SET_GROUP) operation only adds
> one _private_ mount to one group (without moving between groups), the
> sharing groups tree itself stays unchanged after it.
> 
> In Virtuozzo we have "mount-v2" implementation, based with the original
> kernel patch from Andrei, tested for almost a year and it actually
> decreased number of bugs with mounts a lot. One can take a look on the
> implementation of sharing group restore in CRIU in "mount-v2" here:
> 
> https://src.openvz.org/projects/OVZ/repos/criu/browse/criu/mount-v2.c#898
> 
> This works almost the same with current version of patch if we replace
> mount(MS_SET_GROUP) to move_mount(MOVE_MOUNT_SET_GROUP), please see
> super-draft port for mainstream criu, this at least passes
> non-user-namespaced mount tests (zdtm.py --mounts-v2 -f ns).
> 
> https://github.com/Snorch/criu/commits/mount-v2-poc
> 
> v2: Solve the problem mentioned by Andrei:
> - check mnt_root of "to" is in the sub-tree of mnt_root of "from"
> - also check "from" has no locked mounts in subroot of "to"
> v3: Add checks:
> - check paths to be mount root dentries
> - return EINVAL if From is private (no sharing to copy)
> 
> ---
>  fs/namespace.c             | 75 +++++++++++++++++++++++++++++++++++++-
>  include/uapi/linux/mount.h |  3 +-
>  2 files changed, 76 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/namespace.c b/fs/namespace.c
> index ab4174a3c802..a7828e695e03 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -2684,6 +2684,76 @@ static bool check_for_nsfs_mounts(struct mount *subtree)
>  	return ret;
>  }
>  
> +static int do_set_group(struct path *from_path, struct path *to_path)
> +{
> +	struct mount *from, *to;
> +	int err;
> +
> +	from = real_mount(from_path->mnt);
> +	to = real_mount(to_path->mnt);
> +
> +	namespace_lock();
> +
> +	err = -EINVAL;
> +	/* To and From must be mounted */
> +	if (!is_mounted(&from->mnt))
> +		goto out;
> +	if (!is_mounted(&to->mnt))
> +		goto out;
> +
> +	err = -EPERM;
> +	/* We should be allowed to modify mount namespaces of both mounts */
> +	if (!ns_capable(from->mnt_ns->user_ns, CAP_SYS_ADMIN))
> +		goto out;
> +	if (!ns_capable(to->mnt_ns->user_ns, CAP_SYS_ADMIN))
> +		goto out;
> +
> +	err = -EINVAL;
> +	/* To and From paths should be mount roots */
> +	if (from_path->dentry != from_path->mnt->mnt_root)
> +		goto out;
> +	if (to_path->dentry != to_path->mnt->mnt_root)
> +		goto out;
> +
> +	/* Setting sharing groups is only allowed across same superblock */
> +	if (from->mnt.mnt_sb != to->mnt.mnt_sb)
> +		goto out;
> +
> +	/* From mount root should be wider than To mount root */
> +	if (!is_subdir(to->mnt.mnt_root, from->mnt.mnt_root))
> +		goto out;
> +
> +	/* From mount should not have locked children in place of To's root */
> +	if (has_locked_children(from, to->mnt.mnt_root))
> +		goto out;
> +
> +	/* Setting sharing groups is only allowed on private mounts */
> +	if (IS_MNT_SHARED(to) || IS_MNT_SLAVE(to))
> +		goto out;
> +
> +	/* From should not be private */
> +	if (!IS_MNT_SHARED(from) && !IS_MNT_SLAVE(from))
> +		goto out;
> +
> +	if (IS_MNT_SLAVE(from)) {
> +		struct mount *m = from->mnt_master;
> +
> +		list_add(&to->mnt_slave, &m->mnt_slave_list);
> +		to->mnt_master = m;
> +	}
> +
> +	if (IS_MNT_SHARED(from)) {
> +		to->mnt_group_id = from->mnt_group_id;
> +		list_add(&to->mnt_share, &from->mnt_share);
> +		set_mnt_shared(to);

Thanks for the updated patch.

Note, set_mnt_shared() writes to the vfsmount so this should be:

if (IS_MNT_SHARED(from)) {
	to->mnt_group_id = from->mnt_group_id;
	list_add(&to->mnt_share, &from->mnt_share);
	lock_mount_hash();
	set_mnt_shared(to);
	unlock_mount_hash();
}

> +	}


> +	}
> +
> +	err = 0;
> +out:
> +	namespace_unlock();
> +	return err;
> +}
> +
>  static int do_move_mount(struct path *old_path, struct path *new_path)
>  {
>  	struct mnt_namespace *ns;
> @@ -3669,7 +3739,10 @@ SYSCALL_DEFINE5(move_mount,
>  	if (ret < 0)
>  		goto out_to;
>  
> -	ret = do_move_mount(&from_path, &to_path);
> +	if (flags & MOVE_MOUNT_SET_GROUP)
> +		ret = do_set_group(&from_path, &to_path);
> +	else
> +		ret = do_move_mount(&from_path, &to_path);
>  
>  out_to:
>  	path_put(&to_path);
> diff --git a/include/uapi/linux/mount.h b/include/uapi/linux/mount.h
> index dd7a166fdf9c..4d93967f8aea 100644
> --- a/include/uapi/linux/mount.h
> +++ b/include/uapi/linux/mount.h
> @@ -73,7 +73,8 @@
>  #define MOVE_MOUNT_T_SYMLINKS		0x00000010 /* Follow symlinks on to path */
>  #define MOVE_MOUNT_T_AUTOMOUNTS		0x00000020 /* Follow automounts on to path */
>  #define MOVE_MOUNT_T_EMPTY_PATH		0x00000040 /* Empty to path permitted */
> -#define MOVE_MOUNT__MASK		0x00000077
> +#define MOVE_MOUNT_SET_GROUP		0x00000100 /* Set sharing group instead */
> +#define MOVE_MOUNT__MASK		0x00000177
>  
>  /*
>   * fsopen() flags.
> -- 
> 2.31.1

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v3 1/2] move_mount: allow to add a mount into an existing group
  2021-07-14 14:43 ` [PATCH v3 1/2] move_mount: allow to add a mount into an existing group Christian Brauner
@ 2021-07-14 16:12   ` Pavel Tikhomirov
  0 siblings, 0 replies; 4+ messages in thread
From: Pavel Tikhomirov @ 2021-07-14 16:12 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, Eric W . Biederman, Alexander Viro,
	Mattias Nissler, Aleksa Sarai, Andrei Vagin, linux-api, lkml



On 14.07.2021 17:43, Christian Brauner wrote:
> On Wed, Jul 14, 2021 at 04:27:53PM +0300, Pavel Tikhomirov wrote:
>> Previously a sharing group (shared and master ids pair) can be only
>> inherited when mount is created via bindmount. This patch adds an
>> ability to add an existing private mount into an existing sharing group.
>>
>> With this functionality one can first create the desired mount tree from
>> only private mounts (without the need to care about undesired mount
>> propagation or mount creation order implied by sharing group
>> dependencies), and next then setup any desired mount sharing between
>> those mounts in tree as needed.
>>
>> This allows CRIU to restore any set of mount namespaces, mount trees and
>> sharing group trees for a container.
>>
>> We have many issues with restoring mounts in CRIU related to sharing
>> groups and propagation:
>> - reverse sharing groups vs mount tree order requires complex mounts
>>    reordering which mostly implies also using some temporary mounts
>> (please see https://lkml.org/lkml/2021/3/23/569 for more info)
>>
>> - mount() syscall creates tons of mounts due to propagation
>> - mount re-parenting due to propagation
>> - "Mount Trap" due to propagation
>> - "Non Uniform" propagation, meaning that with different tricks with
>>    mount order and temporary children-"lock" mounts one can create mount
>>    trees which can't be restored without those tricks
>> (see https://www.linuxplumbersconf.org/event/7/contributions/640/)
>>
>> With this new functionality we can resolve all the problems with
>> propagation at once.
>>
>> Link: https://lore.kernel.org/r/20210714132754.94633-1-ptikhomirov@virtuozzo.com
>> Cc: Eric W. Biederman <ebiederm@xmission.com>
>> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
>> Cc: Christian Brauner <christian.brauner@ubuntu.com>
>> Cc: Mattias Nissler <mnissler@chromium.org>
>> Cc: Aleksa Sarai <cyphar@cyphar.com>
>> Cc: Andrei Vagin <avagin@gmail.com>
>> Cc: linux-fsdevel@vger.kernel.org
>> Cc: linux-api@vger.kernel.org
>> Cc: lkml <linux-kernel@vger.kernel.org>
>> Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
>> ---
>> This is a rework of "mnt: allow to add a mount into an existing group"
>> patch from Andrei. https://lkml.org/lkml/2017/4/28/20
>>
>> New do_set_group is similar to do_move_mount, but with some restrictions
>> of do_move_mount removed and that's why:
>>
>> 1) Allow "cross-namespace" sharing group set. If we allow operation only
>> with mounts from current+anon mount namespace one would still be able to
>> setns(from_mntns) + open_tree(from, OPEN_TREE_CLONE) + setns(to_mntns) +
>> move_mount(anon, to, MOVE_MOUNT_SET_GROUP) to set sharing group to mount
>> in different mount namespace with source mount. But with this approach
>> we would need to create anon mount namespace and mount copy each time,
>> which is just a waste of resources. So instead lets just check if we are
>> allowed to modify both mount namespaces (which looks equivalent to what
>> setns-es and open_tree check).
>>
>> 2) Skip checks wich only apply to actually moving mount which we have in
>> do_move_mount and open_tree. We don't need to check for MNT_LOCKED,
>> d_is_dir matching, unbindable, nsfs loops and ancestor relation as we
>> don't move mounts.
>>
>> Also let's add some new checks:
>>
>> 1) Don't allow to copy sharing from mount with narrow root to a wider
>> root, so that user does not have power to receive more propagations when
>> user already has. (proposed by Andrei)
>>
>> 2) Don't allow to copy sharing from mount with locked children for the
>> same reason, as user shouldn't see propagations to areas overmounted by
>> locked mounts (if the user could not already do it before sharing
>> adjustment).
>>
>> 3) If From is private for MOVE_MOUNT_SET_GROUP let's report an error
>> instead of just doing nothing, so that the user knows that there is
>> probably some logical usage error. (proposed by Christian)
>>
>> Security note: there would be no (new) loops in sharing groups tree,
>> because this new move_mount(MOVE_MOUNT_SET_GROUP) operation only adds
>> one _private_ mount to one group (without moving between groups), the
>> sharing groups tree itself stays unchanged after it.
>>
>> In Virtuozzo we have "mount-v2" implementation, based with the original
>> kernel patch from Andrei, tested for almost a year and it actually
>> decreased number of bugs with mounts a lot. One can take a look on the
>> implementation of sharing group restore in CRIU in "mount-v2" here:
>>
>> https://src.openvz.org/projects/OVZ/repos/criu/browse/criu/mount-v2.c#898
>>
>> This works almost the same with current version of patch if we replace
>> mount(MS_SET_GROUP) to move_mount(MOVE_MOUNT_SET_GROUP), please see
>> super-draft port for mainstream criu, this at least passes
>> non-user-namespaced mount tests (zdtm.py --mounts-v2 -f ns).
>>
>> https://github.com/Snorch/criu/commits/mount-v2-poc
>>
>> v2: Solve the problem mentioned by Andrei:
>> - check mnt_root of "to" is in the sub-tree of mnt_root of "from"
>> - also check "from" has no locked mounts in subroot of "to"
>> v3: Add checks:
>> - check paths to be mount root dentries
>> - return EINVAL if From is private (no sharing to copy)
>>
>> ---
>>   fs/namespace.c             | 75 +++++++++++++++++++++++++++++++++++++-
>>   include/uapi/linux/mount.h |  3 +-
>>   2 files changed, 76 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/namespace.c b/fs/namespace.c
>> index ab4174a3c802..a7828e695e03 100644
>> --- a/fs/namespace.c
>> +++ b/fs/namespace.c
>> @@ -2684,6 +2684,76 @@ static bool check_for_nsfs_mounts(struct mount *subtree)
>>   	return ret;
>>   }
>>   
>> +static int do_set_group(struct path *from_path, struct path *to_path)
>> +{
>> +	struct mount *from, *to;
>> +	int err;
>> +
>> +	from = real_mount(from_path->mnt);
>> +	to = real_mount(to_path->mnt);
>> +
>> +	namespace_lock();
>> +
>> +	err = -EINVAL;
>> +	/* To and From must be mounted */
>> +	if (!is_mounted(&from->mnt))
>> +		goto out;
>> +	if (!is_mounted(&to->mnt))
>> +		goto out;
>> +
>> +	err = -EPERM;
>> +	/* We should be allowed to modify mount namespaces of both mounts */
>> +	if (!ns_capable(from->mnt_ns->user_ns, CAP_SYS_ADMIN))
>> +		goto out;
>> +	if (!ns_capable(to->mnt_ns->user_ns, CAP_SYS_ADMIN))
>> +		goto out;
>> +
>> +	err = -EINVAL;
>> +	/* To and From paths should be mount roots */
>> +	if (from_path->dentry != from_path->mnt->mnt_root)
>> +		goto out;
>> +	if (to_path->dentry != to_path->mnt->mnt_root)
>> +		goto out;
>> +
>> +	/* Setting sharing groups is only allowed across same superblock */
>> +	if (from->mnt.mnt_sb != to->mnt.mnt_sb)
>> +		goto out;
>> +
>> +	/* From mount root should be wider than To mount root */
>> +	if (!is_subdir(to->mnt.mnt_root, from->mnt.mnt_root))
>> +		goto out;
>> +
>> +	/* From mount should not have locked children in place of To's root */
>> +	if (has_locked_children(from, to->mnt.mnt_root))
>> +		goto out;
>> +
>> +	/* Setting sharing groups is only allowed on private mounts */
>> +	if (IS_MNT_SHARED(to) || IS_MNT_SLAVE(to))
>> +		goto out;
>> +
>> +	/* From should not be private */
>> +	if (!IS_MNT_SHARED(from) && !IS_MNT_SLAVE(from))
>> +		goto out;
>> +
>> +	if (IS_MNT_SLAVE(from)) {
>> +		struct mount *m = from->mnt_master;
>> +
>> +		list_add(&to->mnt_slave, &m->mnt_slave_list);
>> +		to->mnt_master = m;
>> +	}
>> +
>> +	if (IS_MNT_SHARED(from)) {
>> +		to->mnt_group_id = from->mnt_group_id;
>> +		list_add(&to->mnt_share, &from->mnt_share);
>> +		set_mnt_shared(to);
> 
> Thanks for the updated patch.
> 
> Note, set_mnt_shared() writes to the vfsmount so this should be:
> 
> if (IS_MNT_SHARED(from)) {
> 	to->mnt_group_id = from->mnt_group_id;
> 	list_add(&to->mnt_share, &from->mnt_share);
> 	lock_mount_hash();
> 	set_mnt_shared(to);
> 	unlock_mount_hash();
> }

Thanks for pointing that out! Sent v4 to fix it.

> 
>> +	}
> 
> 
>> +	}
>> +
>> +	err = 0;
>> +out:
>> +	namespace_unlock();
>> +	return err;
>> +}
>> +
>>   static int do_move_mount(struct path *old_path, struct path *new_path)
>>   {
>>   	struct mnt_namespace *ns;
>> @@ -3669,7 +3739,10 @@ SYSCALL_DEFINE5(move_mount,
>>   	if (ret < 0)
>>   		goto out_to;
>>   
>> -	ret = do_move_mount(&from_path, &to_path);
>> +	if (flags & MOVE_MOUNT_SET_GROUP)
>> +		ret = do_set_group(&from_path, &to_path);
>> +	else
>> +		ret = do_move_mount(&from_path, &to_path);
>>   
>>   out_to:
>>   	path_put(&to_path);
>> diff --git a/include/uapi/linux/mount.h b/include/uapi/linux/mount.h
>> index dd7a166fdf9c..4d93967f8aea 100644
>> --- a/include/uapi/linux/mount.h
>> +++ b/include/uapi/linux/mount.h
>> @@ -73,7 +73,8 @@
>>   #define MOVE_MOUNT_T_SYMLINKS		0x00000010 /* Follow symlinks on to path */
>>   #define MOVE_MOUNT_T_AUTOMOUNTS		0x00000020 /* Follow automounts on to path */
>>   #define MOVE_MOUNT_T_EMPTY_PATH		0x00000040 /* Empty to path permitted */
>> -#define MOVE_MOUNT__MASK		0x00000077
>> +#define MOVE_MOUNT_SET_GROUP		0x00000100 /* Set sharing group instead */
>> +#define MOVE_MOUNT__MASK		0x00000177
>>   
>>   /*
>>    * fsopen() flags.
>> -- 
>> 2.31.1

-- 
Best regards, Tikhomirov Pavel
Software Developer, Virtuozzo.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2021-07-14 16:12 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-14 13:27 [PATCH v3 1/2] move_mount: allow to add a mount into an existing group Pavel Tikhomirov
2021-07-14 13:27 ` [PATCH v3 2/2] tests: add move_mount(MOVE_MOUNT_SET_GROUP) selftest Pavel Tikhomirov
2021-07-14 14:43 ` [PATCH v3 1/2] move_mount: allow to add a mount into an existing group Christian Brauner
2021-07-14 16:12   ` Pavel Tikhomirov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.