linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v7 1/2] fork: extend clone3() to support setting a PID
@ 2019-11-11 13:17 Adrian Reber
  2019-11-11 13:17 ` [PATCH v7 2/2] selftests: add tests for clone3() Adrian Reber
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Adrian Reber @ 2019-11-11 13:17 UTC (permalink / raw)
  To: Christian Brauner, Eric Biederman, Pavel Emelyanov, Jann Horn,
	Oleg Nesterov, Dmitry Safonov
  Cc: linux-kernel, Andrei Vagin, Mike Rapoport, Radostin Stoyanov,
	Adrian Reber

The main motivation to add set_tid to clone3() is CRIU.

To restore a process with the same PID/TID CRIU currently uses
/proc/sys/kernel/ns_last_pid. It writes the desired (PID - 1) to
ns_last_pid and then (quickly) does a clone(). This works most of the
time, but it is racy. It is also slow as it requires multiple syscalls.

Extending clone3() to support *set_tid makes it possible restore a
process using CRIU without accessing /proc/sys/kernel/ns_last_pid and
race free (as long as the desired PID/TID is available).

This clone3() extension places the same restrictions (CAP_SYS_ADMIN)
on clone3() with *set_tid as they are currently in place for ns_last_pid.

The original version of this change was using a single value for
set_tid. At the 2019 LPC, after presenting set_tid, it was, however,
decided to change set_tid to an array to enable setting the PID of a
process in multiple PID namespaces at the same time. If a process is
created in a PID namespace it is possible to influence the PID inside
and outside of the PID namespace. Details also in the corresponding
selftest.

To create a process with the following PIDs:

      PID NS level         Requested PID
        0 (host)              31496
        1                        42
        2                         1

For that example the two newly introduced parameters to struct
clone_args (set_tid and set_tid_size) would need to be:

  set_tid[0] = 1;
  set_tid[1] = 42;
  set_tid[2] = 31496;
  set_tid_size = 3;

If only the PIDs of the two innermost nested PID namespaces should be
defined it would look like this:

  set_tid[0] = 1;
  set_tid[1] = 42;
  set_tid_size = 2;

The PID of the newly created process would then be the next available
free PID in the PID namespace level 0 (host) and 42 in the PID namespace
at level 1 and the PID of the process in the innermost PID namespace
would be 1.

The set_tid array is used to specify the PID of a process starting
from the innermost nested PID namespaces up to set_tid_size PID namespaces.

set_tid_size cannot be larger then the current PID namespace level.

Signed-off-by: Adrian Reber <areber@redhat.com>
---
v2:
 - Removed (size < sizeof(struct clone_args)) as discussed with
   Christian and Dmitry
 - Added comment to ((set_tid != 1) && idr_get_cursor() <= 1) (Oleg)
 - Use idr_alloc() instead of idr_alloc_cyclic() (Oleg)

v3:
 - Return EEXIST if PID is already in use (Christian)
 - Drop CLONE_SET_TID (Christian and Oleg)
 - Use idr_is_empty() instead of idr_get_cursor() (Oleg)
 - Handle different `struct clone_args` sizes (Dmitry)

v4:
 - Rework struct size check with defines (Christian)
 - Reduce number of set_tid checks (Oleg)
 - Less parentheses and more robust code (Oleg)
 - Do ns_capable() on correct user_ns (Oleg, Christian)

v5:
 - make set_tid checks earlier in alloc_pid() (Christian)
 - remove unnecessary comment and struct size check (Christian)

v6:
 - remove CLONE_SET_TID from description (Christian)
 - add clone3() tests for different clone_args sizes (Christian)
 - move more set_tid checks to alloc_pid() (Oleg)
 - make all set_tid checks lockless (Oleg)

v7:
 - changed set_tid to be an array to set the PID of a process
   in multiple nested PID namespaces at the same time as discussed
   at LPC 2019 (container MC)
---
 include/linux/pid.h           |  3 ++-
 include/linux/pid_namespace.h |  2 ++
 include/linux/sched/task.h    |  3 +++
 include/uapi/linux/sched.h    |  2 ++
 kernel/fork.c                 | 20 +++++++++++++-
 kernel/pid.c                  | 50 ++++++++++++++++++++++++++++++-----
 kernel/pid_namespace.c        |  2 --
 7 files changed, 71 insertions(+), 11 deletions(-)

diff --git a/include/linux/pid.h b/include/linux/pid.h
index 9645b1194c98..034b7df25888 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -120,7 +120,8 @@ extern struct pid *find_vpid(int nr);
 extern struct pid *find_get_pid(int nr);
 extern struct pid *find_ge_pid(int nr, struct pid_namespace *);
 
-extern struct pid *alloc_pid(struct pid_namespace *ns);
+extern struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
+			     size_t set_tid_size);
 extern void free_pid(struct pid *pid);
 extern void disable_pid_allocation(struct pid_namespace *ns);
 
diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
index 49538b172483..2ed6af88794b 100644
--- a/include/linux/pid_namespace.h
+++ b/include/linux/pid_namespace.h
@@ -12,6 +12,8 @@
 #include <linux/ns_common.h>
 #include <linux/idr.h>
 
+/* MAX_PID_NS_LEVEL is needed for limiting size of 'struct pid' */
+#define MAX_PID_NS_LEVEL 32
 
 struct fs_pin;
 
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 4b1c3b664f51..f1879884238e 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -26,6 +26,9 @@ struct kernel_clone_args {
 	unsigned long stack;
 	unsigned long stack_size;
 	unsigned long tls;
+	pid_t *set_tid;
+	/* Number of elements in *set_tid */
+	size_t set_tid_size;
 };
 
 /*
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 25b4fa00bad1..13f74c40a629 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -72,6 +72,8 @@ struct clone_args {
 	__aligned_u64 stack;
 	__aligned_u64 stack_size;
 	__aligned_u64 tls;
+	__aligned_u64 set_tid;
+	__aligned_u64 set_tid_size;
 };
 #endif
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 55af6931c6ec..c70b9224997d 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2026,7 +2026,8 @@ static __latent_entropy struct task_struct *copy_process(
 	stackleak_task_init(p);
 
 	if (pid != &init_struct_pid) {
-		pid = alloc_pid(p->nsproxy->pid_ns_for_children);
+		pid = alloc_pid(p->nsproxy->pid_ns_for_children, args->set_tid,
+				args->set_tid_size);
 		if (IS_ERR(pid)) {
 			retval = PTR_ERR(pid);
 			goto bad_fork_cleanup_thread;
@@ -2527,6 +2528,7 @@ noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs,
 					      struct clone_args __user *uargs,
 					      size_t usize)
 {
+	int i;
 	int err;
 	struct clone_args args;
 
@@ -2539,6 +2541,9 @@ noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs,
 	if (err)
 		return err;
 
+	if (unlikely(args.set_tid_size > MAX_PID_NS_LEVEL))
+		return -E2BIG;
+
 	/*
 	 * Verify that higher 32bits of exit_signal are unset and that
 	 * it is a valid signal
@@ -2556,8 +2561,17 @@ noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs,
 		.stack		= args.stack,
 		.stack_size	= args.stack_size,
 		.tls		= args.tls,
+		.set_tid	= kargs->set_tid,
+		.set_tid_size	= args.set_tid_size,
 	};
 
+	for (i = 0; i < args.set_tid_size; i++) {
+		if (copy_from_user(&kargs->set_tid[i],
+		    u64_to_user_ptr(args.set_tid + (i * sizeof(args.set_tid))),
+		    sizeof(pid_t)))
+			return -EFAULT;
+	}
+
 	return 0;
 }
 
@@ -2631,6 +2645,10 @@ SYSCALL_DEFINE2(clone3, struct clone_args __user *, uargs, size_t, size)
 	int err;
 
 	struct kernel_clone_args kargs;
+	pid_t set_tid[MAX_PID_NS_LEVEL];
+
+	memset(set_tid, 0, sizeof(set_tid));
+	kargs.set_tid = set_tid;
 
 	err = copy_clone_args_from_user(&kargs, uargs, size);
 	if (err)
diff --git a/kernel/pid.c b/kernel/pid.c
index 0a9f2e437217..82b3f91c131c 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -157,7 +157,8 @@ void free_pid(struct pid *pid)
 	call_rcu(&pid->rcu, delayed_put_pid);
 }
 
-struct pid *alloc_pid(struct pid_namespace *ns)
+struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
+		      size_t set_tid_size)
 {
 	struct pid *pid;
 	enum pid_type type;
@@ -166,6 +167,17 @@ struct pid *alloc_pid(struct pid_namespace *ns)
 	struct upid *upid;
 	int retval = -ENOMEM;
 
+	/*
+	 * set_tid_size contains the size of the set_tid array. Starting at
+	 * the most nested currently active PID namespace it tells alloc_pid()
+	 * which PID to set for a process in that most nested PID namespace
+	 * up to set_tid_size PID namespaces. It does not have to set the PID
+	 * for a process in all nested PID namespaces but set_tid_size must
+	 * never be greater than the current ns->level + 1.
+	 */
+	if (set_tid_size > ns->level + 1)
+		return ERR_PTR(-EINVAL);
+
 	pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL);
 	if (!pid)
 		return ERR_PTR(retval);
@@ -175,6 +187,18 @@ struct pid *alloc_pid(struct pid_namespace *ns)
 
 	for (i = ns->level; i >= 0; i--) {
 		int pid_min = 1;
+		int t_pos = 0;
+
+		if (set_tid_size) {
+			t_pos = - (i - ns->level);
+			if (set_tid[t_pos] < 1 || set_tid[t_pos] >= pid_max)
+				return ERR_PTR(-EINVAL);
+			/* Also fail if a PID != 1 is requested and no PID 1 exists */
+			if (set_tid[t_pos] != 1 && !tmp->child_reaper)
+				return ERR_PTR(-EINVAL);
+			if (!ns_capable(tmp->user_ns, CAP_SYS_ADMIN))
+				return ERR_PTR(-EPERM);
+		}
 
 		idr_preload(GFP_KERNEL);
 		spin_lock_irq(&pidmap_lock);
@@ -186,12 +210,24 @@ struct pid *alloc_pid(struct pid_namespace *ns)
 		if (idr_get_cursor(&tmp->idr) > RESERVED_PIDS)
 			pid_min = RESERVED_PIDS;
 
-		/*
-		 * Store a null pointer so find_pid_ns does not find
-		 * a partially initialized PID (see below).
-		 */
-		nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min,
-				      pid_max, GFP_ATOMIC);
+		if (set_tid_size) {
+			nr = idr_alloc(&tmp->idr, NULL, set_tid[t_pos],
+				       set_tid[t_pos] + 1, GFP_ATOMIC);
+			/*
+			 * If ENOSPC is returned it means that the PID is
+			 * alreay in use. Return EEXIST in that case.
+			 */
+			if (nr == -ENOSPC)
+				nr = -EEXIST;
+			set_tid_size--;
+		} else {
+			/*
+			 * Store a null pointer so find_pid_ns does not find
+			 * a partially initialized PID (see below).
+			 */
+			nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min,
+					      pid_max, GFP_ATOMIC);
+		}
 		spin_unlock_irq(&pidmap_lock);
 		idr_preload_end();
 
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index a6a79f85c81a..d40017e79ebe 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -26,8 +26,6 @@
 
 static DEFINE_MUTEX(pid_caches_mutex);
 static struct kmem_cache *pid_ns_cachep;
-/* MAX_PID_NS_LEVEL is needed for limiting size of 'struct pid' */
-#define MAX_PID_NS_LEVEL 32
 /* Write once array, filled from the beginning. */
 static struct kmem_cache *pid_cache[MAX_PID_NS_LEVEL];
 
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v7 2/2] selftests: add tests for clone3()
  2019-11-11 13:17 [PATCH v7 1/2] fork: extend clone3() to support setting a PID Adrian Reber
@ 2019-11-11 13:17 ` Adrian Reber
  2019-11-11 15:25 ` [PATCH v7 1/2] fork: extend clone3() to support setting a PID Oleg Nesterov
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 12+ messages in thread
From: Adrian Reber @ 2019-11-11 13:17 UTC (permalink / raw)
  To: Christian Brauner, Eric Biederman, Pavel Emelyanov, Jann Horn,
	Oleg Nesterov, Dmitry Safonov
  Cc: linux-kernel, Andrei Vagin, Mike Rapoport, Radostin Stoyanov,
	Adrian Reber

This tests clone3() with *set_tid to see if all desired PIDs are working
as expected. The tests are trying multiple invalid input parameters as
well as creating processes while specifying a certain PID in multiple
PID namespaces at the same time.

Signed-off-by: Adrian Reber <areber@redhat.com>
---
 tools/testing/selftests/clone3/.gitignore     |   1 +
 tools/testing/selftests/clone3/Makefile       |   2 +-
 .../testing/selftests/clone3/clone3_set_tid.c | 345 ++++++++++++++++++
 3 files changed, 347 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/clone3/clone3_set_tid.c

diff --git a/tools/testing/selftests/clone3/.gitignore b/tools/testing/selftests/clone3/.gitignore
index 85d9d3ba2524..d56c3c49d869 100644
--- a/tools/testing/selftests/clone3/.gitignore
+++ b/tools/testing/selftests/clone3/.gitignore
@@ -1 +1,2 @@
 clone3
+clone3_set_tid
diff --git a/tools/testing/selftests/clone3/Makefile b/tools/testing/selftests/clone3/Makefile
index ea922c014ae4..2d292545ca8e 100644
--- a/tools/testing/selftests/clone3/Makefile
+++ b/tools/testing/selftests/clone3/Makefile
@@ -2,6 +2,6 @@
 
 CFLAGS += -I../../../../usr/include/
 
-TEST_GEN_PROGS := clone3
+TEST_GEN_PROGS := clone3 clone3_set_tid
 
 include ../lib.mk
diff --git a/tools/testing/selftests/clone3/clone3_set_tid.c b/tools/testing/selftests/clone3/clone3_set_tid.c
new file mode 100644
index 000000000000..51196901ccc4
--- /dev/null
+++ b/tools/testing/selftests/clone3/clone3_set_tid.c
@@ -0,0 +1,345 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Based on Christian Brauner's clone3() example.
+ * These tests are assuming to be running in the host's
+ * PID namespace.
+ */
+
+#define _GNU_SOURCE
+#include <errno.h>
+#include <linux/types.h>
+#include <linux/sched.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <sys/syscall.h>
+#include <sys/types.h>
+#include <sys/un.h>
+#include <sys/wait.h>
+#include <unistd.h>
+#include <sched.h>
+
+#include "../kselftest.h"
+
+#ifndef MAX_PID_NS_LEVEL
+#define MAX_PID_NS_LEVEL 32
+#endif
+
+static int pipe_1[2];
+static int pipe_2[2];
+
+static pid_t raw_clone(struct clone_args *args)
+{
+	return syscall(__NR_clone3, args, sizeof(struct clone_args));
+}
+
+static int call_clone3_set_tid(__u64 *set_tid,
+			       size_t set_tid_size,
+			       int flags,
+			       int expected_pid,
+			       int wait_for_it)
+{
+	int status;
+	int ret = 0;
+	pid_t pid = -1;
+	struct clone_args args = {0};
+
+	args.flags = flags;
+	args.exit_signal = SIGCHLD;
+	args.set_tid = (__u64)set_tid;
+	args.set_tid_size = set_tid_size;
+
+	pid = raw_clone(&args);
+	if (pid < 0) {
+		ksft_print_msg("%s - Failed to create new process\n",
+			       strerror(errno));
+		return -errno;
+	}
+
+	if (pid == 0) {
+		char tmp = 0;
+		ksft_print_msg("I am the child, my PID is %d (expected %d)\n",
+			       getpid(), set_tid[0]);
+		if (wait_for_it) {
+			ksft_print_msg("[%d] Child is ready and waiting\n", getpid());
+			/* Signal the parent that the child is ready */
+			close(pipe_1[0]);
+			write(pipe_1[1], &tmp, 1);
+			close(pipe_1[1]);
+			close(pipe_2[1]);
+			read(pipe_2[0], &tmp, 1);
+			close(pipe_2[0]);
+		}
+
+		if (set_tid[0] != getpid())
+			_exit(EXIT_FAILURE);
+		_exit(EXIT_SUCCESS);
+	}
+
+	if (expected_pid == 0 || expected_pid == pid)
+		ksft_print_msg("I am the parent (%d). My child's pid is %d\n",
+			       getpid(), pid);
+	else {
+		ksft_print_msg(
+			"Expected child pid %d does not match actual pid %d\n",
+			expected_pid, pid);
+		ret = -1;
+	}
+
+	if (wait(&status) < 0) {
+		ksft_print_msg("Child returned %s\n", strerror(errno));
+		return -errno;
+	}
+	if (WEXITSTATUS(status))
+		return WEXITSTATUS(status);
+
+	return ret;
+}
+
+static void test_clone3_set_tid(__u64 *set_tid,
+				size_t set_tid_size,
+				int flags,
+				int expected,
+				int expected_pid,
+				int wait_for_it)
+{
+	int ret;
+
+	ksft_print_msg(
+		"[%d] Trying clone3() with CLONE_SET_TID to %d and 0x%x\n",
+		getpid(), set_tid[0], flags);
+	ret = call_clone3_set_tid(set_tid, set_tid_size, flags, expected_pid,
+				  wait_for_it);
+	ksft_print_msg(
+		"[%d] clone3() with CLONE_SET_TID %d says :%d - expected %d\n",
+		getpid(), set_tid[0], ret, expected);
+	if (ret != expected)
+		ksft_test_result_fail(
+			"[%d] Result (%d) is different than expected (%d)\n",
+			getpid(), ret, expected);
+	else
+		ksft_test_result_pass("[%d] Result (%d) matches expectation (%d)\n",
+			getpid(), ret, expected);
+}
+int main(int argc, char *argv[])
+{
+	FILE *f;
+	char buf;
+	pid_t pid;
+	pid_t ns1;
+	pid_t ns2;
+	pid_t ns3;
+	int status;
+	char *proc;
+	int ret = -1;
+	pid_t ns_pid;
+	int pid_max = 0;
+	uid_t uid = getuid();
+	char line[1024] = {0};
+	__aligned_u64 set_tid[MAX_PID_NS_LEVEL * 2];
+	__aligned_u64 set_tid_small[1];
+
+	if (pipe(pipe_1) == -1 || pipe(pipe_2))
+		 ksft_exit_fail_msg("pipe() failed\n");
+
+	ksft_print_header();
+	ksft_set_plan(27);
+
+	f = fopen("/proc/sys/kernel/pid_max", "r");
+	if (f == NULL)
+		ksft_exit_fail_msg(
+			"%s - Could not open /proc/sys/kernel/pid_max\n",
+			strerror(errno));
+	fscanf(f, "%d", &pid_max);
+	fclose(f);
+	ksft_print_msg("/proc/sys/kernel/pid_max %d\n", pid_max);
+
+	/* Try invalid settings */
+	memset(&set_tid, 0, sizeof(set_tid));
+	test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL + 1, 0, -E2BIG, 0, 0);
+	test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL * 2, 0, -E2BIG, 0, 0);
+	test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL * 2 + 1, 0, -E2BIG, 0, 0);
+	test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL * 42, 0, -E2BIG, 0, 0);
+
+	/* small set_tid array, but maximum set_tid_size */
+	/* Find the current active PID */
+	pid = fork();
+	if (pid == 0) {
+		ksft_print_msg("Child has PID %d\n", getpid());
+		_exit(EXIT_SUCCESS);
+	}
+	(void)wait(NULL);
+	/* After the child has finished, its PID should be free. */
+	set_tid_small[0] = pid;
+	/*
+	 * There is a chance that this can return -EFAULT as the actual
+	 * set_tid array has only one entry, but we are telling the kernel
+	 * that it has the size MAX_PID_NS_LEVEL. This could lead to a
+	 * situation where copy_from_user() fails. So far it always
+	 * succeeds and copies random data (whatever is after set_tid_small).
+	 */
+	test_clone3_set_tid(set_tid_small, MAX_PID_NS_LEVEL, 0, -EINVAL, 0, 0);
+
+	/*
+	 * This can actually work if this test running in a MAX_PID_NS_LEVEL - 1
+	 * nested PID namespace.
+	 */
+	test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL - 1, 0, -EINVAL, 0, 0);
+
+	memset(&set_tid, 0xff, sizeof(set_tid));
+	test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL + 1, 0, -E2BIG, 0, 0);
+	test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL * 2, 0, -E2BIG, 0, 0);
+	test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL * 2 + 1, 0, -E2BIG, 0, 0);
+	test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL * 42, 0, -E2BIG, 0, 0);
+	/*
+	 * This can actually work if this test running in a MAX_PID_NS_LEVEL - 1
+	 * nested PID namespace.
+	 */
+	test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL - 1, 0, -EINVAL, 0, 0);
+
+	memset(&set_tid, 0, sizeof(set_tid));
+	/* Try with an invalid PID */
+	set_tid[0] = 0;
+	test_clone3_set_tid(set_tid, 1, 0, -EINVAL, 0, 0);
+	set_tid[0] = -1;
+	test_clone3_set_tid(set_tid, 1, 0, -EINVAL, 0, 0);
+	/* Claim that the set_tid array actually contains 2 elements. */
+	test_clone3_set_tid(set_tid, 2, 0, -EINVAL, 0, 0);
+	/* Try it in a new PID namespace */
+	if (uid == 0)
+		test_clone3_set_tid(set_tid, 1, CLONE_NEWPID, -EINVAL, 0, 0);
+	else
+		test_clone3_set_tid(set_tid, 1, CLONE_NEWPID, -EPERM, 0, 0);
+
+	/*
+	 * Try with a valid PID (1) but as non-root. This should fail
+	 * with -EPERM if running in the initial user namespace.
+	 * As root it should tell us -EEXIST.
+	 */
+	set_tid[0] = 1;
+	if (uid == 0)
+		test_clone3_set_tid(set_tid, 1, 0, -EEXIST, 0, 0);
+	else
+		test_clone3_set_tid(set_tid, 1, 0, -EPERM, 0, 0);
+
+	/* Try it in a new PID namespace */
+	if (uid == 0)
+		test_clone3_set_tid(set_tid, 1, CLONE_NEWPID, 0, 0, 0);
+	else
+		test_clone3_set_tid(set_tid, 1, CLONE_NEWPID, -EPERM, 0, 0);
+
+	/* pid_max should fail everywhere */
+	set_tid[0] = pid_max;
+	test_clone3_set_tid(set_tid, 1, 0, -EINVAL, 0, 0);
+	if (uid == 0)
+		test_clone3_set_tid(set_tid, 1, CLONE_NEWPID, -EINVAL, 0, 0);
+	else
+		test_clone3_set_tid(set_tid, 1, CLONE_NEWPID, -EPERM, 0, 0);
+
+	if (uid != 0) {
+		/*
+		 * All remaining tests require root. Tell the framework
+		 * that all those tests are skipped as non-root.
+		 */
+		ksft_cnt.ksft_xskip += ksft_plan - ksft_test_num();
+		goto out;
+	}
+
+	/* Find the current active PID */
+	pid = fork();
+	if (pid == 0) {
+		ksft_print_msg("Child has PID %d\n", getpid());
+		usleep(500);
+		_exit(EXIT_SUCCESS);
+	}
+	(void)wait(NULL);
+	/* After the child has finished, its PID should be free. */
+	set_tid[0] = pid;
+	test_clone3_set_tid(set_tid, 1, 0, 0, 0, 0);
+	/* This should fail as there is no PID 1 in that namespace */
+	test_clone3_set_tid(set_tid, 1, CLONE_NEWPID, -EINVAL, 0, 0);
+	set_tid[0] = 1;
+	set_tid[1] = pid;
+	test_clone3_set_tid(set_tid, 2, CLONE_NEWPID, 0, pid, 0);
+
+	ksft_print_msg("unshare PID namespace\n");
+	unshare(CLONE_NEWPID);
+	set_tid[0] = pid;
+	/* This should fail as there is no PID 1 in that namespace */
+	test_clone3_set_tid(set_tid, 1, 0, -EINVAL, 0, 0);
+
+	/* Let's create a PID 1 */
+	ns_pid = fork();
+	if (ns_pid == 0) {
+		ksft_print_msg("Child in PID namespace has PID %d\n", getpid());
+		set_tid[0] = 2;
+		test_clone3_set_tid(set_tid, 1, 0, 0, 2, 0);
+		set_tid[0] = 1;
+		set_tid[1] = 42;
+		set_tid[2] = pid;
+		/*
+		 * This should fail as there are not enough active PID
+		 * namespaces. Again assuming this is running in the host's
+		 * PID namespace. Not yet nested.
+		 */
+		test_clone3_set_tid(set_tid, 4, CLONE_NEWPID, -EINVAL, 0, 0);
+		/*
+		 * This should work and from the parent we should see
+		 * something like 'NSpid:	pid	42	1'.
+		 */
+		test_clone3_set_tid(set_tid, 3, CLONE_NEWPID, 0, 42, 1);
+		_exit(ksft_cnt.ksft_pass);
+	}
+
+	close(pipe_1[1]);
+	close(pipe_2[0]);
+	while (read(pipe_1[0], &buf, 1) > 0) {
+		ksft_print_msg("[%d] Child is ready and waiting\n", getpid());
+		break;
+	}
+
+	asprintf(&proc, "/proc/%d/status", pid);
+	f = fopen(proc, "r");
+	if (f == NULL)
+		ksft_exit_fail_msg(
+			"%s - Could not open %s\n",
+			strerror(errno), proc);
+	while (fgets(line, 1024, f)) {
+		if (strstr(line, "NSpid")) {
+			/* Verify that all generated PIDs are as expected. */
+			sscanf(line, "NSpid:\t%d\t%d\t%d", &ns3, &ns2, &ns1);
+			break;
+		}
+	}
+	fclose(f);
+	free(proc);
+	close(pipe_2[0]);
+	/* Tell the clone3()'d child to finish. */
+	write(pipe_2[1], &buf, 1);
+	close(pipe_2[1]);
+
+	if (wait(&status) < 0) {
+		ksft_print_msg("Child returned %s\n", strerror(errno));
+		ret = -errno;
+		goto out;
+	}
+	if (WEXITSTATUS(status))
+		/*
+		 * Update the number of total tests with the tests from the
+		 * child processes.
+		 */
+		ksft_cnt.ksft_pass = WEXITSTATUS(status);
+
+	if (ns3 == pid && ns2 == 42 && ns1 == 1)
+		ksft_test_result_pass(
+			"PIDs in all namespaces as expected (%d,%d,%d)\n",
+			ns3, ns2, ns1);
+	else
+		ksft_test_result_fail(
+			"PIDs in all namespaces not as expected (%d,%d,%d)\n",
+			ns3, ns2, ns1);
+out:
+	ret = 0;
+
+	return !ret ? ksft_exit_pass() : ksft_exit_fail();
+}
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v7 1/2] fork: extend clone3() to support setting a PID
  2019-11-11 13:17 [PATCH v7 1/2] fork: extend clone3() to support setting a PID Adrian Reber
  2019-11-11 13:17 ` [PATCH v7 2/2] selftests: add tests for clone3() Adrian Reber
@ 2019-11-11 15:25 ` Oleg Nesterov
  2019-11-11 15:40   ` Adrian Reber
  2019-11-11 16:12 ` Christian Brauner
  2019-11-11 20:41 ` Rasmus Villemoes
  3 siblings, 1 reply; 12+ messages in thread
From: Oleg Nesterov @ 2019-11-11 15:25 UTC (permalink / raw)
  To: Adrian Reber
  Cc: Christian Brauner, Eric Biederman, Pavel Emelyanov, Jann Horn,
	Dmitry Safonov, linux-kernel, Andrei Vagin, Mike Rapoport,
	Radostin Stoyanov

On 11/11, Adrian Reber wrote:
>
> v7:
>  - changed set_tid to be an array to set the PID of a process
>    in multiple nested PID namespaces at the same time as discussed
>    at LPC 2019 (container MC)

cough... iirc you convinced me this is not needed when we discussed
the previous version ;) Nevermind, probably my memory fools me.

So far I only have some cosmetic nits,

> @@ -175,6 +187,18 @@ struct pid *alloc_pid(struct pid_namespace *ns)
>
>  	for (i = ns->level; i >= 0; i--) {
>  		int pid_min = 1;
> +		int t_pos = 0;
                    ^^^^^

I won't insist, but I'd suggest to cache set_tid[t_pos] instead to make
the code a bit more simple.

> @@ -186,12 +210,24 @@ struct pid *alloc_pid(struct pid_namespace *ns)
>  		if (idr_get_cursor(&tmp->idr) > RESERVED_PIDS)
>  			pid_min = RESERVED_PIDS;

You can probably move this code into the "else" branch below.

IOW, something like


	for (i = ns->level; i >= 0; i--) {
		int xxx = 0;

		if (set_tid_size) {
			int pos = ns->level - i;

			xxx = set_tid[pos];
			if (xxx < 1 || xxx >= pid_max)
				return ERR_PTR(-EINVAL);
			/* Also fail if a PID != 1 is requested and no PID 1 exists */
			if (xxx != 1 && !tmp->child_reaper)
				return ERR_PTR(-EINVAL);
			if (!ns_capable(tmp->user_ns, CAP_SYS_ADMIN))
				return ERR_PTR(-EPERM);
			set_tid_size--;
		}

		idr_preload(GFP_KERNEL);
		spin_lock_irq(&pidmap_lock);

		if (xxx) {
			nr = idr_alloc(&tmp->idr, NULL, xxx, xxx + 1,
					GFP_ATOMIC);
			/*
			 * If ENOSPC is returned it means that the PID is
			 * alreay in use. Return EEXIST in that case.
			 */
			if (nr == -ENOSPC)
				nr = -EEXIST;
		} else {
			int pid_min = 1;
			/*
			 * init really needs pid 1, but after reaching the
			 * maximum wrap back to RESERVED_PIDS
			 */
			if (idr_get_cursor(&tmp->idr) > RESERVED_PIDS)
				pid_min = RESERVED_PIDS;
			/*
			 * Store a null pointer so find_pid_ns does not find
			 * a partially initialized PID (see below).
			 */
			nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min,
					      pid_max, GFP_ATOMIC);
		}

		...

This way only the "if (set_tid_size)" block has to play with set_tid_size/set_tid.

note also that this way we can easily allow set_tid[some_level] == 0, we can
simply do

	-	if (xxx < 1 || xxx >= pid_max)
	+	if (xxx < 0 || xxx >= pid_max)

although I don't think this is really useful.

Oleg.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v7 1/2] fork: extend clone3() to support setting a PID
  2019-11-11 15:25 ` [PATCH v7 1/2] fork: extend clone3() to support setting a PID Oleg Nesterov
@ 2019-11-11 15:40   ` Adrian Reber
  2019-11-11 16:14     ` Christian Brauner
  0 siblings, 1 reply; 12+ messages in thread
From: Adrian Reber @ 2019-11-11 15:40 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Christian Brauner, Eric Biederman, Pavel Emelyanov, Jann Horn,
	Dmitry Safonov, linux-kernel, Andrei Vagin, Mike Rapoport,
	Radostin Stoyanov

On Mon, Nov 11, 2019 at 04:25:15PM +0100, Oleg Nesterov wrote:
> On 11/11, Adrian Reber wrote:
> >
> > v7:
> >  - changed set_tid to be an array to set the PID of a process
> >    in multiple nested PID namespaces at the same time as discussed
> >    at LPC 2019 (container MC)
> 
> cough... iirc you convinced me this is not needed when we discussed
> the previous version ;) Nevermind, probably my memory fools me.

You are right. You suggested the same thing and we didn't listen ;)

> So far I only have some cosmetic nits,

Thanks for the quick review. I will try to apply your suggestions.

> > @@ -175,6 +187,18 @@ struct pid *alloc_pid(struct pid_namespace *ns)
> >
> >  	for (i = ns->level; i >= 0; i--) {
> >  		int pid_min = 1;
> > +		int t_pos = 0;
>                     ^^^^^
> 
> I won't insist, but I'd suggest to cache set_tid[t_pos] instead to make
> the code a bit more simple.
> 
> > @@ -186,12 +210,24 @@ struct pid *alloc_pid(struct pid_namespace *ns)
> >  		if (idr_get_cursor(&tmp->idr) > RESERVED_PIDS)
> >  			pid_min = RESERVED_PIDS;
> 
> You can probably move this code into the "else" branch below.
> 
> IOW, something like
> 
> 
> 	for (i = ns->level; i >= 0; i--) {
> 		int xxx = 0;
> 
> 		if (set_tid_size) {
> 			int pos = ns->level - i;
> 
> 			xxx = set_tid[pos];
> 			if (xxx < 1 || xxx >= pid_max)
> 				return ERR_PTR(-EINVAL);
> 			/* Also fail if a PID != 1 is requested and no PID 1 exists */
> 			if (xxx != 1 && !tmp->child_reaper)
> 				return ERR_PTR(-EINVAL);
> 			if (!ns_capable(tmp->user_ns, CAP_SYS_ADMIN))
> 				return ERR_PTR(-EPERM);
> 			set_tid_size--;
> 		}
> 
> 		idr_preload(GFP_KERNEL);
> 		spin_lock_irq(&pidmap_lock);
> 
> 		if (xxx) {
> 			nr = idr_alloc(&tmp->idr, NULL, xxx, xxx + 1,
> 					GFP_ATOMIC);
> 			/*
> 			 * If ENOSPC is returned it means that the PID is
> 			 * alreay in use. Return EEXIST in that case.
> 			 */
> 			if (nr == -ENOSPC)
> 				nr = -EEXIST;
> 		} else {
> 			int pid_min = 1;
> 			/*
> 			 * init really needs pid 1, but after reaching the
> 			 * maximum wrap back to RESERVED_PIDS
> 			 */
> 			if (idr_get_cursor(&tmp->idr) > RESERVED_PIDS)
> 				pid_min = RESERVED_PIDS;
> 			/*
> 			 * Store a null pointer so find_pid_ns does not find
> 			 * a partially initialized PID (see below).
> 			 */
> 			nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min,
> 					      pid_max, GFP_ATOMIC);
> 		}
> 
> 		...
> 
> This way only the "if (set_tid_size)" block has to play with set_tid_size/set_tid.
> 
> note also that this way we can easily allow set_tid[some_level] == 0, we can
> simply do
> 
> 	-	if (xxx < 1 || xxx >= pid_max)
> 	+	if (xxx < 0 || xxx >= pid_max)
> 
> although I don't think this is really useful.

Yes. I explicitly didn't allow 0 as a PID as I didn't thought it would
be useful (or maybe even valid).

		Adrian


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v7 1/2] fork: extend clone3() to support setting a PID
  2019-11-11 13:17 [PATCH v7 1/2] fork: extend clone3() to support setting a PID Adrian Reber
  2019-11-11 13:17 ` [PATCH v7 2/2] selftests: add tests for clone3() Adrian Reber
  2019-11-11 15:25 ` [PATCH v7 1/2] fork: extend clone3() to support setting a PID Oleg Nesterov
@ 2019-11-11 16:12 ` Christian Brauner
  2019-11-11 20:41 ` Rasmus Villemoes
  3 siblings, 0 replies; 12+ messages in thread
From: Christian Brauner @ 2019-11-11 16:12 UTC (permalink / raw)
  To: Adrian Reber
  Cc: Eric Biederman, Pavel Emelyanov, Jann Horn, Oleg Nesterov,
	Dmitry Safonov, linux-kernel, Andrei Vagin, Mike Rapoport,
	Radostin Stoyanov

On Mon, Nov 11, 2019 at 02:17:03PM +0100, Adrian Reber wrote:
> The main motivation to add set_tid to clone3() is CRIU.
> 
> To restore a process with the same PID/TID CRIU currently uses
> /proc/sys/kernel/ns_last_pid. It writes the desired (PID - 1) to
> ns_last_pid and then (quickly) does a clone(). This works most of the
> time, but it is racy. It is also slow as it requires multiple syscalls.
> 
> Extending clone3() to support *set_tid makes it possible restore a
> process using CRIU without accessing /proc/sys/kernel/ns_last_pid and
> race free (as long as the desired PID/TID is available).
> 
> This clone3() extension places the same restrictions (CAP_SYS_ADMIN)
> on clone3() with *set_tid as they are currently in place for ns_last_pid.
> 
> The original version of this change was using a single value for
> set_tid. At the 2019 LPC, after presenting set_tid, it was, however,
> decided to change set_tid to an array to enable setting the PID of a
> process in multiple PID namespaces at the same time. If a process is
> created in a PID namespace it is possible to influence the PID inside
> and outside of the PID namespace. Details also in the corresponding
> selftest.
> 
> To create a process with the following PIDs:
> 
>       PID NS level         Requested PID
>         0 (host)              31496
>         1                        42
>         2                         1
> 
> For that example the two newly introduced parameters to struct
> clone_args (set_tid and set_tid_size) would need to be:
> 
>   set_tid[0] = 1;
>   set_tid[1] = 42;
>   set_tid[2] = 31496;
>   set_tid_size = 3;
> 
> If only the PIDs of the two innermost nested PID namespaces should be
> defined it would look like this:
> 
>   set_tid[0] = 1;
>   set_tid[1] = 42;
>   set_tid_size = 2;
> 
> The PID of the newly created process would then be the next available
> free PID in the PID namespace level 0 (host) and 42 in the PID namespace
> at level 1 and the PID of the process in the innermost PID namespace
> would be 1.

I'm not a fan of this array-based patch tbh, especially since afaict
CRIU has only vague plans to support restoring nested pid namespaces but
I won't stand in the way. :)

> 
> The set_tid array is used to specify the PID of a process starting
> from the innermost nested PID namespaces up to set_tid_size PID namespaces.
> 
> set_tid_size cannot be larger then the current PID namespace level.
> 
> Signed-off-by: Adrian Reber <areber@redhat.com>
> ---
> v2:
>  - Removed (size < sizeof(struct clone_args)) as discussed with
>    Christian and Dmitry
>  - Added comment to ((set_tid != 1) && idr_get_cursor() <= 1) (Oleg)
>  - Use idr_alloc() instead of idr_alloc_cyclic() (Oleg)
> 
> v3:
>  - Return EEXIST if PID is already in use (Christian)
>  - Drop CLONE_SET_TID (Christian and Oleg)
>  - Use idr_is_empty() instead of idr_get_cursor() (Oleg)
>  - Handle different `struct clone_args` sizes (Dmitry)
> 
> v4:
>  - Rework struct size check with defines (Christian)
>  - Reduce number of set_tid checks (Oleg)
>  - Less parentheses and more robust code (Oleg)
>  - Do ns_capable() on correct user_ns (Oleg, Christian)
> 
> v5:
>  - make set_tid checks earlier in alloc_pid() (Christian)
>  - remove unnecessary comment and struct size check (Christian)
> 
> v6:
>  - remove CLONE_SET_TID from description (Christian)
>  - add clone3() tests for different clone_args sizes (Christian)
>  - move more set_tid checks to alloc_pid() (Oleg)
>  - make all set_tid checks lockless (Oleg)
> 
> v7:
>  - changed set_tid to be an array to set the PID of a process
>    in multiple nested PID namespaces at the same time as discussed
>    at LPC 2019 (container MC)
> ---
>  include/linux/pid.h           |  3 ++-
>  include/linux/pid_namespace.h |  2 ++
>  include/linux/sched/task.h    |  3 +++
>  include/uapi/linux/sched.h    |  2 ++
>  kernel/fork.c                 | 20 +++++++++++++-
>  kernel/pid.c                  | 50 ++++++++++++++++++++++++++++++-----
>  kernel/pid_namespace.c        |  2 --
>  7 files changed, 71 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/pid.h b/include/linux/pid.h
> index 9645b1194c98..034b7df25888 100644
> --- a/include/linux/pid.h
> +++ b/include/linux/pid.h
> @@ -120,7 +120,8 @@ extern struct pid *find_vpid(int nr);
>  extern struct pid *find_get_pid(int nr);
>  extern struct pid *find_ge_pid(int nr, struct pid_namespace *);
>  
> -extern struct pid *alloc_pid(struct pid_namespace *ns);
> +extern struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
> +			     size_t set_tid_size);
>  extern void free_pid(struct pid *pid);
>  extern void disable_pid_allocation(struct pid_namespace *ns);
>  
> diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
> index 49538b172483..2ed6af88794b 100644
> --- a/include/linux/pid_namespace.h
> +++ b/include/linux/pid_namespace.h
> @@ -12,6 +12,8 @@
>  #include <linux/ns_common.h>
>  #include <linux/idr.h>
>  
> +/* MAX_PID_NS_LEVEL is needed for limiting size of 'struct pid' */
> +#define MAX_PID_NS_LEVEL 32
>  
>  struct fs_pin;
>  
> diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
> index 4b1c3b664f51..f1879884238e 100644
> --- a/include/linux/sched/task.h
> +++ b/include/linux/sched/task.h
> @@ -26,6 +26,9 @@ struct kernel_clone_args {
>  	unsigned long stack;
>  	unsigned long stack_size;
>  	unsigned long tls;
> +	pid_t *set_tid;
> +	/* Number of elements in *set_tid */
> +	size_t set_tid_size;
>  };
>  
>  /*
> diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
> index 25b4fa00bad1..13f74c40a629 100644
> --- a/include/uapi/linux/sched.h
> +++ b/include/uapi/linux/sched.h
> @@ -72,6 +72,8 @@ struct clone_args {
>  	__aligned_u64 stack;
>  	__aligned_u64 stack_size;
>  	__aligned_u64 tls;
> +	__aligned_u64 set_tid;
> +	__aligned_u64 set_tid_size;
>  };
>  #endif
>  
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 55af6931c6ec..c70b9224997d 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -2026,7 +2026,8 @@ static __latent_entropy struct task_struct *copy_process(
>  	stackleak_task_init(p);
>  
>  	if (pid != &init_struct_pid) {
> -		pid = alloc_pid(p->nsproxy->pid_ns_for_children);
> +		pid = alloc_pid(p->nsproxy->pid_ns_for_children, args->set_tid,
> +				args->set_tid_size);
>  		if (IS_ERR(pid)) {
>  			retval = PTR_ERR(pid);
>  			goto bad_fork_cleanup_thread;
> @@ -2527,6 +2528,7 @@ noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs,
>  					      struct clone_args __user *uargs,
>  					      size_t usize)
>  {
> +	int i;
>  	int err;
>  	struct clone_args args;
>  
> @@ -2539,6 +2541,9 @@ noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs,
>  	if (err)
>  		return err;
>  
> +	if (unlikely(args.set_tid_size > MAX_PID_NS_LEVEL))
> +		return -E2BIG;

I'd prefer EINVAL for this case. Mostly because we do that for other
arguments already. E2BIG currently only reflects invalid size of struct
clone_args itself and which makes it easy to spot for userspace and I'd
like to not overload that error too (The not-overloading-EINVAL-ship
has already sailed a long time ago.).

Also, it seems this misses some more checks, maybe?

if (!args.set_tid && args.set_tid_size > 0)
	return -EINVAL;

if (args.set_tid && args.set_tid_size == 0)
	return -EINVAL;

> +
>  	/*
>  	 * Verify that higher 32bits of exit_signal are unset and that
>  	 * it is a valid signal
> @@ -2556,8 +2561,17 @@ noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs,
>  		.stack		= args.stack,
>  		.stack_size	= args.stack_size,
>  		.tls		= args.tls,
> +		.set_tid	= kargs->set_tid,

Everything setup here uses members from "args" not "kargs".
That difference is very easy to overlook should anything change in this
code in the future. I'd prefer if setting up both (set_tid and
set_tid_size) or at least the set_tid pointer would be done later,
after the copy_from_user() stuff.

So at the top of this function you could do:

pid_t *kset_tid = kargs->set_tid;

> +		.set_tid_size	= args.set_tid_size,
>  	};
>  
> +	for (i = 0; i < args.set_tid_size; i++) {
> +		if (copy_from_user(&kargs->set_tid[i],
> +		    u64_to_user_ptr(args.set_tid + (i * sizeof(args.set_tid))),
> +		    sizeof(pid_t)))
> +			return -EFAULT;
> +	}

All of this indexing doesn't look very nice and introduces racyness.
Can't we do something like:

	if (args->set_tid && copy_from_user(kset_tid, u64_to_user_ptr(args.set_tid), (kargs->set_tid_size * sizeof(pid_t))))
		return -EFAULT;
	kargs->set_tid = kset_tid;

Christian

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v7 1/2] fork: extend clone3() to support setting a PID
  2019-11-11 15:40   ` Adrian Reber
@ 2019-11-11 16:14     ` Christian Brauner
  2019-11-11 16:32       ` Oleg Nesterov
  2019-11-11 23:08       ` Eric W. Biederman
  0 siblings, 2 replies; 12+ messages in thread
From: Christian Brauner @ 2019-11-11 16:14 UTC (permalink / raw)
  To: Adrian Reber
  Cc: Oleg Nesterov, Eric Biederman, Pavel Emelyanov, Jann Horn,
	Dmitry Safonov, linux-kernel, Andrei Vagin, Mike Rapoport,
	Radostin Stoyanov

On Mon, Nov 11, 2019 at 04:40:28PM +0100, Adrian Reber wrote:
> On Mon, Nov 11, 2019 at 04:25:15PM +0100, Oleg Nesterov wrote:
> > On 11/11, Adrian Reber wrote:
> > >
> > > v7:
> > >  - changed set_tid to be an array to set the PID of a process
> > >    in multiple nested PID namespaces at the same time as discussed
> > >    at LPC 2019 (container MC)
> > 
> > cough... iirc you convinced me this is not needed when we discussed
> > the previous version ;) Nevermind, probably my memory fools me.
> 
> You are right. You suggested the same thing and we didn't listen ;)
> 
> > So far I only have some cosmetic nits,
> 
> Thanks for the quick review. I will try to apply your suggestions.
> 
> > > @@ -175,6 +187,18 @@ struct pid *alloc_pid(struct pid_namespace *ns)
> > >
> > >  	for (i = ns->level; i >= 0; i--) {
> > >  		int pid_min = 1;
> > > +		int t_pos = 0;
> >                     ^^^^^
> > 
> > I won't insist, but I'd suggest to cache set_tid[t_pos] instead to make
> > the code a bit more simple.
> > 
> > > @@ -186,12 +210,24 @@ struct pid *alloc_pid(struct pid_namespace *ns)
> > >  		if (idr_get_cursor(&tmp->idr) > RESERVED_PIDS)
> > >  			pid_min = RESERVED_PIDS;
> > 
> > You can probably move this code into the "else" branch below.
> > 
> > IOW, something like
> > 
> > 
> > 	for (i = ns->level; i >= 0; i--) {
> > 		int xxx = 0;
> > 
> > 		if (set_tid_size) {
> > 			int pos = ns->level - i;
> > 
> > 			xxx = set_tid[pos];
> > 			if (xxx < 1 || xxx >= pid_max)
> > 				return ERR_PTR(-EINVAL);
> > 			/* Also fail if a PID != 1 is requested and no PID 1 exists */
> > 			if (xxx != 1 && !tmp->child_reaper)
> > 				return ERR_PTR(-EINVAL);
> > 			if (!ns_capable(tmp->user_ns, CAP_SYS_ADMIN))
> > 				return ERR_PTR(-EPERM);
> > 			set_tid_size--;
> > 		}
> > 
> > 		idr_preload(GFP_KERNEL);
> > 		spin_lock_irq(&pidmap_lock);
> > 
> > 		if (xxx) {
> > 			nr = idr_alloc(&tmp->idr, NULL, xxx, xxx + 1,
> > 					GFP_ATOMIC);
> > 			/*
> > 			 * If ENOSPC is returned it means that the PID is
> > 			 * alreay in use. Return EEXIST in that case.
> > 			 */
> > 			if (nr == -ENOSPC)
> > 				nr = -EEXIST;
> > 		} else {
> > 			int pid_min = 1;
> > 			/*
> > 			 * init really needs pid 1, but after reaching the
> > 			 * maximum wrap back to RESERVED_PIDS
> > 			 */
> > 			if (idr_get_cursor(&tmp->idr) > RESERVED_PIDS)
> > 				pid_min = RESERVED_PIDS;
> > 			/*
> > 			 * Store a null pointer so find_pid_ns does not find
> > 			 * a partially initialized PID (see below).
> > 			 */
> > 			nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min,
> > 					      pid_max, GFP_ATOMIC);
> > 		}
> > 
> > 		...
> > 
> > This way only the "if (set_tid_size)" block has to play with set_tid_size/set_tid.
> > 
> > note also that this way we can easily allow set_tid[some_level] == 0, we can
> > simply do
> > 
> > 	-	if (xxx < 1 || xxx >= pid_max)
> > 	+	if (xxx < 0 || xxx >= pid_max)
> > 
> > although I don't think this is really useful.
> 
> Yes. I explicitly didn't allow 0 as a PID as I didn't thought it would
> be useful (or maybe even valid).

How do you express: I don't care about a specific pid in pidns level
<n>, just give me a random one? For example,

set_tid[0] = 1234
set_tid[1] = 5678
set_tid[2] = random_pid()
set_tid[3] = 9

Wouldn't that be potentially useful?

Christian

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v7 1/2] fork: extend clone3() to support setting a PID
  2019-11-11 16:14     ` Christian Brauner
@ 2019-11-11 16:32       ` Oleg Nesterov
  2019-11-11 23:08       ` Eric W. Biederman
  1 sibling, 0 replies; 12+ messages in thread
From: Oleg Nesterov @ 2019-11-11 16:32 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Adrian Reber, Eric Biederman, Pavel Emelyanov, Jann Horn,
	Dmitry Safonov, linux-kernel, Andrei Vagin, Mike Rapoport,
	Radostin Stoyanov

On 11/11, Christian Brauner wrote:
>
> On Mon, Nov 11, 2019 at 04:40:28PM +0100, Adrian Reber wrote:
> > > note also that this way we can easily allow set_tid[some_level] == 0, we can
> > > simply do
> > >
> > > 	-	if (xxx < 1 || xxx >= pid_max)
> > > 	+	if (xxx < 0 || xxx >= pid_max)
> > >
> > > although I don't think this is really useful.
> >
> > Yes. I explicitly didn't allow 0 as a PID as I didn't thought it would
> > be useful (or maybe even valid).
>
> How do you express: I don't care about a specific pid in pidns level
> <n>,

Yes,

> Wouldn't that be potentially useful?

As I said above, I don't think this is really useful. Just I was thinking
out loud.

I suggested to cache set_tid[pos] rather than pos because this makes the
code simpler, not because this allows this allows to "naturally" handle
the case when set_tid[pos] == 0. Sorry it it was not clear.

Oleg.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v7 1/2] fork: extend clone3() to support setting a PID
  2019-11-11 13:17 [PATCH v7 1/2] fork: extend clone3() to support setting a PID Adrian Reber
                   ` (2 preceding siblings ...)
  2019-11-11 16:12 ` Christian Brauner
@ 2019-11-11 20:41 ` Rasmus Villemoes
  2019-11-12 15:26   ` Adrian Reber
  2019-11-13  8:02   ` Adrian Reber
  3 siblings, 2 replies; 12+ messages in thread
From: Rasmus Villemoes @ 2019-11-11 20:41 UTC (permalink / raw)
  To: Adrian Reber, Christian Brauner, Eric Biederman, Pavel Emelyanov,
	Jann Horn, Oleg Nesterov, Dmitry Safonov
  Cc: linux-kernel, Andrei Vagin, Mike Rapoport, Radostin Stoyanov

On 11/11/2019 14.17, Adrian Reber wrote:
> The main motivation to add set_tid to clone3() is CRIU.
> 
> To restore a process with the same PID/TID CRIU currently uses
> /proc/sys/kernel/ns_last_pid. It writes the desired (PID - 1) to
> ns_last_pid and then (quickly) does a clone(). This works most of the
> time, but it is racy. It is also slow as it requires multiple syscalls.
> 
> Extending clone3() to support *set_tid makes it possible restore a
> process using CRIU without accessing /proc/sys/kernel/ns_last_pid and
> race free (as long as the desired PID/TID is available).
> 
> This clone3() extension places the same restrictions (CAP_SYS_ADMIN)
> on clone3() with *set_tid as they are currently in place for ns_last_pid.
> 
> The original version of this change was using a single value for
> set_tid. At the 2019 LPC, after presenting set_tid, it was, however,
> decided to change set_tid to an array to enable setting the PID of a
> process in multiple PID namespaces at the same time. If a process is
> created in a PID namespace it is possible to influence the PID inside
> and outside of the PID namespace. Details also in the corresponding
> selftest.
> 

>  	/*
>  	 * Verify that higher 32bits of exit_signal are unset and that
>  	 * it is a valid signal
> @@ -2556,8 +2561,17 @@ noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs,
>  		.stack		= args.stack,
>  		.stack_size	= args.stack_size,
>  		.tls		= args.tls,
> +		.set_tid	= kargs->set_tid,
> +		.set_tid_size	= args.set_tid_size,
>  	};

This is a bit ugly. And is it even well-defined? I mean, it's a bit
similar to the "i = i++;". So it would be best to avoid.

> +	for (i = 0; i < args.set_tid_size; i++) {
> +		if (copy_from_user(&kargs->set_tid[i],
> +		    u64_to_user_ptr(args.set_tid + (i * sizeof(args.set_tid))),
> +		    sizeof(pid_t)))
> +			return -EFAULT;
> +	}
> +

If I'm reading this (and your test case) right, you expect the user
pointer to point at an array of u64, and here you're copying the first
half of each u64 to the pid_t array. That only works on little-endian.

It seems more obvious (since I don't think there's any disagreement
anywhere on sizeof(pid_t)) to expect the user pointer to point at an
array of pid_t and then simply copy_from_user() the whole thing in one go.

>  	return 0;
>  }
>  
> @@ -2631,6 +2645,10 @@ SYSCALL_DEFINE2(clone3, struct clone_args __user *, uargs, size_t, size)
>  	int err;
>  
>  	struct kernel_clone_args kargs;
> +	pid_t set_tid[MAX_PID_NS_LEVEL];
> +
> +	memset(set_tid, 0, sizeof(set_tid));
> +	kargs.set_tid = set_tid;

Hm, isn't it a bit much to add two cachelines (and dirtying them via the
memset) to the stack footprint of clone3, considering that almost nobody
(relatively speaking) will use this?

So how about copy_clone_args_from_user() does

if (args.set_tid) {
  set_tid = memdup_user(u64_to_user_ptr(), ...)
  if (IS_ERR(set_tid))
    return PTR_ERR(set_tid);
  kargs.set_tid = set_tid;
}

Then somebody needs to free that, but this is probably not the last
clone extension that might need extra data, so one could do

s/long _do_fork/static long __do_fork/

and then create a _do_fork that always cleans up the passed-in kargs, i.e.

long _do_fork(struct kargs *args)
{
  long ret = __do_fork(args);
  kfree(args->set_tid);
  return ret;
}

Rasmus

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v7 1/2] fork: extend clone3() to support setting a PID
  2019-11-11 16:14     ` Christian Brauner
  2019-11-11 16:32       ` Oleg Nesterov
@ 2019-11-11 23:08       ` Eric W. Biederman
  2019-11-12 10:24         ` Christian Brauner
  1 sibling, 1 reply; 12+ messages in thread
From: Eric W. Biederman @ 2019-11-11 23:08 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Adrian Reber, Oleg Nesterov, Pavel Emelyanov, Jann Horn,
	Dmitry Safonov, linux-kernel, Andrei Vagin, Mike Rapoport,
	Radostin Stoyanov

Christian Brauner <christian.brauner@ubuntu.com> writes:

> On Mon, Nov 11, 2019 at 04:40:28PM +0100, Adrian Reber wrote:
>> On Mon, Nov 11, 2019 at 04:25:15PM +0100, Oleg Nesterov wrote:
>> > On 11/11, Adrian Reber wrote:
>> > >
>> > > v7:
>> > >  - changed set_tid to be an array to set the PID of a process
>> > >    in multiple nested PID namespaces at the same time as discussed
>> > >    at LPC 2019 (container MC)
>> > 
>> > cough... iirc you convinced me this is not needed when we discussed
>> > the previous version ;) Nevermind, probably my memory fools me.
>> 
>> You are right. You suggested the same thing and we didn't listen ;)
>> 
>> > So far I only have some cosmetic nits,
>> 
>> Thanks for the quick review. I will try to apply your suggestions.
>> 
>> > > @@ -175,6 +187,18 @@ struct pid *alloc_pid(struct pid_namespace *ns)
>> > >
>> > >  	for (i = ns->level; i >= 0; i--) {
>> > >  		int pid_min = 1;
>> > > +		int t_pos = 0;
>> >                     ^^^^^
>> > 
>> > I won't insist, but I'd suggest to cache set_tid[t_pos] instead to make
>> > the code a bit more simple.
>> > 
>> > > @@ -186,12 +210,24 @@ struct pid *alloc_pid(struct pid_namespace *ns)
>> > >  		if (idr_get_cursor(&tmp->idr) > RESERVED_PIDS)
>> > >  			pid_min = RESERVED_PIDS;
>> > 
>> > You can probably move this code into the "else" branch below.
>> > 
>> > IOW, something like
>> > 
>> > 
>> > 	for (i = ns->level; i >= 0; i--) {
>> > 		int xxx = 0;
>> > 
>> > 		if (set_tid_size) {
>> > 			int pos = ns->level - i;
>> > 
>> > 			xxx = set_tid[pos];
>> > 			if (xxx < 1 || xxx >= pid_max)
>> > 				return ERR_PTR(-EINVAL);
>> > 			/* Also fail if a PID != 1 is requested and no PID 1 exists */
>> > 			if (xxx != 1 && !tmp->child_reaper)
>> > 				return ERR_PTR(-EINVAL);
>> > 			if (!ns_capable(tmp->user_ns, CAP_SYS_ADMIN))
>> > 				return ERR_PTR(-EPERM);
>> > 			set_tid_size--;
>> > 		}
>> > 
>> > 		idr_preload(GFP_KERNEL);
>> > 		spin_lock_irq(&pidmap_lock);
>> > 
>> > 		if (xxx) {
>> > 			nr = idr_alloc(&tmp->idr, NULL, xxx, xxx + 1,
>> > 					GFP_ATOMIC);
>> > 			/*
>> > 			 * If ENOSPC is returned it means that the PID is
>> > 			 * alreay in use. Return EEXIST in that case.
>> > 			 */
>> > 			if (nr == -ENOSPC)
>> > 				nr = -EEXIST;
>> > 		} else {
>> > 			int pid_min = 1;
>> > 			/*
>> > 			 * init really needs pid 1, but after reaching the
>> > 			 * maximum wrap back to RESERVED_PIDS
>> > 			 */
>> > 			if (idr_get_cursor(&tmp->idr) > RESERVED_PIDS)
>> > 				pid_min = RESERVED_PIDS;
>> > 			/*
>> > 			 * Store a null pointer so find_pid_ns does not find
>> > 			 * a partially initialized PID (see below).
>> > 			 */
>> > 			nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min,
>> > 					      pid_max, GFP_ATOMIC);
>> > 		}
>> > 
>> > 		...
>> > 
>> > This way only the "if (set_tid_size)" block has to play with set_tid_size/set_tid.
>> > 
>> > note also that this way we can easily allow set_tid[some_level] == 0, we can
>> > simply do
>> > 
>> > 	-	if (xxx < 1 || xxx >= pid_max)
>> > 	+	if (xxx < 0 || xxx >= pid_max)
>> > 
>> > although I don't think this is really useful.
>> 
>> Yes. I explicitly didn't allow 0 as a PID as I didn't thought it would
>> be useful (or maybe even valid).

I agree not allowing 0 sounds very reasonable.

> How do you express: I don't care about a specific pid in pidns level
> <n>, just give me a random one? For example,
>
> set_tid[0] = 1234
> set_tid[1] = 5678
> set_tid[2] = random_pid()
> set_tid[3] = 9
>
> Wouldn't that be potentially useful?

I can't imagine how.

At least in my head the fundamental concept is picking up a container on
one machine and moving it to another machine.  For that operation you
will know starting with the most nested pid namespace the pids that you
want up to some point.  Farther up you don't know.

I can't imagine in what scenario you would not know a pid at outer level
but want a specified pid at an ever farther removed outer level.  What
scenario are you thinking about that could lead to such a situation?

For the me the question is: Are you restoring what you know or not?

Eric

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v7 1/2] fork: extend clone3() to support setting a PID
  2019-11-11 23:08       ` Eric W. Biederman
@ 2019-11-12 10:24         ` Christian Brauner
  0 siblings, 0 replies; 12+ messages in thread
From: Christian Brauner @ 2019-11-12 10:24 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Adrian Reber, Oleg Nesterov, Pavel Emelyanov, Jann Horn,
	Dmitry Safonov, linux-kernel, Andrei Vagin, Mike Rapoport,
	Radostin Stoyanov

On Mon, Nov 11, 2019 at 05:08:57PM -0600, Eric W. Biederman wrote:
> Christian Brauner <christian.brauner@ubuntu.com> writes:
> 
> > On Mon, Nov 11, 2019 at 04:40:28PM +0100, Adrian Reber wrote:
> >> On Mon, Nov 11, 2019 at 04:25:15PM +0100, Oleg Nesterov wrote:
> >> > On 11/11, Adrian Reber wrote:
> >> > >
> >> > > v7:
> >> > >  - changed set_tid to be an array to set the PID of a process
> >> > >    in multiple nested PID namespaces at the same time as discussed
> >> > >    at LPC 2019 (container MC)
> >> > 
> >> > cough... iirc you convinced me this is not needed when we discussed
> >> > the previous version ;) Nevermind, probably my memory fools me.
> >> 
> >> You are right. You suggested the same thing and we didn't listen ;)
> >> 
> >> > So far I only have some cosmetic nits,
> >> 
> >> Thanks for the quick review. I will try to apply your suggestions.
> >> 
> >> > > @@ -175,6 +187,18 @@ struct pid *alloc_pid(struct pid_namespace *ns)
> >> > >
> >> > >  	for (i = ns->level; i >= 0; i--) {
> >> > >  		int pid_min = 1;
> >> > > +		int t_pos = 0;
> >> >                     ^^^^^
> >> > 
> >> > I won't insist, but I'd suggest to cache set_tid[t_pos] instead to make
> >> > the code a bit more simple.
> >> > 
> >> > > @@ -186,12 +210,24 @@ struct pid *alloc_pid(struct pid_namespace *ns)
> >> > >  		if (idr_get_cursor(&tmp->idr) > RESERVED_PIDS)
> >> > >  			pid_min = RESERVED_PIDS;
> >> > 
> >> > You can probably move this code into the "else" branch below.
> >> > 
> >> > IOW, something like
> >> > 
> >> > 
> >> > 	for (i = ns->level; i >= 0; i--) {
> >> > 		int xxx = 0;
> >> > 
> >> > 		if (set_tid_size) {
> >> > 			int pos = ns->level - i;
> >> > 
> >> > 			xxx = set_tid[pos];
> >> > 			if (xxx < 1 || xxx >= pid_max)
> >> > 				return ERR_PTR(-EINVAL);
> >> > 			/* Also fail if a PID != 1 is requested and no PID 1 exists */
> >> > 			if (xxx != 1 && !tmp->child_reaper)
> >> > 				return ERR_PTR(-EINVAL);
> >> > 			if (!ns_capable(tmp->user_ns, CAP_SYS_ADMIN))
> >> > 				return ERR_PTR(-EPERM);
> >> > 			set_tid_size--;
> >> > 		}
> >> > 
> >> > 		idr_preload(GFP_KERNEL);
> >> > 		spin_lock_irq(&pidmap_lock);
> >> > 
> >> > 		if (xxx) {
> >> > 			nr = idr_alloc(&tmp->idr, NULL, xxx, xxx + 1,
> >> > 					GFP_ATOMIC);
> >> > 			/*
> >> > 			 * If ENOSPC is returned it means that the PID is
> >> > 			 * alreay in use. Return EEXIST in that case.
> >> > 			 */
> >> > 			if (nr == -ENOSPC)
> >> > 				nr = -EEXIST;
> >> > 		} else {
> >> > 			int pid_min = 1;
> >> > 			/*
> >> > 			 * init really needs pid 1, but after reaching the
> >> > 			 * maximum wrap back to RESERVED_PIDS
> >> > 			 */
> >> > 			if (idr_get_cursor(&tmp->idr) > RESERVED_PIDS)
> >> > 				pid_min = RESERVED_PIDS;
> >> > 			/*
> >> > 			 * Store a null pointer so find_pid_ns does not find
> >> > 			 * a partially initialized PID (see below).
> >> > 			 */
> >> > 			nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min,
> >> > 					      pid_max, GFP_ATOMIC);
> >> > 		}
> >> > 
> >> > 		...
> >> > 
> >> > This way only the "if (set_tid_size)" block has to play with set_tid_size/set_tid.
> >> > 
> >> > note also that this way we can easily allow set_tid[some_level] == 0, we can
> >> > simply do
> >> > 
> >> > 	-	if (xxx < 1 || xxx >= pid_max)
> >> > 	+	if (xxx < 0 || xxx >= pid_max)
> >> > 
> >> > although I don't think this is really useful.
> >> 
> >> Yes. I explicitly didn't allow 0 as a PID as I didn't thought it would
> >> be useful (or maybe even valid).
> 
> I agree not allowing 0 sounds very reasonable.

Yeah, I think we are all in agreement here.

> 
> > How do you express: I don't care about a specific pid in pidns level
> > <n>, just give me a random one? For example,
> >
> > set_tid[0] = 1234
> > set_tid[1] = 5678
> > set_tid[2] = random_pid()
> > set_tid[3] = 9
> >
> > Wouldn't that be potentially useful?
> 
> I can't imagine how.
> 
> At least in my head the fundamental concept is picking up a container on
> one machine and moving it to another machine.  For that operation you
> will know starting with the most nested pid namespace the pids that you
> want up to some point.  Farther up you don't know.
> 
> I can't imagine in what scenario you would not know a pid at outer level
> but want a specified pid at an ever farther removed outer level.  What
> scenario are you thinking about that could lead to such a situation?
> 
> For the me the question is: Are you restoring what you know or not?

I didn't advocate for making this possible (though I can see how this
would be a neat hacking tool).
Though this whole paragraph highlights one of my concerns with this
whole feature. As it stands it is _only_ useful to CRIU. Which as I said
before is fine but it still makes me queasy when an interface really
just is designed to serve a single use-case; this specific feature even
just a single user.
I'm hopeful that we can find other use-cases for testing. It's probably
already a fun feature for making pid-reuse based kernel exploits way
easier.

Christian

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v7 1/2] fork: extend clone3() to support setting a PID
  2019-11-11 20:41 ` Rasmus Villemoes
@ 2019-11-12 15:26   ` Adrian Reber
  2019-11-13  8:02   ` Adrian Reber
  1 sibling, 0 replies; 12+ messages in thread
From: Adrian Reber @ 2019-11-12 15:26 UTC (permalink / raw)
  To: Rasmus Villemoes
  Cc: Christian Brauner, Eric Biederman, Pavel Emelyanov, Jann Horn,
	Oleg Nesterov, Dmitry Safonov, linux-kernel, Andrei Vagin,
	Mike Rapoport, Radostin Stoyanov

On Mon, Nov 11, 2019 at 09:41:39PM +0100, Rasmus Villemoes wrote:
> On 11/11/2019 14.17, Adrian Reber wrote:
> > The main motivation to add set_tid to clone3() is CRIU.
> > 
> > To restore a process with the same PID/TID CRIU currently uses
> > /proc/sys/kernel/ns_last_pid. It writes the desired (PID - 1) to
> > ns_last_pid and then (quickly) does a clone(). This works most of the
> > time, but it is racy. It is also slow as it requires multiple syscalls.
> > 
> > Extending clone3() to support *set_tid makes it possible restore a
> > process using CRIU without accessing /proc/sys/kernel/ns_last_pid and
> > race free (as long as the desired PID/TID is available).
> > 
> > This clone3() extension places the same restrictions (CAP_SYS_ADMIN)
> > on clone3() with *set_tid as they are currently in place for ns_last_pid.
> > 
> > The original version of this change was using a single value for
> > set_tid. At the 2019 LPC, after presenting set_tid, it was, however,
> > decided to change set_tid to an array to enable setting the PID of a
> > process in multiple PID namespaces at the same time. If a process is
> > created in a PID namespace it is possible to influence the PID inside
> > and outside of the PID namespace. Details also in the corresponding
> > selftest.
> > 
> 
> >  	/*
> >  	 * Verify that higher 32bits of exit_signal are unset and that
> >  	 * it is a valid signal
> > @@ -2556,8 +2561,17 @@ noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs,
> >  		.stack		= args.stack,
> >  		.stack_size	= args.stack_size,
> >  		.tls		= args.tls,
> > +		.set_tid	= kargs->set_tid,
> > +		.set_tid_size	= args.set_tid_size,
> >  	};
> 
> This is a bit ugly. And is it even well-defined? I mean, it's a bit
> similar to the "i = i++;". So it would be best to avoid.
> 
> > +	for (i = 0; i < args.set_tid_size; i++) {
> > +		if (copy_from_user(&kargs->set_tid[i],
> > +		    u64_to_user_ptr(args.set_tid + (i * sizeof(args.set_tid))),
> > +		    sizeof(pid_t)))
> > +			return -EFAULT;
> > +	}
> > +
> 
> If I'm reading this (and your test case) right, you expect the user
> pointer to point at an array of u64, and here you're copying the first
> half of each u64 to the pid_t array. That only works on little-endian.
> 
> It seems more obvious (since I don't think there's any disagreement
> anywhere on sizeof(pid_t)) to expect the user pointer to point at an
> array of pid_t and then simply copy_from_user() the whole thing in one go.
> 
> >  	return 0;
> >  }
> >  
> > @@ -2631,6 +2645,10 @@ SYSCALL_DEFINE2(clone3, struct clone_args __user *, uargs, size_t, size)
> >  	int err;
> >  
> >  	struct kernel_clone_args kargs;
> > +	pid_t set_tid[MAX_PID_NS_LEVEL];
> > +
> > +	memset(set_tid, 0, sizeof(set_tid));
> > +	kargs.set_tid = set_tid;
> 
> Hm, isn't it a bit much to add two cachelines (and dirtying them via the
> memset) to the stack footprint of clone3, considering that almost nobody
> (relatively speaking) will use this?
> 
> So how about copy_clone_args_from_user() does
> 
> if (args.set_tid) {
>   set_tid = memdup_user(u64_to_user_ptr(), ...)
>   if (IS_ERR(set_tid))
>     return PTR_ERR(set_tid);
>   kargs.set_tid = set_tid;
> }
> 
> Then somebody needs to free that, but this is probably not the last
> clone extension that might need extra data, so one could do
> 
> s/long _do_fork/static long __do_fork/
> 
> and then create a _do_fork that always cleans up the passed-in kargs, i.e.
> 
> long _do_fork(struct kargs *args)
> {
>   long ret = __do_fork(args);
>   kfree(args->set_tid);
>   return ret;
> }

Thanks for your review. Did you had a look at what Christian suggested?
That should solve most of the points you mentioned. I will also remove
the memset() as it is not necessary at all.

		Adrian


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v7 1/2] fork: extend clone3() to support setting a PID
  2019-11-11 20:41 ` Rasmus Villemoes
  2019-11-12 15:26   ` Adrian Reber
@ 2019-11-13  8:02   ` Adrian Reber
  1 sibling, 0 replies; 12+ messages in thread
From: Adrian Reber @ 2019-11-13  8:02 UTC (permalink / raw)
  To: Rasmus Villemoes
  Cc: Christian Brauner, Eric Biederman, Pavel Emelyanov, Jann Horn,
	Oleg Nesterov, Dmitry Safonov, linux-kernel, Andrei Vagin,
	Mike Rapoport, Radostin Stoyanov

On Mon, Nov 11, 2019 at 09:41:39PM +0100, Rasmus Villemoes wrote:
> On 11/11/2019 14.17, Adrian Reber wrote:
> > The main motivation to add set_tid to clone3() is CRIU.
> > 
> > To restore a process with the same PID/TID CRIU currently uses
> > /proc/sys/kernel/ns_last_pid. It writes the desired (PID - 1) to
> > ns_last_pid and then (quickly) does a clone(). This works most of the
> > time, but it is racy. It is also slow as it requires multiple syscalls.
> > 
> > Extending clone3() to support *set_tid makes it possible restore a
> > process using CRIU without accessing /proc/sys/kernel/ns_last_pid and
> > race free (as long as the desired PID/TID is available).
> > 
> > This clone3() extension places the same restrictions (CAP_SYS_ADMIN)
> > on clone3() with *set_tid as they are currently in place for ns_last_pid.
> > 
> > The original version of this change was using a single value for
> > set_tid. At the 2019 LPC, after presenting set_tid, it was, however,
> > decided to change set_tid to an array to enable setting the PID of a
> > process in multiple PID namespaces at the same time. If a process is
> > created in a PID namespace it is possible to influence the PID inside
> > and outside of the PID namespace. Details also in the corresponding
> > selftest.
> > 
> 
> >  	/*
> >  	 * Verify that higher 32bits of exit_signal are unset and that
> >  	 * it is a valid signal
> > @@ -2556,8 +2561,17 @@ noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs,
> >  		.stack		= args.stack,
> >  		.stack_size	= args.stack_size,
> >  		.tls		= args.tls,
> > +		.set_tid	= kargs->set_tid,
> > +		.set_tid_size	= args.set_tid_size,
> >  	};
> 
> This is a bit ugly. And is it even well-defined? I mean, it's a bit
> similar to the "i = i++;". So it would be best to avoid.
> 
> > +	for (i = 0; i < args.set_tid_size; i++) {
> > +		if (copy_from_user(&kargs->set_tid[i],
> > +		    u64_to_user_ptr(args.set_tid + (i * sizeof(args.set_tid))),
> > +		    sizeof(pid_t)))
> > +			return -EFAULT;
> > +	}
> > +
> 
> If I'm reading this (and your test case) right, you expect the user
> pointer to point at an array of u64, and here you're copying the first
> half of each u64 to the pid_t array. That only works on little-endian.
> 
> It seems more obvious (since I don't think there's any disagreement
> anywhere on sizeof(pid_t)) to expect the user pointer to point at an
> array of pid_t and then simply copy_from_user() the whole thing in one go.

Yes, that was wrong. I changed the test case to use an array of pid_t.

		Adrian


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2019-11-13  8:02 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-11 13:17 [PATCH v7 1/2] fork: extend clone3() to support setting a PID Adrian Reber
2019-11-11 13:17 ` [PATCH v7 2/2] selftests: add tests for clone3() Adrian Reber
2019-11-11 15:25 ` [PATCH v7 1/2] fork: extend clone3() to support setting a PID Oleg Nesterov
2019-11-11 15:40   ` Adrian Reber
2019-11-11 16:14     ` Christian Brauner
2019-11-11 16:32       ` Oleg Nesterov
2019-11-11 23:08       ` Eric W. Biederman
2019-11-12 10:24         ` Christian Brauner
2019-11-11 16:12 ` Christian Brauner
2019-11-11 20:41 ` Rasmus Villemoes
2019-11-12 15:26   ` Adrian Reber
2019-11-13  8:02   ` Adrian Reber

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).