All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v10 1/2] fork: extend clone3() to support setting a PID
@ 2019-11-14 14:27 Adrian Reber
  2019-11-14 14:27 ` [PATCH v10 2/2] selftests: add tests for clone3() with *set_tid Adrian Reber
                   ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: Adrian Reber @ 2019-11-14 14:27 UTC (permalink / raw)
  To: Christian Brauner, Eric Biederman, Pavel Emelyanov, Jann Horn,
	Oleg Nesterov, Dmitry Safonov, Rasmus Villemoes
  Cc: linux-kernel, Andrei Vagin, Mike Rapoport, Radostin Stoyanov,
	Adrian Reber

The main motivation to add set_tid to clone3() is CRIU.

To restore a process with the same PID/TID CRIU currently uses
/proc/sys/kernel/ns_last_pid. It writes the desired (PID - 1) to
ns_last_pid and then (quickly) does a clone(). This works most of the
time, but it is racy. It is also slow as it requires multiple syscalls.

Extending clone3() to support *set_tid makes it possible restore a
process using CRIU without accessing /proc/sys/kernel/ns_last_pid and
race free (as long as the desired PID/TID is available).

This clone3() extension places the same restrictions (CAP_SYS_ADMIN)
on clone3() with *set_tid as they are currently in place for ns_last_pid.

The original version of this change was using a single value for
set_tid. At the 2019 LPC, after presenting set_tid, it was, however,
decided to change set_tid to an array to enable setting the PID of a
process in multiple PID namespaces at the same time. If a process is
created in a PID namespace it is possible to influence the PID inside
and outside of the PID namespace. Details also in the corresponding
selftest.

To create a process with the following PIDs:

      PID NS level         Requested PID
        0 (host)              31496
        1                        42
        2                         1

For that example the two newly introduced parameters to struct
clone_args (set_tid and set_tid_size) would need to be:

  set_tid[0] = 1;
  set_tid[1] = 42;
  set_tid[2] = 31496;
  set_tid_size = 3;

If only the PIDs of the two innermost nested PID namespaces should be
defined it would look like this:

  set_tid[0] = 1;
  set_tid[1] = 42;
  set_tid_size = 2;

The PID of the newly created process would then be the next available
free PID in the PID namespace level 0 (host) and 42 in the PID namespace
at level 1 and the PID of the process in the innermost PID namespace
would be 1.

The set_tid array is used to specify the PID of a process starting
from the innermost nested PID namespaces up to set_tid_size PID namespaces.

set_tid_size cannot be larger then the current PID namespace level.

Signed-off-by: Adrian Reber <areber@redhat.com>
Reviewed-by: Christian Brauner <christian.brauner@ubuntu.com>
---
v2:
 - Removed (size < sizeof(struct clone_args)) as discussed with
   Christian and Dmitry
 - Added comment to ((set_tid != 1) && idr_get_cursor() <= 1) (Oleg)
 - Use idr_alloc() instead of idr_alloc_cyclic() (Oleg)

v3:
 - Return EEXIST if PID is already in use (Christian)
 - Drop CLONE_SET_TID (Christian and Oleg)
 - Use idr_is_empty() instead of idr_get_cursor() (Oleg)
 - Handle different `struct clone_args` sizes (Dmitry)

v4:
 - Rework struct size check with defines (Christian)
 - Reduce number of set_tid checks (Oleg)
 - Less parentheses and more robust code (Oleg)
 - Do ns_capable() on correct user_ns (Oleg, Christian)

v5:
 - make set_tid checks earlier in alloc_pid() (Christian)
 - remove unnecessary comment and struct size check (Christian)

v6:
 - remove CLONE_SET_TID from description (Christian)
 - add clone3() tests for different clone_args sizes (Christian)
 - move more set_tid checks to alloc_pid() (Oleg)
 - make all set_tid checks lockless (Oleg)

v7:
 - changed set_tid to be an array to set the PID of a process
   in multiple nested PID namespaces at the same time as discussed
   at LPC 2019 (container MC)

v8:
 - skip unnecessary memset() (Rasmus)
 - replace set_tid copy loop with a single copy (Christian)
 - more parameter error checking (Christian)
 - cache set_tid in alloc_pid() (Oleg)
 - move code in "else" branch (Oleg)

v9:
 - added kernel-doc to include/uapi/linux/sched.h (Christian)
 - moved a variable to limit its scope; keep all set_tid_size
   related changes in one place (Oleg)

v10:
 - added CLONE_ARGS_SIZE_VER1 to sched.h (Christian)
---
 include/linux/pid.h           |  3 +-
 include/linux/pid_namespace.h |  2 +
 include/linux/sched/task.h    |  3 ++
 include/uapi/linux/sched.h    | 53 ++++++++++++++++++---------
 kernel/fork.c                 | 24 +++++++++++-
 kernel/pid.c                  | 69 +++++++++++++++++++++++++++--------
 kernel/pid_namespace.c        |  2 -
 7 files changed, 119 insertions(+), 37 deletions(-)

diff --git a/include/linux/pid.h b/include/linux/pid.h
index 034e3cd60dc0..998ae7d24450 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -124,7 +124,8 @@ extern struct pid *find_vpid(int nr);
 extern struct pid *find_get_pid(int nr);
 extern struct pid *find_ge_pid(int nr, struct pid_namespace *);
 
-extern struct pid *alloc_pid(struct pid_namespace *ns);
+extern struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
+			     size_t set_tid_size);
 extern void free_pid(struct pid *pid);
 extern void disable_pid_allocation(struct pid_namespace *ns);
 
diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
index 49538b172483..2ed6af88794b 100644
--- a/include/linux/pid_namespace.h
+++ b/include/linux/pid_namespace.h
@@ -12,6 +12,8 @@
 #include <linux/ns_common.h>
 #include <linux/idr.h>
 
+/* MAX_PID_NS_LEVEL is needed for limiting size of 'struct pid' */
+#define MAX_PID_NS_LEVEL 32
 
 struct fs_pin;
 
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 4b1c3b664f51..f1879884238e 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -26,6 +26,9 @@ struct kernel_clone_args {
 	unsigned long stack;
 	unsigned long stack_size;
 	unsigned long tls;
+	pid_t *set_tid;
+	/* Number of elements in *set_tid */
+	size_t set_tid_size;
 };
 
 /*
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 1d500ed03c63..2e649cfa07f4 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -39,24 +39,38 @@
 #ifndef __ASSEMBLY__
 /**
  * struct clone_args - arguments for the clone3 syscall
- * @flags:       Flags for the new process as listed above.
- *               All flags are valid except for CSIGNAL and
- *               CLONE_DETACHED.
- * @pidfd:       If CLONE_PIDFD is set, a pidfd will be
- *               returned in this argument.
- * @child_tid:   If CLONE_CHILD_SETTID is set, the TID of the
- *               child process will be returned in the child's
- *               memory.
- * @parent_tid:  If CLONE_PARENT_SETTID is set, the TID of
- *               the child process will be returned in the
- *               parent's memory.
- * @exit_signal: The exit_signal the parent process will be
- *               sent when the child exits.
- * @stack:       Specify the location of the stack for the
- *               child process.
- * @stack_size:  The size of the stack for the child process.
- * @tls:         If CLONE_SETTLS is set, the tls descriptor
- *               is set to tls.
+ * @flags:        Flags for the new process as listed above.
+ *                All flags are valid except for CSIGNAL and
+ *                CLONE_DETACHED.
+ * @pidfd:        If CLONE_PIDFD is set, a pidfd will be
+ *                returned in this argument.
+ * @child_tid:    If CLONE_CHILD_SETTID is set, the TID of the
+ *                child process will be returned in the child's
+ *                memory.
+ * @parent_tid:   If CLONE_PARENT_SETTID is set, the TID of
+ *                the child process will be returned in the
+ *                parent's memory.
+ * @exit_signal:  The exit_signal the parent process will be
+ *                sent when the child exits.
+ * @stack:        Specify the location of the stack for the
+ *                child process.
+ * @stack_size:   The size of the stack for the child process.
+ * @tls:          If CLONE_SETTLS is set, the tls descriptor
+ *                is set to tls.
+ * @set_tid:      Pointer to an array of type *pid_t. The size
+ *                of the array is defined using @set_tid_size.
+ *                This array is used select PIDs/TIDs for newly
+ *                created processes. The first element in this
+ *                defines the PID in the most nested PID
+ *                namespace. Each additional element in the array
+ *                defines the PID in the parent PID namespace of
+ *                the original PID namespace. If the array has
+ *                less entries than the number of currently
+ *                nested PID namespaces only the PIDs in the
+ *                corresponding namespaces are set.
+ * @set_tid_size: This defines the size of the array referenced
+ *                in @set_tid. This cannot be larger than the
+ *                kernel's limit of nested PID namespaces.
  *
  * The structure is versioned by size and thus extensible.
  * New struct members must go at the end of the struct and
@@ -71,10 +85,13 @@ struct clone_args {
 	__aligned_u64 stack;
 	__aligned_u64 stack_size;
 	__aligned_u64 tls;
+	__aligned_u64 set_tid;
+	__aligned_u64 set_tid_size;
 };
 #endif
 
 #define CLONE_ARGS_SIZE_VER0 64 /* sizeof first published struct */
+#define CLONE_ARGS_SIZE_VER1 80 /* sizeof second published struct */
 
 /*
  * Scheduling policies
diff --git a/kernel/fork.c b/kernel/fork.c
index 954e875e72b1..417570263f1f 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2087,7 +2087,8 @@ static __latent_entropy struct task_struct *copy_process(
 	stackleak_task_init(p);
 
 	if (pid != &init_struct_pid) {
-		pid = alloc_pid(p->nsproxy->pid_ns_for_children);
+		pid = alloc_pid(p->nsproxy->pid_ns_for_children, args->set_tid,
+				args->set_tid_size);
 		if (IS_ERR(pid)) {
 			retval = PTR_ERR(pid);
 			goto bad_fork_cleanup_thread;
@@ -2590,6 +2591,7 @@ noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs,
 {
 	int err;
 	struct clone_args args;
+	pid_t *kset_tid = kargs->set_tid;
 
 	if (unlikely(usize > PAGE_SIZE))
 		return -E2BIG;
@@ -2600,6 +2602,15 @@ noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs,
 	if (err)
 		return err;
 
+	if (unlikely(args.set_tid_size > MAX_PID_NS_LEVEL))
+		return -EINVAL;
+
+	if (unlikely(!args.set_tid && args.set_tid_size > 0))
+		return -EINVAL;
+
+	if (unlikely(args.set_tid && args.set_tid_size == 0))
+		return -EINVAL;
+
 	/*
 	 * Verify that higher 32bits of exit_signal are unset and that
 	 * it is a valid signal
@@ -2617,8 +2628,16 @@ noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs,
 		.stack		= args.stack,
 		.stack_size	= args.stack_size,
 		.tls		= args.tls,
+		.set_tid_size	= args.set_tid_size,
 	};
 
+	if (args.set_tid &&
+		copy_from_user(kset_tid, u64_to_user_ptr(args.set_tid),
+			(kargs->set_tid_size * sizeof(pid_t))))
+		return -EFAULT;
+
+	kargs->set_tid = kset_tid;
+
 	return 0;
 }
 
@@ -2662,6 +2681,9 @@ SYSCALL_DEFINE2(clone3, struct clone_args __user *, uargs, size_t, size)
 	int err;
 
 	struct kernel_clone_args kargs;
+	pid_t set_tid[MAX_PID_NS_LEVEL];
+
+	kargs.set_tid = set_tid;
 
 	err = copy_clone_args_from_user(&kargs, uargs, size);
 	if (err)
diff --git a/kernel/pid.c b/kernel/pid.c
index 7b5f6c963d72..eb32668997c6 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -157,7 +157,8 @@ void free_pid(struct pid *pid)
 	call_rcu(&pid->rcu, delayed_put_pid);
 }
 
-struct pid *alloc_pid(struct pid_namespace *ns)
+struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
+		      size_t set_tid_size)
 {
 	struct pid *pid;
 	enum pid_type type;
@@ -166,6 +167,17 @@ struct pid *alloc_pid(struct pid_namespace *ns)
 	struct upid *upid;
 	int retval = -ENOMEM;
 
+	/*
+	 * set_tid_size contains the size of the set_tid array. Starting at
+	 * the most nested currently active PID namespace it tells alloc_pid()
+	 * which PID to set for a process in that most nested PID namespace
+	 * up to set_tid_size PID namespaces. It does not have to set the PID
+	 * for a process in all nested PID namespaces but set_tid_size must
+	 * never be greater than the current ns->level + 1.
+	 */
+	if (set_tid_size > ns->level + 1)
+		return ERR_PTR(-EINVAL);
+
 	pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL);
 	if (!pid)
 		return ERR_PTR(retval);
@@ -174,24 +186,51 @@ struct pid *alloc_pid(struct pid_namespace *ns)
 	pid->level = ns->level;
 
 	for (i = ns->level; i >= 0; i--) {
-		int pid_min = 1;
+		int tid = 0;
+
+		if (set_tid_size) {
+			tid = set_tid[ns->level - i];
+			if (tid < 1 || tid >= pid_max)
+				return ERR_PTR(-EINVAL);
+			/*
+			 * Also fail if a PID != 1 is requested and
+			 * no PID 1 exists.
+			 */
+			if (tid != 1 && !tmp->child_reaper)
+				return ERR_PTR(-EINVAL);
+			if (!ns_capable(tmp->user_ns, CAP_SYS_ADMIN))
+				return ERR_PTR(-EPERM);
+			set_tid_size--;
+		}
 
 		idr_preload(GFP_KERNEL);
 		spin_lock_irq(&pidmap_lock);
 
-		/*
-		 * init really needs pid 1, but after reaching the maximum
-		 * wrap back to RESERVED_PIDS
-		 */
-		if (idr_get_cursor(&tmp->idr) > RESERVED_PIDS)
-			pid_min = RESERVED_PIDS;
-
-		/*
-		 * Store a null pointer so find_pid_ns does not find
-		 * a partially initialized PID (see below).
-		 */
-		nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min,
-				      pid_max, GFP_ATOMIC);
+		if (tid) {
+			nr = idr_alloc(&tmp->idr, NULL, tid,
+				       tid + 1, GFP_ATOMIC);
+			/*
+			 * If ENOSPC is returned it means that the PID is
+			 * alreay in use. Return EEXIST in that case.
+			 */
+			if (nr == -ENOSPC)
+				nr = -EEXIST;
+		} else {
+			int pid_min = 1;
+			/*
+			 * init really needs pid 1, but after reaching the
+			 * maximum wrap back to RESERVED_PIDS
+			 */
+			if (idr_get_cursor(&tmp->idr) > RESERVED_PIDS)
+				pid_min = RESERVED_PIDS;
+
+			/*
+			 * Store a null pointer so find_pid_ns does not find
+			 * a partially initialized PID (see below).
+			 */
+			nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min,
+					      pid_max, GFP_ATOMIC);
+		}
 		spin_unlock_irq(&pidmap_lock);
 		idr_preload_end();
 
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index a6a79f85c81a..d40017e79ebe 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -26,8 +26,6 @@
 
 static DEFINE_MUTEX(pid_caches_mutex);
 static struct kmem_cache *pid_ns_cachep;
-/* MAX_PID_NS_LEVEL is needed for limiting size of 'struct pid' */
-#define MAX_PID_NS_LEVEL 32
 /* Write once array, filled from the beginning. */
 static struct kmem_cache *pid_cache[MAX_PID_NS_LEVEL];
 

base-commit: 7acdfe534e726450fe8051abc2f36380fb2c2c0e
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v10 2/2] selftests: add tests for clone3() with *set_tid
  2019-11-14 14:27 [PATCH v10 1/2] fork: extend clone3() to support setting a PID Adrian Reber
@ 2019-11-14 14:27 ` Adrian Reber
  2019-11-14 14:49   ` Christian Brauner
                     ` (2 more replies)
  2019-11-14 15:50 ` [PATCH v10 1/2] fork: extend clone3() to support setting a PID Oleg Nesterov
                   ` (2 subsequent siblings)
  3 siblings, 3 replies; 15+ messages in thread
From: Adrian Reber @ 2019-11-14 14:27 UTC (permalink / raw)
  To: Christian Brauner, Eric Biederman, Pavel Emelyanov, Jann Horn,
	Oleg Nesterov, Dmitry Safonov, Rasmus Villemoes
  Cc: linux-kernel, Andrei Vagin, Mike Rapoport, Radostin Stoyanov,
	Adrian Reber

This tests clone3() with *set_tid to see if all desired PIDs are working
as expected. The tests are trying multiple invalid input parameters as
well as creating processes while specifying a certain PID in multiple
PID namespaces at the same time.

Additionally this moves common clone3() test code into clone3_selftests.h.

Signed-off-by: Adrian Reber <areber@redhat.com>
---
v9:
 - applied all changes from Christian's review (except using the
   NSpid: parsing code from selftests/pidfd/pidfd_fdinfo_test.c)

v10:
 - added even more '\n' and include file fixes (Christian)
---
 tools/testing/selftests/clone3/.gitignore     |   1 +
 tools/testing/selftests/clone3/Makefile       |   2 +-
 tools/testing/selftests/clone3/clone3.c       |   8 +-
 .../selftests/clone3/clone3_clear_sighand.c   |  20 +-
 .../selftests/clone3/clone3_selftests.h       |  33 ++
 .../testing/selftests/clone3/clone3_set_tid.c | 357 ++++++++++++++++++
 6 files changed, 395 insertions(+), 26 deletions(-)
 create mode 100644 tools/testing/selftests/clone3/clone3_selftests.h
 create mode 100644 tools/testing/selftests/clone3/clone3_set_tid.c

diff --git a/tools/testing/selftests/clone3/.gitignore b/tools/testing/selftests/clone3/.gitignore
index 2a30ae18b06e..0dc4f32c6cb8 100644
--- a/tools/testing/selftests/clone3/.gitignore
+++ b/tools/testing/selftests/clone3/.gitignore
@@ -1,2 +1,3 @@
 clone3
 clone3_clear_sighand
+clone3_set_tid
diff --git a/tools/testing/selftests/clone3/Makefile b/tools/testing/selftests/clone3/Makefile
index eb26eb793c80..cf976c732906 100644
--- a/tools/testing/selftests/clone3/Makefile
+++ b/tools/testing/selftests/clone3/Makefile
@@ -1,6 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0
 CFLAGS += -g -I../../../../usr/include/
 
-TEST_GEN_PROGS := clone3 clone3_clear_sighand
+TEST_GEN_PROGS := clone3 clone3_clear_sighand clone3_set_tid
 
 include ../lib.mk
diff --git a/tools/testing/selftests/clone3/clone3.c b/tools/testing/selftests/clone3/clone3.c
index 0f8a9ef40117..4669b3d418e7 100644
--- a/tools/testing/selftests/clone3/clone3.c
+++ b/tools/testing/selftests/clone3/clone3.c
@@ -18,6 +18,7 @@
 #include <sched.h>
 
 #include "../kselftest.h"
+#include "clone3_selftests.h"
 
 /*
  * Different sizes of struct clone_args
@@ -35,11 +36,6 @@ enum test_mode {
 	CLONE3_ARGS_INVAL_EXIT_SIGNAL_NSIG,
 };
 
-static pid_t raw_clone(struct clone_args *args, size_t size)
-{
-	return syscall(__NR_clone3, args, size);
-}
-
 static int call_clone3(uint64_t flags, size_t size, enum test_mode test_mode)
 {
 	struct clone_args args = {
@@ -83,7 +79,7 @@ static int call_clone3(uint64_t flags, size_t size, enum test_mode test_mode)
 
 	memcpy(&args_ext.args, &args, sizeof(struct clone_args));
 
-	pid = raw_clone((struct clone_args *)&args_ext, size);
+	pid = sys_clone3((struct clone_args *)&args_ext, size);
 	if (pid < 0) {
 		ksft_print_msg("%s - Failed to create new process\n",
 				strerror(errno));
diff --git a/tools/testing/selftests/clone3/clone3_clear_sighand.c b/tools/testing/selftests/clone3/clone3_clear_sighand.c
index 0d957be1bdc5..456783ad19d6 100644
--- a/tools/testing/selftests/clone3/clone3_clear_sighand.c
+++ b/tools/testing/selftests/clone3/clone3_clear_sighand.c
@@ -14,30 +14,12 @@
 #include <sys/wait.h>
 
 #include "../kselftest.h"
+#include "clone3_selftests.h"
 
 #ifndef CLONE_CLEAR_SIGHAND
 #define CLONE_CLEAR_SIGHAND 0x100000000ULL
 #endif
 
-#ifndef __NR_clone3
-#define __NR_clone3 -1
-struct clone_args {
-	__aligned_u64 flags;
-	__aligned_u64 pidfd;
-	__aligned_u64 child_tid;
-	__aligned_u64 parent_tid;
-	__aligned_u64 exit_signal;
-	__aligned_u64 stack;
-	__aligned_u64 stack_size;
-	__aligned_u64 tls;
-};
-#endif
-
-static pid_t sys_clone3(struct clone_args *args, size_t size)
-{
-	return syscall(__NR_clone3, args, size);
-}
-
 static void test_clone3_supported(void)
 {
 	pid_t pid;
diff --git a/tools/testing/selftests/clone3/clone3_selftests.h b/tools/testing/selftests/clone3/clone3_selftests.h
new file mode 100644
index 000000000000..988c194d850d
--- /dev/null
+++ b/tools/testing/selftests/clone3/clone3_selftests.h
@@ -0,0 +1,33 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef _CLONE3_SELFTESTS_H
+#define _CLONE3_SELFTESTS_H
+
+#define _GNU_SOURCE
+#include <sched.h>
+#include <stdint.h>
+#include <syscall.h>
+#include <linux/types.h>
+
+#define ptr_to_u64(ptr) ((__u64)((uintptr_t)(ptr)))
+
+#ifndef __NR_clone3
+#define __NR_clone3 -1
+struct clone_args {
+	__aligned_u64 flags;
+	__aligned_u64 pidfd;
+	__aligned_u64 child_tid;
+	__aligned_u64 parent_tid;
+	__aligned_u64 exit_signal;
+	__aligned_u64 stack;
+	__aligned_u64 stack_size;
+	__aligned_u64 tls;
+};
+#endif
+
+static pid_t sys_clone3(struct clone_args *args, size_t size)
+{
+	return syscall(__NR_clone3, args, size);
+}
+
+#endif /* _CLONE3_SELFTESTS_H */
diff --git a/tools/testing/selftests/clone3/clone3_set_tid.c b/tools/testing/selftests/clone3/clone3_set_tid.c
new file mode 100644
index 000000000000..f2bf2e7a0ee6
--- /dev/null
+++ b/tools/testing/selftests/clone3/clone3_set_tid.c
@@ -0,0 +1,357 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Based on Christian Brauner's clone3() example.
+ * These tests are assuming to be running in the host's
+ * PID namespace.
+ */
+
+#define _GNU_SOURCE
+#include <errno.h>
+#include <linux/types.h>
+#include <linux/sched.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdbool.h>
+#include <sys/syscall.h>
+#include <sys/types.h>
+#include <sys/un.h>
+#include <sys/wait.h>
+#include <unistd.h>
+#include <sched.h>
+
+#include "../kselftest.h"
+#include "clone3_selftests.h"
+
+#ifndef MAX_PID_NS_LEVEL
+#define MAX_PID_NS_LEVEL 32
+#endif
+
+static int pipe_1[2];
+static int pipe_2[2];
+
+static int call_clone3_set_tid(pid_t *set_tid,
+			       size_t set_tid_size,
+			       int flags,
+			       int expected_pid,
+			       bool wait_for_it)
+{
+	int status;
+	pid_t pid = -1;
+
+	struct clone_args args = {
+		.flags = flags,
+		.exit_signal = SIGCHLD,
+		.set_tid = ptr_to_u64(set_tid),
+		.set_tid_size = set_tid_size,
+	};
+
+	pid = sys_clone3(&args, sizeof(struct clone_args));
+	if (pid < 0) {
+		ksft_print_msg("%s - Failed to create new process\n",
+			       strerror(errno));
+		return -errno;
+	}
+
+	if (pid == 0) {
+		char tmp = 0;
+
+		ksft_print_msg("I am the child, my PID is %d (expected %d)\n",
+			       getpid(), set_tid[0]);
+		if (wait_for_it) {
+			ksft_print_msg("[%d] Child is ready and waiting\n",
+				       getpid());
+
+			/* Signal the parent that the child is ready */
+			close(pipe_1[0]);
+			write(pipe_1[1], &tmp, 1);
+			close(pipe_1[1]);
+			close(pipe_2[1]);
+			read(pipe_2[0], &tmp, 1);
+			close(pipe_2[0]);
+		}
+
+		if (set_tid[0] != getpid())
+			_exit(EXIT_FAILURE);
+		_exit(EXIT_SUCCESS);
+	}
+
+	if (expected_pid == 0 || expected_pid == pid) {
+		ksft_print_msg("I am the parent (%d). My child's pid is %d\n",
+			       getpid(), pid);
+	} else {
+		ksft_print_msg(
+			"Expected child pid %d does not match actual pid %d\n",
+			expected_pid, pid);
+	}
+
+	if (wait(&status) < 0) {
+		ksft_print_msg("Child returned %s\n", strerror(errno));
+		return -errno;
+	}
+
+	if (!WIFEXITED(status))
+		return -1;
+
+	return WEXITSTATUS(status);
+}
+
+static void test_clone3_set_tid(pid_t *set_tid,
+				size_t set_tid_size,
+				int flags,
+				int expected,
+				int expected_pid,
+				bool wait_for_it)
+{
+	int ret;
+
+	ksft_print_msg(
+		"[%d] Trying clone3() with CLONE_SET_TID to %d and 0x%x\n",
+		getpid(), set_tid[0], flags);
+	ret = call_clone3_set_tid(set_tid, set_tid_size, flags, expected_pid,
+				  wait_for_it);
+	ksft_print_msg(
+		"[%d] clone3() with CLONE_SET_TID %d says :%d - expected %d\n",
+		getpid(), set_tid[0], ret, expected);
+	if (ret != expected)
+		ksft_test_result_fail(
+			"[%d] Result (%d) is different than expected (%d)\n",
+			getpid(), ret, expected);
+	else
+		ksft_test_result_pass(
+			"[%d] Result (%d) matches expectation (%d)\n",
+			getpid(), ret, expected);
+}
+int main(int argc, char *argv[])
+{
+	FILE *f;
+	char buf;
+	char *line;
+	int status;
+	int ret = -1;
+	size_t len = 0;
+	int pid_max = 0;
+	uid_t uid = getuid();
+	char proc_path[100] = {0};
+	pid_t pid, ns1, ns2, ns3, ns_pid;
+	pid_t set_tid[MAX_PID_NS_LEVEL * 2];
+
+	if (pipe(pipe_1) < 0 || pipe(pipe_2) < 0)
+		ksft_exit_fail_msg("pipe() failed\n");
+
+	ksft_print_header();
+	ksft_set_plan(26);
+
+	f = fopen("/proc/sys/kernel/pid_max", "r");
+	if (f == NULL)
+		ksft_exit_fail_msg(
+			"%s - Could not open /proc/sys/kernel/pid_max\n",
+			strerror(errno));
+	fscanf(f, "%d", &pid_max);
+	fclose(f);
+	ksft_print_msg("/proc/sys/kernel/pid_max %d\n", pid_max);
+
+	/* Try invalid settings */
+	memset(&set_tid, 0, sizeof(set_tid));
+	test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL + 1, 0, -EINVAL, 0, 0);
+
+	test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL * 2, 0, -EINVAL, 0, 0);
+
+	test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL * 2 + 1, 0,
+			-EINVAL, 0, 0);
+
+	test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL * 42, 0, -EINVAL, 0, 0);
+
+	/*
+	 * This can actually work if this test running in a MAX_PID_NS_LEVEL - 1
+	 * nested PID namespace.
+	 */
+	test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL - 1, 0, -EINVAL, 0, 0);
+
+	memset(&set_tid, 0xff, sizeof(set_tid));
+	test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL + 1, 0, -EINVAL, 0, 0);
+
+	test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL * 2, 0, -EINVAL, 0, 0);
+
+	test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL * 2 + 1, 0,
+			-EINVAL, 0, 0);
+
+	test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL * 42, 0, -EINVAL, 0, 0);
+
+	/*
+	 * This can actually work if this test running in a MAX_PID_NS_LEVEL - 1
+	 * nested PID namespace.
+	 */
+	test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL - 1, 0, -EINVAL, 0, 0);
+
+	memset(&set_tid, 0, sizeof(set_tid));
+	/* Try with an invalid PID */
+	set_tid[0] = 0;
+	test_clone3_set_tid(set_tid, 1, 0, -EINVAL, 0, 0);
+
+	set_tid[0] = -1;
+	test_clone3_set_tid(set_tid, 1, 0, -EINVAL, 0, 0);
+
+	/* Claim that the set_tid array actually contains 2 elements. */
+	test_clone3_set_tid(set_tid, 2, 0, -EINVAL, 0, 0);
+
+	/* Try it in a new PID namespace */
+	if (uid == 0)
+		test_clone3_set_tid(set_tid, 1, CLONE_NEWPID, -EINVAL, 0, 0);
+	else
+		ksft_test_result_skip("Clone3() with set_tid requires root\n");
+
+	/* Try with a valid PID (1) this should return -EEXIST. */
+	set_tid[0] = 1;
+	if (uid == 0)
+		test_clone3_set_tid(set_tid, 1, 0, -EEXIST, 0, 0);
+	else
+		ksft_test_result_skip("Clone3() with set_tid requires root\n");
+
+	/* Try it in a new PID namespace */
+	if (uid == 0)
+		test_clone3_set_tid(set_tid, 1, CLONE_NEWPID, 0, 0, 0);
+	else
+		ksft_test_result_skip("Clone3() with set_tid requires root\n");
+
+	/* pid_max should fail everywhere */
+	set_tid[0] = pid_max;
+	test_clone3_set_tid(set_tid, 1, 0, -EINVAL, 0, 0);
+
+	if (uid == 0)
+		test_clone3_set_tid(set_tid, 1, CLONE_NEWPID, -EINVAL, 0, 0);
+	else
+		ksft_test_result_skip("Clone3() with set_tid requires root\n");
+
+	if (uid != 0) {
+		/*
+		 * All remaining tests require root. Tell the framework
+		 * that all those tests are skipped as non-root.
+		 */
+		ksft_cnt.ksft_xskip += ksft_plan - ksft_test_num();
+		goto out;
+	}
+
+	/* Find the current active PID */
+	pid = fork();
+	if (pid == 0) {
+		ksft_print_msg("Child has PID %d\n", getpid());
+		_exit(EXIT_SUCCESS);
+	}
+	(void)wait(NULL);
+
+	/* After the child has finished, its PID should be free. */
+	set_tid[0] = pid;
+	test_clone3_set_tid(set_tid, 1, 0, 0, 0, 0);
+
+	/* This should fail as there is no PID 1 in that namespace */
+	test_clone3_set_tid(set_tid, 1, CLONE_NEWPID, -EINVAL, 0, 0);
+
+	/*
+	 * Creating a process with PID 1 in the newly created most nested
+	 * PID namespace and PID 'pid' in the parent PID namespace. This
+	 * needs to work.
+	 */
+	set_tid[0] = 1;
+	set_tid[1] = pid;
+	test_clone3_set_tid(set_tid, 2, CLONE_NEWPID, 0, pid, 0);
+
+	ksft_print_msg("unshare PID namespace\n");
+	if (unshare(CLONE_NEWPID) == -1)
+		ksft_exit_fail_msg("unshare(CLONE_NEWPID) failed: %s\n",
+				strerror(errno));
+
+	set_tid[0] = pid;
+
+	/* This should fail as there is no PID 1 in that namespace */
+	test_clone3_set_tid(set_tid, 1, 0, -EINVAL, 0, 0);
+
+	/* Let's create a PID 1 */
+	ns_pid = fork();
+	if (ns_pid == 0) {
+		ksft_print_msg("Child in PID namespace has PID %d\n", getpid());
+		set_tid[0] = 2;
+		test_clone3_set_tid(set_tid, 1, 0, 0, 2, 0);
+
+		set_tid[0] = 1;
+		set_tid[1] = 42;
+		set_tid[2] = pid;
+		/*
+		 * This should fail as there are not enough active PID
+		 * namespaces. Again assuming this is running in the host's
+		 * PID namespace. Not yet nested.
+		 */
+		test_clone3_set_tid(set_tid, 4, CLONE_NEWPID, -EINVAL, 0, 0);
+
+		/*
+		 * This should work and from the parent we should see
+		 * something like 'NSpid:	pid	42	1'.
+		 */
+		test_clone3_set_tid(set_tid, 3, CLONE_NEWPID, 0, 42, true);
+
+		_exit(ksft_cnt.ksft_pass);
+	}
+
+	close(pipe_1[1]);
+	close(pipe_2[0]);
+	while (read(pipe_1[0], &buf, 1) > 0) {
+		ksft_print_msg("[%d] Child is ready and waiting\n", getpid());
+		break;
+	}
+
+	snprintf(proc_path, sizeof(proc_path), "/proc/%d/status", pid);
+	f = fopen(proc_path, "r");
+	if (f == NULL)
+		ksft_exit_fail_msg(
+			"%s - Could not open %s\n",
+			strerror(errno), proc_path);
+
+	while (getline(&line, &len, f) != -1) {
+		if (strstr(line, "NSpid")) {
+			int i;
+
+			/* Verify that all generated PIDs are as expected. */
+			i = sscanf(line, "NSpid:\t%d\t%d\t%d",
+				   &ns3, &ns2, &ns1);
+			if (i != 3)
+				ns1 = ns2 = ns3 = 0;
+			break;
+		}
+	}
+	fclose(f);
+	free(line);
+	close(pipe_2[0]);
+
+	/* Tell the clone3()'d child to finish. */
+	write(pipe_2[1], &buf, 1);
+	close(pipe_2[1]);
+
+	if (wait(&status) < 0) {
+		ksft_print_msg("Child returned %s\n", strerror(errno));
+		ret = -errno;
+		goto out;
+	}
+
+	if (!WIFEXITED(status))
+		ksft_test_result_fail("Child error\n");
+
+	if (WEXITSTATUS(status))
+		/*
+		 * Update the number of total tests with the tests from the
+		 * child processes.
+		 */
+		ksft_cnt.ksft_pass = WEXITSTATUS(status);
+
+	if (ns3 == pid && ns2 == 42 && ns1 == 1)
+		ksft_test_result_pass(
+			"PIDs in all namespaces as expected (%d,%d,%d)\n",
+			ns3, ns2, ns1);
+	else
+		ksft_test_result_fail(
+			"PIDs in all namespaces not as expected (%d,%d,%d)\n",
+			ns3, ns2, ns1);
+out:
+	ret = 0;
+
+	return !ret ? ksft_exit_pass() : ksft_exit_fail();
+}
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v10 2/2] selftests: add tests for clone3() with *set_tid
  2019-11-14 14:27 ` [PATCH v10 2/2] selftests: add tests for clone3() with *set_tid Adrian Reber
@ 2019-11-14 14:49   ` Christian Brauner
  2019-11-14 18:34   ` Andrei Vagin
  2019-11-14 19:05   ` Andrei Vagin
  2 siblings, 0 replies; 15+ messages in thread
From: Christian Brauner @ 2019-11-14 14:49 UTC (permalink / raw)
  To: Adrian Reber
  Cc: Eric Biederman, Pavel Emelyanov, Jann Horn, Oleg Nesterov,
	Dmitry Safonov, Rasmus Villemoes, linux-kernel, Andrei Vagin,
	Mike Rapoport, Radostin Stoyanov

On Thu, Nov 14, 2019 at 03:27:07PM +0100, Adrian Reber wrote:
> This tests clone3() with *set_tid to see if all desired PIDs are working
> as expected. The tests are trying multiple invalid input parameters as
> well as creating processes while specifying a certain PID in multiple
> PID namespaces at the same time.
> 
> Additionally this moves common clone3() test code into clone3_selftests.h.
> 
> Signed-off-by: Adrian Reber <areber@redhat.com>

Thanks, Adrian!
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v10 1/2] fork: extend clone3() to support setting a PID
  2019-11-14 14:27 [PATCH v10 1/2] fork: extend clone3() to support setting a PID Adrian Reber
  2019-11-14 14:27 ` [PATCH v10 2/2] selftests: add tests for clone3() with *set_tid Adrian Reber
@ 2019-11-14 15:50 ` Oleg Nesterov
  2019-11-14 16:40 ` Christian Brauner
  2019-11-14 19:15 ` Andrei Vagin
  3 siblings, 0 replies; 15+ messages in thread
From: Oleg Nesterov @ 2019-11-14 15:50 UTC (permalink / raw)
  To: Adrian Reber
  Cc: Christian Brauner, Eric Biederman, Pavel Emelyanov, Jann Horn,
	Dmitry Safonov, Rasmus Villemoes, linux-kernel, Andrei Vagin,
	Mike Rapoport, Radostin Stoyanov

On 11/14, Adrian Reber wrote:
>
> @@ -2600,6 +2602,15 @@ noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs,
>  	if (err)
>  		return err;
>
> +	if (unlikely(args.set_tid_size > MAX_PID_NS_LEVEL))
> +		return -EINVAL;

so we need this to because copy_from_user() below writes into the
set_tid[MAX_PID_NS_LEVEL] on the caller's stack, then later alloc_pid()
does another "correct" check... We could simply shift that check here,
but probably this would be less clear, so I won't argue.

> @@ -2617,8 +2628,16 @@ noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs,
>  		.stack		= args.stack,
>  		.stack_size	= args.stack_size,
>  		.tls		= args.tls,
> +		.set_tid_size	= args.set_tid_size,
>  	};
...
> +	kargs->set_tid = kset_tid;

this looks a bit strange, you could simply do

		.set_tid_size   = args.set_tid_size,
		.set_tid	= kset_tid,

but this is really minor.

Looks good to me,

Reviewed-by: Oleg Nesterov <oleg@redhat.com>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v10 1/2] fork: extend clone3() to support setting a PID
  2019-11-14 14:27 [PATCH v10 1/2] fork: extend clone3() to support setting a PID Adrian Reber
  2019-11-14 14:27 ` [PATCH v10 2/2] selftests: add tests for clone3() with *set_tid Adrian Reber
  2019-11-14 15:50 ` [PATCH v10 1/2] fork: extend clone3() to support setting a PID Oleg Nesterov
@ 2019-11-14 16:40 ` Christian Brauner
  2019-11-14 19:15 ` Andrei Vagin
  3 siblings, 0 replies; 15+ messages in thread
From: Christian Brauner @ 2019-11-14 16:40 UTC (permalink / raw)
  To: Adrian Reber, Oleg Nesterov
  Cc: Eric Biederman, Pavel Emelyanov, Jann Horn, Dmitry Safonov,
	Rasmus Villemoes, linux-kernel, Andrei Vagin, Mike Rapoport,
	Radostin Stoyanov

On Thu, Nov 14, 2019 at 03:27:06PM +0100, Adrian Reber wrote:
> The main motivation to add set_tid to clone3() is CRIU.
> 
> To restore a process with the same PID/TID CRIU currently uses
> /proc/sys/kernel/ns_last_pid. It writes the desired (PID - 1) to
> ns_last_pid and then (quickly) does a clone(). This works most of the
> time, but it is racy. It is also slow as it requires multiple syscalls.
> 
> Extending clone3() to support *set_tid makes it possible restore a
> process using CRIU without accessing /proc/sys/kernel/ns_last_pid and
> race free (as long as the desired PID/TID is available).
> 
> This clone3() extension places the same restrictions (CAP_SYS_ADMIN)
> on clone3() with *set_tid as they are currently in place for ns_last_pid.
> 
> The original version of this change was using a single value for
> set_tid. At the 2019 LPC, after presenting set_tid, it was, however,
> decided to change set_tid to an array to enable setting the PID of a
> process in multiple PID namespaces at the same time. If a process is
> created in a PID namespace it is possible to influence the PID inside
> and outside of the PID namespace. Details also in the corresponding
> selftest.
> 
> To create a process with the following PIDs:
> 
>       PID NS level         Requested PID
>         0 (host)              31496
>         1                        42
>         2                         1
> 
> For that example the two newly introduced parameters to struct
> clone_args (set_tid and set_tid_size) would need to be:
> 
>   set_tid[0] = 1;
>   set_tid[1] = 42;
>   set_tid[2] = 31496;
>   set_tid_size = 3;
> 
> If only the PIDs of the two innermost nested PID namespaces should be
> defined it would look like this:
> 
>   set_tid[0] = 1;
>   set_tid[1] = 42;
>   set_tid_size = 2;
> 
> The PID of the newly created process would then be the next available
> free PID in the PID namespace level 0 (host) and 42 in the PID namespace
> at level 1 and the PID of the process in the innermost PID namespace
> would be 1.
> 
> The set_tid array is used to specify the PID of a process starting
> from the innermost nested PID namespaces up to set_tid_size PID namespaces.
> 
> set_tid_size cannot be larger then the current PID namespace level.
> 
> Signed-off-by: Adrian Reber <areber@redhat.com>
> Reviewed-by: Christian Brauner <christian.brauner@ubuntu.com>

Applied.
Thanks, Adrian and Oleg!
Christian

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v10 2/2] selftests: add tests for clone3() with *set_tid
  2019-11-14 14:27 ` [PATCH v10 2/2] selftests: add tests for clone3() with *set_tid Adrian Reber
  2019-11-14 14:49   ` Christian Brauner
@ 2019-11-14 18:34   ` Andrei Vagin
  2019-11-15 15:14     ` Adrian Reber
  2019-11-14 19:05   ` Andrei Vagin
  2 siblings, 1 reply; 15+ messages in thread
From: Andrei Vagin @ 2019-11-14 18:34 UTC (permalink / raw)
  To: Adrian Reber
  Cc: Christian Brauner, Eric Biederman, Pavel Emelyanov, Jann Horn,
	Oleg Nesterov, Dmitry Safonov, Rasmus Villemoes, linux-kernel,
	Mike Rapoport, Radostin Stoyanov

On Thu, Nov 14, 2019 at 03:27:07PM +0100, Adrian Reber wrote:
> This tests clone3() with *set_tid to see if all desired PIDs are working
> as expected. The tests are trying multiple invalid input parameters as
> well as creating processes while specifying a certain PID in multiple
> PID namespaces at the same time.
> 
> Additionally this moves common clone3() test code into clone3_selftests.h.
> 
> Signed-off-by: Adrian Reber <areber@redhat.com>
> ---
> v9:
>  - applied all changes from Christian's review (except using the
>    NSpid: parsing code from selftests/pidfd/pidfd_fdinfo_test.c)
> 
> v10:
>  - added even more '\n' and include file fixes (Christian)
> ---
>  tools/testing/selftests/clone3/.gitignore     |   1 +
>  tools/testing/selftests/clone3/Makefile       |   2 +-
>  tools/testing/selftests/clone3/clone3.c       |   8 +-
>  .../selftests/clone3/clone3_clear_sighand.c   |  20 +-
>  .../selftests/clone3/clone3_selftests.h       |  33 ++
>  .../testing/selftests/clone3/clone3_set_tid.c | 357 ++++++++++++++++++
>  6 files changed, 395 insertions(+), 26 deletions(-)
>  create mode 100644 tools/testing/selftests/clone3/clone3_selftests.h
>  create mode 100644 tools/testing/selftests/clone3/clone3_set_tid.c
> 
> diff --git a/tools/testing/selftests/clone3/.gitignore b/tools/testing/selftests/clone3/.gitignore
> index 2a30ae18b06e..0dc4f32c6cb8 100644
> --- a/tools/testing/selftests/clone3/.gitignore
> +++ b/tools/testing/selftests/clone3/.gitignore
> @@ -1,2 +1,3 @@
>  clone3
>  clone3_clear_sighand
> +clone3_set_tid
> diff --git a/tools/testing/selftests/clone3/Makefile b/tools/testing/selftests/clone3/Makefile
> index eb26eb793c80..cf976c732906 100644
> --- a/tools/testing/selftests/clone3/Makefile
> +++ b/tools/testing/selftests/clone3/Makefile
> @@ -1,6 +1,6 @@
>  # SPDX-License-Identifier: GPL-2.0
>  CFLAGS += -g -I../../../../usr/include/
>  
> -TEST_GEN_PROGS := clone3 clone3_clear_sighand
> +TEST_GEN_PROGS := clone3 clone3_clear_sighand clone3_set_tid
>  
>  include ../lib.mk
> diff --git a/tools/testing/selftests/clone3/clone3.c b/tools/testing/selftests/clone3/clone3.c
> index 0f8a9ef40117..4669b3d418e7 100644
> --- a/tools/testing/selftests/clone3/clone3.c
> +++ b/tools/testing/selftests/clone3/clone3.c
> @@ -18,6 +18,7 @@
>  #include <sched.h>
>  
>  #include "../kselftest.h"
> +#include "clone3_selftests.h"
>  
>  /*
>   * Different sizes of struct clone_args
> @@ -35,11 +36,6 @@ enum test_mode {
>  	CLONE3_ARGS_INVAL_EXIT_SIGNAL_NSIG,
>  };
>  
> -static pid_t raw_clone(struct clone_args *args, size_t size)
> -{
> -	return syscall(__NR_clone3, args, size);
> -}
> -
>  static int call_clone3(uint64_t flags, size_t size, enum test_mode test_mode)
>  {
>  	struct clone_args args = {
> @@ -83,7 +79,7 @@ static int call_clone3(uint64_t flags, size_t size, enum test_mode test_mode)
>  
>  	memcpy(&args_ext.args, &args, sizeof(struct clone_args));
>  
> -	pid = raw_clone((struct clone_args *)&args_ext, size);
> +	pid = sys_clone3((struct clone_args *)&args_ext, size);
>  	if (pid < 0) {
>  		ksft_print_msg("%s - Failed to create new process\n",
>  				strerror(errno));
> diff --git a/tools/testing/selftests/clone3/clone3_clear_sighand.c b/tools/testing/selftests/clone3/clone3_clear_sighand.c
> index 0d957be1bdc5..456783ad19d6 100644
> --- a/tools/testing/selftests/clone3/clone3_clear_sighand.c
> +++ b/tools/testing/selftests/clone3/clone3_clear_sighand.c
> @@ -14,30 +14,12 @@
>  #include <sys/wait.h>
>  
>  #include "../kselftest.h"
> +#include "clone3_selftests.h"
>  
>  #ifndef CLONE_CLEAR_SIGHAND
>  #define CLONE_CLEAR_SIGHAND 0x100000000ULL
>  #endif
>  
> -#ifndef __NR_clone3
> -#define __NR_clone3 -1
> -struct clone_args {
> -	__aligned_u64 flags;
> -	__aligned_u64 pidfd;
> -	__aligned_u64 child_tid;
> -	__aligned_u64 parent_tid;
> -	__aligned_u64 exit_signal;
> -	__aligned_u64 stack;
> -	__aligned_u64 stack_size;
> -	__aligned_u64 tls;
> -};
> -#endif
> -
> -static pid_t sys_clone3(struct clone_args *args, size_t size)
> -{
> -	return syscall(__NR_clone3, args, size);
> -}
> -
>  static void test_clone3_supported(void)
>  {
>  	pid_t pid;
> diff --git a/tools/testing/selftests/clone3/clone3_selftests.h b/tools/testing/selftests/clone3/clone3_selftests.h
> new file mode 100644
> index 000000000000..988c194d850d
> --- /dev/null
> +++ b/tools/testing/selftests/clone3/clone3_selftests.h
> @@ -0,0 +1,33 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +#ifndef _CLONE3_SELFTESTS_H
> +#define _CLONE3_SELFTESTS_H
> +
> +#define _GNU_SOURCE
> +#include <sched.h>
> +#include <stdint.h>
> +#include <syscall.h>
> +#include <linux/types.h>
> +
> +#define ptr_to_u64(ptr) ((__u64)((uintptr_t)(ptr)))
> +
> +#ifndef __NR_clone3
> +#define __NR_clone3 -1
> +struct clone_args {
> +	__aligned_u64 flags;
> +	__aligned_u64 pidfd;
> +	__aligned_u64 child_tid;
> +	__aligned_u64 parent_tid;
> +	__aligned_u64 exit_signal;
> +	__aligned_u64 stack;
> +	__aligned_u64 stack_size;
> +	__aligned_u64 tls;
> +};
> +#endif
> +
> +static pid_t sys_clone3(struct clone_args *args, size_t size)
> +{
> +	return syscall(__NR_clone3, args, size);
> +}
> +
> +#endif /* _CLONE3_SELFTESTS_H */
> diff --git a/tools/testing/selftests/clone3/clone3_set_tid.c b/tools/testing/selftests/clone3/clone3_set_tid.c
> new file mode 100644
> index 000000000000..f2bf2e7a0ee6
> --- /dev/null
> +++ b/tools/testing/selftests/clone3/clone3_set_tid.c
> @@ -0,0 +1,357 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +/*
> + * Based on Christian Brauner's clone3() example.
> + * These tests are assuming to be running in the host's
> + * PID namespace.
> + */
> +
> +#define _GNU_SOURCE
> +#include <errno.h>
> +#include <linux/types.h>
> +#include <linux/sched.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <stdbool.h>
> +#include <sys/syscall.h>
> +#include <sys/types.h>
> +#include <sys/un.h>
> +#include <sys/wait.h>
> +#include <unistd.h>
> +#include <sched.h>
> +
> +#include "../kselftest.h"
> +#include "clone3_selftests.h"
> +
> +#ifndef MAX_PID_NS_LEVEL
> +#define MAX_PID_NS_LEVEL 32
> +#endif
> +
> +static int pipe_1[2];
> +static int pipe_2[2];
> +
> +static int call_clone3_set_tid(pid_t *set_tid,
> +			       size_t set_tid_size,
> +			       int flags,
> +			       int expected_pid,
> +			       bool wait_for_it)
> +{
> +	int status;
> +	pid_t pid = -1;
> +
> +	struct clone_args args = {
> +		.flags = flags,
> +		.exit_signal = SIGCHLD,
> +		.set_tid = ptr_to_u64(set_tid),
> +		.set_tid_size = set_tid_size,
> +	};
> +
> +	pid = sys_clone3(&args, sizeof(struct clone_args));
> +	if (pid < 0) {
> +		ksft_print_msg("%s - Failed to create new process\n",
> +			       strerror(errno));
> +		return -errno;
> +	}
> +
> +	if (pid == 0) {
> +		char tmp = 0;
> +
> +		ksft_print_msg("I am the child, my PID is %d (expected %d)\n",
> +			       getpid(), set_tid[0]);
> +		if (wait_for_it) {
> +			ksft_print_msg("[%d] Child is ready and waiting\n",
> +				       getpid());
> +
> +			/* Signal the parent that the child is ready */
> +			close(pipe_1[0]);
> +			write(pipe_1[1], &tmp, 1);

I think it would be better if we will check an return code of write and
print an error if it isn't equal to 1.
> +			close(pipe_1[1]);
> +			close(pipe_2[1]);
> +			read(pipe_2[0], &tmp, 1);

the same here

> +			close(pipe_2[0]);
> +		}
> +
> +		if (set_tid[0] != getpid())
> +			_exit(EXIT_FAILURE);
> +		_exit(EXIT_SUCCESS);
> +	}
> +
> +	if (expected_pid == 0 || expected_pid == pid) {
> +		ksft_print_msg("I am the parent (%d). My child's pid is %d\n",
> +			       getpid(), pid);
> +	} else {
> +		ksft_print_msg(
> +			"Expected child pid %d does not match actual pid %d\n",
> +			expected_pid, pid);

ksft_test_result_fail or return -1?

> +	}
> +
> +	if (wait(&status) < 0) {

This test creates a few children, so waitpid(pid, &status, 0) will look better here and
in other places.

> +		ksft_print_msg("Child returned %s\n", strerror(errno));
> +		return -errno;
> +	}
> +
> +	if (!WIFEXITED(status))
> +		return -1;
> +
> +	return WEXITSTATUS(status);
> +}
> +
> +static void test_clone3_set_tid(pid_t *set_tid,
> +				size_t set_tid_size,
> +				int flags,
> +				int expected,
> +				int expected_pid,
> +				bool wait_for_it)
> +{
> +	int ret;
> +
> +	ksft_print_msg(
> +		"[%d] Trying clone3() with CLONE_SET_TID to %d and 0x%x\n",
> +		getpid(), set_tid[0], flags);
> +	ret = call_clone3_set_tid(set_tid, set_tid_size, flags, expected_pid,
> +				  wait_for_it);
> +	ksft_print_msg(
> +		"[%d] clone3() with CLONE_SET_TID %d says :%d - expected %d\n",
> +		getpid(), set_tid[0], ret, expected);
> +	if (ret != expected)
> +		ksft_test_result_fail(
> +			"[%d] Result (%d) is different than expected (%d)\n",
> +			getpid(), ret, expected);
> +	else
> +		ksft_test_result_pass(
> +			"[%d] Result (%d) matches expectation (%d)\n",
> +			getpid(), ret, expected);
> +}
> +int main(int argc, char *argv[])
> +{
> +	FILE *f;
> +	char buf;
> +	char *line;
> +	int status;
> +	int ret = -1;
> +	size_t len = 0;
> +	int pid_max = 0;
> +	uid_t uid = getuid();
> +	char proc_path[100] = {0};
> +	pid_t pid, ns1, ns2, ns3, ns_pid;
> +	pid_t set_tid[MAX_PID_NS_LEVEL * 2];
> +
> +	if (pipe(pipe_1) < 0 || pipe(pipe_2) < 0)
> +		ksft_exit_fail_msg("pipe() failed\n");
> +
> +	ksft_print_header();
> +	ksft_set_plan(26);
> +
> +	f = fopen("/proc/sys/kernel/pid_max", "r");
> +	if (f == NULL)
> +		ksft_exit_fail_msg(
> +			"%s - Could not open /proc/sys/kernel/pid_max\n",
> +			strerror(errno));
> +	fscanf(f, "%d", &pid_max);
> +	fclose(f);
> +	ksft_print_msg("/proc/sys/kernel/pid_max %d\n", pid_max);
> +
> +	/* Try invalid settings */
> +	memset(&set_tid, 0, sizeof(set_tid));
> +	test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL + 1, 0, -EINVAL, 0, 0);
> +
> +	test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL * 2, 0, -EINVAL, 0, 0);
> +
> +	test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL * 2 + 1, 0,
> +			-EINVAL, 0, 0);
> +
> +	test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL * 42, 0, -EINVAL, 0, 0);
> +
> +	/*
> +	 * This can actually work if this test running in a MAX_PID_NS_LEVEL - 1
> +	 * nested PID namespace.
> +	 */
> +	test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL - 1, 0, -EINVAL, 0, 0);
> +
> +	memset(&set_tid, 0xff, sizeof(set_tid));
> +	test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL + 1, 0, -EINVAL, 0, 0);
> +
> +	test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL * 2, 0, -EINVAL, 0, 0);
> +
> +	test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL * 2 + 1, 0,
> +			-EINVAL, 0, 0);
> +
> +	test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL * 42, 0, -EINVAL, 0, 0);
> +
> +	/*
> +	 * This can actually work if this test running in a MAX_PID_NS_LEVEL - 1
> +	 * nested PID namespace.
> +	 */
> +	test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL - 1, 0, -EINVAL, 0, 0);
> +
> +	memset(&set_tid, 0, sizeof(set_tid));
> +	/* Try with an invalid PID */
> +	set_tid[0] = 0;
> +	test_clone3_set_tid(set_tid, 1, 0, -EINVAL, 0, 0);
> +
> +	set_tid[0] = -1;
> +	test_clone3_set_tid(set_tid, 1, 0, -EINVAL, 0, 0);
> +
> +	/* Claim that the set_tid array actually contains 2 elements. */
> +	test_clone3_set_tid(set_tid, 2, 0, -EINVAL, 0, 0);
> +
> +	/* Try it in a new PID namespace */
> +	if (uid == 0)
> +		test_clone3_set_tid(set_tid, 1, CLONE_NEWPID, -EINVAL, 0, 0);
> +	else
> +		ksft_test_result_skip("Clone3() with set_tid requires root\n");
> +
> +	/* Try with a valid PID (1) this should return -EEXIST. */
> +	set_tid[0] = 1;
> +	if (uid == 0)
> +		test_clone3_set_tid(set_tid, 1, 0, -EEXIST, 0, 0);
> +	else
> +		ksft_test_result_skip("Clone3() with set_tid requires root\n");
> +
> +	/* Try it in a new PID namespace */
> +	if (uid == 0)
> +		test_clone3_set_tid(set_tid, 1, CLONE_NEWPID, 0, 0, 0);
> +	else
> +		ksft_test_result_skip("Clone3() with set_tid requires root\n");
> +
> +	/* pid_max should fail everywhere */
> +	set_tid[0] = pid_max;
> +	test_clone3_set_tid(set_tid, 1, 0, -EINVAL, 0, 0);
> +
> +	if (uid == 0)
> +		test_clone3_set_tid(set_tid, 1, CLONE_NEWPID, -EINVAL, 0, 0);
> +	else
> +		ksft_test_result_skip("Clone3() with set_tid requires root\n");
> +
> +	if (uid != 0) {
> +		/*
> +		 * All remaining tests require root. Tell the framework
> +		 * that all those tests are skipped as non-root.
> +		 */

can we split this test on two separate tests. It will be esier to read
and avoid fiddling with ksft_cnt.

> +		ksft_cnt.ksft_xskip += ksft_plan - ksft_test_num();
> +		goto out;
> +	}
> +
> +	/* Find the current active PID */
> +	pid = fork();
> +	if (pid == 0) {
> +		ksft_print_msg("Child has PID %d\n", getpid());
> +		_exit(EXIT_SUCCESS);
> +	}
> +	(void)wait(NULL);

I think we can check an return code here

> +
> +	/* After the child has finished, its PID should be free. */
> +	set_tid[0] = pid;
> +	test_clone3_set_tid(set_tid, 1, 0, 0, 0, 0);
> +
> +	/* This should fail as there is no PID 1 in that namespace */
> +	test_clone3_set_tid(set_tid, 1, CLONE_NEWPID, -EINVAL, 0, 0);
> +
> +	/*
> +	 * Creating a process with PID 1 in the newly created most nested
> +	 * PID namespace and PID 'pid' in the parent PID namespace. This
> +	 * needs to work.
> +	 */
> +	set_tid[0] = 1;
> +	set_tid[1] = pid;
> +	test_clone3_set_tid(set_tid, 2, CLONE_NEWPID, 0, pid, 0);
> +
> +	ksft_print_msg("unshare PID namespace\n");
> +	if (unshare(CLONE_NEWPID) == -1)
> +		ksft_exit_fail_msg("unshare(CLONE_NEWPID) failed: %s\n",
> +				strerror(errno));
> +
> +	set_tid[0] = pid;
> +
> +	/* This should fail as there is no PID 1 in that namespace */
> +	test_clone3_set_tid(set_tid, 1, 0, -EINVAL, 0, 0);
> +
> +	/* Let's create a PID 1 */
> +	ns_pid = fork();
> +	if (ns_pid == 0) {
> +		ksft_print_msg("Child in PID namespace has PID %d\n", getpid());
> +		set_tid[0] = 2;
> +		test_clone3_set_tid(set_tid, 1, 0, 0, 2, 0);
> +
> +		set_tid[0] = 1;
> +		set_tid[1] = 42;
> +		set_tid[2] = pid;
> +		/*
> +		 * This should fail as there are not enough active PID
> +		 * namespaces. Again assuming this is running in the host's
> +		 * PID namespace. Not yet nested.
> +		 */
> +		test_clone3_set_tid(set_tid, 4, CLONE_NEWPID, -EINVAL, 0, 0);
> +
> +		/*
> +		 * This should work and from the parent we should see
> +		 * something like 'NSpid:	pid	42	1'.
> +		 */
> +		test_clone3_set_tid(set_tid, 3, CLONE_NEWPID, 0, 42, true);
> +
> +		_exit(ksft_cnt.ksft_pass);
> +	}
> +
> +	close(pipe_1[1]);
> +	close(pipe_2[0]);
> +	while (read(pipe_1[0], &buf, 1) > 0) {

If a child process will crash, this will be a busyloop.

> +		ksft_print_msg("[%d] Child is ready and waiting\n", getpid());
> +		break;
> +	}
> +
> +	snprintf(proc_path, sizeof(proc_path), "/proc/%d/status", pid);
> +	f = fopen(proc_path, "r");
> +	if (f == NULL)
> +		ksft_exit_fail_msg(
> +			"%s - Could not open %s\n",
> +			strerror(errno), proc_path);
> +
> +	while (getline(&line, &len, f) != -1) {
> +		if (strstr(line, "NSpid")) {
> +			int i;
> +
> +			/* Verify that all generated PIDs are as expected. */
> +			i = sscanf(line, "NSpid:\t%d\t%d\t%d",
> +				   &ns3, &ns2, &ns1);
> +			if (i != 3)

report a fail here and print the line?

> +				ns1 = ns2 = ns3 = 0;
> +			break;
> +		}
> +	}
> +	fclose(f);
> +	free(line);
> +	close(pipe_2[0]);
> +
> +	/* Tell the clone3()'d child to finish. */
> +	write(pipe_2[1], &buf, 1);
> +	close(pipe_2[1]);
> +
> +	if (wait(&status) < 0) {
> +		ksft_print_msg("Child returned %s\n", strerror(errno));
> +		ret = -errno;
> +		goto out;
> +	}
> +
> +	if (!WIFEXITED(status))
> +		ksft_test_result_fail("Child error\n");
> +
> +	if (WEXITSTATUS(status))
> +		/*
> +		 * Update the number of total tests with the tests from the
> +		 * child processes.
> +		 */
> +		ksft_cnt.ksft_pass = WEXITSTATUS(status);
> +
> +	if (ns3 == pid && ns2 == 42 && ns1 == 1)
> +		ksft_test_result_pass(
> +			"PIDs in all namespaces as expected (%d,%d,%d)\n",
> +			ns3, ns2, ns1);
> +	else
> +		ksft_test_result_fail(
> +			"PIDs in all namespaces not as expected (%d,%d,%d)\n",
> +			ns3, ns2, ns1);
> +out:
> +	ret = 0;
> +
> +	return !ret ? ksft_exit_pass() : ksft_exit_fail();
> +}
> -- 
> 2.23.0
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v10 2/2] selftests: add tests for clone3() with *set_tid
  2019-11-14 14:27 ` [PATCH v10 2/2] selftests: add tests for clone3() with *set_tid Adrian Reber
  2019-11-14 14:49   ` Christian Brauner
  2019-11-14 18:34   ` Andrei Vagin
@ 2019-11-14 19:05   ` Andrei Vagin
  2 siblings, 0 replies; 15+ messages in thread
From: Andrei Vagin @ 2019-11-14 19:05 UTC (permalink / raw)
  To: Adrian Reber
  Cc: Christian Brauner, Eric Biederman, Pavel Emelyanov, Jann Horn,
	Oleg Nesterov, Dmitry Safonov, Rasmus Villemoes, linux-kernel,
	Mike Rapoport, Radostin Stoyanov

On Thu, Nov 14, 2019 at 03:27:07PM +0100, Adrian Reber wrote:
> This tests clone3() with *set_tid to see if all desired PIDs are working
> as expected. The tests are trying multiple invalid input parameters as
> well as creating processes while specifying a certain PID in multiple
> PID namespaces at the same time.
> 
> Additionally this moves common clone3() test code into clone3_selftests.h.
> 
> Signed-off-by: Adrian Reber <areber@redhat.com>
> ---
> v9:
>  - applied all changes from Christian's review (except using the
>    NSpid: parsing code from selftests/pidfd/pidfd_fdinfo_test.c)
> 
> v10:
>  - added even more '\n' and include file fixes (Christian)
> ---
>  tools/testing/selftests/clone3/.gitignore     |   1 +
>  tools/testing/selftests/clone3/Makefile       |   2 +-
>  tools/testing/selftests/clone3/clone3.c       |   8 +-
>  .../selftests/clone3/clone3_clear_sighand.c   |  20 +-
>  .../selftests/clone3/clone3_selftests.h       |  33 ++
>  .../testing/selftests/clone3/clone3_set_tid.c | 357 ++++++++++++++++++
>  6 files changed, 395 insertions(+), 26 deletions(-)
>  create mode 100644 tools/testing/selftests/clone3/clone3_selftests.h
>  create mode 100644 tools/testing/selftests/clone3/clone3_set_tid.c
> 
> diff --git a/tools/testing/selftests/clone3/.gitignore b/tools/testing/selftests/clone3/.gitignore
> index 2a30ae18b06e..0dc4f32c6cb8 100644
> --- a/tools/testing/selftests/clone3/.gitignore
> +++ b/tools/testing/selftests/clone3/.gitignore
> @@ -1,2 +1,3 @@
>  clone3
>  clone3_clear_sighand
> +clone3_set_tid
> diff --git a/tools/testing/selftests/clone3/Makefile b/tools/testing/selftests/clone3/Makefile
> index eb26eb793c80..cf976c732906 100644
> --- a/tools/testing/selftests/clone3/Makefile
> +++ b/tools/testing/selftests/clone3/Makefile
> @@ -1,6 +1,6 @@
>  # SPDX-License-Identifier: GPL-2.0
>  CFLAGS += -g -I../../../../usr/include/
>  
> -TEST_GEN_PROGS := clone3 clone3_clear_sighand
> +TEST_GEN_PROGS := clone3 clone3_clear_sighand clone3_set_tid
>  
>  include ../lib.mk
> diff --git a/tools/testing/selftests/clone3/clone3.c b/tools/testing/selftests/clone3/clone3.c
> index 0f8a9ef40117..4669b3d418e7 100644
> --- a/tools/testing/selftests/clone3/clone3.c
> +++ b/tools/testing/selftests/clone3/clone3.c
> @@ -18,6 +18,7 @@
>  #include <sched.h>
>  
>  #include "../kselftest.h"
> +#include "clone3_selftests.h"
>  
>  /*
>   * Different sizes of struct clone_args
> @@ -35,11 +36,6 @@ enum test_mode {
>  	CLONE3_ARGS_INVAL_EXIT_SIGNAL_NSIG,
>  };
>  
> -static pid_t raw_clone(struct clone_args *args, size_t size)
> -{
> -	return syscall(__NR_clone3, args, size);
> -}
> -
>  static int call_clone3(uint64_t flags, size_t size, enum test_mode test_mode)
>  {
>  	struct clone_args args = {
> @@ -83,7 +79,7 @@ static int call_clone3(uint64_t flags, size_t size, enum test_mode test_mode)
>  
>  	memcpy(&args_ext.args, &args, sizeof(struct clone_args));
>  
> -	pid = raw_clone((struct clone_args *)&args_ext, size);
> +	pid = sys_clone3((struct clone_args *)&args_ext, size);
>  	if (pid < 0) {
>  		ksft_print_msg("%s - Failed to create new process\n",
>  				strerror(errno));
> diff --git a/tools/testing/selftests/clone3/clone3_clear_sighand.c b/tools/testing/selftests/clone3/clone3_clear_sighand.c
> index 0d957be1bdc5..456783ad19d6 100644
> --- a/tools/testing/selftests/clone3/clone3_clear_sighand.c
> +++ b/tools/testing/selftests/clone3/clone3_clear_sighand.c
> @@ -14,30 +14,12 @@
>  #include <sys/wait.h>
>  
>  #include "../kselftest.h"
> +#include "clone3_selftests.h"
>  
>  #ifndef CLONE_CLEAR_SIGHAND
>  #define CLONE_CLEAR_SIGHAND 0x100000000ULL
>  #endif
>  
> -#ifndef __NR_clone3
> -#define __NR_clone3 -1
> -struct clone_args {
> -	__aligned_u64 flags;
> -	__aligned_u64 pidfd;
> -	__aligned_u64 child_tid;
> -	__aligned_u64 parent_tid;
> -	__aligned_u64 exit_signal;
> -	__aligned_u64 stack;
> -	__aligned_u64 stack_size;
> -	__aligned_u64 tls;
> -};
> -#endif
> -
> -static pid_t sys_clone3(struct clone_args *args, size_t size)
> -{
> -	return syscall(__NR_clone3, args, size);
> -}
> -
>  static void test_clone3_supported(void)
>  {
>  	pid_t pid;
> diff --git a/tools/testing/selftests/clone3/clone3_selftests.h b/tools/testing/selftests/clone3/clone3_selftests.h
> new file mode 100644
> index 000000000000..988c194d850d
> --- /dev/null
> +++ b/tools/testing/selftests/clone3/clone3_selftests.h
> @@ -0,0 +1,33 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +#ifndef _CLONE3_SELFTESTS_H
> +#define _CLONE3_SELFTESTS_H
> +
> +#define _GNU_SOURCE
> +#include <sched.h>
> +#include <stdint.h>
> +#include <syscall.h>
> +#include <linux/types.h>
> +
> +#define ptr_to_u64(ptr) ((__u64)((uintptr_t)(ptr)))
> +
> +#ifndef __NR_clone3
> +#define __NR_clone3 -1
> +struct clone_args {
> +	__aligned_u64 flags;
> +	__aligned_u64 pidfd;
> +	__aligned_u64 child_tid;
> +	__aligned_u64 parent_tid;
> +	__aligned_u64 exit_signal;
> +	__aligned_u64 stack;
> +	__aligned_u64 stack_size;
> +	__aligned_u64 tls;

$ make
gcc -g -I../../../../usr/include/    clone3_set_tid.c  -o /mnt/tools/testing/selftests/clone3/clone3_set_tid
clone3_set_tid.c: In function ‘call_clone3_set_tid’:
clone3_set_tid.c:45:4: error: ‘struct clone_args’ has no member named ‘set_tid’; did you mean ‘parent_tid’?
   45 |   .set_tid = ptr_to_u64(set_tid),
      |    ^~~~~~~
      |    parent_tid
clone3_set_tid.c:46:4: error: ‘struct clone_args’ has no member named ‘set_tid_size’
   46 |   .set_tid_size = set_tid_size,
      |    ^~~~~~~~~~~~
make: *** [../lib.mk:138: /mnt/tools/testing/selftests/clone3/clone3_set_tid] Error 1


> +};
> +#endif
> +
> +static pid_t sys_clone3(struct clone_args *args, size_t size)
> +{
> +	return syscall(__NR_clone3, args, size);
> +}
> +
> +#endif /* _CLONE3_SELFTESTS_H */
> diff --git a/tools/testing/selftests/clone3/clone3_set_tid.c b/tools/testing/selftests/clone3/clone3_set_tid.c
> new file mode 100644
> index 000000000000..f2bf2e7a0ee6
> --- /dev/null
> +++ b/tools/testing/selftests/clone3/clone3_set_tid.c
> @@ -0,0 +1,357 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +/*
> + * Based on Christian Brauner's clone3() example.
> + * These tests are assuming to be running in the host's
> + * PID namespace.
> + */
> +
> +#define _GNU_SOURCE
> +#include <errno.h>
> +#include <linux/types.h>
> +#include <linux/sched.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <stdbool.h>
> +#include <sys/syscall.h>
> +#include <sys/types.h>
> +#include <sys/un.h>
> +#include <sys/wait.h>
> +#include <unistd.h>
> +#include <sched.h>
> +
> +#include "../kselftest.h"
> +#include "clone3_selftests.h"
> +
> +#ifndef MAX_PID_NS_LEVEL
> +#define MAX_PID_NS_LEVEL 32
> +#endif
> +
> +static int pipe_1[2];
> +static int pipe_2[2];
> +
> +static int call_clone3_set_tid(pid_t *set_tid,
> +			       size_t set_tid_size,
> +			       int flags,
> +			       int expected_pid,
> +			       bool wait_for_it)
> +{
> +	int status;
> +	pid_t pid = -1;
> +
> +	struct clone_args args = {
> +		.flags = flags,
> +		.exit_signal = SIGCHLD,
> +		.set_tid = ptr_to_u64(set_tid),
> +		.set_tid_size = set_tid_size,
> +	};
> +
> +	pid = sys_clone3(&args, sizeof(struct clone_args));
> +	if (pid < 0) {
> +		ksft_print_msg("%s - Failed to create new process\n",
> +			       strerror(errno));
> +		return -errno;
> +	}
> +
> +	if (pid == 0) {
> +		char tmp = 0;
> +
> +		ksft_print_msg("I am the child, my PID is %d (expected %d)\n",
> +			       getpid(), set_tid[0]);
> +		if (wait_for_it) {
> +			ksft_print_msg("[%d] Child is ready and waiting\n",
> +				       getpid());
> +
> +			/* Signal the parent that the child is ready */
> +			close(pipe_1[0]);
> +			write(pipe_1[1], &tmp, 1);
> +			close(pipe_1[1]);
> +			close(pipe_2[1]);
> +			read(pipe_2[0], &tmp, 1);
> +			close(pipe_2[0]);
> +		}
> +
> +		if (set_tid[0] != getpid())
> +			_exit(EXIT_FAILURE);
> +		_exit(EXIT_SUCCESS);
> +	}
> +
> +	if (expected_pid == 0 || expected_pid == pid) {
> +		ksft_print_msg("I am the parent (%d). My child's pid is %d\n",
> +			       getpid(), pid);
> +	} else {
> +		ksft_print_msg(
> +			"Expected child pid %d does not match actual pid %d\n",
> +			expected_pid, pid);
> +	}
> +
> +	if (wait(&status) < 0) {
> +		ksft_print_msg("Child returned %s\n", strerror(errno));
> +		return -errno;
> +	}
> +
> +	if (!WIFEXITED(status))
> +		return -1;
> +
> +	return WEXITSTATUS(status);
> +}
> +
> +static void test_clone3_set_tid(pid_t *set_tid,
> +				size_t set_tid_size,
> +				int flags,
> +				int expected,
> +				int expected_pid,
> +				bool wait_for_it)
> +{
> +	int ret;
> +
> +	ksft_print_msg(
> +		"[%d] Trying clone3() with CLONE_SET_TID to %d and 0x%x\n",
> +		getpid(), set_tid[0], flags);
> +	ret = call_clone3_set_tid(set_tid, set_tid_size, flags, expected_pid,
> +				  wait_for_it);
> +	ksft_print_msg(
> +		"[%d] clone3() with CLONE_SET_TID %d says :%d - expected %d\n",
> +		getpid(), set_tid[0], ret, expected);
> +	if (ret != expected)
> +		ksft_test_result_fail(
> +			"[%d] Result (%d) is different than expected (%d)\n",
> +			getpid(), ret, expected);
> +	else
> +		ksft_test_result_pass(
> +			"[%d] Result (%d) matches expectation (%d)\n",
> +			getpid(), ret, expected);
> +}
> +int main(int argc, char *argv[])
> +{
> +	FILE *f;
> +	char buf;
> +	char *line;
> +	int status;
> +	int ret = -1;
> +	size_t len = 0;
> +	int pid_max = 0;
> +	uid_t uid = getuid();
> +	char proc_path[100] = {0};
> +	pid_t pid, ns1, ns2, ns3, ns_pid;
> +	pid_t set_tid[MAX_PID_NS_LEVEL * 2];
> +
> +	if (pipe(pipe_1) < 0 || pipe(pipe_2) < 0)
> +		ksft_exit_fail_msg("pipe() failed\n");
> +
> +	ksft_print_header();
> +	ksft_set_plan(26);
> +
> +	f = fopen("/proc/sys/kernel/pid_max", "r");
> +	if (f == NULL)
> +		ksft_exit_fail_msg(
> +			"%s - Could not open /proc/sys/kernel/pid_max\n",
> +			strerror(errno));
> +	fscanf(f, "%d", &pid_max);
> +	fclose(f);
> +	ksft_print_msg("/proc/sys/kernel/pid_max %d\n", pid_max);
> +
> +	/* Try invalid settings */
> +	memset(&set_tid, 0, sizeof(set_tid));
> +	test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL + 1, 0, -EINVAL, 0, 0);
> +
> +	test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL * 2, 0, -EINVAL, 0, 0);
> +
> +	test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL * 2 + 1, 0,
> +			-EINVAL, 0, 0);
> +
> +	test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL * 42, 0, -EINVAL, 0, 0);
> +
> +	/*
> +	 * This can actually work if this test running in a MAX_PID_NS_LEVEL - 1
> +	 * nested PID namespace.
> +	 */
> +	test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL - 1, 0, -EINVAL, 0, 0);
> +
> +	memset(&set_tid, 0xff, sizeof(set_tid));
> +	test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL + 1, 0, -EINVAL, 0, 0);
> +
> +	test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL * 2, 0, -EINVAL, 0, 0);
> +
> +	test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL * 2 + 1, 0,
> +			-EINVAL, 0, 0);
> +
> +	test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL * 42, 0, -EINVAL, 0, 0);
> +
> +	/*
> +	 * This can actually work if this test running in a MAX_PID_NS_LEVEL - 1
> +	 * nested PID namespace.
> +	 */
> +	test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL - 1, 0, -EINVAL, 0, 0);
> +
> +	memset(&set_tid, 0, sizeof(set_tid));
> +	/* Try with an invalid PID */
> +	set_tid[0] = 0;
> +	test_clone3_set_tid(set_tid, 1, 0, -EINVAL, 0, 0);
> +
> +	set_tid[0] = -1;
> +	test_clone3_set_tid(set_tid, 1, 0, -EINVAL, 0, 0);
> +
> +	/* Claim that the set_tid array actually contains 2 elements. */
> +	test_clone3_set_tid(set_tid, 2, 0, -EINVAL, 0, 0);
> +
> +	/* Try it in a new PID namespace */
> +	if (uid == 0)
> +		test_clone3_set_tid(set_tid, 1, CLONE_NEWPID, -EINVAL, 0, 0);
> +	else
> +		ksft_test_result_skip("Clone3() with set_tid requires root\n");
> +
> +	/* Try with a valid PID (1) this should return -EEXIST. */
> +	set_tid[0] = 1;
> +	if (uid == 0)
> +		test_clone3_set_tid(set_tid, 1, 0, -EEXIST, 0, 0);
> +	else
> +		ksft_test_result_skip("Clone3() with set_tid requires root\n");
> +
> +	/* Try it in a new PID namespace */
> +	if (uid == 0)
> +		test_clone3_set_tid(set_tid, 1, CLONE_NEWPID, 0, 0, 0);
> +	else
> +		ksft_test_result_skip("Clone3() with set_tid requires root\n");
> +
> +	/* pid_max should fail everywhere */
> +	set_tid[0] = pid_max;
> +	test_clone3_set_tid(set_tid, 1, 0, -EINVAL, 0, 0);
> +
> +	if (uid == 0)
> +		test_clone3_set_tid(set_tid, 1, CLONE_NEWPID, -EINVAL, 0, 0);
> +	else
> +		ksft_test_result_skip("Clone3() with set_tid requires root\n");
> +
> +	if (uid != 0) {
> +		/*
> +		 * All remaining tests require root. Tell the framework
> +		 * that all those tests are skipped as non-root.
> +		 */
> +		ksft_cnt.ksft_xskip += ksft_plan - ksft_test_num();
> +		goto out;
> +	}
> +
> +	/* Find the current active PID */
> +	pid = fork();
> +	if (pid == 0) {
> +		ksft_print_msg("Child has PID %d\n", getpid());
> +		_exit(EXIT_SUCCESS);
> +	}
> +	(void)wait(NULL);
> +
> +	/* After the child has finished, its PID should be free. */
> +	set_tid[0] = pid;
> +	test_clone3_set_tid(set_tid, 1, 0, 0, 0, 0);
> +
> +	/* This should fail as there is no PID 1 in that namespace */
> +	test_clone3_set_tid(set_tid, 1, CLONE_NEWPID, -EINVAL, 0, 0);
> +
> +	/*
> +	 * Creating a process with PID 1 in the newly created most nested
> +	 * PID namespace and PID 'pid' in the parent PID namespace. This
> +	 * needs to work.
> +	 */
> +	set_tid[0] = 1;
> +	set_tid[1] = pid;
> +	test_clone3_set_tid(set_tid, 2, CLONE_NEWPID, 0, pid, 0);
> +
> +	ksft_print_msg("unshare PID namespace\n");
> +	if (unshare(CLONE_NEWPID) == -1)
> +		ksft_exit_fail_msg("unshare(CLONE_NEWPID) failed: %s\n",
> +				strerror(errno));
> +
> +	set_tid[0] = pid;
> +
> +	/* This should fail as there is no PID 1 in that namespace */
> +	test_clone3_set_tid(set_tid, 1, 0, -EINVAL, 0, 0);
> +
> +	/* Let's create a PID 1 */
> +	ns_pid = fork();
> +	if (ns_pid == 0) {
> +		ksft_print_msg("Child in PID namespace has PID %d\n", getpid());
> +		set_tid[0] = 2;
> +		test_clone3_set_tid(set_tid, 1, 0, 0, 2, 0);
> +
> +		set_tid[0] = 1;
> +		set_tid[1] = 42;
> +		set_tid[2] = pid;
> +		/*
> +		 * This should fail as there are not enough active PID
> +		 * namespaces. Again assuming this is running in the host's
> +		 * PID namespace. Not yet nested.
> +		 */
> +		test_clone3_set_tid(set_tid, 4, CLONE_NEWPID, -EINVAL, 0, 0);
> +
> +		/*
> +		 * This should work and from the parent we should see
> +		 * something like 'NSpid:	pid	42	1'.
> +		 */
> +		test_clone3_set_tid(set_tid, 3, CLONE_NEWPID, 0, 42, true);
> +
> +		_exit(ksft_cnt.ksft_pass);
> +	}
> +
> +	close(pipe_1[1]);
> +	close(pipe_2[0]);
> +	while (read(pipe_1[0], &buf, 1) > 0) {
> +		ksft_print_msg("[%d] Child is ready and waiting\n", getpid());
> +		break;
> +	}
> +
> +	snprintf(proc_path, sizeof(proc_path), "/proc/%d/status", pid);
> +	f = fopen(proc_path, "r");
> +	if (f == NULL)
> +		ksft_exit_fail_msg(
> +			"%s - Could not open %s\n",
> +			strerror(errno), proc_path);
> +
> +	while (getline(&line, &len, f) != -1) {
> +		if (strstr(line, "NSpid")) {
> +			int i;
> +
> +			/* Verify that all generated PIDs are as expected. */
> +			i = sscanf(line, "NSpid:\t%d\t%d\t%d",
> +				   &ns3, &ns2, &ns1);
> +			if (i != 3)
> +				ns1 = ns2 = ns3 = 0;
> +			break;
> +		}
> +	}
> +	fclose(f);
> +	free(line);
> +	close(pipe_2[0]);
> +
> +	/* Tell the clone3()'d child to finish. */
> +	write(pipe_2[1], &buf, 1);
> +	close(pipe_2[1]);
> +
> +	if (wait(&status) < 0) {
> +		ksft_print_msg("Child returned %s\n", strerror(errno));
> +		ret = -errno;
> +		goto out;
> +	}
> +
> +	if (!WIFEXITED(status))
> +		ksft_test_result_fail("Child error\n");
> +
> +	if (WEXITSTATUS(status))
> +		/*
> +		 * Update the number of total tests with the tests from the
> +		 * child processes.
> +		 */
> +		ksft_cnt.ksft_pass = WEXITSTATUS(status);
> +
> +	if (ns3 == pid && ns2 == 42 && ns1 == 1)
> +		ksft_test_result_pass(
> +			"PIDs in all namespaces as expected (%d,%d,%d)\n",
> +			ns3, ns2, ns1);
> +	else
> +		ksft_test_result_fail(
> +			"PIDs in all namespaces not as expected (%d,%d,%d)\n",
> +			ns3, ns2, ns1);
> +out:
> +	ret = 0;
> +
> +	return !ret ? ksft_exit_pass() : ksft_exit_fail();
> +}
> -- 
> 2.23.0
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v10 1/2] fork: extend clone3() to support setting a PID
  2019-11-14 14:27 [PATCH v10 1/2] fork: extend clone3() to support setting a PID Adrian Reber
                   ` (2 preceding siblings ...)
  2019-11-14 16:40 ` Christian Brauner
@ 2019-11-14 19:15 ` Andrei Vagin
  2019-11-15  9:34   ` Oleg Nesterov
  3 siblings, 1 reply; 15+ messages in thread
From: Andrei Vagin @ 2019-11-14 19:15 UTC (permalink / raw)
  To: Adrian Reber
  Cc: Christian Brauner, Eric Biederman, Pavel Emelyanov, Jann Horn,
	Oleg Nesterov, Dmitry Safonov, Rasmus Villemoes, linux-kernel,
	Mike Rapoport, Radostin Stoyanov

On Thu, Nov 14, 2019 at 03:27:06PM +0100, Adrian Reber wrote:
...
> diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
> index 1d500ed03c63..2e649cfa07f4 100644
> --- a/include/uapi/linux/sched.h
> +++ b/include/uapi/linux/sched.h
...
> @@ -174,24 +186,51 @@ struct pid *alloc_pid(struct pid_namespace *ns)
>  	pid->level = ns->level;
>  
>  	for (i = ns->level; i >= 0; i--) {
> -		int pid_min = 1;
> +		int tid = 0;
> +
> +		if (set_tid_size) {
> +			tid = set_tid[ns->level - i];
> +			if (tid < 1 || tid >= pid_max)
> +				return ERR_PTR(-EINVAL);

do we need to release pids what have been allocated on previous levels?

nr = -EINVAL;
goto out_free;

Could you add a test for this case?


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v10 1/2] fork: extend clone3() to support setting a PID
  2019-11-14 19:15 ` Andrei Vagin
@ 2019-11-15  9:34   ` Oleg Nesterov
  2019-11-15  9:58     ` Christian Brauner
  0 siblings, 1 reply; 15+ messages in thread
From: Oleg Nesterov @ 2019-11-15  9:34 UTC (permalink / raw)
  To: Andrei Vagin
  Cc: Adrian Reber, Christian Brauner, Eric Biederman, Pavel Emelyanov,
	Jann Horn, Dmitry Safonov, Rasmus Villemoes, linux-kernel,
	Mike Rapoport, Radostin Stoyanov

On 11/14, Andrei Vagin wrote:
>
> On Thu, Nov 14, 2019 at 03:27:06PM +0100, Adrian Reber wrote:
> ...
> > diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
> > index 1d500ed03c63..2e649cfa07f4 100644
> > --- a/include/uapi/linux/sched.h
> > +++ b/include/uapi/linux/sched.h
> ...
> > @@ -174,24 +186,51 @@ struct pid *alloc_pid(struct pid_namespace *ns)
> >  	pid->level = ns->level;
> >
> >  	for (i = ns->level; i >= 0; i--) {
> > -		int pid_min = 1;
> > +		int tid = 0;
> > +
> > +		if (set_tid_size) {
> > +			tid = set_tid[ns->level - i];
> > +			if (tid < 1 || tid >= pid_max)
> > +				return ERR_PTR(-EINVAL);
>
> do we need to release pids what have been allocated on previous levels?

Heh ;) it is really amazing that nobody noticed this! Thanks Andrei.

> nr = -EINVAL;

retval = -EINVAL;

> goto out_free;

Oleg.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v10 1/2] fork: extend clone3() to support setting a PID
  2019-11-15  9:34   ` Oleg Nesterov
@ 2019-11-15  9:58     ` Christian Brauner
  2019-11-15 10:49       ` Oleg Nesterov
  0 siblings, 1 reply; 15+ messages in thread
From: Christian Brauner @ 2019-11-15  9:58 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrei Vagin, Adrian Reber, Eric Biederman, Pavel Emelyanov,
	Jann Horn, Dmitry Safonov, Rasmus Villemoes, linux-kernel,
	Mike Rapoport, Radostin Stoyanov

On Fri, Nov 15, 2019 at 10:34:20AM +0100, Oleg Nesterov wrote:
> On 11/14, Andrei Vagin wrote:
> >
> > On Thu, Nov 14, 2019 at 03:27:06PM +0100, Adrian Reber wrote:
> > ...
> > > diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
> > > index 1d500ed03c63..2e649cfa07f4 100644
> > > --- a/include/uapi/linux/sched.h
> > > +++ b/include/uapi/linux/sched.h
> > ...
> > > @@ -174,24 +186,51 @@ struct pid *alloc_pid(struct pid_namespace *ns)
> > >  	pid->level = ns->level;
> > >
> > >  	for (i = ns->level; i >= 0; i--) {
> > > -		int pid_min = 1;
> > > +		int tid = 0;
> > > +
> > > +		if (set_tid_size) {
> > > +			tid = set_tid[ns->level - i];
> > > +			if (tid < 1 || tid >= pid_max)
> > > +				return ERR_PTR(-EINVAL);
> >
> > do we need to release pids what have been allocated on previous levels?
> 
> Heh ;) it is really amazing that nobody noticed this! Thanks Andrei.
> 
> > nr = -EINVAL;
> 
> retval = -EINVAL;
> 
> > goto out_free;

How do we feel about moving this into a separate helper like below?
Keeps the ugliness out of alloc_pid() itself.

Christian

diff --git a/kernel/pid.c b/kernel/pid.c
index eb32668997c6..d3dfd1bbebaf 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -157,6 +157,31 @@ void free_pid(struct pid *pid)
 	call_rcu(&pid->rcu, delayed_put_pid);
 }
 
+static int set_tid_next(pid_t *set_tid, size_t *size, int idx)
+{
+	int tid = 0;
+
+	if (*size) {
+		tid = set_tid[idx];
+		if (tid < 1 || tid >= pid_max)
+			return -EINVAL;
+
+		/*
+		 * Also fail if a PID != 1 is requested and
+		 * no PID 1 exists.
+		 */
+		if (tid != 1 && !tmp->child_reaper)
+			return -EINVAL;
+
+		if (!ns_capable(tmp->user_ns, CAP_SYS_ADMIN))
+			return -EPERM;
+
+		(*size)--;
+	}
+
+	return tid;
+}
+
 struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
 		      size_t set_tid_size)
 {
@@ -188,20 +213,10 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
 	for (i = ns->level; i >= 0; i--) {
 		int tid = 0;
 
-		if (set_tid_size) {
-			tid = set_tid[ns->level - i];
-			if (tid < 1 || tid >= pid_max)
-				return ERR_PTR(-EINVAL);
-			/*
-			 * Also fail if a PID != 1 is requested and
-			 * no PID 1 exists.
-			 */
-			if (tid != 1 && !tmp->child_reaper)
-				return ERR_PTR(-EINVAL);
-			if (!ns_capable(tmp->user_ns, CAP_SYS_ADMIN))
-				return ERR_PTR(-EPERM);
-			set_tid_size--;
-		}
+		retval = set_tid_next(set_tid, &set_tid_size, ns->level - i);
+		if (retval < 0)
+			goto out_free;
+		tid = retval;
 
 		idr_preload(GFP_KERNEL);
 		spin_lock_irq(&pidmap_lock);

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v10 1/2] fork: extend clone3() to support setting a PID
  2019-11-15  9:58     ` Christian Brauner
@ 2019-11-15 10:49       ` Oleg Nesterov
  2019-11-15 13:08         ` Christian Brauner
  0 siblings, 1 reply; 15+ messages in thread
From: Oleg Nesterov @ 2019-11-15 10:49 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Andrei Vagin, Adrian Reber, Eric Biederman, Pavel Emelyanov,
	Jann Horn, Dmitry Safonov, Rasmus Villemoes, linux-kernel,
	Mike Rapoport, Radostin Stoyanov

On 11/15, Christian Brauner wrote:
>
> +static int set_tid_next(pid_t *set_tid, size_t *size, int idx)
> +{
> +	int tid = 0;
> +
> +	if (*size) {
> +		tid = set_tid[idx];
> +		if (tid < 1 || tid >= pid_max)
> +			return -EINVAL;
> +
> +		/*
> +		 * Also fail if a PID != 1 is requested and
> +		 * no PID 1 exists.
> +		 */
> +		if (tid != 1 && !tmp->child_reaper)
> +			return -EINVAL;
> +
> +		if (!ns_capable(tmp->user_ns, CAP_SYS_ADMIN))
> +			return -EPERM;
> +
> +		(*size)--;
> +	}

this needs more args, struct pid_namespace *tmp + pid_t pid_max

> +
> +	return tid;
> +}
> +
>  struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
>  		      size_t set_tid_size)
>  {
> @@ -188,20 +213,10 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
>  	for (i = ns->level; i >= 0; i--) {
>  		int tid = 0;
>  
> -		if (set_tid_size) {
> -			tid = set_tid[ns->level - i];
> -			if (tid < 1 || tid >= pid_max)
> -				return ERR_PTR(-EINVAL);
> -			/*
> -			 * Also fail if a PID != 1 is requested and
> -			 * no PID 1 exists.
> -			 */
> -			if (tid != 1 && !tmp->child_reaper)
> -				return ERR_PTR(-EINVAL);
> -			if (!ns_capable(tmp->user_ns, CAP_SYS_ADMIN))
> -				return ERR_PTR(-EPERM);
> -			set_tid_size--;
> -		}
> +		retval = set_tid_next(set_tid, &set_tid_size, ns->level - i);
> +		if (retval < 0)
> +			goto out_free;
> +		tid = retval;

Well, if we add a helper then

	static inline int check_tid(tid, max, ns)
	{
		if (tid < 1 || tid >= max)
			return ERR_PTR(-EINVAL);
		/*
		 * Also fail if a PID != 1 is requested and
		 * no PID 1 exists.
		 */
		if (tid != 1 && !ns->child_reaper)
			return ERR_PTR(-EINVAL);
		if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN))
			return ERR_PTR(-EPERM);
		return 0;
	}

	... alloc_pid() ...

		if (set_tid_size) {
			tid = set_tid[ns->level - i];
			retval = check_tid(tid, pid_max, tmp);
			if (retval)
				goto out_free;
			set_tid_size--;
		}

looks more clean to me. But still ugly. IMO,

		if (set_tid_size) {
			tid = set_tid[ns->level - i];

			retval = -EINVAL;
			if (tid < 1 || tid >= pid_max)
				goto out_free;
			/*
			 * Also fail if a PID != 1 is requested and
			 * no PID 1 exists.
			 */
			if (tid != 1 && !tmp->child_reaper)
				goto out_free;
			retval = -EPERM;
			if (!ns_capable(tmp->user_ns, CAP_SYS_ADMIN))
				goto out_free;
			set_tid_size--;
		}

makes more sense.

Oleg.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v10 1/2] fork: extend clone3() to support setting a PID
  2019-11-15 10:49       ` Oleg Nesterov
@ 2019-11-15 13:08         ` Christian Brauner
  2019-11-16 22:55           ` Eric W. Biederman
  0 siblings, 1 reply; 15+ messages in thread
From: Christian Brauner @ 2019-11-15 13:08 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrei Vagin, Adrian Reber, Eric Biederman, Pavel Emelyanov,
	Jann Horn, Dmitry Safonov, Rasmus Villemoes, linux-kernel,
	Mike Rapoport, Radostin Stoyanov

On Fri, Nov 15, 2019 at 11:49:10AM +0100, Oleg Nesterov wrote:
> On 11/15, Christian Brauner wrote:
> >
> > +static int set_tid_next(pid_t *set_tid, size_t *size, int idx)
> > +{
> > +	int tid = 0;
> > +
> > +	if (*size) {
> > +		tid = set_tid[idx];
> > +		if (tid < 1 || tid >= pid_max)
> > +			return -EINVAL;
> > +
> > +		/*
> > +		 * Also fail if a PID != 1 is requested and
> > +		 * no PID 1 exists.
> > +		 */
> > +		if (tid != 1 && !tmp->child_reaper)
> > +			return -EINVAL;
> > +
> > +		if (!ns_capable(tmp->user_ns, CAP_SYS_ADMIN))
> > +			return -EPERM;
> > +
> > +		(*size)--;
> > +	}
> 
> this needs more args, struct pid_namespace *tmp + pid_t pid_max
> 		if (set_tid_size) {
> 			tid = set_tid[ns->level - i];
> 
> 			retval = -EINVAL;
> 			if (tid < 1 || tid >= pid_max)
> 				goto out_free;

I'm not a fan of this pattern of _not_ setting error codes in the actual
error path t but I won't object.

	Christian

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v10 2/2] selftests: add tests for clone3() with *set_tid
  2019-11-14 18:34   ` Andrei Vagin
@ 2019-11-15 15:14     ` Adrian Reber
  2019-11-15 18:39       ` Andrei Vagin
  0 siblings, 1 reply; 15+ messages in thread
From: Adrian Reber @ 2019-11-15 15:14 UTC (permalink / raw)
  To: Andrei Vagin
  Cc: Christian Brauner, Eric Biederman, Pavel Emelyanov, Jann Horn,
	Oleg Nesterov, Dmitry Safonov, Rasmus Villemoes, linux-kernel,
	Mike Rapoport, Radostin Stoyanov

On Thu, Nov 14, 2019 at 10:34:21AM -0800, Andrei Vagin wrote:
> > +		/*
> > +		 * This should work and from the parent we should see
> > +		 * something like 'NSpid:	pid	42	1'.
> > +		 */
> > +		test_clone3_set_tid(set_tid, 3, CLONE_NEWPID, 0, 42, true);
> > +
> > +		_exit(ksft_cnt.ksft_pass);
> > +	}
> > +
> > +	close(pipe_1[1]);
> > +	close(pipe_2[0]);
> > +	while (read(pipe_1[0], &buf, 1) > 0) {
> 
> If a child process will crash, this will be a busyloop.

No. To be honest the whole loop is unnecessary. If the loop is entered
there is immediately a break.

> > +		ksft_print_msg("[%d] Child is ready and waiting\n", getpid());
> > +		break;
> > +	}

If the loop is not entered it does not loop. So the whole thing is
useless. I think I was expecting it to block, but it doesn't work that
way.

> > +	snprintf(proc_path, sizeof(proc_path), "/proc/%d/status", pid);
> > +	f = fopen(proc_path, "r");
> > +	if (f == NULL)
> > +		ksft_exit_fail_msg(
> > +			"%s - Could not open %s\n",
> > +			strerror(errno), proc_path);

If the child does not exist anymore, the test will fail here and exit.

Besides this while() I tried to address all your comments in v11. Any
further comments on the test?

		Adrian


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v10 2/2] selftests: add tests for clone3() with *set_tid
  2019-11-15 15:14     ` Adrian Reber
@ 2019-11-15 18:39       ` Andrei Vagin
  0 siblings, 0 replies; 15+ messages in thread
From: Andrei Vagin @ 2019-11-15 18:39 UTC (permalink / raw)
  To: Adrian Reber
  Cc: Christian Brauner, Eric Biederman, Pavel Emelyanov, Jann Horn,
	Oleg Nesterov, Dmitry Safonov, Rasmus Villemoes, LKML,
	Mike Rapoport, Radostin Stoyanov

On Fri, Nov 15, 2019 at 7:14 AM Adrian Reber <areber@redhat.com> wrote:
>
> On Thu, Nov 14, 2019 at 10:34:21AM -0800, Andrei Vagin wrote:
>
> > > +   snprintf(proc_path, sizeof(proc_path), "/proc/%d/status", pid);
> > > +   f = fopen(proc_path, "r");
> > > +   if (f == NULL)
> > > +           ksft_exit_fail_msg(
> > > +                   "%s - Could not open %s\n",
> > > +                   strerror(errno), proc_path);
>
> If the child does not exist anymore, the test will fail here and exit.
>
> Besides this while() I tried to address all your comments in v11. Any
> further comments on the test?

No, I don't have any other comments.

Thanks,
Andrei

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v10 1/2] fork: extend clone3() to support setting a PID
  2019-11-15 13:08         ` Christian Brauner
@ 2019-11-16 22:55           ` Eric W. Biederman
  0 siblings, 0 replies; 15+ messages in thread
From: Eric W. Biederman @ 2019-11-16 22:55 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Oleg Nesterov, Andrei Vagin, Adrian Reber, Pavel Emelyanov,
	Jann Horn, Dmitry Safonov, Rasmus Villemoes, linux-kernel,
	Mike Rapoport, Radostin Stoyanov

Christian Brauner <christian.brauner@ubuntu.com> writes:

> On Fri, Nov 15, 2019 at 11:49:10AM +0100, Oleg Nesterov wrote:
>> On 11/15, Christian Brauner wrote:
>> >
>> > +static int set_tid_next(pid_t *set_tid, size_t *size, int idx)
>> > +{
>> > +	int tid = 0;
>> > +
>> > +	if (*size) {
>> > +		tid = set_tid[idx];
>> > +		if (tid < 1 || tid >= pid_max)
>> > +			return -EINVAL;
>> > +
>> > +		/*
>> > +		 * Also fail if a PID != 1 is requested and
>> > +		 * no PID 1 exists.
>> > +		 */
>> > +		if (tid != 1 && !tmp->child_reaper)
>> > +			return -EINVAL;
>> > +
>> > +		if (!ns_capable(tmp->user_ns, CAP_SYS_ADMIN))
>> > +			return -EPERM;
>> > +
>> > +		(*size)--;
>> > +	}
>> 
>> this needs more args, struct pid_namespace *tmp + pid_t pid_max
>> 		if (set_tid_size) {
>> 			tid = set_tid[ns->level - i];
>> 
>> 			retval = -EINVAL;
>> 			if (tid < 1 || tid >= pid_max)
>> 				goto out_free;
>
> I'm not a fan of this pattern of _not_ setting error codes in the actual
> error path t but I won't object.

If you can show a compiler that actually manages to reliably perform
that code motion it is worth discussing making it go away.  Last I
checked gcc will emit an extra basic block just to handle setting
the error code before jumping to the out_free.  That extra basic block
because of the extra jump tends to be costly.

Not a huge cost but in these days when branches are getting increasingly
expensive and Moore's law wrapping up I prefer code patterns that
generate cood code.

Eric

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2019-11-16 22:56 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-14 14:27 [PATCH v10 1/2] fork: extend clone3() to support setting a PID Adrian Reber
2019-11-14 14:27 ` [PATCH v10 2/2] selftests: add tests for clone3() with *set_tid Adrian Reber
2019-11-14 14:49   ` Christian Brauner
2019-11-14 18:34   ` Andrei Vagin
2019-11-15 15:14     ` Adrian Reber
2019-11-15 18:39       ` Andrei Vagin
2019-11-14 19:05   ` Andrei Vagin
2019-11-14 15:50 ` [PATCH v10 1/2] fork: extend clone3() to support setting a PID Oleg Nesterov
2019-11-14 16:40 ` Christian Brauner
2019-11-14 19:15 ` Andrei Vagin
2019-11-15  9:34   ` Oleg Nesterov
2019-11-15  9:58     ` Christian Brauner
2019-11-15 10:49       ` Oleg Nesterov
2019-11-15 13:08         ` Christian Brauner
2019-11-16 22:55           ` Eric W. Biederman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.