linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v7 1/2] fork: extend clone3() to support setting a PID
@ 2019-11-11 13:17 Adrian Reber
  2019-11-11 13:17 ` [PATCH v7 2/2] selftests: add tests for clone3() Adrian Reber
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Adrian Reber @ 2019-11-11 13:17 UTC (permalink / raw)
  To: Christian Brauner, Eric Biederman, Pavel Emelyanov, Jann Horn,
	Oleg Nesterov, Dmitry Safonov
  Cc: linux-kernel, Andrei Vagin, Mike Rapoport, Radostin Stoyanov,
	Adrian Reber

The main motivation to add set_tid to clone3() is CRIU.

To restore a process with the same PID/TID CRIU currently uses
/proc/sys/kernel/ns_last_pid. It writes the desired (PID - 1) to
ns_last_pid and then (quickly) does a clone(). This works most of the
time, but it is racy. It is also slow as it requires multiple syscalls.

Extending clone3() to support *set_tid makes it possible restore a
process using CRIU without accessing /proc/sys/kernel/ns_last_pid and
race free (as long as the desired PID/TID is available).

This clone3() extension places the same restrictions (CAP_SYS_ADMIN)
on clone3() with *set_tid as they are currently in place for ns_last_pid.

The original version of this change was using a single value for
set_tid. At the 2019 LPC, after presenting set_tid, it was, however,
decided to change set_tid to an array to enable setting the PID of a
process in multiple PID namespaces at the same time. If a process is
created in a PID namespace it is possible to influence the PID inside
and outside of the PID namespace. Details also in the corresponding
selftest.

To create a process with the following PIDs:

      PID NS level         Requested PID
        0 (host)              31496
        1                        42
        2                         1

For that example the two newly introduced parameters to struct
clone_args (set_tid and set_tid_size) would need to be:

  set_tid[0] = 1;
  set_tid[1] = 42;
  set_tid[2] = 31496;
  set_tid_size = 3;

If only the PIDs of the two innermost nested PID namespaces should be
defined it would look like this:

  set_tid[0] = 1;
  set_tid[1] = 42;
  set_tid_size = 2;

The PID of the newly created process would then be the next available
free PID in the PID namespace level 0 (host) and 42 in the PID namespace
at level 1 and the PID of the process in the innermost PID namespace
would be 1.

The set_tid array is used to specify the PID of a process starting
from the innermost nested PID namespaces up to set_tid_size PID namespaces.

set_tid_size cannot be larger then the current PID namespace level.

Signed-off-by: Adrian Reber <areber@redhat.com>
---
v2:
 - Removed (size < sizeof(struct clone_args)) as discussed with
   Christian and Dmitry
 - Added comment to ((set_tid != 1) && idr_get_cursor() <= 1) (Oleg)
 - Use idr_alloc() instead of idr_alloc_cyclic() (Oleg)

v3:
 - Return EEXIST if PID is already in use (Christian)
 - Drop CLONE_SET_TID (Christian and Oleg)
 - Use idr_is_empty() instead of idr_get_cursor() (Oleg)
 - Handle different `struct clone_args` sizes (Dmitry)

v4:
 - Rework struct size check with defines (Christian)
 - Reduce number of set_tid checks (Oleg)
 - Less parentheses and more robust code (Oleg)
 - Do ns_capable() on correct user_ns (Oleg, Christian)

v5:
 - make set_tid checks earlier in alloc_pid() (Christian)
 - remove unnecessary comment and struct size check (Christian)

v6:
 - remove CLONE_SET_TID from description (Christian)
 - add clone3() tests for different clone_args sizes (Christian)
 - move more set_tid checks to alloc_pid() (Oleg)
 - make all set_tid checks lockless (Oleg)

v7:
 - changed set_tid to be an array to set the PID of a process
   in multiple nested PID namespaces at the same time as discussed
   at LPC 2019 (container MC)
---
 include/linux/pid.h           |  3 ++-
 include/linux/pid_namespace.h |  2 ++
 include/linux/sched/task.h    |  3 +++
 include/uapi/linux/sched.h    |  2 ++
 kernel/fork.c                 | 20 +++++++++++++-
 kernel/pid.c                  | 50 ++++++++++++++++++++++++++++++-----
 kernel/pid_namespace.c        |  2 --
 7 files changed, 71 insertions(+), 11 deletions(-)

diff --git a/include/linux/pid.h b/include/linux/pid.h
index 9645b1194c98..034b7df25888 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -120,7 +120,8 @@ extern struct pid *find_vpid(int nr);
 extern struct pid *find_get_pid(int nr);
 extern struct pid *find_ge_pid(int nr, struct pid_namespace *);
 
-extern struct pid *alloc_pid(struct pid_namespace *ns);
+extern struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
+			     size_t set_tid_size);
 extern void free_pid(struct pid *pid);
 extern void disable_pid_allocation(struct pid_namespace *ns);
 
diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
index 49538b172483..2ed6af88794b 100644
--- a/include/linux/pid_namespace.h
+++ b/include/linux/pid_namespace.h
@@ -12,6 +12,8 @@
 #include <linux/ns_common.h>
 #include <linux/idr.h>
 
+/* MAX_PID_NS_LEVEL is needed for limiting size of 'struct pid' */
+#define MAX_PID_NS_LEVEL 32
 
 struct fs_pin;
 
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 4b1c3b664f51..f1879884238e 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -26,6 +26,9 @@ struct kernel_clone_args {
 	unsigned long stack;
 	unsigned long stack_size;
 	unsigned long tls;
+	pid_t *set_tid;
+	/* Number of elements in *set_tid */
+	size_t set_tid_size;
 };
 
 /*
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 25b4fa00bad1..13f74c40a629 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -72,6 +72,8 @@ struct clone_args {
 	__aligned_u64 stack;
 	__aligned_u64 stack_size;
 	__aligned_u64 tls;
+	__aligned_u64 set_tid;
+	__aligned_u64 set_tid_size;
 };
 #endif
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 55af6931c6ec..c70b9224997d 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2026,7 +2026,8 @@ static __latent_entropy struct task_struct *copy_process(
 	stackleak_task_init(p);
 
 	if (pid != &init_struct_pid) {
-		pid = alloc_pid(p->nsproxy->pid_ns_for_children);
+		pid = alloc_pid(p->nsproxy->pid_ns_for_children, args->set_tid,
+				args->set_tid_size);
 		if (IS_ERR(pid)) {
 			retval = PTR_ERR(pid);
 			goto bad_fork_cleanup_thread;
@@ -2527,6 +2528,7 @@ noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs,
 					      struct clone_args __user *uargs,
 					      size_t usize)
 {
+	int i;
 	int err;
 	struct clone_args args;
 
@@ -2539,6 +2541,9 @@ noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs,
 	if (err)
 		return err;
 
+	if (unlikely(args.set_tid_size > MAX_PID_NS_LEVEL))
+		return -E2BIG;
+
 	/*
 	 * Verify that higher 32bits of exit_signal are unset and that
 	 * it is a valid signal
@@ -2556,8 +2561,17 @@ noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs,
 		.stack		= args.stack,
 		.stack_size	= args.stack_size,
 		.tls		= args.tls,
+		.set_tid	= kargs->set_tid,
+		.set_tid_size	= args.set_tid_size,
 	};
 
+	for (i = 0; i < args.set_tid_size; i++) {
+		if (copy_from_user(&kargs->set_tid[i],
+		    u64_to_user_ptr(args.set_tid + (i * sizeof(args.set_tid))),
+		    sizeof(pid_t)))
+			return -EFAULT;
+	}
+
 	return 0;
 }
 
@@ -2631,6 +2645,10 @@ SYSCALL_DEFINE2(clone3, struct clone_args __user *, uargs, size_t, size)
 	int err;
 
 	struct kernel_clone_args kargs;
+	pid_t set_tid[MAX_PID_NS_LEVEL];
+
+	memset(set_tid, 0, sizeof(set_tid));
+	kargs.set_tid = set_tid;
 
 	err = copy_clone_args_from_user(&kargs, uargs, size);
 	if (err)
diff --git a/kernel/pid.c b/kernel/pid.c
index 0a9f2e437217..82b3f91c131c 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -157,7 +157,8 @@ void free_pid(struct pid *pid)
 	call_rcu(&pid->rcu, delayed_put_pid);
 }
 
-struct pid *alloc_pid(struct pid_namespace *ns)
+struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
+		      size_t set_tid_size)
 {
 	struct pid *pid;
 	enum pid_type type;
@@ -166,6 +167,17 @@ struct pid *alloc_pid(struct pid_namespace *ns)
 	struct upid *upid;
 	int retval = -ENOMEM;
 
+	/*
+	 * set_tid_size contains the size of the set_tid array. Starting at
+	 * the most nested currently active PID namespace it tells alloc_pid()
+	 * which PID to set for a process in that most nested PID namespace
+	 * up to set_tid_size PID namespaces. It does not have to set the PID
+	 * for a process in all nested PID namespaces but set_tid_size must
+	 * never be greater than the current ns->level + 1.
+	 */
+	if (set_tid_size > ns->level + 1)
+		return ERR_PTR(-EINVAL);
+
 	pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL);
 	if (!pid)
 		return ERR_PTR(retval);
@@ -175,6 +187,18 @@ struct pid *alloc_pid(struct pid_namespace *ns)
 
 	for (i = ns->level; i >= 0; i--) {
 		int pid_min = 1;
+		int t_pos = 0;
+
+		if (set_tid_size) {
+			t_pos = - (i - ns->level);
+			if (set_tid[t_pos] < 1 || set_tid[t_pos] >= pid_max)
+				return ERR_PTR(-EINVAL);
+			/* Also fail if a PID != 1 is requested and no PID 1 exists */
+			if (set_tid[t_pos] != 1 && !tmp->child_reaper)
+				return ERR_PTR(-EINVAL);
+			if (!ns_capable(tmp->user_ns, CAP_SYS_ADMIN))
+				return ERR_PTR(-EPERM);
+		}
 
 		idr_preload(GFP_KERNEL);
 		spin_lock_irq(&pidmap_lock);
@@ -186,12 +210,24 @@ struct pid *alloc_pid(struct pid_namespace *ns)
 		if (idr_get_cursor(&tmp->idr) > RESERVED_PIDS)
 			pid_min = RESERVED_PIDS;
 
-		/*
-		 * Store a null pointer so find_pid_ns does not find
-		 * a partially initialized PID (see below).
-		 */
-		nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min,
-				      pid_max, GFP_ATOMIC);
+		if (set_tid_size) {
+			nr = idr_alloc(&tmp->idr, NULL, set_tid[t_pos],
+				       set_tid[t_pos] + 1, GFP_ATOMIC);
+			/*
+			 * If ENOSPC is returned it means that the PID is
+			 * alreay in use. Return EEXIST in that case.
+			 */
+			if (nr == -ENOSPC)
+				nr = -EEXIST;
+			set_tid_size--;
+		} else {
+			/*
+			 * Store a null pointer so find_pid_ns does not find
+			 * a partially initialized PID (see below).
+			 */
+			nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min,
+					      pid_max, GFP_ATOMIC);
+		}
 		spin_unlock_irq(&pidmap_lock);
 		idr_preload_end();
 
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index a6a79f85c81a..d40017e79ebe 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -26,8 +26,6 @@
 
 static DEFINE_MUTEX(pid_caches_mutex);
 static struct kmem_cache *pid_ns_cachep;
-/* MAX_PID_NS_LEVEL is needed for limiting size of 'struct pid' */
-#define MAX_PID_NS_LEVEL 32
 /* Write once array, filled from the beginning. */
 static struct kmem_cache *pid_cache[MAX_PID_NS_LEVEL];
 
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2019-11-13  8:02 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-11 13:17 [PATCH v7 1/2] fork: extend clone3() to support setting a PID Adrian Reber
2019-11-11 13:17 ` [PATCH v7 2/2] selftests: add tests for clone3() Adrian Reber
2019-11-11 15:25 ` [PATCH v7 1/2] fork: extend clone3() to support setting a PID Oleg Nesterov
2019-11-11 15:40   ` Adrian Reber
2019-11-11 16:14     ` Christian Brauner
2019-11-11 16:32       ` Oleg Nesterov
2019-11-11 23:08       ` Eric W. Biederman
2019-11-12 10:24         ` Christian Brauner
2019-11-11 16:12 ` Christian Brauner
2019-11-11 20:41 ` Rasmus Villemoes
2019-11-12 15:26   ` Adrian Reber
2019-11-13  8:02   ` Adrian Reber

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).