* [PATCH v4 1/2] fork: extend clone3() to support CLONE_SET_TID @ 2019-08-08 21:22 Adrian Reber 2019-08-08 21:22 ` [PATCH v4 2/2] selftests: add tests for clone3() Adrian Reber 2019-08-10 1:10 ` [PATCH v4 1/2] fork: extend clone3() to support CLONE_SET_TID Christian Brauner 0 siblings, 2 replies; 7+ messages in thread From: Adrian Reber @ 2019-08-08 21:22 UTC (permalink / raw) To: Christian Brauner, Eric Biederman, Pavel Emelianov, Jann Horn, Oleg Nesterov, Dmitry Safonov Cc: linux-kernel, Andrei Vagin, Mike Rapoport, Radostin Stoyanov, Adrian Reber The main motivation to add set_tid to clone3() is CRIU. To restore a process with the same PID/TID CRIU currently uses /proc/sys/kernel/ns_last_pid. It writes the desired (PID - 1) to ns_last_pid and then (quickly) does a clone(). This works most of the time, but it is racy. It is also slow as it requires multiple syscalls. Extending clone3() to support set_tid makes it possible restore a process using CRIU without accessing /proc/sys/kernel/ns_last_pid and race free (as long as the desired PID/TID is available). This clone3() extension places the same restrictions (CAP_SYS_ADMIN) on clone3() with set_tid as they are currently in place for ns_last_pid. Signed-off-by: Adrian Reber <areber@redhat.com> --- v2: - Removed (size < sizeof(struct clone_args)) as discussed with Christian and Dmitry - Added comment to ((set_tid != 1) && idr_get_cursor() <= 1) (Oleg) - Use idr_alloc() instead of idr_alloc_cyclic() (Oleg) v3: - Return EEXIST if PID is already in use (Christian) - Drop CLONE_SET_TID (Christian and Oleg) - Use idr_is_empty() instead of idr_get_cursor() (Oleg) - Handle different `struct clone_args` sizes (Dmitry) v4: - Rework struct size check with defines (Christian) - Reduce number of set_tid checks (Oleg) - Less parentheses and more robust code (Oleg) - Do ns_capable() on correct user_ns (Oleg, Christian) --- include/linux/pid.h | 2 +- include/linux/sched/task.h | 1 + include/uapi/linux/sched.h | 1 + kernel/fork.c | 25 +++++++++++++++++++++++-- kernel/pid.c | 34 +++++++++++++++++++++++++++------- 5 files changed, 53 insertions(+), 10 deletions(-) diff --git a/include/linux/pid.h b/include/linux/pid.h index 2a83e434db9d..052000db0ced 100644 --- a/include/linux/pid.h +++ b/include/linux/pid.h @@ -116,7 +116,7 @@ extern struct pid *find_vpid(int nr); extern struct pid *find_get_pid(int nr); extern struct pid *find_ge_pid(int nr, struct pid_namespace *); -extern struct pid *alloc_pid(struct pid_namespace *ns); +extern struct pid *alloc_pid(struct pid_namespace *ns, pid_t set_tid); extern void free_pid(struct pid *pid); extern void disable_pid_allocation(struct pid_namespace *ns); diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h index 0497091e40c1..4f2a80564332 100644 --- a/include/linux/sched/task.h +++ b/include/linux/sched/task.h @@ -26,6 +26,7 @@ struct kernel_clone_args { unsigned long stack; unsigned long stack_size; unsigned long tls; + pid_t set_tid; }; /* diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h index b3105ac1381a..e1ce103a2c47 100644 --- a/include/uapi/linux/sched.h +++ b/include/uapi/linux/sched.h @@ -45,6 +45,7 @@ struct clone_args { __aligned_u64 stack; __aligned_u64 stack_size; __aligned_u64 tls; + __aligned_u64 set_tid; }; /* diff --git a/kernel/fork.c b/kernel/fork.c index 2852d0e76ea3..2a03f0e201e9 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -117,6 +117,13 @@ */ #define MAX_THREADS FUTEX_TID_MASK +/* + * Different sizes of struct clone_args + */ +#define CLONE3_ARGS_SIZE_V0 64 +/* V1 includes set_tid */ +#define CLONE3_ARGS_SIZE_V1 72 + /* * Protected counters by write_lock_irq(&tasklist_lock) */ @@ -2031,7 +2038,13 @@ static __latent_entropy struct task_struct *copy_process( stackleak_task_init(p); if (pid != &init_struct_pid) { - pid = alloc_pid(p->nsproxy->pid_ns_for_children); + if (args->set_tid && !ns_capable( + p->nsproxy->pid_ns_for_children->user_ns, + CAP_SYS_ADMIN)) { + retval = -EPERM; + goto bad_fork_cleanup_thread; + } + pid = alloc_pid(p->nsproxy->pid_ns_for_children, args->set_tid); if (IS_ERR(pid)) { retval = PTR_ERR(pid); goto bad_fork_cleanup_thread; @@ -2535,9 +2548,14 @@ noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs, if (unlikely(size > PAGE_SIZE)) return -E2BIG; - if (unlikely(size < sizeof(struct clone_args))) + /* The struct needs to be at least the size of the original struct. */ + if (unlikely(size < CLONE3_ARGS_SIZE_V0)) return -EINVAL; + if (size < sizeof(struct clone_args)) + memset((void *)&args + size, 0, + sizeof(struct clone_args) - size); + if (unlikely(!access_ok(uargs, size))) return -EFAULT; @@ -2573,6 +2591,9 @@ noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs, .tls = args.tls, }; + if (size >= CLONE3_ARGS_SIZE_V1) + kargs->set_tid = args.set_tid; + return 0; } diff --git a/kernel/pid.c b/kernel/pid.c index 0a9f2e437217..9ce89c35c5be 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -157,7 +157,7 @@ void free_pid(struct pid *pid) call_rcu(&pid->rcu, delayed_put_pid); } -struct pid *alloc_pid(struct pid_namespace *ns) +struct pid *alloc_pid(struct pid_namespace *ns, int set_tid) { struct pid *pid; enum pid_type type; @@ -186,12 +186,32 @@ struct pid *alloc_pid(struct pid_namespace *ns) if (idr_get_cursor(&tmp->idr) > RESERVED_PIDS) pid_min = RESERVED_PIDS; - /* - * Store a null pointer so find_pid_ns does not find - * a partially initialized PID (see below). - */ - nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min, - pid_max, GFP_ATOMIC); + if (set_tid) { + /* + * Also fail if a PID != 1 is requested + * and no PID 1 exists. + */ + nr = -EINVAL; + if (set_tid < pid_max && set_tid > 0 && + (set_tid == 1 || !idr_is_empty(&tmp->idr))) + nr = idr_alloc(&tmp->idr, NULL, set_tid, + set_tid + 1, GFP_ATOMIC); + /* + * If ENOSPC is returned it means that the PID is + * alreay in use. Return EEXIST in that case. + */ + if (nr == -ENOSPC) + nr = -EEXIST; + /* Only use set_tid for one PID namespace. */ + set_tid = 0; + } else { + /* + * Store a null pointer so find_pid_ns does not find + * a partially initialized PID (see below). + */ + nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min, + pid_max, GFP_ATOMIC); + } spin_unlock_irq(&pidmap_lock); idr_preload_end(); -- 2.21.0 ^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH v4 2/2] selftests: add tests for clone3() 2019-08-08 21:22 [PATCH v4 1/2] fork: extend clone3() to support CLONE_SET_TID Adrian Reber @ 2019-08-08 21:22 ` Adrian Reber 2019-08-10 1:10 ` [PATCH v4 1/2] fork: extend clone3() to support CLONE_SET_TID Christian Brauner 1 sibling, 0 replies; 7+ messages in thread From: Adrian Reber @ 2019-08-08 21:22 UTC (permalink / raw) To: Christian Brauner, Eric Biederman, Pavel Emelianov, Jann Horn, Oleg Nesterov, Dmitry Safonov Cc: linux-kernel, Andrei Vagin, Mike Rapoport, Radostin Stoyanov, Adrian Reber This tests clone3() with and without set_tid to see if all desired PIDs are working as expected. The test tries to clone3() with a set_tid of -1, 1, pid_max, a PID which is already in use and an unused PID. The same tests are also running in PID namespace. Signed-off-by: Adrian Reber <areber@redhat.com> --- tools/testing/selftests/clone3/.gitignore | 2 + tools/testing/selftests/clone3/Makefile | 11 ++ tools/testing/selftests/clone3/clone3.c | 141 +++++++++++++++ .../testing/selftests/clone3/clone3_set_tid.c | 161 ++++++++++++++++++ 4 files changed, 315 insertions(+) create mode 100644 tools/testing/selftests/clone3/.gitignore create mode 100644 tools/testing/selftests/clone3/Makefile create mode 100644 tools/testing/selftests/clone3/clone3.c create mode 100644 tools/testing/selftests/clone3/clone3_set_tid.c diff --git a/tools/testing/selftests/clone3/.gitignore b/tools/testing/selftests/clone3/.gitignore new file mode 100644 index 000000000000..c63c64a78ddf --- /dev/null +++ b/tools/testing/selftests/clone3/.gitignore @@ -0,0 +1,2 @@ +clone3_set_tid +clone3 diff --git a/tools/testing/selftests/clone3/Makefile b/tools/testing/selftests/clone3/Makefile new file mode 100644 index 000000000000..4efcf45b995b --- /dev/null +++ b/tools/testing/selftests/clone3/Makefile @@ -0,0 +1,11 @@ +# SPDX-License-Identifier: GPL-2.0 +uname_M := $(shell uname -m 2>/dev/null || echo not) +ARCH ?= $(shell echo $(uname_M) | sed -e s/i.86/i386/) + +CFLAGS += -I../../../../usr/include/ + +ifeq ($(ARCH),x86_64) + TEST_GEN_PROGS := clone3 clone3_set_tid +endif + +include ../lib.mk diff --git a/tools/testing/selftests/clone3/clone3.c b/tools/testing/selftests/clone3/clone3.c new file mode 100644 index 000000000000..55a6915566b8 --- /dev/null +++ b/tools/testing/selftests/clone3/clone3.c @@ -0,0 +1,141 @@ +// SPDX-License-Identifier: GPL-2.0 + +/* Based on Christian Brauner's clone3() example */ + +#define _GNU_SOURCE +#include <errno.h> +#include <linux/types.h> +#include <linux/sched.h> +#include <stdio.h> +#include <stdlib.h> +#include <sys/syscall.h> +#include <sys/types.h> +#include <sys/un.h> +#include <sys/wait.h> +#include <unistd.h> +#include <sched.h> + +#include "../kselftest.h" + +static pid_t raw_clone(struct clone_args *args) +{ + return syscall(__NR_clone3, args, sizeof(struct clone_args)); +} + +static int call_clone3(int flags) +{ + struct clone_args args = {0}; + pid_t ppid = -1; + pid_t pid = -1; + int status; + + args.flags = flags; + args.exit_signal = SIGCHLD; + + pid = raw_clone(&args); + if (pid < 0) { + ksft_print_msg("%s - Failed to create new process\n", + strerror(errno)); + return -errno; + } + + if (pid == 0) { + ksft_print_msg("I am the child, my PID is %d\n", getpid()); + _exit(EXIT_SUCCESS); + } + + ppid = getpid(); + ksft_print_msg("I am the parent (%d). My child's pid is %d\n", + ppid, pid); + + (void)wait(&status); + if (WEXITSTATUS(status)) + return WEXITSTATUS(status); + + return 0; +} + +static int test_clone3(int flags, int expected) +{ + int ret; + + ksft_print_msg("[%d] Trying clone3() with flags 0x%x\n", + getpid(), flags); + ret = call_clone3(flags); + ksft_print_msg("[%d] clone3() with flags says :%d expected %d\n", + getpid(), ret, expected); + if (ret != expected) + ksft_exit_fail_msg( + "[%d] Result (%d) is different than expected (%d)\n", + getpid(), ret, expected); + ksft_test_result_pass("[%d] Result (%d) matches expectation (%d)\n", + getpid(), ret, expected); + return 0; +} +int main(int argc, char *argv[]) +{ + int ret = -1; + pid_t pid; + + ksft_print_header(); + ksft_set_plan(3); + + /* Just a simple clone3() should return 0.*/ + if (test_clone3(0, 0)) + goto on_error; + /* Do a clone3() in a new PID NS.*/ + if (test_clone3(CLONE_NEWPID, 0)) + goto on_error; + ksft_print_msg("First unshare\n"); + if (unshare(CLONE_NEWPID)) + goto on_error; + /* + * Before clone3()ing in a new PID NS with + * CLONE_NEWPID a fork() is necessary. + */ + if (test_clone3(CLONE_NEWPID, -EINVAL)) + goto on_error; + pid = fork(); + if (pid < 0) { + ksft_print_msg("First fork() failed\n"); + goto on_error; + } + if (pid > 0) { + (void)wait(NULL); + goto parent_out; + } + ksft_set_plan(6); + if (test_clone3(CLONE_NEWPID, 0)) + goto on_error; + if (test_clone3(0, 0)) + goto on_error; + ksft_print_msg("Second unshare\n"); + if (unshare(CLONE_NEWPID)) + goto on_error; + /* + * Before clone3()ing in a new PID NS with + * CLONE_NEWPID a fork() is necessary. + */ + if (test_clone3(CLONE_NEWPID, -EINVAL)) + goto on_error; + pid = fork(); + if (pid < 0) { + ksft_print_msg("Second fork() failed\n"); + goto on_error; + } + if (pid > 0) { + (void)wait(NULL); + goto parent_out; + } + ksft_set_plan(8); + if (test_clone3(CLONE_NEWPID, 0)) + goto on_error; + if (test_clone3(0, 0)) + goto on_error; + +parent_out: + ret = 0; +on_error: + + return !ret ? ksft_exit_pass() : ksft_exit_fail(); +} diff --git a/tools/testing/selftests/clone3/clone3_set_tid.c b/tools/testing/selftests/clone3/clone3_set_tid.c new file mode 100644 index 000000000000..f5012e84dcb3 --- /dev/null +++ b/tools/testing/selftests/clone3/clone3_set_tid.c @@ -0,0 +1,161 @@ +// SPDX-License-Identifier: GPL-2.0 + +/* Based on Christian Brauner's clone3() example */ + +#define _GNU_SOURCE +#include <errno.h> +#include <linux/types.h> +#include <linux/sched.h> +#include <stdio.h> +#include <stdlib.h> +#include <sys/syscall.h> +#include <sys/types.h> +#include <sys/un.h> +#include <sys/wait.h> +#include <unistd.h> +#include <sched.h> + +#include "../kselftest.h" + +static pid_t raw_clone(struct clone_args *args) +{ + return syscall(__NR_clone3, args, sizeof(struct clone_args)); +} + +static int call_clone3_set_tid(int set_tid, int flags) +{ + struct clone_args args = {0}; + pid_t ppid = -1; + pid_t pid = -1; + int status; + + args.flags = flags; + args.exit_signal = SIGCHLD; + args.set_tid = set_tid; + + pid = raw_clone(&args); + if (pid < 0) { + ksft_print_msg("%s - Failed to create new process\n", + strerror(errno)); + return -errno; + } + + if (pid == 0) { + ksft_print_msg("I am the child, my PID is %d (expected %d)\n", + getpid(), set_tid); + if (set_tid != getpid()) + _exit(EXIT_FAILURE); + _exit(EXIT_SUCCESS); + } + + ppid = getpid(); + ksft_print_msg("I am the parent (%d). My child's pid is %d\n", + ppid, pid); + + (void)wait(&status); + if (WEXITSTATUS(status)) + return WEXITSTATUS(status); + + return 0; +} + +static int test_clone3_set_tid(int set_tid, int flags, int expected) +{ + int ret; + + ksft_print_msg( + "[%d] Trying clone3() with CLONE_SET_TID to %d and 0x%x\n", + getpid(), set_tid, flags); + ret = call_clone3_set_tid(set_tid, flags); + ksft_print_msg( + "[%d] clone3() with CLONE_SET_TID %d says :%d - expected %d\n", + getpid(), set_tid, ret, expected); + if (ret != expected) + ksft_exit_fail_msg( + "[%d] Result (%d) is different than expected (%d)\n", + getpid(), ret, expected); + ksft_test_result_pass("[%d] Result (%d) matches expectation (%d)\n", + getpid(), ret, expected); + return 0; +} +int main(int argc, char *argv[]) +{ + FILE *f; + int pid_max = 0; + pid_t pid; + pid_t ns_pid; + int ret = -1; + + ksft_print_header(); + ksft_set_plan(13); + + f = fopen("/proc/sys/kernel/pid_max", "r"); + if (f == NULL) + ksft_exit_fail_msg( + "%s - Could not open /proc/sys/kernel/pid_max\n", + strerror(errno)); + fscanf(f, "%d", &pid_max); + fclose(f); + ksft_print_msg("/proc/sys/kernel/pid_max %d\n", pid_max); + + /* First try with an invalid PID */ + if (test_clone3_set_tid(-1, 0, -EINVAL)) + goto on_error; + if (test_clone3_set_tid(-1, CLONE_NEWPID, -EINVAL)) + goto on_error; + /* Then with PID 1 */ + if (test_clone3_set_tid(1, 0, -EEXIST)) + goto on_error; + /* PID 1 should not fail in a PID namespace */ + if (test_clone3_set_tid(1, CLONE_NEWPID, 0)) + goto on_error; + /* pid_max should fail everywhere */ + if (test_clone3_set_tid(pid_max, 0, -EINVAL)) + goto on_error; + if (test_clone3_set_tid(pid_max, CLONE_NEWPID, -EINVAL)) + goto on_error; + /* Find the current active PID */ + pid = fork(); + if (pid == 0) { + ksft_print_msg("Child has PID %d\n", getpid()); + sleep(1); + _exit(EXIT_SUCCESS); + } + /* Try to create a process with that PID should fail */ + if (test_clone3_set_tid(pid, 0, -EEXIST)) + goto on_error; + (void)wait(NULL); + /* After the child has finished, try again with the same PID */ + if (test_clone3_set_tid(pid, 0, 0)) + goto on_error; + /* This should fail as there is no PID 1 in that namespace */ + if (test_clone3_set_tid(pid, CLONE_NEWPID, -EINVAL)) + goto on_error; + unshare(CLONE_NEWPID); + if (test_clone3_set_tid(10, 0, -EINVAL)) + goto on_error; + /* Let's create a PID 1 */ + ns_pid = fork(); + if (ns_pid == 0) { + ksft_print_msg("Child in PID namespace has PID %d\n", getpid()); + sleep(1); + _exit(EXIT_SUCCESS); + } + /* + * Now, after the unshare() it should be possible to create a process + * with another ID than 1 in the PID namespace. + */ + if (test_clone3_set_tid(2, 0, 0)) + goto on_error; + /* Use a different PID in this namespace. */ + if (test_clone3_set_tid(2222, 0, 0)) + goto on_error; + if (test_clone3_set_tid(1, 0, -EEXIST)) + goto on_error; + (void)wait(NULL); + + ret = 0; +on_error: + + return !ret ? ksft_exit_pass() : ksft_exit_fail(); +} -- 2.21.0 ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH v4 1/2] fork: extend clone3() to support CLONE_SET_TID 2019-08-08 21:22 [PATCH v4 1/2] fork: extend clone3() to support CLONE_SET_TID Adrian Reber 2019-08-08 21:22 ` [PATCH v4 2/2] selftests: add tests for clone3() Adrian Reber @ 2019-08-10 1:10 ` Christian Brauner 2019-08-10 5:59 ` Adrian Reber 1 sibling, 1 reply; 7+ messages in thread From: Christian Brauner @ 2019-08-10 1:10 UTC (permalink / raw) To: Adrian Reber Cc: Eric Biederman, Pavel Emelianov, Jann Horn, Oleg Nesterov, Dmitry Safonov, linux-kernel, Andrei Vagin, Mike Rapoport, Radostin Stoyanov On Thu, Aug 08, 2019 at 11:22:21PM +0200, Adrian Reber wrote: > The main motivation to add set_tid to clone3() is CRIU. > > To restore a process with the same PID/TID CRIU currently uses > /proc/sys/kernel/ns_last_pid. It writes the desired (PID - 1) to > ns_last_pid and then (quickly) does a clone(). This works most of the > time, but it is racy. It is also slow as it requires multiple syscalls. > > Extending clone3() to support set_tid makes it possible restore a > process using CRIU without accessing /proc/sys/kernel/ns_last_pid and > race free (as long as the desired PID/TID is available). > > This clone3() extension places the same restrictions (CAP_SYS_ADMIN) > on clone3() with set_tid as they are currently in place for ns_last_pid. > > Signed-off-by: Adrian Reber <areber@redhat.com> > --- > v2: > - Removed (size < sizeof(struct clone_args)) as discussed with > Christian and Dmitry > - Added comment to ((set_tid != 1) && idr_get_cursor() <= 1) (Oleg) > - Use idr_alloc() instead of idr_alloc_cyclic() (Oleg) > > v3: > - Return EEXIST if PID is already in use (Christian) > - Drop CLONE_SET_TID (Christian and Oleg) > - Use idr_is_empty() instead of idr_get_cursor() (Oleg) > - Handle different `struct clone_args` sizes (Dmitry) > > v4: > - Rework struct size check with defines (Christian) > - Reduce number of set_tid checks (Oleg) > - Less parentheses and more robust code (Oleg) > - Do ns_capable() on correct user_ns (Oleg, Christian) > --- > include/linux/pid.h | 2 +- > include/linux/sched/task.h | 1 + > include/uapi/linux/sched.h | 1 + > kernel/fork.c | 25 +++++++++++++++++++++++-- > kernel/pid.c | 34 +++++++++++++++++++++++++++------- > 5 files changed, 53 insertions(+), 10 deletions(-) > > diff --git a/include/linux/pid.h b/include/linux/pid.h > index 2a83e434db9d..052000db0ced 100644 > --- a/include/linux/pid.h > +++ b/include/linux/pid.h > @@ -116,7 +116,7 @@ extern struct pid *find_vpid(int nr); > extern struct pid *find_get_pid(int nr); > extern struct pid *find_ge_pid(int nr, struct pid_namespace *); > > -extern struct pid *alloc_pid(struct pid_namespace *ns); > +extern struct pid *alloc_pid(struct pid_namespace *ns, pid_t set_tid); > extern void free_pid(struct pid *pid); > extern void disable_pid_allocation(struct pid_namespace *ns); > > diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h > index 0497091e40c1..4f2a80564332 100644 > --- a/include/linux/sched/task.h > +++ b/include/linux/sched/task.h > @@ -26,6 +26,7 @@ struct kernel_clone_args { > unsigned long stack; > unsigned long stack_size; > unsigned long tls; > + pid_t set_tid; > }; > > /* > diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h > index b3105ac1381a..e1ce103a2c47 100644 > --- a/include/uapi/linux/sched.h > +++ b/include/uapi/linux/sched.h > @@ -45,6 +45,7 @@ struct clone_args { > __aligned_u64 stack; > __aligned_u64 stack_size; > __aligned_u64 tls; > + __aligned_u64 set_tid; > }; > > /* > diff --git a/kernel/fork.c b/kernel/fork.c > index 2852d0e76ea3..2a03f0e201e9 100644 > --- a/kernel/fork.c > +++ b/kernel/fork.c > @@ -117,6 +117,13 @@ > */ > #define MAX_THREADS FUTEX_TID_MASK > > +/* > + * Different sizes of struct clone_args > + */ > +#define CLONE3_ARGS_SIZE_V0 64 > +/* V1 includes set_tid */ > +#define CLONE3_ARGS_SIZE_V1 72 > + > /* > * Protected counters by write_lock_irq(&tasklist_lock) > */ > @@ -2031,7 +2038,13 @@ static __latent_entropy struct task_struct *copy_process( > stackleak_task_init(p); > > if (pid != &init_struct_pid) { > - pid = alloc_pid(p->nsproxy->pid_ns_for_children); > + if (args->set_tid && !ns_capable( > + p->nsproxy->pid_ns_for_children->user_ns, > + CAP_SYS_ADMIN)) { > + retval = -EPERM; > + goto bad_fork_cleanup_thread; > + } > + pid = alloc_pid(p->nsproxy->pid_ns_for_children, args->set_tid); > if (IS_ERR(pid)) { > retval = PTR_ERR(pid); > goto bad_fork_cleanup_thread; > @@ -2535,9 +2548,14 @@ noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs, > if (unlikely(size > PAGE_SIZE)) > return -E2BIG; > > - if (unlikely(size < sizeof(struct clone_args))) > + /* The struct needs to be at least the size of the original struct. */ I don't think you need that comment. I think the macro is pretty self-explanatory. If you want it to be even clearer you could even make it CLONE3_ARGS_SIZE_MIN but V0 is good enough. :) > + if (unlikely(size < CLONE3_ARGS_SIZE_V0)) > return -EINVAL; > > + if (size < sizeof(struct clone_args)) > + memset((void *)&args + size, 0, > + sizeof(struct clone_args) - size); > + > if (unlikely(!access_ok(uargs, size))) > return -EFAULT; > > @@ -2573,6 +2591,9 @@ noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs, > .tls = args.tls, > }; > > + if (size >= CLONE3_ARGS_SIZE_V1) > + kargs->set_tid = args.set_tid; Hm, the if-condition is not needed though, right? At this point we will have already copied from struct clone_args __user *uargs into struct clone_args args. If we hit that codepath that means the kernel definitely has a field for set_tid in its struct clone_args. :) So this could probably just be: .tls = args.tls, .set_tid = args.set_tid, } ? > + > return 0; > } > > diff --git a/kernel/pid.c b/kernel/pid.c > index 0a9f2e437217..9ce89c35c5be 100644 > --- a/kernel/pid.c > +++ b/kernel/pid.c > @@ -157,7 +157,7 @@ void free_pid(struct pid *pid) > call_rcu(&pid->rcu, delayed_put_pid); > } > > -struct pid *alloc_pid(struct pid_namespace *ns) > +struct pid *alloc_pid(struct pid_namespace *ns, int set_tid) > { > struct pid *pid; > enum pid_type type; > @@ -186,12 +186,32 @@ struct pid *alloc_pid(struct pid_namespace *ns) > if (idr_get_cursor(&tmp->idr) > RESERVED_PIDS) > pid_min = RESERVED_PIDS; > > - /* > - * Store a null pointer so find_pid_ns does not find > - * a partially initialized PID (see below). > - */ > - nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min, > - pid_max, GFP_ATOMIC); > + if (set_tid) { > + /* > + * Also fail if a PID != 1 is requested > + * and no PID 1 exists. > + */ > + nr = -EINVAL; > + if (set_tid < pid_max && set_tid > 0 && Hm, you're already in the if-branch hat verified if (set_tid) so the set_tid > 0 conjunct seems redundant. :) > + (set_tid == 1 || !idr_is_empty(&tmp->idr))) > + nr = idr_alloc(&tmp->idr, NULL, set_tid, > + set_tid + 1, GFP_ATOMIC); I'm confused, shouldn't this be if (set_tid < pid_max || (set_tid == 1 && !idr_is_emtpy(&tmp->idf))) > + /* > + * If ENOSPC is returned it means that the PID is > + * alreay in use. Return EEXIST in that case. > + */ > + if (nr == -ENOSPC) > + nr = -EEXIST; > + /* Only use set_tid for one PID namespace. */ > + set_tid = 0; > + } else { > + /* > + * Store a null pointer so find_pid_ns does not find > + * a partially initialized PID (see below). > + */ > + nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min, > + pid_max, GFP_ATOMIC); > + } > spin_unlock_irq(&pidmap_lock); > idr_preload_end(); > > -- > 2.21.0 > ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v4 1/2] fork: extend clone3() to support CLONE_SET_TID 2019-08-10 1:10 ` [PATCH v4 1/2] fork: extend clone3() to support CLONE_SET_TID Christian Brauner @ 2019-08-10 5:59 ` Adrian Reber 2019-08-11 6:51 ` Christian Brauner 0 siblings, 1 reply; 7+ messages in thread From: Adrian Reber @ 2019-08-10 5:59 UTC (permalink / raw) To: Christian Brauner Cc: Eric Biederman, Pavel Emelianov, Jann Horn, Oleg Nesterov, Dmitry Safonov, linux-kernel, Andrei Vagin, Mike Rapoport, Radostin Stoyanov On Sat, Aug 10, 2019 at 03:10:34AM +0200, Christian Brauner wrote: > On Thu, Aug 08, 2019 at 11:22:21PM +0200, Adrian Reber wrote: > > The main motivation to add set_tid to clone3() is CRIU. > > > > To restore a process with the same PID/TID CRIU currently uses > > /proc/sys/kernel/ns_last_pid. It writes the desired (PID - 1) to > > ns_last_pid and then (quickly) does a clone(). This works most of the > > time, but it is racy. It is also slow as it requires multiple syscalls. > > > > Extending clone3() to support set_tid makes it possible restore a > > process using CRIU without accessing /proc/sys/kernel/ns_last_pid and > > race free (as long as the desired PID/TID is available). > > > > This clone3() extension places the same restrictions (CAP_SYS_ADMIN) > > on clone3() with set_tid as they are currently in place for ns_last_pid. > > > > Signed-off-by: Adrian Reber <areber@redhat.com> > > --- > > v2: > > - Removed (size < sizeof(struct clone_args)) as discussed with > > Christian and Dmitry > > - Added comment to ((set_tid != 1) && idr_get_cursor() <= 1) (Oleg) > > - Use idr_alloc() instead of idr_alloc_cyclic() (Oleg) > > > > v3: > > - Return EEXIST if PID is already in use (Christian) > > - Drop CLONE_SET_TID (Christian and Oleg) > > - Use idr_is_empty() instead of idr_get_cursor() (Oleg) > > - Handle different `struct clone_args` sizes (Dmitry) > > > > v4: > > - Rework struct size check with defines (Christian) > > - Reduce number of set_tid checks (Oleg) > > - Less parentheses and more robust code (Oleg) > > - Do ns_capable() on correct user_ns (Oleg, Christian) > > --- > > include/linux/pid.h | 2 +- > > include/linux/sched/task.h | 1 + > > include/uapi/linux/sched.h | 1 + > > kernel/fork.c | 25 +++++++++++++++++++++++-- > > kernel/pid.c | 34 +++++++++++++++++++++++++++------- > > 5 files changed, 53 insertions(+), 10 deletions(-) > > > > diff --git a/include/linux/pid.h b/include/linux/pid.h > > index 2a83e434db9d..052000db0ced 100644 > > --- a/include/linux/pid.h > > +++ b/include/linux/pid.h > > @@ -116,7 +116,7 @@ extern struct pid *find_vpid(int nr); > > extern struct pid *find_get_pid(int nr); > > extern struct pid *find_ge_pid(int nr, struct pid_namespace *); > > > > -extern struct pid *alloc_pid(struct pid_namespace *ns); > > +extern struct pid *alloc_pid(struct pid_namespace *ns, pid_t set_tid); > > extern void free_pid(struct pid *pid); > > extern void disable_pid_allocation(struct pid_namespace *ns); > > > > diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h > > index 0497091e40c1..4f2a80564332 100644 > > --- a/include/linux/sched/task.h > > +++ b/include/linux/sched/task.h > > @@ -26,6 +26,7 @@ struct kernel_clone_args { > > unsigned long stack; > > unsigned long stack_size; > > unsigned long tls; > > + pid_t set_tid; > > }; > > > > /* > > diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h > > index b3105ac1381a..e1ce103a2c47 100644 > > --- a/include/uapi/linux/sched.h > > +++ b/include/uapi/linux/sched.h > > @@ -45,6 +45,7 @@ struct clone_args { > > __aligned_u64 stack; > > __aligned_u64 stack_size; > > __aligned_u64 tls; > > + __aligned_u64 set_tid; > > }; > > > > /* > > diff --git a/kernel/fork.c b/kernel/fork.c > > index 2852d0e76ea3..2a03f0e201e9 100644 > > --- a/kernel/fork.c > > +++ b/kernel/fork.c > > @@ -117,6 +117,13 @@ > > */ > > #define MAX_THREADS FUTEX_TID_MASK > > > > +/* > > + * Different sizes of struct clone_args > > + */ > > +#define CLONE3_ARGS_SIZE_V0 64 > > +/* V1 includes set_tid */ > > +#define CLONE3_ARGS_SIZE_V1 72 > > + > > /* > > * Protected counters by write_lock_irq(&tasklist_lock) > > */ > > @@ -2031,7 +2038,13 @@ static __latent_entropy struct task_struct *copy_process( > > stackleak_task_init(p); > > > > if (pid != &init_struct_pid) { > > - pid = alloc_pid(p->nsproxy->pid_ns_for_children); > > + if (args->set_tid && !ns_capable( > > + p->nsproxy->pid_ns_for_children->user_ns, > > + CAP_SYS_ADMIN)) { > > + retval = -EPERM; > > + goto bad_fork_cleanup_thread; > > + } > > + pid = alloc_pid(p->nsproxy->pid_ns_for_children, args->set_tid); > > if (IS_ERR(pid)) { > > retval = PTR_ERR(pid); > > goto bad_fork_cleanup_thread; > > @@ -2535,9 +2548,14 @@ noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs, > > if (unlikely(size > PAGE_SIZE)) > > return -E2BIG; > > > > - if (unlikely(size < sizeof(struct clone_args))) > > + /* The struct needs to be at least the size of the original struct. */ > > I don't think you need that comment. I think the macro is pretty > self-explanatory. If you want it to be even clearer you could even make > it CLONE3_ARGS_SIZE_MIN but V0 is good enough. :) Will remove the comment. > > + if (unlikely(size < CLONE3_ARGS_SIZE_V0)) > > return -EINVAL; > > > > + if (size < sizeof(struct clone_args)) > > + memset((void *)&args + size, 0, > > + sizeof(struct clone_args) - size); > > + > > if (unlikely(!access_ok(uargs, size))) > > return -EFAULT; > > > > @@ -2573,6 +2591,9 @@ noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs, > > .tls = args.tls, > > }; > > > > + if (size >= CLONE3_ARGS_SIZE_V1) > > + kargs->set_tid = args.set_tid; > > Hm, the if-condition is not needed though, right? At this point we will > have already copied from struct clone_args __user *uargs into struct > clone_args args. If we hit that codepath that means the kernel > definitely has a field for set_tid in its struct clone_args. :) So this > could probably just be: > > .tls = args.tls, > .set_tid = args.set_tid, > } > > ? Right. > > + > > return 0; > > } > > > > diff --git a/kernel/pid.c b/kernel/pid.c > > index 0a9f2e437217..9ce89c35c5be 100644 > > --- a/kernel/pid.c > > +++ b/kernel/pid.c > > @@ -157,7 +157,7 @@ void free_pid(struct pid *pid) > > call_rcu(&pid->rcu, delayed_put_pid); > > } > > > > -struct pid *alloc_pid(struct pid_namespace *ns) > > +struct pid *alloc_pid(struct pid_namespace *ns, int set_tid) > > { > > struct pid *pid; > > enum pid_type type; > > @@ -186,12 +186,32 @@ struct pid *alloc_pid(struct pid_namespace *ns) > > if (idr_get_cursor(&tmp->idr) > RESERVED_PIDS) > > pid_min = RESERVED_PIDS; > > > > - /* > > - * Store a null pointer so find_pid_ns does not find > > - * a partially initialized PID (see below). > > - */ > > - nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min, > > - pid_max, GFP_ATOMIC); > > + if (set_tid) { > > + /* > > + * Also fail if a PID != 1 is requested > > + * and no PID 1 exists. > > + */ > > + nr = -EINVAL; > > + if (set_tid < pid_max && set_tid > 0 && > > Hm, you're already in the if-branch hat verified if (set_tid) so the > set_tid > 0 conjunct seems redundant. :) Yes, but I dropped all checks to see if set_tid is negative as suggested by Oleg and moved it here. > > + (set_tid == 1 || !idr_is_empty(&tmp->idr))) > > + nr = idr_alloc(&tmp->idr, NULL, set_tid, > > + set_tid + 1, GFP_ATOMIC); > > I'm confused, shouldn't this be > > if (set_tid < pid_max || (set_tid == 1 && !idr_is_emtpy(&tmp->idf))) Now I am also confused ;). This does not work. This will always return true if set_tid is less than pid_max. So pid_max needs to be something like 1 for the check after || to make sense, right? But you really got me confused here right now. Right now I still think what I did is correct. > > + /* > > + * If ENOSPC is returned it means that the PID is > > + * alreay in use. Return EEXIST in that case. > > + */ > > + if (nr == -ENOSPC) > > + nr = -EEXIST; > > + /* Only use set_tid for one PID namespace. */ > > + set_tid = 0; > > + } else { > > + /* > > + * Store a null pointer so find_pid_ns does not find > > + * a partially initialized PID (see below). > > + */ > > + nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min, > > + pid_max, GFP_ATOMIC); > > + } > > spin_unlock_irq(&pidmap_lock); > > idr_preload_end(); > > > > -- > > 2.21.0 > > ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v4 1/2] fork: extend clone3() to support CLONE_SET_TID 2019-08-10 5:59 ` Adrian Reber @ 2019-08-11 6:51 ` Christian Brauner 2019-08-11 19:06 ` Christian Brauner 0 siblings, 1 reply; 7+ messages in thread From: Christian Brauner @ 2019-08-11 6:51 UTC (permalink / raw) To: Adrian Reber Cc: Eric Biederman, Pavel Emelianov, Jann Horn, Oleg Nesterov, Dmitry Safonov, linux-kernel, Andrei Vagin, Mike Rapoport, Radostin Stoyanov On Sat, Aug 10, 2019 at 07:59:18AM +0200, Adrian Reber wrote: > On Sat, Aug 10, 2019 at 03:10:34AM +0200, Christian Brauner wrote: > > On Thu, Aug 08, 2019 at 11:22:21PM +0200, Adrian Reber wrote: > > > The main motivation to add set_tid to clone3() is CRIU. > > > > > > To restore a process with the same PID/TID CRIU currently uses > > > /proc/sys/kernel/ns_last_pid. It writes the desired (PID - 1) to > > > ns_last_pid and then (quickly) does a clone(). This works most of the > > > time, but it is racy. It is also slow as it requires multiple syscalls. > > > > > > Extending clone3() to support set_tid makes it possible restore a > > > process using CRIU without accessing /proc/sys/kernel/ns_last_pid and > > > race free (as long as the desired PID/TID is available). > > > > > > This clone3() extension places the same restrictions (CAP_SYS_ADMIN) > > > on clone3() with set_tid as they are currently in place for ns_last_pid. > > > > > > Signed-off-by: Adrian Reber <areber@redhat.com> > > > --- > > > v2: > > > - Removed (size < sizeof(struct clone_args)) as discussed with > > > Christian and Dmitry > > > - Added comment to ((set_tid != 1) && idr_get_cursor() <= 1) (Oleg) > > > - Use idr_alloc() instead of idr_alloc_cyclic() (Oleg) > > > > > > v3: > > > - Return EEXIST if PID is already in use (Christian) > > > - Drop CLONE_SET_TID (Christian and Oleg) > > > - Use idr_is_empty() instead of idr_get_cursor() (Oleg) > > > - Handle different `struct clone_args` sizes (Dmitry) > > > > > > v4: > > > - Rework struct size check with defines (Christian) > > > - Reduce number of set_tid checks (Oleg) > > > - Less parentheses and more robust code (Oleg) > > > - Do ns_capable() on correct user_ns (Oleg, Christian) > > > --- > > > include/linux/pid.h | 2 +- > > > include/linux/sched/task.h | 1 + > > > include/uapi/linux/sched.h | 1 + > > > kernel/fork.c | 25 +++++++++++++++++++++++-- > > > kernel/pid.c | 34 +++++++++++++++++++++++++++------- > > > 5 files changed, 53 insertions(+), 10 deletions(-) > > > > > > diff --git a/include/linux/pid.h b/include/linux/pid.h > > > index 2a83e434db9d..052000db0ced 100644 > > > --- a/include/linux/pid.h > > > +++ b/include/linux/pid.h > > > @@ -116,7 +116,7 @@ extern struct pid *find_vpid(int nr); > > > extern struct pid *find_get_pid(int nr); > > > extern struct pid *find_ge_pid(int nr, struct pid_namespace *); > > > > > > -extern struct pid *alloc_pid(struct pid_namespace *ns); > > > +extern struct pid *alloc_pid(struct pid_namespace *ns, pid_t set_tid); > > > extern void free_pid(struct pid *pid); > > > extern void disable_pid_allocation(struct pid_namespace *ns); > > > > > > diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h > > > index 0497091e40c1..4f2a80564332 100644 > > > --- a/include/linux/sched/task.h > > > +++ b/include/linux/sched/task.h > > > @@ -26,6 +26,7 @@ struct kernel_clone_args { > > > unsigned long stack; > > > unsigned long stack_size; > > > unsigned long tls; > > > + pid_t set_tid; > > > }; > > > > > > /* > > > diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h > > > index b3105ac1381a..e1ce103a2c47 100644 > > > --- a/include/uapi/linux/sched.h > > > +++ b/include/uapi/linux/sched.h > > > @@ -45,6 +45,7 @@ struct clone_args { > > > __aligned_u64 stack; > > > __aligned_u64 stack_size; > > > __aligned_u64 tls; > > > + __aligned_u64 set_tid; > > > }; > > > > > > /* > > > diff --git a/kernel/fork.c b/kernel/fork.c > > > index 2852d0e76ea3..2a03f0e201e9 100644 > > > --- a/kernel/fork.c > > > +++ b/kernel/fork.c > > > @@ -117,6 +117,13 @@ > > > */ > > > #define MAX_THREADS FUTEX_TID_MASK > > > > > > +/* > > > + * Different sizes of struct clone_args > > > + */ > > > +#define CLONE3_ARGS_SIZE_V0 64 > > > +/* V1 includes set_tid */ > > > +#define CLONE3_ARGS_SIZE_V1 72 > > > + > > > /* > > > * Protected counters by write_lock_irq(&tasklist_lock) > > > */ > > > @@ -2031,7 +2038,13 @@ static __latent_entropy struct task_struct *copy_process( > > > stackleak_task_init(p); > > > > > > if (pid != &init_struct_pid) { > > > - pid = alloc_pid(p->nsproxy->pid_ns_for_children); > > > + if (args->set_tid && !ns_capable( > > > + p->nsproxy->pid_ns_for_children->user_ns, > > > + CAP_SYS_ADMIN)) { > > > + retval = -EPERM; > > > + goto bad_fork_cleanup_thread; > > > + } > > > + pid = alloc_pid(p->nsproxy->pid_ns_for_children, args->set_tid); > > > if (IS_ERR(pid)) { > > > retval = PTR_ERR(pid); > > > goto bad_fork_cleanup_thread; > > > @@ -2535,9 +2548,14 @@ noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs, > > > if (unlikely(size > PAGE_SIZE)) > > > return -E2BIG; > > > > > > - if (unlikely(size < sizeof(struct clone_args))) > > > + /* The struct needs to be at least the size of the original struct. */ > > > > I don't think you need that comment. I think the macro is pretty > > self-explanatory. If you want it to be even clearer you could even make > > it CLONE3_ARGS_SIZE_MIN but V0 is good enough. :) > > Will remove the comment. > > > > + if (unlikely(size < CLONE3_ARGS_SIZE_V0)) > > > return -EINVAL; > > > > > > + if (size < sizeof(struct clone_args)) > > > + memset((void *)&args + size, 0, > > > + sizeof(struct clone_args) - size); > > > + > > > if (unlikely(!access_ok(uargs, size))) > > > return -EFAULT; > > > > > > @@ -2573,6 +2591,9 @@ noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs, > > > .tls = args.tls, > > > }; > > > > > > + if (size >= CLONE3_ARGS_SIZE_V1) > > > + kargs->set_tid = args.set_tid; > > > > Hm, the if-condition is not needed though, right? At this point we will > > have already copied from struct clone_args __user *uargs into struct > > clone_args args. If we hit that codepath that means the kernel > > definitely has a field for set_tid in its struct clone_args. :) So this > > could probably just be: > > > > .tls = args.tls, > > .set_tid = args.set_tid, > > } > > > > ? > > Right. > > > > + > > > return 0; > > > } > > > > > > diff --git a/kernel/pid.c b/kernel/pid.c > > > index 0a9f2e437217..9ce89c35c5be 100644 > > > --- a/kernel/pid.c > > > +++ b/kernel/pid.c > > > @@ -157,7 +157,7 @@ void free_pid(struct pid *pid) > > > call_rcu(&pid->rcu, delayed_put_pid); > > > } > > > > > > -struct pid *alloc_pid(struct pid_namespace *ns) > > > +struct pid *alloc_pid(struct pid_namespace *ns, int set_tid) > > > { > > > struct pid *pid; > > > enum pid_type type; > > > @@ -186,12 +186,32 @@ struct pid *alloc_pid(struct pid_namespace *ns) > > > if (idr_get_cursor(&tmp->idr) > RESERVED_PIDS) > > > pid_min = RESERVED_PIDS; > > > > > > - /* > > > - * Store a null pointer so find_pid_ns does not find > > > - * a partially initialized PID (see below). > > > - */ > > > - nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min, > > > - pid_max, GFP_ATOMIC); > > > + if (set_tid) { > > > + /* > > > + * Also fail if a PID != 1 is requested > > > + * and no PID 1 exists. > > > + */ > > > + nr = -EINVAL; > > > + if (set_tid < pid_max && set_tid > 0 && > > > > Hm, you're already in the if-branch hat verified if (set_tid) so the > > set_tid > 0 conjunct seems redundant. :) > > Yes, but I dropped all checks to see if set_tid is negative as suggested > by Oleg and moved it here. > > > > + (set_tid == 1 || !idr_is_empty(&tmp->idr))) > > > + nr = idr_alloc(&tmp->idr, NULL, set_tid, > > > + set_tid + 1, GFP_ATOMIC); > > > > I'm confused, shouldn't this be > > > > if (set_tid < pid_max || (set_tid == 1 && !idr_is_emtpy(&tmp->idf))) > > Now I am also confused ;). This does not work. This will always return > true if set_tid is less than pid_max. So pid_max needs to be something > like 1 for the check after || to make sense, right? But you really got > me confused here right now. Right now I still think what I did is > correct. I missed the part where you reset set_tid to 0 below and mis-parsed the rightmost conjunct in the if statement. One thing I dislike is that you do if (set_tid) { if ([...] && set_tid > 0 && [...]) } For consistency this should rather be: if (set_tid) { if ([...] && set_tid && [...]) } Overall I'd rather write the second if as: if (set_tid && set_tid < pid_max && (set_tid == 1 || !idr_is_empty(&tmp->idr))) to place the most basic condition (i.e. set_tid is non-zero) leftmost. The right-most conjunct (set_tid == 1 || !idr_is_empty(&tmp->idr)) is non-trivial to parse. Someone not familiar with that code would need time to figure out why and how exactly 1 is special-cased. I think overall this is begging for alloc_pid() to be documented now if you don't mind. :) Christian ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v4 1/2] fork: extend clone3() to support CLONE_SET_TID 2019-08-11 6:51 ` Christian Brauner @ 2019-08-11 19:06 ` Christian Brauner 2019-08-11 20:11 ` Adrian Reber 0 siblings, 1 reply; 7+ messages in thread From: Christian Brauner @ 2019-08-11 19:06 UTC (permalink / raw) To: Adrian Reber Cc: Eric Biederman, Pavel Emelianov, Jann Horn, Oleg Nesterov, Dmitry Safonov, linux-kernel, Andrei Vagin, Mike Rapoport, Radostin Stoyanov On Sun, Aug 11, 2019 at 08:51:48AM +0200, Christian Brauner wrote: > On Sat, Aug 10, 2019 at 07:59:18AM +0200, Adrian Reber wrote: > > On Sat, Aug 10, 2019 at 03:10:34AM +0200, Christian Brauner wrote: > > > On Thu, Aug 08, 2019 at 11:22:21PM +0200, Adrian Reber wrote: > > > > The main motivation to add set_tid to clone3() is CRIU. > > > > > > > > To restore a process with the same PID/TID CRIU currently uses > > > > /proc/sys/kernel/ns_last_pid. It writes the desired (PID - 1) to > > > > ns_last_pid and then (quickly) does a clone(). This works most of the > > > > time, but it is racy. It is also slow as it requires multiple syscalls. > > > > > > > > Extending clone3() to support set_tid makes it possible restore a > > > > process using CRIU without accessing /proc/sys/kernel/ns_last_pid and > > > > race free (as long as the desired PID/TID is available). > > > > > > > > This clone3() extension places the same restrictions (CAP_SYS_ADMIN) > > > > on clone3() with set_tid as they are currently in place for ns_last_pid. > > > > > > > > Signed-off-by: Adrian Reber <areber@redhat.com> > > > > --- > > > > v2: > > > > - Removed (size < sizeof(struct clone_args)) as discussed with > > > > Christian and Dmitry > > > > - Added comment to ((set_tid != 1) && idr_get_cursor() <= 1) (Oleg) > > > > - Use idr_alloc() instead of idr_alloc_cyclic() (Oleg) > > > > > > > > v3: > > > > - Return EEXIST if PID is already in use (Christian) > > > > - Drop CLONE_SET_TID (Christian and Oleg) > > > > - Use idr_is_empty() instead of idr_get_cursor() (Oleg) > > > > - Handle different `struct clone_args` sizes (Dmitry) > > > > > > > > v4: > > > > - Rework struct size check with defines (Christian) > > > > - Reduce number of set_tid checks (Oleg) > > > > - Less parentheses and more robust code (Oleg) > > > > - Do ns_capable() on correct user_ns (Oleg, Christian) > > > > --- > > > > include/linux/pid.h | 2 +- > > > > include/linux/sched/task.h | 1 + > > > > include/uapi/linux/sched.h | 1 + > > > > kernel/fork.c | 25 +++++++++++++++++++++++-- > > > > kernel/pid.c | 34 +++++++++++++++++++++++++++------- > > > > 5 files changed, 53 insertions(+), 10 deletions(-) > > > > > > > > diff --git a/include/linux/pid.h b/include/linux/pid.h > > > > index 2a83e434db9d..052000db0ced 100644 > > > > --- a/include/linux/pid.h > > > > +++ b/include/linux/pid.h > > > > @@ -116,7 +116,7 @@ extern struct pid *find_vpid(int nr); > > > > extern struct pid *find_get_pid(int nr); > > > > extern struct pid *find_ge_pid(int nr, struct pid_namespace *); > > > > > > > > -extern struct pid *alloc_pid(struct pid_namespace *ns); > > > > +extern struct pid *alloc_pid(struct pid_namespace *ns, pid_t set_tid); > > > > extern void free_pid(struct pid *pid); > > > > extern void disable_pid_allocation(struct pid_namespace *ns); > > > > > > > > diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h > > > > index 0497091e40c1..4f2a80564332 100644 > > > > --- a/include/linux/sched/task.h > > > > +++ b/include/linux/sched/task.h > > > > @@ -26,6 +26,7 @@ struct kernel_clone_args { > > > > unsigned long stack; > > > > unsigned long stack_size; > > > > unsigned long tls; > > > > + pid_t set_tid; > > > > }; > > > > > > > > /* > > > > diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h > > > > index b3105ac1381a..e1ce103a2c47 100644 > > > > --- a/include/uapi/linux/sched.h > > > > +++ b/include/uapi/linux/sched.h > > > > @@ -45,6 +45,7 @@ struct clone_args { > > > > __aligned_u64 stack; > > > > __aligned_u64 stack_size; > > > > __aligned_u64 tls; > > > > + __aligned_u64 set_tid; > > > > }; > > > > > > > > /* > > > > diff --git a/kernel/fork.c b/kernel/fork.c > > > > index 2852d0e76ea3..2a03f0e201e9 100644 > > > > --- a/kernel/fork.c > > > > +++ b/kernel/fork.c > > > > @@ -117,6 +117,13 @@ > > > > */ > > > > #define MAX_THREADS FUTEX_TID_MASK > > > > > > > > +/* > > > > + * Different sizes of struct clone_args > > > > + */ > > > > +#define CLONE3_ARGS_SIZE_V0 64 > > > > +/* V1 includes set_tid */ > > > > +#define CLONE3_ARGS_SIZE_V1 72 > > > > + > > > > /* > > > > * Protected counters by write_lock_irq(&tasklist_lock) > > > > */ > > > > @@ -2031,7 +2038,13 @@ static __latent_entropy struct task_struct *copy_process( > > > > stackleak_task_init(p); > > > > > > > > if (pid != &init_struct_pid) { > > > > - pid = alloc_pid(p->nsproxy->pid_ns_for_children); > > > > + if (args->set_tid && !ns_capable( > > > > + p->nsproxy->pid_ns_for_children->user_ns, > > > > + CAP_SYS_ADMIN)) { > > > > + retval = -EPERM; > > > > + goto bad_fork_cleanup_thread; > > > > + } > > > > + pid = alloc_pid(p->nsproxy->pid_ns_for_children, args->set_tid); > > > > if (IS_ERR(pid)) { > > > > retval = PTR_ERR(pid); > > > > goto bad_fork_cleanup_thread; > > > > @@ -2535,9 +2548,14 @@ noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs, > > > > if (unlikely(size > PAGE_SIZE)) > > > > return -E2BIG; > > > > > > > > - if (unlikely(size < sizeof(struct clone_args))) > > > > + /* The struct needs to be at least the size of the original struct. */ > > > > > > I don't think you need that comment. I think the macro is pretty > > > self-explanatory. If you want it to be even clearer you could even make > > > it CLONE3_ARGS_SIZE_MIN but V0 is good enough. :) > > > > Will remove the comment. > > > > > > + if (unlikely(size < CLONE3_ARGS_SIZE_V0)) > > > > return -EINVAL; > > > > > > > > + if (size < sizeof(struct clone_args)) > > > > + memset((void *)&args + size, 0, > > > > + sizeof(struct clone_args) - size); > > > > + > > > > if (unlikely(!access_ok(uargs, size))) > > > > return -EFAULT; > > > > > > > > @@ -2573,6 +2591,9 @@ noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs, > > > > .tls = args.tls, > > > > }; > > > > > > > > + if (size >= CLONE3_ARGS_SIZE_V1) > > > > + kargs->set_tid = args.set_tid; > > > > > > Hm, the if-condition is not needed though, right? At this point we will > > > have already copied from struct clone_args __user *uargs into struct > > > clone_args args. If we hit that codepath that means the kernel > > > definitely has a field for set_tid in its struct clone_args. :) So this > > > could probably just be: > > > > > > .tls = args.tls, > > > .set_tid = args.set_tid, > > > } > > > > > > ? > > > > Right. > > > > > > + > > > > return 0; > > > > } > > > > > > > > diff --git a/kernel/pid.c b/kernel/pid.c > > > > index 0a9f2e437217..9ce89c35c5be 100644 > > > > --- a/kernel/pid.c > > > > +++ b/kernel/pid.c > > > > @@ -157,7 +157,7 @@ void free_pid(struct pid *pid) > > > > call_rcu(&pid->rcu, delayed_put_pid); > > > > } > > > > > > > > -struct pid *alloc_pid(struct pid_namespace *ns) > > > > +struct pid *alloc_pid(struct pid_namespace *ns, int set_tid) > > > > { > > > > struct pid *pid; > > > > enum pid_type type; > > > > @@ -186,12 +186,32 @@ struct pid *alloc_pid(struct pid_namespace *ns) > > > > if (idr_get_cursor(&tmp->idr) > RESERVED_PIDS) > > > > pid_min = RESERVED_PIDS; > > > > > > > > - /* > > > > - * Store a null pointer so find_pid_ns does not find > > > > - * a partially initialized PID (see below). > > > > - */ > > > > - nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min, > > > > - pid_max, GFP_ATOMIC); > > > > + if (set_tid) { > > > > + /* > > > > + * Also fail if a PID != 1 is requested > > > > + * and no PID 1 exists. > > > > + */ > > > > + nr = -EINVAL; > > > > + if (set_tid < pid_max && set_tid > 0 && > > > > > > Hm, you're already in the if-branch hat verified if (set_tid) so the > > > set_tid > 0 conjunct seems redundant. :) > > > > Yes, but I dropped all checks to see if set_tid is negative as suggested > > by Oleg and moved it here. > > > > > > + (set_tid == 1 || !idr_is_empty(&tmp->idr))) > > > > + nr = idr_alloc(&tmp->idr, NULL, set_tid, > > > > + set_tid + 1, GFP_ATOMIC); > > > > > > I'm confused, shouldn't this be > > > > > > if (set_tid < pid_max || (set_tid == 1 && !idr_is_emtpy(&tmp->idf))) > > > > Now I am also confused ;). This does not work. This will always return > > true if set_tid is less than pid_max. So pid_max needs to be something > > like 1 for the check after || to make sense, right? But you really got > > me confused here right now. Right now I still think what I did is > > correct. > > I missed the part where you reset set_tid to 0 below and mis-parsed the > rightmost conjunct in the if statement. > > One thing I dislike is that you do > > if (set_tid) { > if ([...] && set_tid > 0 && [...]) > } Thinking about this a bit more you probably did the explicit set_tid > 0 to catch the case where it is negative? But also, I don't understand why we do any work at all before having verified that set_tid is sensible, i.e. why do we call kmem_cache_calloc(), do idr_preload(), and take a spinlock, before even verifying that our parameters are sane? If there's no specific reason for this I suggest to patch alloc_pid() like this: -struct pid *alloc_pid(struct pid_namespace *ns) +struct pid *alloc_pid(struct pid_namespace *ns, int set_tid) { struct pid *pid; enum pid_type type; @@ -166,6 +166,9 @@ struct pid *alloc_pid(struct pid_namespace *ns) struct upid *upid; int retval = -ENOMEM; + if (set_tid < 0 || set_tid >= pid_max) + return ERR_PTR(-EINVAL); + pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL); if (!pid) return ERR_PTR(retval); @@ -186,12 +189,31 @@ struct pid *alloc_pid(struct pid_namespace *ns) if (idr_get_cursor(&tmp->idr) > RESERVED_PIDS) pid_min = RESERVED_PIDS; - /* - * Store a null pointer so find_pid_ns does not find - * a partially initialized PID (see below). - */ - nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min, - pid_max, GFP_ATOMIC); + if (set_tid) { + /* + * Also fail if a PID != 1 is requested + * and no PID 1 exists. + */ + nr = -EINVAL; + if (set_tid == 1 || !idr_is_empty(&tmp->idr)) + nr = idr_alloc(&tmp->idr, NULL, set_tid, + set_tid + 1, GFP_ATOMIC); + /* + * If ENOSPC is returned it means that the PID is + * alreay in use. Return EEXIST in that case. + */ + if (nr == -ENOSPC) + nr = -EEXIST; + /* Only use set_tid for one PID namespace. */ + set_tid = 0; + } else { + /* + * Store a null pointer so find_pid_ns does not find + * a partially initialized PID (see below). + */ + nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min, + pid_max, GFP_ATOMIC); + } spin_unlock_irq(&pidmap_lock); idr_preload_end() This makes things a lot more clearer in my opinion. First, verify that the pre-conditions are met. Second, verify that the conditions are met which depend on the state of the pid namespace, i.e. there's either already a pid 1 or pid 1 is requested. We should also do this since alloc_pid() is exported in a header file and so we can't and shouldn't rely on the fact that all callers will pass in something sensible for set_tid. > > For consistency this should rather be: > > if (set_tid) { > if ([...] && set_tid && [...]) > } > > Overall I'd rather write the second if as: > > if (set_tid && set_tid < pid_max && (set_tid == 1 || !idr_is_empty(&tmp->idr))) > > to place the most basic condition (i.e. set_tid is non-zero) leftmost. > > The right-most conjunct (set_tid == 1 || !idr_is_empty(&tmp->idr)) is > non-trivial to parse. Someone not familiar with that code would need > time to figure out why and how exactly 1 is special-cased. > I think overall this is begging for alloc_pid() to be documented now if > you don't mind. :) > > Christian ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v4 1/2] fork: extend clone3() to support CLONE_SET_TID 2019-08-11 19:06 ` Christian Brauner @ 2019-08-11 20:11 ` Adrian Reber 0 siblings, 0 replies; 7+ messages in thread From: Adrian Reber @ 2019-08-11 20:11 UTC (permalink / raw) To: Christian Brauner Cc: Eric Biederman, Pavel Emelianov, Jann Horn, Oleg Nesterov, Dmitry Safonov, linux-kernel, Andrei Vagin, Mike Rapoport, Radostin Stoyanov On Sun, Aug 11, 2019 at 09:06:59PM +0200, Christian Brauner wrote: > On Sun, Aug 11, 2019 at 08:51:48AM +0200, Christian Brauner wrote: > > On Sat, Aug 10, 2019 at 07:59:18AM +0200, Adrian Reber wrote: > > > On Sat, Aug 10, 2019 at 03:10:34AM +0200, Christian Brauner wrote: > > > > On Thu, Aug 08, 2019 at 11:22:21PM +0200, Adrian Reber wrote: > > > > > The main motivation to add set_tid to clone3() is CRIU. > > > > > > > > > > To restore a process with the same PID/TID CRIU currently uses > > > > > /proc/sys/kernel/ns_last_pid. It writes the desired (PID - 1) to > > > > > ns_last_pid and then (quickly) does a clone(). This works most of the > > > > > time, but it is racy. It is also slow as it requires multiple syscalls. > > > > > > > > > > Extending clone3() to support set_tid makes it possible restore a > > > > > process using CRIU without accessing /proc/sys/kernel/ns_last_pid and > > > > > race free (as long as the desired PID/TID is available). > > > > > > > > > > This clone3() extension places the same restrictions (CAP_SYS_ADMIN) > > > > > on clone3() with set_tid as they are currently in place for ns_last_pid. > > > > > > > > > > Signed-off-by: Adrian Reber <areber@redhat.com> > > > > > --- > > > > > v2: > > > > > - Removed (size < sizeof(struct clone_args)) as discussed with > > > > > Christian and Dmitry > > > > > - Added comment to ((set_tid != 1) && idr_get_cursor() <= 1) (Oleg) > > > > > - Use idr_alloc() instead of idr_alloc_cyclic() (Oleg) > > > > > > > > > > v3: > > > > > - Return EEXIST if PID is already in use (Christian) > > > > > - Drop CLONE_SET_TID (Christian and Oleg) > > > > > - Use idr_is_empty() instead of idr_get_cursor() (Oleg) > > > > > - Handle different `struct clone_args` sizes (Dmitry) > > > > > > > > > > v4: > > > > > - Rework struct size check with defines (Christian) > > > > > - Reduce number of set_tid checks (Oleg) > > > > > - Less parentheses and more robust code (Oleg) > > > > > - Do ns_capable() on correct user_ns (Oleg, Christian) > > > > > --- > > > > > include/linux/pid.h | 2 +- > > > > > include/linux/sched/task.h | 1 + > > > > > include/uapi/linux/sched.h | 1 + > > > > > kernel/fork.c | 25 +++++++++++++++++++++++-- > > > > > kernel/pid.c | 34 +++++++++++++++++++++++++++------- > > > > > 5 files changed, 53 insertions(+), 10 deletions(-) > > > > > > > > > > diff --git a/include/linux/pid.h b/include/linux/pid.h > > > > > index 2a83e434db9d..052000db0ced 100644 > > > > > --- a/include/linux/pid.h > > > > > +++ b/include/linux/pid.h > > > > > @@ -116,7 +116,7 @@ extern struct pid *find_vpid(int nr); > > > > > extern struct pid *find_get_pid(int nr); > > > > > extern struct pid *find_ge_pid(int nr, struct pid_namespace *); > > > > > > > > > > -extern struct pid *alloc_pid(struct pid_namespace *ns); > > > > > +extern struct pid *alloc_pid(struct pid_namespace *ns, pid_t set_tid); > > > > > extern void free_pid(struct pid *pid); > > > > > extern void disable_pid_allocation(struct pid_namespace *ns); > > > > > > > > > > diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h > > > > > index 0497091e40c1..4f2a80564332 100644 > > > > > --- a/include/linux/sched/task.h > > > > > +++ b/include/linux/sched/task.h > > > > > @@ -26,6 +26,7 @@ struct kernel_clone_args { > > > > > unsigned long stack; > > > > > unsigned long stack_size; > > > > > unsigned long tls; > > > > > + pid_t set_tid; > > > > > }; > > > > > > > > > > /* > > > > > diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h > > > > > index b3105ac1381a..e1ce103a2c47 100644 > > > > > --- a/include/uapi/linux/sched.h > > > > > +++ b/include/uapi/linux/sched.h > > > > > @@ -45,6 +45,7 @@ struct clone_args { > > > > > __aligned_u64 stack; > > > > > __aligned_u64 stack_size; > > > > > __aligned_u64 tls; > > > > > + __aligned_u64 set_tid; > > > > > }; > > > > > > > > > > /* > > > > > diff --git a/kernel/fork.c b/kernel/fork.c > > > > > index 2852d0e76ea3..2a03f0e201e9 100644 > > > > > --- a/kernel/fork.c > > > > > +++ b/kernel/fork.c > > > > > @@ -117,6 +117,13 @@ > > > > > */ > > > > > #define MAX_THREADS FUTEX_TID_MASK > > > > > > > > > > +/* > > > > > + * Different sizes of struct clone_args > > > > > + */ > > > > > +#define CLONE3_ARGS_SIZE_V0 64 > > > > > +/* V1 includes set_tid */ > > > > > +#define CLONE3_ARGS_SIZE_V1 72 > > > > > + > > > > > /* > > > > > * Protected counters by write_lock_irq(&tasklist_lock) > > > > > */ > > > > > @@ -2031,7 +2038,13 @@ static __latent_entropy struct task_struct *copy_process( > > > > > stackleak_task_init(p); > > > > > > > > > > if (pid != &init_struct_pid) { > > > > > - pid = alloc_pid(p->nsproxy->pid_ns_for_children); > > > > > + if (args->set_tid && !ns_capable( > > > > > + p->nsproxy->pid_ns_for_children->user_ns, > > > > > + CAP_SYS_ADMIN)) { > > > > > + retval = -EPERM; > > > > > + goto bad_fork_cleanup_thread; > > > > > + } > > > > > + pid = alloc_pid(p->nsproxy->pid_ns_for_children, args->set_tid); > > > > > if (IS_ERR(pid)) { > > > > > retval = PTR_ERR(pid); > > > > > goto bad_fork_cleanup_thread; > > > > > @@ -2535,9 +2548,14 @@ noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs, > > > > > if (unlikely(size > PAGE_SIZE)) > > > > > return -E2BIG; > > > > > > > > > > - if (unlikely(size < sizeof(struct clone_args))) > > > > > + /* The struct needs to be at least the size of the original struct. */ > > > > > > > > I don't think you need that comment. I think the macro is pretty > > > > self-explanatory. If you want it to be even clearer you could even make > > > > it CLONE3_ARGS_SIZE_MIN but V0 is good enough. :) > > > > > > Will remove the comment. > > > > > > > > + if (unlikely(size < CLONE3_ARGS_SIZE_V0)) > > > > > return -EINVAL; > > > > > > > > > > + if (size < sizeof(struct clone_args)) > > > > > + memset((void *)&args + size, 0, > > > > > + sizeof(struct clone_args) - size); > > > > > + > > > > > if (unlikely(!access_ok(uargs, size))) > > > > > return -EFAULT; > > > > > > > > > > @@ -2573,6 +2591,9 @@ noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs, > > > > > .tls = args.tls, > > > > > }; > > > > > > > > > > + if (size >= CLONE3_ARGS_SIZE_V1) > > > > > + kargs->set_tid = args.set_tid; > > > > > > > > Hm, the if-condition is not needed though, right? At this point we will > > > > have already copied from struct clone_args __user *uargs into struct > > > > clone_args args. If we hit that codepath that means the kernel > > > > definitely has a field for set_tid in its struct clone_args. :) So this > > > > could probably just be: > > > > > > > > .tls = args.tls, > > > > .set_tid = args.set_tid, > > > > } > > > > > > > > ? > > > > > > Right. > > > > > > > > + > > > > > return 0; > > > > > } > > > > > > > > > > diff --git a/kernel/pid.c b/kernel/pid.c > > > > > index 0a9f2e437217..9ce89c35c5be 100644 > > > > > --- a/kernel/pid.c > > > > > +++ b/kernel/pid.c > > > > > @@ -157,7 +157,7 @@ void free_pid(struct pid *pid) > > > > > call_rcu(&pid->rcu, delayed_put_pid); > > > > > } > > > > > > > > > > -struct pid *alloc_pid(struct pid_namespace *ns) > > > > > +struct pid *alloc_pid(struct pid_namespace *ns, int set_tid) > > > > > { > > > > > struct pid *pid; > > > > > enum pid_type type; > > > > > @@ -186,12 +186,32 @@ struct pid *alloc_pid(struct pid_namespace *ns) > > > > > if (idr_get_cursor(&tmp->idr) > RESERVED_PIDS) > > > > > pid_min = RESERVED_PIDS; > > > > > > > > > > - /* > > > > > - * Store a null pointer so find_pid_ns does not find > > > > > - * a partially initialized PID (see below). > > > > > - */ > > > > > - nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min, > > > > > - pid_max, GFP_ATOMIC); > > > > > + if (set_tid) { > > > > > + /* > > > > > + * Also fail if a PID != 1 is requested > > > > > + * and no PID 1 exists. > > > > > + */ > > > > > + nr = -EINVAL; > > > > > + if (set_tid < pid_max && set_tid > 0 && > > > > > > > > Hm, you're already in the if-branch hat verified if (set_tid) so the > > > > set_tid > 0 conjunct seems redundant. :) > > > > > > Yes, but I dropped all checks to see if set_tid is negative as suggested > > > by Oleg and moved it here. > > > > > > > > + (set_tid == 1 || !idr_is_empty(&tmp->idr))) > > > > > + nr = idr_alloc(&tmp->idr, NULL, set_tid, > > > > > + set_tid + 1, GFP_ATOMIC); > > > > > > > > I'm confused, shouldn't this be > > > > > > > > if (set_tid < pid_max || (set_tid == 1 && !idr_is_emtpy(&tmp->idf))) > > > > > > Now I am also confused ;). This does not work. This will always return > > > true if set_tid is less than pid_max. So pid_max needs to be something > > > like 1 for the check after || to make sense, right? But you really got > > > me confused here right now. Right now I still think what I did is > > > correct. > > > > I missed the part where you reset set_tid to 0 below and mis-parsed the > > rightmost conjunct in the if statement. > > > > One thing I dislike is that you do > > > > if (set_tid) { > > if ([...] && set_tid > 0 && [...]) > > } > > Thinking about this a bit more you probably did the explicit set_tid > 0 to > catch the case where it is negative? But also, I don't understand why we do any > work at all before having verified that set_tid is sensible, i.e. why do we > call kmem_cache_calloc(), do idr_preload(), and take a spinlock, before even > verifying that our parameters are sane? If there's no specific reason for this > I suggest to patch alloc_pid() like this: > > -struct pid *alloc_pid(struct pid_namespace *ns) > +struct pid *alloc_pid(struct pid_namespace *ns, int set_tid) > { > struct pid *pid; > enum pid_type type; > @@ -166,6 +166,9 @@ struct pid *alloc_pid(struct pid_namespace *ns) > struct upid *upid; > int retval = -ENOMEM; > > + if (set_tid < 0 || set_tid >= pid_max) > + return ERR_PTR(-EINVAL); > + > pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL); > if (!pid) > return ERR_PTR(retval); > @@ -186,12 +189,31 @@ struct pid *alloc_pid(struct pid_namespace *ns) > if (idr_get_cursor(&tmp->idr) > RESERVED_PIDS) > pid_min = RESERVED_PIDS; > > - /* > - * Store a null pointer so find_pid_ns does not find > - * a partially initialized PID (see below). > - */ > - nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min, > - pid_max, GFP_ATOMIC); > + if (set_tid) { > + /* > + * Also fail if a PID != 1 is requested > + * and no PID 1 exists. > + */ > + nr = -EINVAL; > + if (set_tid == 1 || !idr_is_empty(&tmp->idr)) > + nr = idr_alloc(&tmp->idr, NULL, set_tid, > + set_tid + 1, GFP_ATOMIC); > + /* > + * If ENOSPC is returned it means that the PID is > + * alreay in use. Return EEXIST in that case. > + */ > + if (nr == -ENOSPC) > + nr = -EEXIST; > + /* Only use set_tid for one PID namespace. */ > + set_tid = 0; > + } else { > + /* > + * Store a null pointer so find_pid_ns does not find > + * a partially initialized PID (see below). > + */ > + nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min, > + pid_max, GFP_ATOMIC); > + } > spin_unlock_irq(&pidmap_lock); > idr_preload_end() > > This makes things a lot more clearer in my opinion. First, verify that the > pre-conditions are met. Second, verify that the conditions are met which depend > on the state of the pid namespace, i.e. there's either already a pid 1 or pid 1 > is requested. > We should also do this since alloc_pid() is exported in a header file and so we > can't and shouldn't rely on the fact that all callers will pass in something > sensible for set_tid. I like that. I will update the patch and send it out to match this suggestion. Adrian ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2019-08-11 20:12 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-08-08 21:22 [PATCH v4 1/2] fork: extend clone3() to support CLONE_SET_TID Adrian Reber 2019-08-08 21:22 ` [PATCH v4 2/2] selftests: add tests for clone3() Adrian Reber 2019-08-10 1:10 ` [PATCH v4 1/2] fork: extend clone3() to support CLONE_SET_TID Christian Brauner 2019-08-10 5:59 ` Adrian Reber 2019-08-11 6:51 ` Christian Brauner 2019-08-11 19:06 ` Christian Brauner 2019-08-11 20:11 ` Adrian Reber
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).