All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 1/2] fork: add clone3
@ 2019-06-03 14:43 Christian Brauner
  2019-06-03 14:43 ` [PATCH v2 2/2] arch: wire-up clone3() syscall on x86 Christian Brauner
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Christian Brauner @ 2019-06-03 14:43 UTC (permalink / raw)
  To: viro, linux-kernel, torvalds, jannh
  Cc: keescook, fweimer, oleg, arnd, dhowells, Christian Brauner,
	Pavel Emelyanov, Andrew Morton, Adrian Reber, Andrei Vagin,
	linux-api

This adds the clone3 system call.

As mentioned several times already (cf. [7], [8]) here's the promised
patchset for clone3().

We recently merged the CLONE_PIDFD patchset (cf. [1]). It took the last
free flag from clone().

Independent of the CLONE_PIDFD patchset a time namespace has been discussed
at Linux Plumber Conference last year and has been sent out and reviewed
(cf. [5]). It is expected that it will go upstream in the not too distant
future. However, it relies on the addition of the CLONE_NEWTIME flag to
clone(). The only other good candidate - CLONE_DETACHED - is currently not
recyclable as we have identified at least two large or widely used
codebases that currently pass this flag (cf. [2], [3], and [4]). Given that
CLONE_PIDFD grabbed the last clone() flag the time namespace is effectively
blocked. clone3() has the advantage that it will unblock this patchset
again.

The idea is to keep clone3() very simple and close to the original clone(),
specifically, to keep on supporting old clone()-based workloads.
We know there have been various creative proposals how a new process
creation syscall or even api is supposed to look like. Some people even
going so far as to argue that the traditional fork()+exec() split should be
abandoned in favor of an in-kernel version of spawn(). Independent of
whether or not we personally think spawn() is a good idea this patchset has
and does not want to have anything to do with this.
One stance we take is that there's no real good alternative to
clone()+exec() and we need and want to support this model going forward;
independent of spawn().
The following requirements guided clone3():
- bump the number of available flags
- move arguments that are currently passed as separate arguments
  in clone() into a dedicated struct clone_args
  - choose a struct layout that is easy to handle on 32 and on 64 bit
  - choose a struct layout that is extensible
  - give new flags that currently need to abuse another flag's dedicated
    return argument in clone() their own dedicated return argument
    (e.g. CLONE_PIDFD)
  - use a separate kernel internal struct kernel_clone_args that is
    properly typed according to current kernel conventions in fork.c and is
    different from  the uapi struct clone_args
- port _do_fork() to use kernel_clone_args so that all process creation
  syscalls such as fork(), vfork(), clone(), and clone3() behave identical
  (Arnd suggested, that we can probably also port do_fork() itself in a
   separate patchset.)
- ease of transition for userspace from clone() to clone3()
  This very much means that we do *not* remove functionality that userspace
  currently relies on as the latter is a good way of creating a syscall
  that won't be adopted.
- do not try to be clever or complex: keep clone3() as dumb as possible

In accordance with Linus suggestions (cf. [11]), clone3() has the following
signature:

/* uapi */
struct clone_args {
        __aligned_u64 flags;
        __aligned_u64 pidfd;
        __aligned_u64 child_tid;
        __aligned_u64 parent_tid;
        __aligned_u64 exit_signal;
        __aligned_u64 stack;
        __aligned_u64 stack_size;
        __aligned_u64 tls;
};

/* kernel internal */
struct kernel_clone_args {
        u64 flags;
        int __user *pidfd;
        int __user *child_tid;
        int __user *parent_tid;
        int exit_signal;
        unsigned long stack;
        unsigned long stack_size;
        unsigned long tls;
};

long sys_clone3(struct clone_args __user *uargs, size_t size)

clone3() cleanly supports all of the supported flags from clone() and thus
all legacy workloads.
The advantage of sticking close to the old clone() is the low cost for
userspace to switch to this new api. Quite a lot of userspace apis (e.g.
pthreads) are based on the clone() syscall. With the new clone3() syscall
supporting all of the old workloads and opening up the ability to add new
features should make switching to it for userspace more appealing. In
essence, glibc can just write a simple wrapper to switch from clone() to
clone3().

There has been some interest in this patchset already. We have received a
patch from the CRIU corner for clone3() that would set the PID/TID of a
restored process without /proc/sys/kernel/ns_last_pid to eliminate a race.

/* User visible differences to legacy clone() */
- CLONE_DETACHED will cause EINVAL with clone3()
- CSIGNAL is deprecated
  It is superseeded by a dedicated "exit_signal" argument in struct
  clone_args freeing up space for additional flags.
  This is based on a suggestion from Andrei and Linus (cf. [9] and [10])

/* References */
[1]: b3e5838252665ee4cfa76b82bdf1198dca81e5be
[2]: https://dxr.mozilla.org/mozilla-central/source/security/sandbox/linux/SandboxFilter.cpp#343
[3]: https://git.musl-libc.org/cgit/musl/tree/src/thread/pthread_create.c#n233
[4]: https://sources.debian.org/src/blcr/0.8.5-2.3/cr_module/cr_dump_self.c/?hl=740#L740
[5]: https://lore.kernel.org/lkml/20190425161416.26600-1-dima@arista.com/
[6]: https://lore.kernel.org/lkml/20190425161416.26600-2-dima@arista.com/
[7]: https://lore.kernel.org/lkml/CAHrFyr5HxpGXA2YrKza-oB-GGwJCqwPfyhD-Y5wbktWZdt0sGQ@mail.gmail.com/
[8]: https://lore.kernel.org/lkml/20190524102756.qjsjxukuq2f4t6bo@brauner.io/
[9]: https://lore.kernel.org/lkml/20190529222414.GA6492@gmail.com/
[10]: https://lore.kernel.org/lkml/CAHk-=whQP-Ykxi=zSYaV9iXsHsENa+2fdj-zYKwyeyed63Lsfw@mail.gmail.com/
[11]: https://lore.kernel.org/lkml/CAHk-=wieuV4hGwznPsX-8E0G2FKhx3NjZ9X3dTKh5zKd+iqOBw@mail.gmail.com/

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christian Brauner <christian@brauner.io>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Cc: Jann Horn <jannh@google.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Adrian Reber <adrian@lisas.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: linux-api@vger.kernel.org
---
v1:
- Linus Torvalds <torvalds@linux-foundation.org>:
  - redesign based on Linus proposal
  - switch from arg-based to revision-based naming scheme: s/clone6/clone3/
- Arnd Bergmann <arnd@arndb.de>:
  - use a single copy_from_user() instead of multiple get_user() calls
    since the latter have a constant overhead on some architectures
  - a range of other tweaks and suggestions
v2:
- Linus Torvalds <torvalds@linux-foundation.org>,
  Andrei Vagin <avagin@gmail.com>:
  - replace CSIGNAL flag with dedicated exit_signal argument in struct
    clone_args
- Christian Brauner <christian@brauner.io>:
  - improve naming for some struct clone_args members
---
 arch/x86/ia32/sys_ia32.c   |  12 ++-
 include/linux/sched/task.h |  14 ++-
 include/linux/syscalls.h   |   6 ++
 include/uapi/linux/sched.h |  17 ++++
 kernel/fork.c              | 188 ++++++++++++++++++++++++++++---------
 5 files changed, 190 insertions(+), 47 deletions(-)

diff --git a/arch/x86/ia32/sys_ia32.c b/arch/x86/ia32/sys_ia32.c
index a43212036257..26525c0cd5e6 100644
--- a/arch/x86/ia32/sys_ia32.c
+++ b/arch/x86/ia32/sys_ia32.c
@@ -237,6 +237,14 @@ COMPAT_SYSCALL_DEFINE5(x86_clone, unsigned long, clone_flags,
 		       unsigned long, newsp, int __user *, parent_tidptr,
 		       unsigned long, tls_val, int __user *, child_tidptr)
 {
-	return _do_fork(clone_flags, newsp, 0, parent_tidptr, child_tidptr,
-			tls_val);
+	struct kernel_clone_args args = {
+		.flags = (clone_flags & ~CSIGNAL),
+		.child_tid = child_tidptr,
+		.parent_tid = parent_tidptr,
+		.exit_signal = (clone_flags & CSIGNAL),
+		.stack = newsp,
+		.tls = tls_val,
+	};
+
+	return _do_fork(&args);
 }
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index f1227f2c38a4..9444d4f2bd83 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -8,11 +8,23 @@
  */
 
 #include <linux/sched.h>
+#include <linux/compiler_types.h>
 
 struct task_struct;
 struct rusage;
 union thread_union;
 
+struct kernel_clone_args {
+	u64 flags;
+	int __user *pidfd;
+	int __user *child_tid;
+	int __user *parent_tid;
+	int exit_signal;
+	unsigned long stack;
+	unsigned long stack_size;
+	unsigned long tls;
+};
+
 /*
  * This serializes "schedule()" and also protects
  * the run-queue from deletions/modifications (but
@@ -73,7 +85,7 @@ extern void do_group_exit(int);
 extern void exit_files(struct task_struct *);
 extern void exit_itimers(struct signal_struct *);
 
-extern long _do_fork(unsigned long, unsigned long, unsigned long, int __user *, int __user *, unsigned long);
+extern long _do_fork(struct kernel_clone_args *kargs);
 extern long do_fork(unsigned long, unsigned long, unsigned long, int __user *, int __user *);
 struct task_struct *fork_idle(int);
 struct mm_struct *copy_init_mm(void);
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index e2870fe1be5b..254db24af0cd 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -70,6 +70,7 @@ struct sigaltstack;
 struct rseq;
 union bpf_attr;
 struct io_uring_params;
+struct clone_args;
 
 #include <linux/types.h>
 #include <linux/aio_abi.h>
@@ -852,6 +853,11 @@ asmlinkage long sys_clone(unsigned long, unsigned long, int __user *,
 	       int __user *, unsigned long);
 #endif
 #endif
+
+#ifdef __ARCH_WANT_SYS_CLONE
+asmlinkage long sys_clone3(struct clone_args __user *uargs, size_t size);
+#endif
+
 asmlinkage long sys_execve(const char __user *filename,
 		const char __user *const __user *argv,
 		const char __user *const __user *envp);
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index ed4ee170bee2..634af5ec07b5 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -2,6 +2,8 @@
 #ifndef _UAPI_LINUX_SCHED_H
 #define _UAPI_LINUX_SCHED_H
 
+#include <linux/types.h>
+
 /*
  * cloning flags:
  */
@@ -30,6 +32,21 @@
 #define CLONE_NEWPID		0x20000000	/* New pid namespace */
 #define CLONE_NEWNET		0x40000000	/* New network namespace */
 #define CLONE_IO		0x80000000	/* Clone io context */
+#define CLONE_MAX ~0U
+
+/*
+ * Arguments for the clone3 syscall
+ */
+struct clone_args {
+	__aligned_u64 flags;
+	__aligned_u64 pidfd;
+	__aligned_u64 child_tid;
+	__aligned_u64 parent_tid;
+	__aligned_u64 exit_signal;
+	__aligned_u64 stack;
+	__aligned_u64 stack_size;
+	__aligned_u64 tls;
+};
 
 /*
  * Scheduling policies
diff --git a/kernel/fork.c b/kernel/fork.c
index b4cba953040a..32f27e1d99da 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1760,19 +1760,19 @@ static __always_inline void delayed_free_task(struct task_struct *tsk)
  * flags). The actual kick-off is left to the caller.
  */
 static __latent_entropy struct task_struct *copy_process(
-					unsigned long clone_flags,
-					unsigned long stack_start,
-					unsigned long stack_size,
-					int __user *parent_tidptr,
-					int __user *child_tidptr,
 					struct pid *pid,
 					int trace,
-					unsigned long tls,
-					int node)
+					int node,
+					struct kernel_clone_args *args)
 {
 	int pidfd = -1, retval;
 	struct task_struct *p;
 	struct multiprocess_signals delayed;
+	u64 clone_flags = args->flags;
+	int __user *child_tidptr = args->child_tid;
+	unsigned long tls = args->tls;
+	unsigned long stack_start = args->stack;
+	unsigned long stack_size = args->stack_size;
 
 	/*
 	 * Don't allow sharing the root directory with processes in a different
@@ -1821,27 +1821,12 @@ static __latent_entropy struct task_struct *copy_process(
 	}
 
 	if (clone_flags & CLONE_PIDFD) {
-		int reserved;
-
 		/*
-		 * - CLONE_PARENT_SETTID is useless for pidfds and also
-		 *   parent_tidptr is used to return pidfds.
 		 * - CLONE_DETACHED is blocked so that we can potentially
 		 *   reuse it later for CLONE_PIDFD.
 		 * - CLONE_THREAD is blocked until someone really needs it.
 		 */
-		if (clone_flags &
-		    (CLONE_DETACHED | CLONE_PARENT_SETTID | CLONE_THREAD))
-			return ERR_PTR(-EINVAL);
-
-		/*
-		 * Verify that parent_tidptr is sane so we can potentially
-		 * reuse it later.
-		 */
-		if (get_user(reserved, parent_tidptr))
-			return ERR_PTR(-EFAULT);
-
-		if (reserved != 0)
+		if (clone_flags & (CLONE_DETACHED | CLONE_THREAD))
 			return ERR_PTR(-EINVAL);
 	}
 
@@ -2062,7 +2047,7 @@ static __latent_entropy struct task_struct *copy_process(
 			goto bad_fork_free_pid;
 
 		pidfd = retval;
-		retval = put_user(pidfd, parent_tidptr);
+		retval = put_user(pidfd, args->pidfd);
 		if (retval)
 			goto bad_fork_put_pidfd;
 	}
@@ -2105,7 +2090,7 @@ static __latent_entropy struct task_struct *copy_process(
 		if (clone_flags & CLONE_PARENT)
 			p->exit_signal = current->group_leader->exit_signal;
 		else
-			p->exit_signal = (clone_flags & CSIGNAL);
+			p->exit_signal = args->exit_signal;
 		p->group_leader = p;
 		p->tgid = p->pid;
 	}
@@ -2313,8 +2298,11 @@ static inline void init_idle_pids(struct task_struct *idle)
 struct task_struct *fork_idle(int cpu)
 {
 	struct task_struct *task;
-	task = copy_process(CLONE_VM, 0, 0, NULL, NULL, &init_struct_pid, 0, 0,
-			    cpu_to_node(cpu));
+	struct kernel_clone_args args = {
+		.flags = CLONE_VM,
+	};
+
+	task = copy_process(&init_struct_pid, 0, cpu_to_node(cpu), &args);
 	if (!IS_ERR(task)) {
 		init_idle_pids(task);
 		init_idle(task, cpu);
@@ -2334,18 +2322,15 @@ struct mm_struct *copy_init_mm(void)
  * It copies the process, and if successful kick-starts
  * it and waits for it to finish using the VM if required.
  */
-long _do_fork(unsigned long clone_flags,
-	      unsigned long stack_start,
-	      unsigned long stack_size,
-	      int __user *parent_tidptr,
-	      int __user *child_tidptr,
-	      unsigned long tls)
+long _do_fork(struct kernel_clone_args *args)
 {
+	u64 clone_flags = args->flags;
 	struct completion vfork;
 	struct pid *pid;
 	struct task_struct *p;
 	int trace = 0;
 	long nr;
+	int __user *parent_tidptr = args->parent_tid;
 
 	/*
 	 * Determine whether and which event to report to ptracer.  When
@@ -2356,7 +2341,7 @@ long _do_fork(unsigned long clone_flags,
 	if (!(clone_flags & CLONE_UNTRACED)) {
 		if (clone_flags & CLONE_VFORK)
 			trace = PTRACE_EVENT_VFORK;
-		else if ((clone_flags & CSIGNAL) != SIGCHLD)
+		else if (args->exit_signal != SIGCHLD)
 			trace = PTRACE_EVENT_CLONE;
 		else
 			trace = PTRACE_EVENT_FORK;
@@ -2365,8 +2350,7 @@ long _do_fork(unsigned long clone_flags,
 			trace = 0;
 	}
 
-	p = copy_process(clone_flags, stack_start, stack_size, parent_tidptr,
-			 child_tidptr, NULL, trace, tls, NUMA_NO_NODE);
+	p = copy_process(NULL, trace, NUMA_NO_NODE, args);
 	add_latent_entropy();
 
 	if (IS_ERR(p))
@@ -2414,8 +2398,16 @@ long do_fork(unsigned long clone_flags,
 	      int __user *parent_tidptr,
 	      int __user *child_tidptr)
 {
-	return _do_fork(clone_flags, stack_start, stack_size,
-			parent_tidptr, child_tidptr, 0);
+	struct kernel_clone_args args = {
+		.flags = (clone_flags & ~CSIGNAL),
+		.child_tid = child_tidptr,
+		.parent_tid = parent_tidptr,
+		.exit_signal = (clone_flags & CSIGNAL),
+		.stack = stack_start,
+		.stack_size = stack_size,
+	};
+
+	return _do_fork(&args);
 }
 #endif
 
@@ -2424,15 +2416,25 @@ long do_fork(unsigned long clone_flags,
  */
 pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags)
 {
-	return _do_fork(flags|CLONE_VM|CLONE_UNTRACED, (unsigned long)fn,
-		(unsigned long)arg, NULL, NULL, 0);
+	struct kernel_clone_args args = {
+		.flags = ((flags | CLONE_VM | CLONE_UNTRACED) & ~CSIGNAL),
+		.exit_signal = (flags & CSIGNAL),
+		.stack = (unsigned long)fn,
+		.stack_size = (unsigned long)arg,
+	};
+
+	return _do_fork(&args);
 }
 
 #ifdef __ARCH_WANT_SYS_FORK
 SYSCALL_DEFINE0(fork)
 {
 #ifdef CONFIG_MMU
-	return _do_fork(SIGCHLD, 0, 0, NULL, NULL, 0);
+	struct kernel_clone_args args = {
+		.exit_signal = SIGCHLD,
+	};
+
+	return _do_fork(&args);
 #else
 	/* can not support in nommu mode */
 	return -EINVAL;
@@ -2443,8 +2445,12 @@ SYSCALL_DEFINE0(fork)
 #ifdef __ARCH_WANT_SYS_VFORK
 SYSCALL_DEFINE0(vfork)
 {
-	return _do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, 0,
-			0, NULL, NULL, 0);
+	struct kernel_clone_args args = {
+		.flags = CLONE_VFORK | CLONE_VM,
+		.exit_signal = SIGCHLD,
+	};
+
+	return _do_fork(&args);
 }
 #endif
 
@@ -2472,7 +2478,101 @@ SYSCALL_DEFINE5(clone, unsigned long, clone_flags, unsigned long, newsp,
 		 unsigned long, tls)
 #endif
 {
-	return _do_fork(clone_flags, newsp, 0, parent_tidptr, child_tidptr, tls);
+	struct kernel_clone_args args = {
+		.flags = (clone_flags & ~CSIGNAL),
+		.pidfd = parent_tidptr,
+		.child_tid = child_tidptr,
+		.parent_tid = parent_tidptr,
+		.exit_signal = (clone_flags & CSIGNAL),
+		.stack = newsp,
+		.tls = tls,
+	};
+
+	/* clone(CLONE_PIDFD) uses parent_tidptr to return a pidfd */
+	if ((clone_flags & CLONE_PIDFD) && (clone_flags & CLONE_PARENT_SETTID))
+		return -EINVAL;
+
+	return _do_fork(&args);
+}
+
+static bool clone3_flags_valid(u64 flags)
+{
+	if (flags & ~CLONE_MAX)
+		return false;
+
+	if (flags & (CLONE_DETACHED | CSIGNAL))
+		return false;
+
+	return true;
+}
+
+static int copy_clone_args_from_user(struct kernel_clone_args *kargs,
+				     struct clone_args __user *uargs,
+				     size_t size)
+{
+	struct clone_args args;
+
+	if (unlikely(size > PAGE_SIZE))
+		return -E2BIG;
+
+	if (unlikely(size < sizeof(struct clone_args)))
+		return -EINVAL;
+
+	if (unlikely(!access_ok(uargs, size)))
+		return -EFAULT;
+
+	if (size > sizeof(struct clone_args)) {
+		unsigned char __user *addr;
+		unsigned char __user *end;
+		unsigned char val;
+
+		addr = (void __user *)uargs + sizeof(struct clone_args);
+		end = (void __user *)uargs + size;
+
+		for (; addr < end; addr++) {
+			if (get_user(val, addr))
+				return -EFAULT;
+			if (val)
+				return -E2BIG;
+		}
+
+		size = sizeof(struct clone_args);
+	}
+
+	if (copy_from_user(&args, uargs, size))
+		return -EFAULT;
+
+	if (!clone3_flags_valid(args.flags))
+		return -EINVAL;
+
+	if ((args.flags & (CLONE_THREAD | CLONE_PARENT)) && args.exit_signal)
+		return -EINVAL;
+
+	memset(kargs, 0, sizeof(*kargs));
+
+	kargs->flags = args.flags;
+	kargs->exit_signal = args.exit_signal;
+	kargs->child_tid = u64_to_user_ptr(args.child_tid);
+	kargs->parent_tid = u64_to_user_ptr(args.parent_tid);
+	kargs->pidfd = u64_to_user_ptr(args.pidfd);
+	kargs->stack = args.stack;
+	kargs->stack_size = args.stack_size;
+	kargs->tls = args.tls;
+
+	return 0;
+}
+
+SYSCALL_DEFINE2(clone3, struct clone_args __user *, uargs, size_t, size)
+{
+	int err;
+
+	struct kernel_clone_args kargs;
+
+	err = copy_clone_args_from_user(&kargs, uargs, size);
+	if (err)
+		return err;
+
+	return _do_fork(&kargs);
 }
 #endif
 
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v2 2/2] arch: wire-up clone3() syscall on x86
  2019-06-03 14:43 [PATCH v2 1/2] fork: add clone3 Christian Brauner
@ 2019-06-03 14:43 ` Christian Brauner
  2019-06-04  9:28 ` [PATCH v2 1/2] fork: add clone3 David Howells
  2019-06-04 10:36 ` Arnd Bergmann
  2 siblings, 0 replies; 8+ messages in thread
From: Christian Brauner @ 2019-06-03 14:43 UTC (permalink / raw)
  To: viro, linux-kernel, torvalds, jannh
  Cc: keescook, fweimer, oleg, arnd, dhowells, Christian Brauner,
	Andrew Morton, Adrian Reber, linux-api, linux-arch, x86

Wire up the clone3() call on x86.

This patch only wires up clone3() on x86. Some of the arches look like they
need special assembly massaging and it is probably smarter if the
appropriate arch maintainers would do the actual wiring.

Signed-off-by: Christian Brauner <christian@brauner.io>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Adrian Reber <adrian@lisas.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: linux-api@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Cc: x86@kernel.org
---
v1: unchanged
v2: unchanged
---
 arch/x86/entry/syscalls/syscall_32.tbl | 1 +
 arch/x86/entry/syscalls/syscall_64.tbl | 1 +
 include/uapi/asm-generic/unistd.h      | 4 +++-
 3 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index ad968b7bac72..80e26211feff 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -438,3 +438,4 @@
 431	i386	fsconfig		sys_fsconfig			__ia32_sys_fsconfig
 432	i386	fsmount			sys_fsmount			__ia32_sys_fsmount
 433	i386	fspick			sys_fspick			__ia32_sys_fspick
+436	i386	clone3			sys_clone3			__ia32_sys_clone3
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index b4e6f9e6204a..7968f0b5b5e8 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -355,6 +355,7 @@
 431	common	fsconfig		__x64_sys_fsconfig
 432	common	fsmount			__x64_sys_fsmount
 433	common	fspick			__x64_sys_fspick
+436	common	clone3			__x64_sys_clone3/ptregs
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index a87904daf103..45bc87687c47 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -844,9 +844,11 @@ __SYSCALL(__NR_fsconfig, sys_fsconfig)
 __SYSCALL(__NR_fsmount, sys_fsmount)
 #define __NR_fspick 433
 __SYSCALL(__NR_fspick, sys_fspick)
+#define __NR_clone3 436
+__SYSCALL(__NR_clone3, sys_clone3)
 
 #undef __NR_syscalls
-#define __NR_syscalls 434
+#define __NR_syscalls 437
 
 /*
  * 32 bit systems traditionally used different
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 1/2] fork: add clone3
  2019-06-03 14:43 [PATCH v2 1/2] fork: add clone3 Christian Brauner
  2019-06-03 14:43 ` [PATCH v2 2/2] arch: wire-up clone3() syscall on x86 Christian Brauner
@ 2019-06-04  9:28 ` David Howells
  2019-06-04  9:43   ` Christian Brauner
  2019-06-04 10:36 ` Arnd Bergmann
  2 siblings, 1 reply; 8+ messages in thread
From: David Howells @ 2019-06-04  9:28 UTC (permalink / raw)
  To: Christian Brauner
  Cc: dhowells, viro, linux-kernel, torvalds, jannh, keescook, fweimer,
	oleg, arnd, Pavel Emelyanov, Andrew Morton, Adrian Reber,
	Andrei Vagin, linux-api

Christian Brauner <christian@brauner.io> wrote:

> +#include <linux/compiler_types.h>

I suspect you don't want to include that directly.

Also, to avoid bloating linux/sched/task.h yet further, maybe put this in
linux/sched/clone.h?

> -extern long _do_fork(unsigned long, unsigned long, unsigned long, int __user *, int __user *, unsigned long);
> +extern long _do_fork(struct kernel_clone_args *kargs);
>  extern long do_fork(unsigned long, unsigned long, unsigned long, int __user *, int __user *);

Maybe these could move into linux/sched/clone.h too.

> +#define CLONE_MAX ~0U

Can you add a comment summarising the meaning?

> +	u64 clone_flags = args->flags;
> +	int __user *child_tidptr = args->child_tid;
> +	unsigned long tls = args->tls;
> +	unsigned long stack_start = args->stack;
> +	unsigned long stack_size = args->stack_size;

Some of these are only used once, so it's probably not worth sticking them in
local variables.

> -		if (clone_flags &
> -		    (CLONE_DETACHED | CLONE_PARENT_SETTID | CLONE_THREAD))
> -			return ERR_PTR(-EINVAL);

Did this error check get lost?  I can see part of it further on, but the check
on CLONE_PARENT_SETTID is absent.

> +	int __user *parent_tidptr = args->parent_tid;

There's only one usage remaining after this patch, so a local var doesn't gain
a lot.

>  pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags)
>  {
> -	return _do_fork(flags|CLONE_VM|CLONE_UNTRACED, (unsigned long)fn,
> -		(unsigned long)arg, NULL, NULL, 0);
> +	struct kernel_clone_args args = {
> +		.flags = ((flags | CLONE_VM | CLONE_UNTRACED) & ~CSIGNAL),
> +		.exit_signal = (flags & CSIGNAL),

Kernel threads can have exit signals?

> +static int copy_clone_args_from_user(struct kernel_clone_args *kargs,
> +				     struct clone_args __user *uargs,
> +				     size_t size)

I would make this "noinline".  If it gets inlined, local variable "args" may
still be on the stack when _do_fork() gets called.

David

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 1/2] fork: add clone3
  2019-06-04  9:28 ` [PATCH v2 1/2] fork: add clone3 David Howells
@ 2019-06-04  9:43   ` Christian Brauner
  2019-06-04  9:56     ` Christian Brauner
  2019-06-04 10:42     ` David Laight
  0 siblings, 2 replies; 8+ messages in thread
From: Christian Brauner @ 2019-06-04  9:43 UTC (permalink / raw)
  To: David Howells
  Cc: viro, linux-kernel, torvalds, jannh, keescook, fweimer, oleg,
	arnd, Pavel Emelyanov, Andrew Morton, Adrian Reber, Andrei Vagin,
	linux-api

On Tue, Jun 04, 2019 at 10:28:12AM +0100, David Howells wrote:
> Christian Brauner <christian@brauner.io> wrote:
> 
> > +#include <linux/compiler_types.h>
> 
> I suspect you don't want to include that directly.
> 
> Also, to avoid bloating linux/sched/task.h yet further, maybe put this in
> linux/sched/clone.h?

Yeah, not the worst idea.
Though I'd leave the flags where they are and just add struct
kernel_clone_args in there. But I assume that's what you meant anyway.

> 
> > -extern long _do_fork(unsigned long, unsigned long, unsigned long, int __user *, int __user *, unsigned long);
> > +extern long _do_fork(struct kernel_clone_args *kargs);
> >  extern long do_fork(unsigned long, unsigned long, unsigned long, int __user *, int __user *);
> 
> Maybe these could move into linux/sched/clone.h too.

Meh, that could be a separate cleanup patch after clone3() has been
merged.

> 
> > +#define CLONE_MAX ~0U
> 
> Can you add a comment summarising the meaning?

Yes, can do.

> 
> > +	u64 clone_flags = args->flags;
> > +	int __user *child_tidptr = args->child_tid;
> > +	unsigned long tls = args->tls;
> > +	unsigned long stack_start = args->stack;
> > +	unsigned long stack_size = args->stack_size;
> 
> Some of these are only used once, so it's probably not worth sticking them in
> local variables.

[1]:
Ok, will double check.
This was just to minimize copy-paste erros for variables which were used
multiple times.

> 
> > -		if (clone_flags &
> > -		    (CLONE_DETACHED | CLONE_PARENT_SETTID | CLONE_THREAD))
> > -			return ERR_PTR(-EINVAL);
> 
> Did this error check get lost?  I can see part of it further on, but the check
> on CLONE_PARENT_SETTID is absent.

No, it's only relevant for legacy clone() since it uses the
parent_tidptr argument to return the pidfd. clone3() has a dedicated
return argument for that in clone_args.
The check for legacy clone() is now done in legacy clone() directly.
copy_process() should only do generic checks for all version of
clone(),fork(),vfork(), etc.

> 
> > +	int __user *parent_tidptr = args->parent_tid;
> 
> There's only one usage remaining after this patch, so a local var doesn't gain
> a lot.

Yes, that leads back to [1].

> 
> >  pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags)
> >  {
> > -	return _do_fork(flags|CLONE_VM|CLONE_UNTRACED, (unsigned long)fn,
> > -		(unsigned long)arg, NULL, NULL, 0);
> > +	struct kernel_clone_args args = {
> > +		.flags = ((flags | CLONE_VM | CLONE_UNTRACED) & ~CSIGNAL),
> > +		.exit_signal = (flags & CSIGNAL),
> 
> Kernel threads can have exit signals?

Yes,

kernel/kthread.c:       pid = kernel_thread(kthread, create, CLONE_FS | CLONE_FILES | SIGCHLD);
kernel/umh.c:   pid = kernel_thread(call_usermodehelper_exec_async, sub_info, SIGCHLD);

And even if they couldn't have. This is just to make sure that if they
ever would we'd be prepared.

> 
> > +static int copy_clone_args_from_user(struct kernel_clone_args *kargs,
> > +				     struct clone_args __user *uargs,
> > +				     size_t size)
> 
> I would make this "noinline".  If it gets inlined, local variable "args" may
> still be on the stack when _do_fork() gets called.

Hm, can do.

Thanks!
Christian

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 1/2] fork: add clone3
  2019-06-04  9:43   ` Christian Brauner
@ 2019-06-04  9:56     ` Christian Brauner
  2019-06-04 10:42     ` David Laight
  1 sibling, 0 replies; 8+ messages in thread
From: Christian Brauner @ 2019-06-04  9:56 UTC (permalink / raw)
  To: David Howells
  Cc: viro, linux-kernel, torvalds, jannh, keescook, fweimer, oleg,
	arnd, Pavel Emelyanov, Andrew Morton, Adrian Reber, Andrei Vagin,
	linux-api

On Tue, Jun 04, 2019 at 11:43:17AM +0200, Christian Brauner wrote:
> On Tue, Jun 04, 2019 at 10:28:12AM +0100, David Howells wrote:
> > Christian Brauner <christian@brauner.io> wrote:
> > 
> > > +#include <linux/compiler_types.h>
> > 
> > I suspect you don't want to include that directly.
> > 
> > Also, to avoid bloating linux/sched/task.h yet further, maybe put this in
> > linux/sched/clone.h?
> 
> Yeah, not the worst idea.
> Though I'd leave the flags where they are and just add struct
> kernel_clone_args in there. But I assume that's what you meant anyway.

Actually, I would like to defer this to the cleanup patch too.
This way the patch stays small and clean and task.h is currently the
right place to put it.

> 
> > 
> > > -extern long _do_fork(unsigned long, unsigned long, unsigned long, int __user *, int __user *, unsigned long);
> > > +extern long _do_fork(struct kernel_clone_args *kargs);
> > >  extern long do_fork(unsigned long, unsigned long, unsigned long, int __user *, int __user *);
> > 
> > Maybe these could move into linux/sched/clone.h too.
> 
> Meh, that could be a separate cleanup patch after clone3() has been
> merged.
> 
> > 
> > > +#define CLONE_MAX ~0U
> > 
> > Can you add a comment summarising the meaning?
> 
> Yes, can do.
> 
> > 
> > > +	u64 clone_flags = args->flags;
> > > +	int __user *child_tidptr = args->child_tid;
> > > +	unsigned long tls = args->tls;
> > > +	unsigned long stack_start = args->stack;
> > > +	unsigned long stack_size = args->stack_size;
> > 
> > Some of these are only used once, so it's probably not worth sticking them in
> > local variables.
> 
> [1]:
> Ok, will double check.
> This was just to minimize copy-paste erros for variables which were used
> multiple times.
> 
> > 
> > > -		if (clone_flags &
> > > -		    (CLONE_DETACHED | CLONE_PARENT_SETTID | CLONE_THREAD))
> > > -			return ERR_PTR(-EINVAL);
> > 
> > Did this error check get lost?  I can see part of it further on, but the check
> > on CLONE_PARENT_SETTID is absent.
> 
> No, it's only relevant for legacy clone() since it uses the
> parent_tidptr argument to return the pidfd. clone3() has a dedicated
> return argument for that in clone_args.
> The check for legacy clone() is now done in legacy clone() directly.
> copy_process() should only do generic checks for all version of
> clone(),fork(),vfork(), etc.
> 
> > 
> > > +	int __user *parent_tidptr = args->parent_tid;
> > 
> > There's only one usage remaining after this patch, so a local var doesn't gain
> > a lot.
> 
> Yes, that leads back to [1].
> 
> > 
> > >  pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags)
> > >  {
> > > -	return _do_fork(flags|CLONE_VM|CLONE_UNTRACED, (unsigned long)fn,
> > > -		(unsigned long)arg, NULL, NULL, 0);
> > > +	struct kernel_clone_args args = {
> > > +		.flags = ((flags | CLONE_VM | CLONE_UNTRACED) & ~CSIGNAL),
> > > +		.exit_signal = (flags & CSIGNAL),
> > 
> > Kernel threads can have exit signals?
> 
> Yes,
> 
> kernel/kthread.c:       pid = kernel_thread(kthread, create, CLONE_FS | CLONE_FILES | SIGCHLD);
> kernel/umh.c:   pid = kernel_thread(call_usermodehelper_exec_async, sub_info, SIGCHLD);
> 
> And even if they couldn't have. This is just to make sure that if they
> ever would we'd be prepared.
> 
> > 
> > > +static int copy_clone_args_from_user(struct kernel_clone_args *kargs,
> > > +				     struct clone_args __user *uargs,
> > > +				     size_t size)
> > 
> > I would make this "noinline".  If it gets inlined, local variable "args" may
> > still be on the stack when _do_fork() gets called.
> 
> Hm, can do.
> 
> Thanks!
> Christian

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 1/2] fork: add clone3
  2019-06-03 14:43 [PATCH v2 1/2] fork: add clone3 Christian Brauner
  2019-06-03 14:43 ` [PATCH v2 2/2] arch: wire-up clone3() syscall on x86 Christian Brauner
  2019-06-04  9:28 ` [PATCH v2 1/2] fork: add clone3 David Howells
@ 2019-06-04 10:36 ` Arnd Bergmann
  2019-06-04 10:49   ` Christian Brauner
  2 siblings, 1 reply; 8+ messages in thread
From: Arnd Bergmann @ 2019-06-04 10:36 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Al Viro, Linux Kernel Mailing List, Linus Torvalds, Jann Horn,
	Kees Cook, Florian Weimer, Oleg Nesterov, David Howells,
	Pavel Emelyanov, Andrew Morton, Adrian Reber, Andrei Vagin,
	Linux API

On Mon, Jun 3, 2019 at 4:44 PM Christian Brauner <christian@brauner.io> wrote:

> +
> +#ifdef __ARCH_WANT_SYS_CLONE
> +asmlinkage long sys_clone3(struct clone_args __user *uargs, size_t size);
> +#endif

I would leave it outside of __ARCH_WANT_SYS_CLONE, as far
as I can tell the only reason for that #ifdef is so architectures that
have their own sys_clone implementation can opt out of the generic
one, but we don't want that for new syscalls.

In fact, I'd prefer to drop the symbol entirely and have a different
symbol with the opposite meaning such as
__ARCH_NONSTANDARD_SYS_CLONE that only gets
selected by sparc, ia64 and m68k. That should be a separate
patch though, and I'm not asking you to do it, unless you
want to clean up a little more.

       Arnd

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: [PATCH v2 1/2] fork: add clone3
  2019-06-04  9:43   ` Christian Brauner
  2019-06-04  9:56     ` Christian Brauner
@ 2019-06-04 10:42     ` David Laight
  1 sibling, 0 replies; 8+ messages in thread
From: David Laight @ 2019-06-04 10:42 UTC (permalink / raw)
  To: 'Christian Brauner', David Howells
  Cc: viro, linux-kernel, torvalds, jannh, keescook, fweimer, oleg,
	arnd, Pavel Emelyanov, Andrew Morton, Adrian Reber, Andrei Vagin,
	linux-api

From: Christian Brauner
> Sent: 04 June 2019 10:43
...
> > > +	u64 clone_flags = args->flags;
> > > +	int __user *child_tidptr = args->child_tid;
> > > +	unsigned long tls = args->tls;
> > > +	unsigned long stack_start = args->stack;
> > > +	unsigned long stack_size = args->stack_size;
> >
> > Some of these are only used once, so it's probably not worth sticking them in
> > local variables.
> 
> [1]:
> Ok, will double check.
> This was just to minimize copy-paste erros for variables which were used
> multiple times.

Even the ones that are used multiple times may be better being
repeatedly read from args->xxx.

If you are "lucky" 'args' will be in a register variables
so all the accesses are cheap.
With too many locals everything gets copied onto the actual
stack - especially likely if there are any function calls
(that might conceivably change *args) after the locals
are initialised.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 1/2] fork: add clone3
  2019-06-04 10:36 ` Arnd Bergmann
@ 2019-06-04 10:49   ` Christian Brauner
  0 siblings, 0 replies; 8+ messages in thread
From: Christian Brauner @ 2019-06-04 10:49 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Al Viro, Linux Kernel Mailing List, Linus Torvalds, Jann Horn,
	Kees Cook, Florian Weimer, Oleg Nesterov, David Howells,
	Pavel Emelyanov, Andrew Morton, Adrian Reber, Andrei Vagin,
	Linux API

On Tue, Jun 04, 2019 at 12:36:24PM +0200, Arnd Bergmann wrote:
> On Mon, Jun 3, 2019 at 4:44 PM Christian Brauner <christian@brauner.io> wrote:
> 
> > +
> > +#ifdef __ARCH_WANT_SYS_CLONE
> > +asmlinkage long sys_clone3(struct clone_args __user *uargs, size_t size);
> > +#endif
> 
> I would leave it outside of __ARCH_WANT_SYS_CLONE, as far
> as I can tell the only reason for that #ifdef is so architectures that
> have their own sys_clone implementation can opt out of the generic
> one, but we don't want that for new syscalls.
> 
> In fact, I'd prefer to drop the symbol entirely and have a different
> symbol with the opposite meaning such as
> __ARCH_NONSTANDARD_SYS_CLONE that only gets
> selected by sparc, ia64 and m68k. That should be a separate
> patch though, and I'm not asking you to do it, unless you
> want to clean up a little more.

I am totally up for this but I would prefer if we land clone3() in the
5.3 merge window and then for 5.3 rc{2,3} do the cleanups that David and
you suggested.
This leaves this patchset lean and easy to review.

Christian

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2019-06-04 10:49 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-03 14:43 [PATCH v2 1/2] fork: add clone3 Christian Brauner
2019-06-03 14:43 ` [PATCH v2 2/2] arch: wire-up clone3() syscall on x86 Christian Brauner
2019-06-04  9:28 ` [PATCH v2 1/2] fork: add clone3 David Howells
2019-06-04  9:43   ` Christian Brauner
2019-06-04  9:56     ` Christian Brauner
2019-06-04 10:42     ` David Laight
2019-06-04 10:36 ` Arnd Bergmann
2019-06-04 10:49   ` Christian Brauner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.