linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/2] fork: add clone6
@ 2019-05-26 10:26 Christian Brauner
  2019-05-26 10:26 ` [PATCH 2/2] arch: wire-up clone6() syscall on x86 Christian Brauner
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Christian Brauner @ 2019-05-26 10:26 UTC (permalink / raw)
  To: viro, linux-kernel, torvalds, jannh
  Cc: fweimer, oleg, arnd, dhowells, Christian Brauner,
	Pavel Emelyanov, Andrew Morton, Adrian Reber, Andrei Vagin,
	linux-api

This adds the clone6 system call.

As mentioned several times already (cf. [7], [8]) here's the promised
patchset for clone6().

We recently merged the CLONE_PIDFD patchset (cf. [1]). It took the last
free flag from clone().

Independent of the CLONE_PIDFD patchset a time namespace has been discussed
at Linux Plumber Conference last year and has been sent out and reviewed
(cf. [5]). It is expected that it will go upstream in the not too distant
future. However, it relies on the addition of the CLONE_NEWTIME flag to
clone(). The only other good candidate - CLONE_DETACHED - is currently not
recycable as we have identified at least two large or widely used codebases
that currently pass this flag (cf. [2], [3], and [4]). Given that we
grabbed the last clone() flag we effectively blocked the time namespace
patchset. It just seems right that we unblock it again.

The idea is to keep clone6() very simple and close to the original clone(),
specifically, to keep on supporting old clone()-based workloads.
We know there have been various creative proposals how a new process
creation syscall or even api is supposed to look like. Some people even
going so far as to argue that the traditional fork()+exec() split should be
abandoned in favor of an in-kernel version of spawn(). Independent of
whether or not we personally think spawn() is a good idea this patchset has
and does not want to have anything to do with this.
One stance we take is that there's no real good alternative to
clone()+exec() and we need and want to support this model going forward
independent of spawn().
The following requirements guided us for clone6():
- bump the number of available flags as much as possible while ensuring
  that all flag arguments are passed in registers so they remain easily
  accessible for seccomp.
- move non-flag arguments that are currently passed as separate arguments
  in clone() into a dedicated struct
  - choose a struct layout that is easy to handle on 32 and on 64 bit
  - choose a struct layout that is extensible
  - give new flags that currently need to abuse another flag's dedicated
    return argument in clone() their own dedicated return argument
    (e.g. CLONE_PIDFD)
- ease of transition for userspace from clone() to clone6()
- do not try to be clever or complex: keep clone6() as dumb as possible

What we came up with is clone6() which has the following signature:

struct clone6_args {
        __s32 pidfd;
        __aligned_u64 parent_tidptr;
        __aligned_u64 child_tidptr;
        __aligned_u64 stack;
        __aligned_u64 stack_size;
        __aligned_u64 tls;
};

long sys_clone6(struct clone6_args __user *uargs,
                unsigned int flags1,
                unsigned int flags2,
                unsigned int flags3,
                unsigned int flags4,
                unsigned int flags5);

clone6() cleanly supports all of the supported flags from clone() in
flags1, i.e. flags1 is full and all legacy workloads are supported with
clone6().
With clone6() we have 160 flag values in total which - even for a feature
heavy syscall like clone - should hold quite a while. If they are really
all taken at some point we can simply bite the bullet and start adding
additional flag arguments into struct clone6 itself.

Another advantage of sticking close to the old clone() is the low cost for
userspace to switch to this new api. Quite a lot of userspace apis (e.g.
pthreads) are based on the clone() syscall. With the new clone6() syscall
supporting all of the old workloads and opening up the ability to add new
features should make switching to it for userspace more appealing. In
essence, glibc can just write a simple wrapper to switch from clone() to
clone6().

There has been some interest in this patchset already. We have received a
patch from the CRIU corner for clone6() that would set the PID/TID of a
restored process without /proc/sys/kernel/ns_last_pid to eliminate a race.

/* References */
[1]: b3e5838252665ee4cfa76b82bdf1198dca81e5be
[2]: https://dxr.mozilla.org/mozilla-central/source/security/sandbox/linux/SandboxFilter.cpp#343
[3]: https://git.musl-libc.org/cgit/musl/tree/src/thread/pthread_create.c#n233
[4]: https://sources.debian.org/src/blcr/0.8.5-2.3/cr_module/cr_dump_self.c/?hl=740#L740
[5]: https://lore.kernel.org/lkml/20190425161416.26600-1-dima@arista.com/
[6]: https://lore.kernel.org/lkml/20190425161416.26600-2-dima@arista.com/
[7]: https://lore.kernel.org/lkml/CAHrFyr5HxpGXA2YrKza-oB-GGwJCqwPfyhD-Y5wbktWZdt0sGQ@mail.gmail.com/
[8]: https://lore.kernel.org/lkml/20190524102756.qjsjxukuq2f4t6bo@brauner.io/

Co-developed-by: Jann Horn <jannh@google.com>
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Christian Brauner <christian@brauner.io>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Adrian Reber <adrian@lisas.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: linux-api@vger.kernel.org
---
 arch/x86/ia32/sys_ia32.c   |   9 ++-
 include/linux/sched/task.h |   2 +-
 include/linux/syscalls.h   |   9 +++
 include/uapi/linux/sched.h |  14 ++++
 kernel/fork.c              | 161 ++++++++++++++++++++++++++++---------
 5 files changed, 152 insertions(+), 43 deletions(-)

diff --git a/arch/x86/ia32/sys_ia32.c b/arch/x86/ia32/sys_ia32.c
index a43212036257..55a8c550ba74 100644
--- a/arch/x86/ia32/sys_ia32.c
+++ b/arch/x86/ia32/sys_ia32.c
@@ -237,6 +237,11 @@ COMPAT_SYSCALL_DEFINE5(x86_clone, unsigned long, clone_flags,
 		       unsigned long, newsp, int __user *, parent_tidptr,
 		       unsigned long, tls_val, int __user *, child_tidptr)
 {
-	return _do_fork(clone_flags, newsp, 0, parent_tidptr, child_tidptr,
-			tls_val);
+	struct clone6_args args = {
+		.stack = newsp,
+		.parent_tidptr = (uintptr_t)parent_tidptr,
+		.tls = tls_val,
+		.child_tidptr = (uintptr_t)child_tidptr
+	};
+	return _do_fork(clone_flags, &args);
 }
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index f1227f2c38a4..06e7c0df6ab0 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -73,7 +73,7 @@ extern void do_group_exit(int);
 extern void exit_files(struct task_struct *);
 extern void exit_itimers(struct signal_struct *);
 
-extern long _do_fork(unsigned long, unsigned long, unsigned long, int __user *, int __user *, unsigned long);
+extern long _do_fork(u64 clone_flags, struct clone6_args *uargs);
 extern long do_fork(unsigned long, unsigned long, unsigned long, int __user *, int __user *);
 struct task_struct *fork_idle(int);
 struct mm_struct *copy_init_mm(void);
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index e2870fe1be5b..235df5c5e711 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -70,6 +70,7 @@ struct sigaltstack;
 struct rseq;
 union bpf_attr;
 struct io_uring_params;
+struct clone6_args;
 
 #include <linux/types.h>
 #include <linux/aio_abi.h>
@@ -852,6 +853,14 @@ asmlinkage long sys_clone(unsigned long, unsigned long, int __user *,
 	       int __user *, unsigned long);
 #endif
 #endif
+
+#ifdef __ARCH_WANT_SYS_CLONE
+asmlinkage long sys_clone6(struct clone6_args __user *uargs,
+			   unsigned int flags1, unsigned int flags2,
+			   unsigned int flags3, unsigned int flags4,
+			   unsigned int flags5);
+#endif
+
 asmlinkage long sys_execve(const char __user *filename,
 		const char __user *const __user *argv,
 		const char __user *const __user *envp);
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index ed4ee170bee2..b8d2809c5bc6 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -2,6 +2,8 @@
 #ifndef _UAPI_LINUX_SCHED_H
 #define _UAPI_LINUX_SCHED_H
 
+#include <linux/types.h>
+
 /*
  * cloning flags:
  */
@@ -31,6 +33,18 @@
 #define CLONE_NEWNET		0x40000000	/* New network namespace */
 #define CLONE_IO		0x80000000	/* Clone io context */
 
+/*
+ * Arguments for the clone6 syscall
+ */
+struct clone6_args {
+	__s32 pidfd;
+	__aligned_u64 parent_tidptr;
+	__aligned_u64 child_tidptr;
+	__aligned_u64 stack;
+	__aligned_u64 stack_size;
+	__aligned_u64 tls;
+};
+
 /*
  * Scheduling policies
  */
diff --git a/kernel/fork.c b/kernel/fork.c
index b4cba953040a..ffd5471c64af 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1760,19 +1760,21 @@ static __always_inline void delayed_free_task(struct task_struct *tsk)
  * flags). The actual kick-off is left to the caller.
  */
 static __latent_entropy struct task_struct *copy_process(
-					unsigned long clone_flags,
-					unsigned long stack_start,
-					unsigned long stack_size,
-					int __user *parent_tidptr,
-					int __user *child_tidptr,
+					u64 clone_flags,
 					struct pid *pid,
 					int trace,
-					unsigned long tls,
-					int node)
+					int node,
+					struct clone6_args *args,
+					bool is_clone6)
 {
 	int pidfd = -1, retval;
 	struct task_struct *p;
 	struct multiprocess_signals delayed;
+	int __user *child_tidptr = u64_to_user_ptr(args->child_tidptr);
+	int __user *parent_tidptr = NULL;
+	unsigned long tls = args->tls;
+	unsigned long stack_start = args->stack;
+	unsigned long stack_size = args->stack_size;
 
 	/*
 	 * Don't allow sharing the root directory with processes in a different
@@ -1824,25 +1826,32 @@ static __latent_entropy struct task_struct *copy_process(
 		int reserved;
 
 		/*
-		 * - CLONE_PARENT_SETTID is useless for pidfds and also
-		 *   parent_tidptr is used to return pidfds.
 		 * - CLONE_DETACHED is blocked so that we can potentially
 		 *   reuse it later for CLONE_PIDFD.
 		 * - CLONE_THREAD is blocked until someone really needs it.
 		 */
-		if (clone_flags &
-		    (CLONE_DETACHED | CLONE_PARENT_SETTID | CLONE_THREAD))
+		if (clone_flags & (CLONE_DETACHED | CLONE_THREAD))
 			return ERR_PTR(-EINVAL);
 
-		/*
-		 * Verify that parent_tidptr is sane so we can potentially
-		 * reuse it later.
-		 */
-		if (get_user(reserved, parent_tidptr))
-			return ERR_PTR(-EFAULT);
+		if (!is_clone6) {
+			/*
+			 * For non-clone6() versions parent_tidptr is used to
+			 * return pidfds.
+			 */
+			if (clone_flags & CLONE_PARENT_SETTID)
+				return ERR_PTR(-EINVAL);
 
-		if (reserved != 0)
-			return ERR_PTR(-EINVAL);
+			/*
+			 * Verify that parent_tidptr is sane so we can
+			 * potentially reuse it later.
+			 */
+			parent_tidptr = u64_to_user_ptr(args->parent_tidptr);
+			if (get_user(reserved, parent_tidptr))
+				return ERR_PTR(-EFAULT);
+
+			if (reserved != 0)
+				return ERR_PTR(-EINVAL);
+		}
 	}
 
 	/*
@@ -2062,9 +2071,14 @@ static __latent_entropy struct task_struct *copy_process(
 			goto bad_fork_free_pid;
 
 		pidfd = retval;
-		retval = put_user(pidfd, parent_tidptr);
-		if (retval)
-			goto bad_fork_put_pidfd;
+		if (!is_clone6) {
+			/* store pidfd in parent_tidptr for legacy clone */
+			retval = put_user(pidfd, parent_tidptr);
+			if (retval)
+				goto bad_fork_put_pidfd;
+		} else {
+			args->pidfd = pidfd;
+		}
 	}
 
 #ifdef CONFIG_BLOCK
@@ -2313,8 +2327,10 @@ static inline void init_idle_pids(struct task_struct *idle)
 struct task_struct *fork_idle(int cpu)
 {
 	struct task_struct *task;
-	task = copy_process(CLONE_VM, 0, 0, NULL, NULL, &init_struct_pid, 0, 0,
-			    cpu_to_node(cpu));
+	struct clone6_args args = { 0 };
+
+	task = copy_process(CLONE_VM, &init_struct_pid, 0,
+			    cpu_to_node(cpu), &args, false);
 	if (!IS_ERR(task)) {
 		init_idle_pids(task);
 		init_idle(task, cpu);
@@ -2334,18 +2350,15 @@ struct mm_struct *copy_init_mm(void)
  * It copies the process, and if successful kick-starts
  * it and waits for it to finish using the VM if required.
  */
-long _do_fork(unsigned long clone_flags,
-	      unsigned long stack_start,
-	      unsigned long stack_size,
-	      int __user *parent_tidptr,
-	      int __user *child_tidptr,
-	      unsigned long tls)
+static long _do_clone_common(u64 clone_flags, struct clone6_args *args,
+			     bool is_clone6)
 {
 	struct completion vfork;
 	struct pid *pid;
 	struct task_struct *p;
 	int trace = 0;
 	long nr;
+	int __user *parent_tidptr = u64_to_user_ptr(args->parent_tidptr);
 
 	/*
 	 * Determine whether and which event to report to ptracer.  When
@@ -2365,8 +2378,8 @@ long _do_fork(unsigned long clone_flags,
 			trace = 0;
 	}
 
-	p = copy_process(clone_flags, stack_start, stack_size, parent_tidptr,
-			 child_tidptr, NULL, trace, tls, NUMA_NO_NODE);
+	p = copy_process(clone_flags, NULL, trace, NUMA_NO_NODE, args,
+			 is_clone6);
 	add_latent_entropy();
 
 	if (IS_ERR(p))
@@ -2405,6 +2418,11 @@ long _do_fork(unsigned long clone_flags,
 	return nr;
 }
 
+long _do_fork(u64 clone_flags, struct clone6_args *args)
+{
+	return _do_clone_common(clone_flags, args, false);
+}
+
 #ifndef CONFIG_HAVE_COPY_THREAD_TLS
 /* For compatibility with architectures that call do_fork directly rather than
  * using the syscall entry points below. */
@@ -2414,8 +2432,14 @@ long do_fork(unsigned long clone_flags,
 	      int __user *parent_tidptr,
 	      int __user *child_tidptr)
 {
-	return _do_fork(clone_flags, stack_start, stack_size,
-			parent_tidptr, child_tidptr, 0);
+	struct clone6_args args = {
+		.stack = stack_start,
+		.stack_size = stack_size,
+		.parent_tidptr = (uintptr_t)parent_tidptr,
+		.child_tidptr = (uintptr_t)child_tidptr
+	};
+
+	return _do_fork(clone_flags, &args);
 }
 #endif
 
@@ -2424,15 +2448,20 @@ long do_fork(unsigned long clone_flags,
  */
 pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags)
 {
-	return _do_fork(flags|CLONE_VM|CLONE_UNTRACED, (unsigned long)fn,
-		(unsigned long)arg, NULL, NULL, 0);
+	struct clone6_args args = {
+		.stack = (unsigned long)fn,
+		.stack_size = (unsigned long)arg,
+	};
+	return _do_fork(flags|CLONE_VM|CLONE_UNTRACED, &args);
 }
 
 #ifdef __ARCH_WANT_SYS_FORK
 SYSCALL_DEFINE0(fork)
 {
 #ifdef CONFIG_MMU
-	return _do_fork(SIGCHLD, 0, 0, NULL, NULL, 0);
+	struct clone6_args args = { 0 };
+
+	return _do_fork(SIGCHLD, &args);
 #else
 	/* can not support in nommu mode */
 	return -EINVAL;
@@ -2443,8 +2472,9 @@ SYSCALL_DEFINE0(fork)
 #ifdef __ARCH_WANT_SYS_VFORK
 SYSCALL_DEFINE0(vfork)
 {
-	return _do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, 0,
-			0, NULL, NULL, 0);
+	struct clone6_args args = { 0 };
+
+	return _do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, &args);
 }
 #endif
 
@@ -2472,7 +2502,58 @@ SYSCALL_DEFINE5(clone, unsigned long, clone_flags, unsigned long, newsp,
 		 unsigned long, tls)
 #endif
 {
-	return _do_fork(clone_flags, newsp, 0, parent_tidptr, child_tidptr, tls);
+	struct clone6_args args = {
+		.stack = newsp,
+		.parent_tidptr = (uintptr_t)parent_tidptr,
+		.tls = tls,
+		.child_tidptr = (uintptr_t)child_tidptr
+	};
+	return _do_fork(clone_flags, &args);
+}
+
+SYSCALL_DEFINE6(clone6, struct clone6_args __user*, uargs,
+			unsigned int, flags1,
+			unsigned int, flags2,
+			unsigned int, flags3,
+			unsigned int, flags4,
+			unsigned int, flags5)
+{
+	struct clone6_args args = { 0 };
+	u64 flags = flags1;
+	int result;
+
+	if (flags2 || flags3 || flags4 || flags5)
+		return -EINVAL;
+
+	/*
+	 * flags1 is full so we only need to verify that CLONE_DETACHED
+	 * is not passed since we can't use it.
+	 */
+	if (flags & CLONE_DETACHED)
+		return -EINVAL;
+
+	result = get_user(args.stack, &uargs->stack);
+#if defined(CONFIG_CLONE_BACKWARDS3)
+	if (!result)
+		result = get_user(args.stack_size, &uargs->stack_size);
+#endif
+	if (!result && (flags & (CLONE_CHILD_CLEARTID | CLONE_CHILD_SETTID)))
+		result = get_user(args.child_tidptr, &uargs->child_tidptr);
+	if (!result && (flags & CLONE_PARENT_SETTID))
+		result = get_user(args.parent_tidptr, &uargs->parent_tidptr);
+	if (!result && (flags & CLONE_SETTLS))
+		result = get_user(args.tls, &uargs->tls);
+	if (result)
+		return -EFAULT;
+
+	result = _do_clone_common(flags, &args, true);
+
+	if (flags & CLONE_PIDFD) {
+		if (put_user(args.pidfd, &uargs->pidfd))
+			return -EFAULT;
+	}
+
+	return result;
 }
 #endif
 
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/2] arch: wire-up clone6() syscall on x86
  2019-05-26 10:26 [PATCH 1/2] fork: add clone6 Christian Brauner
@ 2019-05-26 10:26 ` Christian Brauner
  2019-05-27 10:02   ` Arnd Bergmann
  2019-05-26 16:50 ` [PATCH 1/2] fork: add clone6 Linus Torvalds
  2019-05-28 15:23 ` Eric W. Biederman
  2 siblings, 1 reply; 15+ messages in thread
From: Christian Brauner @ 2019-05-26 10:26 UTC (permalink / raw)
  To: viro, linux-kernel, torvalds, jannh
  Cc: fweimer, oleg, arnd, dhowells, Christian Brauner, Andrew Morton,
	Adrian Reber, linux-api, linux-arch, x86

Wire up the clone6() call on x86.

This patch only wires up clone6() on x86. Some of the arches look like they
need special assembly massaging and it is probably smarter if the
appropriate arch maintainers would do the actual wiring.

Signed-off-by: Christian Brauner <christian@brauner.io>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Adrian Reber <adrian@lisas.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: linux-api@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Cc: x86@kernel.org
---
 arch/x86/entry/syscalls/syscall_32.tbl | 1 +
 arch/x86/entry/syscalls/syscall_64.tbl | 1 +
 include/uapi/asm-generic/unistd.h      | 4 +++-
 3 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index ad968b7bac72..dffcd57990b3 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -438,3 +438,4 @@
 431	i386	fsconfig		sys_fsconfig			__ia32_sys_fsconfig
 432	i386	fsmount			sys_fsmount			__ia32_sys_fsmount
 433	i386	fspick			sys_fspick			__ia32_sys_fspick
+436	i386	clone6			sys_clone6			__ia32_sys_clone6
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index b4e6f9e6204a..73bf4cc099a2 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -355,6 +355,7 @@
 431	common	fsconfig		__x64_sys_fsconfig
 432	common	fsmount			__x64_sys_fsmount
 433	common	fspick			__x64_sys_fspick
+436	common	clone6			__x64_sys_clone6/ptregs
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index a87904daf103..500bdb4c5e36 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -844,9 +844,11 @@ __SYSCALL(__NR_fsconfig, sys_fsconfig)
 __SYSCALL(__NR_fsmount, sys_fsmount)
 #define __NR_fspick 433
 __SYSCALL(__NR_fspick, sys_fspick)
+#define __NR_clone6 436
+__SYSCALL(__NR_clone6, sys_clone6)
 
 #undef __NR_syscalls
-#define __NR_syscalls 434
+#define __NR_syscalls 437
 
 /*
  * 32 bit systems traditionally used different
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] fork: add clone6
  2019-05-26 10:26 [PATCH 1/2] fork: add clone6 Christian Brauner
  2019-05-26 10:26 ` [PATCH 2/2] arch: wire-up clone6() syscall on x86 Christian Brauner
@ 2019-05-26 16:50 ` Linus Torvalds
  2019-05-27 10:42   ` Christian Brauner
  2019-05-28 15:23 ` Eric W. Biederman
  2 siblings, 1 reply; 15+ messages in thread
From: Linus Torvalds @ 2019-05-26 16:50 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Al Viro, Linux List Kernel Mailing, Jann Horn, Florian Weimer,
	Oleg Nesterov, Arnd Bergmann, David Howells, Pavel Emelyanov,
	Andrew Morton, Adrian Reber, Andrei Vagin, Linux API

On Sun, May 26, 2019 at 3:27 AM Christian Brauner <christian@brauner.io> wrote:
>
> This adds the clone6 system call.

No, this is not acceptable.

> +       struct clone6_args args = {

First of all, we don't pass in "clone6_args" to the actual implementation.

Passing in lots of args as a structure is fine. But it damn well isn't
a "clone6" structure. It's just a "clone_args" inside the kernel, and
there should be a separate clone_uapi_args for the user/kernel
interface.

The user interface (in the uapi file) may be called "clone6_args", but
that is *not* then what we pass around inside the kernel.

But even for the uapi version, the "clone6" name doesn't make any
sense as a name to begin with. It's not the sixth revision of clone,
and that clone6 structure isn't even "six arguments", because it has
lots of other arguments (with the extra flags registers you did, but
I'll get to that later).

Finally, the actual definition of that thing is wrong for a uapi interface too:

   struct clone6_args {
          __s32 pidfd;
          __aligned_u64 parent_tidptr;
          __aligned_u64 child_tidptr;
          ...

because now it has a hole in that structure definition because of
alignment issues. So make "pidfd" be 64-bit too. You clearly tried to
make this be compat-aware etc, but we really don't want to have parts
of user structures with random padding that we then don't check.

So the *user* visible structure should be full of those __aligned_u64
to make sure everything is aligned and doesn't care about compat
models.

But the *kernel* structure that we use should be nice and tailored for
kernel use, and use proper kernel types.

And related to that disgusting thing:

> -extern long _do_fork(unsigned long, unsigned long, unsigned long, int __user *, int __user *, unsigned long);
> +extern long _do_fork(u64 clone_flags, struct clone6_args *uargs);

This is the correct thing to do (except for the "clone6" name), but:

>  static __latent_entropy struct task_struct *copy_process(
> -                                       unsigned long clone_flags,
> -                                       unsigned long stack_start,
> -                                       unsigned long stack_size,
> -                                       int __user *parent_tidptr,
> -                                       int __user *child_tidptr,
> +                                       u64 clone_flags,
>                                         struct pid *pid,
>                                         int trace,
> -                                       unsigned long tls,
> -                                       int node)
> +                                       int node,
> +                                       struct clone6_args *args,
> +                                       bool is_clone6)

But this is absolutely wrong.

That "bool is_clone6" is garbage.

The in-kernel "struict clone_args" should just be everything that
clone needs to know.

So the in-kernel "struct clone_args" should never ever need some
"is_clone6" boolean to specify how you then treat the arguments.

Stuff like this:

> +       int __user *child_tidptr = u64_to_user_ptr(args->child_tidptr);

should have been done in the stubs that set up the "struct clone_args"
thing, not in copy_process().

So all those

        if (is_clone6) ...

things need to go away, and it just needs to be the caller (who knows
what kind of clone call it is) setting up the clone_args properly,
depending on the *different* user interfaces.

And no, we don't do crazy stuff like this either:

+SYSCALL_DEFINE6(clone6, struct clone6_args __user*, uargs,
+                       unsigned int, flags1,
+                       unsigned int, flags2,
+                       unsigned int, flags3,
+                       unsigned int, flags4,
+                       unsigned int, flags5)

where this is wrong because it randomly just decides that everything
is flags (BS), and then doubles down on stupidity by making them
"unsigned int", so that the tests of the flags actually don't test the
upper bits anyway.

Why would you reserve 5 words of flags for future use when you have a
whole structure in user space that you _didn't_ reserve anything in?

So remove all those random flags, put ONE of them in the structure you
already have (as a "__aligned_u64", so that you actually get 64 bits,
not the 32 in "unsigned int"), and then perhaps you can add *one*
other register argument, which is the *size* of the structure that
user space is passing in.

That way, if we ever expand things, we'll just add new flags to the
end of the in-memory structure we have, but old binaries that don't
know about the size will continue to pass in the old size, and we'll
be all good.

So I hate the whole patch with a passion. It does absolutely
everything wrong from an interface standpoint.

I don't hate the notion of just adding flags. But do it cleanly, not
with random "is_clone6" bits.

And no, the structure we pass in from user space must *NOT* be the
same structure that we just copy blindly around as a kernel structure.
We've done that mistake before, and it is *always* a mistake. Think
"struct stat" and friends. No, we have a "struct kstat" for the kernel
version, and then we convert at the system call boundary. Let's not
repeat traditional mistakes.

And part of the conversion is exactly that

  Make everybody use the same in-kernel interface, so that the
lower-down routines don't have those kinds of "if it's this system
call, do this, otherwise, do that"

kind of horrible nasty mis-designs.

So to summarize:

 (a) make a separate kernel-only "clone_args" structure that is
unified and works for every single version of clone/fork/vfork, and
that has pointers and types in the kernel native format.

     So this will have things like "int __user *parent_tidptr" etc,
and something like *one* u64 flag field.

 (b) have the new system call have a nice compat-safe but
_independent_ format ("struct clone_uapi_struct")

 (c) you can now make the new system-call interface be something like

int clone_ext(struct clone_uapi_struct __user *uargs, size_t size);
{
        struct clone_uapi_struct kargs;
        struct clone_args args;

        // The API is defined as a stucture of 64-bit words
        if (size & 7)
                return -EINVAL;
        if (!access_ok(uargs, size))
                return -EFAULT;

        // If the user passes in new flags we don't
        // know about (because it was compiled against
        // a newer kernel than what is runn ing), make
        // sure they are zero
        if (size > sizeof(kargs)) {
                size_t n = (size - sizeof(kargs))>>3;
                u64 __user *ptr = (u64 __user *)(uargs+1)

                if (n > 10)
                        return -EINVAL;
                do {
                        u64 val;
                        if (get_user(val, ptr++))
                                return -EFAULT;
                        if (val)
                                return -EINVAL;
                } while (--n);

                // Ok, everything else was zero, we use
                // the part we know about
                size = sizeof(kargs);
        }
        memset(&kargs, 0, sizeof(kargs));
        memcpy_from_user(&kargs, uargs, size);

        .. now convert 'kargs' to 'args' ..

See? The above may be a bit *unnecessarily* anal about the whole
checking etc, but it's actually fairly simple in the end. And it means
that we have that "convert to internal format" in just one place - the
place where it makes sense. And it's a lot cleaner interface to user
mode, and allows for easy expansion later.

NOTE! By all means make "clone_ext()" take some of the other arguments
as part of the argument registers, not everything has to be part of
"struct clone_uapi_struct". But none of this "flag1..5" stuff. That's
what a struct is for.

Maybe the basic ":u64 clone_flags" could be the first argument (but
64-bit arguments can be a bit nasty for compat layers, so it's
probably not even worth it - once you have to copy an argument
structure from user space, you might as well put everything there).

                Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] arch: wire-up clone6() syscall on x86
  2019-05-26 10:26 ` [PATCH 2/2] arch: wire-up clone6() syscall on x86 Christian Brauner
@ 2019-05-27 10:02   ` Arnd Bergmann
  2019-05-27 10:45     ` Christian Brauner
  0 siblings, 1 reply; 15+ messages in thread
From: Arnd Bergmann @ 2019-05-27 10:02 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Al Viro, Linux Kernel Mailing List, Linus Torvalds, Jann Horn,
	Florian Weimer, Oleg Nesterov, David Howells, Andrew Morton,
	Adrian Reber, Linux API, linux-arch, the arch/x86 maintainers

On Sun, May 26, 2019 at 12:27 PM Christian Brauner <christian@brauner.io> wrote:
>
> Wire up the clone6() call on x86.
>
> This patch only wires up clone6() on x86. Some of the arches look like they
> need special assembly massaging and it is probably smarter if the
> appropriate arch maintainers would do the actual wiring.

Why do some architectures need special cases here? I'd prefer to have
new system calls always get defined in a way that avoids this, and
have a common entry point for everyone.

Looking at the m68k sys_clone comment in
arch/m68k/kernel/process.c, it seems that this was done as an
optimization to deal with an inferior ABI. Similar code is present
in h8300, ia64, nios2, and sparc. If all of them just do this to
shave off a few cycles from the system call entry, I really
couldn't care less.

       Arnd

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] fork: add clone6
  2019-05-26 16:50 ` [PATCH 1/2] fork: add clone6 Linus Torvalds
@ 2019-05-27 10:42   ` Christian Brauner
  2019-05-27 19:27     ` Linus Torvalds
  0 siblings, 1 reply; 15+ messages in thread
From: Christian Brauner @ 2019-05-27 10:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Linux List Kernel Mailing, Jann Horn, Florian Weimer,
	Oleg Nesterov, Arnd Bergmann, David Howells, Pavel Emelyanov,
	Andrew Morton, Adrian Reber, Andrei Vagin, Linux API

Moin,

Wasn't near a computer yesterday so sorry for the late reply. :) I
(I should note that this was supposed to be prefixed with RFC. But *shrug*.)

On Sun, May 26, 2019 at 09:50:32AM -0700, Linus Torvalds wrote:
> On Sun, May 26, 2019 at 3:27 AM Christian Brauner <christian@brauner.io> wrote:
> >
> > This adds the clone6 system call.
> 
> No, this is not acceptable.
> 
> > +       struct clone6_args args = {
> 
> First of all, we don't pass in "clone6_args" to the actual implementation.
> 
> Passing in lots of args as a structure is fine. But it damn well isn't
> a "clone6" structure. It's just a "clone_args" inside the kernel, and
> there should be a separate clone_uapi_args for the user/kernel
> interface.

Yeah, that makes sense. This is similar to what we recently did for
signals, i.e. kernel_siginfo_t as the kernel internal struct and only
expose siginfo_t to userspace.

> 
> The user interface (in the uapi file) may be called "clone6_args", but
> that is *not* then what we pass around inside the kernel.
> 
> But even for the uapi version, the "clone6" name doesn't make any
> sense as a name to begin with. It's not the sixth revision of clone,

Ok, no argument about the argument struct.

But for the name of the actual syscall itself we originally used
the revision number aka clone3() (because of clone2 on ia64).
We chose clone6() after we asked around what the current convention is:
revision number or argument number. And we were told that it is common
to use the arg number. And from the syscall list it looked like people
were right:
- accept4(/* 4 args */)
- dup2(/* 2 args */)
- dup3(/* 3 args */)
- eventfd2(/* 2 args */)
- pipe2(/* 2 args */)
- pselect6(/* 6 args, including structs */)
- signalfd4(/* 4 args, one of them a struct */)
- umount2(/* 2 args */)
- wait3(/* 3 args, one of them a struct */)
- wait4(/* 4 args, one of them a struct */)

Anyway, we can name it clone3() for the next revision.
(The "argument number" naming scheme struck us as kind of odd. Jann
 pointed out that if we ever have to have another syscall that already
 takes 6 arguments what would it be named "pselect6.1"?)

> and that clone6 structure isn't even "six arguments", because it has
> lots of other arguments (with the extra flags registers you did, but
> I'll get to that later).
> 
> Finally, the actual definition of that thing is wrong for a uapi interface too:
> 
>    struct clone6_args {
>           __s32 pidfd;
>           __aligned_u64 parent_tidptr;
>           __aligned_u64 child_tidptr;
>           ...
> 
> because now it has a hole in that structure definition because of
> alignment issues. So make "pidfd" be 64-bit too. You clearly tried to
> make this be compat-aware etc, but we really don't want to have parts
> of user structures with random padding that we then don't check.

Noted.

> 
> So the *user* visible structure should be full of those __aligned_u64
> to make sure everything is aligned and doesn't care about compat
> models.
> 
> But the *kernel* structure that we use should be nice and tailored for
> kernel use, and use proper kernel types.

Noted.

> 
> And related to that disgusting thing:
> 
> > -extern long _do_fork(unsigned long, unsigned long, unsigned long, int __user *, int __user *, unsigned long);
> > +extern long _do_fork(u64 clone_flags, struct clone6_args *uargs);
> 
> This is the correct thing to do (except for the "clone6" name), but:
> 
> >  static __latent_entropy struct task_struct *copy_process(
> > -                                       unsigned long clone_flags,
> > -                                       unsigned long stack_start,
> > -                                       unsigned long stack_size,
> > -                                       int __user *parent_tidptr,
> > -                                       int __user *child_tidptr,
> > +                                       u64 clone_flags,
> >                                         struct pid *pid,
> >                                         int trace,
> > -                                       unsigned long tls,
> > -                                       int node)
> > +                                       int node,
> > +                                       struct clone6_args *args,
> > +                                       bool is_clone6)
> 
> But this is absolutely wrong.
> 
> That "bool is_clone6" is garbage.
> 
> The in-kernel "struict clone_args" should just be everything that
> clone needs to know.

This goes back to the missing split between an in-kernel struct and uapi
struct, but sure.

> 
> So the in-kernel "struct clone_args" should never ever need some
> "is_clone6" boolean to specify how you then treat the arguments.
> 
> Stuff like this:
> 
> > +       int __user *child_tidptr = u64_to_user_ptr(args->child_tidptr);
> 
> should have been done in the stubs that set up the "struct clone_args"
> thing, not in copy_process().

Noted.

> 
> So all those
> 
>         if (is_clone6) ...
> 
> things need to go away, and it just needs to be the caller (who knows
> what kind of clone call it is) setting up the clone_args properly,
> depending on the *different* user interfaces.

Right.

> 
> And no, we don't do crazy stuff like this either:
> 
> +SYSCALL_DEFINE6(clone6, struct clone6_args __user*, uargs,
> +                       unsigned int, flags1,
> +                       unsigned int, flags2,
> +                       unsigned int, flags3,
> +                       unsigned int, flags4,
> +                       unsigned int, flags5)
> 
> where this is wrong because it randomly just decides that everything
> is flags (BS), and then doubles down on stupidity by making them
> "unsigned int", so that the tests of the flags actually don't test the

For all system calls that use flag arguments so far "unsigned int" was
the way to go because of 32bit and because of prior convention.
We debated having this be a 64 bit. But honestly, this is one of those
junctions where it becomes a matter of: "have you been around long
enough to simply ignore prior conventions?".

> upper bits anyway.
> 
> Why would you reserve 5 words of flags for future use when you have a
> whole structure in user space that you _didn't_ reserve anything in?
> 
> So remove all those random flags, put ONE of them in the structure you
> already have (as a "__aligned_u64", so that you actually get 64 bits,
> not the 32 in "unsigned int"), and then perhaps you can add *one*
> other register argument, which is the *size* of the structure that
> user space is passing in.

Sure, that was what I originally had in mind but the valid point was
made that passing all flags arguments in registers makes it easy for
seccomp to filter them.
It is a cleaner design to do it in the struct, sure. I don't have
quarrels with moving the flags into the struct itself.

> 
> That way, if we ever expand things, we'll just add new flags to the
> end of the in-memory structure we have, but old binaries that don't
> know about the size will continue to pass in the old size, and we'll
> be all good.

Right, that's the way the commit message outlines what we would've
wanted to do once we run out of flags in the register arguments.

> 
> So I hate the whole patch with a passion. It does absolutely
> everything wrong from an interface standpoint.

He, fair enough. Let's see how fast we can fix this then. :)

> 
> I don't hate the notion of just adding flags. But do it cleanly, not
> with random "is_clone6" bits.
> 
> And no, the structure we pass in from user space must *NOT* be the
> same structure that we just copy blindly around as a kernel structure.
> We've done that mistake before, and it is *always* a mistake. Think
> "struct stat" and friends. No, we have a "struct kstat" for the kernel
> version, and then we convert at the system call boundary. Let's not
> repeat traditional mistakes.
> 
> And part of the conversion is exactly that
> 
>   Make everybody use the same in-kernel interface, so that the
> lower-down routines don't have those kinds of "if it's this system
> call, do this, otherwise, do that"
> 
> kind of horrible nasty mis-designs.
> 
> So to summarize:
> 
>  (a) make a separate kernel-only "clone_args" structure that is
> unified and works for every single version of clone/fork/vfork, and
> that has pointers and types in the kernel native format.
> 
>      So this will have things like "int __user *parent_tidptr" etc,
> and something like *one* u64 flag field.
> 
>  (b) have the new system call have a nice compat-safe but
> _independent_ format ("struct clone_uapi_struct")
> 
>  (c) you can now make the new system-call interface be something like
> 
> int clone_ext(struct clone_uapi_struct __user *uargs, size_t size);
> {
>         struct clone_uapi_struct kargs;
>         struct clone_args args;
> 
>         // The API is defined as a stucture of 64-bit words
>         if (size & 7)
>                 return -EINVAL;
>         if (!access_ok(uargs, size))
>                 return -EFAULT;
> 
>         // If the user passes in new flags we don't
>         // know about (because it was compiled against
>         // a newer kernel than what is runn ing), make
>         // sure they are zero
>         if (size > sizeof(kargs)) {
>                 size_t n = (size - sizeof(kargs))>>3;
>                 u64 __user *ptr = (u64 __user *)(uargs+1)
> 
>                 if (n > 10)
>                         return -EINVAL;
>                 do {
>                         u64 val;
>                         if (get_user(val, ptr++))
>                                 return -EFAULT;
>                         if (val)
>                                 return -EINVAL;
>                 } while (--n);
> 
>                 // Ok, everything else was zero, we use
>                 // the part we know about
>                 size = sizeof(kargs);
>         }
>         memset(&kargs, 0, sizeof(kargs));
>         memcpy_from_user(&kargs, uargs, size);
> 
>         .. now convert 'kargs' to 'args' ..
> 
> See? The above may be a bit *unnecessarily* anal about the whole
> checking etc, but it's actually fairly simple in the end. And it means
> that we have that "convert to internal format" in just one place - the
> place where it makes sense. And it's a lot cleaner interface to user
> mode, and allows for easy expansion later.
> 
> NOTE! By all means make "clone_ext()" take some of the other arguments
> as part of the argument registers, not everything has to be part of
> "struct clone_uapi_struct". But none of this "flag1..5" stuff. That's
> what a struct is for.
> 
> Maybe the basic ":u64 clone_flags" could be the first argument (but
> 64-bit arguments can be a bit nasty for compat layers, so it's
> probably not even worth it - once you have to copy an argument
> structure from user space, you might as well put everything there).

Hm, still pondering whether having one unsigned int argument passed
through registers that captures all the flags from the old clone() would
be a good idea. But I'll call that once the other code is written and if
I *should* think it is a good idea and people hate it they can just yell
and we can remove it.

Thanks the pointers!
Christian

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] arch: wire-up clone6() syscall on x86
  2019-05-27 10:02   ` Arnd Bergmann
@ 2019-05-27 10:45     ` Christian Brauner
  2019-05-27 12:28       ` Arnd Bergmann
  0 siblings, 1 reply; 15+ messages in thread
From: Christian Brauner @ 2019-05-27 10:45 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Al Viro, Linux Kernel Mailing List, Linus Torvalds, Jann Horn,
	Florian Weimer, Oleg Nesterov, David Howells, Andrew Morton,
	Adrian Reber, Linux API, linux-arch, the arch/x86 maintainers

On Mon, May 27, 2019 at 12:02:37PM +0200, Arnd Bergmann wrote:
> On Sun, May 26, 2019 at 12:27 PM Christian Brauner <christian@brauner.io> wrote:
> >
> > Wire up the clone6() call on x86.
> >
> > This patch only wires up clone6() on x86. Some of the arches look like they
> > need special assembly massaging and it is probably smarter if the
> > appropriate arch maintainers would do the actual wiring.
> 
> Why do some architectures need special cases here? I'd prefer to have
> new system calls always get defined in a way that avoids this, and
> have a common entry point for everyone.
> 
> Looking at the m68k sys_clone comment in
> arch/m68k/kernel/process.c, it seems that this was done as an
> optimization to deal with an inferior ABI. Similar code is present
> in h8300, ia64, nios2, and sparc. If all of them just do this to
> shave off a few cycles from the system call entry, I really
> couldn't care less.

I'm happy to wire all arches up at the same time in the next revision. I
just wasn't sure why some of them were assemblying the living hell out
of clone; especially ia64. I really didn't want to bother touching all
of this just for an initial RFC.

Christian

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] arch: wire-up clone6() syscall on x86
  2019-05-27 10:45     ` Christian Brauner
@ 2019-05-27 12:28       ` Arnd Bergmann
  2019-05-27 12:34         ` Christian Brauner
  0 siblings, 1 reply; 15+ messages in thread
From: Arnd Bergmann @ 2019-05-27 12:28 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Al Viro, Linux Kernel Mailing List, Linus Torvalds, Jann Horn,
	Florian Weimer, Oleg Nesterov, David Howells, Andrew Morton,
	Adrian Reber, Linux API, linux-arch, the arch/x86 maintainers

On Mon, May 27, 2019 at 12:45 PM Christian Brauner <christian@brauner.io> wrote:
> On Mon, May 27, 2019 at 12:02:37PM +0200, Arnd Bergmann wrote:
> > On Sun, May 26, 2019 at 12:27 PM Christian Brauner <christian@brauner.io> wrote:
> > >
> > > Wire up the clone6() call on x86.
> > >
> > > This patch only wires up clone6() on x86. Some of the arches look like they
> > > need special assembly massaging and it is probably smarter if the
> > > appropriate arch maintainers would do the actual wiring.
> >
> > Why do some architectures need special cases here? I'd prefer to have
> > new system calls always get defined in a way that avoids this, and
> > have a common entry point for everyone.
> >
> > Looking at the m68k sys_clone comment in
> > arch/m68k/kernel/process.c, it seems that this was done as an
> > optimization to deal with an inferior ABI. Similar code is present
> > in h8300, ia64, nios2, and sparc. If all of them just do this to
> > shave off a few cycles from the system call entry, I really
> > couldn't care less.
>
> I'm happy to wire all arches up at the same time in the next revision. I
> just wasn't sure why some of them were assemblying the living hell out
> of clone; especially ia64. I really didn't want to bother touching all
> of this just for an initial RFC.

Don't worry about doing all architectures for the RFC, I mainly want this
to be done consistently by the time it gets into linux-next.

One thing to figure out though is whether we need the stack_size argument
that a couple of architectures pass. It's usually hardwired to zero,
but not all the time, and I don't know the history of this.

       Arnd

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] arch: wire-up clone6() syscall on x86
  2019-05-27 12:28       ` Arnd Bergmann
@ 2019-05-27 12:34         ` Christian Brauner
  2019-05-27 18:48           ` Linus Torvalds
  0 siblings, 1 reply; 15+ messages in thread
From: Christian Brauner @ 2019-05-27 12:34 UTC (permalink / raw)
  To: Arnd Bergmann, linux-ia64
  Cc: Al Viro, Linux Kernel Mailing List, Linus Torvalds, Jann Horn,
	Florian Weimer, Oleg Nesterov, David Howells, Andrew Morton,
	Adrian Reber, Linux API, linux-arch, the arch/x86 maintainers

On Mon, May 27, 2019 at 02:28:33PM +0200, Arnd Bergmann wrote:
> On Mon, May 27, 2019 at 12:45 PM Christian Brauner <christian@brauner.io> wrote:
> > On Mon, May 27, 2019 at 12:02:37PM +0200, Arnd Bergmann wrote:
> > > On Sun, May 26, 2019 at 12:27 PM Christian Brauner <christian@brauner.io> wrote:
> > > >
> > > > Wire up the clone6() call on x86.
> > > >
> > > > This patch only wires up clone6() on x86. Some of the arches look like they
> > > > need special assembly massaging and it is probably smarter if the
> > > > appropriate arch maintainers would do the actual wiring.
> > >
> > > Why do some architectures need special cases here? I'd prefer to have
> > > new system calls always get defined in a way that avoids this, and
> > > have a common entry point for everyone.
> > >
> > > Looking at the m68k sys_clone comment in
> > > arch/m68k/kernel/process.c, it seems that this was done as an
> > > optimization to deal with an inferior ABI. Similar code is present
> > > in h8300, ia64, nios2, and sparc. If all of them just do this to
> > > shave off a few cycles from the system call entry, I really
> > > couldn't care less.
> >
> > I'm happy to wire all arches up at the same time in the next revision. I
> > just wasn't sure why some of them were assemblying the living hell out
> > of clone; especially ia64. I really didn't want to bother touching all
> > of this just for an initial RFC.
> 
> Don't worry about doing all architectures for the RFC, I mainly want this
> to be done consistently by the time it gets into linux-next.
> 
> One thing to figure out though is whether we need the stack_size argument
> that a couple of architectures pass. It's usually hardwired to zero,
> but not all the time, and I don't know the history of this.

Afaict, stack_size is *only* used on ia64:

/*
 * sys_clone2(u64 flags, u64 ustack_base, u64 ustack_size, u64 parent_tidptr, u64 child_tidptr,
 *	      u64 tls)
 */
GLOBAL_ENTRY(sys_clone2)
	/*
	 * Allocate 8 input registers since ptrace() may clobber them
	 */
	.prologue ASM_UNW_PRLG_RP|ASM_UNW_PRLG_PFS, ASM_UNW_PRLG_GRSAVE(8)
	alloc r16=ar.pfs,8,2,6,0
	DO_SAVE_SWITCH_STACK
	adds r2=PT(R16)+IA64_SWITCH_STACK_SIZE+16,sp
	mov loc0=rp
	mov loc1=r16				// save ar.pfs across do_fork
	.body
	mov out1=in1
	mov out2=in2
	tbit.nz p6,p0=in0,CLONE_SETTLS_BIT
	mov out3=in3	// parent_tidptr: valid only w/CLONE_PARENT_SETTID
	;;
(p6)	st8 [r2]=in5				// store TLS in r16 for copy_thread()
	mov out4=in4	// child_tidptr:  valid only w/CLONE_CHILD_SETTID or CLONE_CHILD_CLEARTID
	mov out0=in0				// out0 = clone_flags
	br.call.sptk.many rp=do_fork
.ret1:	.restore sp
	adds sp=IA64_SWITCH_STACK_SIZE,sp	// pop the switch stack
	mov ar.pfs=loc1
	mov rp=loc0
	br.ret.sptk.many rp
END(sys_clone2)

I'm not sure if this needs to be because of architectural constraints or
if it just is a historic artifact.
(Ccing ia64 now to see what they have to say.)

Christian

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] arch: wire-up clone6() syscall on x86
  2019-05-27 12:34         ` Christian Brauner
@ 2019-05-27 18:48           ` Linus Torvalds
  0 siblings, 0 replies; 15+ messages in thread
From: Linus Torvalds @ 2019-05-27 18:48 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Arnd Bergmann, linux-ia64, Al Viro, Linux Kernel Mailing List,
	Jann Horn, Florian Weimer, Oleg Nesterov, David Howells,
	Andrew Morton, Adrian Reber, Linux API, linux-arch,
	the arch/x86 maintainers

On Mon, May 27, 2019 at 5:34 AM Christian Brauner <christian@brauner.io> wrote:
>
> Afaict, stack_size is *only* used on ia64:

That's because ia64 "stacks" are an odd non-stack thing (like so much
of the architecture).

In computer science, a stack is a FIFO that grows/shrinks according to
use. In practical implementations, it also has a direction, but the
"size" is basically not relevant if you just allow it to grow
dynamically. The key word here being "dynamically": the stack size is
inherently a dynamic thing.

So you don't really need a "stack size". The whole concept doesn't
make sense, outside of the obvious maximum limit things (ie
RLIMIT_STACK) and simply just hitting other allocations.

But ia64 is "special".

The ia64 stack isn't actually a stack. It's *two* stacks, growing in
opposite directions. One for the hardware spilling of the register
state and call frame ("backing store"), and one for the traditional
software stack.

So on ia64, the stack size suddenly becomes a fixed thing, because
it's not a dynamically growing single stack that grows in one
direction, it's literally a fixed virtual area that has two different
stacks growing towards each other.

Btw, don't get me wrong. Two stacks can be a good thing, and a lot of
security people want to have two stacks - one for actual call frame
data, and a separate one for automatic stack variables that have their
address taken.

Having separate stacks avoids the whole traditional stack smash model
(well, it avoids the one that overwrites the return frame - you can
still possibly have security issues because one function smashes the
automatic stack of a caller and then cause the caller to be confused
and do something insecure).

And the ia64 double stack kind of works that way automatically. So
"double stack" very much isn't wrong per se, but doing it the way ia64
did was too inflexible and the register stack (and rotation) was and
is just a bad idea.

Two stacks without the hw register renaming and flushing can be
lovely, and can even merit some hw support (ie the whole "Shadow
stack" model).

              Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] fork: add clone6
  2019-05-27 10:42   ` Christian Brauner
@ 2019-05-27 19:27     ` Linus Torvalds
  2019-05-27 19:36       ` Jann Horn
  2019-05-28 10:08       ` Christian Brauner
  0 siblings, 2 replies; 15+ messages in thread
From: Linus Torvalds @ 2019-05-27 19:27 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Al Viro, Linux List Kernel Mailing, Jann Horn, Florian Weimer,
	Oleg Nesterov, Arnd Bergmann, David Howells, Pavel Emelyanov,
	Andrew Morton, Adrian Reber, Andrei Vagin, Linux API

On Mon, May 27, 2019 at 3:42 AM Christian Brauner <christian@brauner.io> wrote:
>
> Hm, still pondering whether having one unsigned int argument passed
> through registers that captures all the flags from the old clone() would
> be a good idea.

That sounds like a reasonable thing to do.

Maybe we could continue to call the old flags CLONE_XYZ and continue
to pass them in as "flags" argument, and then we have CLONE_EXT_XYZ
flags for a new 64-bit flag field that comes in through memory in the
new clone_args thing?

That would make the flag arguments less "unified", but might make for
a simpler patch, and might make it easier for the old interfaces to
just pass in the clone flags they already have in registers instead of
setting them up in the clone_args structure.

I don't think it's necessarily wrong for the interface to show some
effects of legacy models, as long as we don't have those kinds of "if
(is_clone6)" nasties.

                   Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] fork: add clone6
  2019-05-27 19:27     ` Linus Torvalds
@ 2019-05-27 19:36       ` Jann Horn
  2019-05-30 18:26         ` Kees Cook
  2019-05-28 10:08       ` Christian Brauner
  1 sibling, 1 reply; 15+ messages in thread
From: Jann Horn @ 2019-05-27 19:36 UTC (permalink / raw)
  To: Linus Torvalds, Kees Cook, Christian Brauner
  Cc: Al Viro, Linux List Kernel Mailing, Florian Weimer,
	Oleg Nesterov, Arnd Bergmann, David Howells, Pavel Emelyanov,
	Andrew Morton, Adrian Reber, Andrei Vagin, Linux API

+Kees

On Mon, May 27, 2019 at 9:27 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Mon, May 27, 2019 at 3:42 AM Christian Brauner <christian@brauner.io> wrote:
> > Hm, still pondering whether having one unsigned int argument passed
> > through registers that captures all the flags from the old clone() would
> > be a good idea.
>
> That sounds like a reasonable thing to do.
>
> Maybe we could continue to call the old flags CLONE_XYZ and continue
> to pass them in as "flags" argument, and then we have CLONE_EXT_XYZ
> flags for a new 64-bit flag field that comes in through memory in the
> new clone_args thing?

With the current seccomp model, that would have the unfortunate effect
of making it impossible to filter out new clone flags - which would
likely mean that people who want to sandbox their code would not use
the new clone() because they don't want their sandboxed code to be
able to create time namespaces and whatever other new fancy things
clone() might support in the future. This is why I convinced Christian
to pass flags in registers for the first patch version.

The alternative I see would be to somehow extend seccomp to support
argument structures that are passed in memory - that would probably
require quite a bit of new plumbing though, both in the kernel and in
userspace code that configures seccomp filters.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] fork: add clone6
  2019-05-27 19:27     ` Linus Torvalds
  2019-05-27 19:36       ` Jann Horn
@ 2019-05-28 10:08       ` Christian Brauner
  2019-05-28 14:15         ` Andy Lutomirski
  1 sibling, 1 reply; 15+ messages in thread
From: Christian Brauner @ 2019-05-28 10:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Linux List Kernel Mailing, Jann Horn, Florian Weimer,
	Oleg Nesterov, Arnd Bergmann, David Howells, Pavel Emelyanov,
	Andrew Morton, Adrian Reber, Andrei Vagin, Linux API

On Mon, May 27, 2019 at 12:27:08PM -0700, Linus Torvalds wrote:
> On Mon, May 27, 2019 at 3:42 AM Christian Brauner <christian@brauner.io> wrote:
> >
> > Hm, still pondering whether having one unsigned int argument passed
> > through registers that captures all the flags from the old clone() would
> > be a good idea.
> 
> That sounds like a reasonable thing to do.
> 
> Maybe we could continue to call the old flags CLONE_XYZ and continue
> to pass them in as "flags" argument, and then we have CLONE_EXT_XYZ
> flags for a new 64-bit flag field that comes in through memory in the
> new clone_args thing?

Hm. I think I'll try a first version without an additional register
flags argument. And here's why: I'm not sure it buys us a lot especially
if we're giving up on making this convenient for seccomp anyway.
And with that out of the way (at least for the moment) I would really
like to make this interface consistent. But we can revisit this when I
have the code.

Christian

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] fork: add clone6
  2019-05-28 10:08       ` Christian Brauner
@ 2019-05-28 14:15         ` Andy Lutomirski
  0 siblings, 0 replies; 15+ messages in thread
From: Andy Lutomirski @ 2019-05-28 14:15 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Linus Torvalds, Al Viro, Linux List Kernel Mailing, Jann Horn,
	Florian Weimer, Oleg Nesterov, Arnd Bergmann, David Howells,
	Pavel Emelyanov, Andrew Morton, Adrian Reber, Andrei Vagin,
	Linux API

On Tue, May 28, 2019 at 3:08 AM Christian Brauner <christian@brauner.io> wrote:
>
> On Mon, May 27, 2019 at 12:27:08PM -0700, Linus Torvalds wrote:
> > On Mon, May 27, 2019 at 3:42 AM Christian Brauner <christian@brauner.io> wrote:
> > >
> > > Hm, still pondering whether having one unsigned int argument passed
> > > through registers that captures all the flags from the old clone() would
> > > be a good idea.
> >
> > That sounds like a reasonable thing to do.
> >
> > Maybe we could continue to call the old flags CLONE_XYZ and continue
> > to pass them in as "flags" argument, and then we have CLONE_EXT_XYZ
> > flags for a new 64-bit flag field that comes in through memory in the
> > new clone_args thing?
>
> Hm. I think I'll try a first version without an additional register
> flags argument. And here's why: I'm not sure it buys us a lot especially
> if we're giving up on making this convenient for seccomp anyway.
> And with that out of the way (at least for the moment) I would really
> like to make this interface consistent. But we can revisit this when I
> have the code.
>

Seems reasonable.  Once the interface is nailed down, we can see if it
makes sense to break out some flags into a register.  I would guess
that all the unsharing flags are a good candidate.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] fork: add clone6
  2019-05-26 10:26 [PATCH 1/2] fork: add clone6 Christian Brauner
  2019-05-26 10:26 ` [PATCH 2/2] arch: wire-up clone6() syscall on x86 Christian Brauner
  2019-05-26 16:50 ` [PATCH 1/2] fork: add clone6 Linus Torvalds
@ 2019-05-28 15:23 ` Eric W. Biederman
  2 siblings, 0 replies; 15+ messages in thread
From: Eric W. Biederman @ 2019-05-28 15:23 UTC (permalink / raw)
  To: Christian Brauner
  Cc: viro, linux-kernel, torvalds, jannh, fweimer, oleg, arnd,
	dhowells, Pavel Emelyanov, Andrew Morton, Adrian Reber,
	Andrei Vagin, linux-api

Christian Brauner <christian@brauner.io> writes:

> This adds the clone6 system call.
>
> As mentioned several times already (cf. [7], [8]) here's the promised
> patchset for clone6().
>
> We recently merged the CLONE_PIDFD patchset (cf. [1]). It took the last
> free flag from clone().
>
> Independent of the CLONE_PIDFD patchset a time namespace has been discussed
> at Linux Plumber Conference last year and has been sent out and reviewed
> (cf. [5]). It is expected that it will go upstream in the not too distant
> future. However, it relies on the addition of the CLONE_NEWTIME flag to
> clone(). The only other good candidate - CLONE_DETACHED - is currently not
> recycable as we have identified at least two large or widely used codebases
> that currently pass this flag (cf. [2], [3], and [4]). Given that we
> grabbed the last clone() flag we effectively blocked the time namespace
> patchset. It just seems right that we unblock it again.

I am not certain just extending clone is the right way to go.

- Last I looked glibc does not support calling clone without creating
  a stack first.  Which makes it unpleasant to support clone as a fork
  with extra flags as container runtimes would appreciate.

- Tying namespace creation to process creation is unnecessary.
  I admit both the time and the pid namespace actually need a new
  process before you can use them, but the trick of having a namespace
  for children and a namespace the current process uses seems to handle
  that case nicely.

- There is cruft in clone current runtimes do not use.
  The entire CSIGNAL mask. Also: CLONE_PARENT, CLONE_DETACHED.  And
  probably one or two other bits that I am not remembering right now.

  It would probably make sense to make all of the old linux-thread
  support optional so we can compile it out, and in a decade or two
  get rid of it as unused code.

Maybe some of this is time critical and doing everything in a single
system call makes sense.  But I don't a few extra microseconds matters
in container creation.  It feels to me like the road to better
maintenance of the kernel would just be to move work out of clone.

It certainly feels like we could implement all of the current
clone functionality on top of a simpler clone that I have described.

Perhaps we want sys_createns that like setns works on a single
namespace at a time.

Eric

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] fork: add clone6
  2019-05-27 19:36       ` Jann Horn
@ 2019-05-30 18:26         ` Kees Cook
  0 siblings, 0 replies; 15+ messages in thread
From: Kees Cook @ 2019-05-30 18:26 UTC (permalink / raw)
  To: Jann Horn
  Cc: Linus Torvalds, Christian Brauner, Al Viro,
	Linux List Kernel Mailing, Florian Weimer, Oleg Nesterov,
	Arnd Bergmann, David Howells, Pavel Emelyanov, Andrew Morton,
	Adrian Reber, Andrei Vagin, Linux API

On Mon, May 27, 2019 at 09:36:18PM +0200, Jann Horn wrote:
> +Kees
> 
> On Mon, May 27, 2019 at 9:27 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> > On Mon, May 27, 2019 at 3:42 AM Christian Brauner <christian@brauner.io> wrote:
> > > Hm, still pondering whether having one unsigned int argument passed
> > > through registers that captures all the flags from the old clone() would
> > > be a good idea.
> >
> > That sounds like a reasonable thing to do.
> >
> > Maybe we could continue to call the old flags CLONE_XYZ and continue
> > to pass them in as "flags" argument, and then we have CLONE_EXT_XYZ
> > flags for a new 64-bit flag field that comes in through memory in the
> > new clone_args thing?
> 
> With the current seccomp model, that would have the unfortunate effect
> of making it impossible to filter out new clone flags - which would
> likely mean that people who want to sandbox their code would not use
> the new clone() because they don't want their sandboxed code to be
> able to create time namespaces and whatever other new fancy things
> clone() might support in the future. This is why I convinced Christian
> to pass flags in registers for the first patch version.
> 
> The alternative I see would be to somehow extend seccomp to support
> argument structures that are passed in memory - that would probably
> require quite a bit of new plumbing though, both in the kernel and in
> userspace code that configures seccomp filters.

FWIW, the only path forward on this that I've been able to see is to
normalize how syscalls read memory from userspace, and to basically
provide a cache (i.e. copy from userspace once) that will be examined by
both seccomp and later kernel functions. I have not been able to imagine
an API that wasn't a massive amount of work to implement, though. Maybe
it could be done only for a few kinds of arguments (file paths, certain
structures, etc) but I haven't made any progress on it.

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2019-05-30 18:26 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-26 10:26 [PATCH 1/2] fork: add clone6 Christian Brauner
2019-05-26 10:26 ` [PATCH 2/2] arch: wire-up clone6() syscall on x86 Christian Brauner
2019-05-27 10:02   ` Arnd Bergmann
2019-05-27 10:45     ` Christian Brauner
2019-05-27 12:28       ` Arnd Bergmann
2019-05-27 12:34         ` Christian Brauner
2019-05-27 18:48           ` Linus Torvalds
2019-05-26 16:50 ` [PATCH 1/2] fork: add clone6 Linus Torvalds
2019-05-27 10:42   ` Christian Brauner
2019-05-27 19:27     ` Linus Torvalds
2019-05-27 19:36       ` Jann Horn
2019-05-30 18:26         ` Kees Cook
2019-05-28 10:08       ` Christian Brauner
2019-05-28 14:15         ` Andy Lutomirski
2019-05-28 15:23 ` Eric W. Biederman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).