LKML Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH v4 0/2] Syscall User Redirection
@ 2020-07-16 19:31 Gabriel Krisman Bertazi
  2020-07-16 19:31 ` [PATCH v4 1/2] kernel: Implement selective syscall userspace redirection Gabriel Krisman Bertazi
                   ` (3 more replies)
  0 siblings, 4 replies; 21+ messages in thread
From: Gabriel Krisman Bertazi @ 2020-07-16 19:31 UTC (permalink / raw)
  To: tglx
  Cc: linux-kernel, kernel, willy, luto, gofmanp, keescook,
	linux-kselftest, shuah, Gabriel Krisman Bertazi

Hi,

This is v4 of Syscall User Redirection.  The implementation itself is
not modified from v3, it only applies the latest round of reviews to the
selftests.

__NR_syscalls is not really exported in header files other than
asm-generic for every architecture, so it felt safer to optionally
expose it with a fallback to a high value.

Also, I didn't expose tests for PR_GET as that is not currently
implemented.  If possible, I'd have it supported by a future patchset,
since it is not immediately necessary to support this feature.

Finally, one question: Which tree would this go through?

Gabriel Krisman Bertazi (2):
  kernel: Implement selective syscall userspace redirection
  selftests: Add kselftest for syscall user dispatch

 arch/Kconfig                                  |  20 ++
 arch/x86/Kconfig                              |   1 +
 arch/x86/entry/common.c                       |   5 +
 arch/x86/include/asm/thread_info.h            |   4 +-
 arch/x86/kernel/signal_compat.c               |   2 +-
 fs/exec.c                                     |   2 +
 include/linux/sched.h                         |   3 +
 include/linux/syscall_user_dispatch.h         |  50 ++++
 include/uapi/asm-generic/siginfo.h            |   3 +-
 include/uapi/linux/prctl.h                    |   5 +
 kernel/Makefile                               |   1 +
 kernel/fork.c                                 |   1 +
 kernel/sys.c                                  |   5 +
 kernel/syscall_user_dispatch.c                |  92 +++++++
 tools/testing/selftests/Makefile              |   1 +
 .../syscall_user_dispatch/.gitignore          |   2 +
 .../selftests/syscall_user_dispatch/Makefile  |   9 +
 .../selftests/syscall_user_dispatch/config    |   1 +
 .../syscall_user_dispatch.c                   | 256 ++++++++++++++++++
 19 files changed, 460 insertions(+), 3 deletions(-)
 create mode 100644 include/linux/syscall_user_dispatch.h
 create mode 100644 kernel/syscall_user_dispatch.c
 create mode 100644 tools/testing/selftests/syscall_user_dispatch/.gitignore
 create mode 100644 tools/testing/selftests/syscall_user_dispatch/Makefile
 create mode 100644 tools/testing/selftests/syscall_user_dispatch/config
 create mode 100644 tools/testing/selftests/syscall_user_dispatch/syscall_user_dispatch.c

-- 
2.27.0


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v4 1/2] kernel: Implement selective syscall userspace redirection
  2020-07-16 19:31 [PATCH v4 0/2] Syscall User Redirection Gabriel Krisman Bertazi
@ 2020-07-16 19:31 ` Gabriel Krisman Bertazi
  2020-07-16 21:06   ` Matthew Wilcox
                     ` (2 more replies)
  2020-07-16 19:31 ` [PATCH v4 2/2] selftests: Add kselftest for syscall user dispatch Gabriel Krisman Bertazi
                   ` (2 subsequent siblings)
  3 siblings, 3 replies; 21+ messages in thread
From: Gabriel Krisman Bertazi @ 2020-07-16 19:31 UTC (permalink / raw)
  To: tglx
  Cc: linux-kernel, kernel, willy, luto, gofmanp, keescook,
	linux-kselftest, shuah, Gabriel Krisman Bertazi

Introduce a mechanism to quickly disable/enable syscall handling for a
specific process and redirect to userspace via SIGSYS.  This is useful
for processes with parts that require syscall redirection and parts that
don't, but who need to perform this boundary crossing really fast,
without paying the cost of a system call to reconfigure syscall handling
on each boundary transition.  This is particularly important for Windows
games running over Wine.

The proposed interface looks like this:

  prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <start_addr>, <end_addr>, [selector])

The range [<start_addr>,<end_addr>] is a part of the process memory map
that is allowed to by-pass the redirection code and dispatch syscalls
directly, such that in fast paths a process doesn't need to disable the
trap nor the kernel has to check the selector.  This is essential to
return from SIGSYS to a blocked area without triggering another SIGSYS
from rt_sigreturn.

selector is an optional pointer to a char-sized userspace memory region
that has a key switch for the mechanism. This key switch is set to
either PR_SYS_DISPATCH_ON, PR_SYS_DISPATCH_OFF to enable and disable the
redirection without calling the kernel.

The feature is meant to be set per-thread and it is disabled on
fork/clone/execv.

Internally, this doesn't add overhead to the syscall hot path, and it
requires very little per-architecture support.  I avoided using seccomp,
even though it duplicates some functionality, due to previous feedback
that maybe it shouldn't mix with seccomp since it is not a security
mechanism.  And obviously, this should never be considered a security
mechanism, since any part of the program can by-pass it by using the
syscall dispatcher.

For the sysinfo benchmark, which measures the overhead added to
executing a native syscall that doesn't require interception, the
overhead using only the direct dispatcher region to issue syscalls is
pretty much irrelevant.  The overhead of using the selector goes around
40ns for a native (unredirected) syscall in my system, and it is (as
expected) dominated by the supervisor-mode user-address access.  In
fact, with SMAP off, the overhead is consistently less than 5ns on my
test box.

Right now, it is only supported by x86_64 and x86, but it should be
easily enabled for other architectures.

An example code using this interface can be found at:
  https://gitlab.collabora.com/krisman/syscall-disable-personality

Changes since v2:
  (Matthew Wilcox suggestions)
  - Drop __user on non-ptr type.
  - Move #define closer to similar defs
  - Allow a memory region that can dispatch directly
  (Kees Cook suggestions)
  - Improve kconfig summary line
  - Move flag cleanup on execve to begin_new_exec
  - Hint branch predictor in the syscall path
  (Me)
  - Convert selector to char

Changes since RFC:
  (Kees Cook suggestions)
  - Don't mention personality while explaining the feature
  - Use syscall_get_nr
  - Remove header guard on several places
  - Convert WARN_ON to WARN_ON_ONCE
  - Explicit check for state values
  - Rename to syscall user dispatcher

Cc: Matthew Wilcox <willy@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Paul Gofman <gofmanp@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
---
 arch/Kconfig                          | 20 ++++++
 arch/x86/Kconfig                      |  1 +
 arch/x86/entry/common.c               |  5 ++
 arch/x86/include/asm/thread_info.h    |  4 +-
 arch/x86/kernel/signal_compat.c       |  2 +-
 fs/exec.c                             |  2 +
 include/linux/sched.h                 |  3 +
 include/linux/syscall_user_dispatch.h | 50 +++++++++++++++
 include/uapi/asm-generic/siginfo.h    |  3 +-
 include/uapi/linux/prctl.h            |  5 ++
 kernel/Makefile                       |  1 +
 kernel/fork.c                         |  1 +
 kernel/sys.c                          |  5 ++
 kernel/syscall_user_dispatch.c        | 92 +++++++++++++++++++++++++++
 14 files changed, 191 insertions(+), 3 deletions(-)
 create mode 100644 include/linux/syscall_user_dispatch.h
 create mode 100644 kernel/syscall_user_dispatch.c

diff --git a/arch/Kconfig b/arch/Kconfig
index 8cc35dc556c7..0ebd971d0d8f 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -465,6 +465,26 @@ config SECCOMP_FILTER
 
 	  See Documentation/userspace-api/seccomp_filter.rst for details.
 
+config HAVE_ARCH_SYSCALL_USER_DISPATCH
+	bool
+	help
+	  An arch should select this symbol if it provides all of these things:
+	  - TIF_SYSCALL_USER_DISPATCH
+	  - syscall_get_arch
+	  - syscall_rollback
+	  - syscall_get_nr
+	  - SIGSYS siginfo_t support
+
+config SYSCALL_USER_DISPATCH
+	bool "Support syscall redirection to userspace dispatcher"
+	depends on HAVE_ARCH_SYSCALL_USER_DISPATCH
+	help
+	  Enable tasks to ask the kernel to redirect syscalls not
+	  issued from a predefined dispatcher back to userspace,
+	  depending on a userspace memory selector.
+
+	  This option is useful to optimize games running over Wine.
+
 config HAVE_ARCH_STACKLEAK
 	bool
 	help
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 883da0abf779..466a3a9c0708 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -149,6 +149,7 @@ config X86
 	select HAVE_ARCH_COMPAT_MMAP_BASES	if MMU && COMPAT
 	select HAVE_ARCH_PREL32_RELOCATIONS
 	select HAVE_ARCH_SECCOMP_FILTER
+	select HAVE_ARCH_SYSCALL_USER_DISPATCH
 	select HAVE_ARCH_THREAD_STRUCT_WHITELIST
 	select HAVE_ARCH_STACKLEAK
 	select HAVE_ARCH_TRACEHOOK
diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index bd3f14175193..6c1360a7f260 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -138,6 +138,11 @@ static long syscall_trace_enter(struct pt_regs *regs)
 			return -1L;
 	}
 
+	if (work & _TIF_SYSCALL_USER_DISPATCH) {
+		if (do_syscall_user_dispatch(regs))
+			return -1L;
+	}
+
 #ifdef CONFIG_SECCOMP
 	/*
 	 * Do seccomp after ptrace, to catch any tracer changes.
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 8de8ceccb8bc..b26a9f2f0491 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -93,6 +93,7 @@ struct thread_info {
 #define TIF_NOTSC		16	/* TSC is not accessible in userland */
 #define TIF_IA32		17	/* IA32 compatibility process */
 #define TIF_SLD			18	/* Restore split lock detection on context switch */
+#define TIF_SYSCALL_USER_DISPATCH 19	/* Redirect syscall for userspace handling */
 #define TIF_MEMDIE		20	/* is terminating due to OOM killer */
 #define TIF_POLLING_NRFLAG	21	/* idle is polling for TIF_NEED_RESCHED */
 #define TIF_IO_BITMAP		22	/* uses I/O bitmap */
@@ -123,6 +124,7 @@ struct thread_info {
 #define _TIF_NOTSC		(1 << TIF_NOTSC)
 #define _TIF_IA32		(1 << TIF_IA32)
 #define _TIF_SLD		(1 << TIF_SLD)
+#define _TIF_SYSCALL_USER_DISPATCH (1 << TIF_SYSCALL_USER_DISPATCH)
 #define _TIF_POLLING_NRFLAG	(1 << TIF_POLLING_NRFLAG)
 #define _TIF_IO_BITMAP		(1 << TIF_IO_BITMAP)
 #define _TIF_FORCED_TF		(1 << TIF_FORCED_TF)
@@ -136,7 +138,7 @@ struct thread_info {
 /* Work to do before invoking the actual syscall. */
 #define _TIF_WORK_SYSCALL_ENTRY	\
 	(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_EMU | _TIF_SYSCALL_AUDIT |	\
-	 _TIF_SECCOMP | _TIF_SYSCALL_TRACEPOINT)
+	 _TIF_SECCOMP | _TIF_SYSCALL_TRACEPOINT | _TIF_SYSCALL_USER_DISPATCH)
 
 /* flags to check in __switch_to() */
 #define _TIF_WORK_CTXSW_BASE					\
diff --git a/arch/x86/kernel/signal_compat.c b/arch/x86/kernel/signal_compat.c
index 9ccbf0576cd0..210aecc6eab9 100644
--- a/arch/x86/kernel/signal_compat.c
+++ b/arch/x86/kernel/signal_compat.c
@@ -31,7 +31,7 @@ static inline void signal_compat_build_tests(void)
 	BUILD_BUG_ON(NSIGBUS  != 5);
 	BUILD_BUG_ON(NSIGTRAP != 5);
 	BUILD_BUG_ON(NSIGCHLD != 6);
-	BUILD_BUG_ON(NSIGSYS  != 1);
+	BUILD_BUG_ON(NSIGSYS  != 2);
 
 	/* This is part of the ABI and can never change in size: */
 	BUILD_BUG_ON(sizeof(compat_siginfo_t) != 128);
diff --git a/fs/exec.c b/fs/exec.c
index e6e8a9a70327..849f618ed790 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1386,6 +1386,8 @@ int begin_new_exec(struct linux_binprm * bprm)
 	flush_thread();
 	me->personality &= ~bprm->per_clear;
 
+	clear_tsk_syscall_user_dispatch(me);
+
 	/*
 	 * We have to apply CLOEXEC before we change whether the process is
 	 * dumpable (in setup_new_exec) to avoid a race with a process in userspace
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 692e327d7455..407b868146e1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -32,6 +32,7 @@
 #include <linux/posix-timers.h>
 #include <linux/rseq.h>
 #include <linux/kcsan.h>
+#include <linux/syscall_user_dispatch.h>
 
 /* task_struct member predeclarations (sorted alphabetically): */
 struct audit_context;
@@ -953,6 +954,8 @@ struct task_struct {
 #endif
 	struct seccomp			seccomp;
 
+	struct syscall_user_dispatch	syscall_dispatch;
+
 	/* Thread group tracking: */
 	u64				parent_exec_id;
 	u64				self_exec_id;
diff --git a/include/linux/syscall_user_dispatch.h b/include/linux/syscall_user_dispatch.h
new file mode 100644
index 000000000000..a49e2de93705
--- /dev/null
+++ b/include/linux/syscall_user_dispatch.h
@@ -0,0 +1,50 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2020 Collabora Ltd.
+ */
+#ifndef _SYSCALL_USER_DISPATCH_H
+#define _SYSCALL_USER_DISPATCH_H
+
+struct task_struct;
+static void clear_tsk_thread_flag(struct task_struct *tsk, int flag);
+
+#ifdef CONFIG_SYSCALL_USER_DISPATCH
+struct syscall_user_dispatch {
+	char __user *selector;
+	unsigned long dispatcher_start;
+	unsigned long dispatcher_end;
+};
+
+int do_syscall_user_dispatch(struct pt_regs *regs);
+int set_syscall_user_dispatch(int mode, unsigned long dispatcher_start,
+			      unsigned long dispatcher_end,
+			      char __user *selector);
+
+static inline void clear_tsk_syscall_user_dispatch(struct task_struct *tsk)
+{
+	clear_tsk_thread_flag(tsk, TIF_SYSCALL_USER_DISPATCH);
+}
+
+#else
+struct syscall_user_dispatch {};
+
+static inline int set_syscall_user_dispatch(int mode, unsigned long dispatcher_start,
+					    unsigned long dispatcher_end,
+					    char __user *selector)
+{
+	return -EINVAL;
+}
+
+static inline int do_syscall_user_dispatch(struct pt_regs *regs)
+{
+	WARN_ON_ONCE(1);
+	return 0;
+}
+
+static inline void clear_tsk_syscall_user_dispatch(struct task_struct *tsk)
+{
+}
+
+#endif /* CONFIG_SYSCALL_USER_DISPATCH */
+
+#endif /* _SYSCALL_USER_DISPATCH_H */
diff --git a/include/uapi/asm-generic/siginfo.h b/include/uapi/asm-generic/siginfo.h
index cb3d6c267181..37741908b846 100644
--- a/include/uapi/asm-generic/siginfo.h
+++ b/include/uapi/asm-generic/siginfo.h
@@ -284,7 +284,8 @@ typedef struct siginfo {
  * SIGSYS si_codes
  */
 #define SYS_SECCOMP	1	/* seccomp triggered */
-#define NSIGSYS		1
+#define SYS_USER_DISPATCH 2	/* syscall user dispatch triggered */
+#define NSIGSYS		2
 
 /*
  * SIGEMT si_codes
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 07b4f8131e36..96265246383d 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -238,4 +238,9 @@ struct prctl_mm_map {
 #define PR_SET_IO_FLUSHER		57
 #define PR_GET_IO_FLUSHER		58
 
+/* Dispatch syscalls to a userspace handler */
+#define PR_SET_SYSCALL_USER_DISPATCH	59
+# define PR_SYS_DISPATCH_OFF		0
+# define PR_SYS_DISPATCH_ON		1
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/Makefile b/kernel/Makefile
index f3218bc5ec69..158b8c61592f 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -92,6 +92,7 @@ obj-$(CONFIG_DETECT_HUNG_TASK) += hung_task.o
 obj-$(CONFIG_LOCKUP_DETECTOR) += watchdog.o
 obj-$(CONFIG_HARDLOCKUP_DETECTOR_PERF) += watchdog_hld.o
 obj-$(CONFIG_SECCOMP) += seccomp.o
+obj-$(CONFIG_SYSCALL_USER_DISPATCH) += syscall_user_dispatch.o
 obj-$(CONFIG_RELAY) += relay.o
 obj-$(CONFIG_SYSCTL) += utsname_sysctl.o
 obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
diff --git a/kernel/fork.c b/kernel/fork.c
index 142b23645d82..c6b64a849fec 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -921,6 +921,7 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 	clear_user_return_notifier(tsk);
 	clear_tsk_need_resched(tsk);
 	set_task_stack_end_magic(tsk);
+	clear_tsk_syscall_user_dispatch(tsk);
 
 #ifdef CONFIG_STACKPROTECTOR
 	tsk->stack_canary = get_random_canary();
diff --git a/kernel/sys.c b/kernel/sys.c
index 00a96746e28a..d85880873c92 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -42,6 +42,7 @@
 #include <linux/syscore_ops.h>
 #include <linux/version.h>
 #include <linux/ctype.h>
+#include <linux/syscall_user_dispatch.h>
 
 #include <linux/compat.h>
 #include <linux/syscalls.h>
@@ -2527,6 +2528,10 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 
 		error = (current->flags & PR_IO_FLUSHER) == PR_IO_FLUSHER;
 		break;
+	case PR_SET_SYSCALL_USER_DISPATCH:
+		error = set_syscall_user_dispatch((int) arg2, arg3, arg4,
+						  (char __user *) arg5);
+		break;
 	default:
 		error = -EINVAL;
 		break;
diff --git a/kernel/syscall_user_dispatch.c b/kernel/syscall_user_dispatch.c
new file mode 100644
index 000000000000..39ee29c2b91f
--- /dev/null
+++ b/kernel/syscall_user_dispatch.c
@@ -0,0 +1,92 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2020 Collabora Ltd.
+ */
+#include <linux/sched.h>
+#include <linux/prctl.h>
+#include <linux/syscall_user_dispatch.h>
+#include <linux/uaccess.h>
+#include <linux/signal.h>
+
+#include <asm/syscall.h>
+
+#include <linux/sched/signal.h>
+#include <linux/sched/task_stack.h>
+
+static void trigger_sigsys(struct pt_regs *regs)
+{
+	struct kernel_siginfo info;
+
+	clear_siginfo(&info);
+	info.si_signo = SIGSYS;
+	info.si_code = SYS_USER_DISPATCH;
+	info.si_call_addr = (void __user *)KSTK_EIP(current);
+	info.si_errno = 0;
+	info.si_arch = syscall_get_arch(current);
+	info.si_syscall = syscall_get_nr(current, regs);
+
+	force_sig_info(&info);
+}
+
+int do_syscall_user_dispatch(struct pt_regs *regs)
+{
+	struct syscall_user_dispatch *sd = &current->syscall_dispatch;
+	unsigned long ip = instruction_pointer(regs);
+	char state;
+
+	if (likely(ip >= sd->dispatcher_start && ip <= sd->dispatcher_end))
+		return 0;
+
+	if (likely(sd->selector)) {
+		if (unlikely(__get_user(state, sd->selector)))
+			do_exit(SIGSEGV);
+
+		if (likely(state == 0))
+			return 0;
+
+		if (state != 1)
+			do_exit(SIGSEGV);
+	}
+
+	syscall_rollback(current, regs);
+	trigger_sigsys(regs);
+
+	return 1;
+}
+
+int set_syscall_user_dispatch(int mode, unsigned long dispatcher_start,
+			      unsigned long dispatcher_end, char __user *selector)
+{
+	switch (mode) {
+	case PR_SYS_DISPATCH_OFF:
+		if (dispatcher_start || dispatcher_end || selector)
+			return -EINVAL;
+		break;
+	case PR_SYS_DISPATCH_ON:
+		/*
+		 * Validate the direct dispatcher region just for basic
+		 * sanity.  If the user is able to submit a syscall from
+		 * an address, that address is obviously valid.
+		 */
+		if (dispatcher_end < dispatcher_start)
+			return -EINVAL;
+
+		if (selector && !access_ok(selector, 1))
+			return -EFAULT;
+
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	current->syscall_dispatch.selector = selector;
+	current->syscall_dispatch.dispatcher_start = dispatcher_start;
+	current->syscall_dispatch.dispatcher_end = dispatcher_end;
+
+	if (mode == PR_SYS_DISPATCH_ON)
+		set_tsk_thread_flag(current, TIF_SYSCALL_USER_DISPATCH);
+	else
+		clear_tsk_thread_flag(current, TIF_SYSCALL_USER_DISPATCH);
+
+	return 0;
+}
-- 
2.27.0


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v4 2/2] selftests: Add kselftest for syscall user dispatch
  2020-07-16 19:31 [PATCH v4 0/2] Syscall User Redirection Gabriel Krisman Bertazi
  2020-07-16 19:31 ` [PATCH v4 1/2] kernel: Implement selective syscall userspace redirection Gabriel Krisman Bertazi
@ 2020-07-16 19:31 ` Gabriel Krisman Bertazi
  2020-07-16 20:04 ` [PATCH v4 0/2] Syscall User Redirection Kees Cook
  2020-08-02 12:01 ` Pavel Machek
  3 siblings, 0 replies; 21+ messages in thread
From: Gabriel Krisman Bertazi @ 2020-07-16 19:31 UTC (permalink / raw)
  To: tglx
  Cc: linux-kernel, kernel, willy, luto, gofmanp, keescook,
	linux-kselftest, shuah, Gabriel Krisman Bertazi

Implement functionality tests for syscall user dispatch.  In order to
make the test portable, refrain from open coding syscall dispatchers and
calculating glibc memory ranges.

changes since v3:
  - Sort entry in Makefile
  - Add SPDX header
  - Use __NR_syscalls if available

Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
---
 tools/testing/selftests/Makefile              |   1 +
 .../syscall_user_dispatch/.gitignore          |   2 +
 .../selftests/syscall_user_dispatch/Makefile  |   9 +
 .../selftests/syscall_user_dispatch/config    |   1 +
 .../syscall_user_dispatch.c                   | 256 ++++++++++++++++++
 5 files changed, 269 insertions(+)
 create mode 100644 tools/testing/selftests/syscall_user_dispatch/.gitignore
 create mode 100644 tools/testing/selftests/syscall_user_dispatch/Makefile
 create mode 100644 tools/testing/selftests/syscall_user_dispatch/config
 create mode 100644 tools/testing/selftests/syscall_user_dispatch/syscall_user_dispatch.c

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 1195bd85af38..07a3c682f1e4 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -53,6 +53,7 @@ TARGETS += sparc64
 TARGETS += splice
 TARGETS += static_keys
 TARGETS += sync
+TARGETS += syscall_user_dispatch
 TARGETS += sysctl
 TARGETS += timens
 ifneq (1, $(quicktest))
diff --git a/tools/testing/selftests/syscall_user_dispatch/.gitignore b/tools/testing/selftests/syscall_user_dispatch/.gitignore
new file mode 100644
index 000000000000..637f08107add
--- /dev/null
+++ b/tools/testing/selftests/syscall_user_dispatch/.gitignore
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0-only
+syscall_user_dispatch
diff --git a/tools/testing/selftests/syscall_user_dispatch/Makefile b/tools/testing/selftests/syscall_user_dispatch/Makefile
new file mode 100644
index 000000000000..eeb07a791057
--- /dev/null
+++ b/tools/testing/selftests/syscall_user_dispatch/Makefile
@@ -0,0 +1,9 @@
+# SPDX-License-Identifier: GPL-2.0
+top_srcdir = ../../../..
+INSTALL_HDR_PATH = $(top_srcdir)/usr
+LINUX_HDR_PATH = $(INSTALL_HDR_PATH)/include/
+
+CFLAGS += -Wall -I$(LINUX_HDR_PATH)
+
+TEST_GEN_PROGS := syscall_user_dispatch
+include ../lib.mk
diff --git a/tools/testing/selftests/syscall_user_dispatch/config b/tools/testing/selftests/syscall_user_dispatch/config
new file mode 100644
index 000000000000..22c4dfe167ca
--- /dev/null
+++ b/tools/testing/selftests/syscall_user_dispatch/config
@@ -0,0 +1 @@
+CONFIG_SYSCALL_USER_DISPATCH=y
diff --git a/tools/testing/selftests/syscall_user_dispatch/syscall_user_dispatch.c b/tools/testing/selftests/syscall_user_dispatch/syscall_user_dispatch.c
new file mode 100644
index 000000000000..ed783fc447a2
--- /dev/null
+++ b/tools/testing/selftests/syscall_user_dispatch/syscall_user_dispatch.c
@@ -0,0 +1,256 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2020 Collabora Ltd.
+ *
+ * Test code for syscall user dispatch
+ */
+
+#define _GNU_SOURCE
+#include <sys/prctl.h>
+#include <sys/sysinfo.h>
+#include <sys/syscall.h>
+#include <signal.h>
+
+#include <asm/unistd.h>
+#include "../kselftest_harness.h"
+
+#ifndef PR_SET_SYSCALL_USER_DISPATCH
+# define PR_SET_SYSCALL_USER_DISPATCH	59
+# define PR_SYS_DISPATCH_OFF	0
+# define PR_SYS_DISPATCH_ON	1
+#endif
+
+#ifndef SYS_USER_DISPATCH
+# define SYS_USER_DISPATCH	2
+#endif
+
+#ifdef __NR_syscalls
+# define MAGIC_SYSCALL_1 (__NR_syscalls + 1) /* Bad Linux syscall number */
+#else
+# define MAGIC_SYSCALL_1 (0xff00)  /* Bad Linux syscall number */
+#endif
+
+#define SYSCALL_DISPATCH_ON(x) ((x) = 1)
+#define SYSCALL_DISPATCH_OFF(x) ((x) = 0)
+
+/* Test Summary:
+ *
+ * - dispatch_trigger_sigsys: Verify if PR_SET_SYSCALL_USER_DISPATCH is
+ *   able to trigger SIGSYS on a syscall.
+ *
+ * - bad_selector: Test that a bad selector value triggers SIGSEGV.
+ *
+ * - bad_prctl_param: Test that the API correctly rejects invalid
+ *   parameters on prctl
+ *
+ * - dispatch_and_return: Test that a syscall is selectively dispatched
+ *   to userspace depending on the value of selector.
+ *
+ * - disable_dispatch: Test that the PR_SYS_DISPATCH_OFF correctly
+ *   disables the dispatcher
+ *
+ * - direct_dispatch_range: Test that a syscall within the allowed range
+ *   can bypass the dispatcher.
+ */
+
+TEST_SIGNAL(dispatch_trigger_sigsys, SIGSYS)
+{
+	char sel = 0;
+	struct sysinfo info;
+	int ret;
+
+	ret = sysinfo(&info);
+	ASSERT_EQ(0, ret);
+
+	ret = prctl(PR_SET_SYSCALL_USER_DISPATCH, PR_SYS_DISPATCH_ON, 0, 0, &sel);
+	ASSERT_EQ(0, ret) {
+		TH_LOG("Kernel does not support CONFIG_SYSCALL_USER_DISPATCH");
+	}
+
+	SYSCALL_DISPATCH_ON(sel);
+
+	sysinfo(&info);
+
+	EXPECT_FALSE(true) {
+		TH_LOG("Unreachable!");
+	}
+}
+
+TEST_SIGNAL(bad_selector, SIGSEGV)
+{
+	char sel = -1;
+	long ret;
+
+	ret = prctl(PR_SET_SYSCALL_USER_DISPATCH, PR_SYS_DISPATCH_ON, 0, 0, &sel);
+	ASSERT_EQ(0, ret) {
+		TH_LOG("Kernel does not support CONFIG_SYSCALL_USER_DISPATCH");
+	}
+	EXPECT_FALSE(true) {
+		TH_LOG("Unreachable!");
+	}
+}
+
+TEST(bad_prctl_param)
+{
+	char sel = 0;
+	int op;
+
+	/* Invalid op */
+	op = -1;
+	prctl(PR_SET_SYSCALL_USER_DISPATCH, op, 0, 0, &sel);
+	ASSERT_EQ(EINVAL, errno);
+
+	/* PR_SYS_DISPATCH_OFF */
+	op = PR_SYS_DISPATCH_OFF;
+
+	/* start_addr != 0 */
+	prctl(PR_SET_SYSCALL_USER_DISPATCH, op, 0x1, 0xff, 0);
+	EXPECT_EQ(EINVAL, errno);
+
+	/* end_addr != 0 */
+	prctl(PR_SET_SYSCALL_USER_DISPATCH, op, 0x0, 0xff, 0);
+	EXPECT_EQ(EINVAL, errno);
+
+	/* sel != NULL */
+	prctl(PR_SET_SYSCALL_USER_DISPATCH, op, 0x0, 0x0, &sel);
+	EXPECT_EQ(EINVAL, errno);
+
+	/* Valid parameter */
+	errno = 0;
+	prctl(PR_SET_SYSCALL_USER_DISPATCH, op, 0x0, 0x0, 0x0);
+	EXPECT_EQ(0, errno);
+
+	/* PR_SYS_DISPATCH_ON */
+	op = PR_SYS_DISPATCH_ON;
+
+	/* start_addr > end_addr */
+	prctl(PR_SET_SYSCALL_USER_DISPATCH, op, 0x1, 0x0, &sel);
+	EXPECT_EQ(EINVAL, errno);
+
+	/* Invalid selector */
+	prctl(PR_SET_SYSCALL_USER_DISPATCH, op, 0x0, 0x1, (void *) -1);
+	ASSERT_EQ(EFAULT, errno);
+}
+
+/*
+ * Use global selector for handle_sigsys tests, to avoid passing
+ * selector to signal handler
+ */
+char glob_sel;
+int nr_syscalls_emulated;
+int si_code;
+
+static void handle_sigsys(int sig, siginfo_t *info, void *ucontext)
+{
+	si_code = info->si_code;
+
+	if (info->si_syscall == MAGIC_SYSCALL_1)
+		nr_syscalls_emulated++;
+
+	/* In preparation for sigreturn. */
+	SYSCALL_DISPATCH_OFF(glob_sel);
+}
+
+TEST(dispatch_and_return)
+{
+	long ret;
+	struct sigaction act;
+	sigset_t mask;
+
+	glob_sel = 0;
+	nr_syscalls_emulated = 0;
+	si_code = 0;
+
+	memset(&act, 0, sizeof(act));
+	sigemptyset(&mask);
+
+	act.sa_sigaction = handle_sigsys;
+	act.sa_flags = SA_SIGINFO;
+	act.sa_mask = mask;
+
+	ret = sigaction(SIGSYS, &act, NULL);
+	ASSERT_EQ(0, ret);
+
+	/* Make sure selector is good prior to prctl. */
+	SYSCALL_DISPATCH_OFF(glob_sel);
+
+	ret = prctl(PR_SET_SYSCALL_USER_DISPATCH, PR_SYS_DISPATCH_ON, 0, 0, &glob_sel);
+	ASSERT_EQ(0, ret) {
+		TH_LOG("Kernel does not support CONFIG_SYSCALL_USER_DISPATCH");
+	}
+
+	/* MAGIC_SYSCALL_1 doesn't exist. */
+	SYSCALL_DISPATCH_OFF(glob_sel);
+	ret = syscall(MAGIC_SYSCALL_1);
+	EXPECT_EQ(-1, ret) {
+		TH_LOG("Dispatch triggered unexpectedly");
+	}
+
+	/* MAGIC_SYSCALL_1 should be emulated. */
+	nr_syscalls_emulated = 0;
+	SYSCALL_DISPATCH_ON(glob_sel);
+
+	ret = syscall(MAGIC_SYSCALL_1);
+	EXPECT_EQ(MAGIC_SYSCALL_1, ret) {
+		TH_LOG("Failed to intercept syscall");
+	}
+	EXPECT_EQ(1, nr_syscalls_emulated) {
+		TH_LOG("Failed to emulate syscall");
+	}
+	ASSERT_EQ(SYS_USER_DISPATCH, si_code) {
+		TH_LOG("Bad si_code in SIGSYS");
+	}
+}
+
+TEST(disable_dispatch)
+{
+	int ret;
+	struct sysinfo info;
+	char sel = 0;
+
+	ret = prctl(PR_SET_SYSCALL_USER_DISPATCH, PR_SYS_DISPATCH_ON, 0, 0, &sel);
+	ASSERT_EQ(0, ret) {
+		TH_LOG("Kernel does not support CONFIG_SYSCALL_USER_DISPATCH");
+	}
+
+	/* MAGIC_SYSCALL_1 doesn't exist. */
+	SYSCALL_DISPATCH_OFF(glob_sel);
+
+	ret = prctl(PR_SET_SYSCALL_USER_DISPATCH, PR_SYS_DISPATCH_OFF, 0, 0, 0);
+	EXPECT_EQ(0, ret) {
+		TH_LOG("Failed to unset syscall user dispatch");
+	}
+
+	/* Shouldn't have any effect... */
+	SYSCALL_DISPATCH_ON(glob_sel);
+
+	ret = syscall(__NR_sysinfo, &info);
+	EXPECT_EQ(0, ret) {
+		TH_LOG("Dispatch triggered unexpectedly");
+	}
+}
+
+TEST(direct_dispatch_range)
+{
+	int ret = 0;
+	struct sysinfo info;
+	char sel = 0;
+
+	/*
+	 * Instead of calculating libc addresses; allow the entire
+	 * memory map and lock the selector.
+	 */
+	ret = prctl(PR_SET_SYSCALL_USER_DISPATCH, PR_SYS_DISPATCH_ON, 0, -1L, &sel);
+	ASSERT_EQ(0, ret) {
+		TH_LOG("Kernel does not support CONFIG_SYSCALL_USER_DISPATCH");
+	}
+
+	SYSCALL_DISPATCH_ON(sel);
+
+	ret = sysinfo(&info);
+	ASSERT_EQ(0, ret) {
+		TH_LOG("Dispatch triggered unexpectedly");
+	}
+}
+
+TEST_HARNESS_MAIN
-- 
2.27.0


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 0/2] Syscall User Redirection
  2020-07-16 19:31 [PATCH v4 0/2] Syscall User Redirection Gabriel Krisman Bertazi
  2020-07-16 19:31 ` [PATCH v4 1/2] kernel: Implement selective syscall userspace redirection Gabriel Krisman Bertazi
  2020-07-16 19:31 ` [PATCH v4 2/2] selftests: Add kselftest for syscall user dispatch Gabriel Krisman Bertazi
@ 2020-07-16 20:04 ` Kees Cook
  2020-07-16 20:22   ` Christian Brauner
  2020-08-02 12:01 ` Pavel Machek
  3 siblings, 1 reply; 21+ messages in thread
From: Kees Cook @ 2020-07-16 20:04 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi, Andy Lutomirski, Matthew Wilcox
  Cc: tglx, linux-kernel, kernel, gofmanp, linux-api, x86,
	linux-kselftest, shuah

On Thu, Jul 16, 2020 at 03:31:39PM -0400, Gabriel Krisman Bertazi wrote:
> This is v4 of Syscall User Redirection.  The implementation itself is
> not modified from v3, it only applies the latest round of reviews to the
> selftests.
> 
> __NR_syscalls is not really exported in header files other than
> asm-generic for every architecture, so it felt safer to optionally
> expose it with a fallback to a high value.
> 
> Also, I didn't expose tests for PR_GET as that is not currently
> implemented.  If possible, I'd have it supported by a future patchset,
> since it is not immediately necessary to support this feature.

Thanks! That all looks good to me.

> Finally, one question: Which tree would this go through?

I haven't heard from several other x86 maintainers yet (which is where
I would normally expect this series to land), but I would be comfortable
taking this through my seccomp tree if I got Acks/Reviews at least from
Andy and Matthew.

Andy, Matthew, what do you think of this?

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 0/2] Syscall User Redirection
  2020-07-16 20:04 ` [PATCH v4 0/2] Syscall User Redirection Kees Cook
@ 2020-07-16 20:22   ` Christian Brauner
  2020-07-16 20:25     ` Kees Cook
  0 siblings, 1 reply; 21+ messages in thread
From: Christian Brauner @ 2020-07-16 20:22 UTC (permalink / raw)
  To: Kees Cook
  Cc: Gabriel Krisman Bertazi, Andy Lutomirski, Matthew Wilcox, tglx,
	linux-kernel, kernel, gofmanp, linux-api, x86, linux-kselftest,
	shuah, jannh

On Thu, Jul 16, 2020 at 01:04:38PM -0700, Kees Cook wrote:
> On Thu, Jul 16, 2020 at 03:31:39PM -0400, Gabriel Krisman Bertazi wrote:
> > This is v4 of Syscall User Redirection.  The implementation itself is
> > not modified from v3, it only applies the latest round of reviews to the
> > selftests.
> > 
> > __NR_syscalls is not really exported in header files other than
> > asm-generic for every architecture, so it felt safer to optionally
> > expose it with a fallback to a high value.
> > 
> > Also, I didn't expose tests for PR_GET as that is not currently
> > implemented.  If possible, I'd have it supported by a future patchset,
> > since it is not immediately necessary to support this feature.
> 
> Thanks! That all looks good to me.

Don't have any problem with this but did this ever get exposure on
linux-api? This is the first time I see this pop up.

Christian

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 0/2] Syscall User Redirection
  2020-07-16 20:22   ` Christian Brauner
@ 2020-07-16 20:25     ` Kees Cook
  2020-07-16 20:29       ` Christian Brauner
  0 siblings, 1 reply; 21+ messages in thread
From: Kees Cook @ 2020-07-16 20:25 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Gabriel Krisman Bertazi, Andy Lutomirski, Matthew Wilcox, tglx,
	linux-kernel, kernel, gofmanp, linux-api, x86, linux-kselftest,
	shuah, jannh

On Thu, Jul 16, 2020 at 10:22:34PM +0200, Christian Brauner wrote:
> On Thu, Jul 16, 2020 at 01:04:38PM -0700, Kees Cook wrote:
> > On Thu, Jul 16, 2020 at 03:31:39PM -0400, Gabriel Krisman Bertazi wrote:
> > > This is v4 of Syscall User Redirection.  The implementation itself is
> > > not modified from v3, it only applies the latest round of reviews to the
> > > selftests.
> > > 
> > > __NR_syscalls is not really exported in header files other than
> > > asm-generic for every architecture, so it felt safer to optionally
> > > expose it with a fallback to a high value.
> > > 
> > > Also, I didn't expose tests for PR_GET as that is not currently
> > > implemented.  If possible, I'd have it supported by a future patchset,
> > > since it is not immediately necessary to support this feature.
> > 
> > Thanks! That all looks good to me.
> 
> Don't have any problem with this but did this ever get exposure on
> linux-api? This is the first time I see this pop up.

I thought I'd added it to CC in the past, but that might have been other
recent unrelated threads. Does this need a full repost there too, you
think?

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 0/2] Syscall User Redirection
  2020-07-16 20:25     ` Kees Cook
@ 2020-07-16 20:29       ` Christian Brauner
  2020-07-16 20:30         ` Gabriel Krisman Bertazi
  0 siblings, 1 reply; 21+ messages in thread
From: Christian Brauner @ 2020-07-16 20:29 UTC (permalink / raw)
  To: Kees Cook
  Cc: Gabriel Krisman Bertazi, Andy Lutomirski, Matthew Wilcox, tglx,
	linux-kernel, kernel, gofmanp, linux-api, x86, linux-kselftest,
	shuah, jannh

On Thu, Jul 16, 2020 at 01:25:43PM -0700, Kees Cook wrote:
> On Thu, Jul 16, 2020 at 10:22:34PM +0200, Christian Brauner wrote:
> > On Thu, Jul 16, 2020 at 01:04:38PM -0700, Kees Cook wrote:
> > > On Thu, Jul 16, 2020 at 03:31:39PM -0400, Gabriel Krisman Bertazi wrote:
> > > > This is v4 of Syscall User Redirection.  The implementation itself is
> > > > not modified from v3, it only applies the latest round of reviews to the
> > > > selftests.
> > > > 
> > > > __NR_syscalls is not really exported in header files other than
> > > > asm-generic for every architecture, so it felt safer to optionally
> > > > expose it with a fallback to a high value.
> > > > 
> > > > Also, I didn't expose tests for PR_GET as that is not currently
> > > > implemented.  If possible, I'd have it supported by a future patchset,
> > > > since it is not immediately necessary to support this feature.
> > > 
> > > Thanks! That all looks good to me.
> > 
> > Don't have any problem with this but did this ever get exposure on
> > linux-api? This is the first time I see this pop up.
> 
> I thought I'd added it to CC in the past, but that might have been other
> recent unrelated threads. Does this need a full repost there too, you
> think?

Nah, wasn't my intention to force a repost. Seems that several people
have looked this over. :) Just curious why it didn't get to linux-api
and we know quite some people who only do look at linux-api (for sanity). :)

Christian

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 0/2] Syscall User Redirection
  2020-07-16 20:29       ` Christian Brauner
@ 2020-07-16 20:30         ` Gabriel Krisman Bertazi
  2020-07-16 21:06           ` Carlos O'Donell
  0 siblings, 1 reply; 21+ messages in thread
From: Gabriel Krisman Bertazi @ 2020-07-16 20:30 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Kees Cook, Andy Lutomirski, Matthew Wilcox, tglx, linux-kernel,
	kernel, gofmanp, linux-api, x86, linux-kselftest, shuah, jannh

Christian Brauner <christian.brauner@ubuntu.com> writes:

> On Thu, Jul 16, 2020 at 01:25:43PM -0700, Kees Cook wrote:
>> On Thu, Jul 16, 2020 at 10:22:34PM +0200, Christian Brauner wrote:
>> > On Thu, Jul 16, 2020 at 01:04:38PM -0700, Kees Cook wrote:
>> > > On Thu, Jul 16, 2020 at 03:31:39PM -0400, Gabriel Krisman Bertazi wrote:
>> > > > This is v4 of Syscall User Redirection.  The implementation itself is
>> > > > not modified from v3, it only applies the latest round of reviews to the
>> > > > selftests.
>> > > > 
>> > > > __NR_syscalls is not really exported in header files other than
>> > > > asm-generic for every architecture, so it felt safer to optionally
>> > > > expose it with a fallback to a high value.
>> > > > 
>> > > > Also, I didn't expose tests for PR_GET as that is not currently
>> > > > implemented.  If possible, I'd have it supported by a future patchset,
>> > > > since it is not immediately necessary to support this feature.
>> > > 
>> > > Thanks! That all looks good to me.
>> > 
>> > Don't have any problem with this but did this ever get exposure on
>> > linux-api? This is the first time I see this pop up.
>> 
>> I thought I'd added it to CC in the past, but that might have been other
>> recent unrelated threads. Does this need a full repost there too, you
>> think?
>
> Nah, wasn't my intention to force a repost. Seems that several people
> have looked this over. :) Just curious why it didn't get to linux-api
> and we know quite some people who only do look at linux-api (for sanity). :)

That's my mistake.  I didn't think about it when submitting :(

If this get re-spinned again I will make sure to CC linux-api.

-- 
Gabriel Krisman Bertazi

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 1/2] kernel: Implement selective syscall userspace redirection
  2020-07-16 19:31 ` [PATCH v4 1/2] kernel: Implement selective syscall userspace redirection Gabriel Krisman Bertazi
@ 2020-07-16 21:06   ` Matthew Wilcox
  2020-07-16 21:26     ` Kees Cook
  2020-07-17  0:20   ` Andy Lutomirski
  2020-07-20 10:08   ` Thomas Gleixner
  2 siblings, 1 reply; 21+ messages in thread
From: Matthew Wilcox @ 2020-07-16 21:06 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi
  Cc: tglx, linux-kernel, kernel, luto, gofmanp, keescook,
	linux-kselftest, shuah

On Thu, Jul 16, 2020 at 03:31:40PM -0400, Gabriel Krisman Bertazi wrote:
> selector is an optional pointer to a char-sized userspace memory region
> that has a key switch for the mechanism. This key switch is set to
> either PR_SYS_DISPATCH_ON, PR_SYS_DISPATCH_OFF to enable and disable the
> redirection without calling the kernel.
> 
> The feature is meant to be set per-thread and it is disabled on
> fork/clone/execv.

Disabled on exec.  Disabled in the child on clone/fork (and vfork, I
think).

That means we don't need to worry about it interacting badly with
a setuid program, right?


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 0/2] Syscall User Redirection
  2020-07-16 20:30         ` Gabriel Krisman Bertazi
@ 2020-07-16 21:06           ` Carlos O'Donell
  0 siblings, 0 replies; 21+ messages in thread
From: Carlos O'Donell @ 2020-07-16 21:06 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi, Christian Brauner
  Cc: Kees Cook, Andy Lutomirski, Matthew Wilcox, tglx, linux-kernel,
	kernel, gofmanp, linux-api, x86, linux-kselftest, shuah, jannh

On 7/16/20 4:30 PM, Gabriel Krisman Bertazi wrote:
> Christian Brauner <christian.brauner@ubuntu.com> writes:
> 
>> On Thu, Jul 16, 2020 at 01:25:43PM -0700, Kees Cook wrote:
>>> On Thu, Jul 16, 2020 at 10:22:34PM +0200, Christian Brauner wrote:
>>>> On Thu, Jul 16, 2020 at 01:04:38PM -0700, Kees Cook wrote:
>>>>> On Thu, Jul 16, 2020 at 03:31:39PM -0400, Gabriel Krisman Bertazi wrote:
>>>>>> This is v4 of Syscall User Redirection.  The implementation itself is
>>>>>> not modified from v3, it only applies the latest round of reviews to the
>>>>>> selftests.
>>>>>>
>>>>>> __NR_syscalls is not really exported in header files other than
>>>>>> asm-generic for every architecture, so it felt safer to optionally
>>>>>> expose it with a fallback to a high value.
>>>>>>
>>>>>> Also, I didn't expose tests for PR_GET as that is not currently
>>>>>> implemented.  If possible, I'd have it supported by a future patchset,
>>>>>> since it is not immediately necessary to support this feature.
>>>>>
>>>>> Thanks! That all looks good to me.
>>>>
>>>> Don't have any problem with this but did this ever get exposure on
>>>> linux-api? This is the first time I see this pop up.
>>>
>>> I thought I'd added it to CC in the past, but that might have been other
>>> recent unrelated threads. Does this need a full repost there too, you
>>> think?
>>
>> Nah, wasn't my intention to force a repost. Seems that several people
>> have looked this over. :) Just curious why it didn't get to linux-api
>> and we know quite some people who only do look at linux-api (for sanity). :)
> 
> That's my mistake.  I didn't think about it when submitting :(
> 
> If this get re-spinned again I will make sure to CC linux-api.

Thank you! It helps C library implementors stay up to date and comment
on changes that impact userspace ABIs and APIs. This patch set was new
to me. Interesting new feature.

-- 
Cheers,
Carlos.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 1/2] kernel: Implement selective syscall userspace redirection
  2020-07-16 21:06   ` Matthew Wilcox
@ 2020-07-16 21:26     ` Kees Cook
  0 siblings, 0 replies; 21+ messages in thread
From: Kees Cook @ 2020-07-16 21:26 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Gabriel Krisman Bertazi, tglx, linux-kernel, kernel, luto,
	gofmanp, linux-kselftest, shuah

On Thu, Jul 16, 2020 at 10:06:01PM +0100, Matthew Wilcox wrote:
> On Thu, Jul 16, 2020 at 03:31:40PM -0400, Gabriel Krisman Bertazi wrote:
> > selector is an optional pointer to a char-sized userspace memory region
> > that has a key switch for the mechanism. This key switch is set to
> > either PR_SYS_DISPATCH_ON, PR_SYS_DISPATCH_OFF to enable and disable the
> > redirection without calling the kernel.
> > 
> > The feature is meant to be set per-thread and it is disabled on
> > fork/clone/execv.
> 
> Disabled on exec.  Disabled in the child on clone/fork (and vfork, I
> think).
> 
> That means we don't need to worry about it interacting badly with
> a setuid program, right?

Right, that's the intention.

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 1/2] kernel: Implement selective syscall userspace redirection
  2020-07-16 19:31 ` [PATCH v4 1/2] kernel: Implement selective syscall userspace redirection Gabriel Krisman Bertazi
  2020-07-16 21:06   ` Matthew Wilcox
@ 2020-07-17  0:20   ` Andy Lutomirski
  2020-07-17  2:15     ` Gabriel Krisman Bertazi
  2020-07-20  9:23     ` Thomas Gleixner
  2020-07-20 10:08   ` Thomas Gleixner
  2 siblings, 2 replies; 21+ messages in thread
From: Andy Lutomirski @ 2020-07-17  0:20 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi
  Cc: Thomas Gleixner, LKML, kernel, Matthew Wilcox, Andrew Lutomirski,
	Paul Gofman, Kees Cook, open list:KERNEL SELFTEST FRAMEWORK,
	Shuah Khan

On Thu, Jul 16, 2020 at 12:31 PM Gabriel Krisman Bertazi
<krisman@collabora.com> wrote:
>

This is quite nice.  I have a few comments, though:

You mentioned rt_sigreturn().  Should this automatically exempt the
kernel-provided signal restorer on architectures (e.g. x86_32) that
provide one?

The amount of syscall entry wiring that arches need to do is IMO
already a bit out of hand.  Should we instead rename TIF_SECCOMP to
TIF_SYSCALL_INTERCEPTION and have one generic callback that handles
seccomp and this new thing?

> +int do_syscall_user_dispatch(struct pt_regs *regs)
> +{
> +       struct syscall_user_dispatch *sd = &current->syscall_dispatch;
> +       unsigned long ip = instruction_pointer(regs);
> +       char state;
> +
> +       if (likely(ip >= sd->dispatcher_start && ip <= sd->dispatcher_end))
> +               return 0;
> +
> +       if (likely(sd->selector)) {
> +               if (unlikely(__get_user(state, sd->selector)))
> +                       do_exit(SIGSEGV);
> +
> +               if (likely(state == 0))
> +                       return 0;
> +
> +               if (state != 1)
> +                       do_exit(SIGSEGV);

This seems a bit extreme and hard to debug if it ever happens.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 1/2] kernel: Implement selective syscall userspace redirection
  2020-07-17  0:20   ` Andy Lutomirski
@ 2020-07-17  2:15     ` Gabriel Krisman Bertazi
  2020-07-17  4:48       ` Andy Lutomirski
  2020-07-20  9:23     ` Thomas Gleixner
  1 sibling, 1 reply; 21+ messages in thread
From: Gabriel Krisman Bertazi @ 2020-07-17  2:15 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, LKML, kernel, Matthew Wilcox, Paul Gofman,
	Kees Cook, open list:KERNEL SELFTEST FRAMEWORK, Shuah Khan

Andy Lutomirski <luto@kernel.org> writes:

> On Thu, Jul 16, 2020 at 12:31 PM Gabriel Krisman Bertazi
> <krisman@collabora.com> wrote:
>>
>
> This is quite nice.  I have a few comments, though:
>
> You mentioned rt_sigreturn().  Should this automatically exempt the
> kernel-provided signal restorer on architectures (e.g. x86_32) that
> provide one?

That seems reasonable.  Not sure how easy it is to do it, though.

> The amount of syscall entry wiring that arches need to do is IMO
> already a bit out of hand.  Should we instead rename TIF_SECCOMP to
> TIF_SYSCALL_INTERCEPTION and have one generic callback that handles
> seccomp and this new thing?

Considering the previous suggestion from Kees to hide it inside the
tracehook and Thomas rework of this path, I'm not sure what is the best
solution here, but some rework of these flags is due.  Thomas suggested
expanding these flags to 64 bits and having some arch specific and
arch-agnostic flags.  With the storage expansion and arch-agnostic flags,
would this still be desirable?

>> +int do_syscall_user_dispatch(struct pt_regs *regs)
>> +{
>> +       struct syscall_user_dispatch *sd = &current->syscall_dispatch;
>> +       unsigned long ip = instruction_pointer(regs);
>> +       char state;
>> +
>> +       if (likely(ip >= sd->dispatcher_start && ip <= sd->dispatcher_end))
>> +               return 0;
>> +
>> +       if (likely(sd->selector)) {
>> +               if (unlikely(__get_user(state, sd->selector)))
>> +                       do_exit(SIGSEGV);
>> +
>> +               if (likely(state == 0))
>> +                       return 0;
>> +
>> +               if (state != 1)
>> +                       do_exit(SIGSEGV);
>
> This seems a bit extreme and hard to debug if it ever happens.

Makes sense, but I don't see a better way to return the error here.
Maybe a SIGSYS with a different si_errno?  Alternatively, we could
revert to the previous behavior of allowing syscalls on state != 0, that
existed in v1.  What do you think?

-- 
Gabriel Krisman Bertazi

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 1/2] kernel: Implement selective syscall userspace redirection
  2020-07-17  2:15     ` Gabriel Krisman Bertazi
@ 2020-07-17  4:48       ` Andy Lutomirski
  2020-07-21 12:06         ` Mark Rutland
  0 siblings, 1 reply; 21+ messages in thread
From: Andy Lutomirski @ 2020-07-17  4:48 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi
  Cc: Andy Lutomirski, Thomas Gleixner, LKML, kernel, Matthew Wilcox,
	Paul Gofman, Kees Cook, open list:KERNEL SELFTEST FRAMEWORK,
	Shuah Khan

On Thu, Jul 16, 2020 at 7:15 PM Gabriel Krisman Bertazi
<krisman@collabora.com> wrote:
>
> Andy Lutomirski <luto@kernel.org> writes:
>
> > On Thu, Jul 16, 2020 at 12:31 PM Gabriel Krisman Bertazi
> > <krisman@collabora.com> wrote:
> >>
> >
> > This is quite nice.  I have a few comments, though:
> >
> > You mentioned rt_sigreturn().  Should this automatically exempt the
> > kernel-provided signal restorer on architectures (e.g. x86_32) that
> > provide one?
>
> That seems reasonable.  Not sure how easy it is to do it, though.

For better or for worse, it's currently straightforward because the code is:

__kernel_sigreturn:
.LSTART_sigreturn:
        popl %eax               /* XXX does this mean it needs unwind info? */
        movl $__NR_sigreturn, %eax
        SYSCALL_ENTER_KERNEL

and SYSCALL_ENTER_KERNEL is hardwired as int $0x80.  (The latter is
probably my fault, for better or for worse.)  So this would change to:

__vdso32_sigreturn_syscall:
  SYSCALL_ENTER_KERNEL

and vdso2c would wire up __vdso32_sigreturn_syscall.  Then there would
be something like:

bool arch_syscall_is_vdso_sigreturn(struct pt_regs *regs);

and that would be that.  Does anyone have an opinion as to whether
this is a good idea?  Modern glibc shouldn't be using this mechanism,
I think, but I won't swear to it.

>
> > The amount of syscall entry wiring that arches need to do is IMO
> > already a bit out of hand.  Should we instead rename TIF_SECCOMP to
> > TIF_SYSCALL_INTERCEPTION and have one generic callback that handles
> > seccomp and this new thing?
>
> Considering the previous suggestion from Kees to hide it inside the
> tracehook and Thomas rework of this path, I'm not sure what is the best
> solution here, but some rework of these flags is due.  Thomas suggested
> expanding these flags to 64 bits and having some arch specific and
> arch-agnostic flags.  With the storage expansion and arch-agnostic flags,
> would this still be desirable?

I think it would be desirable to consolidate this to avoid having
multiple arches need to separately wire up all of these mechanisms.
I'm not sure that the initial upstream implementation needs this, but
it might be nice to support this out of the box on all arches with
seccomp support.

>
> >> +int do_syscall_user_dispatch(struct pt_regs *regs)
> >> +{
> >> +       struct syscall_user_dispatch *sd = &current->syscall_dispatch;
> >> +       unsigned long ip = instruction_pointer(regs);
> >> +       char state;
> >> +
> >> +       if (likely(ip >= sd->dispatcher_start && ip <= sd->dispatcher_end))
> >> +               return 0;
> >> +
> >> +       if (likely(sd->selector)) {
> >> +               if (unlikely(__get_user(state, sd->selector)))
> >> +                       do_exit(SIGSEGV);
> >> +
> >> +               if (likely(state == 0))
> >> +                       return 0;
> >> +
> >> +               if (state != 1)
> >> +                       do_exit(SIGSEGV);
> >
> > This seems a bit extreme and hard to debug if it ever happens.
>
> Makes sense, but I don't see a better way to return the error here.
> Maybe a SIGSYS with a different si_errno?  Alternatively, we could
> revert to the previous behavior of allowing syscalls on state != 0, that
> existed in v1.  What do you think?
>

I don't have a strong opinion.  SIGSYS with different si_errno is
probably reasonable.

--Andy

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 1/2] kernel: Implement selective syscall userspace redirection
  2020-07-17  0:20   ` Andy Lutomirski
  2020-07-17  2:15     ` Gabriel Krisman Bertazi
@ 2020-07-20  9:23     ` Thomas Gleixner
  2020-07-20  9:44       ` Will Deacon
  1 sibling, 1 reply; 21+ messages in thread
From: Thomas Gleixner @ 2020-07-20  9:23 UTC (permalink / raw)
  To: Andy Lutomirski, Gabriel Krisman Bertazi
  Cc: LKML, kernel, Matthew Wilcox, Andrew Lutomirski, Paul Gofman,
	Kees Cook, open list:KERNEL SELFTEST FRAMEWORK, Shuah Khan

Andy Lutomirski <luto@kernel.org> writes:
> On Thu, Jul 16, 2020 at 12:31 PM Gabriel Krisman Bertazi
> <krisman@collabora.com> wrote:
> The amount of syscall entry wiring that arches need to do is IMO
> already a bit out of hand.  Should we instead rename TIF_SECCOMP to
> TIF_SYSCALL_INTERCEPTION and have one generic callback that handles
> seccomp and this new thing?

The right way to go is to consolidate all the stupidly different
entry/exit work handling implementations and have exactly one in generic
code, i.e. what I posted a few days ago.

Then we can make new features only available in the generic version by
hiding the new functionality in the core code and not exposing the
functions to architecture implementations.

Making it easy for architectures to keep their own variant forever just
proliferates the mess we have right now.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 1/2] kernel: Implement selective syscall userspace redirection
  2020-07-20  9:23     ` Thomas Gleixner
@ 2020-07-20  9:44       ` Will Deacon
  0 siblings, 0 replies; 21+ messages in thread
From: Will Deacon @ 2020-07-20  9:44 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andy Lutomirski, Gabriel Krisman Bertazi, LKML, kernel,
	Matthew Wilcox, Paul Gofman, Kees Cook,
	open list:KERNEL SELFTEST FRAMEWORK, Shuah Khan

On Mon, Jul 20, 2020 at 11:23:13AM +0200, Thomas Gleixner wrote:
> Andy Lutomirski <luto@kernel.org> writes:
> > On Thu, Jul 16, 2020 at 12:31 PM Gabriel Krisman Bertazi
> > <krisman@collabora.com> wrote:
> > The amount of syscall entry wiring that arches need to do is IMO
> > already a bit out of hand.  Should we instead rename TIF_SECCOMP to
> > TIF_SYSCALL_INTERCEPTION and have one generic callback that handles
> > seccomp and this new thing?
> 
> The right way to go is to consolidate all the stupidly different
> entry/exit work handling implementations and have exactly one in generic
> code, i.e. what I posted a few days ago.
> 
> Then we can make new features only available in the generic version by
> hiding the new functionality in the core code and not exposing the
> functions to architecture implementations.
> 
> Making it easy for architectures to keep their own variant forever just
> proliferates the mess we have right now.

Couldn't agree more. We recently added PTRACE_SYSEMU to arm64 and I deeply
regret doing that now that yet another way to rewrite the syscall number
has come along. I only just untangled some of the mess in our entry code
for that, so I can't say I'm looking forward to opening it right back up
to support this new feature. Much better to do it in the core code instead.

Will

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 1/2] kernel: Implement selective syscall userspace redirection
  2020-07-16 19:31 ` [PATCH v4 1/2] kernel: Implement selective syscall userspace redirection Gabriel Krisman Bertazi
  2020-07-16 21:06   ` Matthew Wilcox
  2020-07-17  0:20   ` Andy Lutomirski
@ 2020-07-20 10:08   ` Thomas Gleixner
  2020-07-20 13:46     ` Gabriel Krisman Bertazi
  2 siblings, 1 reply; 21+ messages in thread
From: Thomas Gleixner @ 2020-07-20 10:08 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi
  Cc: linux-kernel, kernel, willy, luto, gofmanp, keescook,
	linux-kselftest, shuah, Gabriel Krisman Bertazi

Gabriel,

Gabriel Krisman Bertazi <krisman@collabora.com> writes:
> Introduce a mechanism to quickly disable/enable syscall handling for a
> specific process and redirect to userspace via SIGSYS.  This is useful
> for processes with parts that require syscall redirection and parts that
> don't, but who need to perform this boundary crossing really fast,
> without paying the cost of a system call to reconfigure syscall handling
> on each boundary transition.  This is particularly important for Windows
> games running over Wine.
>
> The proposed interface looks like this:
>
>   prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <start_addr>, <end_addr>, [selector])
>
> The range [<start_addr>,<end_addr>] is a part of the process memory map
> that is allowed to by-pass the redirection code and dispatch syscalls
> directly, such that in fast paths a process doesn't need to disable the
> trap nor the kernel has to check the selector.  This is essential to
> return from SIGSYS to a blocked area without triggering another SIGSYS
> from rt_sigreturn.

Why isn't rt_sigreturn() exempt from that redirection in the first place?

> ---
>  arch/Kconfig                          | 20 ++++++
>  arch/x86/Kconfig                      |  1 +
>  arch/x86/entry/common.c               |  5 ++
>  arch/x86/include/asm/thread_info.h    |  4 +-
>  arch/x86/kernel/signal_compat.c       |  2 +-
>  fs/exec.c                             |  2 +
>  include/linux/sched.h                 |  3 +
>  include/linux/syscall_user_dispatch.h | 50 +++++++++++++++
>  include/uapi/asm-generic/siginfo.h    |  3 +-
>  include/uapi/linux/prctl.h            |  5 ++
>  kernel/Makefile                       |  1 +
>  kernel/fork.c                         |  1 +
>  kernel/sys.c                          |  5 ++
>  kernel/syscall_user_dispatch.c        | 92 +++++++++++++++++++++++++++

A big combo patch is not how we do that. Please split it up into the
core part and a patch enabling it for a particular architexture.

As I said in my reply to Andy, this wants to go on top of the generic
entry/exit work stuff:

  https://lore.kernel.org/r/20200716182208.180916541@linutronix.de

and then syscall_user_dispatch.c ends up in kernel/entry/ and the
dispatching function is not exposed outside of that directory.

I'm going to post a new version later today. Will cc you.

> --- a/arch/x86/include/asm/thread_info.h
> +++ b/arch/x86/include/asm/thread_info.h
> @@ -93,6 +93,7 @@ struct thread_info {
>  #define TIF_NOTSC		16	/* TSC is not accessible in userland */
>  #define TIF_IA32		17	/* IA32 compatibility process */
>  #define TIF_SLD			18	/* Restore split lock detection on context switch */
> +#define TIF_SYSCALL_USER_DISPATCH 19	/* Redirect syscall for userspace handling */

There are two other things out there which compete about the last TIF
bits on x86, so we need to clean that up first.

> +static void trigger_sigsys(struct pt_regs *regs)
> +{
> +	struct kernel_siginfo info;
> +
> +	clear_siginfo(&info);
> +	info.si_signo = SIGSYS;
> +	info.si_code = SYS_USER_DISPATCH;
> +	info.si_call_addr = (void __user *)KSTK_EIP(current);
> +	info.si_errno = 0;
> +	info.si_arch = syscall_get_arch(current);
> +	info.si_syscall = syscall_get_nr(current, regs);
> +
> +	force_sig_info(&info);
> +}
> +
> +int do_syscall_user_dispatch(struct pt_regs *regs)
> +{
> +	struct syscall_user_dispatch *sd = &current->syscall_dispatch;
> +	unsigned long ip = instruction_pointer(regs);
> +	char state;
> +
> +	if (likely(ip >= sd->dispatcher_start && ip <= sd->dispatcher_end))
> +		return 0;
> +
> +	if (likely(sd->selector)) {
> +		if (unlikely(__get_user(state, sd->selector)))

__get_user() mandates an explicit access_ok() which happened in the
prctl(). So this wants a comment why there is none right here.

> +			do_exit(SIGSEGV);
> +
> +		if (likely(state == 0))
> +			return 0;
> +
> +		if (state != 1)
> +			do_exit(SIGSEGV);

If that happens its going to be quite interesting to debug.

Also please use proper defines which are exposed to user space instead
of 0/1.

> +	}
> +
> +	syscall_rollback(current, regs);
> +	trigger_sigsys(regs);
> +
> +	return 1;
> +}
> +
> +int set_syscall_user_dispatch(int mode, unsigned long dispatcher_start,
> +			      unsigned long dispatcher_end, char __user *selector)
> +{
> +	switch (mode) {
> +	case PR_SYS_DISPATCH_OFF:
> +		if (dispatcher_start || dispatcher_end || selector)
> +			return -EINVAL;
> +		break;
> +	case PR_SYS_DISPATCH_ON:
> +		/*
> +		 * Validate the direct dispatcher region just for basic
> +		 * sanity.  If the user is able to submit a syscall from
> +		 * an address, that address is obviously valid.
> +		 */
> +		if (dispatcher_end < dispatcher_start)
> +			return -EINVAL;
> +
> +		if (selector && !access_ok(selector, 1))

  sizeof(*selector)

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 1/2] kernel: Implement selective syscall userspace redirection
  2020-07-20 10:08   ` Thomas Gleixner
@ 2020-07-20 13:46     ` Gabriel Krisman Bertazi
  0 siblings, 0 replies; 21+ messages in thread
From: Gabriel Krisman Bertazi @ 2020-07-20 13:46 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, kernel, willy, luto, gofmanp, keescook,
	linux-kselftest, shuah


Hi Thomas,

Thanks for the valuable feedback!

Thomas Gleixner <tglx@linutronix.de> writes:
> Gabriel Krisman Bertazi <krisman@collabora.com> writes:
>> Introduce a mechanism to quickly disable/enable syscall handling for a
>> specific process and redirect to userspace via SIGSYS.  This is useful
>> for processes with parts that require syscall redirection and parts that
>> don't, but who need to perform this boundary crossing really fast,
>> without paying the cost of a system call to reconfigure syscall handling
>> on each boundary transition.  This is particularly important for Windows
>> games running over Wine.
>>
>> The proposed interface looks like this:
>>
>>   prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <start_addr>, <end_addr>, [selector])
>>
>> The range [<start_addr>,<end_addr>] is a part of the process memory map
>> that is allowed to by-pass the redirection code and dispatch syscalls
>> directly, such that in fast paths a process doesn't need to disable the
>> trap nor the kernel has to check the selector.  This is essential to
>> return from SIGSYS to a blocked area without triggering another SIGSYS
>> from rt_sigreturn.
>
> Why isn't rt_sigreturn() exempt from that redirection in the first
> place?

This was actually a design decision for me.

The main use case I'm considering is emulation of applications written
for other OSs (games over wine), which means this dispatcher code is
exposed to applications built against different ABIs, who trigger
syscalls with bogus parameters (from a linux perspective)

In this emulation scenario, I cannot really trust the syscall number
means rt_sigreturn, so I try to only base the dispatcher decision on the
memory region and selector variable.

I think the best we can do is what Andy said: to exempt rt_sigreturn
when it comes from the vdso, for architectures that do it that way.

>
>> ---
>>  arch/Kconfig                          | 20 ++++++
>>  arch/x86/Kconfig                      |  1 +
>>  arch/x86/entry/common.c               |  5 ++
>>  arch/x86/include/asm/thread_info.h    |  4 +-
>>  arch/x86/kernel/signal_compat.c       |  2 +-
>>  fs/exec.c                             |  2 +
>>  include/linux/sched.h                 |  3 +
>>  include/linux/syscall_user_dispatch.h | 50 +++++++++++++++
>>  include/uapi/asm-generic/siginfo.h    |  3 +-
>>  include/uapi/linux/prctl.h            |  5 ++
>>  kernel/Makefile                       |  1 +
>>  kernel/fork.c                         |  1 +
>>  kernel/sys.c                          |  5 ++
>>  kernel/syscall_user_dispatch.c        | 92 +++++++++++++++++++++++++++
>
> A big combo patch is not how we do that. Please split it up into the
> core part and a patch enabling it for a particular architexture.
>
> As I said in my reply to Andy, this wants to go on top of the generic
> entry/exit work stuff:
>
>   https://lore.kernel.org/r/20200716182208.180916541@linutronix.de
>
> and then syscall_user_dispatch.c ends up in kernel/entry/ and the
> dispatching function is not exposed outside of that directory.
>
> I'm going to post a new version later today. Will cc you.

Thanks. Will do!


-- 
Gabriel Krisman Bertazi

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 1/2] kernel: Implement selective syscall userspace redirection
  2020-07-17  4:48       ` Andy Lutomirski
@ 2020-07-21 12:06         ` Mark Rutland
  0 siblings, 0 replies; 21+ messages in thread
From: Mark Rutland @ 2020-07-21 12:06 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Gabriel Krisman Bertazi, Thomas Gleixner, LKML, kernel,
	Matthew Wilcox, Paul Gofman, Kees Cook,
	open list:KERNEL SELFTEST FRAMEWORK, Shuah Khan, will,
	catalin.marinas

On Thu, Jul 16, 2020 at 09:48:50PM -0700, Andy Lutomirski wrote:
> On Thu, Jul 16, 2020 at 7:15 PM Gabriel Krisman Bertazi
> <krisman@collabora.com> wrote:
> >
> > Andy Lutomirski <luto@kernel.org> writes:
> >
> > > On Thu, Jul 16, 2020 at 12:31 PM Gabriel Krisman Bertazi
> > > <krisman@collabora.com> wrote:
> > >>
> > >
> > > This is quite nice.  I have a few comments, though:
> > >
> > > You mentioned rt_sigreturn().  Should this automatically exempt the
> > > kernel-provided signal restorer on architectures (e.g. x86_32) that
> > > provide one?
> >
> > That seems reasonable.  Not sure how easy it is to do it, though.
> 
> For better or for worse, it's currently straightforward because the code is:
> 
> __kernel_sigreturn:
> .LSTART_sigreturn:
>         popl %eax               /* XXX does this mean it needs unwind info? */
>         movl $__NR_sigreturn, %eax
>         SYSCALL_ENTER_KERNEL
> 
> and SYSCALL_ENTER_KERNEL is hardwired as int $0x80.  (The latter is
> probably my fault, for better or for worse.)  So this would change to:
> 
> __vdso32_sigreturn_syscall:
>   SYSCALL_ENTER_KERNEL
> 
> and vdso2c would wire up __vdso32_sigreturn_syscall.  Then there would
> be something like:
> 
> bool arch_syscall_is_vdso_sigreturn(struct pt_regs *regs);
> 
> and that would be that.  Does anyone have an opinion as to whether
> this is a good idea?  Modern glibc shouldn't be using this mechanism,
> I think, but I won't swear to it.

On arm64 sigreturn is always through the vdso, so IIUC we'd certainly
need something like this. Otherwise it'd be the user's responsibility to
register the vdso sigtramp range when making the prctl, and flip the
selector in each signal handler, which sounds both painful and fragile.

Mark.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 0/2] Syscall User Redirection
  2020-07-16 19:31 [PATCH v4 0/2] Syscall User Redirection Gabriel Krisman Bertazi
                   ` (2 preceding siblings ...)
  2020-07-16 20:04 ` [PATCH v4 0/2] Syscall User Redirection Kees Cook
@ 2020-08-02 12:01 ` Pavel Machek
  2020-08-04 14:26   ` Gabriel Krisman Bertazi
  3 siblings, 1 reply; 21+ messages in thread
From: Pavel Machek @ 2020-08-02 12:01 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi
  Cc: tglx, linux-kernel, kernel, willy, luto, gofmanp, keescook,
	linux-kselftest, shuah

Hi!

> This is v4 of Syscall User Redirection.  The implementation itself is
> not modified from v3, it only applies the latest round of reviews to the
> selftests.
> 
> __NR_syscalls is not really exported in header files other than
> asm-generic for every architecture, so it felt safer to optionally
> expose it with a fallback to a high value.
> 
> Also, I didn't expose tests for PR_GET as that is not currently
> implemented.  If possible, I'd have it supported by a future patchset,
> since it is not immediately necessary to support this feature.
> 
> Finally, one question: Which tree would this go through?

Should it come with Documentation?

How does it interact with ptrace, seccomp and similar?
									Pavel

(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 0/2] Syscall User Redirection
  2020-08-02 12:01 ` Pavel Machek
@ 2020-08-04 14:26   ` Gabriel Krisman Bertazi
  0 siblings, 0 replies; 21+ messages in thread
From: Gabriel Krisman Bertazi @ 2020-08-04 14:26 UTC (permalink / raw)
  To: Pavel Machek
  Cc: tglx, linux-kernel, kernel, willy, luto, gofmanp, keescook,
	linux-kselftest, shuah

Pavel Machek <pavel@ucw.cz> writes:

> Hi!
>
>> This is v4 of Syscall User Redirection.  The implementation itself is
>> not modified from v3, it only applies the latest round of reviews to the
>> selftests.
>> 
>> __NR_syscalls is not really exported in header files other than
>> asm-generic for every architecture, so it felt safer to optionally
>> expose it with a fallback to a high value.
>> 
>> Also, I didn't expose tests for PR_GET as that is not currently
>> implemented.  If possible, I'd have it supported by a future patchset,
>> since it is not immediately necessary to support this feature.
>> 
>> Finally, one question: Which tree would this go through?
>
> Should it come with Documentation?

Hi,

Thanks for the review.

I will prepare it for the next iteration.

> How does it interact with ptrace, seccomp and similar?

That is a very good question.

Regarding seccomp, this must take precedence, since the use case is
emulation (it can be invoked with a different ABI) such that seccomp
filtering by syscall number doesn't make sense in the first place.  In
addition, either the syscall is dispatched back to userspace, in which
case there is no resource for seccomp to protect, or the syscall will be
executed, and seccomp will execute next.

Regarding ptrace, I experimented with before and after, and while the
same ABI argument applies, I felt it was easier to debug if I let ptrace
happen for syscalls that are dispatched back to userspace.  In addition,
doing it after ptrace makes the code in syscall_exit_work slightly
simpler, since it doesn't require special handling for this feature.

For PTRACE_SYSEMU in particular, either placing this before or after is
a bit odd.  I don't think there is a right answer for this one, but I
don't see anyone wanting to use these features at the same time.  I
think as long as it is documented behavior, it should be fine either
way.

-- 
Gabriel Krisman Bertazi

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, back to index

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-16 19:31 [PATCH v4 0/2] Syscall User Redirection Gabriel Krisman Bertazi
2020-07-16 19:31 ` [PATCH v4 1/2] kernel: Implement selective syscall userspace redirection Gabriel Krisman Bertazi
2020-07-16 21:06   ` Matthew Wilcox
2020-07-16 21:26     ` Kees Cook
2020-07-17  0:20   ` Andy Lutomirski
2020-07-17  2:15     ` Gabriel Krisman Bertazi
2020-07-17  4:48       ` Andy Lutomirski
2020-07-21 12:06         ` Mark Rutland
2020-07-20  9:23     ` Thomas Gleixner
2020-07-20  9:44       ` Will Deacon
2020-07-20 10:08   ` Thomas Gleixner
2020-07-20 13:46     ` Gabriel Krisman Bertazi
2020-07-16 19:31 ` [PATCH v4 2/2] selftests: Add kselftest for syscall user dispatch Gabriel Krisman Bertazi
2020-07-16 20:04 ` [PATCH v4 0/2] Syscall User Redirection Kees Cook
2020-07-16 20:22   ` Christian Brauner
2020-07-16 20:25     ` Kees Cook
2020-07-16 20:29       ` Christian Brauner
2020-07-16 20:30         ` Gabriel Krisman Bertazi
2020-07-16 21:06           ` Carlos O'Donell
2020-08-02 12:01 ` Pavel Machek
2020-08-04 14:26   ` Gabriel Krisman Bertazi

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git
	git clone --mirror https://lore.kernel.org/lkml/7 lkml/git/7.git
	git clone --mirror https://lore.kernel.org/lkml/8 lkml/git/8.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \
		linux-kernel@vger.kernel.org
	public-inbox-index lkml

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git