All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v7 0/7] Syscall User Dispatch
@ 2020-11-18  3:28 Gabriel Krisman Bertazi
  2020-11-18  3:28 ` [PATCH v7 1/7] x86: vdso: Expose sigreturn address on vdso to the kernel Gabriel Krisman Bertazi
                   ` (8 more replies)
  0 siblings, 9 replies; 19+ messages in thread
From: Gabriel Krisman Bertazi @ 2020-11-18  3:28 UTC (permalink / raw)
  To: luto, tglx, keescook
  Cc: christian.brauner, peterz, willy, shuah, linux-kernel, linux-api,
	linux-kselftest, x86, gofmanp, Gabriel Krisman Bertazi, kernel

Hi,

This is the v7 of syscall user dispatch.  This version is a bit
different from v6 on the following points, after the modifications
requested on that submission.

* The interface no longer receives  <start>,<end> end parameters, but
  <start>,<length> as suggested by PeterZ.

* Syscall User Dispatch is now done before ptrace, and this means there
  is some SYSCALL_WORK_EXIT work that needs to be done.  No challenges
  there, but I'd like to draw attention to that region of the code that
  is new in this submission.

* The previous TIF_SYSCALL_USER_DISPATCH is now handled through
  SYSCALL_WORK flags.

* Introduced a new test as patch 6, which benchmarks the fast submission
  path and test the return in blocked selector state.

* Nothing is architecture dependent anymore. No config switches.  It
  only depends on CONFIG_GENERIC_ENTRY.

Other smaller changes are documented one each commit.

This was tested using the kselftests tests in patch 5 and 6 and compiled
tested with !CONFIG_GENERIC_ENTRY.

This patchset is based on the core/entry branch of the TIP tree.

A working tree with this patchset is available at:

  https://gitlab.collabora.com/krisman/linux -b syscall-user-dispatch-v7

Previous submissions are archived at:

RFC/v1: https://lkml.org/lkml/2020/7/8/96
v2: https://lkml.org/lkml/2020/7/9/17
v3: https://lkml.org/lkml/2020/7/12/4
v4: https://www.spinics.net/lists/linux-kselftest/msg16377.html
v5: https://lkml.org/lkml/2020/8/10/1320
v6: https://lkml.org/lkml/2020/9/4/1122

Gabriel Krisman Bertazi (7):
  x86: vdso: Expose sigreturn address on vdso to the kernel
  signal: Expose SYS_USER_DISPATCH si_code type
  kernel: Implement selective syscall userspace redirection
  entry: Support Syscall User Dispatch on common syscall entry
  selftests: Add kselftest for syscall user dispatch
  selftests: Add benchmark for syscall user dispatch
  docs: Document Syscall User Dispatch

 .../admin-guide/syscall-user-dispatch.rst     |  87 +++++
 arch/x86/entry/vdso/vdso2c.c                  |   2 +
 arch/x86/entry/vdso/vdso32/sigreturn.S        |   2 +
 arch/x86/entry/vdso/vma.c                     |  15 +
 arch/x86/include/asm/elf.h                    |   2 +
 arch/x86/include/asm/vdso.h                   |   2 +
 arch/x86/kernel/signal_compat.c               |   2 +-
 fs/exec.c                                     |   3 +
 include/linux/entry-common.h                  |   2 +
 include/linux/sched.h                         |   2 +
 include/linux/syscall_user_dispatch.h         |  40 +++
 include/linux/thread_info.h                   |   2 +
 include/uapi/asm-generic/siginfo.h            |   3 +-
 include/uapi/linux/prctl.h                    |   5 +
 kernel/entry/Makefile                         |   2 +-
 kernel/entry/common.c                         |  17 +
 kernel/entry/common.h                         |  16 +
 kernel/entry/syscall_user_dispatch.c          | 102 ++++++
 kernel/fork.c                                 |   1 +
 kernel/sys.c                                  |   5 +
 tools/testing/selftests/Makefile              |   1 +
 .../syscall_user_dispatch/.gitignore          |   3 +
 .../selftests/syscall_user_dispatch/Makefile  |   9 +
 .../selftests/syscall_user_dispatch/config    |   1 +
 .../syscall_user_dispatch/sud_benchmark.c     | 200 +++++++++++
 .../syscall_user_dispatch/sud_test.c          | 310 ++++++++++++++++++
 26 files changed, 833 insertions(+), 3 deletions(-)
 create mode 100644 Documentation/admin-guide/syscall-user-dispatch.rst
 create mode 100644 include/linux/syscall_user_dispatch.h
 create mode 100644 kernel/entry/common.h
 create mode 100644 kernel/entry/syscall_user_dispatch.c
 create mode 100644 tools/testing/selftests/syscall_user_dispatch/.gitignore
 create mode 100644 tools/testing/selftests/syscall_user_dispatch/Makefile
 create mode 100644 tools/testing/selftests/syscall_user_dispatch/config
 create mode 100644 tools/testing/selftests/syscall_user_dispatch/sud_benchmark.c
 create mode 100644 tools/testing/selftests/syscall_user_dispatch/sud_test.c

-- 
2.29.2


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v7 1/7] x86: vdso: Expose sigreturn address on vdso to the kernel
  2020-11-18  3:28 [PATCH v7 0/7] Syscall User Dispatch Gabriel Krisman Bertazi
@ 2020-11-18  3:28 ` Gabriel Krisman Bertazi
  2020-11-18  3:28 ` [PATCH v7 2/7] signal: Expose SYS_USER_DISPATCH si_code type Gabriel Krisman Bertazi
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 19+ messages in thread
From: Gabriel Krisman Bertazi @ 2020-11-18  3:28 UTC (permalink / raw)
  To: luto, tglx, keescook
  Cc: christian.brauner, peterz, willy, shuah, linux-kernel, linux-api,
	linux-kselftest, x86, gofmanp, Gabriel Krisman Bertazi, kernel

Syscall user redirection requires the signal trampoline code to not be
captured, in order to support returning with a locked selector while
avoiding recursion back into the signal handler.  For ia-32, which has
the trampoline in the vDSO, expose the entry points to the kernel, such
that it can avoid dispatching syscalls from that region to userspace.

Suggested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Acked-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>

---
Changes since V5
  - Change return address to bool (Andy)
---
 arch/x86/entry/vdso/vdso2c.c           |  2 ++
 arch/x86/entry/vdso/vdso32/sigreturn.S |  2 ++
 arch/x86/entry/vdso/vma.c              | 15 +++++++++++++++
 arch/x86/include/asm/elf.h             |  2 ++
 arch/x86/include/asm/vdso.h            |  2 ++
 5 files changed, 23 insertions(+)

diff --git a/arch/x86/entry/vdso/vdso2c.c b/arch/x86/entry/vdso/vdso2c.c
index 7380908045c7..2d0f3d8bcc25 100644
--- a/arch/x86/entry/vdso/vdso2c.c
+++ b/arch/x86/entry/vdso/vdso2c.c
@@ -101,6 +101,8 @@ struct vdso_sym required_syms[] = {
 	{"__kernel_sigreturn", true},
 	{"__kernel_rt_sigreturn", true},
 	{"int80_landing_pad", true},
+	{"vdso32_rt_sigreturn_landing_pad", true},
+	{"vdso32_sigreturn_landing_pad", true},
 };
 
 __attribute__((format(printf, 1, 2))) __attribute__((noreturn))
diff --git a/arch/x86/entry/vdso/vdso32/sigreturn.S b/arch/x86/entry/vdso/vdso32/sigreturn.S
index c3233ee98a6b..1bd068f72d4c 100644
--- a/arch/x86/entry/vdso/vdso32/sigreturn.S
+++ b/arch/x86/entry/vdso/vdso32/sigreturn.S
@@ -18,6 +18,7 @@ __kernel_sigreturn:
 	movl $__NR_sigreturn, %eax
 	SYSCALL_ENTER_KERNEL
 .LEND_sigreturn:
+SYM_INNER_LABEL(vdso32_sigreturn_landing_pad, SYM_L_GLOBAL)
 	nop
 	.size __kernel_sigreturn,.-.LSTART_sigreturn
 
@@ -29,6 +30,7 @@ __kernel_rt_sigreturn:
 	movl $__NR_rt_sigreturn, %eax
 	SYSCALL_ENTER_KERNEL
 .LEND_rt_sigreturn:
+SYM_INNER_LABEL(vdso32_rt_sigreturn_landing_pad, SYM_L_GLOBAL)
 	nop
 	.size __kernel_rt_sigreturn,.-.LSTART_rt_sigreturn
 	.previous
diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
index 50e5d3a2e70a..de60cd37070b 100644
--- a/arch/x86/entry/vdso/vma.c
+++ b/arch/x86/entry/vdso/vma.c
@@ -436,6 +436,21 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
 }
 #endif
 
+bool arch_syscall_is_vdso_sigreturn(struct pt_regs *regs)
+{
+#if defined(CONFIG_X86_32) || defined(CONFIG_IA32_EMULATION)
+	const struct vdso_image *image = current->mm->context.vdso_image;
+	unsigned long vdso = (unsigned long) current->mm->context.vdso;
+
+	if (in_ia32_syscall() && image == &vdso_image_32) {
+		if (regs->ip == vdso + image->sym_vdso32_sigreturn_landing_pad ||
+		    regs->ip == vdso + image->sym_vdso32_rt_sigreturn_landing_pad)
+			return true;
+	}
+#endif
+	return false;
+}
+
 #ifdef CONFIG_X86_64
 static __init int vdso_setup(char *s)
 {
diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
index 44a9b9940535..66bdfe838d61 100644
--- a/arch/x86/include/asm/elf.h
+++ b/arch/x86/include/asm/elf.h
@@ -388,6 +388,8 @@ extern int compat_arch_setup_additional_pages(struct linux_binprm *bprm,
 	compat_arch_setup_additional_pages(bprm, interpreter,		\
 					   (ex->e_machine == EM_X86_64))
 
+extern bool arch_syscall_is_vdso_sigreturn(struct pt_regs *regs);
+
 /* Do not change the values. See get_align_mask() */
 enum align_flags {
 	ALIGN_VA_32	= BIT(0),
diff --git a/arch/x86/include/asm/vdso.h b/arch/x86/include/asm/vdso.h
index bbcdc7b8f963..589f489dd375 100644
--- a/arch/x86/include/asm/vdso.h
+++ b/arch/x86/include/asm/vdso.h
@@ -27,6 +27,8 @@ struct vdso_image {
 	long sym___kernel_rt_sigreturn;
 	long sym___kernel_vsyscall;
 	long sym_int80_landing_pad;
+	long sym_vdso32_sigreturn_landing_pad;
+	long sym_vdso32_rt_sigreturn_landing_pad;
 };
 
 #ifdef CONFIG_X86_64
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v7 2/7] signal: Expose SYS_USER_DISPATCH si_code type
  2020-11-18  3:28 [PATCH v7 0/7] Syscall User Dispatch Gabriel Krisman Bertazi
  2020-11-18  3:28 ` [PATCH v7 1/7] x86: vdso: Expose sigreturn address on vdso to the kernel Gabriel Krisman Bertazi
@ 2020-11-18  3:28 ` Gabriel Krisman Bertazi
  2020-11-18  3:28 ` [PATCH v7 3/7] kernel: Implement selective syscall userspace redirection Gabriel Krisman Bertazi
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 19+ messages in thread
From: Gabriel Krisman Bertazi @ 2020-11-18  3:28 UTC (permalink / raw)
  To: luto, tglx, keescook
  Cc: christian.brauner, peterz, willy, shuah, linux-kernel, linux-api,
	linux-kselftest, x86, gofmanp, Gabriel Krisman Bertazi, kernel

SYS_USER_DISPATCH will be triggered when a syscall is sent to userspace
by the Syscall User Dispatch mechanism.  This adjusts eventual
BUILD_BUG_ON around the tree.

Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 arch/x86/kernel/signal_compat.c    | 2 +-
 include/uapi/asm-generic/siginfo.h | 3 ++-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/signal_compat.c b/arch/x86/kernel/signal_compat.c
index a7f3e12cfbdb..d7b51870f16b 100644
--- a/arch/x86/kernel/signal_compat.c
+++ b/arch/x86/kernel/signal_compat.c
@@ -31,7 +31,7 @@ static inline void signal_compat_build_tests(void)
 	BUILD_BUG_ON(NSIGBUS  != 5);
 	BUILD_BUG_ON(NSIGTRAP != 5);
 	BUILD_BUG_ON(NSIGCHLD != 6);
-	BUILD_BUG_ON(NSIGSYS  != 1);
+	BUILD_BUG_ON(NSIGSYS  != 2);
 
 	/* This is part of the ABI and can never change in size: */
 	BUILD_BUG_ON(sizeof(compat_siginfo_t) != 128);
diff --git a/include/uapi/asm-generic/siginfo.h b/include/uapi/asm-generic/siginfo.h
index 7aacf9389010..d2597000407a 100644
--- a/include/uapi/asm-generic/siginfo.h
+++ b/include/uapi/asm-generic/siginfo.h
@@ -286,7 +286,8 @@ typedef struct siginfo {
  * SIGSYS si_codes
  */
 #define SYS_SECCOMP	1	/* seccomp triggered */
-#define NSIGSYS		1
+#define SYS_USER_DISPATCH 2	/* syscall user dispatch triggered */
+#define NSIGSYS		2
 
 /*
  * SIGEMT si_codes
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v7 3/7] kernel: Implement selective syscall userspace redirection
  2020-11-18  3:28 [PATCH v7 0/7] Syscall User Dispatch Gabriel Krisman Bertazi
  2020-11-18  3:28 ` [PATCH v7 1/7] x86: vdso: Expose sigreturn address on vdso to the kernel Gabriel Krisman Bertazi
  2020-11-18  3:28 ` [PATCH v7 2/7] signal: Expose SYS_USER_DISPATCH si_code type Gabriel Krisman Bertazi
@ 2020-11-18  3:28 ` Gabriel Krisman Bertazi
  2020-11-19 12:36   ` Peter Zijlstra
  2020-11-19 17:43   ` Gabriel Krisman Bertazi
  2020-11-18  3:28 ` [PATCH v7 4/7] entry: Support Syscall User Dispatch on common syscall entry Gabriel Krisman Bertazi
                   ` (5 subsequent siblings)
  8 siblings, 2 replies; 19+ messages in thread
From: Gabriel Krisman Bertazi @ 2020-11-18  3:28 UTC (permalink / raw)
  To: luto, tglx, keescook
  Cc: christian.brauner, peterz, willy, shuah, linux-kernel, linux-api,
	linux-kselftest, x86, gofmanp, Gabriel Krisman Bertazi, kernel

Introduce a mechanism to quickly disable/enable syscall handling for a
specific process and redirect to userspace via SIGSYS.  This is useful
for processes with parts that require syscall redirection and parts that
don't, but who need to perform this boundary crossing really fast,
without paying the cost of a system call to reconfigure syscall handling
on each boundary transition.  This is particularly important for Windows
games running over Wine.

The proposed interface looks like this:

  prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <off>, <length>, [selector])

The range [<offset>,<offset>+len] is a part of the process memory map
that is allowed to by-pass the redirection code and dispatch syscalls
directly, such that in fast paths a process doesn't need to disable the
trap nor the kernel has to check the selector.  This is essential to
return from SIGSYS to a blocked area without triggering another SIGSYS
from rt_sigreturn.

selector is an optional pointer to a char-sized userspace memory region
that has a key switch for the mechanism. This key switch is set to
either PR_SYS_DISPATCH_ON, PR_SYS_DISPATCH_OFF to enable and disable the
redirection without calling the kernel.

The feature is meant to be set per-thread and it is disabled on
fork/clone/execv.

Internally, this doesn't add overhead to the syscall hot path, and it
requires very little per-architecture support.  I avoided using seccomp,
even though it duplicates some functionality, due to previous feedback
that maybe it shouldn't mix with seccomp since it is not a security
mechanism.  And obviously, this should never be considered a security
mechanism, since any part of the program can by-pass it by using the
syscall dispatcher.

For the sysinfo benchmark, which measures the overhead added to
executing a native syscall that doesn't require interception, the
overhead using only the direct dispatcher region to issue syscalls is
pretty much irrelevant.  The overhead of using the selector goes around
40ns for a native (unredirected) syscall in my system, and it is (as
expected) dominated by the supervisor-mode user-address access.  In
fact, with SMAP off, the overhead is consistently less than 5ns on my
test box.

Cc: Matthew Wilcox <willy@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Paul Gofman <gofmanp@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: linux-api@vger.kernel.org
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>

---
Changes since v6:
  (Matthew Wilcox)
  - Use unsigned long for mode
  (peterZ)
  - Change interface to {offset,len}
  - Use SYSCALL_WORK interface instead of TIF flags

Changes since v4:
  (Andy Lutomirski)
  - Allow sigreturn coming from vDSO
  - Exit with SIGSYS instead of SIGSEGV on bad selector
  (Thomas Gleixner)
  - Use sizeof selector in access_ok
  - Document usage of __get_user
  - Use constant for state value
  - Split out x86 parts
  - Rebase on top of Gleixner's common entry code
  - Don't expose do_syscall_user_dispatch

Changes since v3:
  - NTR.

Changes since v2:
  (Matthew Wilcox suggestions)
  - Drop __user on non-ptr type.
  - Move #define closer to similar defs
  - Allow a memory region that can dispatch directly
  (Kees Cook suggestions)
  - Improve kconfig summary line
  - Move flag cleanup on execve to begin_new_exec
  - Hint branch predictor in the syscall path
  (Me)
  - Convert selector to char

Changes since RFC:
  (Kees Cook suggestions)
  - Don't mention personality while explaining the feature
  - Use syscall_get_nr
  - Remove header guard on several places
  - Convert WARN_ON to WARN_ON_ONCE
  - Explicit check for state values
  - Rename to syscall user dispatcher
---
 fs/exec.c                             |   3 +
 include/linux/sched.h                 |   2 +
 include/linux/syscall_user_dispatch.h |  40 ++++++++++
 include/linux/thread_info.h           |   2 +
 include/uapi/linux/prctl.h            |   5 ++
 kernel/entry/Makefile                 |   2 +-
 kernel/entry/common.h                 |  16 ++++
 kernel/entry/syscall_user_dispatch.c  | 102 ++++++++++++++++++++++++++
 kernel/fork.c                         |   1 +
 kernel/sys.c                          |   5 ++
 10 files changed, 177 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/syscall_user_dispatch.h
 create mode 100644 kernel/entry/common.h
 create mode 100644 kernel/entry/syscall_user_dispatch.c

diff --git a/fs/exec.c b/fs/exec.c
index 547a2390baf5..aee36e5733ce 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -64,6 +64,7 @@
 #include <linux/compat.h>
 #include <linux/vmalloc.h>
 #include <linux/io_uring.h>
+#include <linux/syscall_user_dispatch.h>
 
 #include <linux/uaccess.h>
 #include <asm/mmu_context.h>
@@ -1302,6 +1303,8 @@ int begin_new_exec(struct linux_binprm * bprm)
 	flush_thread();
 	me->personality &= ~bprm->per_clear;
 
+	clear_syscall_work_syscall_user_dispatch(me);
+
 	/*
 	 * We have to apply CLOEXEC before we change whether the process is
 	 * dumpable (in setup_new_exec) to avoid a race with a process in userspace
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 063cd120b459..6d1c1f5e74fe 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -34,6 +34,7 @@
 #include <linux/rseq.h>
 #include <linux/seqlock.h>
 #include <linux/kcsan.h>
+#include <linux/syscall_user_dispatch.h>
 
 /* task_struct member predeclarations (sorted alphabetically): */
 struct audit_context;
@@ -965,6 +966,7 @@ struct task_struct {
 	unsigned int			sessionid;
 #endif
 	struct seccomp			seccomp;
+	struct syscall_user_dispatch	syscall_dispatch;
 
 	/* Thread group tracking: */
 	u64				parent_exec_id;
diff --git a/include/linux/syscall_user_dispatch.h b/include/linux/syscall_user_dispatch.h
new file mode 100644
index 000000000000..9517ea16f090
--- /dev/null
+++ b/include/linux/syscall_user_dispatch.h
@@ -0,0 +1,40 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2020 Collabora Ltd.
+ */
+#ifndef _SYSCALL_USER_DISPATCH_H
+#define _SYSCALL_USER_DISPATCH_H
+
+#include <linux/thread_info.h>
+
+#ifdef CONFIG_GENERIC_ENTRY
+
+struct syscall_user_dispatch {
+	char __user *selector;
+	unsigned long offset;
+	unsigned long len;
+	bool on_dispatch;
+};
+
+int set_syscall_user_dispatch(unsigned long mode, unsigned long offset,
+			      unsigned long len, char __user *selector);
+
+#define clear_syscall_work_syscall_user_dispatch(tsk) \
+	clear_task_syscall_work(tsk, SYSCALL_USER_DISPATCH)
+
+#else
+struct syscall_user_dispatch {};
+
+static inline int set_syscall_user_dispatch(unsigned long mode, unsigned long offset,
+					    unsigned long len, char __user *selector)
+{
+	return -EINVAL;
+}
+
+static inline void clear_syscall_work_syscall_user_dispatch(struct task_struct *tsk)
+{
+}
+
+#endif /* CONFIG_GENERIC_ENTRY */
+
+#endif /* _SYSCALL_USER_DISPATCH_H */
diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
index 317363212ae9..45708a2602b9 100644
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -41,6 +41,7 @@ enum syscall_work_bit {
 	SYSCALL_WORK_BIT_SYSCALL_TRACE,
 	SYSCALL_WORK_BIT_SYSCALL_EMU,
 	SYSCALL_WORK_BIT_SYSCALL_AUDIT,
+	SYSCALL_WORK_BIT_SYSCALL_USER_DISPATCH,
 };
 
 #define SYSCALL_WORK_SECCOMP		BIT(SYSCALL_WORK_BIT_SECCOMP)
@@ -48,6 +49,7 @@ enum syscall_work_bit {
 #define SYSCALL_WORK_SYSCALL_TRACE	BIT(SYSCALL_WORK_BIT_SYSCALL_TRACE)
 #define SYSCALL_WORK_SYSCALL_EMU	BIT(SYSCALL_WORK_BIT_SYSCALL_EMU)
 #define SYSCALL_WORK_SYSCALL_AUDIT	BIT(SYSCALL_WORK_BIT_SYSCALL_AUDIT)
+#define SYSCALL_WORK_SYSCALL_USER_DISPATCH BIT(SYSCALL_WORK_BIT_SYSCALL_USER_DISPATCH)
 
 #include <asm/thread_info.h>
 
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 7f0827705c9a..90deb41c8a34 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -247,4 +247,9 @@ struct prctl_mm_map {
 #define PR_SET_IO_FLUSHER		57
 #define PR_GET_IO_FLUSHER		58
 
+/* Dispatch syscalls to a userspace handler */
+#define PR_SET_SYSCALL_USER_DISPATCH	59
+# define PR_SYS_DISPATCH_OFF		0
+# define PR_SYS_DISPATCH_ON		1
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/entry/Makefile b/kernel/entry/Makefile
index 34c8a3f1c735..095c775e001e 100644
--- a/kernel/entry/Makefile
+++ b/kernel/entry/Makefile
@@ -9,5 +9,5 @@ KCOV_INSTRUMENT := n
 CFLAGS_REMOVE_common.o	 = -fstack-protector -fstack-protector-strong
 CFLAGS_common.o		+= -fno-stack-protector
 
-obj-$(CONFIG_GENERIC_ENTRY) 		+= common.o
+obj-$(CONFIG_GENERIC_ENTRY) 		+= common.o syscall_user_dispatch.o
 obj-$(CONFIG_KVM_XFER_TO_GUEST_WORK)	+= kvm.o
diff --git a/kernel/entry/common.h b/kernel/entry/common.h
new file mode 100644
index 000000000000..cd0c4e5f143e
--- /dev/null
+++ b/kernel/entry/common.h
@@ -0,0 +1,16 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _COMMON_H
+#define _COMMON_H
+
+bool do_syscall_user_dispatch(struct pt_regs *regs);
+
+static inline bool on_syscall_dispatch(void)
+{
+	if (unlikely(current->syscall_dispatch.on_dispatch)) {
+		current->syscall_dispatch.on_dispatch = false;
+		return true;
+	}
+	return false;
+}
+
+#endif
diff --git a/kernel/entry/syscall_user_dispatch.c b/kernel/entry/syscall_user_dispatch.c
new file mode 100644
index 000000000000..131c38a0b628
--- /dev/null
+++ b/kernel/entry/syscall_user_dispatch.c
@@ -0,0 +1,102 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2020 Collabora Ltd.
+ */
+#include <linux/sched.h>
+#include <linux/prctl.h>
+#include <linux/syscall_user_dispatch.h>
+#include <linux/uaccess.h>
+#include <linux/signal.h>
+#include <linux/elf.h>
+
+#include <asm/syscall.h>
+
+#include <linux/sched/signal.h>
+#include <linux/sched/task_stack.h>
+
+static void trigger_sigsys(struct pt_regs *regs)
+{
+	struct kernel_siginfo info;
+
+	clear_siginfo(&info);
+	info.si_signo = SIGSYS;
+	info.si_code = SYS_USER_DISPATCH;
+	info.si_call_addr = (void __user *)KSTK_EIP(current);
+	info.si_errno = 0;
+	info.si_arch = syscall_get_arch(current);
+	info.si_syscall = syscall_get_nr(current, regs);
+
+	force_sig_info(&info);
+}
+
+bool do_syscall_user_dispatch(struct pt_regs *regs)
+{
+	struct syscall_user_dispatch *sd = &current->syscall_dispatch;
+	char state;
+
+	if (likely(instruction_pointer(regs) - sd->offset < sd->len))
+		return false;
+
+	if (unlikely(arch_syscall_is_vdso_sigreturn(regs)))
+		return false;
+
+	if (likely(sd->selector)) {
+		/*
+		 * access_ok() is performed once, at prctl time, when
+		 * the selector is loaded by userspace.
+		 */
+		if (unlikely(__get_user(state, sd->selector)))
+			do_exit(SIGSEGV);
+
+		if (likely(state == PR_SYS_DISPATCH_OFF))
+			return false;
+
+		if (state != PR_SYS_DISPATCH_ON)
+			do_exit(SIGSYS);
+	}
+
+	sd->on_dispatch = true;
+	syscall_rollback(current, regs);
+	trigger_sigsys(regs);
+
+	return true;
+}
+
+int set_syscall_user_dispatch(unsigned long mode, unsigned long offset,
+			      unsigned long len, char __user *selector)
+{
+	switch (mode) {
+	case PR_SYS_DISPATCH_OFF:
+		if (offset || len || selector)
+			return -EINVAL;
+		break;
+	case PR_SYS_DISPATCH_ON:
+		/*
+		 * Validate the direct dispatcher region just for basic
+		 * sanity against overflow and a 0-sized dispatcher
+		 * region.  If the user is able to submit a syscall from
+		 * an address, that address is obviously valid.
+		 */
+		if (offset && offset + len <= offset)
+			return -EINVAL;
+
+		if (selector && !access_ok(selector, sizeof(*selector)))
+			return -EFAULT;
+
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	current->syscall_dispatch.selector = selector;
+	current->syscall_dispatch.offset = offset;
+	current->syscall_dispatch.len = len;
+	current->syscall_dispatch.on_dispatch = false;
+
+	if (mode == PR_SYS_DISPATCH_ON)
+		set_syscall_work(SYSCALL_USER_DISPATCH);
+	else
+		clear_syscall_work(SYSCALL_USER_DISPATCH);
+
+	return 0;
+}
diff --git a/kernel/fork.c b/kernel/fork.c
index 02b689a23457..4a5ecb41f440 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -906,6 +906,7 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 	clear_user_return_notifier(tsk);
 	clear_tsk_need_resched(tsk);
 	set_task_stack_end_magic(tsk);
+	clear_syscall_work_syscall_user_dispatch(tsk);
 
 #ifdef CONFIG_STACKPROTECTOR
 	tsk->stack_canary = get_random_canary();
diff --git a/kernel/sys.c b/kernel/sys.c
index a730c03ee607..51f00fe20e4d 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -42,6 +42,7 @@
 #include <linux/syscore_ops.h>
 #include <linux/version.h>
 #include <linux/ctype.h>
+#include <linux/syscall_user_dispatch.h>
 
 #include <linux/compat.h>
 #include <linux/syscalls.h>
@@ -2530,6 +2531,10 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 
 		error = (current->flags & PR_IO_FLUSHER) == PR_IO_FLUSHER;
 		break;
+	case PR_SET_SYSCALL_USER_DISPATCH:
+		error = set_syscall_user_dispatch(arg2, arg3, arg4,
+						  (char __user *) arg5);
+		break;
 	default:
 		error = -EINVAL;
 		break;
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v7 4/7] entry: Support Syscall User Dispatch on common syscall entry
  2020-11-18  3:28 [PATCH v7 0/7] Syscall User Dispatch Gabriel Krisman Bertazi
                   ` (2 preceding siblings ...)
  2020-11-18  3:28 ` [PATCH v7 3/7] kernel: Implement selective syscall userspace redirection Gabriel Krisman Bertazi
@ 2020-11-18  3:28 ` Gabriel Krisman Bertazi
  2020-11-18  3:28 ` [PATCH v7 5/7] selftests: Add kselftest for syscall user dispatch Gabriel Krisman Bertazi
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 19+ messages in thread
From: Gabriel Krisman Bertazi @ 2020-11-18  3:28 UTC (permalink / raw)
  To: luto, tglx, keescook
  Cc: christian.brauner, peterz, willy, shuah, linux-kernel, linux-api,
	linux-kselftest, x86, gofmanp, Gabriel Krisman Bertazi, kernel

Syscall User Dispatch (SUD) must take precedence over seccomp and
ptrace, since the use case is emulation (it can be invoked with a
different ABI) such that seccomp filtering by syscall number doesn't
make sense in the first place.  In addition, either the syscall is
dispatched back to userspace, in which case there is no resource for to
trace, or the syscall will be executed, and seccomp/ptrace will execute
next.

Since SUD runs before tracepoints, it needs to be a SYSCALL_WORK_EXIT as
well, just to prevent a trace exit event when dispatch was triggered.
For that, the on_syscall_dispatch() examines context to skip the
tracepoint, audit and other work.

Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>

---
Changes since v6:
  - Update do_syscall_intercept signature (Christian Brauner)
  - Move it to before tracepoints
  - Use SYSCALL_WORK flags
---
 include/linux/entry-common.h |  2 ++
 kernel/entry/common.c        | 17 +++++++++++++++++
 2 files changed, 19 insertions(+)

diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 49b26b216e4e..a6e98b4ba8e9 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -44,10 +44,12 @@
 				 SYSCALL_WORK_SYSCALL_TRACE |		\
 				 SYSCALL_WORK_SYSCALL_EMU |		\
 				 SYSCALL_WORK_SYSCALL_AUDIT |		\
+				 SYSCALL_WORK_SYSCALL_USER_DISPATCH |	\
 				 ARCH_SYSCALL_WORK_ENTER)
 #define SYSCALL_WORK_EXIT	(SYSCALL_WORK_SYSCALL_TRACEPOINT |	\
 				 SYSCALL_WORK_SYSCALL_TRACE |		\
 				 SYSCALL_WORK_SYSCALL_AUDIT |		\
+				 SYSCALL_WORK_SYSCALL_USER_DISPATCH |	\
 				 ARCH_SYSCALL_WORK_EXIT)
 
 /*
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 91e8fd50adf4..cef136fc7cb5 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -5,6 +5,8 @@
 #include <linux/livepatch.h>
 #include <linux/audit.h>
 
+#include "common.h"
+
 #define CREATE_TRACE_POINTS
 #include <trace/events/syscalls.h>
 
@@ -46,6 +48,16 @@ static long syscall_trace_enter(struct pt_regs *regs, long syscall,
 {
 	long ret = 0;
 
+	/*
+	 * Handle Syscall User Dispatch.  This must comes first, since
+	 * the ABI here can be something that doesn't make sense for
+	 * other syscall_work features.
+	 */
+	if (work & SYSCALL_WORK_SYSCALL_USER_DISPATCH) {
+		if (do_syscall_user_dispatch(regs))
+			return -1L;
+	}
+
 	/* Handle ptrace */
 	if (work & (SYSCALL_WORK_SYSCALL_TRACE | SYSCALL_WORK_SYSCALL_EMU)) {
 		ret = arch_syscall_enter_tracehook(regs);
@@ -230,6 +242,11 @@ static void syscall_exit_work(struct pt_regs *regs, unsigned long work)
 {
 	bool step;
 
+	if (work & SYSCALL_WORK_SYSCALL_USER_DISPATCH) {
+		if (on_syscall_dispatch())
+			return;
+	}
+
 	audit_syscall_exit(regs);
 
 	if (work & SYSCALL_WORK_SYSCALL_TRACEPOINT)
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v7 5/7] selftests: Add kselftest for syscall user dispatch
  2020-11-18  3:28 [PATCH v7 0/7] Syscall User Dispatch Gabriel Krisman Bertazi
                   ` (3 preceding siblings ...)
  2020-11-18  3:28 ` [PATCH v7 4/7] entry: Support Syscall User Dispatch on common syscall entry Gabriel Krisman Bertazi
@ 2020-11-18  3:28 ` Gabriel Krisman Bertazi
  2020-11-18  3:28 ` [PATCH v7 6/7] selftests: Add benchmark " Gabriel Krisman Bertazi
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 19+ messages in thread
From: Gabriel Krisman Bertazi @ 2020-11-18  3:28 UTC (permalink / raw)
  To: luto, tglx, keescook
  Cc: christian.brauner, peterz, willy, shuah, linux-kernel, linux-api,
	linux-kselftest, x86, gofmanp, Gabriel Krisman Bertazi, kernel

Implement functionality tests for syscall user dispatch.  In order to
make the test portable, refrain from open coding syscall dispatchers and
calculating glibc memory ranges.

Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Acked-by: Kees Cook <keescook@chromium.org>

---
Changes since v6:
  - Update selftests to reflect {offset,len} api change
Changes since v4:
  - Update bad selector test to reflect change in API

Changes since v3:
  - Sort entry in Makefile
  - Add SPDX header
  - Use __NR_syscalls if available
---
 tools/testing/selftests/Makefile              |   1 +
 .../syscall_user_dispatch/.gitignore          |   3 +
 .../selftests/syscall_user_dispatch/Makefile  |   9 +
 .../selftests/syscall_user_dispatch/config    |   1 +
 .../syscall_user_dispatch/sud_test.c          | 310 ++++++++++++++++++
 5 files changed, 324 insertions(+)
 create mode 100644 tools/testing/selftests/syscall_user_dispatch/.gitignore
 create mode 100644 tools/testing/selftests/syscall_user_dispatch/Makefile
 create mode 100644 tools/testing/selftests/syscall_user_dispatch/config
 create mode 100644 tools/testing/selftests/syscall_user_dispatch/sud_test.c

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index d9c283503159..96d5682fb9d8 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -56,6 +56,7 @@ TARGETS += sparc64
 TARGETS += splice
 TARGETS += static_keys
 TARGETS += sync
+TARGETS += syscall_user_dispatch
 TARGETS += sysctl
 TARGETS += tc-testing
 TARGETS += timens
diff --git a/tools/testing/selftests/syscall_user_dispatch/.gitignore b/tools/testing/selftests/syscall_user_dispatch/.gitignore
new file mode 100644
index 000000000000..f539615ad5da
--- /dev/null
+++ b/tools/testing/selftests/syscall_user_dispatch/.gitignore
@@ -0,0 +1,3 @@
+# SPDX-License-Identifier: GPL-2.0-only
+sud_test
+sud_benchmark
diff --git a/tools/testing/selftests/syscall_user_dispatch/Makefile b/tools/testing/selftests/syscall_user_dispatch/Makefile
new file mode 100644
index 000000000000..8e15fa42bcda
--- /dev/null
+++ b/tools/testing/selftests/syscall_user_dispatch/Makefile
@@ -0,0 +1,9 @@
+# SPDX-License-Identifier: GPL-2.0
+top_srcdir = ../../../..
+INSTALL_HDR_PATH = $(top_srcdir)/usr
+LINUX_HDR_PATH = $(INSTALL_HDR_PATH)/include/
+
+CFLAGS += -Wall -I$(LINUX_HDR_PATH)
+
+TEST_GEN_PROGS := sud_test
+include ../lib.mk
diff --git a/tools/testing/selftests/syscall_user_dispatch/config b/tools/testing/selftests/syscall_user_dispatch/config
new file mode 100644
index 000000000000..039e303e59d7
--- /dev/null
+++ b/tools/testing/selftests/syscall_user_dispatch/config
@@ -0,0 +1 @@
+CONFIG_GENERIC_ENTRY=y
diff --git a/tools/testing/selftests/syscall_user_dispatch/sud_test.c b/tools/testing/selftests/syscall_user_dispatch/sud_test.c
new file mode 100644
index 000000000000..6498b050ef89
--- /dev/null
+++ b/tools/testing/selftests/syscall_user_dispatch/sud_test.c
@@ -0,0 +1,310 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2020 Collabora Ltd.
+ *
+ * Test code for syscall user dispatch
+ */
+
+#define _GNU_SOURCE
+#include <sys/prctl.h>
+#include <sys/sysinfo.h>
+#include <sys/syscall.h>
+#include <signal.h>
+
+#include <asm/unistd.h>
+#include "../kselftest_harness.h"
+
+#ifndef PR_SET_SYSCALL_USER_DISPATCH
+# define PR_SET_SYSCALL_USER_DISPATCH	59
+# define PR_SYS_DISPATCH_OFF	0
+# define PR_SYS_DISPATCH_ON	1
+#endif
+
+#ifndef SYS_USER_DISPATCH
+# define SYS_USER_DISPATCH	2
+#endif
+
+#ifdef __NR_syscalls
+# define MAGIC_SYSCALL_1 (__NR_syscalls + 1) /* Bad Linux syscall number */
+#else
+# define MAGIC_SYSCALL_1 (0xff00)  /* Bad Linux syscall number */
+#endif
+
+#define SYSCALL_DISPATCH_ON(x) ((x) = 1)
+#define SYSCALL_DISPATCH_OFF(x) ((x) = 0)
+
+/* Test Summary:
+ *
+ * - dispatch_trigger_sigsys: Verify if PR_SET_SYSCALL_USER_DISPATCH is
+ *   able to trigger SIGSYS on a syscall.
+ *
+ * - bad_selector: Test that a bad selector value triggers SIGSYS with
+ *   si_errno EINVAL.
+ *
+ * - bad_prctl_param: Test that the API correctly rejects invalid
+ *   parameters on prctl
+ *
+ * - dispatch_and_return: Test that a syscall is selectively dispatched
+ *   to userspace depending on the value of selector.
+ *
+ * - disable_dispatch: Test that the PR_SYS_DISPATCH_OFF correctly
+ *   disables the dispatcher
+ *
+ * - direct_dispatch_range: Test that a syscall within the allowed range
+ *   can bypass the dispatcher.
+ */
+
+TEST_SIGNAL(dispatch_trigger_sigsys, SIGSYS)
+{
+	char sel = 0;
+	struct sysinfo info;
+	int ret;
+
+	ret = sysinfo(&info);
+	ASSERT_EQ(0, ret);
+
+	ret = prctl(PR_SET_SYSCALL_USER_DISPATCH, PR_SYS_DISPATCH_ON, 0, 0, &sel);
+	ASSERT_EQ(0, ret) {
+		TH_LOG("Kernel does not support CONFIG_SYSCALL_USER_DISPATCH");
+	}
+
+	SYSCALL_DISPATCH_ON(sel);
+
+	sysinfo(&info);
+
+	EXPECT_FALSE(true) {
+		TH_LOG("Unreachable!");
+	}
+}
+
+TEST(bad_prctl_param)
+{
+	char sel = 0;
+	int op;
+
+	/* Invalid op */
+	op = -1;
+	prctl(PR_SET_SYSCALL_USER_DISPATCH, op, 0, 0, &sel);
+	ASSERT_EQ(EINVAL, errno);
+
+	/* PR_SYS_DISPATCH_OFF */
+	op = PR_SYS_DISPATCH_OFF;
+
+	/* offset != 0 */
+	prctl(PR_SET_SYSCALL_USER_DISPATCH, op, 0x1, 0x0, 0);
+	EXPECT_EQ(EINVAL, errno);
+
+	/* len != 0 */
+	prctl(PR_SET_SYSCALL_USER_DISPATCH, op, 0x0, 0xff, 0);
+	EXPECT_EQ(EINVAL, errno);
+
+	/* sel != NULL */
+	prctl(PR_SET_SYSCALL_USER_DISPATCH, op, 0x0, 0x0, &sel);
+	EXPECT_EQ(EINVAL, errno);
+
+	/* Valid parameter */
+	errno = 0;
+	prctl(PR_SET_SYSCALL_USER_DISPATCH, op, 0x0, 0x0, 0x0);
+	EXPECT_EQ(0, errno);
+
+	/* PR_SYS_DISPATCH_ON */
+	op = PR_SYS_DISPATCH_ON;
+
+	/* Dispatcher region is bad (offset > 0 && len == 0) */
+	prctl(PR_SET_SYSCALL_USER_DISPATCH, op, 0x1, 0x0, &sel);
+	EXPECT_EQ(EINVAL, errno);
+	prctl(PR_SET_SYSCALL_USER_DISPATCH, op, -1L, 0x0, &sel);
+	EXPECT_EQ(EINVAL, errno);
+
+	/* Invalid selector */
+	prctl(PR_SET_SYSCALL_USER_DISPATCH, op, 0x0, 0x1, (void *) -1);
+	ASSERT_EQ(EFAULT, errno);
+
+	/*
+	 * Dispatcher range overflows unsigned long
+	 */
+	prctl(PR_SET_SYSCALL_USER_DISPATCH, PR_SYS_DISPATCH_ON, 1, -1L, &sel);
+	ASSERT_EQ(EINVAL, errno) {
+		TH_LOG("Should reject bad syscall range");
+	}
+
+	/*
+	 * Allowed range overflows usigned long
+	 */
+	prctl(PR_SET_SYSCALL_USER_DISPATCH, PR_SYS_DISPATCH_ON, -1L, 0x1, &sel);
+	ASSERT_EQ(EINVAL, errno) {
+		TH_LOG("Should reject bad syscall range");
+	}
+}
+
+/*
+ * Use global selector for handle_sigsys tests, to avoid passing
+ * selector to signal handler
+ */
+char glob_sel;
+int nr_syscalls_emulated;
+int si_code;
+int si_errno;
+
+static void handle_sigsys(int sig, siginfo_t *info, void *ucontext)
+{
+	si_code = info->si_code;
+	si_errno = info->si_errno;
+
+	if (info->si_syscall == MAGIC_SYSCALL_1)
+		nr_syscalls_emulated++;
+
+	/* In preparation for sigreturn. */
+	SYSCALL_DISPATCH_OFF(glob_sel);
+}
+
+TEST(dispatch_and_return)
+{
+	long ret;
+	struct sigaction act;
+	sigset_t mask;
+
+	glob_sel = 0;
+	nr_syscalls_emulated = 0;
+	si_code = 0;
+	si_errno = 0;
+
+	memset(&act, 0, sizeof(act));
+	sigemptyset(&mask);
+
+	act.sa_sigaction = handle_sigsys;
+	act.sa_flags = SA_SIGINFO;
+	act.sa_mask = mask;
+
+	ret = sigaction(SIGSYS, &act, NULL);
+	ASSERT_EQ(0, ret);
+
+	/* Make sure selector is good prior to prctl. */
+	SYSCALL_DISPATCH_OFF(glob_sel);
+
+	ret = prctl(PR_SET_SYSCALL_USER_DISPATCH, PR_SYS_DISPATCH_ON, 0, 0, &glob_sel);
+	ASSERT_EQ(0, ret) {
+		TH_LOG("Kernel does not support CONFIG_SYSCALL_USER_DISPATCH");
+	}
+
+	/* MAGIC_SYSCALL_1 doesn't exist. */
+	SYSCALL_DISPATCH_OFF(glob_sel);
+	ret = syscall(MAGIC_SYSCALL_1);
+	EXPECT_EQ(-1, ret) {
+		TH_LOG("Dispatch triggered unexpectedly");
+	}
+
+	/* MAGIC_SYSCALL_1 should be emulated. */
+	nr_syscalls_emulated = 0;
+	SYSCALL_DISPATCH_ON(glob_sel);
+
+	ret = syscall(MAGIC_SYSCALL_1);
+	EXPECT_EQ(MAGIC_SYSCALL_1, ret) {
+		TH_LOG("Failed to intercept syscall");
+	}
+	EXPECT_EQ(1, nr_syscalls_emulated) {
+		TH_LOG("Failed to emulate syscall");
+	}
+	ASSERT_EQ(SYS_USER_DISPATCH, si_code) {
+		TH_LOG("Bad si_code in SIGSYS");
+	}
+	ASSERT_EQ(0, si_errno) {
+		TH_LOG("Bad si_errno in SIGSYS");
+	}
+}
+
+TEST_SIGNAL(bad_selector, SIGSYS)
+{
+	long ret;
+	struct sigaction act;
+	sigset_t mask;
+	struct sysinfo info;
+
+	glob_sel = 0;
+	nr_syscalls_emulated = 0;
+	si_code = 0;
+	si_errno = 0;
+
+	memset(&act, 0, sizeof(act));
+	sigemptyset(&mask);
+
+	act.sa_sigaction = handle_sigsys;
+	act.sa_flags = SA_SIGINFO;
+	act.sa_mask = mask;
+
+	ret = sigaction(SIGSYS, &act, NULL);
+	ASSERT_EQ(0, ret);
+
+	/* Make sure selector is good prior to prctl. */
+	SYSCALL_DISPATCH_OFF(glob_sel);
+
+	ret = prctl(PR_SET_SYSCALL_USER_DISPATCH, PR_SYS_DISPATCH_ON, 0, 0, &glob_sel);
+	ASSERT_EQ(0, ret) {
+		TH_LOG("Kernel does not support CONFIG_SYSCALL_USER_DISPATCH");
+	}
+
+	glob_sel = -1;
+
+	sysinfo(&info);
+
+	/* Even though it is ready to catch SIGSYS, the signal is
+	 * supposed to be uncatchable.
+	 */
+
+	EXPECT_FALSE(true) {
+		TH_LOG("Unreachable!");
+	}
+}
+
+TEST(disable_dispatch)
+{
+	int ret;
+	struct sysinfo info;
+	char sel = 0;
+
+	ret = prctl(PR_SET_SYSCALL_USER_DISPATCH, PR_SYS_DISPATCH_ON, 0, 0, &sel);
+	ASSERT_EQ(0, ret) {
+		TH_LOG("Kernel does not support CONFIG_SYSCALL_USER_DISPATCH");
+	}
+
+	/* MAGIC_SYSCALL_1 doesn't exist. */
+	SYSCALL_DISPATCH_OFF(glob_sel);
+
+	ret = prctl(PR_SET_SYSCALL_USER_DISPATCH, PR_SYS_DISPATCH_OFF, 0, 0, 0);
+	EXPECT_EQ(0, ret) {
+		TH_LOG("Failed to unset syscall user dispatch");
+	}
+
+	/* Shouldn't have any effect... */
+	SYSCALL_DISPATCH_ON(glob_sel);
+
+	ret = syscall(__NR_sysinfo, &info);
+	EXPECT_EQ(0, ret) {
+		TH_LOG("Dispatch triggered unexpectedly");
+	}
+}
+
+TEST(direct_dispatch_range)
+{
+	int ret = 0;
+	struct sysinfo info;
+	char sel = 0;
+
+	/*
+	 * Instead of calculating libc addresses; allow the entire
+	 * memory map and lock the selector.
+	 */
+	ret = prctl(PR_SET_SYSCALL_USER_DISPATCH, PR_SYS_DISPATCH_ON, 0, -1L, &sel);
+	ASSERT_EQ(0, ret) {
+		TH_LOG("Kernel does not support CONFIG_SYSCALL_USER_DISPATCH");
+	}
+
+	SYSCALL_DISPATCH_ON(sel);
+
+	ret = sysinfo(&info);
+	ASSERT_EQ(0, ret) {
+		TH_LOG("Dispatch triggered unexpectedly");
+	}
+}
+
+TEST_HARNESS_MAIN
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v7 6/7] selftests: Add benchmark for syscall user dispatch
  2020-11-18  3:28 [PATCH v7 0/7] Syscall User Dispatch Gabriel Krisman Bertazi
                   ` (4 preceding siblings ...)
  2020-11-18  3:28 ` [PATCH v7 5/7] selftests: Add kselftest for syscall user dispatch Gabriel Krisman Bertazi
@ 2020-11-18  3:28 ` Gabriel Krisman Bertazi
  2020-11-18  3:28 ` [PATCH v7 7/7] docs: Document Syscall User Dispatch Gabriel Krisman Bertazi
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 19+ messages in thread
From: Gabriel Krisman Bertazi @ 2020-11-18  3:28 UTC (permalink / raw)
  To: luto, tglx, keescook
  Cc: christian.brauner, peterz, willy, shuah, linux-kernel, linux-api,
	linux-kselftest, x86, gofmanp, Gabriel Krisman Bertazi, kernel

This is the patch I'm using to evaluate the impact syscall user dispatch
has on native syscall (syscalls not redirected to userspace) when
enabled for the process and submiting syscalls though the unblocked
dispatch selector. It works by running a step to define a baseline of
the cost of executing sysinfo, then enabling SUD, and rerunning that
step.

On my test machine, an AMD Ryzen 5 1500X, I have the following results
with the latest version of syscall user dispatch patches.

root@olga:~# syscall_user_dispatch/sud_benchmark
  Calibrating test set to last ~5 seconds...
  test iterations = 37500000
  Avg syscall time 134ns.
  Caught sys_ff00
  trapped_call_count 1, native_call_count 0.
  Avg syscall time 147ns.
  Interception overhead: 9.7% (+13ns).

Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
---
 .../selftests/syscall_user_dispatch/Makefile  |   2 +-
 .../syscall_user_dispatch/sud_benchmark.c     | 200 ++++++++++++++++++
 2 files changed, 201 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/syscall_user_dispatch/sud_benchmark.c

diff --git a/tools/testing/selftests/syscall_user_dispatch/Makefile b/tools/testing/selftests/syscall_user_dispatch/Makefile
index 8e15fa42bcda..03c120270953 100644
--- a/tools/testing/selftests/syscall_user_dispatch/Makefile
+++ b/tools/testing/selftests/syscall_user_dispatch/Makefile
@@ -5,5 +5,5 @@ LINUX_HDR_PATH = $(INSTALL_HDR_PATH)/include/
 
 CFLAGS += -Wall -I$(LINUX_HDR_PATH)
 
-TEST_GEN_PROGS := sud_test
+TEST_GEN_PROGS := sud_test sud_benchmark
 include ../lib.mk
diff --git a/tools/testing/selftests/syscall_user_dispatch/sud_benchmark.c b/tools/testing/selftests/syscall_user_dispatch/sud_benchmark.c
new file mode 100644
index 000000000000..6689f1183dbf
--- /dev/null
+++ b/tools/testing/selftests/syscall_user_dispatch/sud_benchmark.c
@@ -0,0 +1,200 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2020 Collabora Ltd.
+ *
+ * Benchmark and test syscall user dispatch
+ */
+
+#define _GNU_SOURCE
+#include <stdio.h>
+#include <string.h>
+#include <stdlib.h>
+#include <signal.h>
+#include <errno.h>
+#include <time.h>
+#include <sys/time.h>
+#include <unistd.h>
+#include <sys/sysinfo.h>
+#include <sys/prctl.h>
+#include <sys/syscall.h>
+
+#ifndef PR_SET_SYSCALL_USER_DISPATCH
+# define PR_SET_SYSCALL_USER_DISPATCH	59
+# define PR_SYS_DISPATCH_OFF	0
+# define PR_SYS_DISPATCH_ON	1
+#endif
+
+#ifdef __NR_syscalls
+# define MAGIC_SYSCALL_1 (__NR_syscalls + 1) /* Bad Linux syscall number */
+#else
+# define MAGIC_SYSCALL_1 (0xff00)  /* Bad Linux syscall number */
+#endif
+
+/*
+ * To test returning from a sigsys with selector blocked, the test
+ * requires some per-architecture support (i.e. knowledge about the
+ * signal trampoline address).  On i386, we know it is on the vdso, and
+ * a small trampoline is open-coded for x86_64.  Other architectures
+ * that have a trampoline in the vdso will support TEST_BLOCKED_RETURN
+ * out of the box, but don't enable them until they support syscall user
+ * dispatch.
+ */
+#if defined(__x86_64__) || defined(__i386__)
+#define TEST_BLOCKED_RETURN
+#endif
+
+#ifdef __x86_64__
+void* (syscall_dispatcher_start)(void);
+void* (syscall_dispatcher_end)(void);
+#else
+unsigned long syscall_dispatcher_start = 0;
+unsigned long syscall_dispatcher_end = 0;
+#endif
+
+unsigned long trapped_call_count = 0;
+unsigned long native_call_count = 0;
+
+char selector;
+#define SYSCALL_BLOCK   (selector = PR_SYS_DISPATCH_ON)
+#define SYSCALL_UNBLOCK (selector = PR_SYS_DISPATCH_OFF)
+
+#define CALIBRATION_STEP 100000
+#define CALIBRATE_TO_SECS 5
+int factor;
+
+static double one_sysinfo_step(void)
+{
+	struct timespec t1, t2;
+	int i;
+	struct sysinfo info;
+
+	clock_gettime(CLOCK_MONOTONIC, &t1);
+	for (i = 0; i < CALIBRATION_STEP; i++)
+		sysinfo(&info);
+	clock_gettime(CLOCK_MONOTONIC, &t2);
+	return (t2.tv_sec - t1.tv_sec) + 1.0e-9 * (t2.tv_nsec - t1.tv_nsec);
+}
+
+static void calibrate_set(void)
+{
+	double elapsed = 0;
+
+	printf("Calibrating test set to last ~%d seconds...\n", CALIBRATE_TO_SECS);
+
+	while (elapsed < 1) {
+		elapsed += one_sysinfo_step();
+		factor += CALIBRATE_TO_SECS;
+	}
+
+	printf("test iterations = %d\n", CALIBRATION_STEP * factor);
+}
+
+static double perf_syscall(void)
+{
+	unsigned int i;
+	double partial = 0;
+
+	for (i = 0; i < factor; ++i)
+		partial += one_sysinfo_step()/(CALIBRATION_STEP*factor);
+	return partial;
+}
+
+static void handle_sigsys(int sig, siginfo_t *info, void *ucontext)
+{
+	char buf[1024];
+	int len;
+
+	SYSCALL_UNBLOCK;
+
+	/* printf and friends are not signal-safe. */
+	len = snprintf(buf, 1024, "Caught sys_%x\n", info->si_syscall);
+	write(1, buf, len);
+
+	if (info->si_syscall == MAGIC_SYSCALL_1)
+		trapped_call_count++;
+	else
+		native_call_count++;
+
+#ifdef TEST_BLOCKED_RETURN
+	SYSCALL_BLOCK;
+#endif
+
+#ifdef __x86_64__
+	__asm__ volatile("movq $0xf, %rax");
+	__asm__ volatile("leaveq");
+	__asm__ volatile("add $0x8, %rsp");
+	__asm__ volatile("syscall_dispatcher_start:");
+	__asm__ volatile("syscall");
+	__asm__ volatile("nop"); /* Landing pad within dispatcher area */
+	__asm__ volatile("syscall_dispatcher_end:");
+#endif
+
+}
+
+int main(void)
+{
+	struct sigaction act;
+	double time1, time2;
+	int ret;
+	sigset_t mask;
+
+	memset(&act, 0, sizeof(act));
+	sigemptyset(&mask);
+
+	act.sa_sigaction = handle_sigsys;
+	act.sa_flags = SA_SIGINFO;
+	act.sa_mask = mask;
+
+	calibrate_set();
+
+	time1 = perf_syscall();
+	printf("Avg syscall time %.0lfns.\n", time1 * 1.0e9);
+
+	ret = sigaction(SIGSYS, &act, NULL);
+	if (ret) {
+		perror("Error sigaction:");
+		exit(-1);
+	}
+
+	fprintf(stderr, "Enabling syscall trapping.\n");
+
+	if (prctl(PR_SET_SYSCALL_USER_DISPATCH, PR_SYS_DISPATCH_ON,
+		  syscall_dispatcher_start,
+		  (syscall_dispatcher_end - syscall_dispatcher_start + 1),
+		  &selector)) {
+		perror("prctl failed\n");
+		exit(-1);
+	}
+
+	SYSCALL_BLOCK;
+	syscall(MAGIC_SYSCALL_1);
+
+#ifdef TEST_BLOCKED_RETURN
+	if (selector == PR_SYS_DISPATCH_OFF) {
+		fprintf(stderr, "Failed to return with selector blocked.\n");
+		exit(-1);
+	}
+#endif
+
+	SYSCALL_UNBLOCK;
+
+	if (!trapped_call_count) {
+		fprintf(stderr, "syscall trapping does not work.\n");
+		exit(-1);
+	}
+
+	time2 = perf_syscall();
+
+	if (native_call_count) {
+		perror("syscall trapping intercepted more syscalls than expected\n");
+		exit(-1);
+	}
+
+	printf("trapped_call_count %lu, native_call_count %lu.\n",
+	       trapped_call_count, native_call_count);
+	printf("Avg syscall time %.0lfns.\n", time2 * 1.0e9);
+	printf("Interception overhead: %.1lf%% (+%.0lfns).\n",
+	       100.0 * (time2 / time1 - 1.0), 1.0e9 * (time2 - time1));
+	return 0;
+
+}
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v7 7/7] docs: Document Syscall User Dispatch
  2020-11-18  3:28 [PATCH v7 0/7] Syscall User Dispatch Gabriel Krisman Bertazi
                   ` (5 preceding siblings ...)
  2020-11-18  3:28 ` [PATCH v7 6/7] selftests: Add benchmark " Gabriel Krisman Bertazi
@ 2020-11-18  3:28 ` Gabriel Krisman Bertazi
  2020-11-18  8:48   ` Florian Weimer
  2020-11-18  8:47 ` [PATCH v7 0/7] " Florian Weimer
  2020-11-19 12:38 ` Peter Zijlstra
  8 siblings, 1 reply; 19+ messages in thread
From: Gabriel Krisman Bertazi @ 2020-11-18  3:28 UTC (permalink / raw)
  To: luto, tglx, keescook
  Cc: christian.brauner, peterz, willy, shuah, linux-kernel, linux-api,
	linux-kselftest, x86, gofmanp, Gabriel Krisman Bertazi, kernel

Explain the interface, provide some background and security notes.

Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
---
 .../admin-guide/syscall-user-dispatch.rst     | 87 +++++++++++++++++++
 1 file changed, 87 insertions(+)
 create mode 100644 Documentation/admin-guide/syscall-user-dispatch.rst

diff --git a/Documentation/admin-guide/syscall-user-dispatch.rst b/Documentation/admin-guide/syscall-user-dispatch.rst
new file mode 100644
index 000000000000..e2fb36926f97
--- /dev/null
+++ b/Documentation/admin-guide/syscall-user-dispatch.rst
@@ -0,0 +1,87 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================
+Syscall User Dispatch
+=====================
+
+Background
+----------
+
+Compatibility layers like Wine need a way to efficiently emulate system
+calls of only a part of their process - the part that has the
+incompatible code - while being able to execute native syscalls without
+a high performance penalty on the native part of the process.  Seccomp
+falls short on this task, since it has limited support to efficiently
+filter syscalls based on memory regions, and it doesn't support removing
+filters.  Therefore a new mechanism is necessary.
+
+Syscall User Dispatch brings the filtering of the syscall dispatcher
+address back to userspace.  The application is in control of a flip
+switch, indicating the current personality of the process.  A
+multiple-personality application can then flip the switch without
+invoking the kernel, when crossing the compatibility layer API
+boundaries, to enable/disable the syscall redirection and execute
+syscalls directly (disabled) or send them to be emulated in userspace
+through a SIGSYS.
+
+The goal of this design is to provide very quick compatibility layer
+boundary crosses, which is achieved by not executing a syscall to change
+personality every time the compatibility layer executes.  Instead, a
+userspace memory region exposed to the kernel indicates the current
+personality, and the application simply modifies that variable to
+configure the mechanism.
+
+There is a relatively high cost associated with handling signals on most
+architectures, like x86, but at least for Wine, syscalls issued by
+native Windows code are currently not known to be a performance problem,
+since they are quite rare, at least for modern gaming applications.
+
+Since this mechanism is designed to capture syscalls issued by
+non-native applications, it must function on syscalls whose invocation
+ABI is completely unexpected to Linux.  Syscall User Dispatch, therefore
+doesn't rely on any of the syscall ABI to make the filtering.  It uses
+only the syscall dispatcher address and the userspace key.
+
+Interface
+---------
+
+A process can setup this mechanism on supported kernels
+CONFIG_SYSCALL_USER_DISPATCH) by executing the following prctl:
+
+  prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <offset>, <length>, [selector])
+
+<op> is either PR_SYS_DISPATCH_ON or PR_SYS_DISPATCH_OFF, to enable and
+disable the mechanism globally for that thread.  When
+PR_SYS_DISPATCH_OFF is used, the other fields must be zero.
+
+<offset> and <offset+length> delimit a closed memory region interval
+from which syscalls are always executed directly, regardless of the
+userspace selector.  This provides a fast path for the C library, which
+includes the most common syscall dispatchers in the native code
+applications, and also provides a way for the signal handler to return
+without triggering a nested SIGSYS on (rt_)sigreturn.  Users of this
+interface should make sure that at least the signal trampoline code is
+included in this region. In addition, for syscalls that implement the
+trampoline code on the vDSO, that trampoline is never intercepted.
+
+[selector] is a pointer to a char-sized region in the process memory
+region, that provides a quick way to enable disable syscall redirection
+thread-wide, without the need to invoke the kernel directly.  selector
+can be set to PR_SYS_DISPATCH_ON or PR_SYS_DISPATCH_OFF.  Any other
+value should terminate the program with a SIGSYS.
+
+Security Notes
+--------------
+
+Syscall User Dispatch provides functionality for compatibility layers to
+quickly capture system calls issued by a non-native part of the
+application, while not impacting the Linux native regions of the
+process.  It is not a mechanism for sandboxing system calls, and it
+should not be seen as a security mechanism, since it is trivial for a
+malicious application to subvert the mechanism by jumping to an allowed
+dispatcher region prior to executing the syscall, or to discover the
+address and modify the selector value.  If the use case requires any
+kind of security sandboxing, Seccomp should be used instead.
+
+Any fork or exec of the existing process resets the mechanism to
+PR_SYS_DISPATCH_OFF.
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v7 0/7] Syscall User Dispatch
  2020-11-18  3:28 [PATCH v7 0/7] Syscall User Dispatch Gabriel Krisman Bertazi
                   ` (6 preceding siblings ...)
  2020-11-18  3:28 ` [PATCH v7 7/7] docs: Document Syscall User Dispatch Gabriel Krisman Bertazi
@ 2020-11-18  8:47 ` Florian Weimer
  2020-11-18 17:01   ` Gabriel Krisman Bertazi
  2020-11-19 12:38 ` Peter Zijlstra
  8 siblings, 1 reply; 19+ messages in thread
From: Florian Weimer @ 2020-11-18  8:47 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi
  Cc: luto, tglx, keescook, christian.brauner, peterz, willy, shuah,
	linux-kernel, linux-api, linux-kselftest, x86, gofmanp, kernel

* Gabriel Krisman Bertazi:

> This is the v7 of syscall user dispatch.  This version is a bit
> different from v6 on the following points, after the modifications
> requested on that submission.

Is this supposed to work with existing (Linux) libcs, or do you bring
your own low-level run-time libraries?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v7 7/7] docs: Document Syscall User Dispatch
  2020-11-18  3:28 ` [PATCH v7 7/7] docs: Document Syscall User Dispatch Gabriel Krisman Bertazi
@ 2020-11-18  8:48   ` Florian Weimer
  2020-11-18 17:02     ` Gabriel Krisman Bertazi
  0 siblings, 1 reply; 19+ messages in thread
From: Florian Weimer @ 2020-11-18  8:48 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi
  Cc: luto, tglx, keescook, christian.brauner, peterz, willy, shuah,
	linux-kernel, linux-api, linux-kselftest, x86, gofmanp, kernel

* Gabriel Krisman Bertazi:

> +Interface
> +---------
> +
> +A process can setup this mechanism on supported kernels
> +CONFIG_SYSCALL_USER_DISPATCH) by executing the following prctl:
> +
> +  prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <offset>, <length>, [selector])
> +
> +<op> is either PR_SYS_DISPATCH_ON or PR_SYS_DISPATCH_OFF, to enable and
> +disable the mechanism globally for that thread.  When
> +PR_SYS_DISPATCH_OFF is used, the other fields must be zero.
> +
> +<offset> and <offset+length> delimit a closed memory region interval
> +from which syscalls are always executed directly, regardless of the
> +userspace selector.  This provides a fast path for the C library, which
> +includes the most common syscall dispatchers in the native code
> +applications, and also provides a way for the signal handler to return
> +without triggering a nested SIGSYS on (rt_)sigreturn.  Users of this
> +interface should make sure that at least the signal trampoline code is
> +included in this region. In addition, for syscalls that implement the
> +trampoline code on the vDSO, that trampoline is never intercepted.
> +
> +[selector] is a pointer to a char-sized region in the process memory
> +region, that provides a quick way to enable disable syscall redirection
> +thread-wide, without the need to invoke the kernel directly.  selector
> +can be set to PR_SYS_DISPATCH_ON or PR_SYS_DISPATCH_OFF.  Any other
> +value should terminate the program with a SIGSYS.

Is this a process property or a task/thread property?  The last
paragraph says “thread-wide”, but the first paragraph says “process”.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v7 0/7] Syscall User Dispatch
  2020-11-18  8:47 ` [PATCH v7 0/7] " Florian Weimer
@ 2020-11-18 17:01   ` Gabriel Krisman Bertazi
  2020-11-18 17:22     ` Florian Weimer
  0 siblings, 1 reply; 19+ messages in thread
From: Gabriel Krisman Bertazi @ 2020-11-18 17:01 UTC (permalink / raw)
  To: Florian Weimer
  Cc: luto, tglx, keescook, christian.brauner, peterz, willy, shuah,
	linux-kernel, linux-api, linux-kselftest, x86, gofmanp, kernel

Florian Weimer <fw@deneb.enyo.de> writes:

> * Gabriel Krisman Bertazi:
>
>> This is the v7 of syscall user dispatch.  This version is a bit
>> different from v6 on the following points, after the modifications
>> requested on that submission.
>
> Is this supposed to work with existing (Linux) libcs, or do you bring
> your own low-level run-time libraries?

Hi Florian,

The main use case is to intercept Windows system calls of an application
running over Wine. While Wine is using an unmodified glibc to execute
its own native Linux syscalls, the Windows libraries might be directly
issuing syscalls that we need to capture. So there is a mix. While this
mechanism is compatible with existing libc, we might have other
libraries executing a syscall instruction directly.

-- 
Gabriel Krisman Bertazi

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v7 7/7] docs: Document Syscall User Dispatch
  2020-11-18  8:48   ` Florian Weimer
@ 2020-11-18 17:02     ` Gabriel Krisman Bertazi
  0 siblings, 0 replies; 19+ messages in thread
From: Gabriel Krisman Bertazi @ 2020-11-18 17:02 UTC (permalink / raw)
  To: Florian Weimer
  Cc: luto, tglx, keescook, christian.brauner, peterz, willy, shuah,
	linux-kernel, linux-api, linux-kselftest, x86, gofmanp, kernel

Florian Weimer <fw@deneb.enyo.de> writes:

> * Gabriel Krisman Bertazi:
>
>> +Interface
>> +---------
>> +
>> +A process can setup this mechanism on supported kernels
>> +CONFIG_SYSCALL_USER_DISPATCH) by executing the following prctl:
>> +
>> +  prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <offset>, <length>, [selector])
>> +
>> +<op> is either PR_SYS_DISPATCH_ON or PR_SYS_DISPATCH_OFF, to enable and
>> +disable the mechanism globally for that thread.  When
>> +PR_SYS_DISPATCH_OFF is used, the other fields must be zero.
>> +
>> +<offset> and <offset+length> delimit a closed memory region interval
>> +from which syscalls are always executed directly, regardless of the
>> +userspace selector.  This provides a fast path for the C library, which
>> +includes the most common syscall dispatchers in the native code
>> +applications, and also provides a way for the signal handler to return
>> +without triggering a nested SIGSYS on (rt_)sigreturn.  Users of this
>> +interface should make sure that at least the signal trampoline code is
>> +included in this region. In addition, for syscalls that implement the
>> +trampoline code on the vDSO, that trampoline is never intercepted.
>> +
>> +[selector] is a pointer to a char-sized region in the process memory
>> +region, that provides a quick way to enable disable syscall redirection
>> +thread-wide, without the need to invoke the kernel directly.  selector
>> +can be set to PR_SYS_DISPATCH_ON or PR_SYS_DISPATCH_OFF.  Any other
>> +value should terminate the program with a SIGSYS.
>
> Is this a process property or a task/thread property?  The last
> paragraph says “thread-wide”, but the first paragraph says “process”.

It is per-thread, as it doesn't survive across clone/fork syscalls.  I
will fix the first paragraph of this text.

-- 
Gabriel Krisman Bertazi

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v7 0/7] Syscall User Dispatch
  2020-11-18 17:01   ` Gabriel Krisman Bertazi
@ 2020-11-18 17:22     ` Florian Weimer
  0 siblings, 0 replies; 19+ messages in thread
From: Florian Weimer @ 2020-11-18 17:22 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi
  Cc: luto, tglx, keescook, christian.brauner, peterz, willy, shuah,
	linux-kernel, linux-api, linux-kselftest, x86, gofmanp, kernel

* Gabriel Krisman Bertazi:

> The main use case is to intercept Windows system calls of an application
> running over Wine. While Wine is using an unmodified glibc to execute
> its own native Linux syscalls, the Windows libraries might be directly
> issuing syscalls that we need to capture. So there is a mix. While this
> mechanism is compatible with existing libc, we might have other
> libraries executing a syscall instruction directly.

Please raise this on libc-alpha, it's an unexpected compatibility
constraint on glibc.  Thanks.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v7 3/7] kernel: Implement selective syscall userspace redirection
  2020-11-18  3:28 ` [PATCH v7 3/7] kernel: Implement selective syscall userspace redirection Gabriel Krisman Bertazi
@ 2020-11-19 12:36   ` Peter Zijlstra
  2020-11-19 17:43   ` Gabriel Krisman Bertazi
  1 sibling, 0 replies; 19+ messages in thread
From: Peter Zijlstra @ 2020-11-19 12:36 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi
  Cc: luto, tglx, keescook, christian.brauner, willy, shuah,
	linux-kernel, linux-api, linux-kselftest, x86, gofmanp, kernel

On Tue, Nov 17, 2020 at 10:28:36PM -0500, Gabriel Krisman Bertazi wrote:
>   prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <off>, <length>, [selector])
> 
> The range [<offset>,<offset>+len] is a part of the process memory map

> +	if (likely(instruction_pointer(regs) - sd->offset < sd->len))
> +		return false;

The actual implementation ^ is: [<offset>, <offset>+<length>).

Which seems consistent and right, so I would suggest simply changing the
Changelog, something that could be done when applying.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v7 0/7] Syscall User Dispatch
  2020-11-18  3:28 [PATCH v7 0/7] Syscall User Dispatch Gabriel Krisman Bertazi
                   ` (7 preceding siblings ...)
  2020-11-18  8:47 ` [PATCH v7 0/7] " Florian Weimer
@ 2020-11-19 12:38 ` Peter Zijlstra
  2020-11-21  0:24   ` Kees Cook
  8 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2020-11-19 12:38 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi
  Cc: luto, tglx, keescook, christian.brauner, willy, shuah,
	linux-kernel, linux-api, linux-kselftest, x86, gofmanp, kernel

On Tue, Nov 17, 2020 at 10:28:33PM -0500, Gabriel Krisman Bertazi wrote:
> Gabriel Krisman Bertazi (7):
>   x86: vdso: Expose sigreturn address on vdso to the kernel
>   signal: Expose SYS_USER_DISPATCH si_code type
>   kernel: Implement selective syscall userspace redirection
>   entry: Support Syscall User Dispatch on common syscall entry
>   selftests: Add kselftest for syscall user dispatch
>   selftests: Add benchmark for syscall user dispatch
>   docs: Document Syscall User Dispatch

Aside from the one little nit this looks good to me.

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v7 3/7] kernel: Implement selective syscall userspace redirection
  2020-11-18  3:28 ` [PATCH v7 3/7] kernel: Implement selective syscall userspace redirection Gabriel Krisman Bertazi
  2020-11-19 12:36   ` Peter Zijlstra
@ 2020-11-19 17:43   ` Gabriel Krisman Bertazi
  2020-11-21  0:18     ` Kees Cook
  1 sibling, 1 reply; 19+ messages in thread
From: Gabriel Krisman Bertazi @ 2020-11-19 17:43 UTC (permalink / raw)
  To: luto, Rich Felker
  Cc: tglx, keescook, christian.brauner, peterz, willy, shuah,
	linux-kernel, linux-api, linux-kselftest, x86, gofmanp, kernel

Gabriel Krisman Bertazi <krisman@collabora.com> writes:

> Introduce a mechanism to quickly disable/enable syscall handling for a
> specific process and redirect to userspace via SIGSYS.  This is useful
> for processes with parts that require syscall redirection and parts that
> don't, but who need to perform this boundary crossing really fast,
> without paying the cost of a system call to reconfigure syscall handling
> on each boundary transition.  This is particularly important for Windows
> games running over Wine.

I raised a discussion about this on libc-alpha, as requested by Florian.
At the moment, there was some back and forth on why the use-case is not
done by seccomp, but a more interesting point about user_notif was
raised by Rich Felker (cc'ed).

SIGSYS, as a signal handler, is limited in what can be done inside it.
Rich suggested the user_notif design is a better solution.  I understand
that from a Wine perspective, SIGSYS suffices for their work, but would
it make sense to extend SUD interface to support a user_notif-like
interface?  Would this be acceptable as future work to be added when/if
needed, or should we design it from the start?

The existing interface could be extended with a flags field as part of
the opcode passed in argument 2, which is currently reserved, and then
return a FD, just like seccomp(2) does.  So it is not like the current
patches couldn't be extended in the future if needed, unless I'm
mistaken.

-- 
Gabriel Krisman Bertazi

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v7 3/7] kernel: Implement selective syscall userspace redirection
  2020-11-19 17:43   ` Gabriel Krisman Bertazi
@ 2020-11-21  0:18     ` Kees Cook
  2020-11-22  4:01       ` Andy Lutomirski
  0 siblings, 1 reply; 19+ messages in thread
From: Kees Cook @ 2020-11-21  0:18 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi
  Cc: luto, Rich Felker, tglx, christian.brauner, peterz, willy, shuah,
	linux-kernel, linux-api, linux-kselftest, x86, gofmanp, kernel

On Thu, Nov 19, 2020 at 12:43:05PM -0500, Gabriel Krisman Bertazi wrote:
> The existing interface could be extended with a flags field as part of
> the opcode passed in argument 2, which is currently reserved, and then
> return a FD, just like seccomp(2) does.  So it is not like the current
> patches couldn't be extended in the future if needed, unless I'm
> mistaken.

Yes, I'd prefer this series go in as-is, and if there is a need for
extending the API, arg2 can have more values added.

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v7 0/7] Syscall User Dispatch
  2020-11-19 12:38 ` Peter Zijlstra
@ 2020-11-21  0:24   ` Kees Cook
  0 siblings, 0 replies; 19+ messages in thread
From: Kees Cook @ 2020-11-21  0:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Gabriel Krisman Bertazi, luto, tglx, christian.brauner, willy,
	shuah, linux-kernel, linux-api, linux-kselftest, x86, gofmanp,
	kernel

On Thu, Nov 19, 2020 at 01:38:27PM +0100, Peter Zijlstra wrote:
> On Tue, Nov 17, 2020 at 10:28:33PM -0500, Gabriel Krisman Bertazi wrote:
> > Gabriel Krisman Bertazi (7):
> >   x86: vdso: Expose sigreturn address on vdso to the kernel
> >   signal: Expose SYS_USER_DISPATCH si_code type
> >   kernel: Implement selective syscall userspace redirection
> >   entry: Support Syscall User Dispatch on common syscall entry
> >   selftests: Add kselftest for syscall user dispatch
> >   selftests: Add benchmark for syscall user dispatch
> >   docs: Document Syscall User Dispatch
> 
> Aside from the one little nit this looks good to me.
> 
> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Agreed, and thank you Gabriel for the SYSCALL_WORK series too. :) That's
so nice to have!

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v7 3/7] kernel: Implement selective syscall userspace redirection
  2020-11-21  0:18     ` Kees Cook
@ 2020-11-22  4:01       ` Andy Lutomirski
  0 siblings, 0 replies; 19+ messages in thread
From: Andy Lutomirski @ 2020-11-22  4:01 UTC (permalink / raw)
  To: Kees Cook
  Cc: Gabriel Krisman Bertazi, Andrew Lutomirski, Rich Felker,
	Thomas Gleixner, Christian Brauner, Peter Zijlstra,
	Matthew Wilcox, Shuah Khan, LKML, Linux API,
	open list:KERNEL SELFTEST FRAMEWORK, X86 ML, Paul Gofman, kernel

On Fri, Nov 20, 2020 at 4:18 PM Kees Cook <keescook@chromium.org> wrote:
>
> On Thu, Nov 19, 2020 at 12:43:05PM -0500, Gabriel Krisman Bertazi wrote:
> > The existing interface could be extended with a flags field as part of
> > the opcode passed in argument 2, which is currently reserved, and then
> > return a FD, just like seccomp(2) does.  So it is not like the current
> > patches couldn't be extended in the future if needed, unless I'm
> > mistaken.
>
> Yes, I'd prefer this series go in as-is, and if there is a need for
> extending the API, arg2 can have more values added.

I agree.



>
> --
> Kees Cook

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2020-11-22  4:01 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-18  3:28 [PATCH v7 0/7] Syscall User Dispatch Gabriel Krisman Bertazi
2020-11-18  3:28 ` [PATCH v7 1/7] x86: vdso: Expose sigreturn address on vdso to the kernel Gabriel Krisman Bertazi
2020-11-18  3:28 ` [PATCH v7 2/7] signal: Expose SYS_USER_DISPATCH si_code type Gabriel Krisman Bertazi
2020-11-18  3:28 ` [PATCH v7 3/7] kernel: Implement selective syscall userspace redirection Gabriel Krisman Bertazi
2020-11-19 12:36   ` Peter Zijlstra
2020-11-19 17:43   ` Gabriel Krisman Bertazi
2020-11-21  0:18     ` Kees Cook
2020-11-22  4:01       ` Andy Lutomirski
2020-11-18  3:28 ` [PATCH v7 4/7] entry: Support Syscall User Dispatch on common syscall entry Gabriel Krisman Bertazi
2020-11-18  3:28 ` [PATCH v7 5/7] selftests: Add kselftest for syscall user dispatch Gabriel Krisman Bertazi
2020-11-18  3:28 ` [PATCH v7 6/7] selftests: Add benchmark " Gabriel Krisman Bertazi
2020-11-18  3:28 ` [PATCH v7 7/7] docs: Document Syscall User Dispatch Gabriel Krisman Bertazi
2020-11-18  8:48   ` Florian Weimer
2020-11-18 17:02     ` Gabriel Krisman Bertazi
2020-11-18  8:47 ` [PATCH v7 0/7] " Florian Weimer
2020-11-18 17:01   ` Gabriel Krisman Bertazi
2020-11-18 17:22     ` Florian Weimer
2020-11-19 12:38 ` Peter Zijlstra
2020-11-21  0:24   ` Kees Cook

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.